Professional Documents
Culture Documents
Smart Algorithms Multimedia Imaging
Smart Algorithms Multimedia Imaging
Michael N. Rychagov
Ekaterina V. Tolstaya
Mikhail Y. Sirotenko Editors
Smart
Algorithms
for Multimedia
and Imaging
Signals and Communication Technology
Series Editors
Emre Celebi, Department of Computer Science, University of Central Arkansas,
Conway, AR, USA
Jingdong Chen, Northwestern Polytechnical University, Xi'an, China
E. S. Gopi, Department of Electronics and Communication Engineering, National
Institute of Technology, Tiruchirappalli, Tamil Nadu, India
Amy Neustein, Linguistic Technology Systems, Fort Lee, NJ, USA
H. Vincent Poor, Department of Electrical Engineering, Princeton University,
Princeton, NJ, USA
This series is devoted to fundamentals and applications of modern methods of
signal processing and cutting-edge communication technologies. The main topics
are information and signal theory, acoustical signal processing, image processing
and multimedia systems, mobile and wireless communications, and computer and
communication networks. Volumes in the series address researchers in academia and
industrial R&D departments. The series is application-oriented. The level of
presentation of each individual volume, however, depends on the subject and can
range from practical to scientific.
**Indexing: All books in “Signals and Communication Technology” are indexed by
Scopus and zbMATH**
For general information about this book series, comments or suggestions, please
contact Mary James at mary.james@springer.com or Ramesh Nath Premnath at
ramesh.premnath@springer.com.
Mikhail Y. Sirotenko
Google Research
New York, NY, USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Over the past decades, people have produced vast amounts of multimedia content,
including text, audio, images, animations, and video. The substance of this content
belongs, in turn, to various areas, including entertainment, engineering, medicine,
business, scientific research, etc. This content should be readily processed, analysed,
and displayed by numerous devices like TVs, mobile devices, VR headsets, medical
devices, media players, etc., without losing its quality. This brings researchers and
engineers to the problem of the fast transformation and processing of
multidimensional signals, where they must deal with different sizes and resolutions,
processing speed, memory, and power consumption. In this book, we describe smart
algorithms applied both for multimedia processing in general and in imaging
technology in particular.
In the first book of this series, Adaptive Image Processing Algorithms for Printing
by I.V. Safonov, I.V. Kurilin, M.N. Rychagov, and E.V. Tolstaya, published by
Springer Nature Singapore in 2018, several algorithms were considered for the
image processing pipeline of photo-printer and photo-editing software tools that
we have worked on at different times for processing still images and photos.
The second book, Document Image Processing for Scanning and Printing by the
same authors, published by Springer Nature Switzerland in 2019, dealt with docu-
ment image processing for scanning and printing. A copying technology is needed to
make perfect copies from extremely varied originals; therefore, copying is not in
practice separable from image enhancement. From a technical perspective, it is best
to consider document copying jointly with image enhancement.
This book is devoted to multimedia algorithms and imaging, and it is divided into
four main interconnected parts:
• Image and Video Conversion
• TV and Display Applications
• Machine Learning and Artificial Intelligence
• Mobile Algorithms
v
vi Preface
Image and Video Conversion includes five chapters that cover solutions on super-
resolution using a multi-frame-based approach as well as machine learning-based
super-resolution. They also cover the processing of 3D signals, namely depth
estimation and control, and semi-automatic 2D to 3D video conversion. A compre-
hensive review of visual lossless colour compression technology concludes this part.
TV and Display Applications includes three chapters in which the following
algorithms are considered: video editing, real-time sports episode detection by
video content analysis, and the generation and reproduction of natural effects.
Machine Learning and Artificial Intelligence includes four chapters, where the
following topics are covered: image classification as a service, mobile user profiling,
and automatic view planning in magnetic resonance imaging, as well as dictionary-
based compressed sensing MRI (magnetic resonance imaging).
Finally, Mobile Algorithms consists of four chapters where the following algo-
rithms and solutions implemented for mobile devices are described: a depth camera
based on a colour-coded aperture, the animated graphical abstract of an image, a
motion photo, and approaches and methods for iris recognition for mobile devices.
The solutions presented in the first two books and in the current one have been
included in dozens of patents worldwide, presented at international conferences, and
realized in the firmware of devices and software. The material is based on the
experience of both editors and the authors of particular chapters in industrial research
and technology commercialization. The authors have worked on the development of
algorithms for different divisions of Samsung Electronics Co., Ltd, including the
Printing Business, Visual Display Business, Health and Medical Equipment Divi-
sion, and Mobile Communication Business for more than 15 years.
We should especially note that this book in no way pretends to present an
in-depth review of the achievements accumulated to date in the field of image and
video conversion, TV and display applications, or mobile algorithms. Instead, in this
book, the main results of the studies that we have authored are summarized. We hope
that the main approaches, optimization procedures, and heuristic findings are still
relevant and can be used as a basis for new intelligent solutions in multimedia, TV,
and mobile applications.
How can algorithms capable of being adaptive to image content be developed? In
many cases, inductive or deductive inference can help. Many of the algorithms
include lightweight classifiers or other machine-learning-based techniques, which
have low computational complexity and model size. This makes them deployable on
embedded platforms.
As we have mentioned, the majority of the described algorithms were
implemented as systems-on-chip firmware or as software products. This was a
challenge because, for each industrial task, there are always strict specification
requirements, and, as a result, there are limitations on computational complexity,
memory consumption, and power efficiency. In this book, typically, no device-
dependent optimization tricks are described, though the ideas for effective methods
from an algorithmic point of view are provided.
This book is intended for all those who are interested in advanced multimedia
processing approaches, including applications of machine learning techniques for
Preface vii
the development of effective adaptive algorithms. We hope that this book will serve
as a useful guide for students, researchers, and practitioners.
It is the intention of the editors that each chapter be used as an independent text. In
this regard, at the beginning of a large fragment, the main provisions considered in
the preceding text are briefly repeated with reference to the appropriate chapter or
section. References to the works of other authors and discussions of their results are
given in the course of the presentation of the material.
We would like to thank our colleagues who worked with us both in Korea and at
the Samsung R&D Institute Rus, Moscow, on the development and implementation
of the technologies mentioned in the book, including all of the authors of the
chapters: Sang-cheon Choi, Yang Lim Choi, Dr. Praven Gulaka, Dr. Seung-Hoon
Hahn, Jaebong Yoo, Heejun Lee, Kwanghyun Lee, San-Su Lee, B’jungtae O,
Daekyu Shin, Minsuk Song, Gnana S. Surneni, Juwoan Yoo, Valery
V. Anisimovskiy, Roman V. Arzumanyan, Andrey A. Bout, Dr. Victor V. Bucha,
Dr. Vitaly V. Chernov, Dr. Alexey S. Chernyavskiy, Dr. Aleksey B. Danilevich,
Andrey N. Drogolyub, Yuri S. Efimov, Marta A. Egorova, Dr. Vladimir A. Eremeev,
Dr. Alexey M. Fartukov, Dr. Kirill A. Gavrilyuk, Ivan V. Glazistov, Vitaly
S. Gnatyuk, Aleksei M. Gruzdev, Artem K. Ignatov, Ivan O. Karacharov, Aleksey
Y. Kazantsev, Dr. Konstantin V. Kolchin, Anton S. Kornilov, Dmitry
A. Korobchenko, Mikhail V. Korobkin, Dr. Oxana V. Korzh (Dzhosan), Dr. Igor
M. Kovliga, Konstantin A. Kryzhanovsky, Dr. Mikhail S. Kudinov, Artem
I. Kuharenko, Dr. Ilya V. Kurilin, Vladimir G. Kurmanov, Dr. Gennady
G. Kuznetsov, Dr. Vitaly S. Lavrukhin, Kirill V. Lebedev, Vladislav A. Makeev,
Vadim A. Markovtsev, Dr. Mstislav V. Maslennikov, Dr. Artem S. Migukin, Gleb
S. Milyukov, Dr. Michael N. Mishourovsky, Andrey K. Moiseenko, Alexander
A. Molchanov, Dr. Oleg F. Muratov, Dr. Aleksei Y. Nevidomskii, Dr. Gleb
A. Odinokikh, Irina I. Piontkovskaya, Ivan A. Panchenko, Vladimir
P. Paramonov, Dr. Xenia Y. Petrova, Dr. Sergey Y. Podlesnyy, Petr Pohl,
Dr. Dmitry V. Polubotko, Andrey A. Popovkin, Iryna A. Reimers, Alexander
A. Romanenko, Oleg S. Rybakov, Associate Prof., Dr. Ilia V. Safonov, Sergey
M. Sedunov, Andrey Y. Shcherbinin, Yury V. Slynko, Ivan A. Solomatin, Liubov
V. Stepanova (Podoynitsyna), Zoya V. Pushchina, Prof., Dr.Sc. Mikhail
K. Tchobanou, Dr. Alexander A. Uldin, Anna A. Varfolomeeva, Kira
I. Vinogradova, Dr. Sergey S. Zavalishin, Alexey M. Vil’kin, Sergey
Y. Yakovlev, Dr. Sergey N. Zagoruyko, Dr. Mikhail V. Zheludev, and numerous
volunteers who took part in the collection of test databases and the evaluation of the
quality of our algorithms.
Contributions from our partners at academic and institutional organizations with
whom we are associated through joint publications, patents, and collaborative work,
i.e., Prof. Dr.Sc. Anatoly G. Yagola, Prof. Dr.Sc. Andrey S. Krylov, Dr. Andrey
V. Nasonov, and Dr. Elena A. Pavelyeva from Moscow State University; Academi-
cian RAS, Prof., M.D. Sergey K. Ternovoy, Prof., M.D. Merab A. Sharia, and
M.D. Dmitry V. Ustuzhanin from the Tomography Department of the Cardiology
Research Center (Moscow); Prof., Dr.Sc. Rustam K. Latypov, Dr. Ayrat
F. Khasyanov, Dr. Maksim O. Talanov, and Irina A. Maksimova from Kazan
viii Preface
State University; Academician RAS, Prof., Dr.Sc. Evgeniy E. Tyrtyshnikov from the
Marchuk Institute of Numerical Mathematics RAS; Academician RAS, Prof., Dr.Sc.
Sergei V. Kislyakov, Corresponding Member of RAS, Dr.Sc. Maxim A. Vsemirnov,
and Dr. Sergei I. Nikolenko from the St. Petersburg Department of Steklov Math-
ematical Institute of RAS; Corresponding Member of RAS, Prof., Dr.Sc. Rafael
M. Yusupov, Prof., and Prof., Dr.Sc. Vladimir I. Gorodetski from the St. Petersburg
Institute for Informatics and Automation RAS; Prof., Dr.Sc. Igor S. Gruzman from
Novosibirsk State Technical University; and Prof., Dr.Sc. Vadim R. Lutsiv from
ITMO University (St. Petersburg), are also deeply appreciated.
During all these years and throughout the development of these technologies, we
received comprehensive assistance and active technical support from SRR General
Directors Dr. Youngmin Lee, Dr. Sang-Yoon Oh, Dr. Kim Hyo Gyu, and Jong-Sam
Woo; the members of the planning R&D team: Kee-Hang Lee, Sang-Bae Lee,
Jungsik Kim, Seungmin (Simon) Kim, and Byoung Kyu Min; the SRR IP Depart-
ment, Mikhail Y. Silin, Yulia G. Yukovich, and Sergey V. Navasardyan from
General Administration. All of their actions were always directed toward finding
the most optimal forms of R&D work both for managers and engineers, generating
new approaches to create promising algorithms and SW, and ultimately creating
solutions of high quality. At any time, we relied on their participation and assistance
in resolving issues.
Proofreading of all pages of the manuscript was performed by PRS agency (http://
www.proof-reading-service.com).
ix
Contents
xi
xii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
About the Editors
Michael N. Rychagov received his MS in acoustical imaging and PhD from the
Moscow State University (MSU) in 1986 and 1989, respectively. In 2000, he
received a Dr.Sc. (Habilitation) from the same University. From 1991, he is involved
in teaching and research at the National Research University of Electronic Technol-
ogy (MIET) as an associate professor in the Department of Theoretical and Exper-
imental Physics (1998), professor in the Department of Biomedical Systems (2008),
and professor in the Department of Informatics and SW for Computer Systems
(2014). Since 2004, he joined Samsung R&D Institute in Moscow, Russia (SRR),
working on imaging algorithms for printing, scanning, and copying; TV and display
technologies; multimedia; and tomographic areas during almost 14 years, including
last 8 years as Director of Division at SRR. Currently, he is Senior Manager of SW
Development at Align Technology, Inc. (USA) in the Moscow branch (Russia). His
technical and scientific interests are image and video signal processing, biomedical
modelling, engineering applications of machine learning, and artificial intelligence.
He is a Member of the Society for Imaging Science and Technology and a Senior
Member of IEEE.
xiii
xiv About the Editors
Xenia Y. Petrova
1.1.1 Introduction
Super-resolution (SR) is the name given to techniques that allow a single high-
resolution (HR) image to be constructed out of one or several observed low-
resolution (LR) images (Fig. 1.1). Compared to single-frame interpolation, SR
reconstruction is able to restore the high frequency component of the HR image
by exploiting complementary information from multiple LR frames. The SR prob-
lem can be stated as described in Milanfar (2010).
In a traditional setting, most of the SR methods can be classified according to the
model of image formation, the model of the image prior, and the noise model.
Commonly, research has paid more attention to image formation and noise
models, while the image prior model has remained quite simple. The image forma-
tion model may include some linear operators like smoothing and down-sampling
and the motion model that should be considered in the case of multi-frame SR.
At the present time, as machine learning approaches are becoming more popular,
there is a certain paradigm shift towards a more elaborate prior model. Here the main
question becomes “What do we expect to see? Does it look natural?” instead of the
question “What could have happened to a perfect image so that it became the one we
can observe?”
Super-resolution is a mature technology covered by numerous research and survey
papers, so in the text below, we will focus on aspects related to image formation
models, the relation and difference between SR and interpolation, supplementary
X. Y. Petrova (*)
Samsung R&D Institute Russia (SRR), Moscow, Russia
e-mail: xen@excite.com
Fig. 1.2 Interpolation grid (a) Template with pixel insertion (b) Uniform interpolation template
When pondering on how to make a bigger image out of a smaller image, there are
two basic approaches. The first one, which looks more obvious, assumes that there
are some pixels that we know for sure and are going to remain intact in the resulting
image and some other pixels that are to be “inserted” (Fig. 1.2a). In image interpo-
lation applications, this kind of idea was developed in a wide range of edge-directed
algorithms, starting with the famous NEDI algorithm, described in Li and Orchard
(2001) and more recent developments, like those by Zhang and Wu (2006), Giachetti
and Asuni (2011), Zhou et al. (2012), and Nasonov et al. (2016). However, simply
observing Fig. 1.2a, it can be seen that keeping some pixels intact in the resulting
image makes it impossible to deal with noisy signals. We can also expect the
interpolation quality to become non-uniform, which may be unappealing visually,
so the formation model from Fig. 1.2b should be more appropriate. In the interpo-
lation problem, researchers rarely consider the relation between large-resolution and
small-resolution images, but in SR formulation the main focus is on the image
formation model (light blue arrow in Fig. 1.2b). So, interpreting SR as an inverse
problem to images formation has become a fruitful idea. From this point of view, the
image formation model in Fig 1.2a is a mere down-sampling operator, which is in
1 Super-Resolution: 1. Multi-Frame-Based Approach 3
Fig. 1.3 Image formation model (a) comprised of blur, down-sampling and additive noise; (b)
using information from multiple LR frames with subpixel shift to recover single HR frame
weak relation with the physical processes taking place in the camera, including at
least the blur induced by the optical system, down-sampling, and camera noise
(Fig. 1.3a), as described in Heide et al. (2013). Although the example in the drawing
is quite exaggerated, it emphasizes three simple yet important ideas:
1. If we want to make a real reconstruction, and not a mere guess, it would be very
useful to get more than one observed image (Fig. 1.3b). The number of frames
used for reconstruction should grow as a square of the down-sampling factor.
2. The more blurred the image is, the higher the noise, and the bigger the down-
sampling factor, the harder it is to reconstruct an original image. There exist both
theoretical estimates, like those presented in Baker and Kanade (2002) and Lin
et al. (2008), and also practical observations, when the solution of the SR
reconstruction problem really makes sense.
3. If we know the blur kernel and noise parameters (and the accurate noise model),
the chances of successful reconstruction will increase.
The first idea is an immediate step towards multi-frame SR, which is the main
topic of this chapter. In this case, a threefold model (blur, down-sampling, noise)
becomes insufficient, and we need to consider an additional warp operator, which
describes a spatial transform applied to a high-resolution image before applying blur,
down-sampling, and noise. This operator is related to camera eigenmotions and
object motion. In some applications, the warp operator can be naturally derived from
the problem itself, e.g. in astronomy, the global motion of the sky sphere is known.
In sensor-shift SR, which is now being implemented not only in professional
products like Hasselblad cameras but also in many consumer level devices like
Olympus, Pentax, and some others, a set of sensor shifts is implemented in the
4 X. Y. Petrova
hardware, and these shifts are known by design. But in cases targeting consumer
cameras, estimation of the warp operator (or motion estimation) becomes a separate
problem. Most of the papers, such as Heide et al. (2014), consider only translational
models, while others turn to more complex parametric models, like Rochefort et al.
(2006), who assume globally affine motion, or Fazli and Fathi (2015) as well as
Kanaev and Miller (2013), who consider a motion model in the form of optical flow.
Thus, the multi-frame SR problem can be formulated as the reconstruction of a
high-resolution image X from several observed low-resolution images Yi, where the
image formation model is described by Yi ¼ WiX + ηi, 8 i ¼ 1, . . ., k, where Wi is the
ith image formation operator and ηi is additive noise. Operators Wi can be composed
out of warp Mi, blur Gi, and decimation (down-sampling) D for a single-channel SR
problem:
W i ¼ DGi M i :
In Bodduna and Weickert (2017), along with this popular physically motivated
warp-blur-down-sample model, it was proposed to use a less popular yet more
effective for practical purposes blur-warp-down-sample model, i.e. Wi ¼ DMiGi.
In cases like those described in Park et al. (2008) or Gutierrez and Callico (2016),
when different systems are used to obtain different shots, the blur operators Gi can be
different for each observed frame, but using the same camera system is more
common, so it’s enough to use a single blur matrix G. Anyway, in the case of
spatially invariant warp and blur operators, the blur and warp operators commute, so
there is no difference between these two approaches. The pre-blur (before warping)
model compared to the post-blur model also allows us to concentrate on finding GX
instead of X.
In Farsiu et al. (2004), a more detailed model containing two separate blur
operators, responsible for camera blur Gcam and atmospheric blur Gatm, is
considered:
W i ¼ DGcam M i Gatm :
Parameters a and b of the noise model depend on the camera model and shooting
conditions, such as gain or ISO (which is usually provided in metadata) and the
exposure time. Although there is abundant literature covering quite sophisticated
procedures of noise model parameter estimation, for researchers focused on the SR
problem per se, it is more efficient to refer to data provided by camera manufacturers,
like the NoiseProfile field described in the DNG specification (2012).
Camera 2 API of Android devices is also expected to provide this kind of
information. There also exist quite simple computational methods to estimate the
parameters of the noise model using multiple shots in a controlled environment.
Within the machine learning approach, the image formation model described
above is almost sufficient to generate artificial observations out of the “perfect” input
data to be used as the training dataset. Still, to obtain an even more realistic
formation model, some researchers like Brooks et al. (2019) consider also the colour
and brightness transformation taking place in the image signal processing (ISP) unit:
white balance, colour correction matrix, gamma compression, and tone mapping.
Some researchers like Segall et al. (2002), Ma et al. (2019), and Zhang and Sze
(2017) go even further and consider the reconstruction problem for compressed
video, but this approach is more related to the image and video compression field
rather than image reconstruction for consumer cameras.
X
t
P ¼ argminP F data S P, Y u1 , ⋯Y uk X u
u¼1
X
t
P ¼ argminP F data ðSðP, Y u Þ X u Þ:
u¼1
Even in modern research like Zhang et al. (2018), the data term Fdata often
remains very simple, being just an L2 norm, PSNR (peak signal to noise ratio) or
SSIM (Structural Similarity Index). Although more sophisticated losses, like the
perceptual loss proposed in Johnson et al. (2016), are widely used in research, in
major super-resolution competitions like NTIRE, the outcomes of which were
discussed by Cai et al. (2019), the winners are still determined according to the
PSNR and SSIM, which are still considered to be the most objective quantitative
measures.
By the way, we can consider as representatives of the data-driven approach not
only computationally expensive CNN-based approaches but also advanced edge-
driven methods like those covered in Sect. 1.1.2 or well-known single-frame SR
algorithms like A+ described by Timofte et al. (2014) and RAISR covered in
Romano et al. (2017).
The data-agnostic approach makes minimum assumptions about output images
and solves an optimization problem for each particular input. Considering a three-
fold (blur, down-sampling, noise), fourfold (warp, blur, down-sampling, noise), or
extended fourfold (warp, blur, down-sampling, Bayer down-sampling, noise) image
formation model automatically suggests the SR problem formulation
X
k X
k
X ¼ argminX F data ðW i X Y i Þ þ F reg ðX Þ log Lnoise ðW i X Y i Þ,
i¼1 i¼1
where Fdata is a data fidelity term, Freg is the regularization term responsible for data
smoothness, and Lnoise is the likelihood term for the noise model. When developing
an algorithm targeting a specific input device, it is reasonable to assume that we
know something about the noise model, possibly more than about the input data.
Also, when the noise model is Gaussian, the problem is sufficiently described by the
square data term alone, but a more complex noise model would require a separate
noise term. In MF SR, the regularization term is indispensable, because the input
data is often insufficient for unique reconstruction. More than that, problem condi-
tioning depends on a “lucky” or “unhappy” combination of warp operators, and if
these warps are close to identical, the condition number will be very large even for
numerous available observations (Fig. 1.5).
The data fidelity term is usually chosen as the L1 or L2 norm, but most of the
research in SR is focused on the problem with a quadratic data fidelity term and total
variation (TV) regularization term. Different types of norms, regularization terms,
and corresponding solvers are described in detail by Heide et al. (2014). In the case
of a non-linear and particularly non-convex form of the regularization term, the only
way to find a solution is an iterative approach, which may be prohibitive for real-time
implementation.
8 X. Y. Petrova
Fig. 1.5 “Lucky” (observed points cover different locations) and “unhappy” (observed points are
all the same) data for super-resolution
b X ¼ W Y,
A
where A b ¼ W W þ λ2 H H, W ¼ W , . . . , W , Y ¼ Y , . . . , Y .
1 k 1 k
It is important to mention that this type of problem can be treated by the very fast
shift and add approach covered in Farsiu et al. (2003) and Ismaeil et al. (2013),
which allows the reconstruction of the blurred high-resolution image using averag-
ing of shifted pixels from low-resolution frames. Unfortunately, this approach leaves
the filling of the remaining wholes to the subsequent deblur sub-algorithm on an
irregular grid. This means that the main computational burden is being transferred to
the next pipeline stage and remains quite demanding.
As an image formation model, we are going to use Wi ¼ DGMi and Wi ¼ BDGMi.
It is possible to make this problem even narrower and assume each warp Mi and blur
G as being space-invariant. This limitation is quite reasonable when processing a
small image patch, as has been stated in Robinson et al. (2009). In this case, A b is
known to be reducible to the block diagonal form. This fact is intensively exploited
in papers on fast super-resolution by Robinson et al. (2009, 2010), Sroubek et al.
(2011), and Zhao et al. (2016). Similar results for the image formation model with
1 Super-Resolution: 1. Multi-Frame-Based Approach 9
Bayer down-sampling were presented by Petrova et al. (2017) and Glazistov and
Petrova (2018).
The warp operators Wi are assumed to be already estimated with sufficient
accuracy. Besides, we assume that the motion is some subpixel circular translation,
which is a reasonable assumption that holds in the small image block.
We consider a simple Gaussian noise model with the same sigma value for all the
observed frames ηi ¼ η. Within the L2 L2 formulation, the Gaussian noise model
means that minimising the data fidelity term minimizes also the noise term, so we
can consider a significantly simplified problem, i.e.
X
k
X ¼ argminX F data ðW i X Y i Þ þ λ2 ðHX Þ ðHX Þ
i¼1
The apparatus of structured matrices fits well the linear optimization problems
arising in image processing (and also as a linear inner loop inside non-linear
algorithms). The mathematical foundations of this approach were described in
10 X. Y. Petrova
P1 ¼ PT , Q1 ¼ QT :
F n F n ¼ F n F n ¼ n I n :
8A 2 ℂ ) A ¼ ðPu ÞT APu :
The class of circulant matrices of size n n is denoted by ℂn; so, we can write
A 2 ℂn. A circulant matrix is defined by a single row (or column) a ¼ [a1, a2, . . .,
an]. It can be transformed to diagonal form by the Fourier transform:
1
8A 2 ℂn ) A ¼ F n Λn F n :
n
All circulant matrices of the same size commute. Many matrices used below are
circulant ones, i.e. matrices corresponding to one-dimensional convolution with
cyclic boundary conditions.
N Since we are going to deal with two or more dimensions, the Kronecker product
becomes an important tool. Properties of the Kronecker product that may be
useful for further derivations are summarized in Zhang and Ding (2013) and several
other educational mathematical papers.
An operatorNthat down-samples a vector of length n by the factor s can be written
as Ds ¼ I n=s eT1,s , where eT1,s is the first row of identity matrix Is. Suppose a
two-dimensional n n array is given:
2 3
x11 x12 . . . x1n
6x . . . x2n 7
6 21 x22 7
X matr ¼ 6 7:
4 ⋮ 5
xn1 xn2 . . . xnn
AB CD ¼ A C B D :
8A 2 ℂn ℂm ) ∃N i 2 ℂn , M i 2 ℂm , i ¼ 1, . . . , r : A
Xr O
¼ Ni Mi:
i¼1
1
8A 2 ℂn ℂm ) A ¼ F F m Λ F n Fm ,
mn n
P N M
where Λ ¼ ri¼1 ΛNi Λi and ΛNi , ΛM
i are diagonal matrices of eigenvalues of
matrices Ni and Mi .
Although BCCB matrices and their properties are extensively covered in the
literature, matrices arising from the SR problem (especially the Bayer case) are more
complicated, and this paper will borrow a more general concept of the matrix class
from Voevodin and Tyrtyshnikov (1987) to deal with them in a simple and unified
manner.
Definition 2 A matrix class is a linear subspace of square matrices. Matrix A with
elements ai, j : i, j ¼ 1, . . ., n belongs to matrix class described by numbers
ðqÞ
aij , q 2 Q if it satisfies
X ðqÞ
aij aij ¼ 0:
i, j
1 Super-Resolution: 1. Multi-Frame-Based Approach 13
This definition is narrower than in the original work, which allows a non-zero
constant on the right-hand side and considers also rectangular matrices, but this
modification makes a definition more relevant to the problem under consideration.
We are interested in ℂn(circulant), (general, Q ¼ ∅), and n (diagonal) classes
of square matrices of size n n.
The Kronecker product produces bi-level matrices of N class 1 2 from matrices
from classes and : 8M 2 , 8M 2 ) M
1 2 1 1 2 2 1
M 2 2 1 2 . Here, 1
is called an outer class and an inner class. Saying A 2 simply means that
2
b ¼ F ΛA F n ,
A n
where ΛA 2 s n=s .
In the 2D case, the warp, blur, and regularization operators become
O
M i , G, H 2 ℂn ℂn , D ¼ Ds,s ¼ Ds Ds :
N N
Such matrix A b will satisfy A b ¼ F F ΛA ðF n F n Þ, where ΛA 2
n n
s n=s s n=s .
b can be expanded as
In the Bayer case, matrix A
!
X
k
b¼
A M e D
e i G e B BD
eGeM
ei e H,
þ λ2 H e
i¼1
where
14 X. Y. Petrova
O O O
e ¼ I3
D e ¼ I3
Ds,s , G e i ¼ I3
G, M Mi,
2 3
D2,2 0 0
6 D P1,1 7
6 2,2 0 0 7
B¼6 7,
4 0 D2,2 P1,0 0 5
0 0 D2,2 P0,1
2 3
Hg 0 0
6 0 0 7
6 Hb 7
6 7
6 0 0 Hr 7
e ¼6
H 7,
6H H c1 0 7
6 c1 7
6 7
4 H c2 0 H c2 5
0 H c3 H c3
and Pu, v is a 2D cyclic shift by u columns and v rows. Submatrices from the
expression above satisfy
M i , G, H r , H g , H b , H c1 , H c2 , H c3 2 ℂn ℂn :
Bayer down-sampling operator B extracts and stacks channels G1, G2, R, and
B from the pattern in Fig. 1.6.
b constructed as described
As proven in Glazistov and Petrova (2018), the matrix A
above satisfies
O O
O O
b ¼ I3
A F n F n ΛA I 3 Fn Fn ,
where ΛA 2 3 2s 2sn 2s 2sn . After characterizing the matrix in terms of matrix
class, it becomes possible to prescind from the original problem setting and focus on
matrix class transformations.
In the papers relying on block diagonalization of BCCB matrices, like Sroubek
et al. (2011), it is usually only noted that certain matrices can be transformed to
and no explicit transforms are provided, probably because it’s hard to express the
formula, but thanks to the apparatus of the structured matrices it becomes easy to
obtain closed-form permutation matrices transforming from classes s n=s ,
s n=s s n=s , and 3 2s n=ð2sÞ 2s n=ð2sÞ to block diagonal form n=s s ,
n2 =s2 s2 , and n2 =ð4s2 Þ 12s2 , respectively.
8B 2 1 2 ) I n PTm B I n Q m 2 1 3
The property of inner class preservation can be postulated for any matrix classes
and for matrices Pn, Qn, providing 8A 2 2 ) PTn AQn 2 3 :
O
O
8B 2 1 2 : PTn I m B Qn I m 2 3 2 :
This means that each outer block is transformed from class 1 to class 3 , while
the inner class 2 remains the same.
16 X. Y. Petrova
In the 2D SR problem, we can apply the class swapping operation twice and
N N
convert Ab to block diagonal form: for each A b ¼ F F ΛA ðF n F n Þ, where
n n
ΛA 2 s n=s s n=s , it holds that
O O
O
O
O O
I ns ΠTs,n I s ΠTns,n F n b F
Fn A F n Πns,ns I ns Πs,ns Is
s s n
2 n2 s2 :
s2
b b¼
Matrix A
N N
arising from the Bayer SR problem satisfying A
N N
I 3 F n F n ΛA ðI 3 F n F n Þ, where ΛA 2 3 2s 2s 2s 2s , can be
n n
O O
Table 1.1 summarizes the computational complexity of finding the matrix inverse
(marked “MI”) for the 1D, 2D, and Bayer SR problems. Block diagonalization made
it possible to reduce the complexity of the 2D and Bayer SR problems from O n6 to
O ðn2 s4 Þ þ O ðn2 log nÞ, where n2 log n corresponds to the complexity of the block
diagonalization process itself. Typically, n is much larger than s (as n ¼ 16, . . .,
32, s ¼ 2, . . ., 4), which provides significant economy.
Fig. 1.10 Dependency of the proportion of energy of filter coefficients inside ε-vicinity of the
central element on the vicinity size
the central element, averaged for all filters computed for three input frames with ¼
motion quantization.
The filters are extracted as shown in Fig. 1.9 during the off-line stage (Fig. 1.11)
and applied in the online stage (Fig. 1.12). These images seem self-explanatory, but
an additional description can be found in Petrova et al. (2017).
1 Super-Resolution: 1. Multi-Frame-Based Approach 19
We will show that by taking into account the symmetries intrinsic to this problem
and implementing a smart filter selection scheme, this number can be dramatically
reduced. Strict proofs were provided in Glazistov and Petrova (2018), while here
only the main results will be listed.
Let’s introduce the following transforms:
O
O
ϕ3 ðBÞ ¼ I 3 Un In B I3 Un In ,
O O
O O
ϕ4 ðBÞ ¼ I 3 In Un B I3 In Un ,
O
O
ϕ5 ð B Þ ¼ I 3 ΠTn,n B I 3 Πn,n ,
2 3
0 ... 0 1
60 ... 1 07
6 7
where U n ¼ 6 7 (a permutation matrix of size n n that flips the
40 0 05
1 . . . 20 0 3
1 0 0
6 7
input vector) and J ¼ 4 0 0 1 5, and Px, y is a 2D circular shift operator, where
0 1 0
x is the horizontal shift and y is the vertical shift. Then the number of stored filters
can be reduced using the following properties:
bðu1 , v1 , ⋯, uk , vk Þ ¼ ϕ1 A
A bðu1 , v1 , ⋯, uk , vk Þ ,
bðu1 þ x, v1 þ y, ⋯, uk þ x, vk þ yÞ ¼ ϕ2 A
A bðu1 , v1 , ⋯, uk , vk Þ ,
bðu1 1, v1 , ⋯, uk , vk Þ ¼ ϕ3 A
A bðu1 , v1 , ⋯, uk , vk Þ ,
bðu1 , v1 1, ⋯, uk , vk 1Þ ¼ ϕ4 A
A bðu1 , v1 , ⋯, uk , vk Þ,
bðv1 þ s, u1 þ s, ⋯, vk þ s, uk þ sÞ ¼ ϕ5 A
A bðu1 , v1 , ⋯, uk , vk Þ :
We can also use the same filters for different permutations of input frames: if σ(i)
is any permutation of indices i ¼ 1, . . ., k, then
1 Super-Resolution: 1. Multi-Frame-Based Approach 21
bðu1 , v1 , ⋯, uk , vk Þ ¼ A
A b uσ ð1Þ , vσð1Þ , ⋯, uσðkÞ , vσð1Þ :
Adding 2s to one of the ui’s or vi’s also does not change the problem:
bðu1 , v1 , ⋯, uk , vk Þ ¼ A
A bðu1 , v1 , ⋯, ui1 , vi1 , ui þ 2s, vi , uiþ1 , viþ1 , ⋯, uk , vk Þ,
bðu1 , v1 , ⋯, uk , vk Þ ¼ A
A bðu1 , v1 , ⋯, ui1 , vi1 , ui , vi þ 2s, uiþ1 , viþ1 , ⋯, uk , vk Þ:
For some motions u1, v1, ⋯, uk, vk, certain non-trivial compositions of transforms
ϕ1, . . ., ϕ5 keep the system invariant:
b ¼ ϕi ϕi . . . ϕi
A b ,
A
1 2 m
which makes it possible to express some rows of A b by using elements from other
rows. This is an additional resource for filter-bank compression. Applying exhaus-
tive search and using the rules listed above, for filter size 16 16 and k ¼ 3, we have
obtained the filter-bank compression ratios described in Table 1.2. Thus, the number
of stored values and the number of problems to be solved during the off-line stage
were both reduced. The compression approach increased the complexity of the
online stage to a certain extent, but the proposed compression scheme allows a
straightforward software implementation based on the index table, which stores
appropriate base filters and a list of transforms, encoded in 5 bits, for each possible
set of quantized displacements.
The apparatus of multilevel matrices can be similarly applied to deblurring, multi-
frame deblurring, demosaicing, multi-frame demosaicing, or de-interlacing prob-
lems in order to obtain fast FFT-based algorithms similar to those described in Sect.
1.2.2 and to analyse problem symmetries, as was shown in this section for the Bayer
SR problem.
We have developed a high quality multi-frame joint demosaicing and SR (Bayer SR)
solution which does not use iterations and has linear complexity. A visual
22 X. Y. Petrova
Fig. 1.13 Sample quality on real images: top row demosaicing from Hirakawa and Parks (2005)
with subsequent bicubic interpolation bottom row
Fig. 1.14 Comparison of RGB and Bayer SR: left side demosaicing from Malvar et al. (2004) with
post-processing using RGB SR; right side Bayer SR
comparison with the traditional approach is shown in Fig. 1.13. It can also be seen
that direct reconstruction from the Bayer domain is visually more pleasing compared
to subsequent demosaicing and single-channel SR, as shown in Fig. 1.14. This is the
only case when we used for benchmarking a demosaicing algorithm from Malvar
et al. (2004), because its design purpose was to minimize colour artefacts, which
would be a desirable property for the considered example. In all other cases, we
prefer the approach suggested by Hirakawa and Parks (2005), which provides a
higher PSNR and more natural-looking results.
In Fig. 1.15, we perform a visual comparison with an implementation of an
algorithm from Heide et al. (2014), which shows that careful choice of the linear
cross-channel regularization term can result in a more visually pleasing image than a
non-linear term.
We performed a numeric evaluation of the SR algorithm quality on synthetic
images in order to concentrate on the core algorithm performance without consid-
ering issues of accuracy of motion estimation. We used a test image shown in
Fig. 1.16, which contains several challenging areas (from the point of view of
demosaicing algorithms). Since the reconstruction quality depends on the displace-
ment between low-resolution frames (worst corner case: all the images with the same
displacement), we conducted a statistical experiment with randomly generated
motions. Numeric measurements for several experiment conditions are charted in
Fig. 1.17. Measurements are made separately for each channel. Experiments with
reconstruction from two, three, and four frames were made. Four different
1 Super-Resolution: 1. Multi-Frame-Based Approach 23
Fig. 1.15 Sample results of joint demosaicing and SR on rendered sample with known translational
motion: (a) ground truth; (b) demosaicing from Hirakawa and Parks (2005) + bicubic interpolation;
(c) demosaicing from Hirakawa and Parks (2005) + RGB SR; (d) Bayer SR, smaller regularization
term; (e) Bayer SR, bigger regularization term; (f) our implementation of Bayer SR from Heide et al.
(2014) with cross-channel regularization term from Heide et al. (2013)
Table 1.3 Impact of MV rounding and RGB/Bayer SR for 4 magnification, red channel
(PSNR, Db)
Number of
frames used for
MV reconstruction
rounding Domain Demosaicing method Configuration 2 3 4
No RGB Malvar et al. (2004) 4 " SR 23.5 23.7 23.9
Yes 23.4 23.6 23.7
No Hirakawa and Parks 24.3 24.5 24.6
Yes (2005) 24.0 24.3 24.3
No Bayer N/A 2 " SR + 2 " 24.6 25.2 25.6
Yes 23.3 23.4 23.5
No 4 " SR 24.9 25.7 26.3
Yes 24.5 25.2 25.6
No RGB Malvar et al. (2004) 2 " SR + 2 " 23.5 23.7 23.9
Yes 22.8 22.9 23.0
No Hirakawa and Parks 24.1 24.2 24.5
Yes (2005) 23.4 23.5 23.5
N/A RGB Hirakawa and Parks 4" 23.0
(2005)
(2004). Increasing the number of frames from 2 to 4 caused a 0.5 dB increase in the
RGB SR set-up and a 1.1–1.4 dB increase in the Bayer SR set-up. As expected, the
Bayer SR showed a superior performance, with 26.3 dB on four frames without
rounding of the motion vectors and 25.6 dB with rounded motion vectors. MV
rounding caused a quality drop by 0.1–0.3 dB for RGB SR and a quality drop by
0.4–0.7 dB for Bayer SR. The configuration with subsequent SR and up-sampling
behaves well enough without MV rounding but in the case of rounding can be even
inferior to the baseline.
Although there is clear evidence of weight decay, we had to evaluate the real
impact on the quality of the algorithm output caused by filter truncation. Also, since
the results of subsequent 4 SR with subsequent 2 downscaling were visually
more pleasing than plain 2 SR, we evaluated these configurations numerically. The
results are shown in Table 1.4. The bottom line shows the baseline with Hirakawa
and Parks (2005) demosaicing followed by 2 bicubic up-sampling, providing
27.95 Db reconstruction quality. We can see that even for the simplest setting for
2 magnification, the difference in PSNR from the baseline is 1.38 Db. Increasing
the number of frames from two to four allows us to increase the quality from about
0.9 (for 2 SR) to 1.1 Db (4 SR + down-sampling) compared to the corresponding
two-frame configuration. We can also see that for each number of observed frames,
the 4 SR + down-sampling is about 1.6–1.8 Db better than the corresponding plain
2 SR. The influence of the reduced kernel size (from 16 to 12) is almost negligible
and never exceeds 0.15 Dbs. Finally, in the four-frame set-up, we can see a PSNR
increase of 4.17 Db compared to the baseline.
26 X. Y. Petrova
Table 1.4 Impact of the kernel size and comparison of 4 3 " SR + 2 3 # and 2 3 " SR
configurations (PSNR, Db)
Number of frames used for
reconstruction
Kernel size Configuration 2 3 4
16 16 4 " SR + 2 # 30.93 31.75 32.12
2 " SR 29.34 30.19 30.25
14 14 4 " SR + 2 # 30.93 31.75 32.12
2 " SR 29.34 30.04 30.26
12 12 4 " SR + 2 # 30.93 31.72 32.12
2 " SR 29.33 30.05 30.25
N/A Hirakawa and Parks (2005) + 2 " SR 27.95
1 1 1
L¼ pffiffiffi ðR þ G þ BÞ, S ¼ pffiffiffi ðR BÞ, T ¼ pffiffiffi ðR 2G þ BÞ,
M rgb 3 M rgb 2 M rgb 6
where Mrgb is the maximum of colour channels. These formulae assume normalized
input (i.e. between 0 and 1). Let us denote the reference frame in LST space as fref
and a compensated frame in LST space as fk. Then two difference sub-metrics were
computed: d1 ¼ 1 (( fref fk) G)γ , where is a convolution operator and G is a
Gaussian filter, and d2 ¼ max (0, SSIM( fref, fk)). The final reliability max was
d2
computed as a threshold transform over dd11þd 2
with subsequent nearest neighbour
up-sampling to full size. In pixels where motion was detected to be unreliable, a
reduced number of frames were used for reconstruction. In Fig. 1.18, the special
filter-bank for processing areas which use a reduced number of frames (from 1 to
k 1) is denoted as “fallback”. In order to obtain the final image out of pixels
obtained using anisotropic, directional, and partial (fallback) reconstruction filters, a
special blending sub-algorithm was implemented.
Motion estimation in the Bayer domain is an interesting problem deserving
further description, which will be provided in Sect. 1.3.3.
The main goal of post-processing is to reduce the colour artefacts. Unfortunately,
a cross-channel regularizer that is strong enough to suppress colour artefacts pro-
duces undesirable blur as a side effect. In order to apply a lighter cross-channel
regularization, an additional colour artefact suppression block was implemented. It
converts an output image to YUV space, computes the map of local standard
deviations in the Y channel, smooths it using Gaussian filtering, and uses it as the
reference channel for cross-bilateral filtering of channels U and V. Then the image
with the original values in the Y channel and filtered values in the U and V channels
is transferred back to RGB.
Since the reconstruction model described above considers only Gaussian noise, it
makes sense to implement a separate and more elaborate noise reduction block using
accurate information from metadata and the camera noise profile.
In the case of higher noise levels, it is possible to use reconstruction filters
computed for a higher degree of regularization, but in this case the effect of SR
processing shifts from revealing new details to noise reduction, which is a simpler
problem that can be solved without the filter-bank approach.
In order to achieve a good balance between noise reduction and detail recon-
struction, a salience map was applied to control the local strength of the noise
reduction. A detailed description of salience map computation along with a descrip-
tion of a Bayer structure tensor is provided in Sect. 1.3.2. A visual comparison of the
results obtained with and without salience-based local control of noise reduction is
shown in Fig. 1.20. It can be seen that such adaptation provides a better detail level in
textured areas and higher noise suppression in flat regions compared to the baseline
version.
1 Super-Resolution: 1. Multi-Frame-Based Approach 29
The structure tensor is a traditional instrument for the estimation of local direction-
ality."The
PstructurePtensor is#a matrix composed of local gradients of pixel values
∇2x ∇x ∇y
T¼ P P 2 , and the presence of a local directional structure is
∇x ∇y ∇y
2
detected by threshold transform of coherence c ¼ λλþþ λ
þλ , which is computed from
the larger and smaller eigenvalues of structure tensor λ+ and λ, respectively. If the
coherence is small, this means that a pixel belongs to the low textured area or some
high textured area without a single preferred dimension. If the coherence is above the
threshold, the local direction is collinear to the eigenvector corresponding to the
larger eigenvalue. For RGB images, the gradients are obviously computed as
∇x ¼ Iy, x + 1 Iy, x 1, ∇ y ¼ Iy + 1, x Iy 1, x, while Bayer input requires
some modifications. The gradients were computed as ∇x ¼ max (∇xR, ∇xG, ∇
xB), ∇ y ¼ max (∇yR, ∇yG, ∇yB), where the gradients in each channel were computed
as shown in Table 1.5.
In order to apply texture direction estimation in the filter-bank structure, the angle
of the texture direction was quantized into 16 levels (we checked configurations with
8 levels, which provided visibly inferior quality and 32 levels, which provided a
minor improvement over 16 levels but was more demanding from the point of view
of the memory required). An example of a direction map is shown in Fig. 1.21.
The smaller eigenvalue λ of the structure tensor was also used to compute
the salience map. In each pixel location, the value of λ was computed in
some local window, and then threshold transform and normalization were applied:
r ¼ max ð min ðλ , t 1 Þ, t 2 Þ
t 2 t 1 . After that, the obtained map was smoothed by a Gaussian filter
30 X. Y. Petrova
Table 1.5 Computation of gradients for Bayer pattern (from Fig. 1.6)
Position in Bayer pattern
Gradient R B
∇xR I y,xþ2 I y,x2 I yþ1,xþ1 þI y1,xþ1 I yþ1,x1 I y1,x1
2 2
∇xG Iy, x + 1 Iy, x 1
∇xB I yþ1,xþ1 þI y1,xþ1 I yþ1,x1 I y1,x1 I y,xþ2 I y,x2
2 2
∇yR I yþ2,x I y2,x I yþ1,xþ1 þI yþ1,x1 I y1,xþ1 I y1,x1
2 2
∇yG Iy + 1, x Iy 1, x
∇yB I yþ1,xþ1 þI yþ1,x1 I y1,xþ1 I y1,x1 I yþ2,x I y2,x
2 2
Position in Bayer pattern
Gradient G1 G2
∇xR I y1,xþ2 þI yþ1,xþ2 I y1,x2 I yþ1,x2 Iy, x + 1 Iy, x 1
4
∇xG I y,xþ2 I y,x2 þI yþ1,xþ1 þI y1,xþ1 I yþ1,x1 I y1,x1
4
∇xB Iy, x + 1 Iy, x 1 I y1,xþ2 þI yþ1,xþ2 I y1,x2 I yþ1,x2
4
∇yR Iy + 1, x Iy 1, x I y2,xþ1 þI y2,x1 I yþ2,xþ1 I yþ2,x1
4
∇yG I yþ2,x I y2,x þI yþ1,xþ1 þI yþ1,x1 I y1,xþ1 I y1,x1
4
∇yB I yþ2,xþ1 þI yþ2,x1 I y2,xþ1 I y2,x1 Iy + 1, x Iy 1, x
4
subpixel displacements on the other hand. At the same time, it should have modest
computational complexity. To fulfill these requirements, a multiscale architecture
combining 3-Dimensional Recursive Search (3DRS) and Lucas–Kanade (LK) opti-
cal flow was implemented (Fig. 1.22). Using a 3DRS algorithm for frame-rate
conversion application is demonstrated also by Pohl et al. (2018) and in Chap. 15.
Here, the LK algorithm was implemented with improvements described in Baker
and Matthews (2004). To estimate the largest scale displacement, a simplified 3DRS
implementation from Pohl et al. (2018) was applied to the ¼ scale of the Y channel.
Further motion was refined by conventional LK on ¼ and ½ resolution, and finally
one pass of specially developed Bayer LK was applied. The single-channel Lucas–
Kanade method relies on local solution of the system TTT|u v|T ¼ TTb, where T is
computed similarly to the way it was done in Sect. 1.3.2, except for averaging the
Gaussian window applied to the gradient values. However, for this application
gradient values were obtained just from bilinear demosaicing of the original Bayer
image. The chart of the algorithm is shown in Fig. 1.22.
References
Azzari, L., Foi, A.: Gaussian-Cauchy mixture modeling for robust signal-dependent noise estima-
tion. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and
Signal Processing, pp. 5357–5351 (2014)
Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEE Trans. Pattern Anal.
Mach. Intell. 24(9), 1167–1183 (2002)
Baker, S., Matthews, I.: Lucas-Kanade 20 years on: a unifying framework. Int. J. Comput. Vis.
56(3), 221–255 (2004)
32 X. Y. Petrova
Baker, S., Scharstein, D., Lewis, J., Roth S., Black, M.J., Szeliski, R.: A database and evaluation
methodology for optical flow. In: Proceedings of IEEE International Conference on Computer
Vision, pp. 1–8 (2007). https://doi.org/10.1007/s11263-010-0390-2
Benzi, M., Bini, D., Kressner, D., Munthe-Kaas, H., Van Loan, C.: Exploiting hidden structure in
matrix computations: algorithms and applications. In: Benzi, M., Simoncini, V. (eds.) Lecture
Notes in Mathematics, vol. 2173. Springer International Publishing, Cham (2016)
Bodduna, K., Weickert, J.: Evaluating data terms for variational multi-frame super-resolution. In:
Lauze, F., Dong, Y., Dahl, A.B. (eds.) Lecture Notes in Computer Science, vol. 10302,
pp. 590–601. Springer Nature Switzerland AG, Cham (2017)
Brooks, T., Mildenhall, B., Xue, T., Chen, J., Sharlet, D., Barron, J.-T.: Unprocessing images for
learned raw denoising. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 11036–11045 (2019)
Cai, J., Gu, S., Timofte, R., Zhang, L., Liu, X., Ding, Y. et al.: NTIRE 2019 challenge on real image
super-resolution: methods and result. In: IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops, pp. 2211–2223 (2019)
Chen, C., Ren, Y., Kuo, C.-C.: Big Visual Data Analysis. Scene Classification and Geometric
Labeling, Springer Singapore, Singapore (2016)
Digital negative (DNG) specification, v.1.4.0 (2012)
Farsiu, S., Robinson, D., Elad, M., Milanfar, P.: Robust shift and add approach to superresolution.
In: Proceedings of IS&T International Symposium on Electronic Imaging, Applications of
Digital Image Processing XXVI, vol. 5203 (2003). https://doi.org/10.1117/12.507194.
Accessed on 02 Oct 2020
Farsiu, S., Robinson, M.D., Elad, M., Milanfar, P.: Fast and robust multi-frame super resolution.
IEEE Trans. Image Process. 13(10), 1327–1344 (2004)
Fazli, S., Fathi, H.: Video image sequence super resolution using optical flow motion estimation.
Int. J. Adv. Stud. Comput. Sci. Eng. 4(8), 22–26 (2015)
Foi, A., Alenius, S., Katkovnik, V., Egiazarian, K.: Noise measurement for raw data of digital
imaging sensors by automatic segmentation of non-uniform targets. IEEE Sensors J. 7(10),
1456–1461 (2007)
Foi, A., Trimeche, M., Katkovnik, V., Egiazarian, K.: Practical Poissonian-Gaussian noise model-
ing and fitting for single-image raw-data. IEEE Trans. Image Process. 17(10), 1737–1754
(2008)
Giachetti, A., Asuni, N.: Real time artifact-free image interpolation. IEEE Trans. Image Process.
20(10), 2760–2768 (2011)
Glazistov, I., Petrova X.: Structured matrices in super-resolution problems. In: Proceedings of the
Sixth China-Russia Conference on Numerical Algebra with Applications. Session Report
(2017)
Glazistov, I., Petrova, X.: Superfast joint demosaicing and super-resolution. In: Proceedings of
IS&T International Symposium on Electronic Imaging, Computational Imaging XVI,
pp. 2721–2728 (2018)
Gutierrez, E.Q., Callico, G.M.: Approach to super-resolution through the concept of multi-camera
imaging. In: Radhakrishnan, S. (ed.) Recent Advances in Image and Video Coding (2016).
https://www.intechopen.com/books/recent-advances-in-image-and-video-coding/approach-to-
super-resolution-through-the-concept-of-multicamera-imaging
Hansen, P.C., Nagy, J.G., O’Leary, D.P.: Deblurring Images: Matrices, Spectra, and Filtering.
Fundamentals of Algorithms, vol. 3. SIAM, Philadelphia (2006)
Heide, F., Rouf, M., Hullin, M.-B., Labitzke, B., Heidrich, W., Kolb, A.: High-quality computa-
tional imaging through simple lenses. ACM Trans. Graph. 32(5), Article No. 149 (2013)
Heide, F., Steinberger, M., Tsai, Y.-T., Rouf, M., Pajak, D., Reddy, D., Gallo, O., Liu, J., Heidrich,
W., Egiazarian, K., Kautz, J., Pulli, K.: FlexISP: a flexible camera image processing framework.
ACM Trans. Graph. 33(6), 1–13 (2014)
Hirakawa, K., Parks, W.-T.: Adaptive homogeneity-directed demosaicing algorithm. IEEE Trans.
Image Process. 14(3), 360–369 (2005)
1 Super-Resolution: 1. Multi-Frame-Based Approach 33
Ismaeil, K.A., Aouada, D., Ottersten B., Mirbach, B.: Multi-frame super-resolution by enhanced
shift & add. In: Proceedings of 8th International Symposium on Image and Signal Processing
and Analysis, pp. 171–176 (2013)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution.
In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) European Conference on Computer Vision,
pp. 694–711. Springer International Publishing, Cham (2016)
Kanaev, A.A., Miller, C.W.: Multi-frame super-resolution algorithm for complex motion patterns.
Opt. Express. 21(17), 19850–19866 (2013)
Kuan, D.T., Sawchuk, A.A., Strand, T.C., Chavel, P.: Adaptive noise smoothing filter for images
with signal-dependent noise. IEEE Trans. Pattern Anal. Mach. Intell. 7(2), 165–177 (1985)
Li, X., Orchard, M.: New edge-directed interpolation. IEEE Trans. Image Process. 10(10),
1521–1527 (2001)
Lin, Z., He, J., Tang, X., Tang, C.-K.: Limits of learning-based superresolution algorithms.
Int. J. Comput. Vis. 80, 406–420 (2008)
Liu, X., Tanaka, M., Okutomi, M.: Estimation of signal dependent noise parameters from a single
image. In: Proceedings of the IEEE International Conference on Image Processing, pp. 79–82
(2013)
Liu, X., Tanaka, M., Okutomi, M.: Practical signal-dependent noise parameter estimation from a
single noisy image. IEEE Trans. Image Process. 23(10), 4361–4371 (2014)
Liu, J., Wu C.-H., Wang, Y., Xu Q., Zhou, Y., Huang, H., Wang, C., Cai, S., Ding, Y., Fan, H.,
Wang, J.: Learning raw image de-noising with Bayer pattern unification and Bayer preserving
augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, pp. 4321–4329 (2019)
Ma, D., Afonso, F., Zhang, M., Bull, A.-D.: Perceptually inspired super-resolution of compressed
videos. In: Proceedings of SPIE 11137. Applications of Digital Image Processing XLII, Paper
1113717 (2019)
Malvar, R., He, L.-W., Cutler, R.: High-quality linear interpolation for demosaicing of Bayer-
patterned color images. In: International Conference of Acoustic, Speech and Signal Processing,
vol. 34(11), pp. 2274–2282 (2004)
Mastronardi, N., Ng, M., Tyrtyshnikov, E.E.: Decay in functions of multiband matrices. SIAM
J. Matrix Anal. Appl. 31(5), 2721–2737 (2010)
Milanfar, P. (ed.): Super-Resolution Imaging. CRC Press (Taylor & Francis Group), Boca Raton
(2010)
Nasonov, A. Krylov, A., Petrova, X., Rychagov M.: Edge-directional interpolation algorithm using
structure tensor. In: Proceedings of IS&T International Symposium on Electronic Imaging.
Image Processing: Algorithms and Systems XIV, pp. 1–4 (2016)
Park, J.-H., Oh, H.-M., Moon, G.-K.: Multi-camera imaging system using super-resolution. In:
Proceedings of 23rd International Technical Conference on Circuits/Systems, Computers and
Communications, pp. 465–468 (2008)
Petrova, X., Glazistov, I., Zavalishin, S., Kurmanov, V., Lebedev, K., Molchanov, A., Shcherbinin,
A., Milyukov, G., Kurilin, I.: Non-iterative joint demosaicing and super-resolution framework.
In: Proceedings of IS&T International Symposium on Electronic Imaging, Computational
Imaging XV, pp. 156–162 (2017)
Pohl, P., Anisimovsky, V., Kovliga, I., Gruzdev, A., Arzumanyan, R.: Real-time 3DRS motion
estimation for frame-rate conversion. In: Proceedings of IS&T International Symposium on
Electronic Imaging, Applications of Digital Image Processing XXVI, pp. 3281–3285 (2018)
Pyatykh, S., Hesser, J.: Image sensor noise parameter estimation by variance stabilization and
normality assessment. IEEE Trans. Image Process. 23(9), 3990–3998 (2014)
Rakhshanfar, M., Amer, M.A.: Estimation of Gaussian, Poissonian-Gaussian, and processed visual
noise and its level function. IEEE Trans. Image Process. 25(9), 4172–4185 (2016)
Robinson, M.D., Farsiu, S., Milanfar, P.: Optimal registration of aliased images using variable
projection with applications to super-resolution. Comput. J. 52(1), 31–42 (2009)
34 X. Y. Petrova
Robinson, M.D., Toth, C.A., Lo, J.Y., Farsiu, S.: Efficient Fourier-wavelet super-resolution. IEEE
Trans. Image Process. 19(10), 2669–2681 (2010)
Rochefort, G., Champagnat, F., Le Besnerais, G., Giovannelli, G.-F.: An improved observation
model for super-resolution under affine motion. IEEE Trans. Image Process. 15(11), 3325–3337
(2006)
Romano, Y., Isidoro, J., Milanfar, P.: RAISR: rapid and accurate image super resolution. IEEE
Trans. Comput. Imaging. 3(1), 110–125 (2017)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Adaptive Image Processing Algo-
rithms for Printing. Springer Nature Singapore AG, Singapore (2018)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for
Scanning and Printing. Springer Nature Switzerland AG, Cham (2019)
Segall, C.A., Katsaggelos, A.K., Molina, R., Mateos, J.: Super-resolution from compressed video.
In: Chaudhuri, S. (ed.) Super-Resolution Imaging. The International Series in Engineering and
Computer Science Book Series, Springer, vol. 632, pp. 211–242 (2002)
Sroubek, F., Kamenick, J., Milanfar, P.: Superfast super-resolution. In: Proceedings of 18th IEEE
International Conference on Image Processing, pp. 1153–1156 (2011)
Sutour, C., Deledalle, C.-A., Aujol, J.-F.: Estimation of the noise level function based on a
non-parametric detection of homogeneous image regions. SIAM J. Imaging Sci. 8(4),
2622–2661 (2015)
Sutour, C., Aujol, J.-F., Deledalle, C.-A.: Automatic estimation of the noise level function for
adaptive blind denoising. In: Proceedings of 24th European Signal Processing Conference,
pp. 76–80 (2016)
Timofte, R., De Smet, V., Van Gool, L.: A+: adjusted anchored neighbourhood regression for fast
super-resolution. In: Asian Conference on Computer Vision, pp. 111–126 (2014)
Trench, W.: Properties of multilevel block α-circulants. Linear Algebra Appl. 431(10), 1833–1847
(2009)
Voevodin, V.V., Tyrtyshnikov, E.E.: Computational processes with Toeplitz matrices. Moscow,
Nauka (1987) (in Russian). https://books.google.ru/books?id¼pf3uAAAAMAAJ. Accessed on
02 Oct 2020
Zhang, H., Ding, F.: On the Kronecker products and their applications. J. Appl. Math. 2013, 296185
(2013)
Zhang, Z., Sze, V.: FAST: a framework to accelerate super-resolution processing on compressed
videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1015–1024 (2017)
Zhang, L., Wu, X.L.: An edge-guide image interpolation via directional filtering and data fusion.
IEEE Trans. Image Process. 15(8), 2226–2235 (2006)
Zhang, Y., Wang, G., Xu, J.: Parameter estimation of signal-dependent random noise in CMOS/
CCD image sensor based on numerical characteristic of mixed Poisson noise samples. Sensors.
18(7), 2276–2293 (2018)
Zhao, N., Wei, Q., Basarab, A., Dobigeon, N., Kouame, D., Tourneret, J.-Y.: Fast single image
super-resolution using a new analytical solution for ℓ2-ℓ2 problems. IEEE Trans. Image Process.
25(8), 3683–3697 (2016)
Zhou, D., Shen, X., Dong, W.: Image zooming using directional cubic convolution interpolation.
IET Image Process. 6(6), 627–634 (2012)
Chapter 2
Super-Resolution: 2. Machine
Learning-Based Approach
Alexey S. Chernyavskiy
2.1 Introduction
A. S. Chernyavskiy (*)
Philips AI Research, Moscow, Russia
e-mail: alexey.chernyavskiy@philips.com
remainder of this chapter will focus on architectural designs of SISR CNNs and on
various aspects that make SISR challenging.
The blocks that make up the CNNs for super-resolution do not differ much from
neural networks used in other image-related tasks, such as object or face recognition.
They usually consist of convolution blocks, with kernel sizes of 3 3, interleaved
with simple activation functions such as the ReLU (rectified linear unit). Modern
super-resolution CNNs incorporate the blocks that have been successfully used in
other vision tasks, e.g. various attention mechanisms, residual connections, dilated
convolutions, etc. In contrast with CNNs that are designed for image classification,
there are usually no pooling operations involved. The input low-resolution image is
processed by sets of filters which are specified by kernels with learnable weights.
These operations produce arrays of intermediate outputs called feature maps.
Non-linear activations functions are applied to the feature maps, after adding
learnable biases, in order to zero out some values and to accentuate others. After
passing through several stages of convolutions and non-linear activations, the image
is finally transformed into a high-resolution version by means of the deconvolution
operation, also called transposed convolution (Shi et al. 2016a). Another up-scaling
option is the sub-pixel convolution layer (Shi et al. 2016b), which is faster than the
deconvolution, but is known to generate checkerboard artefacts. During training,
each LR image patch is forward propagated through the neural network, a zoomed
image is generated by the sequence of convolution blocks and activation functions,
and this image is compared to the true HR patch. A loss function is computed, and
the gradients of the loss function with respect to the neural network parameters
(weights and biases) are back-propagated; therefore the network parameters are
updated. In most cases, the loss function is the L2 or L1 distance, but the choice of
a suitable measure for comparing the generated image and its ground truth counter-
part is a subject of active research.
For training a super-resolution model, one should create or obtain a training
dataset which consists of pairs of low-resolution images and their high-resolution
versions. Almost all the SISR neural networks are designed for one zoom factor
only, although, e.g. a 4 up-scaled image can in principle be obtained by passing
the output of a 2-zoom CNN through itself once again. In this way, the size ratio of
the HR and LR images used for training the neural network should correspond to the
desired zoom factor. Training is performed on square image patches cropped from
random locations in input images. The patch size should match the receptive field of
the CNN, which is the size of the neighbourhood that is involved in the computation
of a single pixel of the output. The receptive field of a CNN is usually related to the
typical kernel size and the depth (number of convolution layers).
In terms of architectural complexity, the progress of CNNs for the super-resolu-
tion task has closely followed the successes of image classification CNNs. However,
some characteristic features are inherent to the CNNs used in SISR and to the process
of training such neural networks. We will examine these peculiarities later. There are
several straightforward design choices that come into play when one decides to
create and train a basic SISR CNN.
2 Super-Resolution: 2. Machine Learning-Based Approach 37
First, there is the issue of early vs. late upsampling. In early upsampling, the
low-resolution image is up-scaled using a simple interpolation method (e.g. bicubic
or Lanczos), and this crude version of the HR image serves as input to the CNN,
which basically performs the deblurring. This approach has an obvious drawback
which is the large number of operations and large HR-sized intermediate feature
maps that need to be stored in memory. So, in most modern SISR CNN, starting
from Dong et al. (2016b), the upsampling is delayed to the last stages. In this way,
most of the convolutions are performed over feature maps that have the same size as
the low-resolution image.
While in many early CNN designs the network learned to directly generate an HR
image and the loss function that was being minimized was computed as the norm of
the difference between the generated image and the ground truth HR, Zhang et al.
(2017) proposed to learn the residual image, i.e. the difference between the LR
image, up-scaled by a simple interpolation, and the HR. This strategy proved
beneficial for SISR, denoising and JPEG image deblocking. An example of CNN
architecture with residual learning is shown in Fig. 2.1; a typical model output is
shown in Fig. 2.2 for a zoom factor equal to 3. Compared to standard interpolation,
such as bicubic or Lanczos, the output of a trained SISR CNN contains much sharper
details and shows practically no aliasing.
Fig. 2.1 A schematic illustration of a CNN for super-resolution with residual learning
Fig. 2.2 Results of image up-scaling by a factor of 3: left result of applying bicubic interpolation;
right HR image generated by a CNN
38 A. S. Chernyavskiy
The success of deep convolutional neural networks is largely due to the availability
of training data. For the task of super-resolution, the neural networks are typically
trained on high-quality natural images, from which the low-resolution images are
generated by applying a specific predefined downscaling algorithm (bicubic,
Lanczos, etc.). The images belonging to the training set should ideally come from
the same domain as the one that will be explored after training. Man-made architec-
tural structures possess very specific textural characteristics, and if one would like to
up-scale satellite imagery or street views, the training set should contain many
images of buildings, as in the popular dataset Urban100 (Huang et al. 2015).
Representative images from Urban100 are shown in Fig. 2.3.
On the other hand, a more general training set would allow for greater flexibility
and better average image quality. A widely used training dataset DIV2K (Agustsson
and Timofte 2017) contains 1000 images each having 2K pixels size of at least one
of its axes and features a very diverse content, as illustrated in Fig. 2.4.
Fig. 2.5 Left LR image; top-right HR ground truth; bottom-right image reconstructed by a CNN
from an LR image that was obtained by simple decimation without smoothing or interpolation.
(Reproduced with permission from Shocher et al. 2018)
Web, or taken by a smartphone camera, or old historic images contain many artefacts
coming from sensor noise, non-ideal PSF, aliasing, image compression, on-device
denoising, etc. It is obvious that in a real-life scenario, a low-resolution image is
produced by an optical system and, generally, is not created by applying any kind of
subsampling and interpolation. In this regard, the whole idea of training CNNs on
carefully engineered images obtained by using a known down-sampling function
might sound faulty and questionable.
It seems natural then to create a training dataset that would simulate the real
artefacts introduced into an image by a real imaging system. Another property of real
imaging systems is the intrinsic trade-off between resolution (R) and field of view
(FoV). When zooming out the optical lens in a DSLR camera, the obtained image
has a larger FoV but loses details on subjects; when zooming in the lens, the details
of subjects show up at the cost of a reduced FoV. This trade-off also applies to
cameras with fixed focal lenses (e.g. smartphones), when the shooting distance
changes. The loss of resolution that is due to enlarged FoV can be thought of as a
degradation model that can be reversed by training a CNN (Chen et al. 2019; Cai
et al. 2019a). In a training dataset created for this task, the HR image could come, for
example, from a high-quality DSLR camera, while the LR image could be obtained
by another camera that would have inferior optical characteristics, e.g. a cheap
digital camera with a lower image resolution, different focal distance, distortion
parameters, etc. Both cameras should be mounted on tripods in order to ensure the
closest similarity of the captured scenes. Still, due to the different focus and depth of
field, it would be impossible to align whole images (Fig. 2.6). The patches suitable
for training the CNN would have to be cropped from the central parts of the image
pair. Also, since getting good image quality is not a problem when up-scaling
low-frequency regions like the sky, care should be taken to only select informative
patches that contain high-frequency information, such as edges, corners, and spots.
These parts can be selected using classical computer vision feature detectors, like
SIFT, SURF, or FAST. A subsequent distortion compensation and correlation-based
alignment (registration) must be performed in order to obtain the most accurate
LR-HR pairs. Also, the colour map of the LR image, which can differ from that of
the HR image due to white balance, exposure time, etc., should be adjusted via
histogram matching using the HR image as reference (see Migukin et al. 2020).
During CNN training, the dataset can be augmented by adding randomly rotated
and mirrored versions of the training images. Many other useful types of image
manipulation are implemented in the Albumentations package (Buslaev et al. 2020).
Reshuffling of the training images and sub-pixel convolutions can be greatly facil-
itated by using the Einops package by Rogozhnikov (2018). Both packages are
available for Tensorflow and Pytorch, the two most popular deep learning
frameworks.
Fig. 2.7 Example of images that were presented to human judgement during the development of
LPIPS metric. (Reproduced with permission from Zhang et al. 2018a)
features obtained from intermediate layers of VGG16, a popular CNN that was for
some time responsible for the highest accuracy in image classification on the
ImageNet dataset. SISR CNNs trained with this loss (Ledig et al. 2017, among
many others) have shown more visually pleasant results, without the blurring that is
characteristic of MSE.
Recently, there has been a surge in attempts to leverage the availability of big data
for capturing visual preferences of users and simulating it using engineered metrics
relying on deep learning. Large-scale surveys have been conducted, in the course of
which users were given triplets of images representing the same scene but containing
a variety of conventional and CNN-generated degradations and asked whether the
first or the second image was “closer”, in whatever sense they could imagine, to the
third one (Fig. 2.7). Then, a CNN was trained to predict this perceptual judgement.
The learned perceptual patch similarity LPIPS by Zhang et al. (2018a) and PieAPP
by Prashnani et al. (2018) are two notable examples. These metrics generalize well
even for distortions that were not present during training. Overall, the development
of image quality metrics that would better correlate with human perception is a
subject of active research (Ding et al. 2020).
Table 2.1 FSRCNN architecture modified by adding depthwise separable convolutions and
residual connections
Layer name Comment Type Filter size Output channels
Data Input data Y channel 1
Upsample Upsample the input Deconvolution 1
bicubic interpolation
Conv1 Convolution, PReLU 55 32
Conv2 Convolution, PReLU 11 32
BasicBlock1 See Table 2.2 3 3, 1 1, ReLU, 33 32
sum w/residual
BasicBlock2 See Table 2.2 3 3, 1 1, ReLU, 33 32
sum w/residual
BasicBlock3 See Table 2.2 3 3, 1 1, ReLU, 33 32
sum w/residual
BasicBlock4 See Table 2.2 3 3, 1 1, ReLU, 33
sum w/residual
Conv3 Convolution, PReLU 11 32
Conv4 Convolution 11 32
Deconv Obtain the residual Deconvolution 9 9 stride 3 1
Result Deconv + Upsample Summation 1
Table 2.2 FSRCNN architecture modified by adding depthwise separable convolutions and
residual connections
Layer name Type Filter size Output channels
Conv3 3 Depthwise separable convolution 33 32
Conv1 1 Convolution, ReLU 11 32
Sum Summation of input to block and 32
results of Conv1 1
3 zooming. Then, we keep all the layers’ parameters frozen (by setting the learning
rate to zero for these layers) and replace the final deconvolution layer by a layer that
performs 4 zoom. In this way, after an image is processed by the main body of the
CNN, the output of the last layer before the deconvolution is fed into the layer
specifically responsible for the desired zoom factor. So, the latency for the first zoom
operation is big, but zooming the same image by a different factor is much faster than
the first time.
Our final CNN achieves 33.25 dB on the Set5 dataset with 10K parameters,
compared to FSRCNN with 33.06 dB and 12K parameters. To make full use of the
GPU available on recent mobile phones, we ported our super-resolution CNN to a
Samsung Galaxy S9 mobile device using the Qualcomm Neural Processing Engine
(SNPE) SDK. We were able to achieve 1.6 FPS for 4-up-scaling of 1024 1024
images. This figure does not include the CPU to GPU data transfers and RGB to
YCbCr transformations, which can take 1 second in total. The results of super-
resolution using this CNN are shown in Table. 2.2 and Fig. 2.8.
2 Super-Resolution: 2. Machine Learning-Based Approach 45
Fig. 2.8 Super-resolution on a mobile device: left column bicubic interpolation; right column
modified FSRCNN optimized for Samsung Galaxy S9
Several challenges on single image super-resolution have been initiated since 2017.
They intend to bridge the gap between academic research and real-life applications
of single image super-resolution. The first NTIRE (New Trends in Image Restoration
and Enhancement) challenge featured two tracks. In Track 1 bicubic interpolation
was used for creating the LR images. In Track 2, all that was known was that the LR
images were produced by convolving the HR image with some unknown kernel. In
both tracks, the HR images were downscaled by factors of 2, 3, and 4, and only blur
and decimation were used for this, without adding any noise. The DIV2K image
dataset was proposed for training and validation of algorithms (Agustsson and
Timofte 2017).
The competition attracted many teams from academia and industry, and many
new ideas were demonstrated. Generally, although the PSNR figures for all the
algorithms were worse on images from Track 2 than on those coming from Track
1, there was a strong positive correlation between the success of the method in both
tracks. The NTIRE competition became a yearly event, and the tasks to solve became
more and more challenging. It now features more tracks, many of them related to
image denoising, dehazing, etc. With regard to SR, NTIRE 2018 already featured
four tracks, the first one being the same as Track 1 from 2017, while the remaining
three added unknown image artefacts that emulated the various degradation factors
46 A. S. Chernyavskiy
Fig. 2.9 Perception-distortion plane used for SR algorithm assessment in the PIRM challenge
present in the real image acquisition process from a digital camera. In 2019, RealSR,
a new dataset captured by a high-end DSLR camera, was introduced by Cai
et al. (2019b). For this dataset, HR and LR images of the same scenes were acquired
by the same camera by changing its focal length. In 2020, the “extreme” 16 track
was added. Along with PSNR and SSIM values, the contestants were ranked based
on the mean opinion score (MOS) computed in a user study.
The PIRM (Perceptual Image Restoration and Manipulation) challenge that was
first held in 2018 was the first to really focus on perceptual quality. The organizers
used an evaluation scheme based on the perception-distortion plane. The perception-
distortion plane was divided into three regions by setting thresholds on the RMSE
values (Fig. 2.9). In each region, the goal was to obtain the best mean perceptual
quality. For each participant, the perception index (PI) was computed as a combi-
nation of the no-reference image quality measures of Ma et al. (2017) and NIQE
(Mittal et al. 2013), a lower PI indicating better perceptual quality. The PI demon-
strated a correlation of 0.83 with the mean opinion score.
Another similar challenge, AIM (Advances in Image Manipulation), was first
held in 2019. It focuses on the efficiency of SR. In the constrained SR challenge, the
participants were asked to develop neural network designs or solutions with either
the lowest amount of parameters, or the lowest inference time on a common GPU, or
the best PSNR, while being constrained to maintain or improve over a variant of
SRResNet (Ledig et al. 2017) in terms of the other two criteria.
In 2020, both NTIRE and AIM introduced the Real-World Super-Resolution
(RWSR) sub-challenges, in which no LR-HR pairs are ever provided. In the Same
Domain RWSR track, the aim is to learn a model capable of super-resolving images
in the source set, while preserving low-level image characteristics of the input source
domain. Only the source (input) images are provided for training, without any HR
ground truth. In the Target Domain RWSR track, the aim is to learn a model capable
of super-resolving images in the source set, generating clean high-quality images
2 Super-Resolution: 2. Machine Learning-Based Approach 47
similar to those in the target set. The source input images in both tracks are
constructed using artificial, but realistic, image degradations. The difference with
all the previous challenges is that this time the images in the source and target set are
unpaired, so the 4 super-resolved LR images should possess the same properties as
the HR images of different scenes.
Final reports have been published for all of the above challenges. The reports are
a great illustrated source of information about the winning solutions, neural net
architectures, training strategies, and trends in SR, in general. Relevant references
are Timofte et al. (2017), Timofte et al. (2018), Cai et al. (2019a), and Lugmayr
et al. (2019).
Over the years, although many researchers in super-resolution have used the same
neural networks that produced state-of-the-art results in image classification, a lot of
SR-specific enhancements have been proposed. The architectural decisions that we
will describe next were instrumental in reaching the top positions in SISR challenges
and influenced the research in this field.
One of the major early advances in single image super-resolution was the
introduction of generative adversarial networks (GANs) to produce more realistic
high-resolution images in (Ledig et al. 2017). The proposed SRGAN network
consisted of a generator, a ResNet-like CNN with many residual blocks, and a
discriminator. The pair of two networks was being trained concurrently, with the
generator network trying to produce high-quality HR images, and the discriminator
aiming to correctly classify whether its input is a real HR image or one generated by
an SR algorithm. The performance of the discriminator was measured as the
adversarial loss. The rationale was that this competition between the two networks
would push the generated images closer to the manifold of natural images. In that
work, the similarity between intermediate features generated by passing the two
images through a well-trained image classification model was also used as percep-
tual loss – so that, in total, three different losses (along with MSE) were combined
into one for training. It was clearly demonstrated that GANs are able not only to
synthesize fantasy images given a random input but also to be instrumental in image
processing. Since then, GANs have become a method of choice for deblurring,
super-resolution, denoising, etc.
In the LapSRN model (Lai et al. 2017), shown in Fig. 2.10, the upsampling
follows the principle of Laplacian pyramids, i.e. each level of the CNN learns to
predict a residual that should explain the difference between a simple up-scale of the
previous level and the desired result. The predicted high-frequency residuals at each
level are used to efficiently reconstruct the HR image through upsampling and
addition operations.
The model has two branches: feature extraction and image reconstruction. The
first one uses stacks of convolutional layers to produce and, later, up-scale the
48 A. S. Chernyavskiy
Fig. 2.10 Laplacian pyramid network for 2, 4 and 8 up-scaling. (Reproduced with permis-
sion from Lai et al. 2017)
residual images. The second one serves for summing up the residuals coming from
the feature extraction branch with the images that are upsampled by bilinear inter-
polation and process the results. The entire network is a cascade of CNNs with a
similar structure at each level. Each level has its loss function which is computed
with respect to the corresponding ground truth HR image at the specific scale.
LapSRN generates multiscale predictions, with zoom factors equal to powers
of 2. This design facilitates resource-aware applications, such as those running on
mobile devices. For example, if there is a lack of resources for 8 zooming, the
trained LapSRN model can still perform super-resolution with factors 2 and 4.
Like LapSRN, ProSR proposed by Wang et al. (2018a) aimed at the power-of-
two up-scale task and was built on the same hierarchical pyramid idea (Fig. 2.11).
However, the elementary building blocks for each level of the pyramid became more
sophisticated. Instead of sequences of convolutions, the dense compression units
(DCUs) were adapted from DenseNet. In a DCU, each convolutional layer obtains
“collective knowledge” as additional inputs from all preceding layers and passes on
its own feature maps to all subsequent layers through concatenation. This results in
better gradient flow during training.
In order to reduce the memory consumption and increase the receptive field with
respect to the original LR image, the authors used an asymmetric pyramidal structure
with more layers in the lower levels. Each level of the pyramid consists of a cascade
of DCUs followed by a sub-pixel convolution layer. A GAN variant of the ProSR
was also proposed, where the discriminator matched the progressive nature of the
generator network by operating on the residual outputs of each scale.
Compared to LapSRN, which also used a hierarchical scheme for power-of-two
upsampling, in ProSR the intermediate subnet outputs are neither supervised nor
used as the base image in the subsequent level. This design simplifies the backward
pass and reduces the optimization difficulty.
2 Super-Resolution: 2. Machine Learning-Based Approach 49
Fig. 2.11 Progressive super-resolution network. (Reproduced with permission from Wang et al.
2018a)
Fig. 2.12 DBPN and its up- and down-projection units. (Reproduced with permission from Haris
et al. 2018)
The Deep Back-Projection Network DBPN (Haris et al. 2018) exploits iterative
up- and down-sampling layers, providing an error feedback mechanism for projec-
tion errors at each stage. Inspired by iterative back-projection, an algorithm used
since the 1990s for multi-frame super-resolution, the authors proposed using mutu-
ally connected up- and down-sampling stages each of which represents different
types of image degradation and high-resolution components. As in ProSR, dense
connections between upsampling and down-sampling layers were added to encour-
age feature reuse.
Initial feature maps are constructed from the LR image, and they are fed to a
sequence of back-projection modules (Fig. 2.12). Each such module performs a
change of resolution up or down, with a set of learnable kernels, with a subsequent
return to the initial resolution using another set of kernels. A residual between the
input feature map and the one that was subjected to the up-down or down-up
operation is computed and passed to the next up- or downscaling. Finally, the
50 A. S. Chernyavskiy
Fig. 2.13 Channel attention module used to reweight feature maps. (Reproduced with permission
from Zhang et al. 2018b)
Fig. 2.14 The building and plant patches from two LR images look very similar. Without a correct
prior, GAN-based methods can add details that are not faithful to the underlying class. (Reproduced
with permission from Wang et al. 2018b)
52 A. S. Chernyavskiy
Fig. 2.15 Modulation of SR feature maps using affine parameters derived from probabilistic
segmentation maps. (Reproduced with permission from Wang et al. 2018b)
proposed to use a special Meta-Upscale Module. This module can replace the
standard deconvolution modules that are placed at the very end of CNNs and are
responsible for the up-scaling. For an arbitrary scale factor, this module takes the
zoom factor as input, together with the feature maps created by any SISR CNN, and
dynamically predicts the weights of the up-scale filters. The CNN then uses these
weights to generate an HR image of arbitrary size. Besides the elegance of a meta-
learning approach, and the obvious flexibility with regard to zoom factors, an
important advantage of this approach is the requirement to store parameters for
only one small trained subnetwork.
The degradation factor that produces an LR image from an HR one is often
unknown. It can be associated with a nonsymmetric blur kernel, it can contain noise
from sensors or compression, and it can even be spatially dependent. One prominent
approach to simultaneously deal with whole families of blur kernels and many
possible noise levels has been proposed by Zhang et al. (2018c). By assuming that
the degradation can be modelled as an anisotropic Gaussian, with the addition of
white Gaussian noise with standard deviation σ, a multitude of LR images are
created for every HR ground truth image present in the training dataset. These LR
images are augmented with degradation maps which are computed by projecting the
degradation kernels to a low-dimensional subspace using PCA. The degradation
maps can be spatially dependent. The super-resolution multiple-degradations
(SRMD) network performs simultaneous zooming and deblurring for several zoom
factors and a wide range of blur kernels. It is assumed that the exact shape of the blur
kernel can be reliably estimated during inference. Figure 2.16b demonstrates the
result of applying SRMD to an LR image that was obtained from the HR image by
applying a Gaussian smoothing with an isotropic kernel that had a different width for
different positions in the ground truth HR image; also spatially dependent white
Gaussian noise was added. The degradation model shown in Fig. 2.16a, b is quite
complex, but the results of simultaneous zooming and deblurring (Fig. 2.16c) still
demonstrate sharp edges and good visual quality. This work was further extended to
non-Gaussian degradation kernels by Zhang et al. (2019).
Fig. 2.16 Examples of SRMD on dealing with spatially variant degradation: (a) noise level and
Gaussian blur kernel width maps; (b) zoomed LR image with noise added according to (a); (c)
results of SRMD with scale factor 2
54 A. S. Chernyavskiy
Deep learning algorithms can be applied to multiple frames for video super-
resolution, a subject that we did not touch on in this chapter. In order to ensure proper
spatiotemporal smoothness of the generated videos, DL methods are usually lever-
aged by optical flow and other cues from traditional computer vision, although many
of these cues can also be generated and updated in a DL context. It is worth mentioning
that the Russian Internet company Yandex (2018) has successfully used deep learning
to restore and up-scale various historical movies and cartoons (see Fig. 2.17 for an
example), which the company streamed under the name DeepHD.
Super-resolution of depth maps is also a topic of intense research. Depth maps are
obtained by depth cameras; they can also be computed from stereo pairs. Since this
computation is time-consuming, it is advantageous to use classical algorithms to
obtain the LR depth and then rescale them by a large factor of 4 to 16. The loss of
edge sharpness during super-resolution is much more prominent in depth maps than
in regular images. It has been shown by Hui et al. (2016) that CNNs can be trained to
accurately up-scale LR depth maps given the HR intensity images as an additional
input. Song et al. (2019) proposed an improved multiscale CNN for depth map
super-resolution that does not require the corresponding images. A related applica-
tion of SR is up-scaling of stereo images. Super-resolution of stereo pairs is
challenging because of large disparities between similar-looking patches of images.
Wang et al. (2019) have proposed a special parallax-attention mechanism with a
large receptive field along the epipolar lines to handle large disparity variations.
Super-resolution was recently applied by Chen et al. (2018) for magnetic reso-
nance imaging (MRI) in medicine. The special 3D CNN processes image volumes
and allows the MRI acquisition time to be shortened at the expense of minor image
quality degradation. It is worth noting here that, while in the majority of use cases we
expect that the SISR algorithms would generate images that are visually appealing,
in many cases when an important decision should be made by analysing the image –
e.g. in security, biometrics, and especially in medical imaging – a much more
important issue is to keep the informative features unchanged during the
2 Super-Resolution: 2. Machine Learning-Based Approach 55
up-scaling, without introducing features that look realistic for the specific domain as
a whole but are not relevant and misguiding for the particular case. High values of
similarity, including perceptual metrics, may give an inaccurate impression about
good performance of an algorithm. The ultimate verdict should come from visual
inspection by a panel of experts. Hopefully, this expertise can also be learned and
simulated by a machine learning algorithm to some degree.
Deep learning-based super-resolution has come a long way since the first attempts
at zooming synthetic images obtained by naïve bicubic down-sampling. Nowadays,
super-resolution is an integral part of the general image processing pipeline which
includes denoising and image enhancement. Future super-resolution algorithms
should be tunable by the user and provide reasonable trade-offs between the zoom
factor, denoising level and the loss or hallucination of details. Suitable image quality
metrics should be developed for assessing users’ preferences.
In order to make the algorithms generic and independent of the hardware, camera-
induced artefacts should be disentangled from the image content. This should be
done without requiring much training data from the same camera. Ideally, a single
image should suffice to derive the prior knowledge necessary for up-scaling and
denoising, without the need for pairs of LR and HR images. This direction is called
zero-shot super-resolution (Shocher et al. 2018; Ulyanov et al. 2020; Bell-Kliger
et al. 2019). Unpaired super-resolution is a subject of intense research; this task is
formulated in all the recent super-resolution competitions. The capturing of image
priors is often performed using generative adversarial networks and includes not
only low-level statistics but semantics (colour, resolution) as well. It is possible to
learn the “style” of the target (high-quality) image domain and transfer it to the
super-resolved LR image (Pan et al. 2020).
Modern super-resolution algorithms are computationally demanding, and there is
no indication that the number of convolutional layers in a CNN, after it exceeds
some threshold, inevitably leads to higher image quality. High quality is obtained by
other methods – residual or dense connections, attention mechanisms, multiscale
processing, etc. The number of operations per pixel will most likely decrease in
future SR algorithms. The CNNs will become more suitable for real-time processing
even for large images and high zoom factors. This might be achieved by training in
fixed-point arithmetic, network pruning and compression, and automatic adaptation
of architectures to target hardware using neural architecture search (Eisken et al.
2019). Finally, the application of deep learning methods to video SR, including the
time dimension (frame rate up-conversion; see Chap. 15), will set new standards in
multimedia content generation.
References
Agustsson, E., Timofte, R.: NTIRE 2017 challenge on single image super-resolution: dataset and
study. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recog-
nition Workshops, pp. 1122–1131 (2017)
56 A. S. Chernyavskiy
Athar, S., Wang, Z.: A comprehensive performance evaluation of image quality assessment
algorithms. IEEE Access. 7, 140030–140070 (2019)
Bell-Kliger, S., Shocher, A., Irani, M.: Blind super-resolution kernel estimation using an internal-
GAN. Adv. Neural Inf. Proces. Syst. 32 (2019). http://www.wisdom.weizmann.ac.il/~vision/
kernelgan/index.html. Accessed on 20 Sept 2020
Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A., Druzhinin, M., Kalinin, A.A.:
Albumentations: fast and flexible image augmentations. Information. 11(2), 125 (2020)
Cai, J., et al.: NTIRE 2019 challenge on real image super-resolution: methods and results. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pp. 2211–2223 (2019a). https://ieeexplore.ieee.org/document/9025504. Accessed
on 20 Sept 2020
Cai, J., Zeng, H., Yong, H., Cao, Z., Zhang, L.: Toward real-world single image super-resolution: A
new benchmark and a new model. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision, pp. 3086–3095 (2019b)
Chen, Y., Shi, F., Christodoulou, A.G., Xie, Y., Zhou, Z., Li, D.: Efficient and accurate MRI super-
resolution using a generative adversarial network and 3D multi-level densely connected net-
work. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.)
Medical Image Computing and Computer Assisted Intervention. Lecture Notes in Computer
Science, vol. 11070. Springer Publishing Switzerland, Cham (2018)
Chen, C., Xiong, Z., Tian, X., Zha, Z., Wu, F.: Camera lens super-resolution. In: Proceedings of the
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1652–1660
(2019)
Dai, T., Cai, J., Zhang, Y., Xia, S., Zhang, L.: Second-order attention network for single image
super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 11057–11066 (2019)
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and
texture similarity. arXiv, 2004.07728 (2020)
Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks.
IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016a)
Dong, C., Loy, C.C., He, K., Tang, X.: Accelerating the super-resolution convolutional neural
network. In: Proceedings of the European Conference on Computer Vision, pp. 391–407
(2016b)
Eisken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20,
1–21 (2019)
Haris, M., Shakhnarovich, G., Ukita, N.: Deep back-projection networks for super-resolution. In:
Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 1664–1673 (2018)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Hu, X., Mu, H., Zhang, X., Wang, Z., Tan, T., Sun, J.: Meta-SR: a magnification-arbitrary network
for super-resolution. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 1575–1584 (2019)
Huang, J., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 5197–5206 (2015)
Hui, T.-W., Loy, C.C., Tang, X.: Depth map super-resolution by deep multi-scale guidance. In:
Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. ECCV 2016.
Lecture Notes in Computer Science, vol. 9907. Springer Publishing Switzerland, Cham (2016)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution.
In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. ECCV
2016. Lecture Notes in Computer Science, vol. 9906. Springer Publishing Switzerland, Cham
(2016)
2 Super-Resolution: 2. Machine Learning-Based Approach 57
Kastryulin, S., Parunin, P., Zakirov, D., Prokopenko, D.: PyTorch image quality. https://github.
com/photosynthesis-team/piq (2020). Accessed on 20 Sept 2020
Lai, W., Huang, J., Ahuja, N., Yang, M.: Deep Laplacian pyramid networks for fast and accurate
super-resolution. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 5835–5843 (2017)
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial
network. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 105–114 (2017)
Lugmayr, A., et al.: AIM 2019 Challenge on real-world image super-resolution: methods and
results. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision
Workshop, pp. 3575–3583 (2019)
Ma, C., Yang, C.-Y., Yang, M.-H.: Learning a no-reference quality metric for single-image super-
resolution. Comput. Vis. Image Underst. 158, 1–16 (2017)
Migukin, A., Varfolomeeva, A., Chernyavskiy, A., Chernov, V.: Method for image super-
resolution imitating optical zoom implemented on a resource-constrained mobile device, and
a mobile device implementing the same. US patent application 20200211159 (2020)
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer.
IEEE Signal Process. Lett. 20(3), 209–212 (2013)
Pan, X., Zhan, X., Dai, B., Lin, D., Change Loy, C., Luo, P.: Exploiting deep generative prior for
versatile image restoration and manipulation. arXiv, 2003.13659 (2020)
Prashnani, E., Cai, H., Mostofi, Y., Sen, P.: PieAPP: perceptual image-error assessment through
pairwise preference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 1808–1817 (2018)
Rogozhnikov, A.: Einops – a new style of deep learning code. https://github.com/arogozhnikov/
einops/ (2018). Accessed on 20 Sept 2020
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for
Scanning and Printing. Springer Nature Switzerland AG, Cham (2019)
Shi, W., Caballero, J., Theis, L., Huszar, F., Aitken, A., Ledig, C., Wang, Z.: Is the deconvolution
layer the same as a convolutional layer? arXiv, 1609.07009 (2016a)
Shi, W., Caballero, J., Theis, L., Huszar, F., Aitken, A., Ledig, C., Wang, Z.: Real-time single
image and video super-resolution using an efficient sub-pixel convolutional neural network. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1874–1883 (2016b)
Shocher, A., Cohen, N., Irani, M.: “Zero-shot” super-resolution using deep internal learning. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 3118–3126 (2018)
Song, X., Dai, Y., Qin, X.: Deeply supervised depth map super-resolution as novel view synthesis.
IEEE Trans. Circuits Syst. Video Technol. 29(8), 2323–2336 (2019)
Timofte, R., et al.: NTIRE 2017 challenge on single image super-resolution: methods and results.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work-
shops, pp. 1110–1121 (2017)
Timofte, R., et al.: NTIRE 2018 challenge on single image super-resolution: methods and results.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pp. 965–96511 (2018)
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. Int. J. Comput. Vis. 128, 1867–1888
(2020)
Wang, Y., Perazzi, F., McWilliams, B., Sorkine-Hornung, A., Sorkine-Hornung, O., Schroers, C.:
A fully progressive approach to single-image super-resolution. In: Proceedings of the IEEE/
CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 977–97709
(2018a)
Wang, X., Yu, K., Dong, C., Change Loy, C.: Recovering realistic texture in image super-resolution
by deep spatial feature transform. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 606–615 (2018b)
58 A. S. Chernyavskiy
Wang, L., et al.: Learning parallax attention for stereo image super-resolution. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12242–12251
(2019)
Yandex: DeepHD: Yandex’s AI-powered technology for enhancing images and videos. https://
yandex.com/promo/deephd/ (2018). Accessed on 20 Sept 2020
Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a Gaussian denoiser: residual learning
of deep CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep
features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 586–595 (2018a)
Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep
residual channel attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss,
Y. (eds.) Computer Vision – ECCV 2018. Lecture Notes in Computer Science, vol. 11211.
Springer Publishing Switzerland (2018b)
Zhang, K., Zuo, W., Zhang, L.: Learning a single convolutional super-resolution network for
multiple degradations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 3262–3271 (2018c)
Zhang, K., Zuo, W., Zhang, L.: Deep plug-and-play super-resolution for arbitrary blur kernels. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 1671–1681 (2019)
Chapter 3
Depth Estimation and Control
3.1 Introduction
During the early 2010s with the release of the super popular Hollywood movie
“Avatar” in 2009, by James Cameron, the era of 3D TV technology got new wind.
By 2010, the interest was very high, and all major TV makers included a “3D-ready”
feature in smart TVs. In early 2010, the world’s biggest manufacturers such as
Samsung, LG, Sony, Toshiba, and Panasonic launched first home 3D TVs in the
market. The 3D TV technology was at the top of expectations. Figure 3.1 shows
some of prospective technologies as they were seen in 2010.
Over the next several years, 3D TV was a hot topic in major consumer electronic
shows. However, because of the absence of technical innovations, lack of content,
and clearer disadvantages of the technology, the interest of consumers to such types
of devices started to subside (Fig. 3.2).
By 2016, almost all TV makers announced the termination of the 3D TV feature
in flat panel TVs and drove their attention to high resolution and high dynamic range
features, though 3D cinema is still popular.
The possible causes of such interest transformation are the following, from the
marketing and technology point of view.
1. Inappropriate moment. The recent transition from analogue to digital TV forced
E. V. Tolstaya (*)
Aramco Innovations LLC, Moscow, Russia
e-mail: ktolstaya@yandex.ru
V. V. Bucha
BIQUANTS, Minsk, Belarus
e-mail: vbucha@biquants.com
expectations
3D Flat - Panel TVs and Displays
Cloud Computing
Cloud/Web Platforms
Augmented Reality
Internet TV
3D Printing Gesture Recognition
Mobile Robots
Fig. 3.1 Some prospective technologies, as they were seen in 2010. (Gartner Hype Cycle for
Emerging technologies 2010, www.gartner.com)
Avatar
3D TV popularity dynamics
premiere,
100 December
90 10, 2009,
London
80
70
60
50
40
30
20
10
0
2008-01 2010-01 2012-01 2014-01 2016-01 2018-01 2020-01
Fig. 3.2 Popularity of term “3D TV” measured by Google Trends, given in percentage of its
maximum
many consumers to buy new digital TVs, and by 2010 many of them were not
ready to invest in the new TVs once again.
2. Extra cost. To take full advantage of using this new technology, consumers also
had to buy a 3D-enabled Blu-ray player or get a 3D-enabled satellite box.
3. Uncomfortable glasses. The 3D images work on the principle that each of our
eyes sees a different picture. By perceiving a slightly different picture from each
3 Depth Estimation and Control 61
eye, the brain automatically constructs the third dimension. 3D-ready TVs came
with stereo glasses: with so-called active or passive glasses technology. Glasses
of different manufacturers could be incompatible with each other. Moreover, a
family of three or more people would need additional pairs, since usually only
one or two pairs were supplied with a TV. Viewers wearing prescription glasses
had to wear an additional pair over the required ones. And finally, the glasses
needed charging, so to be able to watch TV, you had to keep several pairs fully
charged.
4. Live TV. It is a difficult task for broadcast networks to support 3D TV: a separate
channel is required for broadcasting 3D content, additional to the conventional
2D one for viewers with no 3D feature.
5. Picture quality. The picture in the 3D mode is dimmer than in the conventional
2D mode, because each eye sees only half of the pixels (or half of the light)
intended for the picture. In addition, viewing a 3D movie on a smaller screen with
a narrow field of view did not give a great experience, because the perceived
depth is much smaller than on big screen cinema. In this case, even increasing
parallax length does not boost depth but adds to eye fatigue and headache.
6. Multiple user scenario. When several people watch a 3D movie, not all of them
can sit at the point in front of the TV that gives best viewing experience. This
leads to additional eye fatigue and picture defects.
It is clear that some of the mentioned technological problems still exist and need
its future engineers to resolve. However, even now we can see that it is still possible
for stereo reproduction technology to meet its next wave of popularity. Virtual
reality headsets, which recently appeared on the market, give an excellent viewing
experience in 3D movie watching.
In passive systems with polarised glasses, the two pictures are projected with
different light polarisation, and corresponding polarised glasses allow separating
the images. The system requires the use of a quite expensive screen that saves the
initial polarisation of reflected light. Usually, such a system is used in 3D movie
62 E. V. Tolstaya and V. V. Bucha
theatres. The main disadvantage is loss of brightness, since only half of the light
reaches each eye.
The most common system in home 3D TVs is based on active shutter glasses: the TV
alternates left and right views on the screen, and synchronised glasses close the eyes
one by one. The stereo effect and its strength depend on the parallax (i.e. difference)
in two views of a stereopair. The main disadvantage is loss of frame rate, because
only half of the existing video frames reach each eye.
Let us consider in more detail the cause of eye fatigue while viewing 3D video on
TV. The idea of showing slightly different images to each eye allows creating the 3D
effect in the viewer’s brains. The bigger the parallax, the more obvious the 3D effect.
The types of parallax are illustrated in Fig. 3.3.
3 Depth Estimation and Control 63
Zero parallax
Stereo plane
Negative parallax
1. Zero parallax. The image difference between left and right views is zero, and the
eye focuses right at the plane of focus. This is a convenient situation in real life,
and generally it does not cause viewing discomfort.
2. Positive (uncrossed) parallax. The position of the convergence point is located
behind the projection screen. The most comfortable viewing is accomplished
when the parallax is almost equal to the interocular distance.
3. Negative (crossed) parallax. The focusing point is in front of the projection
screen. The parallax depends on the convergence angle and the distance of the
observer from the display, and therefore it can be more than the interocular
distance.
4. Positive diverged parallax. The optical axes are diverging to perceive stereo with
the parallax exceeding the interocular distance. This case can cause serious visual
inconvenience for the observers.
Parallax is usually measured as percentage of shift related to frame width. In real
3D movies, the parallax can be as big as 16% (e.g. “Journey To The Center of The
Earth”), 9% as in “Dark Country” and 8% “Dolphins and Whales 3D: Tribes of the
Ocean” (Vatolin 2015). This means that the parallax can be up to 1 metre, when
viewing a 6-metre-wide cinema screen, which is significantly bigger than average
interocular distance. Such situation causes unconventional behaviour of the eyes,
when they try to diverge.
Smaller screens (like TV on a smartphone) have smaller parallax, and the eyes
can better adapt to 3D. However, the drawback is that on smaller screens, the 3D
effect is smaller, and objects look flat.
Usually, human eyes use the mechanism of accommodation to see objects at
different distances in focus (see Fig. 3.4). The muscles that control the lens in the eye
shorten to focus on close objects. The limited depth of field means that objects that
are not at the focal length are typically out of focus. This enables viewers to ignore
certain objects in a scene.
The most common causes of eye fatigue (Mikšícek 2006) are enumerated below.
1. Breakdown of the accommodation and convergence relationship. When a viewer
observes an object in the real world, the eyes focus on a specific point belonging
to this object. However, when the viewer watches 3D content, the eyes try to
64 E. V. Tolstaya and V. V. Bucha
Far blurred
Accommodation for a near target
Near in focus
Near blurred
Accommodation for a far target
Far in focus
focus on “popping-out” 3D objects, while for a sharp picture, they have to focus
on the screen plane. This can misguide our brains and add a feeling of sickness.
2. High values of the parallax. High values of parallax can lead to divergence of the
eyes, and this is the most uncomfortable situation.
3. Crosstalk (ghosts). Occurs when the picture dedicated to the left eye view is
partly visible for the right eye view and vice versa. It is quite common for
technologies based on colour separation and light polarisation, when filtering is
insufficient, or in cases of bad synchronisation between the TV display and the
shutter glasses.
4. Conflict between interposition and parallax. A special type of conflict between
depth cues appears when a portion of the object on one of the views is clipped by
the screen (or image window) surround. The interposition depth cue indicates that
the image surround is in front of the object, which is in direct opposition to the
3 Depth Estimation and Control 65
Correct scene
Viewer at
point
PScene point at
other position distorted
position
P’
Viewer at correct
position
stereo
display
disparity depth cue. This conflict causes depth ambiguity and confusion
(Fig. 3.5).
5. Vertical disparities. Vertical disparities are caused by wrong placement of the
cameras or faulty calibration of the 3D presentation apparatus (e.g. different focal
lengths of the camera lenses). Figure 3.6 illustrates vertical disparity.
6. Common cue collision. Any logical collision of binocular cues and monocular
cues, such as light and shade, relative size, interposition, textural gradient, aerial
perspective, motion parallax, perspective, and depth cueing (Fig. 3.7).
7. Viewing conditions. Viewing conditions include viewing distance, screen size,
lighting of the room, viewing angle, etc. Also, personal features of the viewer:
age, anatomical size, and eye adaptability. Generally, the older the person, the
greater the eye fatigue and sickness, because of lower adaptability of the brain.
For children, who have smaller interocular distance, the 3D effect will be more
expressed, but the younger brain and greater adaptability will decrease the
negative experience. For people having strabismus, it is impossible to perceive
stereo content at all. Figure 3.8 illustrates the situations, when one of the viewers
is not at the optimal position.
8. Content quality. Geometrical distortions, difference in colour, sharpness, bright-
ness/contrast, depth of field of the production optical system between left and
right views, flipped stereo, and time shift, all these factors contribute to lower
content quality, causing more eye fatigue.
66 E. V. Tolstaya and V. V. Bucha
The majority of the cited causes of eye fatigue relate to stereo content quality.
However, even high-quality content can have inappropriate parameters, like high
value of parallax. To compensate for this effect, fast real-time depth control tech-
nology has been proposed, which is aimed to reduce the stereo effect of 3D content
by diminishing the stereo effect (Fig. 3.9).
The depth control feature can be implemented in a 3D TV, and it can be
controlled on the TV remote (Ignatov and Joesan 2009), as shown in Fig. 3.10.
The proposed scheme of stereo effect modification is shown in Fig. 3.11. First,
the depth map between input stereo content is estimated, depth is post-processed to
Fig. 3.9 To reduce perceived depth and associated eye fatigue, it is necessary to diminish the stereo
effect
Depth
control
parameter
estimation
Fig. 3.12 Depth tone mapping: (a) initial depth; (b) tone-mapped depth
remove artefacts and mapped for modified stereo effect, and after that left and right
views are generated.
The depth control method implies using two techniques.
1. Control the depth of pop-up objects (Reduction of excessive negative parallax).
This could be thought as reduction of stereo baseline, when modified stereo views
are moved to each other. In this case, 3D perception of close objects will be
reduced first of all. This technique could be realised via the view interpolation
method when the virtual view for the modified stereopair is interpolated by the
initial stereopair according to a portion of the depth/disparity vector. Areas of the
image with small disparity vectors will wherein remain almost the same, and
areas of the image with large disparity vectors (pop-up objects) will produce less
perception of depth.
2. Control the depth of image plane. This could be thought as moving the image
plane along the z-direction. Then, the perceived 3D scene will move further in the
z-direction. This technique could be realised via depth tone mapping with subse-
quent view interpolation. Depth tone mapping will equally decrease depth per-
ception for every region of the image. Figure 3.12 illustrates how all objects of the
68 E. V. Tolstaya and V. V. Bucha
scene are made more distant for an observer. Depth tone mapping could be
realised through pixel-wise operation of contrast, brightness, and gamma
functions.
Dðx, yÞ ¼ arg min ðI r ðx, yÞ I t ðx þ d, yÞÞ ¼ arg min ðCost ðx, y, dÞÞ,
d d
where Ir is the pixel intensity at reference image, It is the pixel intensity at target
image, d ¼ {dmin, dmax} denotes the disparity range, and D(x,y) is the disparity map.
The photometric constraint applied to single pixel pair does not provide a unique
solution. Instead of comparing individual pixels, several neighbouring pixels are
grouped in a support window, and their intensities are compared with those of pixels
in another window. The simplest matching measure is the sum of absolute differ-
ences (SAD). The disparity which minimised the SAD cost for each pixel is chosen.
The optimisation equation can be rewritten as follows:
Disparity
Matching cost Cost (support) Disparity
computation
computation aggregation refinement
/optimisation
XX
Dðx, yÞ ¼ arg min ð ðI r ðxi , y j Þ I t ðxi þ d, y j ÞÞÞ
d i j
XX
¼ arg min ð ðCostðxi , y j , dÞÞÞ,
d i j
where i2[n,n] and j2[m,m] define the support window size (Fig. 3.14).
Other matching measures include normalised cross correlation (NCC), modified
normalised correlation (MNCC), rank transform, etc. However, there is a problem
with correlation and SAD matching since the window size should be large enough to
include enough intensity variation for matching but small enough to avoid effects of
projective distortions.
For this reason, approaches which adaptively select the window size depending
on local variations of intensities are proposed. Kanade and Okutomi (1994) attempt
to find ideal the window in size and shape for each pixel in an image.
Prazdny (1987) proposed a new function to assign support weights to
neighbouring pixels iteratively. In this method, it is assumed that neighbouring
disparities, if corresponding to the same object in a scene, are similar and that two
neighbouring pixels with similar disparities support each other.
In general, the prior-art aggregation step uses rectangular windows for grouping
the neighbouring pixels and comparing their intensities with those of the pixels in
another window. The pixels can be weighted using linear or nonlinear filters for
better results. The most popular nonlinear filter for disparity estimation with variable
support strategy is the cross-bilateral filter. However, the computation complexity of
this type of filter is extremely high, especially for real-time applications.
In this work, we adapted a separable recursive bilateral-like filtering for matching
cost aggregation. It has a constant-time complexity which is independent of filter
window size and runs much faster than the traditional one while producing the
similar aggregation result of matching cost (Tolstaya and Bucha 2012). We used a
recursive implementation of the cost aggregation function, similar to (Deriche 1990).
The separable implementation of the bilateral filter allows significant speed-up of
computations, having a result similar to the full-kernel implementation (Pham and
70 E. V. Tolstaya and V. V. Bucha
van Vliet 2005). Right-to-left and left-to-right disparities are computed using similar
considerations. First, a difference image between left and right images is computed:
1 X
Fðx, y, δÞ ¼ Dðx0 , y0 , δÞSðjx x0 jÞ hðΔðIðx, yÞ, Iðx0 , y0 ÞÞ,
w x0 , y0 2Γ
where w(x,y) is weight to normalise filter output and computed according to follow-
ing formula:
X
wðx, yÞ ¼ Sðjx x0 jÞ hðΔðIðx, yÞ, Iðx0 , y0 ÞÞ,
x0 , y0 2Γ
and Γ is the support window. This helps to adapt the filtering window, according to
colour similarity of image regions.
In our work, we used the following h(r) and S(x), range and space filter kernels
correspondingly:
jrj jxj
hðr Þ ¼ exp and SðxÞ ¼ exp
σr σs
Symmetric kernels allow separable accumulating for rows and columns. Kernels
h(r) and S(x) are not equal to the commonly used Gaussians kernels, but using those
kernels it is possible to construct the recursive accumulating function and signifi-
cantly increase processing speed, while preserving the quality.
Let us consider the one-dimensional case of the smoothing with S(x). For fixed δ,
we have
3 Depth Estimation and Control 71
X
N 1
F ð xÞ ¼ DðxÞSðk xÞ:
k¼0
Unlike (Deriche 1990), the second pass is based on the result of the first pass:
with the normalising coefficient α ¼ e1=σ s to ensure that the range of the output
signal is the same as the range of the input signal. In case of cross-bilateral filtering,
the weight α is a function of x, and the formulas for the filtered signal (first pass) are
the following:
The backward pass step is modified similarly, and F(x) is the formula for the
filtered signal:
To compute the aggregated cost for the 2D case, four passes of recursive
equations are performed: left to right, right to left, top to bottom, and bottom to
top. Those formulas give only an approximate solution for the cross-bilateral filter,
but for the purpose of cost function aggregation, they give adequate results.
After the matching cost function is aggregated for every δ, the pass for every pixel
along δ gives the disparity values.
Disparity in occlusion areas is filtered additionally, according to the following
formulas, using symmetry consideration:
where DL is disparity map from left image to right image and DR is disparity from
right to left.
72 E. V. Tolstaya and V. V. Bucha
Fig. 3.15 Left image of stereopair (a) and computed disparity map (b)
This rule is very efficient in the case of correction disparity in occlusion areas of
stereo matching, because usually it is known that minimal (or maximal, depending
on the stereopair format) disparity corresponds to farthest (covered) objects and
occlusions occur near boundaries and cover farther objects.
Figure 3.15 shows the results of the proposed algorithm.
The proposed method relies on the idea of convergence from a rough estimate
towards the consistent depth map through subsequent iterations of the depth filter
(Ignatov et al. 2009). On each iteration, the current depth estimate is refined by
filtration with accordance to images from the stereopair. A reference image is a
colour image from a stereopair for which the depth is estimated. A matching image is
the other colour image from the stereopair.
The first step of the method for depth smoothing is analysis and cutting of the
reference depth histogram (Fig. 3.16). The cutting of the histogram suppresses noise
present in depth data. The raw depth estimates could have a lot of outliers. The noise
might appear due to false stereo matching in occlusion areas and in textureless areas.
The proposed method uses two thresholds: a bottom of the histogram range B and a
top of the histogram range T. These thresholds are computed automatically from the
given percentage of outliers.
The next step of the method for depth smoothing is left, right depth cross-check.
The procedure operates as follows:
• Compute the left disparity vector (LDV) from the left depth value.
• Fetch the right depth value mapped by the LDV.
• Compute the right disparity vector (RDV) from the right depth value.
• Compute the disparity difference (DD) of absolute values of LDV and RDV.
• In the case that DD is higher than the threshold, then the left depth pixel is marked
as the outlier.
3 Depth Estimation and Control 73
Depth Image
Depth cross- segmentation Depth
histogram
check onto textured smoothing
cutting
Fig. 3.17 Example of left, right depth cross-check: (a) left image; (b) right image; (c) left depth;
(d) right depth; (e) left depth with noisy pixels (marked black); (f) smoothing result for left depth
without depth cross-checking; (g) smoothing result for left depth with depth cross-checking
In our implementation, the threshold for the disparity cross-check is set at 2, and
noisy pixels are marked by 0. Since 0 < 64, noisy pixels are automatically treated as
outliers for further processing. An example of the depth map marked with noisy pixels
according to depth cross-checking is shown in Fig. 3.17. It shows that the depth cross-
check successfully removes outliers from occlusion areas (it is shown by red circles).
74 E. V. Tolstaya and V. V. Bucha
Fig. 3.18 Example of image segmentation into textured and non-textured regions: (a) colour
image; (b) raw depth; (c) binary segmentation mask (black, textured regions; white, non-textured
regions); (d) smoothing result without using image segmentation; (e) smoothing result using image
segmentation
The next step of the method for depth smoothing is binary segmentation of the left
colour image into textured and non-textured regions. For this purpose, the gradients
in four directions, i.e. horizontal, vertical, and two diagonal, are computed. If all
gradients are lower than the predefined threshold, the pixel is considered to be
non-textured; otherwise it is treated as textured. This could be formulated as follows:
255, if gradients ðx, yÞ < Threshold
BSðx, yÞ ¼ ,
0, otherwise
where BS is the binary segmentation mask for the pixel with coordinates (x, y), a
value of 255 corresponds to a non-textured image pixel, while 0 corresponds to a
textured one. Figure 3.18 presents an example of image segmentation into textured
and non-textured regions along with example of depth map smoothing with and
without the segmentation mask.
3 Depth Estimation and Control 75
Fig. 3.19 Examples of processed depth: (a) colour images; (b) initial raw depth maps; (c) depth
maps smoothed by the proposed method
The problem of depth-based virtual view synthesis (or depth image-based rendering,
DIBR) means that reconstruction of the view from the virtual camera CV, while
views from other cameras C1 and C2 (or different views captured by a moving
camera) and available scene geometry (point correspondences, depth, or the precise
polygon model) are provided (see Fig. 3.20).
The following problems need to be addressed in particular during view
generation:
• Disocclusion
• Temporal consistency
• Symmetric vs asymmetric view generation
• Toed-in camera configuration
76 E. V. Tolstaya and V. V. Bucha
t1 t2
C1 C2
Cv
Reconstructed
3D scene
Disocclusion Disocclusion
area x area
3.7.1 Disocclusion
As we intend to use one depth map for virtual view synthesis, we should be ready for
the appearance of disocclusion areas. The disocclusion area is a part of the virtual
image which becomes visible in a novel viewpoint in contrast to the initial view. The
examples of disocclusion areas are marked by black in Fig. 3.21. A common way to
eliminate disocclusions is to fill up these areas by colours of neighbouring pixels.
3 Depth Estimation and Control 77
Most stereo disparity estimation methods consider still images as input, but TV
stereo systems require real-time depth control/view generation algorithms intended
for video. When considering all frames independently, some flickering can occur,
especially near objects’ boundaries. Usually, the depth estimation algorithm is
modified to output temporally consistent depth maps. Since it is not very practical
to use some complicated algorithms like bundle adjustment, more computationally
effective methods are applied, like averaging inside a small temporal window.
The task of stereo intermediate view generation is a particular case of arbitrary view
rendering, where the positions of virtual views are constrained to lie on the line
connecting the centres of source cameras. To generate the new stereopair with a
reduced stereo effect, we applied symmetric view rendering (see Fig. 3.22), where
the middle point of baseline stays fixed and both left and right views are generated.
Other configurations will render only one view, leaving the other intact. But in this
case, the disocclusion area will be located on one side of the popping-out objects and
can be more susceptible to artefacts.
There are two possible camera configurations: parallel and toed-in (see Fig. 3.23). In
the case of parallel configuration, depth has positive value, and all objects appear in
front of the screen. When the stereo effect is large, this can cause eye discomfort.
The toed-in configuration is closer to natural human visual system. However, the
toed-in configuration generates keystone distortion in images, including vertical
disparity. Due to non-parallel disparity lines, the algorithm of depth estimation
will give erroneous results, and such content will require rectification. To eliminate
Cl Vl V C
r r
78 E. V. Tolstaya and V. V. Bucha
L R L R
Fig. 3.23 Parallel (a) and toed-in (b) camera configuration; illustration of keystone distortion (c)
Fig. 3.24 Virtual view synthesis: initial stereopair (top); generated stereopair with 30% depth
decrease (bottom)
eye discomfort from stereo and save the stereo effect, a method of zero-plane setting
can be applied. It consists of shifting the virtual image plane and reducing depth by
some amount to have negative values in some image areas.
Figures 3.24 and 3.25 illustrate resulting stereopairs with 30% depth decrease.
In the proposed application, we consider depth decrease as more applicable for
the “depth control” feature, since the automatic algorithm in this case will not face
the problem of disocclusion and, hence, hopefully produce fewer artefacts. In the
following Chap. 4 (Semi-automatic 2D to 3D Video Conversion), we will further
address the topic of depth-based image rendering (DIBR) for situations with depth
increase and appearing disocclusion areas that should be treated in a specific way.
At the end, we would like to add that unfortunately very few models of 3D TVs
were equipped with the “3D depth control” feature for a customisable strength of the
stereo effect, for example, LG Electronics 47GA7900. We can consider automatic
2D ! 3D video conversion systems (that were available in production in some
models of 3D TVs by LG electronics, Samsung, etc. and also in commercially
available TriDef 3D software by DDD Group) as a part of such feature, since during
conventional monoscopic to stereoscopic video conversion, the user can preset the
desired amount of stereo effect.
3 Depth Estimation and Control 79
Fig. 3.25 Virtual view synthesis: initial stereopair (top); generated stereopair with 30% depth
decrease (bottom)
References
Birchfield, S., Tomasi, C.: A pixel dissimilarity measure that is insensitive to image sampling. IEEE
Trans. Pattern Anal. Mach. Intell. 20(4), 401–406 (1998)
Deriche, R.: Fast algorithms for low-level vision. IEEE Trans. Pattern Anal. Mach. Intell. 12(1),
78–87 (1990)
Ignatov, A., Joesan O.: Method and system to transform stereo content, European Patent
EP2293586 (2009)
Ignatov, A., Bucha, V., Rychagov, M.: Disparity estimation in real-time 3D acquisition and
reproduction system. In: Proceedings of International Conference on Computer Graphics
“Graphicon”, pp. 61–68 (2009)
Kanade, T., Okutomi, M.: A stereo matching algorithm with an adaptive window: theory and
experiment. IEEE Trans. Pattern Anal. Mach. Intell. 16(9), 920–932 (1994)
Mikšícek, F.: Causes of visual fatigue and its improvements in stereoscopy. University of West
Bohemia in Pilsen, Pilsen, Technical Report DCSE/TR-2006-04 (2006)
Pham, T., van Vliet, L.: Separable bilateral filtering for fast video preprocessing. In: Proceedings of
IEEE International Conference on Multimedia and Expo, pp. 1–4 (2005)
Prazdny, K.: Detection of binocular disparities. In: Fischler, M.A., Firschein, O. (eds.) Readings in
Computer Vision. Issues. Problem, Principles, and Paradigms, pp. 73–79. Morgan Kaufmann,
Los Altos (1987)
Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence
algorithms. Int. J. Comput. Vis. 47(1), 7–42 (2002)
Tolstaya, E.V., Bucha, V.V.: Silhouette extraction using color and depth information. In: Pro-
ceedings of 2012 IS&T/SPIE Electronic Imaging. Three-Dimensional Image Processing (3DIP)
and Applications II. 82900B (2012) Accessed on 04 October 2020. https://doi.org/10.1117/12.
907690
Vatolin, D.: Why does 3D lead to the headache? / Part 4: Parallax (in Russian) (2015). https://habr.
com/en/post/378387/
Chapter 4
Semi-Automatic 2D to 3D Video Conversion
P. Pohl (*)
Samsung R&D Institute Rus (SRR), Moscow, Russia
e-mail: p.pohl@samsung.com
E. V. Tolstaya
Aramco Innovations LLC, Moscow, Russia
e-mail: ktolstaya@yandex.ru
The extraction of key frames is a very important step for semi-automatic video
conversion. Key frames during stereo conversion are completely different from key
frames selected for video summarization, as is described in Chap. 6. The stereo
conversion key frames are selected for an operator, who will manually draw depth
maps for key frames, and then these depth map frames will be further interpolated
(propagated) through the whole video clip, followed by depth-based stereo view
rendering. The more frames that are selected, the more manual labour will be
required for video conversion, but in the case of an insufficient number of key
frames, a lot of intermediate frames may have inappropriate depths or contain
conversion artefacts. The video clip should be thoroughly analysed prior to the
start of manual work. For example, a slow-motion frame with simple, close to linear
motion will require fewer key frames, while dramatic, fast moving objects, espe-
cially if they are closer to the camera and have larger occlusion areas, must have
more key frames to assure better conversion quality.
84 P. Pohl and E. V. Tolstaya
In Wang et al. (2012), the key frame selection algorithm relies on the size of
cumulative occlusion areas and shot segmentation is performed using a block-based
histogram comparison.
Sun et al. (2012) select key frame candidates using the ratio of SURF feature
points to the correspondence number, and a key frame is selected from among
candidates such that it has the smallest reprojection error. Experimental results
show that the propagated depth maps using the proposed method have fewer errors,
which is beneficial for generating high-quality stereoscopic video.
However, for semi-automatic depth 2D-3D conversion, the key frame selection
algorithm should properly handle various situations that are difficult for depth
propagation to diminish possible depth map quality issues and at the same time
limit the overall number of key frames.
Additionally, the algorithm should analyse all parts of video clips and group
similar scene parts. For example, very often during character conversation, the
camera switches from one object to another, while objects’ backgrounds almost do
not change. Such smaller parts of a bigger “dialogue” scene can be grouped together
and be considered as a single scene for every character.
For this purpose, video should be analysed and segmented into smaller parts
(cuts), which should be sorted into similar groups with different characteristics:
1. Scene change (shot segmentation). Many algorithms have already been proposed
in the literature. They are based on either abrupt colour change or motion vector
change. The moving averages of histograms are analysed and compared to some
threshold, meaning that the scene changes when the histogram difference exceeds
the threshold. Smooth scene transitions pose serious problems for such algo-
rithms. It is needless to say that the depth map propagation and stereo conversion
of such scenes is also a difficult task.
2. Motion type detection: still, panning, zooming, complex/chaotic motion. Slow-
motion scenes are easy to propagate, and in this case, a few key frames can save a
lot of manual work. A serious problem for depth propagation is caused by zoom
motion (objects approaching or moving away); in this case, a depth increase or
decrease should be smoothly interpolated (this is illustrated in Fig. 4.2 bottom
row).
3. Visually similar scenes (shots) grouping. Visually similar scenes very often occur
when shooting several talking people, when the camera switches from one person
to another. In this case, we can group scenes with one person and consider this
group to be a longer continuous sequence.
4. Object tracking in the scene. To produce consistent results during 2D-3D con-
version, it is necessary to analyse objects’ motion. When the main object (key
object) appears in the scene, its presence is considered to select the best key frame
when the object is fully observable, and its depth map can be well propagated to
the other frames of its appearance. Figure 4.1 and 4.2 (two top rows) illustrates
this kind of situation.
5. Motion segmentation for object tracking. For better conversion results, motion is
analysed within the scenes. The simplest type of motion is panning or linear
4 Semi-Automatic 2D to 3D Video Conversion 85
Key object
Fig. 4.1 Example of the key object of the scene and its corresponding key frame
Object appeared
key frame
Fig. 4.2 Different situations for key frame selection: object should be fully observable in the scene,
objects appears in the scene, and object is zoomed
7. Occlusion analysis. The key frame should be selected when the cumulative area
of occlusion goes beyond the threshold. This is similar to Wang et al. (2012).
The key frame selection algorithm (Tolstaya and Hahn 2012), based on a function
reflecting the transition complexity between every frame pair, proposes to find the
optimal distribution of key frames indexed by graph optimization. The frames of a
video shot are represented by the vertices of this graph, where the source is the first
frame and the sink is the last frame. When two frames are too far away, their
transition complexity is set to be equal to some large value. The optimization of
such a path can be done using the well-known Dijkstra’s algorithm.
Ideas on the creation of an automatic video analysis algorithm based on machine
learning techniques can be further explored in Chap. 6, where we describe
approaches for video footage analysis and editing and style simulations for creating
dynamic and professional-looking video clips.
4.3.1 Introduction
One of the most challenging problems that arise in semi-automatic conversion is the
temporal propagation of depth data. The bottleneck of the whole 2D to 3D conver-
sion pipeline is the quality of the propagated depth: if the quality is not high enough,
a lot of visually disturbing artefacts appear on the final stereo frames. The quality
strongly depends on the frequency of manually assigned key frames, but drawing a
lot of frames requires more manual work and makes production slower and more
expensive. That is why the crucial problem is error-free temporal propagation of
depth data through as many frames as possible. The optimal key frame distance for
the desired quality of outputs is highly dependent on the properties of the video
sequence and can change significantly within one sequence.
The problem of the temporal propagation of key frame depth data is a complex task.
Video data are temporally undersampled: they contain noise, motion blur, and
optical effects such as reflections, flares, and transparent objects. Moreover, objects
in the scene can disappear, get occluded, or significantly change shape or visibility.
The traditional method of depth interpolation uses a motion estimation result, using
either depth or video images. Varekamp and Barenbrug (2007) propose creating the
first estimate of depth by bilateral filtering of the previous depth image and then
correct the image by estimating the motion between depth frames. A similar
approach is described by Muelle et al. (2010). Harman et al. (2002) use a machine
4 Semi-Automatic 2D to 3D Video Conversion 87
learning approach for the depth assignment of key frames. They suggest that these
should be selected manually or that techniques similar to those for shot-boundary
detection should be applied. After a few points are assigned, a classifier (separate for
each key frame) is trained using a small number of samples, and then it restores the
depth in the key frame. After that, a procedure called “depth tweening” restores
intermediate depth frames. For this purpose, both classifiers of neighbouring key
frames are fed with an image value to produce a depth value. For the final depth, both
intermediate depths are weighted by the distance to the key frames. Weights could
linearly depend on the time distance, but the authors propose the use of a non-linear
time-weight dependence. A problem with such an approach could arise when
intermediate video frames have areas that are completely different than those that
can be found on key frames (for example, in occlusion areas). However, this
situation is difficult for the majority of depth propagation algorithms. Feng et al.
(2012) describe a propagation method based on the generation of superpixels,
matching them and generating depth using matching results and key frame depths.
Superpixels are generated by SLIC (Simple Linear Iterative Clustering). Superpixels
are matched using mean colour (three channels) and the coordinates of the centre
position. Greedy search finds superpixels in a non-key frame (within some fixed
window) with minimal colour difference and L1 spatial distance, multiplied by a
regularization parameter. Cao (2011) proposes the use of motion estimation and
bilateral filtering to get a first depth estimate and refines it by applying depth motion
compensation in frames between key frames with assigned depths.
As a base method for comparison, we will use an approach similar to Cao (2011).
The motion information is then used for the warping of the depth data from previous
and subsequent key frames. These two depth fields are then mixed with the weights
of motion confidence. These weights can be acquired by an analysis of the motion
projection error from the current image to one of the key frames or the error of
projection of a small patch, which has slightly more stable behaviour. As a motion
estimation algorithm, we use the optical flow described by Pohl et al. (2014) with the
addition of the third channel of three YCbCr channels.
Similar but harder problems appear in the area of greyscale video colourization,
as in Irony (2005). In this case, only greyscale images can be used for matching and
finding similar objects. Pixel values and local statistics (features) are used in this
case; spatial consistency is also taken into account.
Most motion estimation algorithms are not ready for larger displacement and
could not be used for interpolation over more than a few frames. The integration of
motion information over time leads to increasing motion errors, especially near
object edges and in occlusion areas. Bilateral filtering of either motion or depth
can cover only small displacements or errors. Moreover, it can lead to disturbing
artefacts in the case of similar foreground and background colours.
88 P. Pohl and E. V. Tolstaya
Our dense depth propagation algorithm interpolates the depth for every frame
independently. It utilizes the nearest preceding and nearest following frames with
known depth maps (key frame depth). The propagation of depth maps from two
sides is essential, as it allows us to interpolate most occlusion problems correctly.
The general idea is to find correspondence between image patches, and assuming
that patches with similar appearances have similar depths, we can synthesize an
unknown depth map based on this similarity (Fig. 4.3).
The process of finding similar patches is based on work by Korman and Avidan
(2015). First, a bank of Walsh-Hadamard (W-H) filters is applied to both images. As
a result, we have a vector of filtering results for every pixel (Fig. 4.4). After that, a
hash code is generated for each pixel using this vector of filtering results.
Patches
hash table
Key
frame
Input video Synthesized
frame depth
Key
frame
Fig. 4.3 Illustration of depth propagation process from two neighbouring frames, the preceding
and following frames
Hash tables are built using hash codes and the corresponding pixel coordinates for
fast search for patches with equal hashes. We assume that patches with equal hashes
have a similar appearance. For this purpose, the authors applied a coherency
sensitive hashing algorithm. Hash code is a short integer of 16 bits, and with hashes
for each patch, a greedy matching algorithm selects patches with the same
(or closest) hash and computes the patch difference as the difference between vectors
with W-H filter results. The matching error used in the search for the best corre-
spondence is a combination of filter output difference and spatial distance, with a
dead zone (no penalty for small distances) and a limit for maximum allowed
distance. Spatial distance between matching patches was introduced to avoid unrea-
sonable correspondences between similar image structures from different parts of the
image. Such matches are improbable for relatively close frames of video input.
In the first step, RGB video frames are converted to the YCrCb colour space. This
allows us to treat luminance and chrominance channels differently. To accommodate
fast motion and decrease the sensitivity to noise, this process is done using image
pyramids. Three pyramids for video frames and two pyramids for key frame depth
maps are created. The finest level (level ¼ 0) has full-frame resolution, and we
decrease the resolution by a factor of 0.5, so that the coarsest level ¼ 2 and has 1/4 of
the full resolution. We use an area-based interpolation method. The iterative scheme
starts on the coarsest level of the pyramid by matching two key frames to the current
video frame. The initial depth is generated by voting over patch correspondences
using weights dependent on colour patch similarity and temporal distances from
reference frames. In the next iterations, matching between images combined from
colour and the depth map is accomplished. Due to performance reasons, only one of
chrominance channels (Cr or Cb does not make a big difference) with a given depth
for reference frames is replaced, and thus the depth estimate is obtained for the
current frame. On each level, we perform several CSH matching iterations and
iteratively update the depth image by voting. A Gaussian kernel smoothing with
decreasing kernel size or another low-pass image filter is used to blur the depth
estimate. This smooths the small amount of noise coming from logically incorrect
matches. The low-resolution depth result for the current frame is upscaled, and
the process is repeated for every pyramid level and ends at the finest resolution,
which is the original resolution of the frames and depth. The process is described by
Algorithm 4.1.
(continued)
90 P. Pohl and E. V. Tolstaya
Mapþt
level0 ¼ I level0 ! I level0
t tþ1
Maplevel
–t Maplevel
+t
t–1 t
Ilevel t+1
Ilevel Ilevel
Mapt
level ¼ I level ! I level and
t t1
Mapþt
level ¼ I level ! I level
t tþ1
The patch-voting procedure uses an estimate of match error for forward and
backward frames and is described in Algorithm 4.2. As an estimate of error, we
use the sum of absolute differences over patch correspondence (on the coarsest level)
or the sum of absolute differences of 16 Walsh-Hadamard kernels that are available
after CSH estimation. In our experiment, the best results were achieved with errors
normalized by the largest error over the whole image. The usage of Walsh-
Hadamard kernels as a similarity measure is justified because it is an estimate of
the true difference of patches, but it decreases sensitivity to noise, because only the
coarser filtering results are used.
(continued)
4 Semi-Automatic 2D to 3D Video Conversion 91
Errt1
2σ level
W prev ¼ e V ðlevelÞ ,
Err tþ1
2σ level
W next ¼ e V ðlevel,tÞ
Most parts of the algorithm are well parallelizable and can make use of multicore
CPUs or GPGPU architectures. The implementation of CSH matching from Korman
and Avidan (2015) can be parallelized by the introduction of propagation tiles. This
decreases the area over which the found match is propagated and usually leads to an
increased number of necessary iterations. When we investigated the differences on
an MPI-Sintel testing dataset Butler et al. (2012), the differences were small even on
sequences with relatively large motions. To speed up CSH matching, we use only
16 Walsh-Hadamard kernels (as we mentioned earlier), scaled to short integer. This
allows the implementation of the computation of absolute differences using SSE and
intrinsic functions. The testing of candidates can be further parallelized in GPGPU
implementation. Our solution has parts running on CPU and parts running on
GPGPU. We tested our implementation on a PC with a Core i7 960 (3.2 GHz)
CPU and an Nvidia GTX 480 graphics card. We achieved running times of ~2 s/
frame on a video with 960 540 resolution. The most time-consuming part of the
computation is the CSH matching. For most experiments, we used three pyramid
levels with half resolution between the levels, with two iterations per level and
σ(level,1) ¼ 2 and σ(level,2) ¼ 1. σ V(level) ¼ 0.053(level-1). The slow python
implementation of the algorithm is given by Tolstaya (2020).
92 P. Pohl and E. V. Tolstaya
4.3.4 Results
In general, some situations remain difficult for propagation, such as the low contrast
videos, noise, and small parts of moving objects, since in this case the background
pixels inside the patch occupy the biggest part of the patch and contribute too much
in voting. However, in the case where the background does not change substantially,
small details can be tracked quite acceptably. The advantages given by CSH
matching include the fact that it is not true motion, and objects on the query frame
can be formed from completely different patches, based only on their visual simi-
larity to reference patches (Fig. 4.5).
The main motivation for the development of a depth propagation algorithm was
the elimination or suppression of the main artefacts of the previously used algorithm
based on optical flow. The main artefacts include depth leakage and depth loss.
Depth leakage can be caused either by the misalignment of key frame depth and
motion edge or an incorrect motion estimation result. The most perceptible artefacts
are noisy tracks of object depth that are left on the background after the foreground
object moves away. Depth loss is mostly caused by an error of motion estimation in
the case of fast motion or complex scene changes like occlusions, flares, reflections,
or semi-transparent objects in the foreground. Examples of such artefacts are shown
Fig. 4.6 Comparison of optical flow-based interpolation (solid line) with our new method (dashed
line) for four different distances of key frames—key frame distance is on the x-axis. PSNR
comparison with original depth. Top—ballet sequence, bottom—breakdance sequence
Fig. 4.7 Optical flow-based interpolation—an example of depth leakage and a small depth loss
(right part of head) in the case of fast motion of an object and flares
in Fig. 4.7. Figure 4.8 shows the output of the proposed algorithm. We compared the
performance of our method with optical flow-based interpolation on the MSR 3D
Video Dataset from Microsoft research (Zitnick et al. 2004). The comparison of the
interpolation error (as the PSNR from the original) is shown in Fig. 4.6. Figure 4.9
compares the depth maps of the proposed algorithm and the depth map computed
with motion vectors, with the depth overlain onto the source video frames. We can
see that in the case of motion vectors, small details can be lost. Other tests were done
on proprietary datasets with the ground truth depth from stereo (computed by the
method of Ignatov et al. 2009) or manually annotated.
94 P. Pohl and E. V. Tolstaya
Fig. 4.8 Our depth interpolation algorithm—an example of solved depth leakage and no depth loss
artefact
Fig. 4.9 Comparison of depth interpolation results—optical flow-based interpolation (top) and our
method (bottom)—an example of solved depth leakage and thin object depth loss artefacts. Left,
depth + video frame overlay; right, interpolated depth. The key frame distance used is equal to eight
From our experiments, we see that the proposed depth interpolation method has
on average better performance than interpolation based on optical flow. Usually,
finer details of depth are preserved, and the artefacts coming from the imperfect
alignment of the depth edge and the true edge of objects are less perceptible or
removed altogether. Our method is also capable of capturing faster motion. On the
other hand, optical flow results are more stable in the case of consistent and not too
fast motion, especially in the presence of a high level of camera noise, video
flickering, or a complex depth structure. The proposed method has a lot of param-
eters, and many of them were set up by intelligent guesswork. One of the future steps
might be a tuning of parameters on a representative set of sequences. Another way
forward could be a hybrid approach that merges the advantages of our method and
optical flow-based interpolation. Unfortunately, we were not able to find a public
dataset for the evaluation of depth propagation that is large enough and includes a
4 Semi-Automatic 2D to 3D Video Conversion 95
4.4.1 Introduction
Motion vectors provide helpful insights on video content. They are used for occlu-
sion analysis and, during the background restoration process, to fill up occlusions
produced by stereo rendering.
Motion vectors are the apparent motion of brightness patterns between two
images defined by the vector field u(x). Optical flow is one of the important but
not generally solved problems in computer vision, and it is under constant develop-
ment. Recent methods using ML techniques and precomputed cost volumes like
Teed and Deng (2020) or Zhao (2020) improve the performance in the case of fast
motion and large occlusions. At the time this material was prepared, the state-of-the-
art methods generally used variational approaches. The Teed and Deng (2020) paper
states that even the most modern methods are inspired by a traditional setup with a
balance between data and regularization terms. Teed and Deng (2020) even follow
the iterative structure similar to first-order primal-dual methods from variational
optical flow; however, it uses learned updates implemented using convolutional
layers. For computing optical flow, we decided to adapt the efficient primal-dual
optimization algorithm proposed by Chambolle and Pock (2011), which is suitable
for GPU implementation. The authors propose the use of total variation optical flow
with a robust L1 norm and extend the brightness constancy assumption by an
additional field to model brightness change. The main drawbacks of the base
algorithm are incorrect smoothing around motion edges and unpredictable behaviour
in occlusion areas. We extended the base algorithm to use colour information and
replaced TV-L1 regularization by a local neighbourhood weighting known as the
non-local smoothness term, proposed by Werlberger et al. (2010) and Sun et al.
(2010a). To fix the optical flow result in occlusion areas, we decided to use motion
inpainting, which uses nearby motion information and motion over-segmentation to
fill in unknown occlusion motion. Sun et al. (2010b) propose explicitly modelling
layers of motion to model the occlusion state. However, this leads to a non-convex
problem formulation that is difficult to optimize even for a small number of motion
layers.
As a trade-off between precision and computation time, we use two colour channels
(Y’ and Cr of the Y’CbCr colour space) instead of three. A basic version of two-colour
variational optical flow with an L1 norm smoothness term is below:
where Ω is the image domain; E is the minimized energy; u(x) ¼ (u1(x), u2(x)) is the
motion field; w(x) is the field connected to the illumination change; λL and λC are
parameters that control the data term’s importance for luminance and colour chan-
nels, respectively; γ controls the regularization of the illumination change; ES is the
4 Semi-Automatic 2D to 3D Video Conversion 97
smoothness part of energy that penalizes changes in u and w; I L1 and I L2 are the
luminance components of the current and next frames; and I C2 and I C2 are the colour
components of the current and next frames.
jI 1 ðxÞI 1 ðxþdxÞj2
sðI 1 , x, dxÞ ¼ eks jdxj e 2kc 2
sðI 1 , x, dxÞ
sn ðI 1 , x, dxÞ ¼ P ,
dx2Ψ sðI 1 , x, dxÞ
where Ψ is the non-local neighbourhood term (e.g. a 5 5 square with (0,0) in the
middle); s(I1, x, dx) and sn(I1, x, dx) are non-normalized and normalized non-local
weights, respectively; and ks and kc are parameters that control the non-local
weights’ response.
In the original formulation of the non-local smoothness term, the size of the local
window determines the number of weights and dual variables per pixel. Thus, for
example, if the window size is 5 5, then for each pixel we need to store 50 dual
variables and 25 weights in memory. Considering that all of these variables are used
in every iteration of optimization, a large number of computations and memory
transfers are required. To overcome this problem, we devised a computation simpli-
fication to decrease the number of non-local weights and dual variables. The idea is
to decrease the non-local neighbourhood for motion and use a larger part of the
image for weights. The way that we use a 3x3 non-local neighbourhood with 5x5
image information is described by the following formula:
98 P. Pohl and E. V. Tolstaya
4.4.6 Solver
Our solver is based on Algorithm 4.1 from Chambolle and Pock (2011). According
to this algorithm, we derived the iteration scheme for the non-local smoothness term
in optical flow. In order to get a convex formulation for optimization, we need to
linearize the data term ED for both channels, Y and Cr. The linearization uses the
current state of motion field u0; derivatives are approximated using the following
scheme:
where I1 is the image from which motion is computed, I2 is the image to which
motion is computed, IT is the image time derivative estimation, and Ix and Iy are
image spatial derivative estimations. It is also possible to derive a three colour
channel version, but the computational complexity increase is considerable. Full
implementation details can be found in Pohl et al. (2014).
where n is the pyramid level (zero n means the finest level and hence the highest
resolution), λ is one of the regularization parameters that is a function of n, and nramp
4 Semi-Automatic 2D to 3D Video Conversion 99
is start of the linear ramp. The parameters λcoarsest and λfinest define the initial and
final λ value.
The variational optical flow formulation does not explicitly handle occlusion
areas. The estimated motion in occlusion areas is incorrect and usually follows a
match to the nearest patch of similar colour. However, if the information for visible
pixels is correct, it is possible to find occlusion areas using the motion from the
nearest frames to the current frame, as shown in Fig. 4.10. Object 401 on the moving
background creates occlusion 404. The precise computation of occlusion areas uses
inverse of bilinear interpolation and thresholding.
where Ω is the image domain, u23(x) and u21(x) are motion fields from the central to
the next and central to the previous frame respectively, M(θ, x) is the motion given
by the model with parameters θ at the image point x, and k is the parameter that
weights the dispersion of the misfit. J is the evaluation function: higher values of this
function give better candidates. This evaluation still needs a parameter k that acts as
the preferred dispersion. However, it will still give quite reasonable results, even
when all clusters have high dispersions. After the evaluation stage, misfit histogram
analysis is done in order to find a first mode of misfit. We search for the first local
minima in a histogram smoothed by convolution with a Gaussian, because the first
local minima in an unsmoothed histogram is too sensitive to noise. Pixels that have a
misfit below three times the standard deviation of the first mode are deemed pixels
belonging to the examined cluster.
After the joint model fit, single direction occlusion areas are added if the local
motion is in good agreement with the fitted motion model. Our experiments show
that the best over-clustering results were achieved using the similarity motion model:
2 3 2 3
u1 t1
6 7 sR I
6 u2 7 ¼ M ðθ, xÞ ¼ 6
4
7
t 2 5,
4 5
1 0 0 1
where u1 and u2 are motion vectors, x1 and x2 are the coordinates of the original
point, t ¼ (t1, t2) is the translation, R is an orthonormal 2 2 rotation matrix, s is the
scaling coefficient, and θ ¼ (R, s, t) are the parameters of the model M.
In order to assign clusters to areas which are marked as occlusions in both
directions, we use a clustering inpainting algorithm. The algorithm searches for
every occluded pixel and assigns it a cluster number using a local colour similarity
assumption:
X
jI 1 ðxÞI 1 ðyÞj2
Wðx, kÞ ¼ fe , CðyÞ ¼ k
2σ 2
y2Ω 0, CðyÞ 6¼ k
CðxÞ ¼ argmaxðWðx, kÞÞ,
k
where Ω is the local neighbourhood domain, I1(x) is the current image pixel with
coordinates x, C(x) is the cluster index of the pixel with coordinates x, W(x, k) is the
weight of the cluster k for a pixel with coordinates x, and σ is the parameter
controlling the colour similarity measure of current and neighbourhood pixels.
4 Semi-Automatic 2D to 3D Video Conversion 101
4.4.9 Results
The results of the clustering are the function C(x) and the list of models MC(θ, x).
This result is used to generate unknown motion in occlusion areas using the motion
model of the cluster that the occlusion pixel was added to. An example of a motion
clustering result is shown in Fig. 4.11.
The main part of the testing of our results was done on scene cuts of film videos to
see “real-life” performance. The main problem with this evaluation is that the ground
truth motion is unavailable and manual evaluation of the results is the only method
we can use. To have some quantification as well as a comparison with the state of the
art, we used the famous Middlebury optical flow evaluation database made by Baker
et al. (2011). We use a colouring scheme that is used by the Middlebury benchmark.
Motion estimation results are stable for a wide range of values of the regulariza-
tion parameter lambda, as can be seen in their errors from ground truth in Fig. 4.12.
The results with lower values of lambda are oversmoothed, whereas a value of
lambda that is too high causes a lot of noise in the motion response as the algorithm
tries to fix the noise in the input images.
The non-local neighbourhood smoothness term improves the edge response in the
constructed motion field. The simplified version can be seen as a relaxation to full
non-local processing, and the quality of results is somewhere in between TV-L1
regularization and full non-local neighbourhood. An example of motion edge
behaviour between these approaches is demonstrated in Fig. 4.13. You can see
that the simplified non-local term decreases the smoothing around the motion edge
but still creates a motion artefact not aligned with the edge of the object. However,
this unwanted behaviour is usually caused by fast motion or a lack of texture on one
of the motion layers. We found out that on the testing set of the Middlebury
benchmark, the difference between the simplified and normal non-local
neighbourhoods is not important. We think it is because the dataset has only small
or moderate motion and usually has rather well-textured layers. You can see the
comparison in Fig. 4.14, with a comparison of errors shown in Fig. 4.15.
The proposed method of motion estimation keeps the sharp edges of the motion
and tries to fix incorrect motion in occlusion areas. We also presented a way to relax
the underlying model to allow considerable speedup of the computation. Our motion
estimation method was ranked in the top 20 out of all algorithms on the Middlebury
Fig. 4.11 Motion inpainting result on cut of Grove3 sequence from Middlebury benchmark—
clustering result overlain on grey image (left), non-inpainted motion (centre), motion after occlu-
sion inpainting (right)
102 P. Pohl and E. V. Tolstaya
Fig. 4.12 Average endpoint and angular errors for Middlebury testing sequence for changing the
regularization parameter
Fig. 4.13 Motion results on Dimetrodon frame 10 for lambda equal to 1, 5, 20, 100
Fig. 4.14 Comparison of TV-L1 (left), 3 3 neighbourhood with special weighting (middle), and
full 5 5 non-local neighbourhood (right) optical flow results; the top row is the motion colourmap,
and on the bottom it is overlaid on the greyscale image to demonstrate the alignment of motion and
object edges
4 Semi-Automatic 2D to 3D Video Conversion 103
Fig. 4.15 Comparison of our 3 3 non-local neighbourhood with special weights and 5 5
neighbourhood optical flow result on Middlebury testing sequence
optical flow dataset of the time, but only one other method reported better processing
time on the “Urban” sequence.
4.5.1 Introduction
The task of the background inpainting step is to recover occluded parts of video
frames to use later during stereo synthesis and to fill occlusion holes. We have an
input video sequence and a corresponding sequence of masks (for every frame of the
video) that denote areas to be inpainted (a space-time hole). The goal is to recover
background areas denoted (covered) by object masks so that the result is visually
plausible and temporally coherent. A lot of attention has been given to the area of
still image inpainting. Notable methods include diffusion-based approaches, such as
Telea (2004) and Bertalmio (2000), and texture synthesis or exemplar-based algo-
rithms, for example, Criminisi et al. (2004). Video inpainting imposes additional
temporal restrictions, which make it more computationally expensive and
challenging.
Generally, most video inpainting approaches fall into two big groups: global and
local with further filtering. Exemplar-based methods for still images can be naturally
extended for videos as a global optimization problem of filling the space-time hole.
Wexler et al. (2007) define a global method using patch similarity for video
completion. The article reports satisfactory results, but the price for global
104 P. Pohl and E. V. Tolstaya
optimization is that the algorithm is extremely complex. They report several hours of
computation for a video of very low resolution and short duration (100 frames of
340 120 pixels). A similar approach was taken by Shiratori (2006). They suggest a
procedure called motion field transfer to estimate motion inside a space-time hole.
Motion is filled in with a patch-based approach using a special similarity measure.
The motion found allows the authors to inpaint a view sequence while maintaining
temporal coherence. However, the performance is also quite low (~40 minutes for a
60-frame video of 352 240). Bugeau et al. (2010) propose inpainting frames
independently and filter the results by Kalman filtering along point trajectories found
by the dense optical flow algorithm. The slowest operation of this approach is the
computation of optical flow. Their method produces a visually consistent result, but
the inpainted region is usually smoothed with occasional temporal artefacts.
In our work (Pohl et al. 2016), we decided not to use a global optimization approach
to make the algorithm more computationally efficient and avoid the loss of small
details on the restored background video. The proposed algorithm consists of three
well-defined steps:
1. Restoration of background motion in foreground regions, using motion vectors
2. Temporal propagation of background image data using the restored background
motion from step #1
3. Iterative spatial inpainting with temporal propagation of inpainted image data
Background motion is restored by computing and combining several motion
estimates based on optical flow motion vectors. Let black contour A be the edge
of a marked foreground region (see Fig. 4.16a). Outside A, we will be using motion
produced by the optical flow algorithm, M0(x, y). In area B, we obtain a local
background motion estimate M1(x, y). This local estimate can be produced, for
example, by running a diffusion-like inpainting algorithm, such as Telea (2004),
Fig. 4.16 Background motion estimation: (a) areas around object’s mask; (b) example of back-
ground motion estimation
4 Semi-Automatic 2D to 3D Video Conversion 105
on M0(x, y) inside region B. Next, we use M0(x, y) from area C to fit the parameters
a0,..,a5 of an affine global motion model
!
a0 þ a1 x þ a 2 y
M 2 ðx, yÞ ¼ :
a3 þ a4 x þ a 5 y
W(x, y) is defined as the exponential decay (with a reasonable decay rate parameter in
pixels) of distance from the foreground area edge A. It is equal to 1 outside contour
A, and the distance parameter is computed by the effective distance transform
algorithm.
As a result, we obtain a full-frame per-pixel estimate of background motion that
generally has the properties of global motion inside previously missing regions but
does not suffer from discontinuity problems around the foreground region edges
(Fig. 4.16b).
The second step—temporal propagation—is the crucial part of our algorithm. The
forward and backward temporal passes are symmetrical, and they fill in the areas that
were visible on other frames of video. We do a forward and backward pass through
the video sequence using integrated (accumulated) motion in occluded regions. In
the forward pass, we integrate motion in the backward direction and decide which
pixels can be filled by data from the past. The same is done for the backward pass to
fill in pixels from future frames. After forward and backward temporal passes, we
can still have some areas that were not inpainted (unfilled areas). These areas were
not seen during the entire video clip. We use still image spatial inpainting to fill in the
missing data in a selected frame and propagate it using restored background motion
to achieve temporal consistency.
Let us introduce the following notation:
– I(x, y) is the nth input frame.
– M(m, n, x) is the restored background motion from frame m to frame n.
– F(n, x) is the input foreground mask for frame n (area to be inpainted).
– QF(n, x) is the inpainted area mask for frame n.
– I(n, x) is the inpainted frame n.
– T is the temporal window size (algorithm parameter).
106 P. Pohl and E. V. Tolstaya
I F ðN curr , xÞ ¼ I ðN curr , xÞ
QF ðN curr , xÞ ¼ 0
Algorithm 4.3 is a greedy algorithm that inpaints the background with data from the
temporally least distant frame. The backward temporal pass algorithm is doing the
same operations, only with a reversed order of iterations and motion directions.
Some areas can be filled from both sides. In this case, we need to find a single
inpainting solution. We used the temporally less distant source. In the case of the
same temporal distance, a blending procedure was done based on the distance from
the non-inpainted area.
The third step is the spatial pass. The goal of the spatial pass is to inpaint regions
that were not filled by temporal passes. To achieve temporal stability, we inpaint on a
selected frame and propagate inpainted data temporally. We found that a reasonable
strategy is to find the largest continuous unfilled area, use spatial inpainting to fill it
in, and then propagate through the whole sequence using a background motion
estimate. It is necessary to perform spatial inpainting with iterative propagation until
all unfilled areas are inpainted. Any spatial inpainting algorithm can be used; in our
work, we experimented with exemplar-based and diffusion-based methods. Our
4 Semi-Automatic 2D to 3D Video Conversion 107
experience shows that it’s better to use a diffusion algorithm for filling small or thin
areas, and an exemplar-based algorithm is better for larger unfilled parts.
Let us denote QFB(n, x) and IFB(n, x) as the temporally inpainted mask and the
background image, respectively, after forward and backward passes are blended
together. QS(n, x) and IS(n, x) are the mask and the image after temporal and spatial
inpainting. |D| stands for the number of pixels in the image domain D.
S ðN dst , xÞ ¼ FðN dst , xÞ\ QS ðN dst , xÞ \ QS ðN curr , x þ MðN dst, N curr , xÞÞ
Qnew
4.5.4 Results
Fig. 4.18 Background restoration quality: (a) simple motion; (b) affine motion
1. Simple motion. The background and foreground are moved by two different
randomly generated motions, which include only rotation and shift.
2. Affine motion. The background and foreground are moved by two different
randomly generated affine motions.
Evaluation was done by running the proposed algorithm with default parameters
on synthetic test sequences (an example is shown in Fig. 4.17). We decided to use
the publicly available implementation of an exemplar-based inpainting algorithm by
Criminisi et al. (2004) to provide a baseline for our approach. Our goal was to test
both the quality of background inpainting for each frame and the temporal stability
of the resulting sequence. For the evaluation of the background inpainting quality,
we use the PSNR between the algorithm’s output and the ground truth background.
Results are shown in Fig. 4.18.
To measure temporal stability, we use the following procedure. Let I(n, x) and
I(n, x) be a pair of consecutive frames with inpainted backgrounds, and let
MGT(n, n + 1, x) be the ground truth motion from frame n to n + 1. We compute
the PSNR between I(n, x) and I(n + 1, x + MGT(n, n + 1, x) (sampling is done using
bicubic interpolation). The results are shown in Fig. 4.19.
As we can see, our inpainting algorithm usually provides a slightly better quality
of background restoration than exemplar-based inpainting when analysed statisti-
cally. However, it produces far more temporally stable output, which is very
important for video inpainting.
We applied our algorithm to a proprietary database of videos with two types of
resolution: 1920 1080 (FHD) and 960 540 (quarter HD, qHD). The method
shows a reasonable quality for scenes without changes in scene properties
4 Semi-Automatic 2D to 3D Video Conversion 109
Fig. 4.19 Temporal stability: (a) simple motion; (b) affine motion
(brightness change, different focus, changing fog or lights) and with rigid scene
structure. A few examples of our algorithm outputs are shown in Fig. 4.20.
Typical visible artefacts include misalignments on the edges of the temporal
inpainting direction (presumably in cases when motion integration is not precise
enough) and mixing the same part of a scene whose appearance properties changed
with time. Examples of such artefacts are shown in Fig. 4.21.
The running time of the algorithm (not including optical flow estimation) is
around 1 s/frame for qHD and 3 s/frame for FHD sequences on a PC with a single
GPU. The results are quite acceptable for the restoration of areas occluded due to
110 P. Pohl and E. V. Tolstaya
Fig. 4.21 Typical artefacts: misalignment (top row) and scene change artefacts (bottom row)
stereoscopic parallax for a limited range of scenes, but for wider applicability, it is
necessary to decrease the level of artefacts. It is possible to apply a more advanced
analysis of propagation reliability for a better decision between spatial and temporal
inpainting; also, the temporal inpainting direction uses the analysis of the level of
scene changes. In addition, it may be useful to improve the alignment of temporally
inpainted parts from different time moments or apply the Poisson seamless stitching
approach from Pérez et al. (2003) in case there are overlapping parts.
Straight line detection was made with the Hough transform and voting procedure.
Patches along the lines are propagated inside the hole, and the rest of the hole is filled
with a conventional method, like the one described in Criminisi’s paper (Criminisi
et al. 2004). Figure 4.22 illustrates the steps of the algorithm.
A large area of research is devoted to stereo quality estimation. Bad stereo results
lead to a poor viewing experience, headache, and eye fatigue. That is why it is
important to review media content and possibly eliminate production or post-
production mistakes. The main causes of eye fatigue are listed in Chap. 3, and
content quality is among the mentioned items.
The most common quality issues that can be present in stereo films (including
even well-known titles with million-dollar budgets) include the following:
• Mixed left and right views
• Stereo view rotation
• Different sizes of objects
• Vertical disparity
• Temporal shift between views
• Colour imbalance between views
• Sharpness difference
112 P. Pohl and E. V. Tolstaya
References
Appia, V., Batur, U.: Fully automatic 2D to 3D conversion with aid of high-level image features. In:
Stereoscopic Displays and Applications XXV, vol. 9011, p. 90110W (2014)
Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation
methodology for optical flow. Int. J. Comp. Vision. 92(1), 1–31 (2011)
Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proceedings of the 27th
Annual Conference on Computer graphics and Interactive Techniques, p. 417 (2000)
Bugeau, A., Piracés, P.G.I., d'Hondt, O., Hervieu, A., Papadakis, N., Caselles, V.: Coherent
background video inpainting through Kalman smoothing along trajectories. In: Proceedings of
2010–15th International Workshop on Vision, Modeling, and Visualization, p. 123 (2010)
4 Semi-Automatic 2D to 3D Video Conversion 113
Butler D.J., Wulff J., Stanley G.B., Black M.J.: A Naturalistic Open Source Movie for Optical Flow
Evaluation. In: Fitzgibbon A., Lazebnik S., Perona P., Sato Y., Schmid C. (eds) Computer
Vision – ECCV 2012. ECCV 2012. Lecture Notes in Computer Science, vol 7577. Springer,
Berlin, Heidelberg (2012)
Cao, X., Li, Z., Dai, Q.: Semi-automatic 2D-to-3D conversion using disparity propagation. IEEE
Trans. Broadcast. 57, 491–499 (2011)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications
to imaging. J. Math. Imag. Vision. 40(1), 120–145 (2011)
Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-based image
inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004)
Feng, J., Ma, H., Hu, J., Cao, L., Zhang, H.: Superpixel based depth propagation for semi-automatic
2D-to-3D video conversion. In: Proceedings of IEEE Third International Conference on Net-
working and Distributed Computing, pp. 157–160 (2012)
Feng, Z., Chao, Z., Huamin, Y., Yuying, D.: Research on fully automatic 2D to 3D method based
on deep learning. In: Proceedings of the IEEE 2nd International Conference on Automation,
Electronics and Electrical Engineering, pp. 538–541 (2019)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with
applications to image analysis and automated cartography. Commun. ACM. 24(6), 381–395
(1981)
Harman, P.V., Flack, J., Fox, S., Dowley, M.: Rapid 2D-to-3D conversion. In: Stereoscopic
displays and virtual reality systems IX International Society for Optics and Photonics, vol.
4660, pp. 78–86 (2002)
Ignatov, A., Bucha, V., Rychagov, M.: Disparity estimation in real-time 3D acquisition and
reproduction system. In: Proceedings of the International Conference on Computer Graphics
«Graphicon 2009», pp. 61–68 (2009)
Irony, R., Cohen-Or, D., Lischinski, D.: Colorization by example. In: Proceedings of the Sixteenth
Eurographics conference on Rendering Techniques, pp. 201–210 (2005)
Korman, S., Avidan, S.: Coherency sensitive hashing. IEEE Trans. Pattern Anal. Mach. Intell. 38
(6), 1099–1112 (2015)
Muelle, M., Zill, F., Kauff, P.: Adaptive cross-trilateral depth map filtering. In: Proceedings of the
IEEE 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video,
pp. 1–4 (2010)
Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. In: ACM SIGGRAPH Papers,
pp. 313–318 (2003)
Pohl, P., Molchanov, A., Shamsuarov, A., Bucha, V.: Spatio-temporal video background
inpainting. Electron. Imaging. 15, 1–5 (2016)
Pohl, P., Sirotenko, M., Tolstaya, E., Bucha, V.: Edge preserving motion estimation with occlusions
correction for assisted 2D to 3D conversion. In: Image Processing: Algorithms and Systems XII,
9019, pp. 901–906 (2014)
Shiratori, T., Matsushita, Y., Tang, X., Kang, S.: Video completion by motion field transfer. In:
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, vol. 1, p. 411 (2006)
Sun, J., Xie, J., Li, J., Liu, W.: A key-frame selection method for semi-automatic 2D-to-3D
vonversion. In: Zhang, W., Yang, X., Xu, Z., An, P., Liu, Q., Lu, Y. (eds.) Advances on Digital
Television and Wireless Multimedia Communications. Communications in Computer and
Information Science, vol. 331. Springer, Berlin, Heidelberg (2012)
Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: Pro-
ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recogni-
tion, pp. 2432–2439 (2010a)
Sun, D., Sudderth, E., Black, M.: Layered image motion with explicit occlusions, temporal
consistency, and depth ordering. In: Proceedings of the 24th Annual Conference on Neural
Information Processing Systems, pp. 2226–2234 (2010b)
114 P. Pohl and E. V. Tolstaya
Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: Proceedings of the
European Conference on Computer Vision, pp. 402–419 (2020)
Telea, A.: An image inpainting technique based on the fast marching method. J. Graph. Tools. 9
(1) (2004)
Tolstaya E.: Implementation of Coherency Sensitive Hashing algorithm. (2020). Accessed on
03 October 2020. https://github.com/ktolstaya/PyCSH
Tolstaya, E., Hahn, S.-H.: Method and system for selecting key frames from video sequences. RU
Patent 2,493,602 (in Russian) (2012)
Tolstaya, E., Pohl, P., Rychagov, M.: Depth propagation for semi-automatic 2d to 3d conversion.
In: Proceedings of SPIE Three-Dimensional Image Processing, Measurement, and Applications,
vol. 9393, p. 939303 (2015)
Varekamp, C., Barenbrug, B.: Improved depth propagation for 2D to 3D video conversion using
key-frames. In: Proceedings of the 4th European Conference on Visual Media Production
(2007)
Vatolin, D.: Why Does 3D Lead to the Headache? / Part 8: Defocus and Future of 3D (in Russian)
(2019). Accessed on 03 October 2020. https://habr.com/ru/post/472782/
Vatolin, D., Bokov, A., Erofeev, M., Napadovsky, V.: Trends in S3D-movie quality evaluated on
105 films using 10 metrics. Electron. Imaging. 2016(5), 1–10 (2016)
Vatolin, D.: Why Does 3D Lead to the Headache? / Part 2: Discomfort because of Video Quality (in
Russian) (2015a). Accessed on 03 October 2020. https://habr.com/en/post/377709/
Vatolin, D.: Why Does 3D Lead to the Headache? / Part 4: Parallax (in Russian) (2015b). Accessed
on 03 October 2020. https://habr.com/en/post/378387/
Voronov, A., Vatolin, D., Sumin, D., Napadovsky, V., Borisov, A.: Methodology for stereoscopic
motion-picture quality assessment. In: Proceedings of SPIE Stereoscopic Displays and Appli-
cations XXIV, vol. 8648, p. 864810 (2013)
Wang, D., Liu, J., Sun, J., Liu, W., Li, Y.: A novel key-frame extraction method for semi-automatic
2D-to-3D video conversion. In: Proceedings of the IEEE international Symposium on Broad-
band Multimedia Systems and Broadcasting, pp. 1–5 (2012)
Werlberger, M., Pock, T., Bischof, H.: Motion estimation with non-local total variation regulari-
zation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pp. 2464–2471 (2010)
Wexler, Y., Shechtman, E., Irani, M.: Space-time completion of video. IEEE Trans. Pattern Anal.
Mach. Intell. 29(3), 463–476 (2007)
Xie, J., Girshick, R., Farhadi, A.: Deep3d: Fully automatic 2d-to-3d video conversion with deep
convolutional neural networks. In: Proceedings of the European Conference on Computer
Vision, pp. 842–857 (2016)
Yuan, H.: Robust semi-automatic 2D-to-3D image conversion via residual-driven optimization.
EURASIP J. Image Video Proc. 1, 66 (2018)
Zhao, S., Sheng, Y., Dong, Y., Chang, E., Xu, Y.: MaskFlownet: asymmetric feature matching with
learnable occlusion mask. In: Proceedings of the CVPR, vol. 1, pp. 6277–6286 (2020)
Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High-quality video view
interpolation using a layered representation. ACM Transactions on Graphics. 23(3) (2004)
Chapter 5
Visually Lossless Colour Compression
Technology
Michael N. Mishourovsky
5.1.1 Introduction
Modern video processing systems such as TV devices, mobile devices, video coder/
decoders, and surveillance equipment process a large amount of data: high-quality
colour video sequences with resolution up to UD at a high frame rate, supporting
HDR (high bit depth per colour component). Real-time multistage digital video
processing usually requires high-speed data buses and huge intermediate buffering,
which lead to increased costs and power consumption. Although the latest achieve-
ments in RAM design make it possible to store a lot of data, reduction of bandwidth
requirements is always a desired and challenging task; moreover, for some applica-
tions it might become a bottleneck due to the balance between cost and technological
advances in a particular product.
One possible approach to cope with this issue is to represent video streams in a
compact (compressed) format, preserving the visual quality, with a possible small
level of losses as long as the visual quality does not suffer; this category of
algorithms is called embedded or texture memory compression and is usually
implemented in HW and should provide very high visual fidelity (quality), a
reasonable compression ratio and low hardware (HW) complexity. It should impose
no strong limitations on data-fetching, and it should be easy to integrate into
synchronous parts of application-specific integral circuits (ASIC) or system on
chip (SoC).
In this chapter, we describe in detail the so-called visually lossless colour
compression technology (VLCCT), which satisfies the abovementioned technical
M. N. Mishourovsky (*)
Huawei Russian Research Institute, Moscow, Russia
e-mail: mmishourovsky@gmail.com
requirements. We also touch on the question of visual quality evaluation (and what
visual quality means) and present an overview of the latest achievements in the field
of embedded compression, video compression, and quality evaluations.
One of the key advantages of the described technology is that it uses only one row
of memory for buffering, works with extremely small blocks, and provides excellent
subjective and objective quality of images with a fixed compression ratio, which is
critical for random access data-fetching. It is also robust to error propagation, as it
introduces no inter-block dependency, which effectively prevents error propagation.
According to estimates done during the technology transferring, the technology
requires only a small number of equivalent physical gates and could be widely
adopted into various HW video processing pipelines.
The approach underlying VLCCT is not new; it is applied for GPU, OpenGL, and
Vulcan supporting several algorithms to effectively compress textures; and OS such
as Android and iOS both support several techniques for encoding and decoding
textures.
During the initial stage of the development, several prototypes were identified
including the following: algorithms based on decorrelating transformations (DCT,
wavelet, differential-predictive algorithms) combined with entropy coding of differ-
ent kinds; the block truncation encoding family of algorithms; block palette
methods; and the vector quantization method. Mitsubishi developed a technology
called fixed block truncation coding by Matoba et al. (1998), Takahashi et al. (2000),
and Torikai et al. (2000) and even produced an ASIC implementing this algorithm
for the mass market. It provides fixed compression with relatively high quality. Fuji
created an original method called Xena (Sugita and Watanabe 2007), Sugita (2007),
which was rather effective and innovative, but provided only lossless compression.
In 1992, a paper was published (Wu and Coll 1992) suggesting an interesting
approach which is essentially targeted to create an optimal palette for a fixed block
to minimize the maximum error. The industry adopted several algorithms from S3 –
so-called S3 Texture Compression (DXT1 ~ DXT5), and ETC/ETC1/ETC2 were
developed by Ericsson. ETC1 is currently supported by Android OS (see the
reference) and is included in the specification of OpenGL – Methods for encoding
and decoding ETC1 textures (2019). PowerVR Texture Compression was designed
for graphic cores and then patented by Imagination Technologies. Adaptive Scalable
Texture Compression (ASTC) was jointly developed by ARM and AMD and
presented in 2012 (which is several years after VLCCT was invented).
In addition to the abovementioned well-known techniques, the following tech-
nologies were reviewed: Strom and Akenine-Moeller (2005), Mitchell and Delp
(1980), Jaspers and de With (2002), Odagiri et al. (2007), and Lee et al. (2008). In
5 Visually Lossless Colour Compression Technology 117
general, all these technologies provide compression ratios from ~1.5 to 6 times with
different quality of decompressed images, different complexity, and various optimi-
zation levels. However, most of them do not provide the required trade-off between
algorithmic and HW complexity (most of these algorithms require several rows –
from 2. . .4 and even more), visual quality, and other requirements behind VLCCT.
Concluding this part, the reader can familiarize himself with the following links
providing a thorough overview of different texture compression methods: Vulkan
SDK updated by Paris (2020), ASTC Texture Compression (2019), Paltashev and
Perminov (2014).
According to the business needs, the limitations for HW complexity, and acceptance
criteria for visual quality, the following requirements were defined (Table 5.1):
Most of compression technologies rely on some sort of redundancy elimination
mechanisms, which can essentially be classified as follows:
1. Visual redundancy (caused by human visual system perception).
2. Colour redundancy.
3. Intra-frame and inter-frame redundancy.
4. Statistical redundancy (attributed to the probabilistic nature of elements in the
stream; a stream is not random, and this might be used by different methods,
such as:
• Huffman encoding and other prefix codes
• Arithmetic encoding – initial publication by Rissanen and Langdon (1979)
• Dictionary-based methods
• Context adaptive binary arithmetic encoding and others
Usually, some sort of data transformations is used:
• Discrete cosine transform, which is an orthogonal transform with normalized
basic functions. DCT approximates decorrelation (removal of linear relations in
data) for natural images (PCA for natural images) and is usually applied at the
block level.
• Different linear prediction schemes with quantization (so-called adaptive pulse
coding modulation – ADPCM) are applied to effectively reduce the self-
similarity of images.
However, a transform or prediction itself does not provide compression; its
coefficients must be further effectively encoded. To accomplish this goal, classical
compression schemes include entropy coding, reducing the statistical redundancy.
As HW complexity has been a critical, limiting factor, it was decided to consider a
rather simple pipeline which prohibited the collection of a huge amount of statistics
and provision of long-term adaptation (as the CABAC engine does in modern H264,
H265 codecs). Instead, a simple approach with a one-row buffer and sliding window
was adopted. The high-level diagram of the data processing pipeline for the
encoding part is shown in Fig. 5.1.
As the buffering lines contribute a lot to the overall HW complexity, the only
possible scenario with one cache line meant that the vertical decorrelation is limited.
However, the proposed sliding window processing still allowed us to implement a
variety of methods; it was realized that a diversity of images (frames) potentially
being compressed by VLCCT inevitably requires different models for effective
representation of a signal.
Before going further, let us highlight several methods tried:
• JPEG2000, which is based on bit-plane arithmetic encoding following right after
wavelet transform
• Classical DCT-based encodings (JPEG-like).
Fig. 5.1 High-level data processing pipeline for encoding part of VLCCT
5 Visually Lossless Colour Compression Technology 119
Fig. 5.3 The final architecture of VLCCT including all main blocks and components
are denoted as specified in Fig. 5.4a, the following features are calculated using a
lifting scheme (Sweldens 1997) (Fig. 5.4b):
{A, B, C, D} ! {s, h, v, d} are transformed according to the following:
Inverse transform
Forward transform
hþvþd
h¼AB A¼sþ
4
v¼AC B¼Ah
d ¼AD
hþvþd C ¼Av
s¼A
4 D¼Ad
Here, s is the mean value of four pixel values; h, v, d – the simplest directional
derivative values (differences). The s-value requires 8 bits to store, h, v and d require
8 bits +1 sign-bit. How can it be encoded effectively with a smaller number of bits?
Let us consider two 2 2 blocks and initial limits on the bit-budget. According to
this, we may come to an average 15 bits per block per colour channel. The mean
value s can be represented with 6-bit precision (uniform quantization was proven to
be a good choice); quantization of h, v, and d is based on a fixed quantization that is
very similar to the well-known Lloyd-Max quantization (which is similar to the k-
means clustering method) (Kabal 1984; Patane and Russo 2001). Several quantizers
can be constructed, where each of them consists of more than 1 value and is
optimized for different image parts. In detail, the quantization process includes the
selection of an appropriate quantizer for the h, v, and d values for each colour; then
each difference is approximated by a quantization value to which it is most similar
(Fig. 5.5a–c). To satisfy bit limits, the following restrictions are applied: only eight
quantizer sets are provided; each consists of four positive/negative values; trained
quantizer values are shown in Table 5.2.
To estimate an approximation error, let us note that, once the quantization is
completed, it is possible to express an error and then estimate pixel errors as follows:
8
8 0 > Δh þ Δv þ Δd
>
> ΔA ¼
< h ¼ h þ Δh
> >
< 4
v0 ¼ v þ Δv ΔB ¼ ΔA Δh
>
: 0 >
> ΔC ¼ ΔA Δv
d ¼ d þ Δd >
>
:
ΔD ¼ ΔA Δd
5 Visually Lossless Colour Compression Technology 123
Fig. 5.5 Selection of appropriate quantizers: (a) distributions for h, v, d, and reconstructed levels;
(b) bit allocation for h, v, d components; (c) an example of quantization with quantizer set selection
and optimal reconstruction levels
If a mean value is encoded with an error, this error should be added to an error
estimation of the A pixel; then, it is possible to aggregate errors for all pixels using,
as an example, the squared sum error (SSE). Once the SSE is calculated for every
quantizer set, the quantizer set which provides the minimal error is selected; other
criteria can be applied to simplify the quantization. The encoding process explained
above is named Fixed Quantization via Table (FQvT) and is described by the
following:
arg min E fd1 , . . . dk g, fId QI 1 , . . . Id QI k g, QI ,
QI¼0...QS
124 M. N. Mishourovsky
where {d1 . . . dK}, original differences; {Id1 . . . IdK}, levels used to reconstruct
differences; and QI, a quantizer defined by a table. E stands for an error of
reconstructed differences relative to the original value.
Another method adopted by VLCCT is the P-method. It helps encoding of such
areas where a specific orientation of details exists; it might be vertical, horizontal, or
diagonal. In addition, this method helps to encode non-texture areas (where the mean
value is a good approximation) and areas where one of the NSW components is close
to zero. The best sub-mode is signalled by a 2-bit value according to Table 5.3:
According to Fig. 5.6, the h sub-mode means that each 2 2 block is encoded
using only two pixels A and C. In the same way, the v sub-mode uses A and B to
approximate the remaining pixel values by these pixels; the diagonal sub-mode
encodes a block in a way similar to that of h and d but uses a diagonal pattern; in
the uniform mode, the mean value is used to approximate the whole 2 2 block. If
any one of the h, v, or d differences is close to zero, the corresponding sub-mode is
invoked. In this case, the following is applied:
The actual bit precision of the mean value is based on the analysis of “free bits”.
Depending on this, the mean value can be encoded using 6, 7, or 8 bits. The structure
of the bits comprising encoding information for this mode is shown in Fig. 5.7. The
algorithm to determine the amount of free bits is shown in Fig. 5.8.
The next method is the S-method. This method is adopted for low-light images
and is derived from the N-method by means of changing the quantization tables and
special encoding of the s-value. In particular, it was found that the least significant
6 bits are enough for encoding the s-value of low-light areas and the least significant
bit of these is excluded, so only 5 bits are used. The modified quantization table is
shown in Table 5.5.
The C-method is also based on the NSW transform for all colour channels
simultaneously, followed by joint encoding of all three channels. The efficiency of
this method is confirmed by the fact that the NSW values are correlated for all three
colour channels. Due to sharing the syntax information between colour channels and
removing the independent selection of quantizers for each colour (they are encoded
by the same quantizer set), an increase of the quantizer’s dynamic range (to eight
levels instead of four) is enabled – see Table 5.6.
The quantization process is like that of the P-method. The main change is that, for
all differences h,v,d for R,G,B colours, one quantizer subset is selected. Then each
difference value is encoded using a 3-bit value; 9 difference values take 27 bits; the
quantizer subset requires another 3 bits; thus, the differences take 30 bits; every
s-value (for R,G, and B) is encoded using 5 bits by truncation of 3 LSBs that are
reconstructed by binary value 100b. Every 2 2 colour block is encoded using
126 M. N. Mishourovsky
Fig. 5.8 The algorithm to determine the number of free bits for the P-method
45 bits; to provide optimal visual quality, the quantization error should take into
account human visual system colour perception, which is translated into weights for
every colour:
5 Visually Lossless Colour Compression Technology 127
8
> ΔhC þ ΔvC þ Δd C
>
> ΔAC ¼
>
> 4
>
<
ΔBC ¼ ΔAC ΔhC
>
>
>
> ΔC C ¼ ΔAC ΔvC
>
>
:
ΔDC ¼ ΔAC ΔdC
X ðΔAC Þ2 þ ðΔBC Þ2 þ ðΔC C Þ2 þ ðΔDC Þ2
E¼ WC,
C¼R, G, B
4
where Wc, weights reflecting the colour perception by an observer for a colour
channel C ¼ R, G, and B; ΔhC, ΔvC and ΔdC, errors to approximate h, v, and
differences. A simpler yet still efficient way to calculate weighted encoding errors is
X
E0 ¼ ½MAX ðjΔhj, jΔvj, jΔdjÞ W C
C¼R, G, B
By subjective visual testing, it was confirmed that this method provided good
visual results and it was finally adopted into the solution.
Let us review the E-method, which is intended for images with sharp colour edges
and colour gradients where other methods usually cause visible distortions. To deal
with this problem, it was suggested to represent each 2 4 block as four small
“stripes” – 1 2 sub-blocks. Every 1 2 sub-block consists of three colour
channels (Fig. 5.9); only one colour channel is defined as a dominant colour channel
for such a sub-block; its values are encoded using 6 bits; the remaining colour
channels are encoded using average values only.
By the design, a dominant channel is determined and pixel values for that channel
are encoded using 6 bits. The remaining channels are encoded via the average values
of two values for every spatial position. Besides, a quantization and clipping are
applied. Firstly, the R, G, and B colour channels are analysed, and, if conditions are
not met, YCbCr colour space is used. The luminance channel is considered as
dominant, while Cb and Cr are encoded as the remaining channels. The key point
128 M. N. Mishourovsky
of this method is that, when calculating the average value, it is calculated jointly for
both the remaining channels for every spatial position. Every 1 2 sub-block is
extended with extra information indicating the dominant channel. The algorithm
describing this method is shown in Fig. 5.10.
In addition to the methods described above, seven other methods are provided
that encode 2 4 blocks without further splitting into smaller sub-blocks. In general,
they are all intended for the cases explained above, but, due to their different
mechanisms to represent data, they might be more efficient in different specific
cases. The 2 4 D-method is targeted for diagonal-like image patches combined
with natural parts (which means a transition region between natural and structure
image parts). The sub-block with a regular diagonal structure is encoded according
to a predefined template, while the remaining 2 2 sub-block is encoded by means
of truncating the rwo least significant bits of every pixel of this 2 2 sub-block.
A sub-block with a regular pattern is determined, which is accomplished via
calculating errors over special templates locations; then the block with the smallest
error is detected (Fig. 5.11). It is considered as a block with a regular structure; the
remaining block is encoded in simple PCM mode with 2 LSB bits truncation.
The template values mentioned above are calculated according to the following
equations:
Rð0, 0 þ 2k Þ þ Rð1, 1 þ 2kÞ þ Gð0, 1 þ 2kÞ þ Gð1, 0 þ 2kÞ þ Bð0, 0 þ 2kÞ þ Bð1, 1 þ 2k Þ
C 0 ðkÞ ¼
6
Rð0, 1 þ 2k Þ þ Rð1, 0 þ 2kÞ þ Gð0, 0 þ 2kÞ þ Gð1, 1 þ 2kÞ þ Bð0, 1 þ 2kÞ þ Bð1, 0 þ 2k Þ
C 1 ðkÞ ¼
6
where C0, C1 – so-called template values; k – the index of a sub-block: k¼¼0 means
the left sub-block, k¼¼1 means the right sub-block. The approximation error for the
k-th block is defined as follows:
5 Visually Lossless Colour Compression Technology 129
Fig. 5.10 The algorithm of determining dominant channels and sub-block encoding (E-method)
BlEðkÞ ¼
jRð0, 0 þ 2kÞ C0ðk Þj þ jRð1, 1 þ 2kÞ C0ðkÞjþ
jGð0, 1 þ 2kÞ C0ðk Þj þ jGð1, 0 þ 2k Þ C0ðkÞjþ
jBð0, 0 þ 2kÞ C0ðk Þj þ jBð1, 1 þ 2kÞ C0ðkÞjþ
jRð0, 1 þ 2kÞ C1ðk Þj þ jRð1, 0 þ 2kÞ C1ðkÞjþ
jGð0, 0 þ 2kÞ C1ðk Þj þ jGð1, 1 þ 2k Þ C1ðkÞjþ
jBð0, 1 þ 2kÞ C1ðk Þj þ jBð1, 0 þ 2kÞ C1ðkÞj:
End
5 Visually Lossless Colour Compression Technology 131
(a)
A1 A2 A3 A4
B1 B2 B3 B4
Calculate:
D1= A1-A2; D2=A2-A3; D3=A3-A4
D4=B1-B2; D5=B2-B3; D6=B3-B4
Calculate:
D_Average = (|D1| + |D2| + |D3| + |D4| + |D5| + |D6|)/6
End
Fig. 5.13 (a) Reference points and difference directions. (b) The algorithm for the F-method
132 M. N. Mishourovsky
8
>
> I R,i,j
if ði, jÞ 2 fð0, 0Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 3Þg
>
< 16 ,
C R,i,j ¼
>
> I R,i,j
>
: , else
32
I G,i,j
C G,i,j ¼ , 8ði, jÞ
16
8
>
> I B,i,j
if ði, jÞ 2 fð0, 1Þ, ð0, 3Þ, ð1, 2Þg
>
< 16 ,
C B,i,j ¼
>
> I B,i,j
>
: , else
32
where IR, IG, IB stands for the input pixel colour value. Output values after quanti-
zation are defined as CR, CG, CB. Reconstruction is performed according to the
following equations:
(
16CR,i,j 0 8, if ði, jÞ 2 fð0, 0Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 3Þg
DR,i,j ¼
32CR,i,j 0 10, else
DG,i,j ¼ 16C G,i,j 0 8, 8ði, jÞ
(
16CB,i,j 0 8, if ði, jÞ 2 fð0, 1Þ, ð0, 3Þ, ð1, 2Þg
DB,i,j ¼ :
32CB,i,j 0 10, else
The second mode is based on averaging combined with LSB truncation for 2 2
sub-blocks:
I R,i,j þ I R,i,jþ2
CR,i,j ¼ , i ¼ 0, 1:j ¼ 0; 1,
4
I G,i,j þ I G,i,jþ2
CG,i,j ¼ , i ¼ 0, 1:j ¼ 0; 1,
2
I B,i,j þ I B,i,jþ2
CB,i,j ¼ , i ¼ 0, 1:j ¼ 0; 1:
4
The third mode of the M-method is based on partial reconstruction of one of the
colour channels using the two remaining colours; these two colour channels are
encoded using bit truncation according to the following:
5 Visually Lossless Colour Compression Technology 133
I Rij
CRi,j ¼ ; if ði, jÞ 2 fð0, 0Þ, ð0, 1Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 2Þ, ð1, 3Þg
8
I Gij
CGi,j ¼ ; if ði, jÞ 2 fð0, 0Þ, ð0, 1Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 2Þ, ð1, 3Þg:
8
The two boundary pixels of the blue channel are encoded similarly:
I B10 I
CB1,0 ¼ , CB0,3 ¼ B03 :
8 16
The best encoding mode is determined according to the minimal reconstruction, and
its number is signalled explicitly in the bitstream.
In the case where one colour channel strongly dominates over the others, the
2 4 method denoted as the O-method is applied. Four cases are supported:
luminance, red, green, and blue colours, which are signalled by a 2-bit index in
the bitstream. All the pixel values of the dominant colour channel are encoded using
PCM without any distortions (which means 8 bits per pixel value) and the remaining
colours are approximated by the mean value over all pixels comprising 2 4 blocks
and encoding it using an 8-bit value. This is explained in Fig. 5.14. This method is
like the E-method but gives a different balance between precision and locality,
although both methods use the idea of dominant colours.
The 2 4 method denoted as the L-method is applied to all colour channels, but it
is processed independently. It is based on construction of a colour palette. For every
2 2 sub-block a palette of two colours is defined; then three modes are provided to
encode the palettes: differential encoding of the palettes, differential encoding of the
colours for every palette, and explicit PCM for colours through quantization. The
differential colour encoding of palettes enhances the accuracy of colours encoding in
some cases; in addition, in some cases the palette colours might coincide due to
calculations, which is why extra palette processing is provided, which increases the
chance of differential encoding being used. If the so-called InterCondition is true,
then the first palette colours are encoded without any changes (8 bits per colour),
while the colours of the second palette are encoded as a difference relative to the
colours of the first palette:
134 M. N. Mishourovsky
These indexes are transformed into 4-bit values, called cumulative difference
values ¼ CDVal, which encode the joint distribution of TIdx_y, TIdx_x (Table 5.7):
5 Visually Lossless Colour Compression Technology 135
This table helps in the encoding of the joint distribution for better localization and
the encoding of rare/common combinations of indexes. There is another Table 5.8
provided to decode indexes according to CDVal.
Using decoded indexes, it is possible to reconstruct colour differences (and hence
colours encoded differentially):
C00 C10 ¼ TIdx y 2:A new range is ½2::2:
C01 C11 ¼ TIdx x 1:A new range is ½1::2:
Extra palette processing is provided to increase the chance of using the differential
mode. It consists of the following:
• Detection of the situation when both colours are equal; setting flag FP to 1 in case
of equal colours; this is done for each palette individually.
• Checking if FP1 + FP2 ¼¼ 1.
• Suppose the colours of the left palette are equal. Then, a new colour from a
second palette should be inserted into the first palette. This colour must be as far
from the colours of the first palette as possible (this will extend the variability of
the palette): the special condition is checked and a new colour from the second
palette is inserted:
sub-block and the mean value is calculated. For example, for the red colour channel
SAD is calculated as follows:
Then, the colour channel which can be effectively encoded with the mean value is
determined: every SAD is compared with Threshold1 in well-defined order; the
colour channel which has the minimal SAD (in accordance with the order shown in
Fig. 5.17) is determined. In addition, the mean value is analysed to decide if it is
small enough (compared with Threshold2 which is set to 32); it might be signalled in
5 Visually Lossless Colour Compression Technology 139
the bitstream that the mean (average) value is small, as a 2-bit value prefix is used to
signal which colour channel is picked, and value 3 is available. In this case, 2-bit
extra colour index is also added to notify the decoder which colour channel is to be
encoded with the mean value using 3 bits by truncating the remaining two (as the
mean value takes 5 bits at most). Otherwise, if the mean value is greater than
Threshold2, 6 bits are used to encode; again, 2 LSB bits are truncated; the remaining
2 colours (“active colours”) within the underlying 2 2 sub-block are encoded using
NSW + FQvT: each active colour is transformed to the set of values: {s1, h1, v1, d1},
{s2, h2, v2, d2}.
{h1, v1, d1} and {h2, v2, d2} are encoded according to the FQvT procedure,
using quantization Table 5.9.
140 M. N. Mishourovsky
Every set of differential values is encoded using its own quantizer set. s1 and s2
values are encoded in one of two modes:
1. 6 bits per s-value. It is used if the mean value of the uniform channel is small and
the number of bits is enough to reserve 12 bits for s-values.
2. Differential mode is used if 6-bit mode cannot be used; the following is provided
in this mode:
• An error caused by encoding of s1, s2 into 5-bit/5-bit (via LSB truncation) is
evaluated:
• if ErrQuant < EffDiffQuant, s1 and s2 are encoded using 5 bits per s-value;
otherwise, 6 bits are used to encode s1 and 4 bits for 3-bit signed difference.
In terms of bitstream structure, the diagram in Fig. 5.18 shows the bits distribu-
tion. According to this diagram, the U1 sub-method spends 43 bits per 2 2
sub-block. The second sub-method of the U-method is called U2. Every 2 4
colour block is processed as three independent 2 4 colour blocks. Every indepen-
dent 2 4 block is further represented as two adjacent 2 2 sub-blocks. For each
2 2 sub-block, approximation by a mean value is estimated. Then, the sub-block
which is approximated by a mean value with the smallest error is determined.
Another sub-block is encoded with NSW and FQvT (see quantization Table 5.10).
5 Visually Lossless Colour Compression Technology 141
6-bitÆS1
5-bitÆS1
1-bitÆsign (S1-S2) 6-bitÆS1
5-bitÆS2
3-bit Æ|S1-S2| 6-bitÆS2
Every difference value {h,v,d} is encoded using the 3-bit quantizer index and
3-bit quantizer set number. The s-value is encoded using a 7-bit value by truncating
LSB. A 1-bit value is also used to signal which block is encoded as uniform. This
sub-method spends 28 bits per independent 2 4 block.
The last method is based on the Hadamard transform (Woo and Won 1999) and is
called the H-method. In this method the underlying 2 4 block is considered as a set
of two 2 2 blocks and H-transform is applied to every block as follows:
8
>
> AþBþCþD
>
> S¼ 8
>
> 4 A ¼ S þ dD þ dH þ dV
>
> AþCBD > >
>
< dH ¼ >
< B ¼ S þ dV dH dD
4 ,
>
> AþBCD > > C ¼ S þ dH dV dD
>
> dV ¼ >
:
>
> 4 ¼ S þ dD dH dV
>
> D
: dD ¼ A B C þ D
>
4
142 M. N. Mishourovsky
The dD value is set to zero as it is less important for the human visual system; the
remaining components S, dH, and dV are encoded as follows:
• the S-value is encoded using a 6-bit value via 2LSB truncation; reconstruction of
these 2LSB is done with a fixed value of 10 (binary radix).
• dH, dV are encoded using FQvT; Table 5.11 describes the quantizer indexes for
eight quantizer sets adopted for H-transform.
Every 2 2 sub-block within every colour channel is encoded using 6 bits for the
s-value, one 3-bit quantizer set index shared between dV and dH, and a 3-bit value
for each difference value. This approach is similar to methods described before, for
example, the U-method; the difference in encoding is in the shared quantizer set
index, which saves bits and relies on correlation between the dV and dH values.
The key idea is to pick a method that provides the minimum error; in general, the
more similar it is to how a user ranks methods according to their visual quality, the
better the final quality is. However, complexity is another limitation which bounds
the final efficiency and restricts the approaches that can be applied in VLCCT.
According to a state-of-the-art review and experimental data, two approaches were
adopted in VLCCT.
The first approach is weighted mean square wrror, which is defined as follows:
2 h
X X 2 2 2 2 i
WMSE ¼ Ai Ai þ Bi Bi þ Ci C i þ Di Di
C¼R, G, B i¼1
WC,
where A, B, C, and D are pixel values and Wc are weights dependent on the colour
channel. According to experiments conducted, the following weights are adopted:
5 Visually Lossless Colour Compression Technology 143
2
MaxSqC ¼ MAX AC AC , BC BC , CC C C , DC DC W C ;
• Then, calculate the sum of MaxSq and the maximum over all colour channels:
X
SMax ¼ MaxSqC
C¼R, G, B
MMax ¼ MAX ½MaxSqR , MaxSqG , MaxSqB ;
To calculate WSMMC for the 2 4 block, the WSMMC values for both 2 2
sub-blocks are added. This method is simpler than WMSE but was still shown to be
effective. In general, to determine the best encoding method, all feasible combina-
tions of methods are checked, errors are estimated, and the one with the smallest
error is selected.
VLCCT provided a simple way to represent the least significant 2 bits of the 2 4
colour block. To keep a reasonable trade-off between bits costs, quality, and
complexity, several methods to encode LSB have been proposed:
• 3-bit encoding – the mean value of LSB bits for every colour is calculated over
the whole 2 4 block followed by quantization:
5 Visually Lossless Colour Compression Technology 145
P xP
y¼1 ¼3
½I c ðx, yÞ&3
y¼0 x¼0
mLSBc ¼ ,
16
where Ic(x, y) – input 10-bit pixel-value in the colour channel C, at the position (x, y).
mLSBc is encoded using 1 bit for each colour:
• 4-bit encoding – like the 3-bit encoding approach, but every 2 2 sub-block is
considered independently for the green colour channel (which reflects higher
sensitivity of the human visual system to green):
P x¼1
y¼1 P P x¼1
y¼1 P
½IG ðx, yÞ&3 ½IG ðx, yÞ&3
y¼0 x¼0 y¼0 x¼2
mLSBc Left ¼ mLSBG Right ¼ ;
8 8
• 5-bit encoding – like 4-bit encoding, but the red channel is also encoded using
splitting into left/right sub-blocks; thus, the green and red colours have higher
precision for LSB encoding.
• 6-bit encoding – every colour channel is encoded using 2 bits, in the same way as
the green channel is encoded in the 4-bit encoding approach.
Reconstruction is done following the next equation (applicable for every channel
and every sub-block or block):
2, ðmLSB ¼¼ 1Þ
Middle Value ¼ :
0, ðmLSB ¼¼ 0Þ
This middle value is assigned to 2 LSB bits of the corresponding block/sub-block for
every processed colour.
N N F
P P O
S S D
C C L
E E U
LSB Encoding
• Each operation or memory cell takes the required predefined number of physical
gates in a chip.
• For effective use of the hardware and to keep constrained latency, pipelining and
concurrent execution should be enabled.
• Algorithmic optimizations must be applied where possible.
The initial HW complexity was based on estimating elementary operations for
almost every module (method). The elementary operations are the following.
1. Addition / subtraction. 8-bit signed/unsigned operation. Denoted as (+)
2. Bit-shift without carry-bit. Denoted as (<<)
3. Comparison of 8-bit signed or unsigned values. Denoted as (?)
4. Look-up table operation for extraction of some value from some position within
table (2D or 1D table). Denoted as (T.Opt)
5. Multiplication/division of 8-bit signed/unsigned values. Denoted as (*/)
6. Logical AND/OR operation. Marked as (AND/OR)
7. Some complex (combined) sequence of operations which are very
HW-dependent. Denoted as (ETC)
8. Transformation from conventional complementary code into sign-magnitude
representation. Denoted as (CC!MM)
5 Visually Lossless Colour Compression Technology 147
Fig. 5.20 Time diagram of the encoding process; error estimation is pipelined
Table 5.13 provides the estimation of the H/W complexity for these operations
(in gates). The total number of gates required for the VLCCT implementation (main
routines of the encoder part) is shown in Table 5.14 (straightforward
implementation).
If pipelining is applied for the error estimation process as shown in Fig. 5.20, a
small but fixed latency occurs, but the complexity is reduced (Table 5.15).
148 M. N. Mishourovsky
Fig. 5.21 An example of {h,v,d} distribution, middle levels, and reconstructed levels
MaxErr1 ¼ MAX fjΔhj, jΔvj, jΔdjg and MaxErr2 ¼ jΔhj þ jΔvj þ jΔdj:
Subjective testing showed that both error functions provided very similar results
in terms of visual quality, with negligible differences under normal viewing condi-
tions. Other simplifications dealt with the selection of the best compression method
for a block, compact storage of the quantized tables, and modification of the
quantization process by means of using middle levels as comparative nodes, as
shown in Fig. 5.21.
After applying all these optimization techniques, the complexity of the encoding
part was reduced to 217 kGates. Table 5.16 sums up the HW complexity in kGates
required to implement VLCCT.
5 Visually Lossless Colour Compression Technology 149
Visual quality (if an image of video is for humans) refers to human perception of the
“goodness” of the visual information being presented. In fact, the “goodness” is a
very generic term: depending on a particular situation, it might refer to an absolute
scale (which is usually defined as the Mean Opinion Score (MOS)), a relative scale
(Differential Mean Opinion Score (DMOS)), or it might be attributed to the human
perception of the visual quality of still images or moving images, low-level (human
visual system) or high-level processing (brain). Many materials on this are publicly
available (see, e.g. Watson 1993, Ohta and Robertson 2005, and others).
According to the classification provided in several international standards/recom-
mendations on visual quality evaluations, such as BT.500 (2019), BT.1676 (2004),
BT.1683 (2004), BT.2095 (2017), P.910 (2008), P.913 (2016) and others, a subjec-
tive testing methodology might be based on a double stimulus setup (where a
reference and impaired image/video is presented in one trial). Alternatively, single
stimulus continuous quality evaluation might be applied. Every setup has its own
pros and cons, and these are well explained in the literature.
The visual quality provided by VLCCT was analysed under 1, 2, and 3
zoom in; this caused all the spatial frequencies (usually expressed in cycles per
degree of visual angle) to be condensed around the low part of the spectrum, which
leads to the following:
• All possible defects become highly visible
• After zoom in, there is not much difference between low and high spatial
frequencies, which means all frequencies become equally important.
To test the visual quality (how accurate VLCCT is), a special dataset was
prepared which included several categories of images:
1. Natural images
2. Computer graphics
3. Screen-content
4. Synthetic images (including resolution and colour charts, different colour
transitions)
5. Combined frames.
150 M. N. Mishourovsky
Fig. 5.22 Examples of test images from the dataset used for VLCCT testing: (a) synthetic colour
transitions (red); (b) synthetic colour transitions (blue); (c) synthetic colour transitions (green); (d)
radial/moiré patterns; (e) natural, complex scene; (f) spatial-frequency chart; (g) circle-like chart; (h)
gradient/transition; (i) gamma-checker pattern; (j) TV test chart; (k) natural, complex scene; (l) colour
transitions; (m) natural, small details; (n) complex colour transition; (o) angular colour transition
Every image was compressed up to three times; this was necessary to verify that
even if an image is passed through a coder-decoder multiple times, no significant
degradation occurs, which is important for real use cases. Examples of some images
from the dataset are shown in Fig. 5.22.
5 Visually Lossless Colour Compression Technology 151
The whole dataset included more than 300 test images (including colour/resolu-
tion charts (e.g. ISO 12233:2017), sharp colour transitions, gradients, natural
images). All images were subjected to both objective (PSNR) and subjective (visual
testing) analysis. Finally, it was confirmed that even after three consecutive VLCCT
compression-decompression loops, no visual distortions were observed. The PSNR
was higher than 33–35 dB (which is close to the perceptually lossless level).
Nowadays, there is huge progress in visual quality analysis; more and more
systems perform automotive quality control. Due to the wide adoption of machine
learning methods, several methods have appeared that are proved to be in good
agreement with human scores of visual quality. It is worth mentioning VMAF
(Video Multi-Method Assessment Fusion), by Li et al. (2016) and a method to
recover subjective scores from noisy raw data – Li and Bampis (2017); fully blind
metrics like Natural Image Quality Evaluator (NIQE) by Mittal et al. (2013), eMOS
(2020), the “perceptual loss” criterion which was used in technologies like style-
transferring (Johnson et al. 2016) and super-resolution; and Nvidia’s just recently
published a paper on image quality evaluation for computer graphic images
(Andersson et al. 2020). Very detailed review of visual quality metrics is given
in Athar and Wang (2019). Most of these technologies have reached maturity
because of the data becoming available, for example:
1. The dataset related to texture perceptual similarity by Zhang et al. (2018).
2. Several large annotated datasets on visual quality collected by experts from
Workgroup Multimedia Signal Processing (https://www.mmsp.uni-konstanz.de/
). The datasets are available at: http://database.mmsp-kn.de/.
3. Other datasets became popular in last years:
• CIDIQ by Liu et al. (2014)
• CSIQ by Larson and Chandler (2010)
• IVL by Corchs et al. (2014)
• IVC (accessed on Sept. 16, 2020)
• Image and Video Quality Assessment datasets LIVE (2020)
• TID by Ponomarenko et al. (2015)
5.8 Conclusions
References
Andersson, P., Nilsson, J., Akenine-Möller, T., Oskarsson, M., Åström, K., Fairchild, M.D.: FLIP:
A difference evaluator for alternating images. In: Proceedings of the ACM on Computer
Graphics and Interactive Techniques. 3 (2), Article 15 (2020) Accessed on 16th September
2020. https://www.highperformancegraphics.org/2020/
ASTC Texture Compression (2019). Accessed on 15 September 2020 https://www.khronos.org/
opengl/wiki/ASTC_Texture_Compression https://github.com/ARM-software/astc-encoder
Athar, S., Wang, Z.: A comprehensive performance evaluation of image quality assessment
algorithms. In: IEEE Access, vol. 7, pp. 140030–140070 (2019)
BT. 1676, Methodological framework for specifying accuracy and cross-calibration of video
quality metrics. ITU recommendation (2004)
BT.1683 Objective perceptual video quality measurement techniques for standard definition digital
broadcast television in the presence of a full reference. ITU recommendation (2004)
BT.2095, Subjective assessment of video quality using expert viewing protocol, actual revision
BT.2095-1. ITU recommendation (2017)
BT.500, Methodologies for the subjective assessment of the quality of television images. ITU
recommendation, actual revision: BT.500-14 (2019)
Corchs, S., Gasparini, F., Schettini, R.: No reference image quality classification for JPEG-distorted
images. Digital Signal Processing. 30, 86–100 (2014)
eMOS technology (2020) Accessed on 16 September 2020. https://deelvin.com/machine-learning-
technologies
Image and Video Quality Assessment at LIVE. The University of Texas at Austin, Laboratory of
Image and Cideo Engineering (2020) Accessed on 16 of September 2020. http://live.ece.utexas.
edu/research/Quality/
Strom, J., Moller, T.A.: iPACKMAN: High-Quality, Low-Complexity Texture Compression for
Mobile Phones, Graphics Hardware, pp. 63–70 (2005)
ISO 12233:2017 Photography–Electronic still picture imaging–Resolution and spatial frequency
responses (2017) Accessed on 16 September 2020. https://www.iso.org/standard/71696.html
IVL dataset. Institut de Recherche en Communications et Cybernétique de Nantes (2014) Accessed
on 16 September 2020. http://ivc.univ-nantes.fr/en/pages/view/44/
5 Visually Lossless Colour Compression Technology 153
Jaspers E.G.T., de With, P.H.N.: Compression for reduction of off-chip video bandwidth. In:
Proceedings of SPIE. 4674 (2002) Accessed on 22 September 2020. https://doi.org/10.1117/
12.451065
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution.
Springer (2016)
Kabal, P.: Quantizers for symmetric gamma distributions. IEEE Transactions on Acoustics, Speech,
and Signal Processing. 32(4), 836–841 (1984)
Larson, E.C., Chandler, D.M.: Most apparent distortion: full-reference image quality assessment
and the role of strategy. Journal of Electronic Imaging. 19(1), 011006 (2010)
Lee, S-Jo, Lee, Si-Hwa, Kim, Do-Hyung: Method, medium, and system compressing and/or
reconstructing image information with low complexity. US Patent Application 20080317116
(2008)
Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., Manohara, M.: Toward a practical perceptual
video quality metric (2016). Accessed on 16 September 2020. https://netflixtechblog.com/
toward-a-practical-perceptual-video-quality-metric-653f208b9652. Github repository: https://
github.com/Netflix/vmaf
Li, Z., Bampis, C.: Recover subjective quality scores from noisy measurements.
arXiv:1611.01715v3 (2017) Accessed on 16 September 2020 https://github.com/Netflix/sureal
Liu, X., Pedersen, M., Hardeberg, J.Y.: CID:IQ – A new image quality database. In: Proceedings of
the International Conference on Image and Signal Processing, pp. 193–202 (2014)
Matoba, N., Terada, K., Saito, M., Tanioka, M.: Real-time continuous recording technique using
FBTC in digital still cameras. In: Proceedings of the SPIE. Digital Solid State Cameras: Designs
and Applications, vol. 3302, (1998)
Methods for encoding and decoding ETC1 textures (2019) Accessed on 22 September 2020. https://
developer.android.com/reference/android/opengl/ETC1
Mitchell, O.R., Delp, E.J.: Multilevel graphics representation using block truncation coding. Proc.
IEEE. 68(7), 868–873 (1980)
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a completely blind image quality analyzer.
IEEE Signal processing Letters. 22(3), 209–212 (2013). Accessed on 16 September 2020. http://
live.ece.utexas.edu/research/Quality/nrqa.htm
Odagiri, J., Nakano, Y., Yoshida, S.: Video compression technology for in-vehicle image trans-
mission: SmartCODEC. Fujitsu Scientific & Technical Journal. 43(4), 469–474 (2007)
Ohta, N., Robertson, A.: Colorimetry: Fundamentals and Applications. John Wiley & Sons, Ltd
(2005)
P.910: Subjective video quality assessment methods for multimedia applications (2008)
P.913: Methods for the subjective assessment of video quality, audio quality and audiovisual
quality of Internet video and distribution quality television in any environment. ITU recom-
mendation (2016)
Paltashev, T., Perminov, I.: Texture compression techniques (2014) Accessed on 15 September
2020. http://sv-journal.org/2014-1/06/en/index.php?lang¼en
Patane, G., Russo, M.: The enhanced LBG algorithm. Elsevier, Neural Networks. 14(9), 1219–1237
(2001)
Paris, G.: Vulkan SDK update (1995–2020). Accessed on 15 September 2020. https://community.
arm.com/developer/tools-software/graphics/b/blog/posts/vulkan-sdk-update
Ponomarenko, N., Jin, L., Ieremeiev, O., Lukin, V., Egiazarian, K., Astola, J., Vozel, B., Chehdi,
K., Carli, M., Battisti, F., Jay Kuo, C.-C.: Image database TID2013: peculiarities, results and
perspectives. Signal Process. Image Commun. 30, 57–77 (2015)
Rissanen, J.J., Langdon, G.G.: Arithmetic coding. IBM J. Res. Dev. 23(2), 149–162 (1979)
Someya, J., Nagase, A., Okuda, N.: Image processing apparatus and method, and image coding
apparatus and method. US Patent Application 20080019598 (2008)
Sugita, Y.: Data compression apparatus, and data compression program storage medium. US Patent
7,183,950 (2007)
154 M. N. Mishourovsky
Sugita, Y., Watanabe, A.: Development of new image compression algorithm (Xena). In: Pro-
ceedings of SPIE. Real-Time Image Processing. vol. 6496. (2007)
Sweldens, W.: The lifting scheme: a construction of second generation of wavelets. J. Math. Anal.
29(2), 511–546 (1997)
Takahashi, T., Matoba, N., Ohashi, S.: Image coding apparatus for converting image information to
variable length codes of predetermined code size, method of image coding and apparatus for
storing/transmitting image. US Patent 6,052,488 (2000)
Torikai, Y., Tanioka, M., Matoba, N.: Apparatus and method of image compression and decom-
pression not requiring raster block and block raster transformation. US Patent 6, 026,194 (2000)
Watson, A.B.: DCTune: A technique for visual optimization of DCT quantization matrices for
individual images. Society for Information Display Digest of Technical Papers. XXIV,
pp. 946–949 (1993)
Woo, S.H., Won, C.S.: Multiresolution progressive image transmission using a 2x2 DCT. In: 1999
Digest of Technical Papers. International Conference on Consumer Electronics (Cat.
No.99CH36277) (1999) https://doi.org/10.1109/ICCE.1999.785243
Wu, Y., Coll, D.C.: Single bit-map block truncation coding of color images. IEEE J. Select. Areas
Commun. 10(5), 952–959 (1992)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep
features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. arXiv:1801.03924 (2018)
Chapter 6
Automatic Video Editing
Sergey Y. Podlesnyy
6.1.1 Introduction
Portable devices equipped with video cameras, like smartphones and action cameras,
are showing rapid progress in imaging technology. 4 K sensor resolution and highly
efficient video codecs like H265 provide firm ground for social video capturing and
consumption. However, despite this impressive progress, users seem to prefer photos
as the main medium to capture their daily events and mostly consume video clips
produced by professionals or skilled bloggers. We argue that the main reason is that
the time needed to watch a video is orders of magnitude longer than photo browsing.
Social video footage captured by ordinary gadget users is too long, not even
speaking of their quality. Videos should be edited before presenting them even to
one’s closest friends and/or family members, yet video editing is a lengthy and
complicated process for most video gadget users.
Video editing should involve at least the steps of selecting the most valuable
footage from the points of view of visual quality and the importance of the action
filmed. This is a time-consuming process even on its own, but the next step is to cut
the footage into a brief and coherent visual story that will be interesting to watch.
This process is cinematographic in nature and thus requires many artistic and
technical skills, which makes it almost impossible for a broad range of users to use
successfully.
Recently, deep learning has shown huge success in the visual data processing
area. We aim to apply the proven techniques of machine learning, convolutional
neural networks and reinforcement learning to create automatic tools for social video
S. Y. Podlesnyy (*)
Cinema and Photo Research Institute (NIKFI) of Gorky Film Studios, Moscow, Russia
e-mail: s.podlesnyy@nikfi.ru
Tsivian (2009) used detailed film shot length metrics to study the history of cine-
matography editing and individual styles of famous film editors. Specifically, these
measure the film’s average shot length (ASL) tabulated by the shot size (from close-
up to long shot and even their extreme scales: from big close-up to very long shot),
cutting swing (standard deviations of shorter and longer shots from ASL), their
cutting range (difference in seconds between the shortest and the longest shot of the
film) and their dynamic profiles (polynomial trend lines which reflect fluctuations of
shot lengths within the duration of the film). The results of applying dynamic
profiling to separate stories within the film, but not the full-feature film, look the
most promising (see Fig. 6.1). It is argued that the profile pattern may reflect either
some general rule of dramatic rhythm or the editors’ individual way of shaping the
narrative flow of their films.
As timing statistics may be the most distinctive “fingerprint” for determining
whether a feature film belongs to a particular authorship, we argue that it may be
beneficial to measure the statistics of transitions between shot types, e.g. classified
by shot size. Compliance with basic cinematography rules (e.g. famous Russian
cinematographer Kuleshov in the early twentieth century recommended transitions
between shot sizes separated by two steps) could be checked by a simple count of
transitions. The importance of the cinematographic cut is well-known to filmmakers
(Pudovkin 1974).
Examples of basic cinematography editing rules are:
• “avoid jump cuts” rule (transitions between the shots should go in two shot size
steps, shot sizes being “extra-long shot”, “long shot”, “middle long shot”, “mid-
dle shot”, “close-up”, “extra close-up”; the camera position should move at least
30 between the two shots).
• “180 line of action” rule (camera should never cross the line of action while
capturing dialogue and other actions having a distinct axis).
Of course, we should not take the rules for granted and demand 100% compli-
ance, as video editing is an artistic process and not a subject for mechanical
judgment.
6 Automatic Video Editing 157
Fig. 6.1 Example metric profiles of four separate stories in D.W. Griffith’s 1916 Intolerance.
(Reproduced with permission from Tsivian 2009)
Fig. 6.2 15 video segments left; reconstruction error for each video segment right. (Reproduced
with permission from Zhao and Xing 2014)
consistency, where a dictionary of key frames is selected such that the original video
can be best reconstructed from this representative dictionary. An efficient global
optimisation algorithm is introduced to solve the dictionary selection model.
Zhao and Xing (2014) further develop a dictionary-based approach analysing the
sparse reconstruction error of a new video segment with a dictionary learnt by
observing the previous part of a potentially very long video. Figure 6.2 illustrates
a video reconstruction error.
An important insight of this work is that typical consumer videos do not have any
temporal segmentation characterised with minimum variation and consistency of
objects. Amateur users often shoot videos with a constantly moving camera, chang-
ing the zoom and shaking the device. Conventional shot boundary detection methods
(e.g. based on colour histograms or motion estimation) do not work in this setting.
Zhao and Xing (2014) cope with this problem by simply breaking the raw footage
into fixed-length sequences, e.g. each 50 frames long.
They represent features for video data as spatio-temporal cuboids of interest
points using the method of Dollar et al. (2005) and describe each detected interest
point with a histogram of gradient (HoG) and histogram of optical flow (HoF). The
feature representation for each detected interest point is then obtained by concatenat-
ing the HoG feature vector and HoF feature vector. Finally, each video segment is
represented as a collection of feature vectors, corresponding to the detected interest
points.
They initiate a dictionary by learning from feature vectors obtained from the first
m segments and further scan through the video; segments that cannot be sparsely
reconstructed using the current dictionary, indicating unseen and interesting content,
are incorporated into the summary video. The current dictionary is updated online
when a segment reconstruction error exceeds a given threshold.
They used a test data set consisting of over 12 hours of raw video footage hand
labelled by human experts. The test videos span a wide variety of scenarios: indoor
and outdoor, moving camera and still camera, with and without camera zoom in/out,
with different categories of targets (human, vehicles, planes, animals etc.) and cover
6 Automatic Video Editing 159
a wide variety of activities and environmental conditions. For each video in the test
data set, three judges selected segments from the original video to compose their
preferred version of the summary video. The final ground truth was then constructed
by pooling together those segments selected by at least two judges. To quantitatively
determine the overlap between the algorithm-generated summary and the ground
truth, both the video segment content and time differences were considered. The
final accuracy was computed as the ratio of segments in the algorithm-generated
summary video that overlap with the ground truth.
The average accuracy of their summarising was 72.30%, and the ratio of the total
time spent on generating feature representations, learning the initial dictionary, video
reconstruction and online dictionary updating to the raw video duration ranged from
0.60 to 1.71.
Although video summarising is causally related to the topic of this chapter, we
refer to the method of determining a measure of the importance of film shots or
segments shown in Uchihachi et al. (2003), where video shots have been clustered
using a measure of visual similarity, such as colour histograms or transform coeffi-
cients. Consecutive frames belonging to the same cluster are considered as a video
shot, each shot having attributes of length and cluster weight (total number of frames
belonging to the cluster). A shot is important if it is both long and rare. Additional
amplification factors are proposed for preferring specific shot categories, for exam-
ple, close-ups.
This approach allows for a concise presentation of long video sequences as
shortened video clips of still-frame storyboards. However, the method of visual
clustering is based on low-level graphical measurements and does not contain
semantic information about the frame content and geometry. In one of our recent
works (Podlesnaya and Podlesnyy 2016), it has been shown that feature vectors
obtained from a frame image by a convolutional neural network trained to recognise
a wide nomenclature of classes, such as ImageNet contest (Russakovsky et al. 2014),
comprise semantic information suitable for visual example-based information
retrieval and for segmenting videos into distinct shots. Here, we will show that the
same feature vector comprises geometry-related semantic information to some
extent and is at least capable of differentiating between cinematography shot sizes
(from close-up to long shots).
Arev et al. (2014) of Disney Research describe a system capable of automatic editing
of video footage obtained from multiple cameras, e.g. collected from viewers of a
basketball game. Automatic editing is performed by optimising a path in the trellis
graph constructed of frames of multiple sources, ordered in time. Edges in the graph
160 S. Y. Podlesnyy
Fig. 6.3 Method pipeline: from multiple camera feeds to a single output cut. (Reproduced with
permission from Arev et al. 2014)
represent transitions between different frames, i.e. effectively cuts (or no cuts if the
edge connects two nodes representing the same footage).
The system efficiently produces high-quality narratives by means of constructing
the cost functions for nodes and edges of the trellis graph to closely correspond to the
basic rules of cinematography. For example, in order to enforce the 180 line of
action rule, they estimate the 3D camera position and rotation for every source of
video footage and further estimate the most important action location in a 3D scene
as a joint focus of attention of multiple cameras (see Fig. 6.3). In order to avoid jump
cuts, the system estimates the cameras’ movement in 3D space and constructs the
loss function for the graph edges so that both the transition angle and the distance
between camera positions in a transition are constrained. For example, an optimal
transition angle is reported around 30 , and the optimal distance is around 6 metres.
By estimating the distance between the camera and the joint attention focus, the
system is capable of evaluating the size of each shot as wide, long, medium, close-up
or extreme close-up. Cost functions for the graph edges penalise transitions between
shots that are more than two sizes apart. Lastly, the system promotes cuts-on-action
transitions by means of estimating actions as local maxima of joint attention focus
acceleration.
6 Automatic Video Editing 161
Let’s go into greater depth on the details of this wonderful work. An input for the
algorithm is k synchronised video streams captured from numerous positions,
possibly moving in time. The overall processing pipeline is shown in Fig. 6.3.
The first step in the pipeline is 3D camera pose estimation for every video stream.
The standard procedure widely used in computational photography is described in
Snavely et al. (2006). Given k synchronously taken frames, a few thousand interest
points are found in each frame. Classic methods for interest points detection and their
description are reviewed in Mikolajczyk et al. (2005), the SIFT method being just
one. Next, for each pair of frames, interest descriptors are matched between the pair,
using the approximate nearest neighbour package of Arya et al. (1998).
More recent approaches suggest that the detection and feature-matching stages
can be avoided and, instead, features can be extracted on a dense grid across the
image. In particular, Neighbourhood Consensus Networks (NCNet) (Rocco et al.
2018) allow for jointly trainable feature extraction, matching, and match-filtering to
directly output a strong set of (mostly) correct correspondences.
Given the pairwise correspondences of interest points, a fundamental matrix for
the pair is estimated using RANSAC (Fischler and Bolles 1987). After numerous
refinements and suppression of outliers, a set of geometrically consistent matches
between each image pair is obtained. Earlier, the efficiency of using the RANSAC
algorithm for outlier suppression was demonstrated, in particular, for image
matching and coordinate transformations in document image processing (Safonov
et al. 2019, Chap. 7).
Next, a set of camera parameters and a 3D location for each pair of interest points
are recovered. The recovered parameters should be consistent, in that the
reprojection error, i.e. the sum of distances between the projections of each track
and its corresponding image features, is minimised. This minimisation problem is
formulated as a non-linear least squares problem and solved iteratively, starting with
one pair of frames and adding more frames one by one to eliminate bad local
minima, which are known to affect large-scale Structure from Motion problems.
The pose of each camera is estimated using a perspective-n-point algorithm (Lepetit
et al. 2009).
The second step in the pipeline is to use the gaze clustering algorithm of Park et al.
(2012) to extract 3D points of joint attention (JA-point) in the scene at each time
instant. All gaze concurrences g are calculated in the scene through time, and their
importance rank(g) is counted as the number of camera gazes that intersect at that
point. Thus, this process produces multiple 3D locations of joint interest, and the
algorithm uses them all during its reasoning about producing the cut.
The third step in the pipeline is trellis graph construction, where nodes
representing a camera and a point of 3D joint attention are estimated via gaze
concurrences; nodes are laid out in slices, each slice corresponding to a time-slice.
The edges in the graph connect all nodes in a slice to nodes in the next time-slice.
Both the nodes and edges of the graph are weighted. The node weights combine such
parameters of a camera frame as:
162 S. Y. Podlesnyy
• Stabilisation cost (to limit shaky camera movement, camera shakiness estimated
from camera trajectory across time)
• Camera roll cost (to enforce camera alignment to the horizon line)
• JA-point importance rank
• 2D location of JA-point projection in the frame (to penalise JA-points lying
outside the 10% margin of the frame resulting in centring the frame around the
main point of interest of the narrative, or stabilising shaky footage, or reducing
distortion in wide FOV cameras, or creating more aesthetic compositions)
• 3D distance between the JA-point and the node’s camera centre (to eliminate
cameras that are too far or too close)
The edge weights of the graph combine:
• Absolute angle difference between the two front vectors of adjacent frames of the
same video feed (continuous feed case; no cut is produced).
• Transition angle (ensures small overlap between the frames and different back-
ground arrangements).
• Distance between the two cameras (ensures small angle change between the
transition frames).
• Shot size (the size of each shot is identified as wide, long, medium, close-up or
extreme close-up, according to the distance from the JA-point; transitions
between shots whose sizes are more than two levels apart can be confusing for
the viewer and should be penalised).
• Acceleration of the 3D JA-point as a measure for action change (to promote cut-
on-action transitions).
The final step in the pipeline is optimal path computation in the graph. A path in
the trellis graph starting at the first slice and ending at the last defines an output
movie whose length matches the length of the original footage. Following continu-
ous edges in the path continues the same shot in the movie, while following
transition edges creates a cut. The cost of a path is the sum of all the edge weights
and node costs in the path. Once the graph is constructed, it becomes possible to find
the “best” movie by choosing the lowest cost path. Dijkstra’s algorithm or the
modified dynamic programming algorithm proposed by Arev et al. (2014) can be
used to find the lowest cost path.
For algorithm evaluation, the following methods have been proposed: compari-
son of automatically edited clips with manually edited clips and random cuts made of
the same footage. Among the metrics proposed is the time of processing (ranging
from 20 hours on average for manual editing of a few minutes’ clip to 6–17 minutes
of automatic editing including rendering). Another metric is the diversity of time
between the cuts and how well it follows the pace of actions in the scene. Yet another
way to compare automated video editing algorithms is by counting the cinematog-
raphy rules violations (e.g. 180 line of action rule violation) and the diversity of the
shot sizes of the transitions.
The proposed method relies heavily on the availability of multiple video sources
taken from different angles to make it possible to derive a joint attention point in 3D
6 Automatic Video Editing 163
Fig. 6.4 A visualisation of three shots from a coherent video cut of a social event. In this case, eight
social cameras record a basketball game. Their views are shown at three different times. The 3D top
view shows the 3D camera poses, the 3D joint attention estimate (blue dots) and the line-of-action
(green line). Using cinematographic guidelines, the quality of the footage and joint attention
estimation, our algorithm chooses times to cut from one camera to another (from the blue camera
to purple and then to green). (Reproduced with permission from Arev et al. 2014)
space. Cinematography rules are hard-coded into the system. The system is reported
to show the additional benefit of improving the visual quality of the resulting films
by means of applying a crop and stabilising some shots in order to achieve the
transition between shot sizes dictated by cinematography rules (Fig. 6.4).
This is an impressive result, but we would suggest adding an evaluation of the
general aesthetics of a frame to prevent the inclusion of technically defective footage
in the resulting film. Good results in determining the aesthetical score of a photo
image were shown by Jin et al. (2016). They used a crowdsourced collection of rated
photos to train a convolutional neural network to directly regress the aesthetical
score of an image. When we analysed the general patterns of their scoring, we found
that their system penalises basic technical defects like blurred images, skyline slopes,
face occlusions, etc. Having in mind that professional video editors may widely use
these as creative effects, not defects, we are still convinced that, as concerns social
video device users, these issues should be penalised as probable technical defects.
Examples of aesthetical scores are shown in Fig. 6.5 (scores range from 0, “bad”, to
1, “excellent”). Examples (a) to (g) correlate well with common sense, while
(h) assigns a relatively low score of 0.18 to a frame extracted from one of the
greatest masterpieces of the XX century.
Leake et al. (2017) presented a system for automatically editing video of dialogue-
driven scenes. Given a script and multiple takes of video footage, this system
performs sentiment analysis of the text spoken and facial key point analysis in
order to determine which video clips are associated with each line of dialogue, and
whether or not the performer speaking the line is visible in the corresponding clip.
164 S. Y. Podlesnyy
Fig. 6.5 Examples of aesthetic scores of video footage: (a, b) GoPro frames, score depends on
horizon alignment; (c) high score for frame from “L.A. Confidential”, 1997 by Curtis Hanson; (d)
high score for frame from “Apocalypse Now”, 1979 by Francis Ford Coppola; (e, f) GoPro frames,
score depends on face lighting; (h) low score assigned to a frame from “Last Tango in Paris”, 1972
by Bernardo Bertolucci
For shot size attribution, they use a face detector to determine the median area of a
face in the frame. Basic film-editing rules are encoded as probability functions for
starting a scene with a particular clip or performing a transition between the takes.
Such cinematography rules as “start wide”, “avoid jump cuts”, “speaker visible” and
“intensify emotion” are encoded using the attributes obtained from the text senti-
ment, face area and aligning dialogue lines with clips. The system provides a
graphical user interface capable of composing several editing rules and controlling
the pace of the final video.
In order to evaluate the results, automatically edited videos were compared with
manually edited film clips. It is reported that it took 2–3 hours of a highly profes-
sional video editor to perform the job. Due to heavy usage of face detection
algorithms, the time taken for automatic editing was comparable, but no human
interference was required during that time. The proposed system is limited to
dialogue-based videos only and requires a script with the performers’ names
labelled.
As shown in Merabti et al. (2015), it is possible to model the editing process using a
Hidden Markov Model (HMM). They hand-annotate existing film scenes to serve as
training data. In particular, they annotate shot sizes and action types as “main
character entering scene”, “both characters leave scene”, etc. They also hand-
annotate the communicative values of utterances in the dialogue following the way
filmmakers construct dialogue: symbolic, moral and narrative communicative
values, as this features a stronger relationship with the type of shot used to portray
6 Automatic Video Editing 165
GoPro Inc. discloses a set of patents covering its Quik software for automated video
editing (Médioni 2017; Matias and Phan 2017). According to these publications, the
following use case for automated video editing may be implemented for
non-professional users of action/sports video cameras:
• User highlights favourite moments in raw footage (see Fig. 6.6 for example of
user interface).
• Application applies pretrained spatio-temporal convolutional neural network for
calculating semantic feature vectors of the highlights.
• Application performs training of LSTM recurrent neural network for the task of
predicting a semantic feature vector of the next video segment from the previous
video segment feature vector.
• Application finds similar video segments in raw footage provided by applying a
trained LSTM neural network and analysing the difference between the predicted
semantic feature vectors and actual feature vectors.
• Application composes a shortened movie, giving priority to video segments
similar to the user-highlighted ones; the movie duration and tempo may be
defined by music selected by the user.
The authors used video segments with a fixed number of frames, e.g. 16 or
24 frames 112x112 pixels RGB, as input data for user highlights learning. Such
segments preserve the spatio-temporal relations of the video signal while being
relatively stationary, as described in Sect. 6.1.3.
For calculation of the semantic vectors, the pretrained 3D convolutional neural
network (CNN) is used. Figure 6.7a shows the overall CNN structure, and Fig. 6.7b
shows the inception block structure. The network is trained with the Sports 1 M data
set (Karpathy et al. 2014) to classify the actions in the video into 487 classes. The
166 S. Y. Podlesnyy
(a) (b)
Fig. 6.7 Overall 3D CNN structure (a) and inception block structure (b). (Image is reproduced
from the patent by Médioni 2017)
6 Automatic Video Editing 167
Softmax classification layer is used only at the pretraining phase. At the inference
time, the output of the final layer with a dimension of 1000 is used as a feature vector
for the video segment.
More information can be found in Chap. 7 (Real-Time Detection of Sports
Broadcasts Using Video Content Analysis).
At the next stage of the video editing pipeline, an LSTM module is trained with
raw video content including user highlights with the goal of predicting the next
and/or previous spatio-temporal feature vectors in video highlights. After the train-
ing, the process determines the presence of one or more highlight moments within
the video content based on a comparison of the spatio-temporal feature vectors with
predicted spatio-temporal feature vectors.
Having identified the video segments close enough to the user highlights, the
application performs automatic video editing using hard-coded cinematography
rules, and temporal priors such as background music events (peaks and/or troughs).
Magisto Ltd. discloses a set of patents (Rav-Acha and Boiman 2016, 2017; Boiman
and Rav-Acha 2017) covering its software for automatic video editing. The company
(now acquired by Vimeo) claims that they perform numerous kinds of visual, audio
analysis and storytelling analysis:
• Action analysis
• Camera motion analysis
• Video stabilisation
• Face detection, recognition and indexing
• Scene analysis
• Objects detection and tracking
• Speech detection
• Audio classification
• Music analysis
• Topic analysis
• Emotion analysis
The long list of features above is implemented within a unified media content
analysis platform which the authors denote as the media predictability framework.
The predictability framework is a nonparametric probabilistic approach for media
analysis, which is used for all the basic building blocks that require high-level media
analysis: recognition, clustering, classification, salience detection, etc. The predict-
ability measure is defined as follows: given a query media entity d and a reference
media entity C (e.g. portions of images, videos or audio), we say that d is predictable
from C if the likelihood P(d | C) is high and unpredictable if it is low. For instance, if
168 S. Y. Podlesnyy
a query media is unpredictable given the reference media, we might say that this
media entity is interesting or surprising. Yet another example: we can associate a
photo of a face with a specific person if this photo is highly predictable from other
photos of that person.
Descriptor Extraction Daisy descriptors (Tola et al. 2010) are used, which com-
pute a gradient image, and then, for each sample point, produce a log-polar sampling
(of size 200). Video descriptors describe space-time regions (e.g. three frames,
yielding a descriptor of length 200 3 ¼ 600 around each sample point). Video
descriptors may be sampled at interest points, and descriptor-space representation
with reduced dimensionality is obtained by one of the known methods.
Density Estimation Given both the descriptor-space representatives {q1,. .., qL}
and the descriptor set extracted from the reference C ¼ {f1,. .., fK}, the next step is
likelihood estimation. The log likelihood log P(qi) of each representative qi is
estimated using the nonparametric Parzen window probability density estimation
method using a Gaussian kernel.
Media Predictability Score Video descriptors with their corresponding probability
density estimations are stored in a database. In order to evaluate the media likelihood
P(d | C), a weighted k-nearest neighbours method is used. The predictability
PredictabilityScore(M1 | M2) of media entity M1 given the media entity M2 as a
reference is computed. Similarly, the predictability PredictabilityScore(M2 | M1) of
media entity M2 given the media entity M1 as a reference is computed. The two
predictability scores are combined to produce a single similarity measure. As a
combination function, one can use the “average” or the “max” operators.
Now basic building blocks that require high-level media analysis: recognition,
clustering, classification, salience detection, etc., can be defined with the presented
framework. For example, the classification block computes the PredictabilityScore
(d | Ci) of the query media entity d for each class Ci. The classification decision may
be the highest scored predictability or the posterior probabilities computed using the
nonparametric Parzen window estimation. The saliency block tries to predict a
portion of a media entity (It) based on previous media entity portions (I1. .. It 1)
that precede it. This can also indicate that this point in time is “surprising”, “unusual”
or “interesting”. Let d be some query media entity, and let C denote the reference set
of media entities. The saliency of d with respect to C is defined as the negative log
predictability of d given C (i.e. log PredictabilityScore(d | C)). Using this notation,
one can say an event is unusual if its saliency measure given other events is high.
The importance measure is used to describe the importance or the amount of
interest of a video clip for some application. This measure is subjective; however, in
many cases it can be estimated with no user intervention using attributes such as the
following:
6 Automatic Video Editing 169
• Camera wandering may indicate that the photographer is changing the focus of
attention; shaky camera movements also indicate that the scene is less important.
• Camera zoom is usually a good indication for high importance because, in many
cases, the photographer zooms in on some object of interest to get a close-up view
of the subject.
• Face close-up, speech recognition and laughter detection are all good indicators
for the higher importance of the corresponding scene.
Given a visual entity d (e.g. a video segment), the attributes above can be used to
compute the Importancy measure as a weighted sum of all the attribute scores:
X
ImportancyðdÞ ¼ max i
αi s i , 0 ,
where αi are the relative weights of each attribute. A video editing score for a video
editing selection of clips c1,. .., cn is defined as:
EditingScoreðc1 , . . . , cn Þ ¼ Importancyðci Þ:
Finally, the authors pose the problem of automatic video editing as an optimisa-
tion of the editing score above, given some constraints (e.g. that the total length of all
the selected sub-clips is not longer than a predefined value). This highly
noncontinuous function can be approximately optimised using stochastic optimisa-
tion techniques (e.g. simulated annealing, genetic algorithms). As an example of a
greedy algorithm, Boiman and Rav-Acha (2017) suggest sorting the editing selec-
tion of clips c1,. .., cn by descending order of their editing score and taking the first
k clips from the ordered list given the constraints mentioned above.
An important method for improved video editing is proposed by Boiman and
Rav-Acha (2017). The authors say: “The term ‘cutaway shot’, or simply ‘cutaway’
. . . is the interruption of a continuously filmed action by inserting a view of
something else. It is usually, although not always, followed by a cut back to the
first shot. The term ‘B-roll’. . . is supplemental or alternative footage intercut with the
main shot which is referred to as the ‘A-roll’. . . B-roll is a well-known technique
used in both film and the television industry. In fiction film, B-roll is used to indicate
simultaneous action or flashbacks. In documentary films, B-roll is used in inter-
views, monologs, and usually with an accompanied voiceover, since B-rolls usually
do not have their own audio. . . As may be apparent, manually generating a video
production that involves B-roll is extremely time consuming and requires experience
in video production. It would, therefore, be advantageous to be able to automatically
generate video production that includes this feature”.
Using the same media predictability framework as described above, it is proposed
to automatically insert B-roll in moments where the visual footage is relatively
boring (e.g. a talking person who is not moving). These moments can be identified
as having a low saliency. Speech recognition and video analysis can be used to
understand the topic and add relevant footage (e.g. photos that are related to that
170 S. Y. Podlesnyy
topic) as a B-roll. For example, detecting the words “trip” and “forest” might yield
photos taken from a forest. When detecting objects and locations in the background,
for example, if someone was taking video with the Eiffel Tower in the background, a
B-roll containing close-ups of the Eiffel Tower could be selected.
As can be seen from the review in the previous section, existing methods of
automatic film editing rely on handcrafted rules coded into the software. The rules
require hand-engineered features such as joint attention focus extracted from multi-
ple footage sources or the parsing of dialogue scripts. Some of the systems referred
to need user input to highlight the most prominent scenes or utilise engineering
approaches for the detection of important scenes such as close-up detection or using
image classification to determine a topic of raw video materials and to cut the video
accordingly.
In this section, we explore the possibility of learning the cinematographic editing
rules directly from the reference movies and apply it for new video production
(Podlesnyy 2020).
Figure 6.8 shows the video footage features extraction pipeline. Frames are sampled
from the video stream of possibly several video files obtained from the user’s
gadgets. After simple preprocessing (downscaling to 227x227 pixels, mean colour
value subtraction), the frame images are input into a GPU where a combination of
three convolutional neural networks reside. A GoogLeNet (Szegedy et al. 2014)-
structured network trained to classify 1000 classes of ImageNet is used to extract a
semantic feature vector of length 1024. A network trained to regress the aesthetical
score on the AVA dataset (Jin et al. 2016) produces a vector of length 2. A network
trained to classify an image into three classes of shot sizes (close-up, medium shot,
long shot) produces a vector of length 3 of the probabilities of an image belonging to
the corresponding shot size. The vectorised attributes of every sampled frame are
stored in the Frames database.
The next step of the pipeline is segmenting the video footage into coherent shots.
We use the approach described in detail in Podlesnaya and Podlesnyy (2016): to
determine shot boundaries, we analyse the vector distance between the semantic
feature vectors of neighbouring frames. If the vector distance is large enough, we
place the shot boundary there. For every separate shot, we calculate the attributes as
the mean value of the semantic feature vectors and median values of the shot size
vector and aesthetic score. The resulting attributes are stored in the Shot Attributes
database.
We perform the same pipeline for reference cinematography samples as well as
for user-generated video content. For reference samples, we used 63 of the 100 best
movies as listed by the American Society of Cinematographers (ASC 2019) and
processed them, excluding 2 minutes of content from the beginning and from the
end, thus eliminating captions, studio logos, etc. as irrelevant material.
The process of automatic film editing starts when the user selects the video
footage he wants to have edited. According to the user’s selection of raw footage,
the Features Preparation module reads data from Shot Attributes database and feeds
it into the Editing module comprising the learned controller model for automatic
video editing. The Editing module produces a storyboard. Based on that, it is
possible to compose the output video clip cutting from raw footage.
For semantic features extraction, the GoogLeNet (Szegedy et al. 2014) model trained
by BVLC on the ImageNet dataset (Russakovsky et al. 2014) is used. The final
classification layer is omitted and the output of layer “pool5/7x7_s1” is used as a
semantic feature vector of length 1024. After feature vectors from over 1,670,000
frames of the motion picture masterpieces mentioned in Sect. 6.2.1 have been
extracted, an incremental PCA was performed to reduce the dimensionality of the
feature vectors to 64. The residual error for the last batch on the incremental PCA
was 0.0029.
In order to automatically detect the shot size, the classifier to distinguish images
between three classes: close-up, medium shot and long shot, was trained. A dataset
using detailed cinematography scripts of 72 episodes of various Russian TV series
was used. The dataset contained 566,000 frames with a nearly even distribution of
frames belonging to the three classes. The GoogLeNet-based network structure was
trained for its compact size and robustness to lighting and colouristic conditions. The
network had three outputs, Softmax loss function was used for training. No aug-
mentation was used for the training data except horizontal flipping. The top-1 testing
accuracy was 0.938 given the overall noisy nature of the dataset.
172 S. Y. Podlesnyy
A trained network described by Jin et al. (2016) was used for the aesthetic scoring
of the shot. Concretely, the AVA-2 variant was used. As mentioned above, the
crowdsourced rating of the training data is somewhat biased to common tastes. For
example, they clearly penalise the score of images with a sloped skyline or having
faces partially occluded by hair. However, they do a good job in scoring low quality
images with obvious technical defects such as blur, defocus, unclear spots, etc.
For each shot, a state vector comprised the following:
• Semantic features – 64 real values (average pooling over frames in the shot)
• Shot size features – three real values (median pooling over frames in the shot)
• Aesthetic score – one (median pooling)
• Normalised Euclidean distance between the pooled semantic feature vectors of
current and previous shots
The plan was to use motion picture masterpieces as reference samples of good
editing. The video editing was modelled as a process of making control decisions
on whether to include a shot in a final movie or to skip it. Further, the editing rhythm
is modelled by learning to make fine-grained decisions on the duration of a shot that
was selected to be included in the final movie. In particular, the video editing process
was modelled as a sequence learning problem (Ross et al. 2011) with the Hamming
loss function. The following labels were used for shots sequence labelling
(Table 6.1).
The training data were prepared as follows. As described in Sect. 6.2.2, a
sequence of shots was collected into a clip having a duration of around 2 minutes.
This could be regarded as the target movie duration. Each shot in the clip sequence
was labelled according to its duration with labels 1–4 (as in Table 6.1), producing a
reference “expert” movie cut. In order to give the model a concept of bad montage,
around 40 augmented clips were produced from each reference clip. This was
performed by randomly inserting shots taken from other movies in the masterpieces
collection, thus disrupting the author’s idea of an edit. Such a shot was assigned a
label 5. Additionally, label 5 was assigned to all shots having an aesthetical score
below some threshold (e.g. 0.1). This gave the model an idea of penalising the shots
having clear technical issues as per common tastes. As a result, a training set having
108,491 sample clips was obtained.
Vowpal Wabbit (Langford et al. 2007) was used for training in the sequence
learning model using DAGGER (Dataset Aggregation), an iterative algorithm that
trains a deterministic policy in an imitation learning setting where expert demon-
strations of good behaviour are used to learn a controller.
Given a state s, denote as C(s,a) the expected immediate cost of performing action
a in state s, and denote as Cπ(s) ¼ Ea ~ π(s)[C(s,a)] the expected immediate cost of
policy π in s. In imitation learning, the true costs C(s,a) for the particular task may
not necessarily be known or observable. Instead, expert demonstrations are observed
and seek to bound J(π) for any cost function C based on how well π mimics the
expert’s policy π*. Denote as l the observed surrogate loss function, which is
minimised instead of C. The goal is to find a policy π’ which minimises the observed
surrogate loss under its induced distribution of states, i.e.:
At the first iteration, the DAGGER algorithm uses the expert’s policy to gather a
dataset of trajectories D. Then the algorithm proceeds by collecting a dataset at each
iteration under the current policy and trains the next policy under the aggregate of all
the collected datasets. The intuition behind this algorithm is that, over the iterations,
the set of inputs is built up that the learned policy is likely to encounter during its
execution based on previous experience (training iterations).
The state vector s is constructed as follows: action labels a are used in historical
span 6, and neighbouring semantic feature vectors from the -sixth shot to the +third
shot from the current one are added to the state vector s. The held out loss after
32 epochs of training was 4.06, while the average length of a sequence was 21.
In Fig. 6.9a, the test footage is shown in the storyboard format. It contains an
unmodified fragment of “Cool Hand Luke”, a 1967 film edited by Stuart Rosenberg.
Figure 6.9b shows the result of automatic editing. The fact that the algorithm has
modified the result proves that the model is not overfitted because “Cool Hand Luke”
was used for training. One may observe that the algorithm has shortened the footage
and deleted some beautiful shots that look foreign to the overall story, e.g. a truck
near the beginning of the fragment. Note that the transition between the medium
1 shot and the truck (a very long distance shot) is quite abrupt, and since the model
had learned average rules of editing it removed the truck shot due to the unusual
transition. The overall cut of the resulting clip looks smooth.
Figure 6.10a shows test footage constructed from a fragment of the movie not
seen by the model during training. It is “Gagarin the First in Space”, 2013, edited by
174 S. Y. Podlesnyy
(a) (b)
Fig. 6.9 A fragment of “Cool Hand Luke”, 1967, editor Stuart Rosenberg: (a) original footage; (b)
result of automatic edit
Pavel Parhomenko. The fragment was augmented by randomly inserting some shots
taken from random places in the same movie. In Fig. 6.10b the result of auto-editing
is presented. The shot with a close-up on the radio has been correctly removed, but a
few other foreign shots have been left. However, the overall cut looks smooth, and
even in colouristic format it shows a nice gradual change of tone throughout the
length of the clip.
In Fig. 6.11a a fragment from “Das Boot”, 1981, edited by Wolfgang Petersen, is
shown augmented by inserting a few shots from “Fanny and Alexander”, 1982,
edited by Ingmar Bergman. Note the shots having a tone that is very much alike with
close-up faces looking in a direction that breaks the rule of 180 of action: in Das
Boot an officer speaks to a crew member in front of him, so the correct cut would be
to montage shots with facing directions. The model trained by imitation learning
correctly removed the wrong shots, as shown in Fig. 6.11b.
In order to estimate how well the proposed method learns basic video editing rules
from unlabelled reference samples of motion picture masterpieces, the numbers of
6 Automatic Video Editing 175
(a) (b)
Fig. 6.10 A fragment of “Gagarin the First in Space”, 2013, editor Pavel Parhomenko: (a) original
footage. Random shots inserted at positions 4, 8–10, 14–15 from the top; (b) the result of automatic
editing
transitions between the shot sizes in the reference footage, in raw non-professional
footage and in the automatically edited clips have been manually counted.
According to cinematography editing rules, the following shot sizes are common:
detail, close-up, medium 1, medium 2, long shot, and very long shot. It was advised,
for example, by Kuleshov in the early twentieth century, that transitions between the
shots should occur with two size steps, e.g. between the medium 1 shot and the long
shot. Transitions between the outer shot sizes – detail and very long shot – can be
done in one step, i.e. detail-close-up and long shot-very long shot.
176 S. Y. Podlesnyy
(a) (b)
Fig. 6.11 A fragment of “Das Boot”, 1981, edited by Wolfgang Petersen: (a) original footage.
Foreign faces inserted at positions 16, 18, 20; (b) the result of automatic editing
In order to evaluate whether the trained DAGGER model has learned the very
basic principles of video editing from the data features extracted from motion picture
masterpieces, the numbers of transitions between shots of different sizes in three
corpuses of footage: clips sampled from the masterpieces dataset, clips sampled from
non-professional video footage and clips automatically edited by the DAGGER
model, have been calculated. Figure 6.12 shows a normalised representation of the
distribution of transitions between the shots. It is easy to see that the
6 Automatic Video Editing 177
Fig. 6.12 Normalised representations of distribution of transitions between the shots: (a)
non-professional video; (b) motion picture masterpieces; (c) automatically edited by our algorithm
Along with Médioni (2017) and Matias and Phan (2017), we state the problem of
video editing as finding an optimal path in a graph. Unlike these two works, we do
not construct a bipartite graph of possible transitions between the footage taken by
multiple video cameras.
Consider an acyclic directed weighted graph G(V, E). Its vertices V ¼ {v1, ... vN}
are video takes, each having approximately homogeneous visual content. The
vertices may be ordered naturally by video takes time codes or reordered by a
user. We are adding two special vertices to V: v0 and vN + 1, which correspond to
the film’s beginning and end markers. Each vertex is characterised with video take
duration ti.
Graph edges E ¼ {eij} are defined only for i < j and correspond to a possible
montage transition from video take i to j. Graph edges eij are characterised with the
cost function wij.
In order for the edited film to include at least one video take, the following
conditions should be met:
8j < N þ 1∃e0,j ,
8i > 0∃ei , Nþ1
In this section, we discuss one of the most important parts of the cost function:
transition quality. In the spirit of the previous section on imitation learning, we are
going to use motion picture masterpieces as reference samples of good editing.
Based on many publications on semantic indexing including our own (Podlesnaya
and Podlesnyy 2016), we hypothesise that the feature vector extracted from a frame
image contains information on the frame semantics, colour tone and subtle geometric
properties required for video editing decisions mimicking the experts’ actions.
To illustrate an intuition behind montage transitions quality evaluation, in
Fig. 6.14 the feature vectors fi, fi + 1, fi + 2, fi + 3 for video takes # i, ... i + 3 are
drawn as bold dots. These videos are taken from a sequence of raw video materials
provided for video editing.
It is possible to find all the feature vectors of the reference cinematographic
materials (motion picture masterpieces) within the circle of radius ri centred at fi.
These closest neighbours, drawn as small dots in Fig. 6.14, will be expert takes
having the closest visual contents to our given raw material take #i.
Consider the circles in the feature vector space, centred at fi + 1, fi + 2, fi + 3. Let us
count all the montage transitions in the reference cinematographic materials from the
neighbours of take #i. As shown in Fig. 6.14, some of the transitions do not fall near
any circles around our raw takes. One transition is found to arrive into a
neighbourhood of fi + 3, and two transitions arrive into fi + 1. None has arrived into
the neighbourhood of fi + 2.
The transition from take #i into take #i + 1, having the maximum count, may be
regarded as the reference in the sense that, in a case of the visual contents of a take #i,
the experts have most often chosen to make the transition into a take that is visually
similar to #i + 1.
In order to score the transition quality numerically, it is possible to normalise the
counts of reference transitions either, for example, by the maximum transitions count
for a given raw materials set or by the average value of the said count for the
reference cinematographic materials corpus. In the former case, the transition quality
may exceed 1.
As shown before, the optimal editing of raw video materials is modelled as finding
the shortest path in a graph G(V, E) constructed from video takes as vertices and all
possible montage transitions as edges. Intuitively, an inverse of the image quality
metric and inverse of the montage transition quality metric described above are
possible candidates for the cost function. However, the cost function penalising just
the quality may result in degenerate solutions with zero or single video takes
included in the edited film. Therefore, the cost function wij for the edge weights
should enable scalarised multi-objective optimisation on the graph.
Let us consider the linear scalarisation of the multi-objective cost function wij:
X j1
1 T 2
wij ¼ k þ μDij þ η s¼iþ1 U s þ λ t j ,
Qij N
where κ is the weight of the transition quality objective, μ is the weight of the
monotonic content penalty, η is the weight of the video take skipping penalty, λ is
the weight of the total desired film length objective, Qij is the transition quality
metric, Dij is the distance metric between takes i and j, Us is the unique value metric
of the skipped take, tj is a video take j duration, T is the target duration of the edited
film, and N is the number of raw video takes.
6 Automatic Video Editing 181
where αj is the image quality metric based on either (Jin et al. 2016) or the nearest
neighbour method described
T in Sect. 6.3.2; ε is a small constant for numerical
stability; and | KNNi KNNj | is the number of montage transitions in reference
materials from neighbourhood i into a neighbourhood j.
Thus, the transition quality metric combines an evaluation of both image aes-
thetics and montage transition aesthetics.
The distance metric between takes is as follows:
Dij ¼ 1 ,
ε þ FV i FV j 2
where ε is a small constant for numerical stability and FVk is the semantic feature
vector of video take k (the mean value for the number of frames comprising the take).
The distance metric between the takes constitutes the monotonic content penalty.
It appears in practice, for example, that due to a superior image or transition quality
of the subsample of video takes, the shortest path-finding algorithm aggressively
excludes all vertices except for very similar ones, resulting in a dull monotonic
narration. It is advised that the weight μ be selected interactively by the user.
The unique value metric of the skipped take is based on the ideas of Uchihachi
et al. (2003). We propose the following algorithm for the calculation of Us.
1. Given a tuple of raw materials feature vectors x ¼ {FV1,... FVN}.
2. Produce a tuple of Euclidean distances between the consecutive feature vectors:
d ¼ {|| FV1 – FV2 ||2, . . . || FVN–1 – FVN ||2}.
3. Clusterise the vectors x by the DBSCAN algorithm using density radius eps ¼ k
STD(d) ranging the coefficient k between 0.3 and 1.3. Select the clustering
producing the maximum class variance for x.
4. Produce a tuple of class labels L ¼ {l1,... lN}. For x members not assigned to any
class (outliers) use label 1.
5. Calculate the unique value metric for every video take.
8
< β, 1k ¼ 1
UK ¼ 1X ,
: ln 1ðli ¼ lk Þ, otherwise
N i
where β is an interactively adjusted coefficient of the unique value metric over the
transition quality and 1(∙) is an indicator function valued as 1 if the argument is the
true condition and 0 if false.
182 S. Y. Podlesnyy
Thus, the unique value metric of the skipped take prevents the degenerate
solutions to the closest path-finding problem by penalising vertices skipping. The
amount of the penalty depends on the content’s uniqueness metric. Unique content is
rare in the raw materials, and therefore it is not assigned to any cluster and instead the
interactively chosen value beta is used. On the other hand, important content is
normally that which occupies the greatest duration of the raw footage, so it is
clustered in larger classes, and the resulting metric reflects also the content’s
importance as well as its uniqueness.
(a) (b)
Fig. 6.15 An example storyboard of automatic editing of non-professional video footage: (a) raw
footage; (b) automatically edited clip
algorithm, the maximum number of possible transitions from every shot was limited
to 30. Besides the performance boost, this method serves as a regulariser for
aggressive optimising by Dijkstra.
The resulting video clip has a duration of 39 seconds. In the first half of the clip,
we may watch a few middle shots having different backgrounds and character poses
selected from a very long take captured by a camera facing the cyclist. Note that the
resulting clip includes a nice shot of the sun and moments when the character was
smiling or made expressive gestures. The rest of the clip is cut from the shots taken
by the camera rotated towards the front of the bike. From countless long shots, the
algorithm has selected the moments showing two different types of scenery, and,
184 S. Y. Podlesnyy
from these, the ones with an almost horizontal skyline were selected, given that the
raw footage contained 90% of frames with a skewed horizon. Note that none of the
above outcomes is hard-coded into the algorithm as an editing rule; the whole
process was completely data-driven.
In order to evaluate the automatic video editing quality, we prepared ten raw footage
collections described in Table 6.2. All footage was taken by non-professional users
and mostly filmed family members in various settings.
Every footage set was automatically edited using the algorithm described in
Sects. 6.3.2 and 6.3.4.
To evaluate the montage quality of the automatically edited videos formally, we
manually counted the numbers of transitions between the takes of different scales
and checked if they met the cinematography editing rules. Table 6.3 summarises the
results of these calculations. For best results it is advised in cinematography that the
transition happens between shot sizes that differ by two steps from each other, while
for the closest and the longest shot sizes, one step is also recommended. In Table 6.3,
the cells corresponding to the recommended transitions are marked with a bold
outline. These cells should contain the majority of transition counts for the test
footage edited by the algorithm. As seen from Table 6.3, two types of transitions
clearly break the “rules”:
• Close-up ! long distance
• First middle ! second middle
This may be explained by the fact that non-professional footage used for auto-
matic editing most often contains close-ups and long distance shots, and the software
could do nothing more than use the available shots for editing. The lack of the
representation power of feature vectors to distinguish between the first and the
second middle shots could be the reason for the second “rule-breaking” result.
Next, to evaluate the automatic video editing quality, we manually counted the
ratio of frames with low technical quality (frames subsampling was used to reduce
the amount of manual work). The following technical problems with video images
were taken into account:
• Out of focus or blurry image
• Wrong exposure (too bright or too dark)
• Skewed horizon line
• Messy/unclear content
• Scene occlusion
The overall mean ratio of defective frames in the edited clips was 0.15 (with a
standard deviation of 0.08). However, some clear outliers were apparent. For
example, in the ski resort videos captured by the GoPro camera attached to the
skier’s helmet, horizon skew was the predominant type of technical defect, and the
automatic editing algorithm fails to select the optimal content to cut the clip. In the
park outdoor footage, almost 17% of the frames were classified by an expert as
“messy/unclear”. If obvious outliers are removed, the mean ratio of defective frames
in the edited clips becomes 0.09 (with a standard deviation of 0.05).
It is possible to further reduce the ratio of defective frames in the resulting clips by
means of raising the value of α in Sect. 6.3.3. However, this may result in too
aggressive video shortening and a lack of moments of interest in the edited clip.
One of the well-known video editing rules is to “avoid jump cuts”. The jump cut
effect happens when a subject is filmed, and after the editing, the subject position is
changed horizontally by 1/3 or more of the frame width. We manually counted the
number of jump cuts in the automatically edited clips, and the mean ratio of jump
cuts was 0.01 (standard deviation equal to 0.02).
Sharp changes of colour tone between two consecutive shots in the cut are
regarded as bad editing. A manual calculation of abrupt colour tone changes
obtained a mean ratio value of 0.1 (standard deviation equal to 0.07).
The average shortening of the raw footage into an edited clip, given the default
values of automatic editing parameters, is 89%. However, the exact duration of the
resulting video clip depends on the parameters intended for interactive adjustment;
therefore, it is not practical to formally evaluate an algorithm by this parameter.
Figures 6.16 and 6.17 show a few examples of automatically edited video clips
made from non-professional footage.
Fig. 6.16 An example storyboard of automatic editing of non-professional video footage: from
4 minutes 54 seconds of highly shaky and messy raw footage captured with waterproof action
camera; this 35-second clip was created automatically, featuring scenery overview, underwater
scenes and crucial moment of a character approaching the camera with advanced swimming style
6 Automatic Video Editing 187
Fig. 6.17 An example storyboard of automatic editing of non-professional video footage: from
16 minutes 38 seconds of raw footage; this 1 minute 38 second clip was created automatically,
featuring scenes of driving toward the Grand Canyon, and the most dramatic scenery views and
character poses at the location
188 S. Y. Podlesnyy
6.4 Conclusion
In this chapter, we have explored the ways of building an automatic video editing
system capable of extracting cinematography editing rules from the reference motion
pictures. Both methods discussed here operate with video shots as atomic units of
montage and rely on averaged semantic feature vectors extracted by a convolutional
neural network trained for ImageNet classification.
The optimal sequence search by the imitation learning algorithm (DAGGER) is in
fact a linear method implemented as logistic regression with SGD. This approach
limits the generalising capabilities of the model and over-smooths the results, if such
words can be used for video clip evaluation by experts. However, this method
proved that optimal sequence search by means of the classification of semantic
feature vectors pairs is capable of extracting knowledge from an unstructured corpus
of reference video content. For example, it was demonstrated that automatic editing
improves the distribution of transitions between the shots of different scales by a
factor of four, as opposed to the distribution of transitions or raw footage filmed by
non-professional users.
Global path optimisation in the transitions graph based on nearest neighbours
search allows us to perform automatic video editing by mimicking the selected
reference content. The algorithm just chooses from the raw footage the shots and
transitions most similar to the reference style. This approach allows us to implement
an arbitrary cost function with possibly non-differential parts and/or arbitrary func-
tional blocks like smile detection. A multi-objective cost function with linear weight
coefficients is a natural choice for interactively adjusting different aspects of video
editing.
The dynamic programming technique has drawbacks such as the large memory
footprint for the nearest neighbour index and the quadratic complexity of the
transitions matrix calculation. The latter problem can be efficiently solved by
limiting the depth while constructing the transitions matrix, and the former problem
becomes less relevant as the memory resources of mobile devices grow. One serious
limitation of the proposed method is the linear narration: a transition graph can be
built for an ordered sequence of footage only, and reordering video takes or inserting
B-roll shots is not possible in a general way.
Interactivity is the key advantage of the proposed solution. Gamification of the
video editing process in the form of trial and error in setting various parameters
allows non-professional users to achieve quite pleasing results without requiring
technical and artistic skills. In Table 6.4 we list both the parameters described in
detail above and proposed future work on additional parameters for fine-tuning the
video editing.
6 Automatic Video Editing 189
References
Arev, I., Park, H.S., Sheikh, Y., Hodgins, J., Shamir, A.: Automatic editing of footage from multiple
social cameras. ACM Trans. Graph. 33(4), 1–11 (2014)
Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for
approximate nearest neighbor searching fixed dimensions. J. ACM. 45(6), 891–923 (1998)
ASC Unveils List of 100 Milestone Films in Cinematography of the 20th Century (2019) Accessed
on 06 October 2020. https://theasc.com/news/asc-unveils-list-of-100-milestone-films-in-cine
matography-of-the-20th-century
Boiman, O., Rav-Acha, A.: System and method for semi-automatic video editing. US Patent
9,570,107 (2017)
Cong, Y., Yuan, J., Luo, J.: Towards scalable summarization of consumer videos via sparse
dictionary selection. IEEE Transactions on Multimedia. 14(1), 66–75 (2012)
190 S. Y. Podlesnyy
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behaviour recognition via sparse spatio-temporal
features. In: Proceedings of EEE International Workshop on Visual Surveillance and Perfor-
mance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Fischler, M., Bolles, R.: Random sample consensus: a paradigm for model fitting with applications
to image analysis and automated cartography. In: Readings in Computer Vision: Issues,
Problems, Principles, and Paradigms, pp. 726–740 (1987)
Jin, X., Chi, J., Peng, S., Tian, Y., Ye, C., Li, X.: Deep image aesthetics classification using
inception modules and fine-tuning connected layer. In: Proceedings of the 8th IEEE Interna-
tional Conference on Wireless Communications and Signal Processing, pp. 1–6 (2016)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video
classification with convolutional neural networks. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Langford, J., Li, L., Strehl, A.: Vowpal Wabbit Online Learning Project (2007) Accessed on
06 October 2020. http://hunch.net/?p¼309
Leake, M., Davis, A., Truong, A., Agrawala, M.: Computational video editing for dialog-driven
scenes. ACM Trans. Graph. 36(4), 130 (2017)
Lepetit, V., Moreno-Noguer, F., Fua, P.: EPnP: An accurate O(n) solution to the PnP problem.
Int. J. Comput. Vision. (2009). Accessed on 06 October 2020). https://doi.org/10.1109/ICCV.
2007.4409116
Matias, J., Phan, H.: System and method of generating video from video clips based on moments of
interest within the video clips. US Patent 10,186,298 (2017)
Médioni, T.: Three-dimensional convolutional neural networks for video highlight detection. US
Patent 9,836,853 (2017)
Merabti, B., Christie, M., Bouatouch, K.: A virtual director using hidden Markov models. In:
Computer Graphics Forum. Wiley (2015). https://doi.org/10.1111/cgf.12775.hal-01244643
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzku, F., Kadir, T.,
Van Gool, L.: A comparison of affine region detectors. Int. J. Comput. Vis. 65 (1/2), pp. 43–72
(2005)
Park, H.S., Jain, E., Sheikh, Y.: 3D social saliency from head-mounted cameras. In: Proceedings of
the 25th International Conference on Neural Information Processing Systems., vol. 1, pp.
422–430 (2012)
Podlesnaya, A., Podlesnyy, S.: Deep learning based semantic video indexing and retrieval. In:
Proceedings of SAI Intelligent Systems Conference, pp. 359–372 (2016)
Podlesnyy, S.: Towards data-driven automatic video editing. In: Liu, Y., Wang, L., Zhao, L., Yu,
Z. (eds.) Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery.
Advances in Intelligent Systems and Computing, vol. 1074. Springer, Cham (2020)
Pudovkin, V.I.: Model (sitter) instead of actor. In: Collected Works, vol. 1, p. 184, Moscow (1974)
(in Russian)
Rav-Acha, A., Boiman, O.: System and method for semi-automatic video editing. US Patent. 9,
554,111 (2017)
Rav-Acha, A., Boiman, O.: Method and system for automatic B-roll video production. US Patent
9,524,752 (2016)
Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus
networks. In: Proceedings of the 32nd Conference on Neural Information Processing Systems,
pp. 1658–1669 (2018)
Ross, S., Gordon, G.J., Bagnell, J.A.: A reduction of imitation learning and structured prediction to
no-regret online learning. In: Proceedings of the 14th International Conference on Artificial
Intelligence and Statistics, pp. 627–635 (2011)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A. C., Fei-Fei, L.: ImageNet large scale visual recognition
challenge. CoRR, arXiv: 1409.0575 (2014)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for
Scanning and Printing. Springer Nature Switzerland AG (2019)
6 Automatic Video Editing 191
Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. ACM
Trans. Graph. 25(3), 835–846 (2006)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V.,
Rabinovich, A.: Going deeper with convolutions. CoRR, arXiv:1409.4842 (2014)
Tola, E., Lepetit, V., Fua, P.: DAISY: An efficient dense descriptor applied to wide-baseline stereo.
IEEE Trans. Pattern Anal. Mach. Intellig. 32(5), 815–830 (2010)
Tsivian, Y.: Cinemetrics: part of the humanities’ cyberinfrastructure. In: Ross, M., Grauer, M.,
Freisleben, B. (eds.) Digital Tools in Media Studies, vol. 9, pp. 93–100. Transcript Verlag,
Bielefeld (2009)
Uchihachi, S., Foote, J.T., Wilcox, L.: Automatic video summarization using a measure of shot
importance and a frame-packing method. US Patent 6,535,639 (2003)
Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 2513–2520 (2014)
Chapter 7
Real-Time Detection of Sports Broadcasts
Using Video Content Analysis
7.1.1 Introduction
X. Y. Petrova (*)
Samsung R&D Institute Russia (SRR), Moscow, Russia
e-mail: xen@excite.com
V. V. Anisimovsky
Huawei Russian Research Institute, Moscow, Russia
e-mail: vanisimovsky@gmail.com
M. N. Rychagov
National Research University of Electronic Technology (MIET), Moscow, Russia
e-mail: michael.rychagov@gmail.com
Fig. 7.1 Recognition of the video genre in the video processing pipeline of a TV receiver
The main goal of our research, the results of which are presented in this chapter,
was to develop an algorithm for the detection of TV programmes containing sports
games in real-time video sequences, with the aim of automatically adjusting the
image settings in a TV receiver (see Fig. 7.1).
Prior works on video sequence classification can be subdivided into groups based on
the purpose of classification (e.g. genre detection or specific object detection), the
modalities used (such as video, audio or subtitles), the feature selection method and
the type of classifier (such as support vector machine (SVM), classification or
regression tree, neural network, etc.). The most similar existing works in terms of
purpose are well-known works on the automatic detection of genre. Genre detection
can be based on various modalities, for example, subtitles (Brezeale and Cook 2006;
Brezeale and Cook 2008), sound (Roach and Mason 2001; Dinh et al. 2002; Bai
et al. 2006), video (all other sources) or several of these modalities at the same time
(Subashini et al. 2012).
Brezeale and Cook (2006) used subtitles and DCT coefficients from the decom-
position of video frames as features. These authors achieved a high level of detection
accuracy but noted that subtitles are missing from many television broadcasts,
although in some countries such as in the USA, legislation obliges broadcasters to
provide subtitles with TV broadcasts. At the same time, however, subtitles are not a
description of what is shown on the screen and are not generated for scenes in which
there is no dialog. Finally, training and classification based on this feature may
involve great computational complexity, since the feature vector can consist of many
thousands of elements. Subtitles are not synchronised at the frame level and provide
no information on scene transitions and can therefore be used only for offline
processing. In addition, in a situation where the user is switching channels, this
type of algorithm will not be able to provide the timely response required for the
optimal selection of video processing settings.
Video sequences containing sporting events were indexed by Bai et al. (2006)
using SVM classifier based on the features of the audio stream. Approaches based on
the analysis of audio streams or subtitles are not applicable in a television receiver for
the automatic adjustment of video processing coefficients, since these approaches do
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis 195
not have a sufficiently short response time to the corresponding video sequence. In the
following, we will consider methods based mostly on an analysis of the video stream.
In work by Jiang et al. (2009), the following genres are defined: cartoons,
advertising, news and sports. As a basic algorithm, the support vector method
using an oriented acyclic graph (DAGSVM) was selected. Fifteen visual features
of four types were distinguished (installation, colour, texture and movement). From
the point of view of video editing, there are changes in frequency and ratio through-
out the programme, and the number of sharp and smooth transitions between scenes
was estimated. A histogram of the average brightness and saturation and the
percentage of pixels with brightness and saturation above a predetermined threshold
were used as colour features. The textural features were related to the statistical
properties of the halftone adjacency matrix: contrast, uniformity, energy, entropy
and correlation. The features related to movement were the average change in
brightness, the average difference in the RGB colour space between adjacent frames
and the proportion of frames that differed slightly from the previous frames (i.e. the
proportion of slow and/or static scenes).
The classification of movies by genre (such as action, drama or thriller) was
attempted by Huang et al. (2008). Five global features were used: the average length
of the episode, colour variation, movement, lighting (e.g. the presence of a flashlight)
and the statistics of the types of transitions between scenes. A decision tree classifier
was implemented.
The application of the properties of boundaries was described by Yuan and Wan
(2004), and a k-means classifier was used to distinguish between badminton, bas-
ketball, football, ice skating and tennis. In an earlier work by the same authors, the
properties of borders were used to identify frames containing a general view of the
spectator stands and advertising outside of the playing field.
The trajectories of faces and blocks of text were analysed by Wei et al. (2000) in
order to distinguish between advertising, news, comedy series and soap operas.
Classification was carried out by finding the maximum projection of the distribution
of the trajectories of a classified fragment on the set of trajectories of the training set.
In a study by Liu and Kender (2002), frames of video recording of lectures were
subdivided into four classes. To achieve this, a quasi-optimal procedure for selecting
features (from 300 initial features) was proposed.
Using camera movement analysis, Takagi et al. (2003a, b) presented the
categorisation of five types of sporting videos (sumo wrestling, tennis, baseball,
football and American football) in two papers. Each of these sports was
characterised by a special style of shooting, and classification was carried out
considering types of camera arrivals, episodes with a static or shaking camera and
transitions from one type of camera movement to another. Since their method was
not based on colour information, it could be very effective in classifying broadcasts
of games in different sports leagues (e.g. at the British Open tennis championship,
the courts are green, while at the French Open, they are red).
A new low-level attribute called the “energy flow of activity” was proposed by
Gillespie and Nguyen (2004). This feature was fed to the input of a network of radial
basis functions to define sports, cinema, landscape shots and news. The energy flow
of activity was measured within a given programme and was based on statistics of
196 X. Y. Petrova et al.
macro-blocks of compressed video, including the number of I-blocks, that is, blocks
with reliably and unreliably predictable motion (e.g. in the case of a uniform
background with a moving camera).
Kittler et al. (2001) proposed the concept of “visual keys” for the identification of
sporting genres (tennis, track sports, swimming, yachting and cycling). These keys
impart semantic meaning to the low-level attributes of the frames of the video
sequence. In total, 16 types of keys were used: athletics tracks, boxing rings, indoor
cycle tracks, the ocean, ski jumps, pools, tennis nets, grass, blue sky, open cycle
tracks, the ocean (medium-range shot), treadmills (long-range shot), treadmills
(medium-range shot), treadmills (close-up), pools (close-up) and tennis courts. The
first nine of these detectors were represented by trained neural networks, and the last
eight used so-called texture codes. The results of the detection of these visual keys
were fed to the input of a k-means classifier to determine which type of sport was
being shown. This set of features was subsequently expanded by the addition of
multimodal elements to detect visual keys (Jaser et al. 2004). In the latter case,
hidden Markov models were used to analyse the time dependencies.
Later, a similar problem was considered which involved identifying seven types
of sporting events (climbing, basketball, motor racing, golf, ski jumping, football
and motorcycle racing) and five types of scenes within sports videos (Choroś and
Pawlaczyk 2010): final titles, commentators in the studio, initial titles, trailers and
tables and standings. A decision was made based on the calculation of colour
coherence vectors (Pass et al. 1996).
Another seven popular video genres were classified by Ionescu et al. (2012):
cartoons, advertising, documentaries, cinema, music videos, news and sports. Three
categories of descriptors were used, which related to colour, dynamics and structure.
Colour properties were characterised globally using statistics of the distribution of
various colours, elementary colours dominant within the image, the properties of
these colours (e.g. brightness, saturation) and the relationships between them. From
the point of view of dynamics, the rhythm of the video, the statistics of the
movements and the percentage of smooth transitions were evaluated. Structural
information was extracted at the frame level by constructing histograms of contours
and identifying the relationships between them.
It is easy to see that the tools used to classify images and videos are very diverse.
In particular, the effectiveness of the decision tree method (Huang et al. 2008),
principal component analysis (PCA) (Vaswani and Chellappa 2006), SVM (Dinh
et al. 2002; Bai et al. 2006), neural networks (Takagi et al. 2003a, b; Subashini et al.
2012), Kohonen networks (Koskela et al. 2009) and hidden Markov models (Truong
et al. 2000) has been demonstrated. An implementation of the nearest neighbour
classifier was reported by Mel (1997), and some researchers have used a random
forest or Bayesian approach (Machajdik and Hanbury 2010).
An interesting attempt was undertaken by Koskela et al. (2009) to establish
correspondence between a semantic concept and a list of synonyms for this concept
from WordNet. This idea can serve as the basis for a quick and effective synthesis of a
visual classifier based on a verbal description made by an expert. A similar idea was
described by Li et al. (2010), in which a bank of classifiers of visual objects utilises a
set of primitive classifiers that can form the inputs for more complex classifiers.
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis 197
For the problem under consideration, methods of scene change detection are also of
interest; however, we limit ourselves here to the simplest case of a direct transition,
since this is the dominant type used in sports broadcasts.
The following requirements apply to the classification algorithm:
1. To control the settings of the video processing pipeline, a classification result
should be obtained for each individual frame, and there should be no abrupt
jumps except in the case of a scene change. Only one pass through the video
frame should be used. This requirement implies that the use of any kind of
temporal features should be avoided, since these introduce additional frame
delays.
2. The algorithm should be universal, i.e. sporting events should be differentiated
from all other types of scenes, including movies, news, cartoons, computer
graphics, concerts, etc.
3. The algorithm should be insensitive to the quality of the video stream (i.e. should
support different resolutions, both standard and high, and should be insensitive to
various compression levels and robust to progressive/interlaced types of content).
4. The complexity of the algorithm should be extremely low, allowing for a
hardware implementation as part of a video processing chip within a general,
more complex system. It is therefore desirable that only a limited local
neighbourhood should be used in the pixel analysis.
5. The algorithm should have a modular structure to enable interaction with other
video processing subsystems. It should also have a clear, straightforward inter-
pretation of parameters, to allow for simple and fast tuning in the desired direction
achieving desired behaviour in specific corner cases.
1. The importance of obtaining the correct result is not the same for different frames:
it is more important to achieve a stable detection result for frames containing a
green field with players moving around it than for frames containing spectator
stands or a commentator or even a player shown in close-up.
2. Not all classification errors have the same weight; it is perfectly permissible to
misclassify football as baseball, but it is unacceptable to misclassify football as a
music video.
3. Video data (at the level of individual frames) by nature has large bias, since the
distribution depends mostly on the duration of the episodes containing frames
with similar characteristics, i.e. the frequency is rather weakly related to
importance.
To formalise these requirements, an additional subclassification is introduced in
the form of a simple hierarchy and weights for penalisation of a confusion matrix.
The confusion matrix is a popular tool for assessing the quality of classification
algorithms (Godbole 2002). We divided the category of “sports games” into three
subcategories: C1 (football games), C2 (field-based sporting events other than
football) and C3 (non-game sports broadcasts, non-game scenes in game broad-
casts). All other frame types are classified into category C4. The weights for
penalisation of the confusion matrix are shown in Table 7.1.
To simplify the development of the algorithm, class C1 was divided into three
classes: long-range shots, general shots and close-up shots. A separate classifier was
built for each class. In the first step, the simplest and most obvious case was
considered: in order to distinguish between frames containing long-distance shots
and hence distinguish frames of a football match from those of other genres, the
percentage of green pixels was used as a feature. Further development was carried
out using a truncated database, containing only frames for which the classifier
designed in the first step produced a Type 1 error.
The simplest formula to detect green pixels would be Gr0 ¼ G > RBmax, where
RBmax is the maximum of red and blue channels of the pixel. But this formula is not
selective enough because it misclassifies as green pixels the following categories of
pixels: (i) almost all white pixels; (ii) pixels with low saturation; (iii) very dark
(almost black) pixels; (iv) blue-toned pixels; and (v) yellow pixels.
Effects (i)–(iv) were corrected by adding an additional term to the basic formula:
3
Gr1 ¼ Gr0 ^ SRGB > 80 ^ R þ B < G _ R þ B < 255 < R B < 35 ^ Y
2
> 80,
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis 199
where SRGB ¼ R + G + B. Effect (v) was mitigated using an additional classifier for
yellow pixels:
where RGMAX is the maximum of the green channel G and red channel R, RGMIN is
minimum of both of these values and S is the colour saturation. The final form of the
formula for detecting green pixels then becomes Gr ¼ Gr0 ^ SRGB > 80 ^
R þ B < 32 G _ R þ B < 255 < R B < 35 ^ Y > 80 ^ Y e .
Finding the proportion of green pixels is far from sufficient to solve the problem.
Some sporting events contain a fairly small number of green pixels, and even a
human observer may make a mistake in the absence of accompanying text or audio
commentary. A condition is therefore applied in which a frame with zero green
pixels belongs to class C4.
The classification of other types of scenes is based on the following observations.
Sporting scenes, and especially those in football, are characterised by the presence of
green pixels with high saturation. Moreover, the green pixels making up the image of
the pitch have low values for the blue colour channel and relatively high brightness
and saturation. The range of variation in the brightness of green pixels is not wide,
and if high values of the brightness gradient are seen within green areas, it is likely
that these frames correspond to images of natural landscapes rather than a soccer
pitch. In sports games, bright and saturated spots usually correspond to the players’
uniforms, while white areas correspond to the markings on the field and the players’
numbers. In close-up shots of soccer players, a small number of green pixels are
usually present in each frame. These observations can be described by the following
empirical relation:
where SRGB ¼ R + G + B.
The detection of bright and saturated colours is formalised as follows:
Bs ¼ max ðR, G, BÞ > 150 & max ðR, G, BÞ min ðR, G, BÞ max ðR, G, BÞ=2:
The skin tone detector is borrowed from the literature (Gomez et al. 2002):
pixels, F7; average brightness of green pixels, F8; mean value of the blue colour
channel for green pixels, F9; and compactness of the brightness histogram for green
pixels, F10.
The proportion of green pixels is calculated as follows:
1 X
F1 ¼ δðGr ði, jÞÞ,
wh i¼1::w,j¼1::h
where w is the frame width, h is the frame height, i, j are pixel coordinates, Gr(i, j) is
detector and δ is a function that converts a
the result generated by the green pixel
0, Øx
logical type to real one, i.e. δðxÞ ¼ .
1, x
The proportion of skin tone pixels is calculated as follows:
1 X
F2 ¼ δðSk ði, jÞÞ,
wh i¼1::w,j¼1::h
where Sk(i, j) is the result of the skin tone pixel detector. The average gradient for the
green pixels is calculated as follows:
P
i¼1::w,j¼1::h jDY ði, jÞj δðGr ði, jÞÞ
F4 ¼ P ,
i¼1::w,j¼1::h δ ð G r ði, jÞÞ
1 X
F5 ¼ δðBs ði, jÞÞ
wh i¼1::w,j¼1::h
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis 201
where Bs(i, j) is the detection result of bright and saturated pixels. The mean
saturation of the green pixels is calculated using the formula:
P
i¼1::w,j¼1::h Sði, jÞ δðGr ði, jÞÞ
F6 ¼ P ,
i¼1::w,j¼1::h δ ð G r ði, jÞÞ
where S(i, j) is the saturation. The proportion of white pixels is estimated as:
1 X
F7 ¼ δðW ði, jÞÞ,
wh i¼1::w,j¼1::h
where W(i, j) is the detection result of the white pixels. The average brightness of the
green pixels is derived, in turn, from the formula:
P
i¼1::w,j¼1::h Y ði, jÞ δðGr ði, jÞÞ
F8 ¼ P :
i¼1::w,j¼1::h δ ð G r ði, jÞÞ
The average value of the blue channel for the green pixels is obtained as follows:
P
i¼1::w,j¼1::h Bði, jÞ δðGr ði, jÞÞ
F9 ¼ P :
i¼1::w,j¼1::h δðGr ði, jÞÞ
Finally, the compactness of the brightness histogram for the green pixels, F10, is
calculated using the following steps:
• A histogram of the brightness values HYGr for the Y pixels belonging to green
areas is constructed.
• The width of the histogram D is computed as the distance between its right and
left non-zero elements.
• The feature F10 is calculated as the proportion of the histogram that lies no further
than an eighth of its width from the central value:
PP8 þD=8
i¼P8 D=8 H YGr ðiÞ
F 10 ¼ P255 :
i¼0 H YGr ðiÞ
2D threshold functions θi (x, y), linear classifiers θi (x, y), 3D threshold transforms
2D L
Fig. 7.3 Elementary classifiers. (a) 1D threshold transform with higher threshold (b) 1D threshold
transform with lower threshold (c) 2D threshold transform (d) 2D linear classifier (e) 3D threshold
transform (f) elliptic 2D classifier
TRUE, x > Ti
θHI
i ðxÞ ¼ (Fig. 7.3а)
FALSE, otherwise
TRUE, x < Ti
i ð xÞ ¼
or θLO (Fig. 7.3b).
FALSE, otherwise
The parameters of two-dimensional threshold transformations (or table classi-
fiers) are presented in Fig. 7.3c. They are described by two threshold vectors,
V 1T ¼ T 10 T 11 .. . T 1N , where T 10 < T 11 < T 12 < . .. < T 1N , and V 2T ¼ T 20 T 21 ... T 2M ,
where T 20 <T 21 <T 22 <...<T 2M . The threshold values for each dimension may not
match, i.e. M 6¼ N. The output of the tabular classifier has the form of a
two-dimensional vector of dimensions M N: θ2D(x,y) ¼ |y11(x,y),y12(x,y),...,
y1N(x,y),..,yM1(x,y),yM2(x,y),...,yMN(x,y)|, where each element is a logical quantity:
yij ðx,yÞ¼xT 1i1 ^xT 1i ^yT 2j1 ^yT 2j . The output of the linear classifier is
calculated as θiL(x,y) ¼ K1 x + K2 y + B > 0, where K1, K2 and B are predefined
constants (Fig. 7.3d). The parameters of three-dimensional tabular classifiers
(Fig. 7.3e) are described by three threshold vectors: V 1
¼ T 1 T 1 . . . T 1 ,
T 0 1 N
where T 10 < T 11 < T 12 < . . . < T 1N ; V 2T ¼ T 20 T 21 . . . T 2M , where T 20 < T 21 <
T 22 < . . . < T 2M ; and V 3T ¼ T 30 T 31 . . . T 3L , where T 30 < T 31 < T 32 < . . . < T 3L .
The output of the tabular classifier has the form of a three-dimensional vector of
dimensions M N: θ2D(x, y) ¼ |y111(x, y, z), y112(x, y, z), . . ., yMNL(x, y)|. We assume
that all threshold values are real numbers defined on the interval [0, 1].
The output of the elliptical classifier (Fig. 7.3f) is calculated as: θi e ðx, yÞ ¼
ðx0 X C Þ2 0 2
a2þ ðy Y
b2
CÞ
< 1, where x0 ¼ x cos α y sin α, y0 ¼ x sin α + y cos α,
and XC, YC, a, b and α are predefined constants.
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis 203
In the process of research and modelling, it was found that in many cases, it was
sufficient to limit ourselves to the simplest version of the classifier, including
one-dimensional and two-dimensional threshold transformations and a set of linear
classifiers (see Fig. 7.4).
We now formulate some basic rules that can be used to draw a final conclusion on
whether the current frame belongs to a sporting event.
The first decision rule G1 is implemented as a tabular classifier based on the
average saturation of the green pixels F1 and the normalised number of green pixels
F6. It calculates a Boolean vector of 16 (four by four) elements G1 ¼ |y11, y12, . . .,
y44|. This rule is parameterised by six threshold values T1. . .T6 . The purpose of this
rule is to distinguish between long-, medium-range and close-up shots and use
different decision logics for each of them. Non-zero values of yij (it is obvious that
the zero norm of the vector is equal to one) correspond to decision that current frame
belongs to a certain shot class. The result of this rule is used for further selection of
the classification thresholds. If the saturation of the green pixels is low, it makes
sense to apply stricter rules than in the case when it is high or average.
The second tabular decision rule G2 ¼ |y11, y12y21, y22| relies on characteristics F2
and F5 and is parameterised by two threshold values, T7and T8. This rule also ensures
the selection of frames with a low proportion of green pixels.
Some rules are regarded as “positive” or “negative” to indicate whether they
change the sign of the final classification result when contributing to the final
classification result, and these are denoted in the following by P and N, respectively.
Terms used as a part of several other rules will be denoted by Q; their role in these
roles may be different (which means sometimes they may contribute to positive
overall classification result and in some other rules to negative classification result).
204 X. Y. Petrova et al.
This notation simplifies collecting corner cases from a large video database and the
analysis of certain conclusions drawn by the classifier.
The case with a very low number of green pixels is considered separately and is
implemented as a “negative” threshold rule: N1 ¼ F1 > T9. The opposite case, with a
very high proportion of skin colour pixels, is also controlled by a “negative” rule,
N2 ¼ F2 < T10, as are the textural properties of green pixels, N3 ¼ F4 < T11. Sports
broadcasts also contain a certain (although relatively small) proportion of white
pixels, N4 ¼ F7 > T12.
Since the auxiliary bright and saturated pixel detector is based on fixed thresh-
olds, it is clear that the detection result will depend on the overall image brightness.
After applying a simple gamma correction to the image, the proportion of bright
and saturated pixels detected can change significantly. To solve this problem, a
linear classifier is used: Q1 ¼ K1 F3 + K2 F5 + B > 0, where K1, K2 and B are
predefined constants.
The number of bright and saturated pixels is controlled by the rule Q2 ¼ F5 > T13.
The average brightness of green pixels is divided into two ranges and different
empirical logic subsequently used for each, for example, Q3 ¼ F8 > T14 and
P1 ¼ F8 > T15. The average value of the blue component of green pixels is controlled
by the rule P2 ¼ F9 < T16, and the degree of compactness of the histogram of the
brightness of green pixels is evaluated as follows: P3 ¼ F10 < T17. The final
classification result R is calculated as
C
e eC2 eC3 eCN
R1 R R ... R
K
e C ¼ G
K eC eC
G eC
G e C
. . . GN K ,
1 2 3
C
Be1 eC2
B eC3
B ... B eCN
K
where
P
δðK ði, jÞ ¼ kÞ Rði, jÞ
i ¼ 1::w,
j ¼ 1::h
eCk ¼
R P ,
δðK ði, jÞ ¼ kÞ
i ¼ 1::w,
j ¼ 1::h
P
δðK ði, jÞ ¼ kÞ Gði, jÞ
i ¼ 1::w,
j ¼ 1::h
eC ¼
G P ,
k
δðK ði, jÞ ¼ kÞ
i ¼ 1::w,
j ¼ 1::h
P
δðK ði, jÞ ¼ kÞ Bði, jÞ
i ¼ 1::w,
j ¼ 1::h
eCk ¼
B P :
δðK ði, jÞ ¼ kÞ
i ¼ 1::w,
j ¼ 1::h
7.3 Results
current frame were shown as a bar graph (Fig. 7.7). The classifier was tested on at
least 20 hours of real-time video (Table 7.2). Errors of the first kind were estimated
by averaging the output of the classifier on the video sequences of class C4. In the
classification process, a 95% accuracy threshold was reached. To evaluate errors of
the second kind, a total number Ntotal ¼ 220 of random frames were selected from
208 X. Y. Petrova et al.
various sequences of classes C1C3, and the classification accuracy was calculated
N þ þN þ
as C1N total C2 100%, where N þ
C1 is the number of frames from class C1 (classified as
þ
“sport”), and N C2 is the number of frames from class C2 (also classified as “sport”).
The classification accuracy was 96.5%. Calculations using the coefficients from
Table 7.1 for frames from class C3 indicated an acceptable level of accuracy
(above 95%); however, these measurements are not of great value, as experts
disagree on how to classify many of the types of images from class C3.
The performance of this algorithm on a computer with a 2 GHz processor and
2 Mb of memory reached 30 fps. The proposed algorithm can be implemented using
only shift and addition operations, which makes it attractive in terms of hardware
implementation.
Despite the high quality of the classification method described above, further work
was required to eliminate the classification errors observed (see Fig. 7.8). Figure 7.8a
shows a fragment from a nature documentary that was misclassified as a long shot of
a soccer game, and Fig. 7.8b shows a caterpillar that was confused with a close-up of
a player. In a future version of the algorithm, such errors will be avoided through the
use of an advanced skin detector, the addition of a white marking detector for long
shots and the introduction of a cross-domain feature that combines texture and
colour for regions of bright and saturated colours. The classification error in
Fig. 7.8c is caused by an overly broad interpretation of the colour green. To solve
this problem, colour detectors could be applied to the global illumination of the
scene. To correct the error shown in Fig. 7.8d, a silhouette classifier could be
developed. However, it would be a quite a complicated solution with the perfor-
mance inacceptable for real-time application.
Many of these problems can be solved, one way or another, using modern
methods of deep learning with neural networks, and a brief review of these is
given in the next section. It must be borne in mind that although this approach
does not require the manual construction of features to describe the image and video,
this is achieved in practice at the cost of high computational complexity and poor
algorithmic interpretability (Kazantsev et al. 2019).
(a)
(b)
(c)
(d)
Fig. 7.8 Detection errors (a) Nature image with low texture and high amount of green
mis-classified as soccer (b) Nature image with high amount of saturated green and bright colors
mis-classified as soccer (c) Underwater image with high amount of greens mis-classified as soccer
(d) Golf mis-classified as soccer
HOSF are used to extract both the phase and the amplitude of the given input,
allowing the subsequent SVM classifier to use a rich feature vector for video
classification (Fig. 7.9).
Another work by Hamed et al. (2013) that leveraged classical machine learning
approaches tackled the task of video genre classification via several steps: initially,
shot detection was used to extract the key frames from the input videos, and the
feature vector was then extracted from the video shot using discrete cosine transform
(DCT) coefficients processed by PCA. The extracted features were subsequently
scaled to values of between zero and one, and, finally, weighted kernel logistic
regression (WKLR) was applied to the data prepared for classification with the aim
210 X. Y. Petrova et al.
Fig. 7.9 Workflow of a sports video classification system. (Reproduced with permission from
Mohanan 2017)
Fig. 7.10 Overview of the sport genre classification method via sensor fusion. (Reproduced with
permission from Cricri et al. 2013)
of achieving a high level of accuracy, making WKLR an effective method for video
classification.
The method suggested by Cricri et al. (2013) utilises multi-sensor fusion for sport
genre classification in mobile videos. An overview of the method is shown in
Fig. 7.10. Multimodal data captured by a mobile device (video, audio and data
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis 211
Fig. 7.11 The processing pipeline of the two-stream CNN method. (Reproduced with permission
from Ye et al. 2015)
Fig. 7.12 Overview of a hybrid deep learning framework for video classification. (Reproduced
with permission from Wu et al. 2015)
features, while the fusion between the spatial and motion features in a regularised
feature fusion network was used to explore feature correlations.
Karpathy et al. (2014) built a large-scale video classification framework by fusing
information over the temporal dimension using only a CNN, without recurrent
networks like LSTM. They explored several approaches to the CNN-based fusion
of temporal information (see Fig. 7.13).
Another idea of theirs, motivated by the human visual system, was a
multiresolution CNN that was split into fovea and context streams, as shown in
Fig. 7.14. Input frames were fed into two separate processing streams: a context
stream, which modelled low-resolution images, and a fovea stream, which processed
high-resolution centre crop.
This design takes advantage of the camera bias present in many online videos,
since the object of interest often occupies the central region.
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis 213
Fig. 7.13 Approaches for fusing information over the temporal dimension. (Reproduced with
permission from Karpathy et al. 2014)
Fig. 7.14 Multiresolution CNN architecture splatted into fovea and context streams. (Reproduced
with permission from Karpathy et al. 2014)
A compound memory network (CMN) was proposed by Zhu and Yang (2018) for
a few-shot video classification task. Their CMN structure followed the key-value
memory network paradigm, in which each key memory involves multiple constitu-
ent keys. These constituent keys work collaboratively in the training process,
allowing the CMN to obtain an optimal video representation in a larger space.
They also introduced a multi-saliency embedding algorithm which encoded a
variable-length video sequence into a fixed-size matrix representation by discover-
ing multiple saliencies of interest. An overview of their method is given in Fig. 7.15.
Finally, there are several methods which combine the advantages of both classical
machine learning and deep learning. One such method (Zha et al. 2015) used a CNN
to extract features from video frame patches, which were subsequently subjected to
spatio-temporal pooling and normalisation to produce video-level CNN features.
214 X. Y. Petrova et al.
Fig. 7.15 Architecture of compound memory network. (Reproduced with permission from Zhu
and Yang 2018)
Fig. 7.16 Video classification pipeline with video-level CNN features. (Reproduced with permis-
sion from Zha et al. 2015)
Fig. 7.17 Learnable pooling with context gating for video classification. (Reproduced with
permission from Miech et al. 2017)
SVM was used to classify video-level CNN features. An overview of this video
classification pipeline is shown in Fig. 7.16.
Another method presented by Miech et al. (2017) and depicted in Fig. 7.17
employed CNNs as feature extractors for both video and audio data and aggregated
the extracted visual and audio features over the temporal dimension using learnable
pooling (e.g. NetVLAD or NetFV). The outputs were subsequently fused using fully
connected and context gating layers.
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis 215
References
Bai, L., Lao, S.Y., Liao, H.X., Chen, J.Y.: Audio classification and segmentation for sports video
structure extraction using support vector machine. In: International Conference on Machine
Learning and Cybernetics, pp. 3303–3307 (2006)
Brezeale, D., Cook, D.J.: Using closed captions and visual features to classify movies by genre. In:
Poster Session of the 7th International Workshop on Multimedia Data Mining (MDM/KDD)
(2006)
Brezeale, D., Cook, D.J.: Automatic video classification: a survey of the literature. IEEE Trans.
Syst. Man Cybern. Part C Appl. Rev. 38(3), 416–430 (2008)
Cho, S., Kang, J.-S.: Histogram shape-based scene change detection algorithm. IEEE Access. 7,
27662–27667 (2019). https://doi.org/10.1109/ACCESS.2019.2898889
Choroś, K., Pawlaczyk, P.: Content-based scene detection and analysis method for automatic
classification of TV sports news. Rough sets and current trends in computing. Lect. Notes
Comput. Sci. 6086, 120–129 (2010)
Cricri, F., Roininen, M., Mate, S., Leppänen, J., Curcio, I.D., Gabbouj, M.: Multi-sensor fusion for
sport genre classification of user generated mobile videos. In: Proceedings of 2013 IEEE
International Conference on Multimedia and Expo (ICME), pp. 1–6 (2013)
Dinh, P.Q., Dorai, C., Venkatesh, S.: Video genre categorization using audio wavelet coefficients.
In: Proceedings of the 5th Asian Conference on Computer Vision (2002)
Gade, R., Moeslund, T.: Sports type classification using signature heatmaps. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 999–1004
(2013)
Gillespie, W.J., Nguyen, D.T.: Classification of video shots using activity power flow. In: Pro-
ceedings of the First IEEE Consumer Communications and Networking Conference,
pp. 336–340 (2004)
Godbole, S.: Exploiting confusion matrices for automatic generation of topic hierarchies and
scaling up multi-way classifiers. Indian Institute of Technology – Bombay. Annual Progress
Report (2002). http://www.godbole.net/shantanu/pubs/autoconfmat.pdf. Accessed on 04 Oct
2020
Gomez, G., Sanchez, M., Sucar, L.E.: On selecting an appropriate color space for skin detection. In:
Lecture Notes in Artificial Intelligence, vol. 2313, pp. 70–79. Springer-Verlag (2002)
Hamed, A.A., Li, R., Xiaoming, Z., Xu, C.: Video genre classification using weighted kernel
logistic regression. Adv. Multimedia. 2013, 1 (2013)
Huang, H.Y., Shih, W.S., Hsu, W.H.: A film classifier based on low-level visual
features. J. Multimed. 3(3) (2008)
Ionescu, B.E., Rasche, C., Vertan, C., Lambert, P.: A contour-color-action approach to automatic
classification of several common video genres. Adaptive multimedia retrieval. Context, explo-
ration, and fusion. In: Lecture Notes in Computer Science, vol. 6817, pp. 74–88 (2012)
Jaser, E., Kittler, J., Christmas, W.: Hierarchical decision-making scheme for sports video
categorisation with temporal post-processing. In: IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, vol. 2, pp. 908–913 (2004)
Jiang, X., Sun, T., Chen, B.: A novel video content classification algorithm based on combined
visual features model. In: Proceedings of the 2nd International Congress on Image and Signal
Processing, pp. 1–6 (2009)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video
classification with convolutional neural networks. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Kazantsev, R., Zvezdakov, S., Vatolin, D.: Application of physical video features in classification
problem. Int. J. Open Inf. Technol. 7(5), 33–38 (2019)
216 X. Y. Petrova et al.
Kittler, J., Messer, K., Christmas, W., Levienaise-Obada, B., Kourbaroulis, D.: Generation of
semantic cues for sports video annotation. In: Proceedings of International Conference on
Image Processing, vol. 3, pp. 26–29 (2001)
Koskela, M., Sjöberg, M., Laaksonen, J.: Improving automatic video retrieval with semantic
concept detection. Lect. Notes Comput. Sci. 5575, 480–489 (2009)
Li, L.-J., Su, H., Fei-Fei, L., Xing, E.P.: Object Bank: a high-level image representation for scene
classification and semantic feature sparsification. In: Proceedings of the Neural Information
Processing Systems (NIPS) (2010)
Liu, Y., Kender, J.R.: Video frame categorization using sort-merge feature selection. In: Pro-
ceedings of the Workshop on Motion and Video Computing, pp. 72–77 (2002)
Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychology
and art theory. In: Proceedings of the International Conference on Multimedia, pp. 83–92 (2010)
Maheswari, S.U., Ramakrishnan, R.: Sports video classification using multi scale framework and
nearest neighbour classifier. Indian J. Sci. Technol. 8(6), 529 (2015)
Mel, B.W.: SEEMORE: combining color, shape, and texture histogramming in a neurally inspired
approach to visual object recognition. Neural Comput. 9(4), 777–804 (1997). http://www.ncbi.
nlm.nih.gov/pubmed/9161022. Accessed on 04 Oct 2020
Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. arXiv
preprint arXiv:1706.06905 (2017)
Mohanan, S.: Sports video categorization by multiclass SVM using higher order spectra features.
Int. J. Adv. Signal Image Sci. 3(2), 27–33 (2017)
Pass, G., Zabih, R., Miller, J.: Comparing images using color coherence vectors. In: Proceedings of
the 4th ACM International Conference on Multimedia (1996)
Roach, M., Mason, J.: Classification of video genre using audio. Eur. Secur. 4, 2693–2696 (2001)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Adaptive Image Processing Algo-
rithms for Printing. Springer Nature Singapore AG, Singapore (2018)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for
Scanning and Printing. Springer Nature Switzerland AG, Cham (2019)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos.
In: Proceedings of 28th Conference on Neural Information Processing Systems, pp. 578–576
(2014)
Subashini, K., Palanivel, S., Ramalingam, V.: Audio-video based classification using SVM and
AANN. Int. J. Comput. Appl. 44(6), 33–39 (2012)
Takagi, S., Hattori, S.M., Yokoyama, K., Kodate, A., Tominaga, H.: Sports video categorizing
method using camera motion parameters. In: Proceedings of International Conference on
Multimedia and Expo, vol. II, pp. 461–464 (2003a)
Takagi, S., Hattori, S.M., Yokoyama, K., Kodate, A., Tominaga, H.: Statistical analyzing method of
camera motion parameters for categorizing sports video. In: Proceedings of the International
Conference on Visual Information Engineering. VIE 2003, pp. 222–225 (2003b)
Truong, B.T., Venkatesh, S., Dorai, C.: Automatic genre identification for content-based video
categorization. In: Proceedings of 15th International Conference on Pattern Recognition, vol.
4, p. 4230 (2000)
Vaswani, N., Chellappa, R.: Principal components null space analysis for image and video
classification. IEEE Trans. Image Process. 15(7), 1816–1830 (2006)
Wei, G., Agnihotri, L., Dimitrova, N.: Tv program classification based on face and text processing.
In: IEEE International Conference on Multimedia and Expo, vol. 3, pp. 1345–1348 (2000)
Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep
learning framework for video classification. In: Proceedings of the 23rd ACM International
Conference on Multimedia, pp. 461–470 (2015)
Ye, H., Wu, Z., Zhao, R.W., Wang, X., Jiang, Y.G., Xue, X.: Evaluating two-stream CNN for video
classification. In: Proceedings of the 5th ACM on International Conference on Multimedia
Retrieval, pp. 435–442 (2015)
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis 217
Yuan, Y., Wan, C.: The application of edge feature in automatic sports genre classification. In:
IEEE Conference on Cybernetics and Intelligent Systems, vol. 2, pp. 1133–1136 (2004)
Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R.: Exploiting image-trained CNN
architectures for unconstrained video classification. arXiv preprint arXiv: 1503.04144 (2015)
Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: Proceedings
of the European Conference on Computer Vision (ECCV), pp. 751–766 (2018)
Chapter 8
Natural Effect Generation
and Reproduction
8.1 Introduction
The creation and sharing of multimedia presentations and slideshows has become a
pervasive activity. The development of tools for automated creation of exciting,
entertaining, and eye-catching photo transitions and animation effects, accompanied
by background music and/or voice comments, is a trend in the last decade (Chen
et al. 2010). One of the most impressive effects is the animation of still photos: for
example, grass swaying in the wind or raindrop ripples in water.
The development of fast and realistic animation effects is a complex problem.
Special interactive authoring tools, such as Adobe After Effects and VideoStudio,
are used to create animation from an image. In these authoring tools, the effects are
selected and adjusted manually, which may require considerable efforts by the user.
The resulting animation is saved as a file, thus requiring a significant amount of
storage space. During playback, such movies are always the same, thus leading to a
feeling of repetitiveness for the viewer.
For multimedia presentations and slideshows, it is preferable to generate ani-
mated effects on the fly with a high frame rate. Very fast and efficient algorithms are
necessary in order to provide the required performance; this is extremely difficult for
low-powered embedded platforms. We have been working on the development and
implementation of automatically generated animated effects for full HD images on
ARM-based CPUs without the usage of GPU capabilities. In such limited condi-
tions, the creation of realistic and impressive animated effects – especially for users
K. A. Kryzhanovskiy (*)
Align Technology Research and Development, Inc., USA, Moscow Branch, Russia
e-mail: kkryzhanovsky@gmail.com
I. V. Safonov
National Research Nuclear University MEPhI, Moscow, Russia
e-mail: ilia.safonov@gmail.com
Fig. 8.1 Detected beats affect the size of the flashing light
who are experienced at playing computer games on powerful PCs and consoles – is a
challenging task.
We have developed several algorithms for the generation of content-based ani-
mation effects from photos, such as flashing light, soap bubbles, sunlight spot,
magnifier effect, rainbow effect, portrait morphing transition effect, snow, rain,
fog, etc. For these effects, we propose a new approach for automatic audio-aware
animation generation. In this chapter, we demonstrate the adaptation of effect
parameters according to background audio for three effects: flashing light (see the
example in Fig. 8.1), soap bubbles, and sunlight spot. Obviously, the concept can be
extended to other animated effects.
There are several content-adaptive techniques for the generation of animation from
still photos. Sakaino (2005) depicts an algorithm for the generation of plausible
motion animation from textures. Safonov and Bucha (2010) describe the animated
thumbnail which is a looped movie demonstrating salient regions of the scene in
sequence. Animation simulates camera tracking in, tracking out, and panning
between detected visual attention zones and the whole scene. More information
can be found in Chap. 14 (‘An Animated Graphical Abstract for an Image’).
Music plays an important role in multimedia presentations. There are some
methods aimed at aesthetical audiovisual composition in slideshows. ‘Tiling
slideshow’ (Chen et al. 2006) describes two methods for the analysis of background
audio to select timing for photo and frame switching. The first method is beat
detection. The second is energy dynamics calculated using root-mean-square values
of adjacent audio frames.
There are other concepts for the automatic combination of audio and visual
information in multimedia presentations. Dunker et al. (2011) suggest an approach
8 Natural Effect Generation and Reproduction 221
In general, the procedure to create an animation effect from a single still image
consists of the following major stages: effect initialization and effect execution.
During effect initialization, certain operations that have to be made only once for the
entire effect span are performed. Such operations may include source image format
conversion, pre-processing, analysis, segmentation, and creation of some visual
objects and elements displayed during effect duration, etc. At the execution stage
for each subsequent frame, a background audio analysis is performed, visual objects
and their parameters are modified depending on the time elapsed, audio features are
calculated, and the entire modified scene is visualized. The animation effect
processing flow chart is displayed in Fig. 8.2.
The flashing light effect displays several flashing and rotating coloured light stars
over the bright spots on the image. In this effect, the size, position, and colour of
flashing light stars are defined by the detected position, size, and colour of the bright
areas on the source still image. An algorithm performs the following steps to detect
small bright areas in the image:
• Calculation of the histogram of brightness of the source image.
• Calculation of the segmentation threshold as the grey level corresponding to a
specified fraction of the brightest pixels of the image using the brightness
histogram.
• Segmentation of the source image by thresholding – while thresholding, the
majority of the morphological filter is used to filter out localized bright pixel
groups.
• Calculation of the following features for each connected region of interest (ROI):
(a) Mean colour Cmean.
(b) Centroid (xc,yc).
(c) Image fraction F (fraction of the image area, occupied by ROI).
222 K. A. Kryzhanovskiy and I. V. Safonov
Effect
Initialization Obtain still image
Effect
Execution Obtain audio fragment
Animation No
stopped?
Yes
(d) Roundness (a relation of the diameter of the circle with the same area as the
ROI to the maximum dimension of the ROI):
pffiffiffiffiffiffiffiffi
2 S=π
Kr ¼ ,
max ðW, H Þ
where S is the area of the ROI and W, H are ROI bounding box dimensions.
(e) Quality (the integral parameter characterizing the possibility of the ROI being
a light source, calculated as follows):
8 Natural Effect Generation and Reproduction 223
where Ymax is the maximum brightness of the ROI, Ymean is the mean brightness
of the ROI, and KF is the coefficient of ROI size.
F=F 0 , if F F 0
KF ¼ ,
F 0 =F, if F > F 0
This effect displays soap bubbles moving over the image. Each bubble is
composed from a colour map, an alpha map, and highlight maps. A set of highlight
maps with different highlight orientations is preliminary calculated for each bubble.
Highlight position depends on the lighting direction in corresponding areas of the
image. Lighting gradient is calculated using a downscaled brightness channel of the
image.
224 K. A. Kryzhanovskiy and I. V. Safonov
Collect brightness
histogram
Calculate
threshold
Image
segmentation
Calculate features
for each region
Find appropriate
regions
(a) (b)
Fig. 8.4 Light star-shape templates: (a) halo template, (b) ray template
The colour map is modulated by the highlight map, selected from the set of
highlight maps in accordance with the average light direction around the bubble, and
then is combined with the source image using alpha blending with a bubble alpha
map. Figure 8.5 illustrates the procedure of soap bubble generation from alpha and
colour maps.
During animation, soap bubbles move smoothly over the image from bottom to
top or vice versa while oscillating slightly in a horizontal direction to give the
impression of real soap bubbles floating in the air. Figure 8.6 demonstrates a
frame of animation with soap bubbles.
8 Natural Effect Generation and Reproduction 225
Fig. 8.5 Soap bubble generation from alpha and colour maps
This effect displays a bright spot moving over the image. Prior to starting the effect,
the image is dimmed according to its initial average brightness. Figure 8.7a shows an
image with the sunlight spot effect. The spotlight trajectory and size are defined by
the attention zones of the photo.
Similar to the authors of many existing publications, we consider human faces
and salient regions to be attention zones. In addition, we regard text inscriptions as
attention zones because these can be the name of a hotel or town in the background
of the photo. In the case of a newspaper, such text can include headlines.
Despite great achievements by deep neural networks in the area of multi-view
face detection (Zhang and Zhang 2014), the classical Viola–Jones face detector
(Viola and Jones 2001) is widely used in embedded systems due to its low power
consumption. The number of false positives can be decreased with additional skin
tone segmentation and processing of the downsampled image (Egorova et al. 2009).
So far, a universal model of human vision does not exist, but the pre-attentive
vision model based on feature integration theory is well known. In this case, because
the observer is at the attentive stage while viewing the photo, a model of human
pre-attentive vision is not strictly required. However, existing approaches for the
detection of regions of interest are based on saliency maps, and these often provide
reasonable outcomes, whereas the use of the attentive vision model requires too
much prior information about the scene, and it is not generally applicable. Classical
saliency map-building algorithms (Itti et al. 1998) have a very high computational
226 K. A. Kryzhanovskiy and I. V. Safonov
(a)
(b)
Fig. 8.7 Demonstration of the sunlight spot effect: (a) particular frame, (b) detected attention zones
complexity. That is why researchers have devoted a lot of effort to developing fast
saliency map creation techniques. Cheng et al. (2011) compare several algorithms
for salient region detection. We implemented the histogram-based contrast method
into our embedded platform.
While developing an algorithm for the detection of areas with text, we took into
account the fact that text components are ordered the same way and are similar to
texture features. Firstly, we applied a LoG edge detector. Then, we filtered the
resulting connected components based on an analysis of the texture features. We
used the following features (Safonov et al. 2019):
8 Natural Effect Generation and Reproduction 227
P
N P
N
Bi ðr, cÞ
r¼1 c¼1
Bi ¼ :
N2
P
4
j Bi Bk j
dBi ¼ k¼1 :
4
P P
N N1 P P
N1 N
dBix ðr, cÞ þ dBiy ðr, cÞ
r¼1 c¼1 r¼1 c¼1
dx,y Bi ¼ :
2N ðN 1Þ
P N d ði, jÞ
4. Block homogeneity Bi: H ¼ 1þjijj ,
i, j
X
Pg ¼ f1j∇Bi ðr, cÞ > T g=N 2 ,
8ðr, cÞ2Bi
where ∇Bi(r, c) is calculated as the square root of the sum of the squares of the
horizontal and vertical derivatives.
6. The percentage of pixel value changes after the morphological operation of
opening Boi on a binary image Bbi , obtained by binarization with a threshold
of 128:
X
Pm ¼ 1jBoi ðr, cÞ 6¼ Bbi ðr, cÞ =N 2 :
8ðr, cÞ2Bi
order and similar colours and texture features, in groups. After that, we classified the
resulting groups. We formed final text zones on the basis of groups classified as text.
Figure 8.7b shows the detected attention zones. A red rectangle depicts a detected
face; green rectangles denote text regions; yellow is a bounding box of the most
salient area according to HC method.
where f is the frequency, and c is the origin on the frequency axis. The angle defines
the hue of the current frequency.
230 K. A. Kryzhanovskiy and I. V. Safonov
Depending on the value of the current note, we defined the brightness of the
selected hue and drew it on a colour circle. We used three different approaches to
display colour on the colour wheel: painting sectors, painting along the radius, and
using different geometric primitives inscribed into the circle.
In the soap bubble effect, depending on the circle colour generated, we deter-
mined the colour of the bubble texture. Figure 8.11 demonstrates an example of soap
bubbles with colour distribution depending on music. In the sunlight spot effect, the
generated colour circle determines the distribution of colours for the highlighted spot
(Fig. 8.12).
In the second approach, we detected beats or rhythm of the music. There are
numerous techniques for beat detection in the time and frequency domains (Scheirer
1998; Goto 2004; Dixon 2007; McKinney et al. 2007; Kotz et al. 2018; Lartillot and
Grandjean 2019). We faced constraints due to real-time performance limitations, and
we were dissatisfied with the outcomes for some music genres. Finally, we assumed
that the beat is present if there are significant changes of values in several bands. This
method meets performance requirements with an acceptable finding for quality of
beats. Figure 8.1 illustrates how detected beats affect the size of the flashing light. If
a beat was detected, we instantly maximized the size and brightness of the lights, and
they then gradually returned to their normal state until the next beat happened. Also,
it was possible to change the flashing lights when the beat occurred (by turning on
and off light sources). In the soap bubble effect, we maximized saturation of the soap
bubble colour when the beat occurred. We also changed the direction of moving
soap bubbles as the beat happened. In the sunlight spot effect, if a beat was detected,
we maximized the brightness and size of the spot, and these then gradually returned
to their normal states.
In the third approach, we analysed the presence of low, middle, and high
frequencies in audio signals. This principle is used in colour music installations. In
the soap bubble effect, we assigned a frequency range for each soap bubble and
defined its saturation according to the value of the corresponding frequency range. In
the flashing light effect, we assigned each light star to its own frequency range and
defined its size and brightness depending on the value of the frequency range.
Figure 8.13 shows how the presence of low, middle, and high frequencies affect
flashing lights.
Another approach is not to divide the spectrum into low, middle, and high
frequencies but rather to assign these frequencies to different tones inside octaves.
Therefore, we used an equalizer containing a large number of bands and where each
octave had enough corresponding bands. We accumulated the values of each
equalizer band to a buffer cell where the corresponding cell number was calculated
using the following statement:
f
log 2 360 mod 360
num ¼ c
360
þ 1,
length
where f is the frequency, c is the origin of the frequency axis, and length is the
number of cells.
Each cell controls the behaviour of selected objects. In the soap bubble effect, we
assigned each soap bubble to a corresponding cell and defined its saturation
232 K. A. Kryzhanovskiy and I. V. Safonov
depending on the value of the cell. In the flashing light effect, we assigned each light
to a corresponding cell and defined its size and brightness depending on the value of
the cell.
participants of the survey said that they did not like the photos or background music
used for the demonstration. It is also worth noting that eight of the respondents were
women and, on average, they rated the effects much higher than men.
Therefore, we can claim that the outcomes of the subjective evaluation demon-
strate the satisfaction of the observers with this new type of audiovisual presentation,
because audio-aware animation behaves uniquely each time it is played back and
does not repeat itself during the playback duration, thus creating vivid and lively
impressions for the observer. A lot of observers were excited by the effects; they
would like to see such features in their multimedia devices.
Several other audio-awareness animation effects can be proposed. Figure 8.15 shows
screenshots of our audio-aware effect prototypes. In rainbow effect (Fig. 8.15a), the
colour distribution of the rainbow changes according to the background audio
spectra. The movement direction, speed, and colour distribution of confetti and
serpentines are adjusted according to music rhythm in the confetti effect
(Fig. 8.15b). Magnifier glass movement speed and magnification are affected by
background music tempo in the magnifier effect (Fig. 8.15c). In the lightning effect
(Fig. 8.15d), lightning bolt strikes are matched to accents in background audio.
Obviously, other approaches to adapting the behaviour of animation to back-
ground audio are also possible. It is possible to analyse left and right audio channels
separately and to apply different behaviours to the left and right sides of the screen,
respectively. Other effects amenable to music adaption can be created.
234 K. A. Kryzhanovskiy and I. V. Safonov
(a) (b)
(c) (d)
Fig. 8.15 Audio-aware animation effect prototypes: (a) rainbow effect; (b) confetti effect; (c)
magnifier effect; (d) lightning effect
References
Chen, J.C., Chu, W.T., Kuo, J.H., Weng, C.Y., Wu, J.L.: Tiling slideshow. In: Proceedings of the
ACM International Conference on Multimedia, pp. 25–34 (2006)
Chen, J., Xiao, J., Gao, Y.: iSlideShow: a content-aware slideshow system. In: Proceedings of the
International Conference on Intelligent User Interfaces, pp. 293–296 (2010)
Cheng, M.M., Zhang, G.X., Mitra, N.J., Huang, X., Hu, S.M.: Global contrast based salient region
detection. In: Proceedings of the Conference on Computer Vision and Pattern Recognition,
pp. 409–416 (2011)
Dixon, S.: Evaluation of the audio beat tracking system BeatRoot. J. New Music Res. 36(1), 39–50
(2007)
Dunker, P., Popp, P., Cook, R.: Content-aware auto-soundtracks for personal photo music
slideshows. In: Proceedings of the IEEE International Conference on Multimedia and Expo,
pp. 1–5 (2011)
Egorova, M.A., Murynin, A.B., Safonov, I.V.: An improvement of face detection algorithm for
color photos. Pattern Recognit. Image Anal. 19(4), 634–640 (2009)
Goto, M.: Real-time music-scene-description system: predominant-F0 estimation for detecting
melody and bass lines in real-world audio signals. Speech Comm. 43(4), 311–329 (2004)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis.
IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
Kotz, S.A., Ravignani, A., Fitch, W.T.: The evolution of rhythm processing. Trends Cogn. Sci.
Special Issue: Time in the Brain. 22(10), 896–910 (2018)
Lartillot, O., Grandjean, D.: Tempo and metrical analysis by tracking multiple metrical levels using
autocorrelation. Appl. Sci. 9(23), 5121 (2019) Accessed on 01 October 2020. https://www.
mdpi.com/2076-3417/9/23/5121
8 Natural Effect Generation and Reproduction 235
Leake, M., Shin, H.V., Kim, J.O., Agrawala, M.: Generating audio-visual slideshows from text
articles using word concreteness. In: Proceedings of the 2020 CHI Conference on Human
Factors in Computing Systems, pp. 1–11 (2020)
McKinney, M.F., Moelants, D., Davies, M.E.P., Klapuri, A.: Evaluation of audio beat tracking and
music tempo extraction algorithms. J. New Music Res. 36(1), 1–16 (2007)
Safonov, I.V., Bucha, V.V.: Animated thumbnail for still image. In: Proceedings of the
GRAPHICON Symposium, pp. 79–86 (2010)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for
Scanning and Printing. Springer Nature Switzerland AG (2019)
Sakaino, H.: The photodynamic tool: generation of animation from a single texture image. In:
Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 1090–1093
(2005)
Scheirer, E.D.: Tempo and beat analysis of acoustic musical signals. J. Acoust. Soc. Am. 103(1),
588–601 (1998)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 511–518 (2001)
Zhang, C., Zhang, Z.: Improving multiview face detection with multi-task deep convolutional
neural networks. In: Proceedings of the IEEE Winter Conference on Applications of Computer
Vision, pp. 1036–1041 (2014)
Chapter 9
Image Classification as a Service
Mikhail Y. Sirotenko
9.1 Introduction
M. Y. Sirotenko (*)
111 8th Ave, New York, NY 10011, USA
e-mail: mihail.sirotenko@gmail.com
Image classification systems could be divided into different kinds depending on the
application and implementation:
1. Binary classifiers separate images in one of two mutually exclusive classes, while
multi-class classifiers can predict one of many classes for a given image.
2. Multi-label classifiers are classifiers that can predict many labels per image
(or none).
3. Based on the chosen classes, there are hierarchical or flat classification systems.
Hierarchical ones assume a certain taxonomy or ontology imposed on classes.
4. Classification systems could be fine-grained or coarse-grained based on the
granularity of chosen classes.
5. If classification is localised, then it means the class is applied to a certain region of
the image. If it is not localised, then the classification happens for the entire
image.
6. Specialist classifiers usually focus on a relatively narrow classification task
(e.g. classifying dog breeds), while generalist classifiers can work with any images.
Figure 9.1 shows an example of the hierarchical fine-grained localised multi-class
classification system (Jia et al. 2020).
Given that the type of image classification system discussed here is a web-based
service, we can deduce a set of constraints and assumptions.
The first assumption is that we are able to run the backend on an arbitrary server
as opposed to running image classification on a specific hardware (e.g. smartphone).
This means that we are free to choose any type of model architecture and any type of
hardware to run the model without very strict limitations on the memory, computa-
tional resources or battery life. On the other hand, transferring data to the web service
may become a bottleneck, especially for users with small bandwidth. This means
that the classification system should work well with compressed images having
relatively low resolution.
Another important consideration is concurrency. Since we are building a web
server system, it should be designed to support concurrent user requests.
The very first question to ask before even starting to design an image classification
system is how this system may end up being used and how to make sure it will do no
harm to people. Recently, we saw a growing number of examples of unethical or
discriminatory uses of AI. Such uses include using computer vision for military
purposes, mass surveillance, impersonation, spying and privacy violations.
One of the recent examples is this: certain authorities use facial recognition to
track and control a minority population (Mozur 2019). This is considered by many as
the first known example of intentionally using artificial intelligence for racial
profiling. Some local governments are banning facial recognition in public places
since it is a serious threat to privacy (y Arcas et al. 2017).
While military or authoritarian governments using face recognition technology is
an instance of unethical use of otherwise useful technology, there are other cases
when AI systems are flawed by design and represent pseudoscience that could hurt
some groups of people if attempted to be used in practice. One example of such
pseudoscience is the work titled “Automated Inference on Criminality Using Face
Images” published in November 2016. The authors claimed that they trained a neural
network to classify people’s faces as criminal or non-criminal. The practice of using
people’s outer appearance to infer inner character is called physiognomy, a pseudo-
science that could lead to dangerous consequences if put into practice and represents
an instance of a broader scientific racism.
Another kind of issue that may lead to unethical use of the image classification
system is algorithmic bias. Algorithmic bias is defined as unjust, unfair or prejudicial
treatment of people related to race, income, sexual orientation, religion, gender and
other characteristics historically associated with discrimination and marginalisation,
when and where it is manifested in algorithmic systems or algorithmically aided
240 M. Y. Sirotenko
Fig. 9.2 Many ways how human bias is introduced into the machine learning system
decision-making. Algorithmic biases could amplify human biases while building the
system. Most of the time, the introduction of algorithmic biases happens without
intention. Figure 9.2 shows that human biases could be introduced into machine
learning systems at every stage of the process and even lead to positive feedback
loops.
Such algorithmic biases could be mitigated by properly collecting the training
data and using metrics that measure fairness in addition to standard accuracy metrics.
It is impossible to list all kinds of end-to-end metrics because they are very problem
dependent, but we can list some of the common ones used for image classification
systems:
• Click-through rate measures what percentage of users clicked a certain hyperlink.
This metric could be useful for the image content-based recommendation system.
Consider a user who is looking to buy an apparel item, and based on her
preferences, the system shows an image of a recommended item. Thus, if a user
clicked on the image, it means that result looks relevant.
• Win/loss ratio measures the number of successful applications of the system vs
unsuccessful ones over a certain period of time compared to some other system or
human. For example, if the goal is to classify images of a printed circuit board as
defective vs non-defective, we could compare the performance of the automatic
image classification system with that of the human operator. While comparing,
we count how many times the classifier made a correct prediction while the
operator made a wrong prediction (win) and how many times the classifier
made a wrong prediction while the human operator was correct (loss). Dividing
the count of wins by the count of losses, we can make a conclusion whether
deploying an automated system makes sense.
• Man/hours saved. Consider a system that classifies a medical 3D image and based
on classification highlights areas indicating potential disease. Such a system
would be able to save a certain amount of time for a medical doctor performing
a diagnosis.
There are dozens of classification metrics being used in the field of machine learning.
We will discuss the most commonly used and relevant for image classification.
Accuracy This is the most simple and fundamental metric for any classification
problem. It is computed as the number of correctly classified samples divided by the
number of incorrectly classified samples. Here and further, by sample we mean an
image from the test set associated with one or more ground-truth labels. A correctly
classified sample is the one for which the model prediction matches ground truth.
This metric can be used in single-label classification problems. The downside of this
metric is that for every sample, there is a prediction; it ignores the score and order of
predictions; it could be sensitive to incorrect or incomplete ground truth; it ignores
test data imbalance. It would be fair to say that this metric is good for toy problems
but not sufficient for most of the real practical tasks.
Top K Accuracy This is a modification of the accuracy metric where the prediction
is considered correct if any of the top K-predicted classes (according to the score)
242 M. Y. Sirotenko
matches with the ground truth. This change makes the metric less prone to incom-
plete or ambiguous labels and helps to promote models that do a better job at ranking
and scoring classification predictions.
Precision Let’s define true positives (TP) as the number of all predictions that
match (at least one) ground-truth label for a corresponding sample, false positives
(FP) as the number of predictions that do not match any ground-truth labels for a
corresponding sample and false negatives (FN) as the number of ground-truth labels
for which no matching predictions exist. Then, the precision metric is computed as:
The precision metric is useful for models that may or may not predict a class for
any given input (which is usually achieved by applying a threshold to the prediction
confidence). If predictions exist for all test samples, then this metric is equivalent to
the accuracy.
Recall Using the notation defined above, the recall metric is defined as:
This metric ignores false predictions and only measures how many of the true
labels were correctly predicted by the model.
Note that precision and recall metrics are oftentimes meaningless if used in
isolation. By tuning a confidence threshold, one can trade off precision for recall
and vice versa. Thus, for many models, it is possible to achieve nearly perfect
precision or recall.
Recall at X% Precision and Precision at Y% Recall These are more practical
versions of the precision and recall metrics we defined above. They measure the
maximum recall at a target precision or maximum precision at a given recall. Target
precision and recall are usually derived from the application. Consider an image
search application where the user enters a text query and the system outputs images
that match that query from the database of billions of images. If the precision of an
image classifier used as a part of that system is low, the user experience could be
extremely poor (imagine searching something and seeing only one out of ten results
to be relevant). Thus, a reasonable goal could be to set a precision goal to be 90% and
try to optimise for as high recall as possible. Now consider another example –
classifying whether an MRI scan contains a tumour or not. The cost of missing an
image with a tumour could be literally deadly, while the cost of incorrectly
predicting that an image has a tumour would require a physician to double check
the prediction and discard it. In this case, recall of 99% could be a reasonable
requirement, while the model could be optimised to deliver as high precision as
possible. Further information dealing with MRI can be revealed in Chap. 11 as well
as in Chap. 12.
9 Image Classification as a Service 243
F1 Score Precision at X% recall and recall at Y% precision are useful when there is
a well-defined precision or recall goal. But what if we don’t have a way to fix either
precision or recall and measure the other one? This could happen if, for example,
prediction scores are not available or if the scores are very badly calibrated. In that
case, we can use the F1 score to compare two models. The F1 score is defined as
follows:
The F1 score takes the value of 1 in the case of 100% precision and 100% recall
and takes the value of 0 if either precision or recall is 0.
Precision-Recall Curve and AUC-PR In many cases, model predictions have
associated confidence scores. By varying the score threshold from the minimum to
the maximum, we can calculate precision and recall at each of those thresholds and
generate a plot that will look like the one in Fig. 9.3. This plot is called the PR curve,
and it is useful to compare different models. From the example, we can conclude that
model A provides a better precision in low recall modes, while model B provides a
better recall at low precision mode.
PR curves are useful to better understand model performance in different modes,
but comparing and concluding which model is absolutely better could be hard. For
that purpose, we can calculate an area under the PR curve. This will provide us with
the single metric that captures both precision and recall of the model on the entire
range of threshold.
Precision
1.
Model A
Model B
1. Recall
Both inference and training time are very important for all practical applications of
image classification models for several reasons. First of all, faster inference means
lower classification latency and therefore better user experience. Fast inference also
means classification systems may be applied to real-time streamed data or to
multidimensional images. Inference speed also correlates with the model complexity
and required computing resources, which means it is potentially less expensive
to run.
Training speed is also an important factor. Some very complex models could take
weeks or months to train. Not only could this be very expensive, but it also slows
down innovation and increases risks since it would take a long time before it would
be clear that the model did not succeed.
One of the ways to measure computational complexity of the model is by
calculating FLOPs (floating point operations per second) required to run the infer-
ence or the training step. This metric is a good approximation to compare complexity
of different models. However, it could be a bad predictor of real processing time. The
reason is because certain structures in the model may be better utilised by modern
hardware accelerators. For example, it is known that models that require a lot of
memory copies are running slower even if they require fewer FLOPs for operation.
For the reasons above, the machine learning community is working towards
standardising benchmarks to measure actual inference and training speed of certain
models on a certain hardware. Training benchmarks measure the wall-clock time
required to train a model on one of the standard datasets to achieve the specified
quality target. Inference benchmarks consist of many different runs with varying
input image resolution, floating point precision and QPS rates.
Another important feature of the model is how it behaves if the input images are
disturbed or from a domain different from the one the model was trained on. The set
of metrics used to measure these features is the same as that used for measuring
classification quality. The difference is in the input data. To measure model robust-
ness, one can add different distortions and noise to the input images or collect a set of
images from a different domain (e.g. if the model was trained on images collected
from the internet, one can collect a test set of images collected from smartphone
cameras).
9.4 Data
Data is the key for modern image classification systems. In order to build an image
classification system, one needs training, validation and test data at least. In addition
to that, a calibration dataset is needed if the plan is to provide calibrated confidence
scores, and out-of-domain data is needed to fine-tune model robustness. Building a
practical machine learning system is more about working with the data than actually
training the model. This is why more and more organisations today are establishing
data operations teams (DataOps) whose responsibilities are acquisition, annotation,
quality assurance, storage and analysis of the datasets.
Figure 9.4 shows the main steps of the dataset preparation process. Preparation of
the high-quality dataset is an iterative process. It includes data acquisition, human
annotation or verification, data engineering and data analysis.
Public data
Synthetic data
Human
Data
annotation or Data analysis
Commercial datasets engineering
verification
Crowdsourced
Controlled acquisition
Data acquisition
Feedback
Data acquisition is the very first step in preparing the dataset. There are several ways
to acquire the data with each way having its own set of pros and cons. The most
straightforward way is to use internal historical data owned by the organisation. An
example of that could be a collection of images of printed circuit boards with quality
assurance labels whether PCB is defective or not. Such data could be used with
minimal additional processing. The downside of using only internal data is that the
volume of samples might be small and the data might be biased in some way. In the
example above, it may happen that all collected images are made with the same
camera having exactly the same settings. Thus, when the model is trained on those
images, it may overfit to a certain feature of that camera and stop working after a
camera upgrade.
Another potential source of data is publicly available data. This may include
public datasets such as ImageNet (Fei-Fei and Russakovsky 2013) or COCO (Lin
et al. 2014) as well as any publicly available images on the Internet. This approach is
the fastest and least expensive way to get started if your organisation does not own
any in-house data or the volumes are not enough. There is a big caveat with this data
though. Most of these datasets are licensed for research or non-commercial use only
which makes it impossible to use for business applications. Even datasets with less
restrictive licences may contain images with wrong licence information which may
lead to a lawsuit by the copyright owner. The same applies to public images
collected from the Internet. In addition to copyright issues, many countries are
tightening up privacy regulations. For example, General Data Protection Regulation
(GDPR) law in the European Union treats any image that may be used to identify an
individual as personal data, and therefore companies that collect images that may
contain personally identifiable information have to comply with storage and reten-
tion requirements of the law.
The more expensive way of acquiring the data is to buy a commercial dataset if
one exists that fits your requirements. The number of companies that are selling
datasets is growing rapidly these days; so, it is possible to purchase the dataset for
most popular applications.
Data crowdsourcing is the strategy to collect the data (usually including annota-
tions) using a crowdsourcing platform. Such a platform asks users to collect the
required data either for compensation or for free as a way to contribute to the
improvement of a service. An example of a paid data crowdsourcing platform is
Mobeye, and an example of a free data crowdsourcing platform is Google
Crowdsource. Another way of data crowdsourcing implementation is through the
data donation option available in the application or service. Some services provide
an option for users to donate their photos or other useful information that is
otherwise considered private to the owner of the service in order to improve that
service.
Controlled data acquisition is the process of creating data using a specially
designed system. An example of such system is a set of cameras pointing to a
9 Image Classification as a Service 247
Depending on the way the images were acquired, they may or may not have ground-
truth labels, or the labels may be not reliable enough and need to be verified by
humans. Data annotation is the process of assigning or verifying ground-truth labels
to each image.
Data annotation is one of the most expensive stages of building image classifi-
cation systems as it requires the human annotator to visually analyse every sample,
which could be time-consuming. The process is typically managed using special
software that handles a dataset of raw images and associated labels, distributes work
among annotators, combines results and implements a user interface to do annota-
tions in the most efficient way. Currently, dozens of free and paid image annotation
platforms exist that offer different kinds of UIs and features. Several strategies in
data annotation exist aimed to reduce the cost and/or increase the ground-truth
quality which we discuss below:
Outsourcing to an Annotation Vendor Currently, there exist dozens of companies
providing data annotation services. These companies are specialising in the
cost-effective annotation of data. Besides having the right software tools, they
handle process management, quality assurance, storage and delivery. Many of
such companies provide jobs in areas of the world with very low income or to
incarcerated people who would not have other means to earn money. Besides the
benefits of reduced costs of labour and reduced management overhead, another
248 M. Y. Sirotenko
Human
Update datasets annotation
Labeled data
Inference Training
Predictions /
Unlabeled ML Model Selected
Features /
data Selection samples
Gradients
strategy
would have to spend a lot of money to annotate all of them to get annotations for all
rare fruits. In order to tackle this problem, an active learning approach could be used
(Schröder and Niekler 2020). Active learning attempts to maximise a model’s
performance gain while annotating the fewest samples possible. The general idea
of active learning is shown in Fig. 9.5.
Active learning helps to either significantly reduce costs or improve quality by
only annotating the most valuable samples.
Data engineering is a process of manipulating the raw data in a way to make it useful
for training and/or evaluation. Here are some typical operations that may be required
to prepare the data:
• Finding near-duplicate images
• Ground-truth label smearing and merging
• Applying taxonomy rules and removing contradicting labels
• Offline augmentation
• Image cropping
• Removing low-quality images (low resolution or size) and low confidence labels
(e.g. labels that have low agreement between annotators)
• Removing inappropriate images (porn, violence, etc.)
• Sampling, querying and filtering samples satisfying certain criteria (e.g. we may
want to sample no more than 1000 samples for each apparel category containing a
certain attribute)
• Converting storage formats
Multimedia data takes a lot of storage compared to text and other structured data.
At the same time, images are one of the most abundant types of information storage
with billions of images uploaded online every day. This makes data engineering
tasks for image datasets not a trivial task.
250 M. Y. Sirotenko
There are several reasons why datasets should be handled by specialised software
rather than just kept as a set of files on a disk. The first reason is backup and
redundancy. As we mentioned above, multimedia data may take a lot of storage,
which increases the chance of data corruption or loss if no backup and redundancy
is used.
The second is dataset versioning. In academia, it is typical for a dataset to be used
without any change for over 10 years. Usually, this is very different in practical
applications. Datasets created for practical use cases are always evolving – new
samples added, bad samples removed, ground-truth labels could be added continu-
ously, datasets could be merged or split, etc. This leads to a situation where
introducing a bug in the dataset is very easy and debugging this bug is extremely
hard. Dataset versioning tools help to treat dataset versions similarly to how code
versions are treated. A number of tools exist for dataset version control including
commercial and open source. DVC is one of the most popular tools. It allows to
manage and version data and integrates with most of the cloud storage platforms.
The third reason is data retention management. A lot of useful data could contain
some private information or be copyrighted, especially if this data is crawled from
the web. This means that such data must be handled carefully and should have ways
to remove a sample following an owner request. Some regulators also require that
the data should be stored for a limited time frame after which it should be deleted.
9 Image Classification as a Service 251
The deep learning revolution made most of the classical computer vision approaches
to image classification obsolete. Figure 9.6 shows the progress of image classifica-
tion models over the last 10 years on a popular ImageNet dataset (Deng et al. 2009).
The only classical computer vision approach is SIFT-FV with 50.9% top 1 accuracy.
The best deep learning model is over four times better in terms of classification error
(EfficientNet-L2). For quite some time, deep learning was considered a high-
accuracy but high-cost approach because it required considerable computational
resources to run inference. In recent years however, much new specialised hardware
has been developed to speed up training and inference. Today, most flagship
smartphones have some version of a neural network accelerator. Reducing costs
and time for training and inference is one of the factors of the increasing popularity
of deep learning. Another factor is a change of software engineering paradigm that
some call “Software 2.0”. In this paradigm, developers no longer build a system
piece by piece. Instead, they specify the objective, collect training examples and let
optimisation algorithms build a desired system. This paradigm turns system devel-
opment into a set of experiments that are easier to parallelise and thus speed up
progress.
252 M. Y. Sirotenko
Fig. 9.6 Evolution of image classification models’ top 1 accuracy on ImageNet dataset
The amount of research produced every year in the area of machine learning and
computer vision is incomprehensible. The NeurIPS conference has a growing trend
of paper submissions (e.g. in 2019 the number of submissions was 6743, and 1428
papers were accepted). Not all of the published research passes the practice test. In
this chapter, we will briefly discuss the most effective model architectures and
training approaches.
A typical neural network-based image classifier can be divided into backbone and
prediction layers (Fig. 9.7). The backbone consists of the input layer, hidden layers
and feature (or bottleneck) layers. It takes an image as an input and produces the
image representation or features. The predictor is then using this representation to
predict classes. This kind of division is rather virtual and allows to conveniently
separate a reusable and more complex part that extracts the image representation
from a predictor that usually is a simple few-layer fully connected network. In the
following, we focus on choosing the architecture for the backbone part of the
classifier.
There are over a hundred different neural network model architectures that exist
today. However, nearly all successful architectures are based on the concept of the
residual network (ResNet) (He et al. 2016). A residual network consists of a chain of
residual blocks as depicted in Fig. 9.8. The main idea is to have a shortcut connection
between the input and output. Such connection allows for the gradients to flow freely
and avoid a vanishing gradient problem that for many years prevented building very
deep neural networks. There are several modifications to the standard residual block.
One modification proposes to add more levels of shortcut connections (DenseNet).
9 Image Classification as a Service 253
Input Feature
Input Hidden layers Hidden layers Class
layer layer
image predictions
Backbone Predictor
Input Output
Weights Activation Normalization
1x1
Conv
Residual
connection
Another modification proposes to assign and learn weights for the shortcut connec-
tion (HighwayNet).
Deploying an ML model is a trade-off between cost, latency and accuracy. Cost is
mostly managed through the hardware that runs the model inference. A more costly
and powerful hardware can either run a model faster (low latency) or run a larger and
more accurate model with the same latency. If the hardware is fixed, then the trade-
off is between the model size (i.e. latency) and the accuracy.
When choosing or designing the ResNet-based model architecture, the accuracy-
latency trade-off is achieved through varying the following parameters:
1. Resolution: this includes input image resolution as well as the resolution of
intermediate feature maps.
2. Number of layers or blocks of layers (such as residual block).
3. Block width which is to the number of channels of the feature maps of the
convolutional layers.
By varying model depth, width and resolution, one can influence model accuracy
and inference speed. However, predicting how accuracy will change as a result of
changing one of those parameters is not possible. The same applies to predicting
model inference speed. Even though it is straightforward to estimate the amount of
computing and memory required for the new architecture, different hardware may
work faster with certain architectures requiring more computations vs some other
ones requiring less computations.
Because of the abovementioned problems, designing model architectures used to
be an art as much as science. But recently, more and more successful architectures
are designed by optimisation or search algorithms (Zoph and Le 2016). Examples of
the architectures that were designed by algorithm are EfficientNet (Tan and Le 2019)
and MobileNetV3 (Howard et al. 2019). Both architectures are a result of neural
254 M. Y. Sirotenko
Depending on the amount and kind of training data, privacy requirements, available
resources to spend for model improvement and other constraints and considerations,
different training approaches could be chosen.
The most straightforward approach is the fine-tuning of the pre-trained model.
The idea here is to find a pre-trained visual model and fine-tune it using a collected
dataset. Fine-tuning in this case means training a model that has a backbone
initialised from the pre-trained model and a predictor initialised randomly. Thou-
sands of models pre-trained on various datasets are available online for download.
There are two modes of fine-tuning: full and partial. Full-model fine-tuning trains the
entire model, while partial freezes most of the model and trains only certain layers.
Typically, in the latter mode, the backbone is frozen while the predictor is trained.
This mode is used when the dataset size is small or when there is a need to do a quick
training. Fine-tuning mode is also used as a baseline before using other ways of
improving model accuracy.
As was mentioned above, data labelling is one of the most expensive stages of
building an image classification system. Acquiring unlabelled data on the other hand
could be much easier. Thus, it is very common to have a small labelled dataset and
much larger dataset with no or weak labels. Self-supervised and semi-supervised
approaches aim to utilise the massive amounts of unlabelled data to improve model
performance. The general idea is to pre-train a model on a large unlabelled dataset in
an unsupervised mode and then fine-tune it using a smaller labelled dataset. This idea
is not new and has been known for about 20 years. However, only recent advances in
unsupervised pre-training made it possible for such models to compete with fully
supervised training regimes where all the data is labelled. One of the most successful
approaches for unsupervised pre-training is contrastive learning (Chen et al. 2020a).
The idea of contrastive learning is depicted in Fig. 9.9. An unlabelled input image is
transformed by two random transformation functions. Those two transformed
images are fed into an encoder network to produce corresponding representations.
Finally, two representations are used as inputs to the projection network whose
outputs are used to compute consistency loss. Consistency loss pushes two pro-
jections from the same image to be close, while projections from different images are
pushed to be far away.
It was shown that unsupervised contrastive learning works best with large
convolutional neural networks (Chen et al. 2020a, b). In order to make this approach
more practical, one of the ways is to use knowledge distillation.
Knowledge distillation in a neural network consists of two steps:
9 Image Classification as a Service 255
1. In the first step, a large model or an ensemble of models is trained using ground-
truth labels. This model is called a teacher model.
2. In the second stage, (typically) a smaller network is trained using predictions of
the teacher network as a ground truth. This model is called a student network.
In the context of semi-supervised learning, a larger teacher model got trained
using unlabelled data and got distilled into a smaller student network. It was shown
that it is possible to do such distillation with negligible accuracy loss (Hinton et al.
2015).
Knowledge distillation is not only useful for semi-supervised approaches but also
as a way to control model size and computational requirements while keeping high
accuracy.
Another approach that is related to semi-supervised learning is domain transfer
learning. Domain transfer is a problem of using one dataset to train a model to make
predictions on a dataset from a different domain. Examples of domain transfer
problems include training on synthetic data and predicting on real data, training on
e-commerce images of products and predicting on user images and so on.
There are two ways of tackling the domain transfer problem depending on
whether there is some training data from the target domain or not:
256 M. Y. Sirotenko
1. If the data from the target domain is available, then the contrastive training that
aims to minimise distance between source and target domain is one of the most
successful approaches.
2. If the target domain samples are unavailable, then the goal is to train a model
robust to the domain shift. In this case, heavy augmentation, regularisation and
contrastive losses help.
Another common problem when training image classification models is data
imbalance. Almost every real-life dataset has some kind of data imbalance which
manifests in some classes having orders of magnitude more training samples than
others. Data imbalance may result in a biased classifier. The most obvious way to
solve a problem is to collect more data for underrepresented classes. This, however,
is not always possible or could be too costly. Another approach that is widely used is
under- or over-sampling. The idea is to use fewer samples of the overrepresented
class and duplicate samples of the underrepresented class during training. This
approach is simple to implement, but the accuracy improvement for the underrepre-
sented classes often comes with the price of reduced accuracy for the overrepre-
sented ones. There is a vast amount of research aimed at handling data imbalance by
building a better loss function (see Cao et al. 2019, for instance).
One more training approach we would like to mention in this chapter is federated
learning (FL) (Yang et al. 2019). Federated learning is a type of privacy-preserving
learning where no central data store is assumed. As shown in Fig. 9.10, in FL there is
a centralised server that does federated learning and a large enough number of user’s
devices. Each user’s device downloads a shared model from the server, uses it and
computes gradients using only data available on that device. According to a sched-
ule, those gradients are sent to a centralised server where gradients from all users
integrated into a shared model. This approach guarantees that no actual data could
leak from the centralised server. Such approach is gaining more popularity recently
since it allows improving model performance using users’ data without compromis-
ing privacy.
Centralized server
Gradients
Updated
model
User 1 User N
device device
9 Image Classification as a Service 257
9.6 Deployment
After the model is trained and evaluated, the final step is model deployment.
Deployment in the context of this chapter means running your model in production
to classify images coming from the users of your service. The factors that are
important at this step are latency, throughput and cost. Latency means how fast a
user will get a response from your service after sending an input image, and
throughput means how many requests your service can process per unit of time
without failures. The other important factors during the model deployment stage are
convenience of updating the model, proper versioning and ability to run A/B tests.
The latter depends on the software framework that was chosen for deployment.
TensorFlow serving is a great example of the platform that delivers high-
performance serving of models with gRPC and REST client support.
The way to reduce the latency or increase the throughput of the model serving
system is by using a more powerful hardware. This could be done either by building
your own server or by using one of many cloud solutions. Most of the cloud
solutions for serving models offer instances with modern GPU support that provide
a much better efficiency than CPU-only solutions. Another alternative to GPU for
accelerating neural networks is tensor processing units (TPU) that were specifically
designed for running neural networks’ training and inference.
Another aspect of model deployment that is important to keep in mind is
protecting the model from stealing. Protecting the confidentiality of ML models is
important for two reasons: (a) a model can be a business advantage to its owner, and
(b) an adversary may use a stolen model to find transferable adversarial examples
that can evade classification by the original model. Several methods were proposed
recently to detect model stealing attacks as well as protecting from them by embed-
ding watermarks in the neural networks (Juuti et al. 2019; Uchida et al. 2017).
References
Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-
distribution-aware margin loss. In: Advances in Neural Information Processing Systems,
pp. 1567–1578 (2019)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of
visual representations. arXiv preprint arXiv:2002.05709 (2020a)
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are
strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020b)
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation
policies from data. arXiv preprint arXiv:1805.09501 (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical
image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 248–255 (2009)
Fei-Fei, L., Russakovsky, O.: Analysis of large-scale visual recognition. Bay Area Vision Meeting
(2013)
258 M. Y. Sirotenko
Hastings, R.: Making the most of the cloud: how to choose and implement the best services for your
library. Scarecrow Press (2013)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hinton, G., Vinyals, O, Dean, J.: Distilling the knowledge in a neural network, arXiv preprint
arXiv:1503.02531 (2015)
Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R.,
Vasudevan, V., Le, Q.V.: Searching for mobilenetv3. In: Proceedings of the IEEE International
Conference on Computer Vision, pp. 1314–1324 (2019)
Jia, M., Shi, M., Sirotenko, M., Cui, Y., Cardie, C., Hariharan, B., Adam, H., Belongie, S.:
Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset. arXiv preprint
arXiv:2004.12276 (2020)
Juuti, M., Szyller, S., Marchal, S., Asokan, N.: PRADA: protecting against DNN model stealing
attacks. In: Proceedings of the IEEE European Symposium on Security and Privacy (EuroS&P),
pp. 512–527 (2019)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.:
Microsoft coco: Common objects in context. In: Proceedings of the European Conference on
Computer Vision, pp. 740–755. Springer, Cham (2014)
Mozur, P.: One month, 500,000 face scans: How China is using A.I. to profile a minority (2019).
Accessed on September 27 2020. https://www.nytimes.com/2019/04/14/technology/china-
surveillance-artificial-intelligence-racial-profiling.html
Neuberger, A., Alshan, E., Levi, G., Alpert, S., Oks, E.: Learning fashion traits with label
uncertainty. In: Proceedings of KDD Workshop Machine Learning Meets Fashion (2017)
Nikolenko, S. I.: Synthetic data for deep learning, arXiv preprint arXiv:1909.11512 (2019)
Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G., Tran, D.: Measuring calibration in deep
learning. In: Proceedings of the CVPR Workshops, pp. 38–41 (2019)
Schröder, C., Niekler, A.: A survey of active learning for text classification using deep neural
networks, arXiv preprint arXiv:2008.07267 (2020)
Singh, A., Sha, J., Narayan, K.S., Achim, T., Abbeel, P.: Bigbird: A large-scale 3d database of
object instances. In: Proceedings of the IEEE International Conference on Robotics and
Automation, pp. 509–516 (2014)
Tan, M., Le, Q. V.: Efficientnet: Rethinking model scaling for convolutional neural networks, arXiv
preprint arXiv:1905.11946 (2019)
Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.I.: Embedding watermarks into deep neural networks.
In: Proceedings of the ACM on International Conference on Multimedia Retrieval, pp. 269–277
(2017)
y Arcas, B.A., Mitchell, M., Todorov, A.: Physiognomy’s New Clothes. In: Medium. Artificial
Intelligence (2017) Accessed on September 27 2020 https://medium.com/@blaisea/
physiognomys-new-clothes-f2d4b59fdd6a.
Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and applications. ACM
Trans. Intell. Syst. Technol. 10(2), 1–19 (2019)
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning, arXiv preprint
arXiv:1611.01578 (2016)
Chapter 10
Mobile User Profiling
10.1 Introduction
Fig. 10.1 Sensors and data sources available in a modern smartphone (part of this image was
designed by Starline/Freepik)
data among others. Most of them can be directly used as features for demographic
predication, except Web data. Thus, another issue is to develop a way to represent
the massive amount of text data from Web pages in the form of an input feature
vector for the demographic prediction method. To solve this issue, we propose using
advanced natural language processing (NLP) technologies based on a probabilistic
topic modelling algorithm (Blei 2012) to extract meaningful textual information
from the Web data.
In this method, we propose extracting user’s interests from Web data to construct
a compact representation of the user’s Web data. For example, user interests may be
expressed by the following: which books the user reads or buys, which sports are
interesting for the user or which purchases the user could potentially make.
This compact representation is suitable for training the demographic classifier. It
should be noted that the user’s interests can be directly used by the content provider
for better targeting of advertisement services or other interactions with the user.
To achieve flexibility in demographic prediction or to provide language indepen-
dence, we propose using common news categories as a model for user interests.
News streams are available in all possible languages of interest. Its categories are
also reasonably universal across languages and cultures. Thus, it allows one to build
a multi-lingual topic model.
This endeavour requires building and training a topic model with a classifier of
text data for specified categories (interests). A list of wanted categories can be
provided by the content provider. The topic model categorizes the text extracted
from Web pages. The topic model can be built with the additive regularization of
topic models (ARTM) algorithm. User interests are then extracted using the trained
topic model.
ARTM is based on the generalization of two powerful algorithms: probabilistic
latent semantic analysis (PLSA) (Hofmann 1999) and latent Dirichlet allocation
(LDA) (Blei 2012). The additive regularization framework allows imposing addi-
tional necessary constraints on the topic model, such as sparseness or desired word
distribution. It should be noted that ARTM can be used not only for clustering but for
classification for a given list of categories.
Next, the demographic model is trained using datasets collected from mobile
users. Features from the collected data are extracted with the help of topic model
trained on previous step. The demographic model comprises several (in our case
three) demographic classifiers. Demographic classifiers must predict the age of the
user (one of the following: ‘0–18’, ‘19–21’, ‘22–29’, ‘30+’), gender (male or female)
or marital status (married/not married) based on a given feature vector. Architecture
of the proposed method is depicted in Fig. 10.2.
It is important to emphasize that the language the user will use cannot be
predicted in advance. This uncertainty means that the model must be multi-lingual:
a speaker of another language may only need to load data for his/her own language.
ARTM allows inclusion of various types of modalities (translations into different
languages, tags, categories, authors, etc.) into one topic model. We propose using
cross-lingual features to implement a language-independent (multi-lingual) NLP
procedure. The idea of cross-lingual feature generation involves training one topic
262 A. M. Fartukov et al.
10.2.1 Pre-processing
As mentioned earlier, the major source of our observational data is Web pages that
the user has browsed. Pre-processing of Web pages includes the following opera-
tions: removing HTML tags, performing stemming or lemmatization of every word,
removing stop words and converting all characters to lowercase and translating the
webpage content into target languages. We consider three target languages (Russian,
English and Korean) in the proposed algorithm. We use the ‘Yandex.Translate’ Web
service for translation.
10 Mobile User Profiling 263
To model the user’s interests, one must analyse the documents viewed by the user.
Topic modelling enables us to assign a set of topics T and to use T to estimate a
conditional probability that word w appears inside document d:
X
phwjdi ¼ t2T
phwjt iphtjdi,
where T is a set of topics. In accordance with PLSA, we follow the assumption that
all documents in a collection inherit one cluster-specific distribution for every cluster
of topic-related words. Our purpose is to assign such topics T, which will maximize a
functional L:
YY max
LðΦ, ΘÞ ¼ ln pðw _ d Þndw ! ,
d2Dw2d Φ, Θ
where ndw denotes the number of times that word w is encountered in a document d,
Φ ¼ ( p(w| t))W T ¼ (φwt)W T is the matrix of term probabilities for each topic, and
Θ ¼ ( p(t| d ))T D ¼ (θtd)T D is the matrix of topic probabilities for each document.
By using the abovementioned expressions, we obtain the following:
X X X max
LðΦ, ΘÞ ¼ n
w2d dw
ln pðw _ t Þpðt _ d Þ ! and
d2D t2T Φ, Θ
X X
w2W
phwjt i ¼ 1, phwjt i 0; t2T
phtjdi ¼ 1, phtjdi 0:
Special attention should be drawn to the following fact: zero probabilities are not
acceptable to the natural logarithm in the above equation for L(Φ, Θ). To overcome
this issue, we followed the ARTM method proposed by Vorontsov and Potapenko
(2015). The regularization coefficient R(Φ, Θ) is added:
Xr
RðΦ, ΘÞ ¼ τ R ðΦ, ΘÞ, τi
i¼1 i i
0,
Xn pi
KLðp _ qÞ ¼ p
i¼1 i
ln :
qi
X min
KLw ðβw _ ϕωt Þ ! :
t2T Φ
X min
KLt ðαt _ θtd Þ ! :
d2D Θ
To achieve both minima, we combine the last two expressions into a single
regularizer Rs(Φ, Θ):
X X X X
Rs ðΦ, ΘÞ ¼ β0 t2T
β
w2W ω
ln ðϕωt Þ þ α0 d2D
α
t2T t
ln ðθtd Þ ! max :
max
LðΦ, ΘÞ þ Rs ðΦ, ΘÞ ! :
Φ, Θ
After performing step 3, it is possible to describe each topic t with the set of its words
w using the probabilities p(w| t). It also becomes possible to map each input
document d into a vector of topics T in accordance with probabilities p(t| d ). At
the next step, we need to aggregate all topic information about the documents d1u,
. . ., dnu viewed by the user u into a single vector. Thus, we average the obtained topic
vectors pu(ti| dj) in the following manner:
10 Mobile User Profiling 265
1 XN d
pu ð t i _ d Þ ¼ p u t i _ d ju ,
Nd j¼1
where Nd denotes the number of documents viewed by a user u and dju is the j-th
document viewed by the user.
The resulting topic vector (or user interest vector) is used as feature vector for
demographic model. Let us consider it in detail.
The demographic model consists of several demographic classifiers. In the
proposed method, the following classifiers are used: age, gender and marital status.
In present work, a deep learning approach builds such classifiers, and the Veles
framework (Veles 2015) is used as a deep learning platform.
Each classifier is built with a neural network (NN) and optimized with a genetic
algorithm. NN architecture is based on the multi-layer perceptron (Collobert and
Bengio 2004). It should be noted that we determined the possible NN architecture of
each classifier and optimal hyper-parameters using a genetic algorithm.
We used the following hyper-parameters of NN architecture: size of the
minibatch, number of layers, number of neurons in each layer, activation function,
dropout, learning rate, weight decay, gradient moment, standard deviation of
weights, gradient descent step, regularization coefficients, initial ranges of weights
and number of examples per iteration.
Using a genetic algorithm, we can adjust these hyper-parameters (Table 10.1).
We also use a genetic algorithm to select optimal features in the input feature vector
and to reduce the size of the input feature vector of demographic model.
When operating a genetic algorithm, a population P with M ¼ 75 instances of
demographic classifiers with the abovementioned parameters is created. Next, we
use error backpropagation to train these classifiers. Based on training results, clas-
sifiers with highest performance in terms of demographic profile prediction are
chosen. Subsequently, to add new classifiers into the population, a crossover oper-
ation is applied. Such a crossover includes random substitution of numbers taken
from the parameters of two original classifiers if they do not coincide with these
classifiers.
Let us illustrate the process with the following example. If classifier C1 contains
n1 ¼ 10 in the first layer, and classifier C2 contains n2 ¼ 100 neurons, then
replacement of its value to 50 may be performed in the crossover operation. The
266 A. M. Fartukov et al.
newly created classifier C3 ¼ crossover(C1, C2) replaces a classifier that shows the
worst performance in the population P.
To introduce modifications of best classifiers, an operation of mutation is also
applied. Each classifier with new parameters is added to the population of classifiers;
subsequently, all new classifiers are retrained and their performance is measured.
This process is continued until the classification process is improved. A demo-
graphic classifier with the best performance in the last population is chosen as the
final one.
Let us consider results from the proposed method. To build robust demographic
prediction models, we collected an extensive dataset with various available types of
features from mobile users. The principles and a detailed description of the dataset
are described in Sect. 10.4.
For demographic prediction, we explored different machine learning approaches
and methods: support vector machines (SVMs) (Cortes and Vapnik 1995), NNs
(Bishop 1995) and logistic regression (Bishop 2006). We performed accuracy tests
(without optimization), the results of which are presented in Table 10.2.
Based on the obtained results, the NN approach builds a demographic prediction
classifier. Although there are a myriad of freely available deep learning frameworks
for training NNs (Caffe, Torch, Theano, etc.), we decided to use our custom deep
learning framework (Veles 2015) because it was designed as a very flexible tool in
terms of workflow construction, data extraction and pre-processing and visualiza-
tion. It also has an additional advantage: the ease of porting the resulting classifier to
mobile devices.
It should be noted that we also had to optimize the number of topic features: we
decreased the initial number of ARTM features from 495 to 170. The demographic
prediction accuracies using topic features generated from ARTM and LDA are
shown in Table 10.3.
behavioural profile corresponds to a legitimate user, the new incoming data are
passed to implicit authentication step. Otherwise, the user will be locked out, and the
authentication system will ask the user to verify his/her identity by using explicit
authentication methods such as a password or biometrics (fingerprint, iris, etc.).
The framework that is elaborated above determines requirements that the implicit
authentication should satisfy. First, methods applied for implicit authentication
should be able to extract representative features that reflect user uniqueness from
noisy data (Hazan and Shabtai 2015). In particular, a user’s interaction with the
smartphone causes high intra-user variability, which should be handled effectively
by methods of implicit authentication.
Second, these methods should process sensor data and profile a user in real time
on the mobile device without sending data out of the mobile device. This process
indicates that implicit authentication should be done without consuming much
power (i.e. should have low battery consumption). Fortunately, SoCs that are used
in modern smartphones include special low-power blocks aimed at real-time man-
agement of the sensors without waking the main processor (Samsung Electronics
2018).
Third, implicit authentication requires protecting both collected data and their
processing. This protection can be provided by a special secure (trusted) execution
environment, which also imposes additional restrictions on available computational
resources. These include the following: restricted number of available processor
cores, reduced frequencies of the processor core(s), unavailability of extra compu-
tational hardware accelerators (e.g. GPU) and a limited amount of memory (ARM
Security Technology 2009).
10 Mobile User Profiling 269
boosted decision trees and, more importantly, to some extent decreases the level of a
smartphone’s security.
Decision of later issue is to use multi-modal methods for passive authentication,
which have obvious advantages over methods based on a single modality. Deb et al.
(2019) described an example of a multi-modal method for passive authentication. In
that paper, the authors proposed using Siamese long short-term memory (LSTM)
architecture (Varior et al. 2016) to extract deep temporal features from the data
corresponding to a number of passive sensors in smartphones for user authentication
(Fig. 10.6). The authors proposed a passive user authentication method based on
keystroke dynamics, GPS location, accelerometer, gyroscope, magnetometer, linear
accelerometer, gravity and rotation modalities that can unobtrusively verify a gen-
uine user with 96.47% True Accept Rate (TAR) at 0.1% False Accept Rate (FAR)
within 3 seconds.
Fig. 10.6 Architecture of the model proposed by Deb et al. (2019). (Reproduced with permission
from Deb et al. 2019)
The main tasks of the system include data acquisition, which tracks usual user
interactions with a smartphone, subsequent storage and transmission of the collected
data, customization of data collection procedure (selection of sources/sensors for
data acquisition) for individual user (or groups of users) and controlling and mon-
itoring the data collection process.
The dataset collection system contains two components: a mobile application for
Android OS (hereafter called the client application) and a server for dataset storage
and controlling dataset collection (Fig. 10.7). Let us consider each component in
detail.
The client application is designed to collect sensor data and user activities (in the
background) and to send the collected data to the server via encrypted communica-
tion channel. The client application can operate on smartphones supporting Android
4.0 or higher. Because users would continue to use their smartphones in a typical
manner during dataset collection, the client application is optimized in terms of
battery usage.
Immediately after installation of the application on a smartphone, it requests the
user to complete a ground-truth profile. This information is needed for further
verification of the developed methods. The client collects the following categories
of information: ‘call + sensors’ data, application-related data and Web data.
‘Call + sensors’ data comprises SMS and call logs, battery and light sensor status,
location information (provided by GPS and cell towers), Wi-Fi connections, etc. All
sensitive information (contacts used for calls and SMS, SMS messages, etc.) is
transformed in a non-invertible manner (hashed) to ensure the user’s privacy. At
the same time, the hashed information can be used to characterize user behaviour.
Physical sensors provide an additional data source about the user context. Infor-
mation about the type and current state of the battery can reveal the battery charging
pattern. The light sensor helps to determine ambient light detection.
272 A. M. Fartukov et al.
Fig. 10.8 Visualization of history and current distribution of participants by gender and marital
status
Another component of the data collection system is the server, which is aimed to
store collected data and to control the collection process (Fig. 10.7). Let us briefly
describe the main functions of the server:
1. Storing collected data in common dataset.
2. Monitoring activity of the client application. This factor includes tracking client
activation events; collecting information about versions of the client application
that are currently used for data collection, number and frequency of data trans-
actions between the clients and the server; and other technical information about
the client application’s activities.
3. Providing statistical information. It is important to obtain actual information
about the current status of the dataset collection. The server should be able to
provide information about the amount of already collected data, number of data
collection participants, their distribution by age/gender and marital status
(Fig. 10.8), etc. in convenient way.
4. Control of the dataset collection procedure. The server should be able to set up a
unified configuration for all clients. Such a configuration determines the set of
sensors that should be used for data collection and the duration of data collection.
Participants can be asked different questions to establish additional data labelling
or ground-truth information immediately before the data collection. Another
aspect of data collection control is notifying participants that the client application
should be updated or requires special attention during data collection. This
function is implemented as push notifications received from the server.
5. System health monitoring and audit. An actual aim of this function is to detect
software issues promptly and provide information that can help to fix them. For
this endeavour, the server should be able to define potentially problematic user
environments (by automatically collecting and sending to the server client
274 A. M. Fartukov et al.
References
ARM Security Technology: Building a Secure System Using TrustZone Technology. ARM
Limited (2009). https://developer.arm.com/documentation/genc009492/c
Bird, S., Segall, I., Lopatka, M.: Replication: why we still can’t browse in peace: on the uniqueness
and reidentifiability of web browsing histories. In: Proceedings of the Sixteenth Symposium on
Usable Privacy and Security, pp. 489–503 (2020)
Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, New York
(1995)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Blei, D.M.: Probabilistic topic models. Commun. ACM. 55(4), 77–84 (2012)
Chen, T., Ernesto, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794
(2016)
Collobert, R., Bengio, S.: Links between perceptrons, MLPs and SVMs. In: Proceedings of the
Twenty-First International Conference on Machine Learning, pp. 1–8 (2004)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Counterpoint Technology Market Research: Almost half of smartphone users spend more than
5 hours a day on their mobile device (2017) (Accessed on 29 September 2020). https://www.
counterpointresearch.com/almost-half-of-smartphone-users-spend-more-than-5-hours-a-day-
on-their-mobile-device/
Crouse, D., Han, H., Chandra, D., Barbello, B., Jain, A.K.: Continuous authentication of mobile
user: fusion of face image and inertial measurement unit data. In: Proceedings of International
Conference on Biometrics, pp. 135–142 (2015)
Deb, D., Ross, A., Jain, A.K., Prakah-Asante, K., Venkatesh Prasad, K.: Actions speak louder than
(pass)words: passive authentication of smartphone users via deep temporal features. In: Pro-
ceedings of International Conference on Biometrics, pp. 1–8 (2019)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM
algorithm. J. R. Statist. Soc. Ser. B (Methodolog.). 39(1), 1–38 (1977)
Dong, Y., Yang, Y., Tang, J., Yang, Y., Chawla, N.V.: Inferring user demographics and social
strategies in mobile social networks. In: Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 15–24 (2014)
Google LLC: Choose when your Android phone can stay unlocked (2020) Accessed on
29 September 2020. https://support.google.com/android/answer/9075927?visit_
id¼637354541024494111-316765611&rd¼1
Hazan, I., Shabtai, A.: Noise reduction of mobile sensors data in the prediction of demographic
attributes. In: IEEE/ACM MobileSoft, pp. 117–120 (2015)
10 Mobile User Profiling 275
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual Interna-
tional ACM SIGIR Conference on Research and Development in Information Retrieval,
pp. 50–57 (1999)
Hu, J., Zeng, H.-J., Li, H., Niu, C., Chen, Z.: Demographic prediction based on user’s browsing
behaviour. In: Proceedings of the 16th International Conference on World Wide Web,
pp. 151–160 (2007)
Kabbur, S., Han, E.-H., Karypis, G.: Content-based methods for predicting web-site demographic
attributes. In: Proceedings of the 2010 IEEE International Conference on Data Mining,
pp. 863–868 (2010)
Laurila, J.K., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Do, T.M.T., Dousse, O., Eberle, J.,
Miettinen, M.: From big smartphone data to worldwide research: the mobile data challenge.
Pervasive Mob. Comput. 9(6), 752–771 (2013)
Patel, V.M., Chellappa, R., Chandra, D., Barbello, B.: Continuous user authentication on mobile
devices: recent progress and remaining challenges. IEEE Signal Process. Mag. 33(4), 49–61
(2016)
Podoynitsina, L., Romanenko, A., Kryzhanovskiy, K., Moiseenko, A.: Demographic prediction
based on mobile user data. In: Proceedings of the IS&T International Symposium on Electronic
Imaging. Mobile Devices and Multimedia: Enabling Technologies, Algorithms, and Applica-
tions, pp. 44–47 (2017)
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
protection of natural persons with regard to the processing of personal data and on the free
movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)
(GDPR). Off. J. Eur. Union L119(59), 1–88 (2016)
Ross, A., Banerjee, S., Chen, C., Chowdhury, A., Mirjalili, V., Sharma, R., Swearingen, T., Yadav,
S.: Some research problems in biometrics: the future beckons. In: Proceedings of International
Conference on Biometrics, pp. 1–8 (2019)
Samsung Electronics: Samsung enables premium multimedia features in high-end smartphones
with Exynos 7 Series 9610 (2018) Accessed on 29 September 2020. https://news.samsung.com/
global/samsung-enables-premium-multimedia-features-in-high-end-smartphones-with-exynos-
7-series-9610
Seneviratne, S., Seneviratne, A., Mohapatra, P., Mahanti, A.: Predicting user traits from a snapshot
of apps installed on a smartphone. ACM SIGMOBILE Mob. Comput. Commun. Rev. 18(2),
1–8 (2014)
Varior, R.R., Shuai, B., Lu, J., Xu, D., Wang, G.: A Siamese long short-term memory architecture
for human re-identification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer
Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol. 9911. Springer,
Cham (2016)
Veles Distributed Platform for rapid deep learning application development. Samsung Electronics
(2015) Accessed on 29 September 2020. https://github.com/Samsung/veles
Vorontsov, K., Potapenko, A.: Additive regularization of topic models. Mach. Learn. 101, 303–323
(2015)
Zhong, E., Tan, B., Mo, K., Yang, Q.: User demographics prediction based on mobile data.
Pervasive Mob. Comput. 9(6), 823–837 (2013)
Chapter 11
Automatic View Planning in Magnetic
Resonance Imaging
11.1.1 Introduction
Magnetic resonance imaging (MRI) is one of the most widely used noninvasive
methods in medical diagnostic imaging. For better quality of the MRI image slices,
their position and orientation should be chosen in accordance with anatomical
landmarks, i.e. respective imaging planes (views or slices) should be preliminarily
planned. This procedure is named in MRI as view planning. The locations of planned
slices and their orientations depend on the human body parts under investigation. For
example, typical cardiac view planning consists of obtaining two-chamber, three-
chamber, four-chamber and short-axis views (see Fig. 11.1).
Typically, view planning is performed manually by a doctor. Such manual
operation has several drawbacks:
• It is time-consuming. Depending on the anatomy and study protocol, it could take
up to 10 minutes and even more in special cases. The patient should stay in the
scanner during this procedure.
A. B. Danilevich
Samsung R&D Institute Rus (SRR), Moscow, Russia
e-mail: a.danilevich@samsung.com
M. N. Rychagov
National Research University of Electronic Technology (MIET), Moscow, Russia
e-mail: michael.rychagov@gmail.com
M. Y. Sirotenko (*)
111 8th Ave, New York, NY 10011, USA
e-mail: mihail.sirotenko@gmail.com
• It is operator-dependent. The image of the same anatomy produced with the same
protocol by different doctors may differ significantly. This fact degrades diagno-
sis quality and analysis of the disease dynamics.
• It requires qualified medical personnel to do the whole workflow (including the
view planning) instead of only analysing the images for diagnostics.
To overcome all these disadvantages, an automatic view planning system may be
used which estimates positions and orientations of desired view planes by analysis of
a scout image, i.e. a preliminary image obtained prior to performing the major
portion of a particular study.
The desired properties of such AVP system are:
• High acquisition speed of the scout image
• High computational speed
• High accuracy of the view planning
• Robustness to noise and anatomical abnormalities
• Support (suitability) for various anatomies
The goal of our work is an AVP system (and respective approach) being devel-
oped in accordance with the requirements listed above.
We describe a fully automatic view planning framework which is designed to be
able to process four kinds of human anatomies, namely, brain, heart, spine and knee.
Our approach is based on anatomical landmark detection and includes several
anatomy-specific pre-processing and post-processing algorithms. The key features
of our framework are (a) using deep learning methods for robust detection of the
landmarks in rapidly acquired low-resolution scout images, (b) unsupervised learn-
ing for overcoming the problem of small training dataset, (c) redundancy-based
midsagittal plane detection for brain AVP, (d) spine disc position alignment via
11 Automatic View Planning in Magnetic Resonance Imaging 279
3D-clustering and using a statistical model (of the vertebrae periodicity) and
(e) position refinement of detected landmarks based on a statistical model.
The AVP framework workflow consists of the following steps (see Fig. 11.2):
1. 3D scout MRI volume acquisition. It is a low-resolution volume acquired at high
speed.
2. Pre-processing of the scout image, which includes such common operations as
bounding box estimation and statistical atlas anchoring and anatomy-specific
operations which include midsagittal plane estimation for the brain.
3. Landmark detection.
4. Post-processing, which consists of common operations for landmark position
refinement and filtering as well as anatomy-specific operations like vertebral disc
position alignment for spine AVP.
5. Estimation of the positions of view planes and their orientation.
A commonly used operation for the pre-processing stage is a bounding box
reconstruction for a body part under investigation. Such a bounding box is helpful
for various working zone estimations, local coordinate origin planning, etc.
The bounding box is a three-dimensional rectangular parallelepiped that bounds
only the essential part of the volume. For example, for brain MRI, the bounding box
simply bounds the head, ignoring the empty space around it. This first rough
estimation already brings information about the body part position, which reduces
the ambiguity of positions of anatomical points within a volume. Such ambiguity
appears due to the variety of body part positions relative to the scanner. This
reduction of ambiguity yields a reduction of the search zone for finding anatomical
landmarks. The bounding box is estimated via integral projections of the whole
11 Automatic View Planning in Magnetic Resonance Imaging 281
volume onto coordinate axes. The bounding box is formed by utmost points of
intersections of the projections with the predefined thresholds. The integral projec-
tion is a one-dimensional function whose value in each point is calculated as a total
of all voxels with a respective coordinate fixed. Using non-zero thresholds allows
cutting off noise in side areas of the volume.
In the next step, the search zone is reduced even more by application of the
statistical atlas. The statistical atlas contains information about the statistical distri-
bution of anatomical landmarks’ positions inside a certain part of a human body. It is
constructed on the basis of annotated volumetric medical images. Positions of
landmarks in a certain volume are transformed to a local coordinate system which
relates to the bounding box, not to the whole volume. Such transformation prevents
wide dispersion of the annotations. On the basis of the landmark positions calculated
for several volumes, the landmarks’ spatial distribution is estimated. In the simplest
case, such distribution can be presented by the convex hull of all points (for a certain
landmark type) in local coordinates. When the statistical atlas is anchored to the
volume, the search zone is defined (Fig. 11.3).
From the landmark processing point of view, the post-processing stage represents
filtering out and clustering of detected points (the landmark candidates).
From the applied MRI task point of view, the post-processing stage contains
procedures which perform the computation of the desired reference lines and planes.
For knee AVP and brain AVP, the post-processing stage implies auxiliary reference
line and plane evaluation on the basis of previously detected landmarks.
During post-processing, at the first stage, all detected point candidates are filtered
by thresholds for the landmark criterion. All candidates of a certain landmark type
whose quality criterion value is less than the threshold are eliminated. Such thresh-
olds are estimated in advance via a set of annotated MRI volumes. They are chosen
in such a way that minimizes the loss function consisting of false-positive errors and
false-negative errors. Optimal thresholds provide a balanced set of false positives
and false negatives for the whole set of available volumes. This balance could be
adjusted by some trade-off parameter. In the loss function mentioned above, the
number of false negatives is calculated as a sum of all missed ground truths in all
282 A. B. Danilevich et al.
Fig. 11.4 Novel idea for automatic selection of working slices to be used for longitudinal fissure
detection
11 Automatic View Planning in Magnetic Resonance Imaging 283
(a) (b)
Fig. 11.5 Longitudinal fissure detection: (a) example of continuous and discontinuous fissure
lines; (b) fissure detector positions (shift and rotation) which should be tested
(a) (b)
Fig. 11.6 An example of adequately detected and wrongly detected fissures. The idea for filtering
out outliers: (a) an example of the fissure detection in a chosen set of axial slices; (b) groups of
detected directions (shown as vectors) of the fissure in axial and coronal planes, respectively
The MSP may be also created directly, on the basis of the points formed by the
reference line intersection with a head contour in slices: as a least-square optimiza-
tion task. The result is just the same as for the averaged vectors and central points.
It should be pointed out that the redundancy in obtained reference lines plays a
great role in MSP estimation. As each reference line may be detected with some
inaccuracy, the data redundancy reduces the impact of errors on the MSP position.
The data redundancy feature makes our approach differ from others (Wang and Li
2008); it permits us to make the procedure more stable. In contrast, in other
approaches, as a rule, two slices only are used for MSP creation – one axial and
one coronal.
Some authors describe an entirely different approach which does not use the MSP
at all. For example, according to van der Kouwe et al. (2005), the authors create
slices and try to map them to some statistical atlas, solving an optimization task with
a rigid body 3D transformation. Then, this spatial transformation relative to an atlas
is used for MRI plane correction.
The estimated MSP is used as one of the planned MRI views. Further, the MSP
helps us to reduce the search space for the landmark detector since the landmarks
(corpus callosum anterior (CCA) and corpus callosum posterior (CCP)) are located
just in this plane.
11 Automatic View Planning in Magnetic Resonance Imaging 285
area defined by the statistical atlas anchored to the volume. This search area is
obtained as a union of sets of points from subvolumes that correspond to statistical
distributions of landmarks’ positions. Secondly, inside the search area, a grid of
search points is defined with some prescribed step (i.e. distance between
neighbouring points).
For classification of a point, its surrounding context is used. The surrounding
context is a portion of voxel data extracted from neighbourhood of the search point.
In our approach, we pick up a cubic subvolume surrounding the respective search
point and extract three orthogonal slices of this subvolume passing through the
search point (Fig. 11.8).
Thus, the landmark detector scans a selected volume with a 3D sliding window
and performs classification of every point by its surrounding context (Fig. 11.9).
11 Automatic View Planning in Magnetic Resonance Imaging 287
The main part of the landmark detector is a discriminative system used for classifi-
cation of the surrounding context extracted around the current search point. In our
approach, we utilize the neural network approach (Rychagov 2003). During the last
years, convolutional neural networks (LeCun and Bengio 1995; Sirotenko 2006;
LeCun et al. 2010) were applied for various recognition tasks and showed very
promising results (Sermanet et al. 2012; Krizhevsky et al. 2012). The network has
several feature-extracting layers, pooling layers, rectification layers and fully
connected layers (Jarrett et al. 2009). Layers of the network contain trainable weights
which prescribe behaviour of the discriminative system. The process of tuning these
weights is based on learning or training. Convolutional layers produce feature maps
which are obtained by convolution of input maps and applying a nonlinear function
to the maps after convolution. This nonlinearity also depends on some parameters
which are trained. Pooling layers alternate with convolutional layers. This kind of
layer performs down-sampling of feature maps. We use max pooling in our
approach. Pooling layers provide invariance of small translations and rotations of
features. As rectification layers, we use abs rectification and local contrast normal-
ization layers (Jarrett et al. 2009). Finally, on top of the network, a fully connected
layer is placed, which produces the final result. The output of the convolutional
neural network is a vector with a number of elements equal to a number of landmarks
plus one. For example, for the multiclass landmark detector designed for detecting
CCA and CCP landmarks, there are three outputs of the neural network: two
landmarks and background. These output values correspond to pseudo-probabilities
that the current landmark is located in a current search point (or no landmarks are
located here in case of background).
We train our network in two stages. In the first stage, we perform an unsupervised
pre-training using predictive sparse decomposition (PSD) (Kavukcuoglu et al.
2010). Then, we perform a supervised training (refining the pre-trained weights
and learning other weights) using stochastic gradient descent with energy-based
learning (LeCun et al. 2006).
A specially prepared dataset is used for the training. Several medical images are
required to be used to compose this training dataset. We have collected several brain,
cardiac, knee and spine scout MRI images for this purpose. These MRI volumes
were manually annotated: using a special programme, we pointed positions of
landmarks of interest in each volume. These points are used to construct true samples
corresponding to the landmarks. As a sample, we suppose a combination of a chosen
11 Automatic View Planning in Magnetic Resonance Imaging 289
class label (target output vector) and a portion of respective voxel data taken from the
surrounding context of the certain point with a predefined size. The way of extracting
such surrounding context around the point of interest is explained in Sect. 11.2.2.
The samples are randomly picked from annotated volumes. The class label is a
vector of target probabilities of the fact that the investigated landmark
(or background) is located in a respective point. These target values are calculated
using the energy-based approach. For every landmark class, the target value is
calculated on the basis of the distance from current sample to the closest ground-
truth point of this class. For example, if sample is picked right in the ground-truth
point for Landmark_1, then the target value for this landmark is “1”. If the sample is
picked far from any ground truths of Landmark_1 (with distance exceeding the
threshold), then the target value for this landmark is “1”. And while we approach
the ground truth, the target value is monotonically increased. We added some extra
samples with spatial distortions to train the system to be robust and invariant to
noise.
At the first stage of learning, an unsupervised pre-training procedure initializes
the weights W of all convolutional (feature extraction) layers of the neural network.
Learning in unsupervised mode uses unlabelled data. This learning is performed via
sparse coding and predictive sparse decomposition techniques (Kavukcuoglu et al.
2010). The learning process is done separately for each layer by performing an
optimization procedure:
X
W ¼ argmin kz F W ð y Þ k2 ,
W y2Y
where Y is a training set, y is an input of the layer from the training set, z is a sparse
code of y, and Fw is a predictor (a function which depends on W; it transforms the
input y to the output of the layer).
This optimization is performed by stochastic gradient descent. Each training
sample is encoded into a sparse code using the dictionary D. The predictor FW
produces features of the sample which should be close to sparse codes. The recon-
struction error, calculated on the basis of features and the sparse code, is used to
calculate the gradient in order to update the weights W. To compute a sparse code for
a certain input, the following optimization problem is solved:
z ¼ argminkzk0 : y ¼ Dz,
z
where D is the dictionary, y is the input signal, z is the encoded signal (code), and
z is the optimal sparse code.
In the abovementioned equation, the input y is represented as a linear combination
of only a few elements of some dictionary D. It means that the produced code z (the
vector of coefficients of decomposition) is sparse. The dictionary D is obtained from
the training set in unsupervised mode (without using annotated labels). An advan-
tage of the unsupervised approach for the optimal dictionary D finding is the fact that
290 A. B. Danilevich et al.
the dictionary is learned directly from data. So, the found dictionary D optimally
represents a hidden structure and specific nature of used data. An additional advan-
tage of the approach is that it does not need a large amount of annotated input data
for the dictionary training. Finding D is equivalent to solving the optimization
problem:
X
W ¼ argmin y2Y
kDz yk2 ,
W
where Y is a training set, y is an input of the layer from the training set, z is a sparse
code of y, and D is the dictionary.
This optimization is performed via a stochastic gradient descent. Decoding of the
sparse code is performed to produce decoded data. The reconstruction error (dis-
crepancy) is calculated on the basis of the training sample and the decoded data; the
discrepancy is used to calculate gradients for the dictionary D updating. The process
of the dictionary D adjustment is alternated with finding the optimal code z for the
input y with fixed D. For all layers except the first, the training set Y is formed as a set
of the previous layer’s outputs.
Unsupervised pre-training is useful when we have only a few labelled data. We
demonstrate the superiority of using PSD with few labelled MRI volumes. For such
experiment, unsupervised pre-training of our network is performed in advance.
Then, several annotated volumes were picked up for supervised fine-tuning. After
training, the misclassification rate (MCR) on the test dataset (samples from MRI
volumes which were not used at the training) was calculated. The plot in Fig. 11.10
shows the performance of classification (the lower MCR is the better one) depending
on various numbers of annotated volumes taking part in supervised learning.
After the unsupervised training is completed, the entire convolutional neural
network is adjusted to produce multi-level sparse codes which are a good hierarchi-
cal feature representation of input data.
In the next step, a supervised training is performed to tune the whole neural
network for producing features (output vector) which correspond to probabilities of
the appearance of a certain landmark or background in a certain point. This is done
by performing the following optimization:
X
W ¼ argmin y2Y
kx G W ð y Þ k2 ,
W
0.5
0.45
MCR
0.4
0.35
0.3
0.25
0.2
0 5 10 15 20 25 30
Number of training MRI volumes
Fig. 11.10 Misclassification rate plot: MCR is calculated on the test dataset using CNN learned
with various numbers of annotated MRI volumes. Red line, pure supervised mode; blue line,
supervised mode with PSD initialization
the first one with their weights updating. At the beginning of the procedure, some
weights of feature extraction layers are initialized with the values computed at the
pre-training stage. The final feature extractor is able to produce a feature vector that
could be directly used for discriminative classification of the input or for assigning to
every class a probability of the input belonging to a respective class.
The trained classifier shows good performance on datasets composed of samples
from different types of MRI volumes (such as knee, brain, cardiac, spine). For future
repeatability and comparison, we also trained our classifier on the OASIS dataset of
brain MRI volumes which is available online (Marcus et al. 2007). MCR results
calculated on test datasets are shown in Table 11.2.
For validation of the convolutional neural network approach, we have compared
it with the widely used support vector machine classifier (SVM) (Steinwart and
Christmann 2008) applied to samples. We used training and testing datasets com-
posed of samples from cardiac MRI volumes. Table 11.3 demonstrates the superi-
ority of convolutional neural networks.
The spine AVP post-processing stage is used for the filtering of detected landmarks,
their clustering and spinal curve approximation on the basis of these clustered
landmark nodes.
292 A. B. Danilevich et al.
The clustering of the detected points is performed for elimination of the outliers
among them (if they present). This operation finds several clusters – dense groups of
candidates – and all points apart from these groups are filtered out. The point quality
weight factors may be optionally considered in this operation. So, after such
pre-processing, a set of clusters’ centres is obtained which may be regarded as
nodes for further spinal approximation curve creating. They represent more adequate
data than the original detected points. The clustering operation is illustrated in
Fig. 11.11.
On the basis of the clustered nodes, a refined sagittal plane is created. This is one
of the AVP results.
Another necessary result is a set of local coronal planes adapted for each
discovered vertebra (or intervertebral disc). For such plane creating, respective
intervertebral disc locations should be found, as well as the planes’ normal vector
directions.
Firstly, a spinal curve approximation is created on the basis of previously
estimated nodes – clustered landmarks. The approximation is represented as two
independent functions of the height coordinate: x(z) and y(z) in coronal and sagittal
sections, respectively. The coronal function x(z) is represented as a sloped straight
line. The sagittal function y(z) is different for C, T and L zones of the spine (upper,
middle and lower spine, respectively). Sagittal approximation for these zones is
represented as a series of such components as a straight line, parabolic function and
trigonometric one (for C and L zones) with adjusted amplitude, period and starting
phase. The approximation is fitted with obtained nodes via the least squares
approach. As a rule, such approximation is quite satisfactory; see the illustration in
Fig. 11.12.
11 Automatic View Planning in Magnetic Resonance Imaging 293
Fig. 11.12 An example of spinal curve approximation (via nodes obtained as the clustered
landmark centres). Average brightness measured along the curve is shown in the illustration, too
Then, the intervertebral disc location should be determined. On the basis of the
spinal curve approximation, a curved “secant tube” is created, which passes approx-
imately through the vertebrae centres. An averaged brightness of voxels in the tube’s
axial sections is collected along the tube. So, the voxels’ brightness function is
obtained for the spinal curve: B(z), or B(L), where L is a running length of the spinal
curve.
In a T2 MRI protocol, vertebrae appear as bright structures, whereas
intervertebral discs appear as dark ones. So, the brightness function (as well as its
gradients) is used for supposed intervertebral disc position detection (see Fig. 11.12).
Nevertheless, the disc locations may be poorly expressed, and, on the other hand,
there may be a lot of false-positive locations detected. To avoid this problem,
additional filtration is used for these supposed discs’ positions; we call it “periodic
filtration”. A statistical model of intervertebral distances is created, and averaged
positions of the discs along the spinal curve are presented as a pulse function. A
convolution of this pulse function with the spinal curve brightness function (or with
294 A. B. Danilevich et al.
Fig. 11.13 The vertebrae statistical model (the periodic “pulse function”) as additional knowledge
for vertebrae supposed location estimation
the brightness gradient function) permits us to detect the disc location more pre-
cisely. The parameters of this pulse function – its shift and scale – are to be adjusted
during the convolution optimization process. The spinal brightness function
processing is illustrated in Fig. 11.13.
Finally, the supposed location of the intervertebral discs is determined. The local
coronal planes are computed in these points, and the planes’ directional vectors are
evaluated as the spinal approximation curve local direction vectors.
The result of intervertebral disc secant plane estimation is shown in Fig. 11.14.
As a rule, 2D sagittal slices are used for vertebra (or intervertebral disc) detection
via various techniques (gradient finding, active shape models, etc.). Some authors
use segmentation (Law et al. 2012), and some do not (Alomari et al. 2010). In the
majority of the works, the statistical context model is used to make vertebrae finding
more robust (Alomari et al. 2011; Neubert et al. 2011).
Sometimes, the authors use a marginal space approach to boost up the process
(Kelm et al. 2011).
A distinctive point of our approach is that we use mostly 3D techniques. We did
not use any segmentation: neither 3D, nor 2D. We obtain the spinal column spatial
location directly via detected 3D points. Our spinal curve is a 3D one, as well as the
clustering and filtering methods.
To make the detection operation more robust, we use a spine statistical model: the
vertebrae “pulse function” which represents statistically determined intervertebral
distance along the spinal curve. This model is used in combination with pixel
brightness analysis in the spatial tubular structure created near the spinal curve.
11 Automatic View Planning in Magnetic Resonance Imaging 295
In some complicated cases (poor scout image quality, artefacts, abnormal anatomy,
etc.), the output of the landmark detector could be ambiguous. This means that there
could be several landmarks detected with high LM quality measure for any given LM
type. In order to eliminate this ambiguity, a special algorithm is used based on
statistics of the landmark’s relative positions.
The goal of the algorithm is to find the configuration of landmarks with high LMQ
value minimizing the energy:
X
E ðX, M X , ΣX Þ ¼ xs 2X, xt 2X,
ψ st ðxs , xt , M X , ΣX Þ,
11.3 Results
For the algorithm testing and the landmark detector training, a database of real MRI
volumes of various anatomies was collected. This included 94 brain, 80 cardiac,
31 knee and 99 spine MRI volumes. Based on the robustness of our landmark
detector, we acquired low-resolution 3D MRI volumes. The advantage of this
approach is short acquisition time. All data were acquired with our specific protocol
and then annotated by experienced radiologists.
In our implementation, we used a convolutional neural network with input size
32323 slices. Input volumes (at both learning and processing stages) were
resized to spacing 2, so 32 pixels correspond to 64 mm. The architecture of the
network is the following: the first convolutional layer has 16 kernels 88 with
shrinkage transfer function; on top of it follows abs rectification, local contrast
normalization and max-pooling with down-sampling factor 2; then we have the
second convolutional layer with 32 kernels 88 connected to 16 input feature maps
with a not fully filled connection matrix; then a fully connected layer with hyperbolic
tangent sigmoid transfer function is situated, which finalizes the network.
Verification of the AVP framework was performed by comparing automatically
built views with views built on the basis of ground-truth landmark positions. The
constructed view planes were compared with the ground-truth ones by angle and
generalized distance between them. For spine MRI result verification, the number of
missed and wrongly determined vertebral discs was counted. In this procedure, the
discs of a specified type only were considered. Examples of the constructed views
are shown in Figs. 11.15, 11.16, 11.17, and 11.18.
11 Automatic View Planning in Magnetic Resonance Imaging 297
Fig. 11.15 Brain AVP results: (a–d) Automatically built midsagittal views. Red lines mean
intersection with corresponding axial and coronal planes. It is shown that the axial plane passes
through CCA and CCP correctly
11.4 Conclusion
Fig. 11.16 Knee AVP results: (a–c) ground-truth views, (d–f) automatically built views. Red lines
mean intersection with other slices
Fig. 11.17 Spine AVP results: positions and orientations of detected (green) and spinal curve (red)
for different spinal zones – (a) cervical, (b) thoracic and (c) lumbar
Our results demonstrate that we were able to achieve reliable, robust results in
much less time than our best competitors. Based on the results, we believe that this
novel AVP framework will help clinicians in achieving fast, accurate and repeatable
MRI scans and be a big differentiating aspect of a company’s MRI product portfolio.
11 Automatic View Planning in Magnetic Resonance Imaging 299
(a) (e)
(b) (f)
(c) (g)
(d) (h)
Fig. 11.18 Cardiac AVP results: (a–d) ground-truth views; (e–h) automatically built views; (a–d)
2-chamber view; (b–f) 3-chamber view; (c–g) 4-chamber view; (d–h) view from short-axis stack
300 A. B. Danilevich et al.
Table 11.4. Time comparison with competitors: imaging time (IT), processing time (PT) and total
time (TT)
Competitors Our approach
MRI type Name IT (s) PT (s) TT (s) IT (s) PT (s) TT (s)
Cardiac S 20 12.5 32.5 19 5 24
P – 103 100+
Brain S 42 2 44 25 1 26
G 40 2 42
Knee S – 30 30+ 23 2 25
P 40 6 46
Spine S 30 5 45 27 2 29
P 120 8 128
G 25 7 32
References
Alomari, R.S., Corso, J., Chaudhary, V., Dhillon, G.: Computer-aided diagnosis of lumbar disc
pathology from clinical lower spine MRI. Int. J. Comput. Assist. Radiol. Surg. 5(3), 287–293
(2010)
Alomari, R.S., Corso, J., Chaudhary, V.: Labeling of lumbar discs using both pixel-and object-level
features with a two-level probabilistic model. IEEE Trans. Med. Imaging. 30(1), 1–10 (2011)
Bauer, S., Ritacco, L.E., Boesch, C., Reyes, M.: Automatic scan planning for magnetic resonance
imaging of the knee joint. Ann. Biomed. Eng. 40(9), 2033–2042 (2012)
Bystrov, D., Pekar, V., Young, S., Dries, S.P.M., Heese, H.S., van Muiswinkel, A.M.: Automated
planning of MRI scans of knee joints. Proc. SPIE Med. Imag. 6509 (2007)
Fenchel, M., Thesen, A., Schilling, A.: Automatic labeling of anatomical structures in MR
FastView images using a statistical atlas. In: Medical Image Computing and Computer-Assisted
Intervention–MICCAI, pp. 576–584. Springer, Berlin, Heidelberg (2008)
Iskurt, A., Becerikly, Y., Mahmutyazicioglu, K.: Automatic identification of landmarks for standard
slice positioning in brain MRI. J. Magn. Reson. Imaging. 34(3), 499–510 (2011)
Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., LeCun, Y.: What is the best multi-stage architecture
for object recognition? In: Proceedings of 12th International Conference on Computer Vision,
vol. 1, pp. 2146–2153 (2009)
Kavukcuoglu, K., Ranzato, M.A., LeCun, Y.: Fast inference in sparse coding algorithms with
applications to object recognition. arXiv preprint arXiv: 1010.3467 (2010)
Kelm, B.M., Zhou, K., Suehling, M., Zheng, Y., Wels, M., Comaniciu, D.: Detection of 3D spinal
geometry using iterated marginal space learning. In: Medical Computer Vision. Recognition
11 Automatic View Planning in Magnetic Resonance Imaging 301
Techniques and Applications in Medical Imaging, pp. 96–105. Springer, Berlin, Heidelberg
(2011)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural
networks. Adv. Neural Inform. Process. Syst. 25(2), 1–9 (2012)
Law, M.W.K., Tay, K.Y., Leung, A., Garvin, G.J., Li, S.: Intervertebral disc segmentation in MR
images using anisotropic oriented flux. Med. Image Anal. 17(1), 43–61 (2012)
Lecouvet, F.E., Claus, J., Schmitz, P., Denolin, V., Bos, C., Vande Berg, B.C.: Clinical evaluation
of automated scan prescription of knee MR images. J. Magn. Reson. Imaging. 29(1), 141–145
(2009)
LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: The
Handbook of Brain Theory and Neural Networks, vol. 3361(10) (1995)
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M.A., Huang, F.J.: A tutorial on energy-based
learning. In: Bakir, G., Hofman, T., Schölkopf, B., Smola, A., Taskar, B. (eds.) Predicting
Structured Data. MIT Press, Cambridge, USA (2006)
LeCun, Y., Kavukcuoglu, K., Farabet, C.: Convolutional networks and applications in vision. In:
Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS),
pp. 253–256 (2010)
Li, P., Xu, Q., Chen, C., Novak, C.L.: Automated alignment of MRI brain scan by anatomic
landmarks. In: Proceedings of SPIE, Medical Imaging, vol. 7259, (2009)
Lu, X., Jolly, M.-P., Georgescu, B., Hayes, C., Speier, P., Schmidt, M., Bi, X., Kroeker, T.,
Comaniciu, D., Kellman, P., Mueller, E., Guehring, J.: Automatic view planning for cardiac
MRI acquisition. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI
2011, pp. 479–486. Springer, Berlin, Heidelberg (2011)
Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open Access
Series of Imaging Studies (OASIS): cross-sectional MRI data in young, middle aged,
nondemented, and demented older adults. J. Cogn. Neurosci. 19(9), 1498–1507 (2007)
Neubert, A., Fripp, J., Shen, K., Engstrom, C., Schwarz, R., Lauer, L., Salvado, O., Crozier, S.:
Automated segmentation of lumbar vertebral bodies and intervertebral discs from MRI using
statistical shape models. In: Proc. of International Society for Magnetic Resonance in Medicine,
vol. 19, p. 1122 (2011)
Pekar, V., Bystrov, D., Heese, H.S., Dries, S.P.M., Schmidt, S., Grewer, R., den Harder, C.J.,
Bergmans, R.C., Simonetti, A.W., van Muiswinkel, A.M.: Automated planning of scan geom-
etries in spine MRI scans. In: Medical Image Computing and Computer-Assisted Intervention–
MICCAI, pp. 601–608. Springer, Berlin, Heidelberg (2007)
Rychagov, M.: Neural networks: Multilayer perceptron and Hopfield networks. Exponenta Pro.
Appl. Math. 1, 29–37 (2003)
Sermanet, P., Chintala, S., Yann LeCun, Y.: Convolutional neural networks applied to house
numbers digit classification. In: Proceedings of the 21st International Conference on Pattern
Recognition (ICPR), pp. 3288–3291 (2012)
Sirotenko, M.: Applications of convolutional neural networks in mobile robots motion trajectory
planning. In: Proceedings of Scientific Conference and Workshop. Mobile Robots and
Mechatronic Systems, pp. 174–181. MSU Publishing, Moscow (2006)
Steinwart, I., Christmann, A.: Support vector machines. Springer, New York (2008)
van der Kouwe, A.J.W., Benner, T., Fischl, B., Schmitt, F., Salat, D.H., Harder, M., Sorensen, A.G.,
Dale, A.M.: On-line automatic slice positioning for brain MR imaging. Neuroimage. 27(1),
222–230 (2005)
Wang, Y., Li, Z.: Consistent detection of mid-sagittal planes for follow-up MR brain studies. In:
Proceedings of SPIE, Medical Imaging, vol. 6914, (2008)
Young, S., Bystrov, D., Netsch, T., Bergmans, R., van Muiswinkel, A., Visser, F., Sprigorum, R.,
Gieseke, J.: Automated planning of MRI neuro scans. In: Proceedings of SPIE, Medical
Imaging, vol. 6144, (2006)
302 A. B. Danilevich et al.
Zhan, Y., Dewan, M., Harder, M., Krishnan, A., Zhou, X.S.: Robust automatic knee MR slice
positioning through redundant and hierarchical anatomy detection. IEEE Trans. Med. Imaging.
30(12), 2087–2100 (2011)
Zheng, Y., Lu, X., Georgescu, B., Littmann, A., Mueller, E., Comaniciu, D.: Automatic left
ventricle detection in MRI images using marginal space learning and component-based voting.
In: Proceedings of SPIE, vol. 7259, (2009)
Chapter 12
Dictionary-Based Compressed Sensing MRI
12.1.1 Introduction
A. S. Migukin (*)
Huawei Russian Research Institute, Moscow, Russia
e-mail: artem.migukin@huawei.com
D. A. Korobchenko
Nvidia Corporation, Moscow Office, Moscow, Russia
e-mail: dkorobchenko@nvidia.com
K. A. Gavrilyuk
University of Amsterdam, Amsterdam, The Netherlands
e-mail: kgavrilyuk@uva.nl
Fig. 12.2 Aliasing artefacts: the magnitude of the inverse Fourier transform – (a) for the fully
sampled spectrum and (b) for the undersampled k-space spectrum as it is shown in Fig. 12.1
2007; Ma et al. 2008; Goldstein and Osher 2009). However, the reconstruction by
leading CS MRI methods with nonadaptive, global sparsifying transforms (finite
differences, wavelets, contourlets, etc.) are usually limited to relatively low
undersampling rate and still have many undesirable artefacts and loss of features
(Ravishankar and Bresler 2010). The images are usually represented by a general
predefined basis or frame which may not provide sufficient sparse representation for
them. For instance, the traditional separable wavelet fails to sparsely represent the
geometric regularity along the singularities, and the conventional total variation
(TV) results in staircase artefacts in case of limited acquired k-space data (Goldstein
and Osher 2009). Contourlets (Do and Vetterli 2006) can sparsely represent the
smooth details but not the spots in images. All these transforms are only in favour of
the sparse representation for the global image specifics.
The local sparsifying transforms allow highlighting a broad set of fine details,
i.e. they carry the local geometric information (Qu et al. 2012). In the so-called
patch-based approach, an image is divided into small overlapping blocks (patches),
and the vector corresponding to each patch is modelled as a sparse linear combina-
tion of candidate vectors termed atoms taken from a set called the dictionary. Here
one requires a huge arsenal of various transforms, whose perfect fit is extremely hard
to assort. Alternatively, the size of patches needs to be constantly decreased.
In perspective, researchers have shown great interest in finding adaptive sparse
regularization. Images may be content-adaptive represented by patches via collabo-
rative sparsifying transform (Dabov et al. 2007) or in terms of dictionary-based
image restoration (Aharon et al. 2006). Adaptive transforms (dictionaries) can
sparsify images better because they are constructed (learnt) especially for the
particular image instance or class of images. Recent studies have shown the promise
of patch-based sparsifying transforms in a variety of applications such as image
denoising (Elad and Aharon 2006), deblurring (Danielyan et al. 2011) or in a specific
task such as phase retrieval (Migukin et al. 2013).
In this work, we exploit adaptive patch-based dictionaries to obtain substantially
improved reconstruction performance for CS MRI. According to our best knowledge
and following Ravishankar and Bresler (2010), Caballero et al. (2012) and Song
et al. (2014), such a sparse regularization with training patch-based dictionaries
306 A. S. Migukin et al.
y ¼ Fx,
where F is the Fourier transform matrix. The vectors x and y are of the same length.
yu ¼ F u x = m∘Fx:
The problem is to find the unknown object vector x from the vector of the
undersampled spectral measurements yu, i.e. to solve an underdetermined system
12 Dictionary-Based Compressed Sensing MRI 307
This sparse coding problem can be solved by greedy algorithms (Elad 2010).
Following Donoho (2006), the CS reconstruction problem can be also simplified by
replacing the ℓ0 norm with its convex relaxation, the ℓ1 norm. Since the real
measurements are always noisy, the CS problem is shown (Donoho et al. 2006) to
be efficiently solved using basis pursuit denoising. Thus, the typical formulation of
the CS MRI reconstruction problem has the following Lagrangian setup (Lustig et al.
2007):
where the subscript F denotes the Frobenius norm, columns of the matrix X ¼ P(x)
are vectorized patches extracted by the operator denoted by P, column vectors of the
matrix Z are sparse codes, and ν is a positive parameter for the synthesis penalty
term. Literally, it is assumed that each vectorized patch Xi can be approximated by
the linear combination DZi, where each column vector Zi contains no larger than
T non-zero components. Note that X is formed as a set of overlapping patches
extracted from the object image x. Since the sparse approximation is performed
for vectorized patches, no restriction on the patch form is imposed. In our work, we
are dealing with rectangular patches to harmonize their size with specifics of k-space
sampling.
xkþ1 ¼ arg min x A DZkþ1 2 þ ν kyu F u xk2 :
x 2 2
Here A denotes the operator that assembles the vectorized image from the set of
patches (columns of input matrix). Particularly, the approximation of the image
vector is assembled from the sparse approximation of patches {DZi}. Here the
positive parameter ν represents the confidence in the given (noisy) measurements,
and the parameter τ controls the accuracy of the object synthesis from sparse codes.
Such kind of algorithms is the subject of intensive research in various application
areas (Bioucas-Dias and Figueiredo 2007; Afonso et al. 2010).
In the first step, the object estimate is assumed to be fixed, and the sparse codes
{Zi} are found using the given dictionary D in terms of the basis pursuit denoising so
the sparse object approximation is satisfied by a certain tolerance τ. In our work, it is
realized based on the orthogonal matching pursuit algorithm (OMP) (Elad 2010). In
the other step, the sparse representation of the object is assumed to be fixed, and the
object estimate is updated targeting the data consistency. The last equation in the
above-given system can be resolved from the normal equation:
12 Dictionary-Based Compressed Sensing MRI 309
1 1
u F u þ ν IÞx ¼ F u yu þ ν AðDZ
ðF H H kþ1
Þ:
The superscript {∙}H denotes the Hermitian transpose operation. Solving the
equation directly is tedious due to inversion of a typically huge matrix. It can be
simplified by transforming from the image to the Fourier domain. Let the Fourier
transform matrix F from the first equation in the current chapter be normalized such
that FHF ¼ I. Then:
1 1
u F u F þ ν IÞFx ¼ FF u yu þ ν FAðDZ
ðFF H H H kþ1
Þ,
where FF H u FuF
H
is a diagonal matrix consisting of ones and zeros – ones are at
those diagonal entries that correspond to a sampled location in the k-space. Here
yu ¼ FF Hu yu and y
k + 1/2
¼ FA(DZk + 1). The vector represents the Fourier spectrum
of the sparse approximated object at the k-th iteration. It follows that the resulting
Fourier spectrum estimation is of the form (Ravishankar and Bresler 2010, cf. Eq. 9):
1
ykþ1 ¼ Øm∘ykþ1=2 þ m∘ ykþ1=2 þ ν yu ,
1þν
Dictionary learning aims to solve the following basis pursuit denoising problem
(Elad 2010) with respect to D:
Since dictionary elements are basis atoms used to represent image patches, they
should be also learnt on the same type of signals, namely: image patches. In the
equation above, the columns of the matrices X and Z represent vectorized training
patches and the corresponding sparse codes, respectively. Again, one commonly
alternates searching for D for the fixed Z (dictionary update step) and seeking for
Z taking D fixed (sparse coding step) (Elad 2006, 2010; Yaghoobi et al. 2009). In
our method, we exploit the state-of-the-art dictionary learning algorithm: K-SVD
(Aharon et al. 2006), successfully applied for image denoising (Mairal et al. 2008;
310 A. S. Migukin et al.
Fig. 12.3 Complex-valued dictionary with rectangular (8 4) atoms: left magnitudes and right
arguments/phases of atoms. Phase of atoms is represented in the HSV (Hue, Saturation,
Value) colour map
Protter and Elad 2009). We recommend taking into consideration the type of data it
contains, namely: training patches should be extracted from datasets of images
similar to the target object or from actual corrupted (aliased) images to be
reconstructed. An example of a dictionary learnt on complex-valued data is shown
in Fig. 12.3.
In this section, we share some hints found during our long-term painstaking research
on efficient CS MRI reconstruction by beforehand precomputed dictionaries: effec-
tive innovations providing fast convergence and imaging enhancement, spatial
adapting to aliasing artefacts and acceleration by parallelization under limited
GPU resources (Korobchenko et al. 2016).
While zeros in the empty positions of the Fourier spectrum lead to strong aliasing
artefacts, some other initial guess seems to be clever. It is found that the result of the
split Bregman iterative algorithm (Goldstein and Osher 2009) is an efficient initial
guess that essentially suppresses aliasing effects and significantly increases both the
initial reconstruction quality and the convergence rate of the main computationally
expensive dictionary-based CS algorithm.
In accordance with (Wang et al. 2007), the ℓ 1 and ℓ 2 norms in the equation given
in Sect. 12.2.2 may be decoupled as follows:
12 Dictionary-Based Compressed Sensing MRI 311
2
xkþ1 ¼ arg min
x
jjF u x yu jj22 þ μ jjχ k ΨðxÞ bk jj2 ,
2
χ kþ1 ¼ arg min λ jjχ jj1 þ μ jjχ Ψðxkþ1 Þ bk jj2
χ
bkþ1 ¼ bk þ Ψ xkþ1 χ kþ1 ,
12.4.5 DESIRE-MRI
Let us assume that the measurement vector and sampling mask are split into Nb
bands, and proper multi-band dictionaries {Db} for all these bands are already learnt
by K-SVD as we discussed above. Let the initial precooking of the undersampled
spectrum of the object be performed by the iterative split Bregman algorithm (see
Sect. 12.4.1). We denote this initial guess by xSB. Then, considering the resulting
Fourier spectrum estimation presented in Sect. 12.3.1, the reconstruction of such a
pre-initialized object is performed by the proposed iterative multi-band algorithm
defined in the following form:
kþ1 X 2
Zb ¼ arg min Zb þ γ b
P x b k
Db b
Z ,
i 0 i
Zb F
i
(continued)
314 A. S. Migukin et al.
XN b
b
x¼ xb
b¼1
We name this two-step algorithm with the split Bregman initialization and
beforehand multi-band dictionary learning precomputed – Dictionary Express
Sparse Image Reconstruction and Enhancement (DESIRE-MRI).
The initial guess x0 is split into Nb bands forming a set of estimates for the band
objects {xb}. Then, all these estimates are partitioned into patches to be sparse
approximated by mini-batch OMP (Step 1). We discuss the mini-batch OMP
targeting on GPU realization in later Sect. 12.5 in details. In Step 2, estimates of
the object subspectra are found by successively assembling the sparse codes into the
band objects and their Fourier transform. Step 3 gives the update of the resulting
band objects by restoring the measured valued in the object subspectra (as it is
defined by the last equation in Sect. 12.3.1) and returning to the image domain. So,
we go back to Step 1 until DESIRE-MRI converges. Note that the output of the
algorithm is the sum of the reconstructed band objects. DESIRE algorithm for a
single-coil case was originally published in (Migukin et al. 2015).
There is a clear parallel with the well-studied Gerchberg-Saxton-Fienup (Fienup
1982) algorithms, but in contrast with loss/ambiguity of the object phase, here we are
faced with the total loss of some complex-valued observations.
One of the main problems in the classical OMP algorithm is the computationally
expensive matrix pseudo-inversion (Rubinstein et al. 2008; Algorithm 1, step 7). It is
efficiently resolved in OMP Cholesky and Batch OMP by the progressive Cholesky
factorization (Cotter et al. 1999). Batch OMP also uses pre-computation of the Gram
matrix of dictionary atoms (Rubinstein et al. 2008), which allows omitting iterative
recalculation of residuals. We use such an optimized version of OMP because of
encoding a huge set of patches by a single dictionary. Moreover, Batch OMP is
12 Dictionary-Based Compressed Sensing MRI 315
The goal of our numerical experiments is to analyse the reconstruction quality and to
study the performance of the algorithm. Here, we consider the reconstruction quality
of complex-valued 2D and 3D target objects as in vivo MR scans with the normal-
ized amplitudes. The used binary sampling masks with the undersampling rate (ratio
of the total number of components to sampling ones) equal to 4 are illustrated in
Fig. 12.4. In addition to our illustration, we present the reconstruction accuracy in
peak signal-to-noise ratio (PSNR) and the high-frequency error norm (HFEN).
Following Ravishankar and Bresler (2010), the reference (fully sampled) and
reconstructed object are (slice-wise) filtered by the Laplacian of Gaussian filter,
and HFEN is found as the ℓ2 norm for the difference between these filtered objects.
Note that in practice, one has no true signal x to calculate the reconstruction
accuracy in terms of PSNR or HFEN. DESIRE-MRI is assumed to be converged if
the norm ||xk xk 1||2 for the difference between the successive iterations (denoted
by DIFF) reaches an empirically found threshold.
We exploit the efficient basis pursuit denoising approach, i.e. for particular
objects, we choose proper parameters of patch-based sparse approximation and the
tolerance τ. In general, the parameters for the split Bregman initialization are λ ¼ 100
and μ ¼ 30 for 2D objects and λ ¼ 0.1 and μ ¼ 3 for 3D objects. For easy comparison
with recent (by the time of development) MRI algorithms (Lustig et al. 2007;
Ravishankar and Bresler 2010), all DESIRE-MRI results are given for ν ¼ 1,
i.e. act on the assumption that our spectral measurements are noise-free.
Fig. 12.5 Influence of initialization on convergence and accuracy of DESIRE-MRI, split Bregman
(SB, solid curve) vs zero filling (ZF, dashed curve) in 2D: (top) stopping condition and recon-
struction quality in terms of (middle) PSNR and (bottom) HFEN
Fig. 12.6 Imaging enhancement with the multi-band scheme: fragment of (a) original axial
magnitude and the DESIRE-MRI result by (b) the whole bandwidth and (c) two bands
318 A. S. Migukin et al.
Let us compare the DESIRE-MRI result with the recently applied CS MRI
approaches. Again, the Fourier spectrum of the Siemens T1 axial coil image is
undersampled by the binary sampling mask shown in Fig. 12.4 (left) and
reconstructed by LDP (Lustig et al. 2007) and DLMRI (Ravishankar and Bresler
2010).
In Fig. 12.7, we present the comparison of the normalized magnitudes. DLMRI
with online recalculation of dictionaries is unable to remove the large aliasing
artefacts (Fig. 12.7c, PSNR ¼ 35.3 dB). LDP suppresses aliasing artefacts but still
not sufficiently (Fig. 12.7b), PSNR ¼ 34.2 dB). A slightly lower PSNR compared
with DLMRI is due to oversmoothing of the LDP result. DESIRE-MRI with
two-band splitting (see Fig. 12.7d) produces an aliasing-free reconstruction that
Fig. 12.7 Comparison of the reconstructed MR images: (a) the original axial magnitude; (b) its
reconstruction obtained by LDP (Lustig et al. 2007), PSNR ¼ 34.2 dB; (c) DLMRI (Ravishankar
and Bresler 2010), PSNR ¼ 35.3 dB; (c) our DESIRE-MRI, PSNR ¼ 39 dB
12 Dictionary-Based Compressed Sensing MRI 319
Fig. 12.8 Imaging of multi-coil DESIRE-MRI reconstructions for 2D and 3D objects (column-
wise, from left to right): Samsung TOF 3D, Siemens TOF 3D, Siemens T1 axial 2D slice and
Siemens 2D Phantom. The comparison of SOS (row-wise, from top to bottom): for the original
objects, the zero-filling reconstruction and by DESIRE-MRI
looks much closer to the reference: a small degree of smoothing is inevitable at high
undersampling rate.
All experimental data are multi-coil, and thus individual coil images are typically
reconstructed independently, and then the final result is found by merging these
reconstructed coil images with the so-called SOS (“sum-of-squares” by Larsson
et al. 2003).
In Fig. 12.8, some DESIRE-MRI results of such multi-coil reconstructions are
demonstrated. In the top row, we present SOS for the original objects (column-wise,
from left to right): Samsung TOF (344 384 30 angiography), Siemens TOF
(348 384 48 angiography), Siemens T1 axial slice (320 350) and Siemens
Phantom (345 384). In the middle row, we demonstrate SOS for the alias
reconstructions by zero filling. The undersampling of the 2D and 3D object spectra
are performed by the corresponding 2D Cartesian and 3D Gaussian masks given in
Fig. 12.4. In the bottom row of Fig. 12.8, SOS for the DESIRE-MRI reconstructions
are illustrated. For both 2D and 3D cases, DESIRE-MRI results in clear reconstruc-
tion with almost no essential degradations. Note that on Siemens Phantom with a lot
of high-frequency details (Fig. 12.8, bottom-right image), DESIRE-MRI returns
some visible defects on the borders of the bottom black rectangle and between radial
“beams”.
320 A. S. Migukin et al.
12.6 Conclusion
References
Afonso, M.V., Bioucas-Dias, J.M., Figueiredo, M.A.T.: Fast image recovery using variable split-
ting and constrained optimization. IEEE Trans. Image Process. 19(9), 2345–2356 (2010)
Aggarwal, N., Bresler, Y.: Patient-adapted reconstruction and acquisition dynamic imaging method
(PARADIGM) for MRI. Inverse Prob. 24(4), 1–29 (2008)
Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithm for designing overcomplete dictionar-
ies for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006)
Amdahl, G.M.: Validity of the single-processor approach to achieving large-scale computing
capabilities. In: Proceedings of AFIPS Conference, vol. 30, pp. 483–485 (1967)
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods.
Prentice-Hall, Englewood Cliffs. 735 p (1989)
Bioucas-Dias, J.M., Figueiredo, M.A.T.: A new TwIST: two-step iterative shrinkage/thresholding
algorithms for image restoration. IEEE Trans. Image Process. 16(12), 2980–2991 (2007)
Accessed on 04 October 2020. http://www.lx.it.pt/~bioucas/TwIST/TwIST.htm
Blaimer, M., Breuer, F., Mueller, M., Heidemann, R.M., Griswold, M.A., Jakob, P.M.: SMASH,
SENSE, PILS, GRAPPA: how to choose the optimal method. Top. Magn. Reson. Imaging.
15(4), 223–236 (2004)
Caballero, J., Rueckert, D., Hajnal, J.V.: Dictionary learning and time sparsity in dynamic MRI. In:
Proceedings of International Conference on Medical Image Computing and Computer-Assisted
Intervention (MICCAI), vol. 15, pp. 256–263 (2012)
Candès, E., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from
highly incomplete frequency information. IEEE Trans. Inf. Theory. 52(2), 489–509 (2006)
Cotter, S.F., Adler, R., Rao, R.D., Kreutz-Delgado, K.: Forward sequential algorithms for best basis
selection. In: IEE Proceedings - Vision, Image and Signal Processing, vol. 146 (5), pp. 235–244
(1999)
Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3-D transform-
domain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080–2095 (2007)
Danielyan, A., Katkovnik, V., Egiazarian, K.: BM3D frames and variational image deblurring.
IEEE Trans. Image Process. 21(4), 1715–1728 (2011)
Do, M.N., Vetterli, M.: The contourlet transform: an efficient directional multiresolution image
representation. IEEE Trans. Image Process. 14(2), 2091–2106 (2006)
Donoho, D.: Compressed sensing. IEEE Trans. Inf. Theory. 52(4), 1289–1306 (2006)
Donoho, D.L., Elad, M., Temlyakov, V.N.: Stable recovery of sparse overcomplete representations
in the presence of noise. IEEE Trans. Inf. Theory. 52(1), 6–18 (2006)
Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point
algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)
Elad, M.: Sparse and Redundant Representations: from Theory to Applications in Signal and Image
Processing. Springer Verlag, New York., 376 p (2010)
Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned
dictionaries. IEEE Trans. Image Process. 15(12), 3736–3745 (2006)
Fienup, J.R.: Phase retrieval algorithms: a comparison. Appl. Opt. 21(15), 2758–2769 (1982)
Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via
finite-element approximations. Comput. Math. Appl. 2(1), 17–40 (1976)
322 A. S. Migukin et al.
Goldstein, T., Osher, S.: The Split Bregman method for L1-regularized problems. SIAM J. Imag.
Sci. 2(2), 323–343 (2009) Accessed on 04 October 2020. http://www.ece.rice.edu/~tag7/Tom_
Goldstein/Split_Bregman.html
Griswold, M.A., Jakob, P.M., Heidemann, R.M., Nittka, M., Jellus, V., Wang, J., Kiefer, B., Haase,
A.: Generalized autocalibrating partially parallel acquisitions (GRAPPA). Magn. Reson. Med.
47(6), 1202–1210 (2002)
Korobchenko, D.A., Danilevitch, A.B., Sirotenko, M.Y., Gavrilyuk, K.A., Rychagov, M.N.:
Automatic view planning in magnetic resonance tomography using convolutional neural net-
works. In: Proceedings of Moscow Institute of Electronic Technology. MIET., 176 p, Moscow
(2016)
Korobchenko, D.A., Migukin, A.S., Danilevich, A.B., Varfolomeeva, A.A, Choi, S., Sirotenko, M.
Y., Rychagov, M.N.: Method for restoring magnetic resonance image and magnetic resonance
image processing apparatus, US Patent Application 20180247436 (2018)
Larkman, D.J., Nunes, R.G.: Parallel magnetic resonance imaging. Phys. Med. Biol. 52(7),
R15–R55 (2007)
Larsson, E.G., Erdogmus, D., Yan, R., Principe, J.C., Fitzsimmons, J.R.: SNR-optimality of sum-
of-squares reconstruction for phased-array magnetic resonance imaging. J. Magn. Reson.
163(1), 121–123 (2003)
Liang, Z.-P., Lauterbur, P.C.: Principles of Magnetic Resonance Imaging: a Signal Processing
Perspective. Wiley-IEEE Press, New York (2000)
Liu, B., King, K., Steckner, M., Xie, J., Sheng, J., Ying, L.: Regularized sensitivity encoding
(SENSE) reconstruction using Bregman iterations. Magn. Reson. Med. 61, 145–152 (2009)
Liu, Q., Wang, S., Yang, K., Luo, J., Zhu, Y., Liang, D.: Highly undersampled magnetic resonance
image reconstruction using two-level Bregman method with dictionary updating. IEEE Trans.
Med. Imaging. 32, 1290–1301 (2013)
Lustig, M., Donoho, D., Pauly, J.M.: Sparse MRI: the application of compressed sensing for rapid
MR imaging. Magn. Reson. Med. 58, 1182–1195 (2007)
Lustig, M., Donoho, D.L., Santos, J.M., Pauly, J.M.: Compressed sensing MRI. IEEE Signal
Process. Mag. 25(2), 72–82 (2008)
Ma, S., Wotao, Y., Zhang, Y., Chakraborty, A.: An efficient algorithm for compressed MR imaging
using total variation and wavelets. In: Proceedings of IEEE International Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)
Mairal, J., Elad, M., Guillermo, S.: Sparse representation for color image restoration. IEEE Trans.
Image Process. 17(1), 53–69 (2008)
Migukin, A., Agour, M., Katkovnik, V.: Phase retrieval in 4f optical system: background compen-
sation and sparse regularization of object with binary amplitude. Appl. Opt. 52(1), A269–A280
(2013)
Migukin, A.S., Korobchenko, D.A., Sirotenko, M.Y., Gavrilyuk, K.A., Choi, S., Gulaka, P.,
Rychagov, M.N.: DESIRE: efficient MRI reconstruction with Split Bregman initialization and
sparse regularization based on pre-learned dictionary. In: Proceedings of the 27th Annual
International Conference on Magnetic Resonance Angiography, p. 34 (2015) http://
society4mra.org/
Migukin, A.S., Korobchenko, D.A., Sirotenko, M.Y., Gavrilyuk, K.A., Gulaka, P., Choi, S.,
Rychagov, M.N., Choi Y.: Magnetic resonance imaging device and method for generating
magnetic resonance image. US Patent Application 20170053402 (2017)
Olsen, Ø.E.: Imaging of abdominal tumours: CT or MRI? Pediatr. Radiol. 38, 452–458 (2008)
Protter, M., Elad, M.: Image sequence denoising via sparse and redundant representations. IEEE
Trans. Image Process. 18(1), 27–36 (2009)
Pruessmann, K.P.: Encoding and reconstruction in parallel MRI. NMR Biomed. 19(3), 288–299
(2006)
Pruessmann, K.P., Weiger, M., Scheidegger, M.B., Boesiger, P.: SENSE: sensitivity encoding for
fast MRI. Magn. Reson. Med. 42(5), 952–962 (1999)
12 Dictionary-Based Compressed Sensing MRI 323
Qu, X., Guo, D., Ning, B., Hou, Y., Lin, Y., Cai, S., Chen, Z.: Undersampled MRI reconstruction
with patch-based directional wavelets. Magn. Reson. Imaging. 30(7), 964–977 (2012)
Ravishankar, S., Bresler, Y.: MR image reconstruction from highly undersampled k-space data by
dictionary learning. IEEE Trans. Med. Imag. 30(5), 1028–1041 (2010) Accessed on 04 October
2020. http://www.ifp.illinois.edu/~yoram/DLMRI-Lab/DLMRI.html
Roemer, P.B., Edelstein, W.A., Hayes, C.E., Souza, S.P., Mueller, O.M.: The NMR phased array.
Magn. Reson. Med. 16, 192–225 (1990)
Rubinstein, R., Zibulevsky, M., Elad, M.: Efficient implementation of the K-SVD algorithm using
batch orthogonal matching pursuit. Technical report – CS-2008-08 Technion (2008)
Sharif, B., Derbyshire, J.A., Faranesh, A.Z., Bresler, Y.: Patient-adaptive reconstruction and
acquisition in dynamic imaging with sensitivity encoding (PARADISE). Magn. Reson. Med.
64(2), 501–513 (2010)
Song, Y., Zhu, Z., Lu, Y., Liu, Q., Zhao, J.: Reconstruction of magnetic resonance imaging by
three-dimensional dual-dictionary learning. Magn. Reson. Med. 71(3), 1285–1298 (2014)
Vasanawala, S.S., Murphy, M.J., Alley, M.T., Lai, P., Keutzer, K., Pauly, J.M., Lustig, M.:
Practical parallel imaging compressed sensing MRI: summary of two years of experience in
accelerating body MRI of pediatric patients. In: Proceedings of the IEEE International Sympo-
sium on Biomedical Imaging: from Nano to Macro, pp. 1039–1043 (2011)
Wang, Y., Yin, W., Zhang, Y.: A fast algorithm for image deblurring with total variation regular-
ization. CAAM Technical Report TR07-10 (2007)
Yaghoobi, M., Blumensath, T., Davies, M.E.: Dictionary learning for sparse approximations with
the majorization method. IEEE Trans. Signal Process. 57(6), 2178–2191 (2009)
Chapter 13
Depth Camera Based on Colour-Coded
Aperture
Vladimir P. Paramonov
13.1 Introduction
Scene depth extraction, i.e. the computation of distances to all scene points visible on
a captured image, is an important part of computer vision. There are various
approaches for depth extraction: a stereo camera and a camera array in general, a
plenoptic camera including dual-pixel technology as a special case, and a camera
with a coded aperture to name a few. The camera array is the most reliable solution,
but it implies extra cost and extra space and increases the power consumption for any
given application. Other approaches use a single camera but multiple images for
depth extraction, thus working only for static scenes, which severely limits the
possible application list. Thus, the coded aperture approach is a promising single-
lens single-frame solution which requires insignificant hardware modification
(Bando 2008) and can provide a depth quality sufficient for many applications
(e.g. Bae et al. 2011; Bando et al. 2008). However, a number of technical issues
have to be solved to achieve a level of performance which is acceptable for
applications. We discuss the following issues and their solutions in this chapter:
(1) light-efficiency degradation due to the insertion of colour filters into the camera
aperture, (2) closeness to the diffraction limit for millimetre-size lenses (e.g.,
smartphones, webcams), (3) blindness of disparity estimation algorithms in
low-textured areas, and (4) final depth estimation in millimetres in the whole
image frame, which requires the use of a special disparity with the depth conversion
method for coded apertures generalised for any imaging optical system.
V. P. Paramonov (*)
Samsung R&D Institute Russia (SRR), Moscow, Russia
e-mail: v.paramonov@samsung.com
Depth can be estimated using a camera with a binary coded aperture (Levin et al.
2007; Veeraraghavan et al. 2007). It requires computationally expensive depth
extraction techniques based on multiple deconvolutions and a sparse image gradient
prior. Disparity extraction using a camera with a colour-coded aperture which pro-
duces spatial misalignment between colour channels was first demonstrated by
Amari and Adelson in 1992 and has not changed significantly since that time
(Bando et al. 2008; Lee et al. 2010, 2013). The main advantage of these cameras
over cameras with a binary coded aperture is the lower computational complexity of
the depth extraction techniques, which do not require time-consuming
deconvolutions.
The light efficiency of the systems proposed in Amari and Adelson 1992; Bando
et al. 2008; Lee et al. 2010, 2013; Levin et al. 2007; Veeraraghavan et al. 2007; Zhou
et al. 2011 is less than 20% compared to a fully opened aperture, which leads to a
decreased signal-to-noise ratio (SNR) or longer exposure times with motion blur.
That makes them impractical for compact handheld devices and for real-time
performance by design. A possible solution was proposed by Chakrabarti and
Zickler (2012), where each colour channel has an individual effective aperture
size. Therefore, the resulting image has colour channels with different depths of
field. Due to its symmetrical design, this coded aperture cannot provide discrimina-
tion between objects closer to or further than the in-focus distance. Furthermore, it
requires a time-consuming disparity extraction algorithm.
Paramonov et al. (2016, b, c) proposed a solution to the problems outlined above
by presenting new light-efficient coded aperture designs and corresponding algo-
rithm modifications. All the aperture designs detailed above are analysed and
compared in Sect. 13.7 of this chapter.
Let us consider the coded aperture concept. A simplified imaging system is
illustrated schematically in Fig. 13.1. It consists of a single thin lens and RGB
colour sensor.
A coded aperture is placed next to the thin lens. The aperture consists of colour
filters with different passbands, e.g. red and green colour filters (Fig. 13.2a).
Fig. 13.1 Conventional single-lens imaging system image formation: (a) focused scene; (b)
defocused scene
13 Depth Camera Based on Colour-Coded Aperture 327
Fig. 13.2 Colour-coded aperture image formation: (a) in-focus foreground; (b) defocused
background
Fig. 13.3 Colour image restoration example. From left to right: image captured with colour-coded
aperture (causing a misalignment in colour channels); extracted disparity map; restored image
Defocused regions of an image captured with this system have different viewpoints
in the red and green colour channels (see Fig. 13.2b). By considering the correspon-
dence between these two channels, the disparity map for the captured scene can be
estimated as in Amari and Adelson (1992).
The original colour image cannot be restored in the case of the absence of a blue
channel. Bando et al. (2008), Lee et al. (2010, 2013), and Paramonov et al.
(2016a, b, c) changed the aperture design to include all three colour channels, thus
making image restoration possible and enhancing the disparity map quality. The
image is restored by applying colour shifts based on the local disparity map value
(Fig. 13.3).
To get the depth map from an estimated disparity map, one may use the thin lens
equation. In practice, most of the prior art works in this area do not discriminate
disparity and depth, treating them as synonyms as one-to-one correspondence exists.
However, a modern imaging system usually consists of a number of different lenses,
i.e. an objective. That makes the use of a thin lens formula impossible. Furthermore,
the planar scene does not have plane depth if we apply a trivial disparity-to-depth
conversion equation. A number of researchers worked on this problem for different
optics systems (Dansereau et al. 2013; Johannsen et al. 2013; Trouvé et al. 2013a, b
in two papers). Depth results for coded aperture cameras (Lee et al. 2013; Panchenko
et al. 2016) are valid only in the centre of the captured image.
328 V. P. Paramonov
Fig. 13.4 Aperture designs for image formation numerical simulation: (a) open aperture, (b) binary
coded aperture (Levin et al. 2007); (c) colour-coded aperture (Bando et al. 2008); (d) colour-coded
aperture (Lee et al. 2010); (e) colour-coded aperture (Chakrabarti and Zickler 2012); (f) colour-
coded aperture (Paramonov et al. 2016a, b, c)
13 Depth Camera Based on Colour-Coded Aperture 329
Fig. 13.5 PSF numerical simulation for different coded aperture designs at different point source
distances from camera along optical axis. From top to bottom row: conventional open aperture,
binary coded aperture (Levin et al. 2007); colour-coded aperture (Bando et al. 2008); colour-coded
aperture (Lee et al. 2010); colour-coded aperture (Chakrabarti and Zickler 2012); colour-coded
aperture (Paramonov et al. 2016a, b, c)
The first step is to simulate the point spread function (PSF) for a given coded
aperture design and different defocus levels, for which we follow the theory and the
code provided in Goodman (2008), Schmidt (2010), and Voelz (2011). The resulting
PSF images are illustrated in Fig. 13.5.
Given a set of PSFs corresponding to different defocus levels (i.e. different
distances), one can simulate the image formation process for a planar scene via
convolution of the input clear image with the corresponding PSF. In the case of a
complex scene with depth variations, this process requires multiple convolutions
with different PSFs for different depth levels. In order to do this, a continuous depth
map should be represented by a finite number of layers. In our simulation, we
precalculate 256 PSFs for a given aperture design corresponding to 256 different
defocus levels. Once we have precalculated the PSF, the process of any image
simulation does not require to repeat it. It should be noted that object boundaries
and semi-transparent objects require extra care to make this simulation realistic.
As a sanity check for the simulation model of the optics, one can numerically
simulate the imaging process using a pair of corresponding image and disparity
taken from existing datasets. Here, we use an image from the Middlebury dataset
(Scharstein and Szeliski 2003) to simulate an image captured through a colour-coded
aperture illustrated in Fig. 13.4c (proposed by Bando et al. 2008). Then we use the
disparity estimation algorithm provided by the original authors of Bando et al.
(2008) for their own aperture design (link to the source code: http://web.media.
mit.edu/~bandy/rgb/). Based on the results in Fig. 13.6, we conclude that the model’s
realism is acceptable.
This gives an opportunity to generate new synthetic datasets for depth estimation
with AI algorithms. Namely, one can use existing datasets of images with
330 V. P. Paramonov
Fig. 13.6 Numerical simulation of image formation for colour-coded aperture: (a) original image;
(b) ground truth disparity map; (c) numerically simulated image; (d) raw layered disparity map
extracted by algorithm implemented by Bando et al. (2008)
Fig. 13.7 Light-efficient colour-coded aperture designs with corresponding light efficiency
approximation (based on effective area sizes)
332 V. P. Paramonov
1 i,j
wi,j
CYX ¼ M wRGB ,
One approach for disparity map estimation is described by Panchenko et al. (2016).
Its implementation for depth estimation and control is given also in Chap. 3. The
approach utilizes a mutual correlation of shifted colour channels in an exponentially
weighted window and uses bilateral filter approximation for cost volume regulari-
zation. We describe the basics below for the completeness and self-sufficiency of the
current chapter.
Let fI i gn1 represent a set of n-captured colour channels of the same scene from
different viewpoints, where Ii is the M N frame. A conventional correlation matrix
Cd is formed for the fI i gn1 set and candidate disparity values d:
0 1
1 ⋯ corr I d1 , I dn
B C
Cd ¼ @ ⋮ ⋱ ⋮ A,
corr I dn , I d1 ⋯ 1
13 Depth Camera Based on Colour-Coded Aperture 333
where superscript ()d denotes the parallel shift in d pixels in the corresponding
channel. The direction of the shift is dictated by the aperture design. The determinant
of the matrix Cd is a good measure of fI i gn1 mutual correlation. Indeed, when all
channels are in strong correlation, all the elements of the matrix are equal to one and
det(Cd) ¼ 0. On the other hand, when the data is completely uncorrelated, we have
det(Cd) ¼ 1. To extract a disparity map using this metric, one should find disparity
values d corresponding to the smallest value of det(Cd) in each pixel of the picture.
Here, we derive another particular implementation of the generalized correlation
metric for n ¼ 3. It corresponds to the case of an aperture with three channels. The
determinant of the correlation matrix is:
2 2 2
detðCd Þ ¼ 1 corr I d1 , I d2 corr I d2 , I d3 corr I d3 , I d1
þ 2corr I d1 , I d2 corr I d2 , I d3 corr I d3 , I d1 ,
and we have
X 2 Y
argmindetðCd Þ ¼ argmax corr I i , I j 2
d d
corr I i , I j :
d d
d d
This metric is similar to the colour line metrics (Amari and Adelson 1992) but is
more robust in important cases of low texture density in some areas of the image.
The extra robustness appears when one of the three channels does not have enough
texture in a local window around a point under consideration. In this case, the colour
lines metric cannot provide disparity information, even if the other two channels are
well defined. The generalized correlation metric avoids this disadvantage and allows
the depth sensor to work similarly to a stereo camera in this case.
Usually, passive sensors provide sparse disparity maps. However, dense disparity
maps can be obtained by propagating disparity information to non-textured areas.
The propagation can be efficiently implemented via joint-bilateral filtering
(Panchenko et al. 2016) of the mutual correlation metric cost or by applying
variational methods (e.g. Chambolle and Pock 2010) for global regularization with
classic total variation or other priors. Here, we assume that the depth is smooth in
non-textured areas.
In contrast to the original work by Panchenko et al. (2016), this algorithm has also
been applied not in sensor colour space but in colour-coded aperture colour space
(Paramonov et al. 2016a, b, c). This increases the texture correlation between the
colour channels if they have overlapping passbands and helps to improve the number
of depth layers compared to RGB colour space (see Fig. 13.9c, d for comparison of
the number of depth layers sensed for the same aperture design but different colour
basis).
Let us derive a disparity-to-depth conversion equation for a single thin-lens
optical system (Fig. 13.1) as was proposed by Paramonov et al. (2016a, b, c). For
a thin lens (Fig. 13.1a), we have:
334 V. P. Paramonov
Fig. 13.9 Depth sensor on the axis calibration results for different colour-coded aperture designs:
(a) three RGB circles processed in RGB colour space; (b) cyan and yellow halves coded aperture
processed in CYX colour space; (c) cyan and yellow coded aperture with open centre processed in
CYX colour space; (d) cyan and yellow coded aperture with open centre processed in conventional
RGB colour space. Please note that there are more depth layers in the same range for case (c) than
for case (d), thanks only to the CYX colour basis (the coded aperture is the same)
1 1 1
þ ¼ ,
zof zif f
where f is the lens focal length, zof the distance between a focused object and the lens,
and zif the distance from the lens to the focused image plane. If we move the image
sensor towards the lens as shown in Fig. 13.1b, the image of the object on the sensor
is convolved with a colour-coded aperture copy, which is the circle of confusion, and
we obtain:
13 Depth Camera Based on Colour-Coded Aperture 335
1 1 1
þ ¼ ,
zod zid f
1 1 þ c=D 1
þ ¼ ,
zof zid f
where zid is the distance from the lens to the defocused image plane, zod is the
distance from the lens to the defocused object plane corresponding to zid, c is the
circle of confusion diameter, and D is the aperture diameter (Fig. 13.1b). We can
solve this system of equations for the circle of confusion diameter:
fD zod zof
c¼ ,
zod zof f
where μ is the sensor pixel size, β ¼ rc/R is the coded aperture coefficient, R ¼D/2 is
the aperture radius, and rc is the distance between the aperture centre and the single-
channel centroid (Fig. 13.8b). Now, we can express the distance between the camera
lens and any object only in terms of the internal camera parameters and the disparity
value corresponding to that object:
bf zof
zod ¼ ,
bf 2μd zof f
where b ¼ βD ¼ 2rc is the distance between two centroids (see Fig. 13.8), i.e. the
colour-coded aperture baseline equivalent to the stereo camera baseline. Note that if
d ¼ 0, zod is naturally equal to zof, then the object is in the camera focus.
To use the last equation in any real complex system (objective), it was proposed to
substitute it with a black box with the entrance and exit pupils (see Goodman 2008;
Paramonov et al. 2016a, b, c for details) located at the second and the first principal
points (H0 and H ), respectively (see Fig. 13.10 as an example of principal planes
location in the case of a double Gauss lens).
The distance between the entrance and exit pupils and the effective focal length
are found through a calibration procedure proposed by Paramonov et al.
(2016a, b, c). Since the pupil position is unknown for a complex lens, we measure
336 V. P. Paramonov
Fig. 13.10 Schematic diagram of the double Gauss lens used in Canon EF 50 mm f/1.8 II lens and
its principal plane location. Please note that this approach works for any optical imaging system
(Goodman 2008)
the distances to all objects from the camera sensor. Therefore, the disparity-to-depth
conversion equation becomes:
bf ð~zof δÞ
~zod δ ¼ ,
bf 2μdð~zof δ f Þ
where ~zod is the distance between the defocused object and the sensor, ~zof is the
distance between the focused object and the sensor, and δ ¼ zif + HH0 is the distance
between the sensor and the entrance pupil. Thus, for ~zod we have:
bf ~zof 2μdδð~zof δ f Þ
~zod ¼ :
bf 2μdð~zof δ f Þ
On the right-hand side of the equation above, there are three independent
unknown variables, namely, ~zo f , b, and δ. We discuss their calibration in the
following text. Other variables are either known or dependent.
Another issue arises due to the point spread function (PSF) changing across the
image. This causes a variation in the disparity values for objects with the same
distances from the sensor but with different positions in the image. A number of
researchers encountered the same problem in their works (Dansereau et al. 2013;
Johannsen et al. 2013; Trouvé et al. 2013a, b). A specific colour-coded aperture
depth sensor calibration to mitigate this effect is described below.
The first step is the conventional calibration with the pinhole camera model and a
chessboard pattern (Zhang 2000). From this calibration, we acquire the distance
zif between the sensor and the exit pupil.
To find the independent variables ~zof , b, and HH0, we capture a set of images of a
chessboard pattern moving in a certain range along the optical axis and orthogonal to
it (see Fig. 13.11). Each time, the object was positioned by hand, which is why small
errors are possible (up to 3 mm). The optical system is focused on a certain distance
13 Depth Camera Based on Colour-Coded Aperture 337
Fig. 13.11 Depth sensor calibration procedure takes place after conventional camera calibration
using pinhole camera model. A chessboard pattern is moving along the optical axis and captured
while the camera focus distance is constant
from the sensor. Our experience shows that the error in focusing by hand in a close
range is high (up to 20 mm for the Canon EF 50 mm f/1.8 lens), so we have to find
the accurate value of ~zof through the calibration as well.
Disparity values are extracted, and their corresponding distances are measured by
a ruler on the test scene for all captured images. Now, we can find ~zof and b so that
the above equation for the distance between a defocused object and the sensor holds
with minimal error (RMS error for all measurements).
To account for depth distortion due to curvature of the optical system field, we
perform the calibration for all the pixels in the image individually. The resulting
colour-coded aperture baseline b(i, j) and in-focus surface ~zof ði, jÞ are shown in
Fig. 13.12a and b, respectively.
The procedure described here was implemented on a prototype based on the
Canon EOS 60D camera and Canon EF 50 mm f/1.8 II lens. Any other imaging
system would also work, but this Canon lens allows easy disassembling (Bando
2008). Thirty-one images were captured (see Fig. 13.11), where the defocused image
plane was moving from 1000 to 4000 mm in 100 mm steps (zod) and the camera was
focused at approximately 2000 mm (zof).
The results of our calibration for different coded aperture designs are presented in
Fig. 13.9. Based on the calibration, the effective focal length of our Canon EF 50 mm
f/1.8 II lens is 51.62 mm, which is in good agreement with the focal length value
provided to us by the professional opticians (51.6 mm) who performed an accurate
calibration (it is not in fact 50 mm).
Using the calibration data, one can perform an accurate depth map estimation:
The floor in Fig. 13.13a is flat but appears to be concave on the extracted depth
map due to depth distortion (Fig. 13.13b). After calibration, the floor surface is
corrected and is close to planar (Fig. 13.13c). The accuracy of undistorted depth
maps extracted with a colour-coded aperture depth sensor is sufficient for 3D scene
reconstruction, as discussed in the next section.
338 V. P. Paramonov
Fig. 13.12 Colour-coded aperture depth sensor 3D calibration results: (a) coded aperture equiva-
lent baseline field b(i, j); (b) optical system in-focus surface ~zof ði, jÞ , where (i, j) are pixel
coordinates
Fig. 13.13 Depth map of a rabbit figure standing on the floor: (a) captured image (the colour shift
in the image is visible when looking under magnification); (b) distorted depth map, floor appears to
be concave; (c) undistorted depth map, floor surface is planar area
First, let us compare the depth estimation error for the layered and sub-pixel
approaches. Figure 13.14 shows that the sub-pixel estimation with the quadratic
polynomial interpolation significantly improves the depth accuracy. This approach
also allows real-time implementation as we interpolate around global maxima only.
The different aperture designs are compared in Paramonov et al. (2016a, b, c),
having the same processing algorithm (except the aperture corresponding to
Chakrabarti and Zickler (2012) as it utilizes a significantly different approach).
The tests were conducted using the Canon EOS 60D DSLR camera with a Canon
EF 50 mm f/1.8 II lens in the same light conditions and for the same distance to the
object, while the exposure time was adjusted to achieve a meaningful image in each
case. Typical results are shown in Fig. 13.15.
13 Depth Camera Based on Colour-Coded Aperture 339
Fig. 13.14 Cyan and yellow coded aperture: depth estimation error comparison for layered and
sub-pixel approaches
Fig. 13.15 Results comparison with prior art. Rows correspond to different coded aperture designs.
From top to bottom: RGB circles (Lee et al. 2010); RGB squares (Bando et al. 2008); CMY squares,
CMY circles, CY halves (all three cases by Paramonov et al. 2016a, b, c); magenta annulus
(Chakrabarti and Zickler 2012); CY with open area (Paramonov et al. 2016a, b, c); open aperture.
Light efficiency increases from top to bottom
avoid texture dependency and image sensor noise, we use a blurred captured image
of a white sheet of paper.
The transparency Ti, j shows the fraction of light which passes through the
imaging system with the coded aperture relative to the same imaging system without
the coded aperture:
I ci,j
T i,j ¼ :
I nc
i:j
Fig. 13.16 Transparency maps for different aperture designs (columns) corresponding to different
image sensor colour channels (rows)
Fig. 13.17 SNR loss for different apertures in the central and border image areas (dB)
All photos were taken with identical camera settings. It seems that SNR degra-
dation from the centre to the side is induced by lens aberrations. Different apertures
provide different SNRs, the value depending on the amount of captured light. The
loss between aperture 1 and 3 is 2.3 dB. To obtain the original SNR value for
aperture 3, one should increase the exposure time by 30%.
It is important to take into account the light efficiency while evaluating depth
sensor results. We estimated the light efficiency by capturing a white sheet of paper
through different coded apertures in the same illumination conditions and with the
same camera parameters. The light efficiency values are presented for each sensor
colour channel independently (see Fig. 13.18). The aperture designs are sorted based
on their light efficiency.
Apertures V and VI in Fig. 13.15 have almost the same light efficiency, but the
depth quality of aperture V seems to be better. Aperture VII has a higher light
efficiency and can be used if depth quality is not an issue.
This analysis may be used to find a suitable trade-off between image quality and
depth quality for a given application.
Fig. 13.19 Design of colour-coded apertures: (a) inside a DSLR camera lens; (b) inside a
smartphone camera lens; (c) disassembled smartphone view
Fig. 13.20 Scenes captured with the DSLR-based prototype with their corresponding depth maps
Fig. 13.21 Depth-dependent image effects. From top to bottom rows: refocusing, pixelization,
colourization
Fig. 13.22 Tiny models captured with the smartphone-based prototype and with their
corresponding disparity maps
Fig. 13.23 Disparity map and binary mask for the image captured with the smartphone prototype
Fig. 13.25 Real-time implementation of raw disparity map estimation with web camera-based
prototype
Fig. 13.26 Point Grey Grasshopper 3 camera with inserted colour-coded aperture on the left (a)
and real-time 3D scene reconstruction process on the right (b)
Fig. 13.27 Test scenes and their corresponding 3D reconstruction examples (a) test scene frontal
veiw and (b) corresponding 3D reconstruction view, (c) test scene side view and (d) corresponding
3D reconstruction veiw
The test scene and 3D reconstruction results are shown in Fig. 13.27a–d. Note
that a chessboard pattern is not used for tracking but only to provide good texture.
In Fig. 13.28, we show the distance between depth layers corresponding to
disparity values equal to 0 and 1 based on the last formula in Sect. 13.5. The layered
Fig. 13.28 Depth sensor accuracy curves for different aperture baselines b: (a) full-size camera
with f-number 1.8 and pixel size 4.5 μm; (b) compact camera with f-number 1.8 and pixel size
1.2 μm
348 V. P. Paramonov
depth error is by definition two times smaller than this distance. The sub-pixel
refinement reduces the depth estimation error two times further (see Figs. 13.6 and
13.10). That gives a final accuracy better than 15 cm on the distance of 10 m and
better than 1 cm on the distance below 2.5 m for the equivalent baseline b of 20 mm.
References
Amari, Y., Adelson, E.: Single-eye range estimation by using displaced apertures with color filters.
In: Proceedings of the International Conference on Industrial Electronics, Control, Instrumen-
tation and Automation, pp. 1588–1592 (1992)
Bae, Y., Manohara, H., White, V., Shcheglov, K.V., Shahinian, H.: Stereo imaging miniature
endoscope. Tech Briefs. Physical Sciences (2011)
Bando, Y.: How to disassemble the Canon EF 50mm F/1.8 II lens (2008). Accessed on
15 September 2020. http://web.media.mit.edu/~bandy/rgb/disassembly.pdf
Bando, Y., Chen, B.-Y., Nishita, T.: Extracting depth and matte using a color-filtered aperture.
ACM Trans. Graph. 27(5), 134:1–134:9 (2008)
Chakrabarti, A., Zickler, T.: Depth and deblurring from a spectrally-varying depth-of-field. In:
Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Lecture Notes in Computer
Science, vol. 7576, pp. 648–661. Springer, Berlin, Heidelberg (2012)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications
to imaging. J. Math. Imaging Vis. 40, 120–145 (2010)
Chang, J., Wetzstein, G.: Deep optics for monocular depth estimation and 3d object detection. In:
Proceedings of the IEEE International Conference on Computer Vision, pp. 10193–10202
(2019)
Chen, W., Xie, D., Zhang, Y., Pu, S.: All you need is a few shifts: designing efficient convolutional
neural networks for image classification. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 7234–7243 (2019)
Dansereau, D., Pizarro, O., Williams, S.: Decoding, calibration and rectification for lenselet-based
plenoptic cameras. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1027–1034 (2013)
Goodman, J.: Introduction to Fourier Optics. McGraw-Hill, New York (2008)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M.,
Adam, H.: MobileNets: efficient convolutional neural networks for mobile vision applications.
arXiv:1704.04861. 1704, 04861 (2017)
Imatest. The SFRplus chart: features and how to photograph it (2014). Accessed on 15 September
2020. https://www.imatest.com/docs/
Johannsen, O., Heinze, C., Goldluecke, B., Perwaß, C.: On the calibration of focused plenoptic
cameras. In: Grzegorzek, M., Theobalt, C., Koch, R., Kolb, A. (eds.) Time-of-Flight and Depth
Imaging. Sensors, Algorithms, and Applications. Lecture Notes in Computer Science, vol.
8200, pp. 302–317. Springer, Berlin, Heidelberg (2013)
Lee, E., Kang, W., Kim, S., Paik, J.: Color shift model-based image enhancement for digital multi
focusing based on a multiple color-filter aperture camera. IEEE Trans. Consum. Electron. 56(2),
317–323 (2010)
Lee, S., Kim, N., Jung, K., Hayes, M.H., Paik, J.: Single image-based depth estimation using dual
off-axis color filtered aperture camera. In: Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing, pp. 2247–2251 (2013)
Levin, A., Fergus, R., Durand, F., Freeman, W.T.: Image and depth from a conventional camera
with a coded aperture. ACM Trans. Graph. 26(3), 70:1–70:10 (2007)
13 Depth Camera Based on Colour-Coded Aperture 349
Mishima, N., Kozakaya, T., Moriya, A., Okada, R., Hiura, S.: Physical cue based depth-sensing by
color coding with deaberration network. arXiv:1908.00329. 1908, 00329 (2019)
Moriuchi, Y., Sasaki, T., Mishima, N., Mita, T.: Depth from asymmetric defocus using color-
filtered aperture. The Society for Information Display. Book 1: Session 23: HDR and Image
Processing (2017). Accessed on 15 September 2020. https://doi.org/10.1002/sdtp.11639
Newcomb, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davidson, A.J., Kohi, P., Shotton,
J., Hodges, S., Fitzgibbon, A.: Kinectfusion: realtime dense surface mapping and tracking. In:
Proceedings of the IEEE International Symposium on Mixed and Augmented Reality,
pp. 127–136 (2011)
Panchenko, I., Bucha, V.: Hardware accelerator of convolution with exponential function for image
processing applications. In: Proceedings of the 7th International Conference on Graphic and
Image Processing. International Society for Optics and Photonic, pp. 98170A–98170A (2015)
Panchenko, I., Paramonov, V., Bucha, V.: Depth estimation algorithm for color coded aperture
camera. In: Proceedings of the IS&T Symposium on Electronic Imaging. 3D Image Processing,
Measurement, and Applications, pp. 405.1–405.6 (2016)
Paramonov, V., Panchenko, I., Bucha, V.: Method and apparatus for image capturing and simul-
taneous depth extraction. US Patent 9,872,012 (2014)
Paramonov, V., Lavrukhin, V., Cherniavskiy, A.: System and method for shift-invariant artificial
neural network. RU Patent 2,656,990 (2016a)
Paramonov, V., Panchenko, I., Bucha, V., Drogolyub, A., Zagoruyko, S.: Depth camera based on
color-coded aperture. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 1, 910–918 (2016b)
Paramonov, V., Panchenko, I., Bucha, V., Drogolyub, A., Zagoruyko, S.: Color-coded aperture.
Oral presentation in 2nd Christmas Colloquium on Computer Vision, Skolkovo Institute of
Science and Technology (2016c). Accessed on 15 September 2020. http://sites.skoltech.ru/app/
data/uploads/sites/25/2015/12/CodedAperture_CCCV2016.pdf
Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. Proc. IEEE
Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1, 195–202 (2003)
Schmidt, J.D.: Numerical Simulation of Optical Wave Propagation with Examples in MATLAB.
SPIE Press, Bellingham (2010)
Sitzmann, V., Diamond, S., Peng, Y., Dun, X., Boyd, S., Heidrich, W., Heide, F., Wetzstein, G.:
End-to-end optimization of optics and image processing for achromatic extended depth of field
and super-resolution imaging. ACM Trans. Graph. 37(4), 114 (2018)
Trouvé, P., Champagnat, F., Besnerais, G.L., Druart, G., Idier, J.: Design of a chromatic 3d camera
with an end-to-end performance model approach. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, pp. 953–960 (2013a)
Trouvé, P., Champagnat, F., Besnerais, G.L., Sabater, J., Avignon, T., Idier, J.: Passive depth
estimation using chromatic aberration and a depth from defocus approach. Appl. Opt. 52(29),
7152–7164 (2013b)
Tsuruyama, T., Moriya, A., Mishima, N., Sasaki, T., Yamaguchi, J., Kozakaya, T.: Optical filter,
imaging device and ranging device. US Patent Application US 20200092482 (2020)
Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled photography: mask
enhanced cameras for heterodyned light fields and coded aperture refocusing. ACM Trans.
Graph. 26(3), 69:1–69:12 (2007)
Voelz, D.G.: Computational Fourier Optics: A MATLAB Tutorial. SPIE Press, Bellingham (2011)
Wu, B., Wan, A., Yue, X., Jin, P.H., Zhao, S., Golmant, N., Gholaminejad, A., Gonzalez, J.E.,
Keutzer, K.: Shift: a zero FLOP, zero parameter alternative to spatial convolutions. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 9127–9135 (2018)
Zagoruyko, S., Chernov, V.: Fast depth map fusion using OpenCL. In: Proceedings of the
Conference on Low Cost 3D (2014). Accessed on 15 September 2020. http://www.lc3d.net/
programme/LC3D_2014_program.pdf
Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell.
22(11), 1330–1334 (2000)
Zhou, C., Lin, S., Nayar, S.K.: Coded aperture pairs for depth from defocus and defocus deblurring.
Int. J. Comput. Vis. 93(1), 53–72 (2011)
Chapter 14
An Animated Graphical Abstract
for an Image
14.1 Introduction
Modern image capture devices are capable of acquiring thousands of files daily.
Despite tremendous progress in the development of user interfaces for personal
computers and mobile devices, the approach for browsing large collections of
images has hardly changed over the past 20 years. Usually, a user scrolls through
the list of downsampled copies of images to find the one they want. This
downsampled image is called a thumbnail or icon. Figure 14.1 demonstrates a
screenshot of File Explorer in Windows 10 with icons of photos.
Browsing is time-consuming, and search is ineffective taking into account sense-
less names of image files. Often it is difficult to recognise the detailed content of the
original image from the observed thumbnail as well as to estimate its quality. For a
downsampled copy of the image, it is almost impossible to assess its blurriness,
noisiness, and presence of compression artefacts. Even for viewing photographs, a
user frequently is forced to zoom in and scroll a photo. The situation is much harder for
browsing of images having complex layout or intended for special applications.
Figure 14.2 shows thumbnails of scanned documents. How can the required document
be found effectively in the case of inapplicability of employing optical character
recognition? Icons of slices of X-ray computed tomographic (CT) images of two
various sandstones are shown in Fig. 14.3. Is it possible to detect some given
sandstone based on the thumbnail? How can the quality of the slices be estimated?
In the viewing interface, a user needs a fast and handy way to see an abstract of
the image. In general, the content of the abstract is application-specific.
Fig. 14.1 Large icons for photos in File Explorer for Win10
Fig. 14.2 Large icons for documents in File Explorer for Win10
14 An Animated Graphical Abstract for an Image 353
Fig. 14.3 Large icons for slices of two X-ray microtomographic images in File Explorer for
Win10: (a) Bentheimer sandstone; (b) Fontainebleau sandstone
Nevertheless, common ideas for the creation of convenient interfaces for viewing of
images can be formulated: a user would like to see clearly the regions of interest and
to estimate the visual quality. In this chapter, we describe the technique for gener-
ating a thumbnail-size animation comprising transitions between the most important
zones of the image. Such animated graphical abstract looks attractive and provides a
user-friendly way for browsing large collections of images.
is impossible to evaluate the noise level and blurriness because the cropped fragment
is downsized and the resulting thumbnail has a lower resolution in comparison with
the original image.
The last advances with automatic cropping for thumbnailing relate to the appli-
cation of deep neural networks. Esmaeili et al. (2017) and Chen et al. (2018) depict
end-to-end fully convolutional neural networks for thumbnail generation without the
building of an intermediate saliency map. Except for the capability of preserving
aspect ratio, these methods have the same drawbacks as other cropping-based
methods.
There are completely different approaches for thumbnail creation. To reflect
noisiness (Samadani et al. 2008) or blurriness (Koik and Ibrahim 2014) of the
original image, these methods fuse corresponding defects in the thumbnail. Such
algorithms do not modify the image composition and better characterise the quality
of the originals. However, yet it is hard to recognise relatively small regions of
interest due to the much smaller size of the thumbnail than the original.
There are many fewer publications devoted to thumbnails of scanned documents.
Berkner et al. (2003) describe the so-called SmartNail for browsing document
images. SmartNail consists of a selection of cropped and scaled document segments
that are recomposed to fit the available display space while maintaining the
recognisability of document images and the readability of text and keeping the
layout close to the original document layout. Nevertheless, the overall initial view
is destroyed, especially for small display size, as well as sometimes layout alteration
is estimated negatively by the observer. Berkner (2006) depicts the method for
determination of the scale factor to preserve text readability and layout
recognisability in the downsized image. Safonov et al. (2018) demonstrate the
rescaling of images by retargeting. That approach allows to decrease the size of
the scanned document several times, but the preservation of text readability for small
thumbnails remains an unsolved problem.
The algorithm for the generation of the animated graphical abstract comprises the
following three key stages.
1. Detection of attention zones
2. Selection of a region for quality estimation
3. Generation of video frames, which are transitions between the zones and the
whole image
Obviously, the attention zones differ for various types of images. For the dem-
onstration of advantages of the animated graphical abstract as a concept, we consider
the following image types: conventional consumer photographs, images of scanned
documents, and slices of X-ray microtomographic images of rock samples. Human
faces are adequate for identification of photo content for the most part. For photos
that do not contain faces, salient regions can be considered as visual attention zones.
The title, headers, other emphasised text elements, and pictures are enough for the
identification of a document. For the investigation of images acquired by tomogra-
phy, we need to examine the regions of various substances.
For visual estimation of blurriness, noise, compression artefacts, and specific
artefacts of CT images (Kornilov et al. 2019), observers should investigate a
fragment of an image without any scaling. We propose to use several simple rules
for the selection of the appropriate fragment: the fragment should contain at least one
contrasting edge and at least one flat region; the histogram of the fragment’s
brightness should be wide but without clipping on limits of dynamic range. These
rules are employed for the region selection in the central part of an image or inside
the attention zones.
It should be clear that to select approaches for important zone detection, one
should take into account the application scenario and hardware platform limitations
for implementation. Fortunately, panning over an image during animation allows
recognising image content even in the case when important zones were detected
incorrectly. For implementations portable to embedding platforms, we prefer tech-
niques having low computational complexity and power consumption rather than
more comprehensive ones.
Information about humans in the photo is important to recognise the scene. Thus, it is
reasonable to apply the face detection algorithm to detect attention zones in the
photo. There are numerous methods for face detection. At the present time, methods
based on the application of deep neural networks demonstrate state-of-the-art per-
formance (Zhang and Zhang 2014; Li et al. 2016; Bai et al. 2018). Nevertheless, we
prefer to apply the Viola-Jones face detector (Viola and Jones 2001), which has
several effective multi-view implementations for various platforms. The number of
false positives of the face detector can be decreased with additional skin tone
segmentation and processing of downsampled images (Egorova et al. 2009). We
356 I. V. Safonov et al.
set the upper limit for the number of detected faces equal to four. If a larger number
of faces are detected, then we select the largest regions. Figure 14.4 illustrates
detected faces as attention zones.
Faces may characterise the photo very well, but a lot of photos do not contain
faces. In this case, an additional mechanism has to be used to detect zones of
attention. The thresholding of a saliency map is one of the common ways of looking
for attention zones. Again, deep neural networks perform well for the problem
(Wang et al. 2015; Zhao et al. 2015; Liu and Han 2016), but we use a simpler
histogram-based contrast technique (Cheng et al. 2014), which usually provides a
reasonable saliency map. Figure 14.5 shows examples of attention zones detected
based on the saliency map.
The majority of icons for document images look very similar. It is difficult to
distinguish from one another. To recognise some document, it is important to see
the title, emphasised blocks of text, and embedded pictures. There are a lot of
document layout analysis methods that allow to perform the segmentation and
detection of different important regions of the document (Safonov et al. 2019).
However, we do not need complete document segmentation to detect several
attention zones, so we can use simple and computationally inexpensive methods.
14 An Animated Graphical Abstract for an Image 357
We propose a fast algorithm to detect a block of text from the very large size that
relates to title and headers. The algorithm includes the following steps. First, the
initial rgb image is converted to greyscale I. The next step is downsampling the
original document image to a size that provides recognisability of text with the size
16–18 pt or greater. For example, a scanned document image with a resolution of
300 dpi should be downsampled five times. The resulting image of an A4 document
has size 700 500 pixels. Handling of a greyscale downsampled copy of the initial
image allows significant decrease of processing time.
Downsized text regions look like a texture. These areas contain the bulk of the
edges. So, to reveal text regions, edge detection techniques can be applied. We use
Laplacian of Gaussian (LoG) filtering with zero crossing. LoG filtering is a convo-
lution of the downsampled image I with kernel k:
ðx2 þ y2 2σ 2 Þk g ðx, yÞ
kðx, yÞ ¼ PN=2 PN=2 ,
2πσ 6 x¼N=2 y¼N=2 kg ðx, yÞ
where N is the size of convolution kernel; σ is standard deviation; and (x, y) are
coordinates of the Cartesian system with the origin at the centre of the kernel.
The zero-crossing approach with fixed threshold T is preferable for edge seg-
mentation. The binary image BW is calculated using the following statement:
358 I. V. Safonov et al.
where Ie is the outcome of LoG filtering; and (r, c) are coordinates of a pixel.
For segmentation of text regions, we look for the pixels that have a lot of edge
pixels in the vicinity:
( P
rþdr=2 Pcþdc=2
1, i¼rdr=2 j¼cdc=2 BWði, jÞ > Tt
Lðr, cÞ ¼ ,
8r, c 0
where L is the image of segmented text regions; dr and dc are sizes of blocks; and Tt
is a threshold.
In addition to text, regions corresponding to vector graphics such as plots and
diagrams are segmented too. Further steps are labelling of connected regions in
L and calculation of its bounding boxes. Regions with a small height or width are
eliminated.
The calculation of the average size of characters for each region of the text and
selection of several zones with large size of characters are performed in the next
steps. Let us consider how to calculate the average size of characters of the text
region, which corresponds to some connected region in the image L. The text region
can be designated as:
Figure 14.6 illustrates our approach for the detection of text regions. Detected text
regions L are marked with green. The image Z consists of all the connected regions.
The average size of characters is calculated for the dark connected areas inside the
green region. This is a reliable way to detect the region of the title of the paper.
At the final stage of our approach, photographic illustrations are identified
because they are important for the document recognition as well. The image I is
divided into non-overlapping blocks with size N M for the detection of embedded
photos. For each block Ei, the energy of the normalised grey level co-occurrence
matrix is calculated:
C ði, jÞ
N I ði, jÞ ¼ P PI ,
i j C I ði, jÞ
XX
Ei ¼ N I 2 ði, jÞ,
i j
Fig. 14.7 The results of detection of a photographic illustration inside a document image
bounding box of the region with the largest area defines the zone of the embedded
photo. Figure 14.7 shows the outcomes of the detection of the blocks related to a
photographic illustration inside the document image.
At the beginning, the sequence order of zones is selected for animation creation. The
first frame always represents a whole downsampled image that is the conventional
thumbnail. The subsequent zones are selected to provide the shortest path across the
image during moving between attention zones. The animation can be looped. In this
case, the final frame is a whole image too. The animation simulates the following
camera effects: tracking-in, tracking-out, and panning between attention zones, slow
panning across a large attention zone, and pausing on the zones. Tracking-in,
tracking-out, and panning effects between two zones are created by constructing a
sequence from N frames. Each frame of the sequence is prepared with the following
steps:
1. Calculation of coordinates of a bounding box for the cropping zone using the line
equation in the parametric form:
where (x1, y1) are coordinates of the start zone; (x2, y2) are coordinates of the
end zone; and t is the parameter, which is increased from 0 to 1 with step dt ¼ 1/
(N 1)
2. Cropping the image using coordinates of the calculated bounding box, preserving
the aspect ratio
3. Resizing of the cropped image to the target size
Figure 14.8 demonstrates an example of the animated graphical abstract for a
photo (Safonov and Bucha 2010). Two faces are detected. Hands of kids are
selected as a region for quality estimation. The animation consists of four transitions
between the whole image and these three zones. The first sequence of the frames
looks like a camera tracking-in to a face. After that, the frame is frozen on a moment
to focus on the zoomed face. The second sequence of the frames looks like a camera
panning between faces. The third sequence of the frames looks like a camera
panning and zooming-in between the face and hands. After that, the frame with
the hands is frozen on a moment for visual quality estimation. The final sequence of
frames looks like a camera tracking-out to the whole scene, and freeze frame takes
place again.
362 I. V. Safonov et al.
Figure 14.9 demonstrates an example of the animated graphical abstract for the
image of a scanned document. The title and embedded picture are detected in the first
stage. The fragment of the image with the title is appropriate for quality estimation.
The animation consists of four transitions between the entire image and these two
zones as well as viewing a relatively large zone of title. The first sequence of frames
looks like a camera tracking-in to the left side of the title zone. The second sequence
of frames looks like a slow camera panning across the zone of the title. After that, the
frame is frozen on a moment for visual quality estimation. The third sequence of
14 An Animated Graphical Abstract for an Image 363
frames looks like a camera panning from the right side of the title zone to the picture
inside the document. After that, frame is frozen on a moment. The final sequence of
frames looks like a camera tracking-out to the entire page. Finally, the frame with the
entire page is frozen on a moment. This sequence of the frames allows to identify the
image content confidently.
364 I. V. Safonov et al.
For CT images, the animation is created between zones having different charac-
teristics according to visual similarity and an image fragment without scaling, which
is used for quality estimation. In contrast to the two previous examples, panning
across the tomographic image often does not allow to see the location of a zone in the
image clearly. It is preferable to make transitions between zones via intermediate
entire slice.
Fig. 14.10 Conventional large icons in the task detection of photos with a certain person
366 I. V. Safonov et al.
Fig. 14.11 Frames of the animated graphical abstract in the task detection of photos with a certain
person
answer. It is a little bit better than random guessing. Two responders had better
results than others because they had much experience in photography and under-
stood shooting conditions which can cause a blurred photo. The animated graphical
abstract demonstrates zoomed-in fragments of the photo and allows to identify
low-quality photos. In our survey, 90% of respondents detected proper photos by
viewing animation frames.
Figure 14.13 shows enlarged fragments of sharp photos in the top row and blurred
photos in the bottom row. The difference is obvious, and blurriness is detectable.
10% of errors are explained by subjective interpretation of the blurriness concept
probably. Indeed, sharpness and blurriness are not formalised strictly and depend on
viewing conditions.
The third task was the selection of two scanned images that represent the
document related to the Descreening topic. The total number of documents was
nine. Figure 14.14 shows the conventional large icons of the scanned images used in
that survey. For icons, the percentage of correct answers was 20. In general, it is
impossible to solve the task properly using conventional thumbnails of small size.
To the contrary, animated graphical abstracts provide a high level of correct answers.
80% of respondents selected both pages related to Descreening thanks to zooming
and panning through the title of the papers, as shown in Fig. 14.15.
The fourth task was the classification of sandstones, icons of slices of CT images
of which are in Fig. 14.3. Probably, researchers experienced in material science can
classify those slices based on icons more or less confidently. However, the partic-
ipants in our survey have no such skills. They were instructed to make decisions
based on the following text description: Bentheimer sandstone has middle,
sub-angular non-uniform grains with 10–15% inclusions of other substances such
as spar and clay; Fontainebleau sandstone has large angular uniform grains. For
icons of slices, the percentage of right answers was 50, which corresponds to random
guessing. Enlarged fragments of images in frames of animation allows to classify
images of sandstones properly. 90% of respondents gave the right answers. Fig-
ure 14.16 shows examples of frames with enlarged fragments of slices.
14 An Animated Graphical Abstract for an Image 367
Fig. 14.12 Conventional large icons in the task selection of blurred photos
368 I. V. Safonov et al.
Fig. 14.13 Frames of animated graphical abstract in the task selection of blurred photos: top row
for sharp photos, bottom row for blurred photos
The final task was the identification of the noisiest tomographic image. We
scanned the same sample six times with different exposure time and number of
frames for averaging (Kornilov et al. 2019). Longer exposure time and greater
number of frames for averaging allow to obtain a high-quality image. Shorter
exposure time and absence of averaging correspond to the noisy images. The
conventional thumbnail does not allow to estimate noise level. Icons of slices for
all six images look almost identical. That is why only 20% of respondents could
identify the noisiest image correctly. Frames of the animation containing zoomed
fragments of slices grant assessing noise level easily. 80% of respondents identify
noisier images via viewing of the animated graphical abstract.
Table 14.1 contains the results for all tasks of our survey. The animated graphical
abstract provides the capability of recognising image content and estimating quality
confidently and outperforms conventional icons and thumbnails considerably. In
addition, such animation is an impressive way for navigation through image collec-
tions in software for PCs, mobile applications, and widgets. The idea of the animated
graphical abstract can be extended to other types of files, for example, PDF
documents.
14 An Animated Graphical Abstract for an Image 369
Fig. 14.14 Conventional large icons in the task selection of documents related to the given topic
370 I. V. Safonov et al.
Fig. 14.15 Frames of animated graphical abstract with panning through the title of the document
Fig. 14.16 Frames of animated graphical abstracts in the task classification of the type of sandstone
14 An Animated Graphical Abstract for an Image 371
References
Bai, Y., Zhang, Y., Ding, M., Ghanem, B.: Finding tiny faces in the wild with generative adversarial
network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 21–30 (2018)
Berkner, K.: How small should a document thumbnail be? In: Digital Publishing. SPIE. 6076,
60760G (2006)
Berkner, K., Schwartz, E.L., Marle, C.: SmartNails – display and image dependent thumbnails. In:
Document Recognition and Retrieval XI. SPIE. 5296, 54–65 (2003)
Chen, H., Wang, B., Pan, T., Zhou, L., Zeng, H.: Cropnet: real-time thumbnailing. In: Proceedings
of the 26th ACM International Conference on Multimedia, pp. 81–89 (2018)
Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, S.M.: Global contrast based salient region
detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2014)
Egorova, M.A., Murynin, A.B., Safonov, I.V.: An improvement of face detection algorithm for
color photos. Pattern Recognit. Image Anal. 19(4), 634–640 (2009)
Esmaeili, S.A., Singh, B., Davis, L.S.: Fast-at: fast automatic thumbnail generation using deep
neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 4622–4630 (2017)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis.
IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
Koik, B.T., Ibrahim, H.: Image thumbnail based on fusion for better image browsing. In: Pro-
ceedings of the IEEE International Conference on Control System, Computing and Engineering,
pp. 547–552 (2014)
Kornilov, A., Safonov, I., Yakimchuk, I.: Blind quality assessment for slice of microtomographic
image. In: Proceedings of the 24th Conference of Open Innovations Association (FRUCT),
pp. 170–178 (2019)
Kornilov, A.S., Reimers, I.A., Safonov, I.V., Yakimchuk, I.V.: Visualization of quality of 3D
tomographic images in construction of digital rock model. Sci. Vis. 12(1), 70–82 (2020)
Li, Y., Sun, B., Wu, T., Wang, Y.: Face detection with end-to-end integration of a ConvNet and a 3d
model. In: Proceedings of the European Conference on Computer Vision, pp. 420–436 (2016)
Lie, M.M., Neto, H.V., Borba, G.B., Gamba, H.R.: Automatic image thumbnailing based on fast
visual saliency detection. In: Proceedings of the 22nd Brazilian Symposium on Multimedia and
the Web, pp. 203–206 (2016)
Lin, K.C.: On improvement of the computation speed of Otsu’s image thresholding. J. Electron.
Imaging. 14(2), 023011 (2005)
Liu, N., Han, J.: DHSNet: deep hierarchical saliency network for salient object detection. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 678–686 (2016)
372 I. V. Safonov et al.
Ramalho, G.L.B., Ferreira, D.S., Rebouças Filho, P.P., de Medeiros, F.N.S.: Rotation-invariant
feature extraction using a structural co-occurrence matrix. Measurement. 94, 406–415 (2016)
Safonov, I.V., Bucha, V.V.: Animated thumbnail for still image. In: Proceedings of the Graphicon
conference, pp. 79–86 (2010)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Adaptive Image Processing Algo-
rithms for Printing. Springer Nature Singapore AG, Singapore (2018)
Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for
Scanning and Printing. Springer Nature Switzerland AG, Cham (2019)
Samadani, R., Mauer, T., Berfanger, D., Clark, J., Bausk, B.: Representative image thumbnails:
automatic and manual. In: Human Vision and Electronic Imaging XIII. SPIE. 6806, 68061D
(2008)
Suh, B., Ling, H., Bederson, B.B., Jacobs, D.W.: Automatic thumbnail cropping and its effective-
ness. In: Proceedings of ACM UIST (2003)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 511–518 (2001)
Wang, L., Lu, H., Ruan, X., Yang, M.H.: Deep networks for saliency detection via local estimation
and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3183–3192 (2015)
Zhang, C., Zhang, Z.: Improving multi-view face detection with multi-task deep convolutional
neural networks. In: IEEE Winter Conference on Applications of Computer Vision,
pp. 1036–1041 (2014)
Zhao, R., Ouyang, W., Li, H., Wang, X.: Saliency detection by multi-context deep learning. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1265–1274 (2015)
Chapter 15
Real-Time Video Frame-Rate Conversion
15.1 Introduction
The problem of increasing the frame rate of a video stream started to gain attention a
long time ago in the mid-1990s, just as TV screens started to get larger and the
stroboscopic effect caused by discretisation of smooth motion started to become
more apparent. The early 100 Hz CRT TV image was of low resolution
(PAL/NTSC), and the main point was to get rid of CRT inherent flicker, but with
ever larger LCD TV sets, the need for good and real-time frame-rate conversion
(FRC) algorithms was increasing.
The problem setup is to analyse the motion of objects in a video stream and create
new frames that follow the same motion. It is obvious that with high resolution
content, this is a computationally demanding task that needs to analyse frame data in
real time and interpolate and compose new frames. As for the TV industry, the
problem of computational load was solved by dedicated FRC chips with highly
optimised circuitry and without strict limitations on power consumption.
The prevailing customer of Samsung R&D Institute Russia was a mobile divi-
sion, so we proposed to bring the FRC “magic” to the smartphone segment. The
computational performance of mobile SoCs was steadily increasing, even more so
on the GPU side. The first use cases for FRC were reasonably chosen to have limited
duration, so that the increased power consumption was not a catastrophic problem.
The use cases were Motion Photo playback and Super Slow Motion capture.
We expected that besides relatively fine quality, we will have to deliver a solution
that will be working smoothly in real time on a mobile device, possibly with power
consumption limitations. The requirements more or less dictate the use of block-wise
motion vectors, which dramatically decreases the complexity of all parts of the FRC
algorithm.
The high-level structure of a FRC algorithm is, with slight variations, shared
between many variants of FRC (Cordes and de Haan 2009). The main stages of
the algorithm are:
1. Motion estimation (ME) – usually analyses two consecutive video frames and
returns motion vectors suitable for tracking objects, so-called true motion
(de Haan et al. 1993; Pohl et al. 2018).
2. Occlusion processing (OP) and preparation of data for MCI – analyses several
motion vector fields and makes decisions about appearing and disappearing areas
and creates data to guide their interpolation. Often, this stage modifies motion
vectors (Bellers et al. 2007).
3. Motion compensated interpolation (MCI) – takes data from OP (occlusion
corrected motion vectors and weights) and produces the interpolated frame.
4. Fallback logic – keyframe repetition instead of FRC processing is applied if the
input is complex video content (either the scene is changed or either highly
nonlinear or extremely fast motion appears). In this case, strong interpolation
artefacts are replaced by judder globally or locally, which is visually better.
We had to develop a purely SW FRC algorithm for fastest possible
commercialisation and prospective support of devices already released to the market.
In our case, purely SW means an implementation that uses any number of available
CPU cores and GPU via the OpenCL standard. This gave us a few advantages:
• Simple upgrades of the algorithm in case of severe artefacts found in some
specific types of video content
• Relatively simple integration with existing smartphone software
• Rapid implementation of possible new scenarios
• Release from some hardware limitations in the form of a small amount of cached
frame memory and one-pass motion estimation
We chose to use 8 8 basic blocks, but for higher resolutions, it is possible to
increase the block size in the ME stage with further upscaling of motion vectors for
subsequent stages.
Fig. 15.1 Trees of direct dependencies (green and red) and areas of indirect dependencies (light
green and light red) for a meandering order (left). Top-to-bottom meandering order used for forward
ME (right top). Bottom-to-top meandering order used for backward ME (right bottom)
376 I. M. Kovliga and P. Pohl
improve the computational cost without any noticeable degradation in the quality of
the resulting motion field.
The 3DRS algorithm is based on block matching, using a frame divided into blocks
of pixels, where X ¼ (x, y) are the pixel-wise coordinates of the centre of the block.
Our FRC algorithm requires two motion fields for each pair of consecutive frames
Ft1, Ft. The forward motion field DFW, t1(X) is the set of motion vectors assigned
to the blocks of Ft1. These motion vectors point to frame Ft. The backward motion
field DBW, t(X) is the set of motion vectors assigned to the blocks of Ft. These motion
vectors point to frame Ft1.
To obtain a motion vector for each block, we try a few candidates only, as
opposed to an exhaustive search that tests all possible motion vectors for each
block. The candidates we try in each are block called a candidate set. The rules for
selecting motion vectors in the candidate set are the same for each block. We use the
following rules (CS - candidate set) to search the current motion field Dcur:
where W and H are the width and height of a block (we use 8 8 blocks); rnd(k) is a
function whose result is a random value from the range < k, k+1, . . .k>; MAD is
the mean absolute difference between the window over the current block B(X) of one
frame and the window over the block pointed to by a motion vector in another frame;
and the size of the windows is 16 12. Dcur is the motion vector from the current
motion field, and Dpred is a predictor obtained from the previously found motion
field. If the forward motion field DFW, t1(X) is searched, then the predictor will be
(DBW, t1(X)); if the backward motion field DBW, t(X) is searched, then the
predictor PBW, t will be formed from DFW, t1(X) by projecting it onto the block
grid of frame Ft with the subsequent inversion: PBW, t(X+DFW, t1(X)) ¼ DFW,
t1(X).
In fact, two ME passes are used for each pair of frames: the first pass is an
estimation of the forward motion field, and the second pass is an estimation of the
15 Real-Time Video Frame-Rate Conversion 377
Fig. 15.2 Sources of spatial and temporal candidates for a block (marked by a red box) in an even
row during forward ME (left); sources of a block in an odd row during forward ME (right)
backward motion field. We use different scanning orders for the first and second
passes. The top-to-bottom meandering scanning order (from top to bottom and from
left to right for odd rows and from right to left for even rows) is used for the first pass
(right-top image of Fig. 15.1). A bottom-to-top meandering scanning order (from
bottom to top and from right to left for odd rows and from left to right for even rows;
rows are numbered from bottom to top in this case) is used for the second pass (right-
bottom image of Fig. 15.1).
The relative positions UScur of the spatial candidate set CSspatial and the relative
positions of the temporal candidate set CStemporal in the above description are valid
only for top-to-bottom and left-to-right directions. If the direction is inverted for some
coordinates, then the corresponding coordinates in UScur and USpred should be
inverted accordingly. Thus, the direction of recursion changes in a meandering
scanning order from row to row. In Fig. 15.2, we show the sources of spatial
candidates (CSspatial, green blocks) and temporal candidates (CStemporal, orange blocks)
for two scan orders. On the left-hand side, this is the top-to-bottom, left-to-right
direction, and on the right-hand side, it is the top-to-bottom, right-to-left direction.
The block erosion process (de Haan et al. 1993) was skipped in our modification
of 3DRS. For additional smoothing of the backward motion field, we applied
additional regularisation after ME. For each block, we compared MADnarrow(X, D)
for the current motion vector D of the block and MADnarrow(X, Dmedian), where
Dmedian is a vector obtained by per-coordinate combining of the median values of
a set consisting of the nine motion vectors from an 8-connected neighbourhood and
from the current block itself. Here, MADnarrow is different from MAD, which is used
for 3DRS matching, and the size of the window for MADnarrow is decreased to 8 8
(to give equal block sizes). The original motion vector is overwritten by Dmedian if
MADnarrow is better or worse by a small margin.
The use of a meandering scanning order in combination with the candidate set
described above prevents the possibility of parallel processing several blocks of a
378 I. M. Kovliga and P. Pohl
given motion field. This is illustrated in Fig. 15.1. The blocks marked in green
should be processed before starting the processing of the current block (marked in
white in a red box) due to their direct dependency via the spatial candidate set and the
blocks marked in red, which will be directly affected by the estimated motion vector
in the current block. The light green and light red blocks mark indirect dependencies.
Thus, there are no blocks that can be processed simultaneously with the current
block since all blocks need to be processed either before or after the current block.
In Al-Kadi et al. (2010), the authors propose a parallel processing which pre-
serves the direction switching of a meandering order. The main drawback is that the
spatial candidate set is not optimal, since the upper blocks for some threads are not
processed at the beginning of row processing.
If we change the meandering order to a simple “raster scan” order (always from
left to right in each row), then the dependencies become smaller (see Fig. 15.3a).
We propose changing the scanning order to achieve wave-front parallel processing,
as proposed in the HEVC standard (Chi et al. 2012), or a staggered approach as shown
in Fluegel et al. (2006). A wave-front scanning order is depicted in Fig. 15.3b, together
with the dependencies and sets of blocks which can be processed in parallel
(highlighted in blue). The set of blue blocks is called a wave-front.
In the traditional method of using wave-front processing, each thread works on
one row of blocks. When a block is processed, the thread should be synchronised
with a thread that works on an upper row, in order to preserve the dependency (the
upper thread needs to stay ahead of the lower thread). This often produces stalls due
to the different times needed to process different blocks. Our approach is different;
working threads process all blocks belonging to a wave-front independently. Thus,
synchronisation that produces a stall is performed only when the processing of the
next wave-front starts, and even this stall may not happen since the starting point of
the next wave-front is usually ready for processing, provided that the number of
tasks is at least several times greater than the number of cores. The proposed
approach therefore eliminates the majority of stalls.
The wave-front scanning order changes the resulting motion field, because it uses
only “left to right” relative positions UScur and USpred during the estimation of
forward motion and only “right to left” for backward motion. In contrast, the
meandering scanning order switches between “left to right” and “right to left” after
processing every row of blocks.
Fig. 15.3 Parallel processing several blocks of a motion field: (a) trees of dependencies for a raster
(left-right) scan order; (b) wave-front scanning order; (c) slanted wave-front scanning order (two
blocks in one task)
15 Real-Time Video Frame-Rate Conversion 379
The proposed wave-front scanning order has an inconvenient memory access pattern
and hence uses the cache of the processor ineffectively. For a meandering scanning
order with smooth motion, the memory accesses are serial, and frame data stored in
the cache are reused effectively. The main direction of the wave-front scanning order
is diagonal, which nullifies the advantage of a long cache line and degrades the reuse
of the data in the cache. As a result, the number of memory accesses (cache misses)
increases.
To solve this problem, we propose to use several blocks placed in raster order as
one task for parallel processing (see Fig. 15.3c), where the task consists of two
blocks). We call this modified order a slanted wave-front scanning order. This
solution changes only the scanning order, and not the directions of recursion, so
only rnd(k) influences the resulting motion field. If rnd(k) is a function of X (the
spatial position in the frame) and the number of calls in X, then the results will be
exactly the same as for the initial wave-front scanning order. The quantity of blocks
in one task can vary; a greater number is better for the cache but can limit the number
of parallel tasks. Reducing the quantity of tasks limits the maximum number of MCP
cores used effectively but also reduces the overhead for thread management.
The computational cost for motion estimation can be represented as a sum of the
costs for a calculation of the MAD and the cost of the control programme code
(managing a scanning order, construction of a candidate set, optimisations related to
skipping the calculation of the MAD for the same candidates, and so on). During our
experiments, we assumed that a significant number of calculations were spent on the
control code. To decrease this overhead, we introduce double-block processing. This
means that one processing unit consists of a horizontal pair of neighbouring blocks
(called a double block) instead of a single block. The use of double-block processing
allows us to reduce almost all control cycles by half. One candidate set is considered
for both blocks of this double block. For example, in forward ME, a candidate set CS
(X) from the left block of a pair is also used for the right block of the pair. However,
calculation of the MAD is performed individually for each block of the pair, and the
best candidate is considered separately for each block of the pair. This point
distinguishes double-block processing from a horizontal enlargement of the block.
A slanted wave-front scanning order where one task consists of two double-block
units is shown in Fig. 15.4. The centre of each double-block unit is marked by a red
point, and the left and right blocks of the double-block unit are separated by a red
line. The current double-block unit is highlighted by a red rectangle. For this unit, the
sources are shown for the spatial candidate set (green blocks) and for the temporal
candidate set (orange blocks) related to block A. The same sources are used for block
380 I. M. Kovliga and P. Pohl
B that belong to the same double block as A. Thus, the same motion vector
candidates are tried for both blocks A and B including random ones.
However, it may be that a true candidate set for block B is useful when the
candidate set of block A gets results for blocks A and B that are too variable (the best
MAD values); this is possible on some edges of a moving object or when a nonlinear
object is used. We therefore propose an additional step for the analysis of MAD
values, which are related to the best motion vectors of blocks A and B of the double-
block unit. Depending on results of this step, we either accept the previously
estimated motion vectors or carry out a motion estimation procedure for block B.
If xA ¼ (x, y) and xB ¼ (x + W, y) are centres of two blocks of a double-block pair,
we can introduce the following decision for additional analysis of B block
candidates:
The quality of the proposed modifications was checked for various Full HD video
streams (1920 1080) with the help of the FRC algorithm. To get ground truth, we
down-sampled the video streams from 30fps to 15fps and then up-converted them
back to 30fps with the help of motion fields obtained by the tested ME algorithms.
15 Real-Time Video Frame-Rate Conversion 381
The initial version of our 3DRS-based algorithm was described above in the
section entitled “Baseline 3DRS-based algorithm”. This algorithm was modified
with the proposed modifications. Luminance components of input frames (initially in
YUV 4:2:0) for ME were down-sampled twice per coordinate using an 8-tap filter,
and the resulting motion vectors had a precision of two pixels. The MCI stage
worked with the initial frames in Full HD resolution and was based on motion-
compensated interpolation (MCI). To calculate the MAD match metric, we used
luminance component of the frame and one of the chrominance components. One
chrominance component was used for forward ME and another for backward ME. In
backward ME, the wave-front scanning order was also switched to bottom-to-top
and right-to-left.
Table 15.1 presents the results of the quality of the FRC algorithm based on (a) an
initial version of the 3DRS-based algorithm. (b) a version modified using the wave-
front scanning order, and (c) a version modified using both the wave-front scanning
order and double-block units. The proposed modifications retain the quality of the
FRC output, except for a small drop when double-block processing is enabled.
Table 15.2 presents the overall results for performance in terms of speed. Column
8 of the table contains the mean execution time for the sum of forward and backward
ME for a pair of frames using five test video streams which were also used for quality
testing. Experiment E1 shows the parameters and speed of the initial 3DRS-based
algorithm as described in the section entitled “Baseline 3DRS-based algorithm”.
Experiment E2 shows a 19.7% drop (E2 vs. E1) when the diagonal wave-front
scanning order is applied instead of the meandering order used in E1. The proposed
slanted wave-front used in E3 and E4 (eight and 16 blocks in each task) minimises
the drop to 4% (E4 vs. E1). The proposed double-block processing increases the
speed by 12.6% (E5 vs. E4) relative to version without double-block processing and
same number of blocks in task.
The speed performance of the proposed modifications for the 3DRS-based
algorithm was evaluated using a Samsung Galaxy S8 mobile phone based on the
MSM8998 chipset. The clock frequency was fixed within a narrow range for the
stability of the results. The MSM8998 chipset uses a big.LITTLE configuration with
four power-efficient and four powerful cores in clusters with different performance.
Threads performing ME algorithms were assigned to a powerful cluster of CPU
cores in experiments E1–E11.
Table 15.1 Comparison of initial quality and proposed modifications: baseline (A), wave-front
scan (B), wave-front + double block (C)
A B C
PSNR PSNR PSNR
Video stream (dB) (dB) (dB)
Climb 42.81 42.81 42.77
Dvintsev12 27.56 27.55 27.44
Turn2 31.98 31.98 31.97
Bosphorusa 43.21 43.21 43.20
Jockeya 34.44 34.44 34.34
Average: 36.00 36.00 35.94
a
Description of video can be found in Correa et al. (2016)
382 I. M. Kovliga and P. Pohl
Occlusions are areas of pair frames that will disappear on the video frame on which
we try to find matching positions. For video sequences, there are two types of
occlusions – covered and uncovered areas (Bellers et al. 2007). We consider only
the interpolation that uses two source frames nearest to the interpolated temporal
position. Normal parts of a frame have a good match – that means that locally the
image patches are very similar, and it is possible to use an appropriate patch from
either image or their mixture (in case of a correct motion vector). The main issue
with occlusion parts is the fact that motion vectors are unreliable in occlusions and
15 Real-Time Video Frame-Rate Conversion 383
appropriate image patches are only in one of the neighbouring frames. So two critical
decision have to be made – what motion vector is right for a given occlusion and
what is the appropriate frame to interpolate from (if a covered area was detected, then
the previous frame should be used only; and if an uncovered area was detected, then
the next frame should be used). Figure 15.5 illustrates the occlusion problem in a
simplified 1D cut. In reality, the vectors are 2D and in case of real-time ME usually
somewhat noisy.
To effectively implement block-wise interpolation (the MCI stage is described in
the next section), one has to estimate all vectors in the block grid of the interpolated
frame. Motion vectors should be corrected in occlusions. In general, there are two
main possibilities of how to fill in vectors in occlusions: spatial or temporal “prop-
agation” or inpainting. Our solution uses temporal filling because its implementation
is very efficient.
Let the forward motion field DFW, N be the set of motion vectors assigned to the
blocks lying in the block grid of frame FN. These forward motion vectors point to
frame FN+1 from frame FN. Motion vectors of the backward motion field DBW, N
point to frame FN 1 from frame FN.
For detection of covered and uncovered areas in frame FN+1, we need motions
DFW, N and DBW, N+2 generated by three consecutive frames (FN, FN+1, FN+2)
(Bellers et al. 2007). We should analyse the DBW, N+2 motion field to detect covered
areas in frame FN+1. Areas where motion vectors of DBW, N+2 do not point to are
covered areas in FN+1. So, motion vectors in DFW, N+1 in those covered areas may be
incorrect. Collocated inverted motion vectors from the DBW, N+1 motion field may be
used for correction of incorrect motion in DFW, N+1 (this is the temporal filling
mentioned above). Motion field DFW, N should be used to detect uncovered areas in
frame FN+1. Collocated inverted motion vectors from DFW, N+1 in detected uncov-
ered areas may be used instead of potentially incorrect motion vectors of motion field
DBW, N+1.
In our solution for interpolation of any frame FN+1+α (α 2 [0. . .1] is a phase of an
interpolated frame), we detect covered areas in FN+1, obtaining CoverMapN+1, and
detect uncovered areas in FN+2, obtaining UncoverMapN+2, as described above.
These maps simply indicate whether blocks of the frame belong to the occluded
area or not. We need two motion fields DFW, N+1, DBW, N+2 between frames FN+1, FN
+2 for detection of those maps and two more motion fields DBW, N+1, DFW, N+2
between FN, FN+1 and FN+2, FN+3 frames for correction of motion vectors in found
occlusions (see Fig. 15.5).
Basically, we calculate one of the predicted interpolated blocks with coordinates
(x, y) in frame FN+1+α by using some motion vector (dx, dy) as follows (suppose that
the backward motion vector from FN+2 to FN+1 is used):
Here, we mix previous and next frames with proportion α equal to the phase. But it
needs to understand which is the proper proportion to mix the previous and next
384 I. M. Kovliga and P. Pohl
Fig. 15.5 1D visualisation of covered and uncovered areas for an interpolated frame FN + 1 + α
frames in occlusions, because only the previous frame FN+1 should be used in
covered areas and only the next frame FN+2 in uncovered areas. So, we need to
calculate some weight αadj for each block instead using the phase α everywhere:
Predðx, yÞ ¼F Nþ1 ðx þ α dx, y þ α dyÞ 1 αadj ðx, yÞ
þ F Nþ2 ðx ð1 αÞ dx, y ð1 αÞ dyÞ αadj ðx, yÞ:
αadj should tend to 0 in covered areas and tend to 1 in uncovered areas. In other areas,
it should stay α. In addition, it is necessary to use corrected motion vectors in the
occlusions. To calculate αadj and obtain the corrected motion field fixedαDBW, N+2,
we do the following (details of the algorithm below are described in Chappalli and
Kim (2012)):
• Copy DBW,N+2 to fixedαDBW,N+2.
15 Real-Time Video Frame-Rate Conversion 385
• (See top part of Fig. 15.6 for details) looking at where in the interpolated frame
FN+1+α blocks from FN+2 moved using motion vectors from DBW, N+2, the moved
position of a block with coordinates (x, y) in keyframe FN+2 is x+(1 α) ∙ dx(x, y),
y+(1 α) ∙ dy (x, y) in frame FN+1+α. The overlapped area between each block in
the block grid of the interpolated frame and all moved blocks from the keyframe
can be found. αadj can be the proportional value with the overlapped area in this
case: αadj ¼ α overlapped area of interpolated block
size of interpolated block . If a collocated area in CoverMapN+1
was marked as a covered area for some block in the interpolated frame, then
(a) the block in the interpolated frame is marked as COVER if the overlapped area
of the block equals zero, and (b) the motion vector from the collocated position of
DBW, N+1 is copied to the collocated position fixedαDBW,N+2 if the block was
marked as COVER;
• (see bottom part of Fig. 15.6 for details) for DFW,N+1, we look at where in the
interpolated frame blocks from FN+1 were moved. For blocks in the interpolated
frame which collocated with an uncovered area in UncoverMapN+1 we do the
following:
(a) calculate αadj as: αadj ¼ 1 ð1 αÞ overlapped area of interpolated block
size of interpolated block ,
(b) Mark as UNCOVER blocks in the interpolated frame which have zero
overlapped area,
(c) Copy inverted motion vectors from collocated positions of DFW, N+2 to
collocated positions of fixedαDBW,N+2 for all blocks marked as UNCOVER.
(d) Copy inverted motion vectors from collocated positions of DFW,N+1 for all
other blocks (which were not marked as UNCOVER)
Fig. 15.6 Example of obtaining fixedαDBW,N+2, αadj and COVER/UNCOVER marking. Firstly,
cover occlusions are fixed (top part). Secondly, uncover occlusions are processed (bottom part)
Our main work here in the OP stage was to adapt and optimise the implementa-
tion for ARM-based SoC mostly. The complexity of the stage is not so big as the
complexity of ME and MCI stages, because the input of the OP stage is block-wise
motion fields and all operations can be performed in block-wise manner. Neverthe-
less, a fully fixed-point implementation with NEON SIMD parallelisation was
needed.
In the description of the OP stage, we focused only on main ideas. In fact, we
have to use slightly more sophisticated methods to get CoverMap, UncoverMap,
αadj, and fixedαDBW,N+2. Their use is caused by rather noisy input motion fields or
just complex motion in a scene. On the other hand, additional details would make the
description even more difficult to understand than now.
15 Real-Time Video Frame-Rate Conversion 387
Fig. 15.7 1D visualisation of (a) clustering and (b) applying motion vector candidates
Motion compensation is the final stage of the FRC algorithm that directly interpo-
lates pixel information. As already mentioned, we used interpolation based on earlier
patented work by Chappalli and Kim (2012). The algorithm samples two nearest
frames in positions determined by one or two motion vector candidates. Motion
vector candidates come from the motion clustering algorithm mentioned above. It is
possible that only one motion vector candidate will be found for some block because
a collection for the block may contain the same motion vectors. Because all our
previous sub-algorithms working with motion vectors were block-based, the MCI
stage naturally uses motion hypotheses constant over an interpolated block.
Firstly, we form two predictors for the interpolated block with coordinates (x, y)
by using each motion vector candidate: cd[i] ¼ (dx[i](x, y), dy[i](x, y)), i ¼ 1, 2 from
CD[K]α,N+2:
388 I. M. Kovliga and P. Pohl
Pred½iðx, yÞ ¼ F Nþ1 ðx þ α dx½iðx, yÞ, y þ α dy½iðx, yÞÞ 1 αadj ðx, yÞ
þF Nþ2 x ð1 αÞ dx½i x, yÞ, y ð1 αÞ dy½iðx, yÞÞ αadj ðx, yÞ
where we use bilinear interpolation to obtain the pixel-wise version of αadj(x, y) that
was naturally block-wise in the OP stage.
Although we use the backward motion vector candidates cd[i] which point to
frame FN+1 from the collocated block with position (x, y) in frame FN+2 (see
Fig. 15.7a), we use them as if the motion vector candidate started in the
(x (1 α) ∙ dx[i] (x, y), y (1 α) ∙ dy[i] (x, y)) position of frame FN+2 (see
Fig. 15.7b) for calculating predictors. So, those motion vector candidates will pass
the interpolated frame FN+1+α exactly in (x, y) position. It simplifies the clustering
process and MCI stage both because we do not need to keep any “projected” motion
vectors and each interpolated block can be processed independently during the MCI
stage.
Despite that we do not use fractional pixel motion vectors in ME and OP stages,
they may appear in the MCI stage, for example, due to following operation:
(1 α) ∙ dx[i] (x, y). In this case, we use bicubic interpolation which gives good
balance between speed and quality.
The interesting question is how to mix a few predictors Pred[i](x, y). Basically, we
obtain a block of the interpolated frame FN+1+α by the sum of a few predictors with
some weights:
Xp Xp
F Nþ1þα ðx, yÞ ¼ i¼1
w½ i ð x, y Þ Pred½iðx, yÞ= i¼1 w½iðx, yÞ,
where p is the number of used motion vector candidates in the block (two in our
solution, so in the worst case, block patches at four positions from two keyframes
have to be sampled). w[i](x, y) are pixel-wise mixing weights or reliability of pre-
dictors. As said in Chappalli and Kim (2012), we can calculate w[i](x, y) as a
function of the difference between patches which were picked out from keyframes
when predictors were calculated:
where l, u are some small values from the [0. . .5] range.
f(e) must be inversely proportional to the argument, for example:
where vref is a background motion vector. For blocks in the interpolated frame which
were marked as COVER or UNCOVER, vref is a motion vector from the collocated
block fixedαDBW,N+2 and ω ¼ 0. For other normal blocks, ω ¼ 1.
The MCI stage has high computational complexity, comparable to the complexity
of the ME stage, because both algorithms require reading a few (extended) blocks
from original frames per one processed block. But in contrast with ME, each block of
the interpolated frame can be processed independently during MCI. This fact gives
us the opportunity to implement MCI on GPU using OpenCL technology.
There are many situations in which the FRC problem is incredibly challenging and
sometimes just plainly unsolvable. An important part of our FRC algorithm devel-
opment is detecting and correcting such corner situations. It is worth noting that the
tasks of detection and correction are different tasks. When a corner case is detected,
two possibilities remain. The first is to fix the problems, and the second is to skip
interpolation (fallback). Fallback, in turn, can be done for the entire frame (global
fallback) or locally, only in those area where problems arose (local fallback in Park
et al. 2012). We apply global fallback only: place the nearest keyframe instead of the
interpolated frame.
The following situations are handled in our solution:
1. Periodic textures. The presence of periodic textures in video frames adversely
affects ME reliability. Motion vectors can easily step over the texture period,
which causes severe interpolation artefacts. We have a detector of periodic
textures as well as a corrector for smoothing motion. The detector recognises
periodic areas in each keyframe. The corrector makes motion fields smooth in
detected areas after each 3DRS iteration. Regularisation is not applied in detected
areas. Fallback does not apply.
2. Strong 1D-only features. Subpixels moving long flat edges also confuse ME due
to different aliasing artefacts in neighbour keyframes, especially in the case of fast
and simple ME with one-pixel accuracy developed for real-time operation. A
detector and a corrector are in the same manner as for periodic textures (detection
of 1D features in keyframes and smoothing motion filled after each 3DRS
iteration, no regularisation in detected areas) with no fallback.
390 I. M. Kovliga and P. Pohl
3. Change of scene brightness (fade in/out). ME’s match metric (MAD) and mixing
weights w in MCI are dependent on the absolute brightness of the image. We
detect global changes between neighbouring keyframes by a linear model using
histograms and correct one of the keyframes for the ME stage only. For relatively
small and global changes, such adjustment works well. A global fallback strategy
is applied if calculated parameters of the model are too large or the model is
incorrect.
4. Inconsistent motion/occlusions. When a foreground object is nonlinearly
deformed (flying bird or quick finger movements), it is impossible to restore the
object correctly. The result of an interpolation looks like a mess of blocks, which
greatly spoils the subjective quality. We analyse CoverMap, UncoverMap,
motion fields, and obtain some reliability for each block in an interpolated
frame. If the gathered reliability falls below some threshold for some set of
neighbouring interpolated blocks, we apply global fallback. A local fallback
strategy may be also applied by using the reliabilities like in Park et al. 2012,
but the appearance of strong visual artefacts is still highly possible, although the
PSNR metric will be higher for local fallback.
Detectors use mostly pixel-wise operations with preliminary downscaled
keyframes. All detectors either fit perfectly for SIMD operations or consume small
computations. In our solution, detectors work in a separate stage called the
preprocessing stage, except inconsistent motion/occlusion that is performed during
the OP stage. Correctors are implemented in the ME stage.
The target devices (Samsung Galaxy S8/S9/S20) are using ARM 64-bit 8-core SoCs
with big.LITTLE scheme with the following execution units: four power-efficient
processor cores (little CPUs), four high-performance processor cores (big CPUs) and
powerful GPU. Stages of the FRC algorithm should be run on different execution
units simultaneously to achieve the highest number of frame interpolations per
second (IPS). This leads us to the idea of organising a pipeline of calculations
where stages of the FRC algorithm are run in parallel. The stages are connected by
a buffer to hide random delays. The pipeline of FRC for 4 upconversion is shown
in Fig. 15.8.
As shown in Fig. 15.8, the following processing are performed simultaneously:
• Preprocessing of keyframe FN+6
• Estimation of motion fields DFW, N+4 and DBW, N+5 by using FN+4, FN+5 and
corresponding preprocessed data (periodic, 1D areas)
• Occlusion processing for interpolated frame FN+2+2/4, where inputs are DBW, N+2,
DFW, N+2, DBW, N+3, andDFW, N+3;
• Motion-compensated interpolation of interpolated frame FN+2+1/4, where inputs
are fixed1/4DBW, N+2 for phase α ¼ 1/4
15 Real-Time Video Frame-Rate Conversion 391
Note that OP and MCI stages can be performed for multiple interpolated frames
between a pair of consecutive keyframes and each time at least a part of calculations
will be different. Whereas preprocessing and ME stages are performed only once for
a pair of consecutive keyframes, they do not depend on the number of interpolated
frames.
Actually, our main use case is doubling frame rate. We optimised each stage and
assigned execution units (shown in Fig. 15.8) so that their durations became close for
some “quite difficult” scene in the case of doubling frame rate.
15.8 Results
We were able to develop the FRC algorithm of commercial-level quality that could
work on battery-powered mobile devices. The algorithm used only standard modules
of SoC (CPU + GPU), which means upgrades or fixes are quite easy. This algorithm
has been integrated in two modes of the Samsung Galaxy camera:
• Super Slow Motion (SSM) – offline 2 conversion of HD (720p) video from
480 to 960 FPS, Target performance: >80 IPS;
• Motion Photo – real-time 2 conversion, playback of 3 seconds of FHD (1080p)
video clip stored within JPEG file, Target performance: >15 IPS.
392 I. M. Kovliga and P. Pohl
Table 15.4 Performance of FRC solution on target mobile devices in various use cases in
interpolations per second (IPS)
Motion Photo (FHD) Super Slow Motion
Device (IPS) (IPS)
Galaxy S8 81 104
Galaxy S9 94 116
The average conversion speed performance for HD and FHD videos can be seen
in Table 15.4. The difference is small because in the case of FHD, the ME stage and
part of the OP stage were performed at reduced resolution.
Detailed measurements of various FRC stages for 2 upconversion are shown in
Table 15.5. Here, Samsung Galaxy Note 10 based on Snapdragon 855 SM8150 was
used. ME and OP stages were performed on two big cores each, the preprocessing
stage (Prep.) uses two little cores, and the MC stage was performed by GPU. Any
fallbacks were disabled, so all frames were interpolated fully. The quality and speed
performance of the described FRC algorithms were checked for various video
streams, for which a detailed description can be found in Correa et al. (2016). We
used only the first 100 frames for time measurements and all frames for quality
measurements. All video sequences were downsized to HD resolution (Super Slow
Motion use case). To get ground truth, we decreased the frame rate of the video
streams twice and then upconverted them back to the initial frame-rate with the help
of the FRC.
It can be seen that with increasing magnitude, and complexity of the motion in a
scene, the computational cost of the algorithm grows, and the quality of the inter-
polation decreases. The ME stage is the most variable and requires the most
computational cost.
For scenes with moderate movements, the algorithm shows satisfactory quality
and attractive speed performance. Actually, even in video with fast movement, the
quality can be good in most places. In the middle part of Fig. 15.9, an interpolated
frame from the “Jockey” video is depicted. Displacement of the background between
depicted keyframes is near 50 pixels (see Fig. 15.10 to better understand the position
of occlusion areas and movement of objects). An attentive reader can see quite a few
15 Real-Time Video Frame-Rate Conversion 393
Fig. 15.9 Interpolation quality. In the top is drawn keyframe #200, in the middle interpolated frame
#201 (with PSNR quality 35.84 dB for luma component), and in the bottom keyframe #202
394 I. M. Kovliga and P. Pohl
Fig. 15.10 Visualisation of movement. Identical features of fast-moving background are connected
by green lines in keyframes. Identical features of almost static foreground are connected by red lines
in keyframes
15 Real-Time Video Frame-Rate Conversion 395
defects in the interpolated frame. The most unpleasant are those which appear
regularly on a foreground object (look at the horse’s ears). This is perceived as an
unnatural flickering of the foreground object. Artefacts, which regularly arise in
occlusions, are perceived as a halo around a moving foreground object. Artefacts,
which appear only in individual frames (not regularly), practically do not spoil the
subjective quality of the video.
References
Al-Kadi, G., Hoogerbrugge, J., Guntur, S., Terechko, A., Duranton, M., Eerenberg, O.: Meandering
based parallel 3DRS algorithm for the multicore era. In: Proceedings of the IEEE International
Conference on Consumer Electronics (2010). https://doi.org/10.1109/ICCE.2010.5418693
Bellers, E.B., van Gurp, J.W., Janssen, J.G.W.M., Braspenning, R., Wittebrood, R.: Solving
occlusion in frame-rate up-conversion. In: Digest of Technical Papers International Conference
on Consumer Electronics, pp. 1–2 (2007)
Chappalli, M.B., Kim, Y.-T.: System and method for motion compensation using a set of candidate
motion vectors obtained from digital video. US Patent 8,175,163 (2012)
Chi, C., Alvarez-Mesa, M., Juurlink, B.: Parallel scalability and efficiency of HEVC parallelization
approaches. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1827–1838 (2012)
Cordes, C.N., de Haan, G.: Invited paper: key requirements for high quality picture-rate conversion.
Dig. Tech. Pap. 40(1), 850–853 (2009)
Correa, G., Assuncao, P., Agostini, L., da Silva Cruz, L.A.: Appendix A: Common test conditions
and video sequences. In: Complexity-Aware High Efficiency Video Coding, pp. 125–158.
Springer International Publishing, Cham (2016)
de Haan, G., Biezen, P., Huijgen, H., Ojo, O.A.: True-motion estimation with 3-D recursive search
block matching. IEEE Trans. Circuits Syst. Video Technol. 3(5), 368–379 (1993)
Fluegel, S., Klussmann, H., Pirsch, P., Schulz, M., Cisse, M., Gehrke, W.: A highly parallel sub-pel
accurate motion estimator for H.264. In: Proceedings of the IEEE 8th Workshop on Multimedia
Signal Processing, pp. 387–390 (2006)
Lertrattanapanich, S., Kim, Y.-T.: System and method for motion vector collection based on
K-means clustering for motion compensated interpolation of digital video. US Patent
8,861,603 (2014)
Park, S.-H., Ahn, T.-G., Park, S.-H., Kim, J.-H.: Advanced local fallback processing for motion-
compensated frame rate up-conversion. In: Proceedings of 2012 IEEE International Conference
on Consumer Electronics (ICCE), pp. 467–468 (2012)
Pohl, P., Anisimovsky, V., Kovliga, I., Gruzdev, A., Arzumanyan, R.: Real-time 3DRS motion
estimation for frame-rate conversion. Electron. Imaging. (13), 1–5 (2018). https://doi.org/10.
2352/ISSN.2470-1173.2018.13.IPAS-328
Chapter 16
Approaches and Methods to Iris
Recognition for Mobile
16.1 Introduction
The modern smartphone is not a simple phone but a device which has access to or
contains huge amount of personal information. Most smartphones have the ability to
perform payment operations by such services as Samsung Pay, Apple Pay, Google
Pay, etc. Thus, phone unlock protection and authentication for payment and for
access to secure folders and files are required. Among all approaches to authenticate
users of mobile devices, the most suitable are knowledge-based and biometric
methods. Knowledge-based methods are based on asking for something the user
knows (PIN, password, pattern). Biometric methods refer to the use of distinctive
anatomical and behavioral characteristics (fingerprints, face, iris, voice, etc.) for
automatically recognizing a person. Today, hundreds of millions of smartphone
users around the world praise the convenience and security provided by biometrics
(Das et al. 2018).
The first commercially successful biometric authentication technology for mobile
devices is fingerprint recognition. Despite the fact that fingerprint-based authentica-
tion shows high distinctiveness, it still has drawbacks (Daugman 2006). Among all
the biometric traits, the iris has several important advantages in comparison with
other biometric traits (Corcoran et al. 2014). The iris image capturing procedure is
contactless, and iris recognition can be considered as a more secure and convenient
authentication method, especially for mobile devices.
A. M. Fartukov (*)
Samsung R&D Institute Rus (SRR), Moscow, Russia
e-mail: a.fartukov@samsung.com
G. A. Odinokikh · V. S. Gnatyuk
Independent Researcher, Moscow, Russia
e-mail: g.odinokikh@gmail.com; vitgracer@gmail.com
This chapter is dedicated to the iris recognition solution for mobile devices.
Section 16.2 describes the iris as a recognition object. A brief review of the
conventional iris recognition solution is given. Main challenges and requirements
for iris recognition for mobile devices are formulated. In Sect. 16.3, the proposed iris
recognition solution for mobile devices is presented. Special attention is paid to
interaction with the user and capturing system. The iris camera control algorithm is
described. Section 16.4 contains a brief description of the developed iris feature
extraction and matching algorithm. Testing results and comparison with state-of-the-
art iris feature extraction and matching algorithms are provided also. In Sect. 16.5,
limitations of iris recognition are discussed. Several approaches which allow to shift
these limitations are described.
The iris is a highly protected, internal organ of the eye, which allows contactless
capturing (Fig. 16.1). The iris is unique for every person, even for twins. The
uniqueness of the iris is in its texture pattern that is determined by melanocytes
(pigment cells) and circular and radial smooth muscle fibers (Tortora and Nielsen
2010). Although the iris is stable over a lifetime, it is not a constant object due to
permanent changes of pupil size. The iris consists of muscle tissue that comprises a
sphincter muscle that causes the pupil to contract and a group of dilator muscles that
cause the pupil to dilate (Fig. 16.2). It is one of the main sources of intra-class
variation, which should be considered in the development of an iris recognition
algorithm (Daugman 2004).
The iris is highly informative biometric trait (Daugman 1993). That is why iris
recognition provides high recognition accuracy and reliability.
A conventional iris recognition system includes the following main steps:
• Iris image acquisition
• Quality checking of obtained iris image
• Iris area segmentation
• Feature extraction (Fig. 16.3).
Fig. 16.2 Responses of the pupil to light of varying brightness: (a) bright light; (b) normal light; (c)
dim light. (Reproduced with permission from Tortora and Nielsen (2010))
should handle input iris images captured under ambient illumination, which varies
over a range from 104 at night to 105 Lux under direct sunlight. The changing
capturing environment also assumes the randomness of the locations of the light
sources, along with their unique characteristics, which creates a random distribution
of the illuminance in the iris area. The mentioned factors can cause a deformation of
the iris texture due to a change in the pupil size, making users squint and degrading
the overall image quality (Fig. 16.4).
Moreover, different features related to interaction with a mobile device and user
itself should be considered:
• The user could wear glasses or contact lenses.
• The user could try to the perform authentication attempt while walking or just
suffer from a hand tremor, thereby causing the device to shake.
• The user can hold the device too far or too close to them, so the iris turns out of the
camera depth of field.
• There could be occlusion of the iris area by eyelids and eyelashes if the user’s eye
is not opened enough.
All these and many other factors affect the quality of the input iris images, thus
influencing the accuracy of the recognition (Tabassi 2011).
Mobile iris recognition should be used on a daily basis. Thus, it requires provid-
ing an easy user interaction and a high recognition speed, which is determined by the
computational complexity. There is a trade-off between computational complexity
and power consumption: recognition should be performed with the best camera
frame rate and should not consume much power at the same time. Recognition
should be performed in a special secure (trusted) execution environment, which
provides limited computational resources – restricted number of available processor
cores and computational hardware accelerators, reduced frequencies of processor
core(s), and limited amount of memory (ARM Security Technology 2009). These
facts should be taken into account in early stages of biometric algorithm
development.
402 A. M. Fartukov et al.
quality assessment measure is performed immediately after the information for its
evaluation becomes available. It allows us not to waste computational resources (i.e.,
energy consumption) for processing data which are not suitable for further
processing and to provide feedback to user as earlier as possible. It should be
noted that structure of the algorithm depicted in Fig. 16.5 is a modification of the
algorithm proposed by Odinokikh et al. (2018). The special quality buffer was
replaced with the straightforward structure as shown in Fig. 16.5. All the other
parts of the algorithm (except the feature extraction and matching stages) and quality
assessment checks were used with no modifications.
Besides interaction with a user, the mobile recognition system can communicate
with iris capturing hardware and additional sensors such as illuminometer,
rangefinder, etc. Obtained information can be used for control parameters of iris
capturing hardware and for adaptation of the algorithm to constantly changing
environment conditions. The scheme summarizing the described approach is
presented in Fig. 16.6. Details can be found in Odinokikh et al. (2019a).
Along with the possibility to control the iris capturing hardware by the recogni-
tion algorithm itself, a separate algorithm for controlling iris camera parameters can
be applied. The purpose of such algorithm is to provide fast correction of the sensor’s
shutter speed (also known as an exposure time), gain and/or parameters of the active
illuminator to obtain iris images suitable for the recognition.
We propose a two-staged algorithm for automatic camera parameter adjustment
that offers fast exposure adjustment on the basis of a single shot with further iterative
camera parameter refinement.
Many of the existing automatic exposure algorithms have been developed to
obtain an optimal image quality for difficult environmental conditions. In the case of
the most complicated scenes, a significant number of these algorithms have some
drawbacks: poor exposure estimation, complexity of region of interest detection, and
404 A. M. Fartukov et al.
X4
ði þ 1Þhi
MSV ¼ P4 ,
i¼0 i¼0 hi
1
μ¼ ,
1 þ ecp
Fig. 16.7 The experimental dependency between an exposure time and mean sample value
16 Approaches and Methods to Iris Recognition for Mobile 405
The optimal normalized exposure time p can be obtained with the value μ which
empirically determines the optimal MSV value that allows the successful pass of
quality assessment checks as described below in this section.
Since the exposure time varies in the (0; Emax] interval, the suboptimal exposure
time E is obtained as:
E 0 ð p þ 1Þ
E ¼ ,
p0 þ 1
where E0 is the exposure time of the captured scene, and p and p0 are normalized
exposure times calculated with the optimal MSV μ and the captured scene MSV μ0,
respectively.
If MSV lies out of the confidence zone μ0 2 [μmin; μmax], we should first make a
“blind” guess on a correct exposure time. This is done by subtracting or adding the
predefined coefficient Eδ to the current exposure E0 several times until MSV
becomes less than μmax or higher than μmin. μmin, and μmax values are determined
based on experimental dependency between an exposure time and normalized MSV.
If there is no place for a further exposure adjustment, we stop and do not adjust a
scene anymore, because the sensor is probably blinded or covered with something.
Once the initial exposure time guess E is found, we try to further refine the
capture parameters.
406 A. M. Fartukov et al.
Table 16.1 The iris image quality difference between competing auto-exposure approaches
Approach Full frame Eye region
Global exposure adjustment
Underexposed
perfect
Proposed
perfect
overexposed
The key idea of the second stage is a mask construction to precisely adjust camera
parameters to the face region brightness in order to obtain the most accurate and fast
iris recognition. In case of the recognition task, it is important to pay more attention
to eye regions, and it is not enough to provide an optimal full-frame visual quality
provided by well-known global exposure adjustment algorithms (Battiato et al.
2009). Table 16.1 illustrates the mentioned drawback.
In order to obtain a face mask, a database of indoor and outdoor image sequences
for 16 users was collected in the following manner. Every user tries to pass the
enrollment procedure on a mobile device in normal conditions, and the
corresponding image sequence is collected. Such sequence is used for enrollment
template creation. Next, the user tries to verify himself, and the corresponding image
sequence (probe sequence) is also collected. All frames from probe sequences are
used for probe (probe template) creation. After that, dissimilarity scores (Hamming
distances) between the user’s enrolled template and each probe are calculated
(Odinokikh et al. 2017). In other words, only genuine comparisons are performed.
Each score is compared with the predefined threshold HDthresh, and the vector of
labels Y ≔ {yi}, i ¼ 1. . .Nscores is created. Nscores is the amount of verification
attempts. The vector of labels represents the dissimilarity between probes and
enrollment templates:
(
0, HDi > HDthresh ,
yi ¼
1, HDi < HDthresh :
Each label yi of Y shows if the person was successfully recognized at the frame i.
16 Approaches and Methods to Iris Recognition for Mobile 407
After the calculation of the vector Y, we downscale and reshape each frame of
probe sequences to get the feature vector xi and construct the matrix of feature
vectors X:
0 1
x1
B x2 C
B C
X¼B C:
@ ... A
xN scores
Using the feature matrix X and the vector of labels Y, we calculate the logistic
regression coefficients of each feature, where the coefficients represent the signifi-
cance of each image pixel for the successful user verification (Fig. 16.9).
As a result, the most significant pixels emphasize eye regions and the periocular
area. This method allows avoiding handcrafted mask construction. It automatically
finds the regions that are important for correct recognition. Obtained mask values are
used for weighted MSV estimation, where each input pixel has a significance score
that determines the pixel weight in the image histogram.
After mask calculation, the main goal is to set camera parameters that make MSV
fall in the predefined interval which leads to the optimal recognition accuracy. To get
interval boundaries, each pair (HDi, MSVi) is mapped onto the coordinate plane
(Fig. 16.10a), and points with HD > HDthresh are removed to exclude probes which
corresponded to rejection during verification (Fig. 16.10b). The optimal image
quality interval center is defined as:
p ¼ argmaxð f Þ:
x
Here f(x) is the distribution density function, and p is the optimal MSV value
corresponding to the image with the most appropriate quality for recognition.
Visually, the plotted distribution allows to distinguish three significant clusters:
noise, points with low pairwise distance density, and points with high pairwise
distance density. To find the optimal image quality interval borders, we cluster
plotted point pairs for three clusters using the K-means algorithm (Fig. 16.10c).
The densest “red” cluster with the minimal pairwise point distance is used to
determine the optimal quality interval borders. The calculated interval can be
represented as [p – delta, p + delta], where the delta parameter equals:
408 A. M. Fartukov et al.
(a) (b)
(c)
Fig. 16.10 Visualization of cluster construction procedure: (a) all (HDi, MSVi) pairs are mapped
onto a coordinate plane; (b) excluded points with HD > HDthresh and plotted distribution density; (c)
obtained clusters
E ¼ E þ ED,
G ¼ G þ GD;
E ¼ E ED,
G ¼ G GD:
Such iterative technique allows to perform the optimal camera parameter adjust-
ment for different illumination conditions.
The proposed algorithm was tested as a part of the iris recognition system
(Odinokikh et al. 2018). This system operates in a mode where False Acceptance
Rate (FAR) ¼ 107. Testing was performed using a mobile phone which is based on
Exynos 8895 and equipped with an NIR iris capturing hardware. Testing involved
10 users (30 verification attempts were made for each user). The enrollment proce-
dure was performed in an indoor environment, while the verification procedure was
done under harsh incandescent lighting in order to prove that the proposed auto-
exposure algorithm improves recognition.
To estimate algorithm performance, we use two parameters: FRR (False Rejec-
tion Rate), a value which shows how many genuine comparisons were rejected
incorrectly, and recognition time, which determines the time interval between the
start of the verification procedure and successful verification. The results of com-
parison are shown in Table 16.2.
In accordance with obtained results, if the fast parameter adjustment stage of the
algorithm is removed, then the algorithm will adjust the camera parameters in an
optimal way, but the exposure adaptation time will be significantly increased
because of the absence of suboptimal exposure values. If the iterative adjustment
stage is removed, then the adaptation time will be small, but face illumination will be
estimated in non-optimal way, and the number of false rejections will increase.
Therefore, it is crucial to use both fast and iterative steps to reduce both recognition
time and false rejection rate.
A more detailed description of the proposed method with minor modifications
and comparison with well-known algorithms for camera parameter adjustment can
be found in Gnatyuk et al. (2019).
Table 16.2 Performance comparison for the proposed algorithm of automatic camera parameter
adjustment
Second stage only Proposed
No First stage only (fast (iterative parameter method (two
Value adjustment exposure adjustment) refinement) stages)
FRR (%) 82.3 20.0 1.6 1.6
Recognition 7.5 0.15 1.5 0.15
time (s)
410 A. M. Fartukov et al.
As mentioned in Sect. 16.2, the final stage of iris recognition includes construction of
the iris feature vector. This procedure represents the extraction of iris texture
information relevant to its subsequent comparison. The input of the feature vector
construction is the normalized iris image (Daugman 2004) as depicted in Fig. 16.11.
Since the iris region can be occluded by eyelids, eyelashes, reflections, and others,
such areas contain irrelevant information for subsequent iris texture matching and
are not used for the feature extraction. It should be noted that feature extraction and
matching are considered together because they are closely connected to each other.
A lot of feature extraction methods, considering iris texture patterns in different
levels of detail, have been proposed (Bowyer et al. 2008). A significant leap in
reliability and quality in the field was achieved with the start of using deep neural
networks (DNNs). There were numerous attempts from that time to apply DNNs for
iris recognition. In particular, Gangwar and Joshi (2016) introduced their
DeepIrisNet as the model combining all successful deep learning techniques
known at the time. The authors thoroughly investigated obtained features and
produced a strong baseline for the next works. An approach with two fully
convolutional networks (FCN) with a modified triplet loss function was recently
proposed in Zhao and Kumar (2017). One of the networks is used for iris template
extraction, whereas the second produces the accompanying mask. Fuzzy image
enhancement combined with simple linear iterative clustering and a self-organizing
map (SOM) neural network were proposed in Abate et al. (2017). Despite the
method being designed for iris recognition on a mobile device, a real-time perfor-
mance has not been achieved. Another recent work by Zhang et al. (2018) declared
as suitable for the mobile case proposes two-headed (iris and periocular) CNN with
fusion of embeddings. Thus, there is no optimal solution for iris feature extraction
and matching presented in published papers.
Fig. 16.12 Proposed model scheme of iris feature extraction and matching (Odinokikh et al.
2019c)
This section is a brief description of the iris feature extraction and matching
presented in Odinokikh et al. (2019c).
The proposed method represents a CNN designed to utilize advantages of the
normalized iris image as an invariant, both low- and high-level representations of
discriminative features and information about iris area and pupil dilation. It contains
iris feature extraction and matching parts trained together (Fig. 16.12).
It is known that shallow layers in CNNs are responsible for extraction of
low-level textural information, while high-level representation is achieved with
depth. Basic elements of the shallow feature extraction block and their relations
are depicted in Fig. 16.12. High-level (deep) feature representation is performed by
convolution block #2. Feature maps, which come from block #1, are concatenated by
channels and pass through it. The meaning of concatenation at this stage is in the
invariance property of the normalized iris image. The output vector FVdeep reflects
high-level representation of discriminative features and is assumed to handle com-
plex nonlinear distortions of the iris texture caused by the changing environment.
412 A. M. Fartukov et al.
Match score calculation is performed on FVdeep, the shallow feature vector FVsh.,
and additional information (FVenv.) about the iris area and pupil dilation by using the
variational inference technique. The depth-wise separable convolution block struc-
ture, which is memory and computationally efficient, was picked up for basic
structural elements for the entire network architecture. Along with lightweight
CNN architecture, it allows to operate in real time on the device with highly limited
computational power.
The following methods were selected as state of the art: FCN + Extended Triplet
Loss (ETL) (Zhao and Kumar 2017) and DeepIrisNet (Gangwar and Joshi 2016). It
also should be noted that results of lightweight CNN proposed in Zhang et al. (2018)
are obtained on the same datasets, which is used for testing of the proposed method.
For detailed results, please refer to Zhang et al. (2018). Many of the methods were
excluded from consideration due to their computational complexity and unsuitability
for mobile applications.
Three different datasets were used for training and evaluation(CASIA 2015):
CASIA-Iris-M1-S2 (CMS2), CASIA-Iris-M1-S3 (CMS3), and one more (Iris-
Mobile, IM) collected privately using a mobile device with an embedded NIR
camera. The latter is collected simulating real authentication scenarios of a mobile
device user: images captured in highly changing illumination both indoors and
outdoors, with/without glasses. More detailed specifications of the datasets are
described in Table 16.3.
Results on the recognition accuracy are represented in Table 16.4. ROC curves
obtained for comparison with state-of-the-art methods on CMS2, CMS3, and IM
datasets are depicted on Fig. 16.13.
The proposed method outperforms the chosen state-of-the-art ones on all the
datasets. After the division into subsets, it became impossible to estimate FNMR at
FMR ¼ 107 for CMS2 and CMS3 datasets since the number of comparisons in
the test sets did not exceed ten million. So, yet another experiment was to estimate
the performance of the proposed model on those datasets without training on them.
The model trained on IM is evaluated on entire CMS2 and CMS3 datasets in order to
(a) (b)
(c)
Fig. 16.13 ROC curves obtained for comparison with state-of-the-art methods on different
datasets: (a) CASIA-Iris-M1-S2 (CMS2); (b) CASIA-Iris-M1-S3 (CMS3); (c) Iris Mobile (IM)
get FNMR at FMR ¼ 107 (CrossDB). Obtained results demonstrate high general-
ization ability of the model.
A mobile device equipped with the Qualcomm Snapdragon 835 CPU was used
for estimating the overall execution time for these iris feature extraction and
matching methods. It should be noted that a single core of the CPU was used. The
results are summarized in Table 16.4.
Thus, the proposed algorithm showed robustness to high variability of iris
representation caused by change in the environment and physiological features of
the iris itself. A profitability of using shallow textural features, feature fusion, and
variational inference as a regularization technique is also investigated in the context
of the iris recognition task. Despite the fact that the approach is based on deep
learning, it is capable of operating in real time on a mobile device in a secure
environment with substantially limited computational power.
414 A. M. Fartukov et al.
In conclusion, we would like to shed light on several open issues of iris recognition
technology. The first issue is related to the limitation in usage of iris recognition
technology in extreme environmental conditions. In near dark, the pupil dilates and
masks almost all the iris texture area. In case of outdoors in direct sunlight, the user
could not open his eyes wide enough, and the iris texture can be masked by
reflections. The second issue is related to wearing glasses. Usage of active illumi-
nation leads to glares on glasses, which can mask the iris area. In this case, the user
should change the position of the mobile device for successful recognition or take off
the glasses. It can be inconvenient, especially in daily usage. Thus, the root of the
majority of issues is in obtaining enough iris texture area for reliable recognition. It
has been observed that at least 40% of the iris area should be visible to achieve the
given accuracy level.
To mitigate the mentioned issues, several approaches, except changes of hard-
ware for iris capturing, were proposed. One of them is the well-known multi-modal
recognition (e.g., fusion of iris and face (or periocular area) recognition) as described
in Ross et al. (2006). In this section, approaches related to the eye itself are
considered only.
The first approach is based on the idea of multi-instance iris recognition, which
performs the fusion of the two irises and uses the relative spatial information and
several factors that describe the environment. Often the iris is significantly occluded
by the eyelids, eyelashes, highlights, etc. This happens mainly because of the
complex environment, in which the user cannot open the eyes wide enough (bright
illumination, windy weather, etc.). It makes the application of the iris multi-instance
approach reasonable in case the input image contains both eyes at the same time.
The final dissimilarity score is calculated as a logistic function of the form:
1
Score ¼ ,
P6
1 þ exp wi M i
i¼0
jd LEFT dRIGHT j
Δd norm ¼ ;
d LEFT þ dRIGHT
d LEFT þ dRIGHT
davg ¼ :
2
AOImin, AOImax are the minimum and maximum values of the area of intersection
between the two binary masks Mprobe and Menroll in each pair:
AOI ¼ ΣM c = M height
c M width
c , M c ¼ M probe M enroll :
ΔNDmin and ΔNDmax are the minimum and maximum values of the normalized
distance ΔND between the centres of the pupil and the iris:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 2
ΔND ¼ NDXprobe NDXenroll þ NDYprobe NDYenroll
xP xI y yI
NDX ¼ , NDY ¼ P ,
RI RI
where (xP, yP) are the coordinates of the centre of the pupil, and RP is its radius;
(xI, yI) are the coordinates of the centre of the iris, and RI is its radius (Fig. 16.14).
ΔPIRavg represents the difference in pupil dilation during the enrollment and
probe based on the PIR ¼ RP/RI ratio:
LEFT RIGHT
ΔPIRavg ¼ PIRLEFT
enroll PIRprobe þ PIRenroll PIRprobe =2:
RIGHT
The weight coefficients above wi, i E [1; 7] were obtained after the training of the
classifier on genuine and impostor matches on a small subset of the data. In case only
one out of two feature vectors is extracted, all the pairs of values used in the weighted
sum are assumed to be equal.
The proposed method allowed to decrease the threshold for the visible iris area
from 40% to 29% during verification/identification without any loss in the accuracy
and performance, which leads to decrease of overall FRR (or in other words, user
convenience is improved).
416 A. M. Fartukov et al.
The first step of the proposed procedure (except successful pass of the verifica-
tion) is an additional quality check of the probe feature vector FVprobe which can be
considered as input for the update procedure. It should be noted that the thresholds
which are used for the quality check are different in enrollment and verification
modes. In particular, the normalized eye opening (NEO) value, described below, is
set as 0.5 for the enrollment and 0.2 for the verification; the non-masked area (NMA)
of the iris (not occluded by any noise) is set as 0.4 and 0.29 for the enrollment and
probe, respectively (in case of the multi-instance iris recognition).
The NEO value reflects the eye opening condition and is calculated as:
El þ Eu
NEO ¼ :
2 RI
Here El and Eu are lower and upper eyelid positions determined as a distance to
the eyelid from a pupil center (Pc) by a vertical (Fig. 16.15). One of the methods for
eyelid position detection is presented in Odinokikh et al. (2019b). It is based on
applying multi-directional 2D Gabor filtering and is suitable for running on mobile
devices.
Additional checking of the probe feature vector consists of applying enrollment
thresholds for NEO and NMA values associated with FVprobe.
The second step consists of checking the possibility to update the enrolled
template. The structure of the enrolled template is depicted in Fig. 16.16.
All FVs in the enrolled template are divided into three groups: initially enrolled
FVs obtained during enrollment of a new user and two groups corresponding to FVs
obtained at high illumination and low illumination conditions respectively. The latter
groups are initially empty and receive new FVs through appending or replacing. It is
important to note that the group of initially enrolled FVs is not updated to prevent
possible degradation of recognition accuracy.
Each FV in the enrolled template contains information about the corresponding
PIR value and average mutual dissimilarity score:
1 X
d am ðFVk Þ ¼ d ðFVk , FVi Þ:
N
i2f1, ..., N j i6¼kg
Here N is the current amount of FVs in the enrolled template. d(FVk) values are
updated after each update cycle.
418 A. M. Fartukov et al.
E
Let
FV1 , . . . , FV E
M denote
the set of M initially enrolled FVs. If
PIR FVprobe < min PIR FVE1 , . . . , PIR FVEM , then FVprobe is considered as a
candidate for the update of the group of FVs obtained
at high illumination
(lPIR
group). Overwise, if PIR FVprobe > max PIR FVE1 , . . . , PIR FVEM , then
FVprobe is considered as a candidate for the update of the group of corresponding
FVs obtained at low illumination (hPIR group).
lPIR and hPIR groups have predefined maximum sizes. If the selected group is
not full, FVprobe is added to it. Overwise, the following rules are applied. If PIR
(FVprobe) is the minimal value among all FVs inside the lPIR group, then FVprobe
replaces the FV with the minimal PIR value in the lPIR group. Similarly, if PIR
(FVprobe) is the maximal value among all FVs inside the hPIR group, then FVprobe
replaces the FV with the maximal PIR value in the hPIR group.
Otherwise, the FV which is closest to FVprobe by PIR value is searched inside the
selected group. Let FVi denote the closest feature vector to FVprobe in terms of PIR
value, and FVi1 and FVi+1 – its corresponding neighbors in terms of PIR value.
Then, the following values are calculated:
D ¼ PIRðFVi Þ PIRavg PIR FVprobe PIRavg ,
1
PIRavg ¼ ðPIRðFVi1 Þ þ PIRðFViþ1 ÞÞ:
2
If D exceeds the predefined threshold, then FVprobe replaces FVi. This simple rule
allows to obtain a group of FVs which are distributed uniformly in terms of PIR
values. Otherwise, an additional rule is applied: if dam(FVprobe) < dam(FVi), then
FVprobe replaces FVi. dam(FVprobe) and dam(FVi) are average mutual dissimilarity
16 Approaches and Methods to Iris Recognition for Mobile 419
scores calculated as shown above. It aids in selecting the FV that exhibits maximum
similarity with the other FVs in the enrolled template.
In order to prove efficiency of the proposed methods, a dataset which emulates
user interaction with a mobile device was collected privately. It is a set of
two-second video sequences, each of which is a real enrollment/verification attempt.
It should be noted that there are no such publicly available datasets.
The dataset was collected using a mobile device with an embedded NIR camera.
It contains videos captured at different distances, in indoor (IN) and outdoor
(OT) environments, with and without glasses. During dataset capturing, the follow-
ing illumination ranges and conditions are set up: (i) three levels for the indoor
samples (0–30, 30–300, and 300–1000 Lux) and (ii) a random value in the range
1–100K Lux (data were collected on a sunny day with different arrangements of the
device relative to the sun). A detailed description of the dataset can be found in
Table 16.7. The Iris Mobile (IM) dataset used in Sect. 16.4 was randomly sampled
from, as well.
The testing procedure for proposed multi-instance iris recognition considers each
video sequence as a single attempt. The procedure contains the following steps:
1. All video sequences captured in indoor conditions and without glasses (IN&NG)
are used to produce the enrollment template. The enrolled template is successfully
created if the following conditions are satisfied:
(a) At least 5 FVs were constructed for each eye.
(b) At least 20 out of 30 frames were processed.
2. All video sequences are used to produce probes. The probe is successfully created
if at least one FV was constructed.
3. Each enrolment template is compared with all probes except the ones generated
from the same video. Thus, the pairwise matching table of the dissimilarity scores
for performed comparisons was created.
4. Obtained counters of successfully created enrolled templates and probes and the
pairwise matching table are used for calculating FTE, FTA, FNMR, FMR, and
EER as described in Dunstone and Yager (2009).
420 A. M. Fartukov et al.
Fig. 16.17 Verification rate values obtained at each update cycle for Gabor-based feature extrac-
tion and matching (Odinokikh et al. 2017)
The recognition accuracy results are presented in Table 16.6. The proposed
CNN-based feature extraction and matching method described in Sect. 16.4 is
compared with the one described in Odinokikh et al. (2017) as a part of the whole
iris recognition pipeline. The latter method is based on Gabor wavelets with an
adaptive phase quantization technique (denoted as GAQ in Table 16.6).
Both methods were tested in three different verification environments: indoors
without glasses (IN&NG), indoors with glasses (IN&G), and outdoors without
glasses (OT&NG). The enrollment was always carried out only indoors without
glasses, and, for this reason, the value of FTE ¼ 3.15% is the same for all the cases.
The target FMR ¼ 107 was set in every experiment.
Applying different matching rules was also investigated. The proposed multi-
instance fusion showed advantages over the other compared rules (Table 16.5).
To simulate template adaptation in a real-life scenario, the following testing
procedure is proposed. The subset containing video sequences captured both in
indoor and outdoor environmental conditions for 28 users is formed from the
whole dataset. For each user, one video sequence captured in indoor conditions
without glasses is randomly selected for generating the initial enrolled template. All
other video sequences (both indoor and outdoor) are used for generating probes.
After generating probes, they are split into two subsets: one is for performing
genuine attempts during verification (genuine subset); another is for enrolled tem-
plate update (update subset). It should be noted that all probes are used for
performing impostor attempts during verification.
On each update cycle, one probe from the update subset is randomly selected, and
the enrolled template update is started. The updated enrolled template is involved in
performance testing after every update cycle. Figure 16.17 shows the verification
rate values obtained at different update cycles for the proposed method for the
Gabor-based feature extraction and matching algorithm proposed in Odinokikh
16 Approaches and Methods to Iris Recognition for Mobile 421
et al. (2017). It can be seen that the proposed adaptation scheme allows to increase
the verification rate up to 6% after 9 update cycles.
Portions of the research in this chapter use the CASIA-Iris-Mobile-V1.0 dataset
collected by the Chinese Academy of Sciences’ Institute of Automation
(CASIA 2015).
References
Abate, A.F., Barra, S., D’Aniello, F., Narducci, F.: Two-tier image features clustering for iris
recognition on mobile. In: Petrosino, A., Loia, V., Pedrycz, W. (eds.) Fuzzy Logic and Soft
Computing Applications. Lecture Notes in Artificial Intelligence, vol. 10147, pp. 260–269.
Springer International Publishing, Cham (2017)
ARM Security Technology. Building a secure system using TrustZone Technology. ARM Limited
(2009)
Battiato, S., Messina, G., Castorina, A.: Exposure сorrection for imaging devices: an overview. In:
Lukas, R. (ed.) Single-Sensor Imaging: Methods and Applications for Digital Cameras,
pp. 323–349. CRC Press, Boca Raton (2009)
Bowyer, K.W., Hollingsworth, K., Flynn, P.J.: Image understanding for iris biometrics: a survey.
Comput. Vis. Image Underst. 110(2), 281–307 (2008)
Chinese Academy of Sciences’ Institute of Automation (CASIA). Casia-iris-mobile-v1.0 (2015).
Accessed on 4 October 2020. http://biometrics.idealtest.org/CASIA-Iris-Mobile-V1.0/CASIA-
Iris-Mobile-V1.0.jsp
Corcoran, P., Bigioi, P., Thavalengal, S.: Feasibility and design considerations for an iris acquisi-
tion system for smartphones. In: Proceedings of the 2014 IEEE Fourth International Conference
on Consumer Electronics, Berlin (ICCE-Berlin), pp. 164–167 (2014)
Das, A., Galdi, C., Han, H., Ramachandra, R., Dugelay, J.-L., Dantcheva, A.: Recent advances in
biometric technology for mobile devices. In: Proceedings of the IEEE 9th International Con-
ference on Biometrics Theory, Applications and Systems (2018)
Daugman, J.: High confidence visual recognition of persons by a test of statistical independence.
IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1148–1161 (1993)
Daugman, J.: Recognising persons by their iris patterns. In: Li, S.Z., Lai, J., Tan, T., Feng, G.,
Wang, Y. (eds.) Advances in Biometric Person Authentication. SINOBIOMETRICS 2004.
Lecture Notes in Computer Science, vol. 3338, pp. 5–25. Springer, Berlin, Heidelberg (2004)
Daugman, J.: Probing the uniqueness and randomness of iris codes: results from 200 billion iris pair
comparisons. Proc. IEEE. 94(11), 1927–1935 (2006)
Daugman, J., Malhas, I.: Iris recognition border-crossing system in the UAE (2004). Accessed on
4 October 2020. https://www.cl.cam.ac.uk/~jgd1000/UAEdeployment.pdf
Dunstone, T., Yager, N.: Biometric System and Data Analysis: Design, Evaluation, and Data
Mining. Springer-Verlag, Boston (2009)
Fujitsu Limited. Fujitsu develops prototype smartphone with iris authentication (2015). Accessed
on 4 October 2020. https://www.fujitsu.com/global/about/resources/news/press-releases/2015/
0302-03.html
Galbally, J., Gomez-Barrero, M.: A review of iris anti-spoofing. In: Proceedings of the 4th
International Conference on Biometrics and Forensics (IWBF), pp. 1–6 (2016)
Gangwar, A.K., Joshi, A.: DeepIrisNet: deep iris representation with applications in iris recognition
and cross sensor iris recognition. In: Proceedings of 2016 IEEE International Conference on
Image Processing (ICIP), pp. 2301–2305 (2016)
Gnatyuk, V., Zavalishin, S., Petrova, X., Odinokikh, G., Fartukov, A., Danilevich, A., Eremeev, V.,
Yoo, J., Lee, K., Lee, H., Shin, D.: Fast automatic exposure adjustment method for iris
recognition system. In: Proceedings of 11th International Conference on Electronics, Computers
and Artificial Intelligence (ECAI), pp. 1–6 (2019)
422 A. M. Fartukov et al.
IrisGuard UK Ltd. EyePay Phone (IG-EP100) specification (2019). Accessed on 4 October 2020.
https://www.irisguard.com/node/57
ISO/IEC 19794-6:2011: Information technology – Biometric data interchange formats – Part 6: Iris
image data (2011), Annex B (2011)
Korobkin, M., Odinokikh, G., Efimov, Y., Solomatin, I., Matveev, I.: Iris segmentation in chal-
lenging conditions. Pattern Recognit Image Anal. 28, 652–657 (2018)
Nourani-Vatani, N., Roberts, J.: Automatic camera exposure control. In: Dunbabin, M., Srinivasan,
M. (eds.) Proceedings of the Australasian Conference on Robotics and Automation, pp. 1–6.
Australian Robotics and Automation Association, Sydney (2007)
Odinokikh, G., Fartukov, A., Korobkin, M., Yoo, J.: Feature vector construction method for iris
recognition. In: International Archives of the Photogrammetry, Remote Sensing and Spatial
Information Science. XLII-2/W4, pp. 233–236 (2017). Accessed on 4 October 2020. https://doi.
org/10.5194/isprs-archives-XLII-2-W4-233-2017
Odinokikh, G.A., Fartukov, A.M., Eremeev, V.A., Gnatyuk, V.S., Korobkin, M.V., Rychagov, M.
N.: High-performance iris recognition for mobile platforms. Pattern Recognit. Image Anal. 28,
516–524 (2018)
Odinokikh, G.A., Gnatyuk, V.S., Fartukov, A.M., Eremeev, V.A., Korobkin, M.V., Danilevich, A.
B., Shin, D., Yoo, J., Lee, K., Lee, H.: Method and apparatus for iris recognition. US Patent
10,445,574 (2019a)
Odinokikh, G., Korobkin, M., Gnatyuk, V., Eremeev, V.: Eyelid position detection method for
mobile iris recognition. In: Strijov, V., Ignatov, D., Vorontsov, K. (eds.) Intelligent Data
Processing. IDP 2016. Communications in Computer and Information Science, vol. 794, pp.
140–150. Springer-Verlag, Cham (2019b)
Odinokikh, G., Korobkin, M., Solomatin, I., Efimov, I., Fartukov, A.: Iris feature extraction and
matching method for mobile biometric applications. In: Proceedings of International Conference
on Biometrics, pp. 1–6 (2019c)
Prabhakar, S., Ivanisov, A., Jain, A.K.: Biometric recognition: sensor characteristics and image
quality. IEEE Instrum. Meas. Soc. Mag. 14(3), 10–16 (2011)
Rathgeb, C., Uhl, A., Wild, P.: Iris segmentation methodologies. In: Iris Biometrics. Advances in
Information Security, vol. 59. Springer-Verlag, New York (2012)
Rattani, A.: Introduction to adaptive biometric systems. In: Rattani, A., Roli, F., Granger, E. (eds.)
Adaptive Biometric Systems. Advances in Computer Vision and Pattern Recognition, pp. 1–8.
Springer, Cham (2015)
Ross, A., Jain, A., Nandakumar, K.: Handbook of Multibiometrics. Springer-Verlag, New York
(2006)
Samsung Electronics. Galaxy tab iris (sm-t116izkrins) specification (2016). Accessed on 4 October
2020. https://www.samsung.com/in/support/model/SM-T116IZKRINS/
Samsung Electronics. How does the iris scanner work on Galaxy S9, Galaxy S9+, and Galaxy
Note9? (2018). Accessed on 4 October 2020. https://www.samsung.com/global/galaxy/what-is/
iris-scanning/
Sun, Z., Tan, T.: Iris anti-spoofing. In: Marcel, S., Nixon, M.S., Li, S.Z. (eds.) Handbook of
Biometric Anti-Spoofing, pp. 103–123. Springer-Verlag, London (2014)
Tabassi, E.: Large scale iris image quality evaluation. In: Proceedings of International Conference
of the Biometrics Special Interest Group, pp. 173–184 (2011)
Tortora, G.J., Nielsen, M.: Principles of Human Anatomy, 12th edn. John Wiley & Sons, Hoboken
(2010)
Zhang, Q., Li, H., Sun, Z., Tan, T.: Deep feature fusion for iris and periocular biometrics on mobile
devices. IEEE Trans. Inf. Forensics Secur. 13(11), 2897–2912 (2018)
Zhao, Z., Kumar, A.: Towards more accurate iris recognition using deeply learned spatially
corresponding features. In: Proceedings of IEEE International Conference on Computer Vision
(ICCV), pp. 3829–3838 (2017)
Index
DIBR, 75 T
problems during view generation Temporal propagation, 105
disocclusion area, 76 Texture codes, 196
symmetric vs. asymmetric, 77 3D recursive search (3DRS) algorithm, 375
temporal consistency, 77 3D TVs
toed-in configuration, 77, 78 on active shutter glasses, 62
virtual view synthesis, 76 cause, eye fatigue, 62, 63, 65, 66
Structural similarity (SSIM), 41, 46 consumer electronic shows, 59
Structure tensor, 29, 30 interest transformation
Sub-pixel convolutions, 36, 41, 48 extra cost, 60
Sunlight spot effect, 225, 226, 230–232 inappropriate moment, 59
Superpixels, 87 live TV, 61
Super-resolution (SR) multiple user scenario, 61
arrays, 4 picture quality, 61
Bayer SR, 21–26 uncomfortable glasses, 60
BCCB matrices, 12 parallax (see Parallax)
block diagonalization in SR problems, prospective technologies, 59, 60
9, 10, 14 smart TVs, 59
circulant matrices, 11 stereo content (see Stereo content
colour filter arrays, 5 reproduction)
data fidelity, 7 Thumbnail creation, 353, 354
data-agnostic approach, 7 Tiling slideshow, 220
filter-bank implementation, 16–18 Toed-in camera configuration, 77, 78
HR image, 1 Tone-to-colour mapping, 229
image formation model, 1–3, 5 Transposed convolution, 36
image interpolation applications, 2 TV programmes, 194
interpolation problem, 2 TV screens, 373
LR images, 1 2D-3D semi-automatic video conversion
machine learning, 6 advantages and disadvantages, 82
mature technology, 1 background inpainting step
on mobile device, 45 (see Background inpainting)
modern research, 7 causes, 81
optimization criteria, 6 depth propagation from key frame
perfect shuffle matrix, 10, 15 (see Depth propagation)
problem conditioning, 7 motion vector estimation (see Motion
problem formulation, fast implementation, vectors)
8–9 steps, 83
reconstruction, 1 stereo content quality, 111, 112
sensor-shift, 3 stereo rig, 81, 82
single- vs. multi-frame, 2 video analysis and key frame detection,
single-channel SR, 19 84–86
SISR, 6 (see also Single image super- view rendering, 110, 111
resolution (SISR)) virtual reality headsets, 81
symmetry properties, 19–21
warping and blurring parameters, 35
Super-resolution multiple-degradations U
(SRMD) network, 53 Unpaired super-resolution, 55
Support vector machine (SVM), 194, 195, Unsupervised contrastive learning, 254
208, 291 User authentication, 267
Symmetric stereo view rendering, 77 User data collection, 269, 271–274
Synthetic data acquisition, 247 User interfaces, 351, 353, 354
System on chip (SoC), 115
432 Index
Z
W Zero filling (ZF), 304, 306
Walsh-Hadamard (W-H) filters, 88 Zero-shot super-resolution, 55
Warp-blur-down-sample model, 4