Download as pdf or txt
Download as pdf or txt
You are on page 1of 90

SPRINGER BRIEFS IN

ELEC TRIC AL AND COMPUTER ENGINEERING

Byung-Gyu Kim
Kalyan Goswami

Basic Prediction
Techniques in
Modern Video
Coding Standards

123
SpringerBriefs in Electrical and Computer
Engineering

More information about this series at http://www.springer.com/series/10059


Byung-Gyu Kim • Kalyan Goswami

Basic Prediction Techniques


in Modern Video Coding
Standards

123
Byung-Gyu Kim Kalyan Goswami
Department of IT Engineering Visual Media Research Section
Sookmyung Women’s University Broadcasting and Media Research laboratory
Seoul, Republic of Korea Electronics and Telecommunication
Research Institute (ETRI)
Daejeon, Republic of Korea

ISSN 2191-8112 ISSN 2191-8120 (electronic)


SpringerBriefs in Electrical and Computer Engineering
ISBN 978-3-319-39239-4 ISBN 978-3-319-39241-7 (eBook)
DOI 10.1007/978-3-319-39241-7

Library of Congress Control Number: 2016942557

© The Author(s) 2016


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG Switzerland
Preface

This book is intended as a basic technical guide for the latest video coding standard
with general descriptions of the latest video compression standard technologies. The
H.264/advanced video coding (AVC) scheme as a video compression standard has
been applied in a variety of multimedia services over the last 10 years. As the latest
video coding standard, High Efficiency Video Coding (HEVC) standard technology
is also expected to be used in a variety of ultrahigh-definition (UHD) multimedia
and immersive media services over the next 10 years.
The structure of the H.264/AVC standard scheme is explained in contrast with
earlier technologies, and the HEVC video compression technology is presented.
The history and background of the overall video coding technology and the hybrid
video codec structure are explained in the Introduction. A detailed explanation of the
modules and functions of the hybrid video codec is presented in Chap. 2. Detailed
descriptions of intra-prediction, inter-prediction, and RD optimization techniques
of H.264/AVC standard modules of the video codec follow. The high degree of
video quality achieved using this standard results in computational complexity in the
video encoding system. Thus, fast algorithms and schemes for reduction in HEVC
encoding system computational complexity are presented and analyzed in Chap. 6.
A complete, comprehensive, and exhaustive analysis of HEVC and the
H.264/AVC video codec is beyond the scope of this book. However, the latest
technologies used in the codec are presented in an attempt to gain an understanding
of both structure and function. Basic principles of video data compression based on
removal of correlations between data are presented and explained. Therefore, this
book will help interested readers to gain an understanding of the latest video codec
technology.

Seoul, Republic of Korea Byung-Gyu Kim


Daejeon, Republic of Korea Kalyan Goswami
March 2016

v
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background and Need for Video Compression . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Classifications of the Redundancies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Statistical Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Psycho-Visual Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Hybrid Video Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Brief History About Compression Standards. . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Hybrid Video Codec Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Picture Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 High-Level Picture Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Block Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 H.264/AVC Block Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 HEVC Block Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Prediction Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 In-Loop Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Deblocking Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Sample Adaptive Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Entropy Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 Huffman Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.2 Arithmetic Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.3 CABAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Intra-prediction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Intra-prediction Modes in H.264/AVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Intra-prediction Modes in HEVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Angular Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 DC and Planer Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

vii
viii Contents

3.3.3
Reference Sample Smoothing and Boundary
Value Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Lossless Intra-prediction Using DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Inter-prediction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Uni- and Bidirectional Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Complexity in the Inter-prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Different Inter-prediction Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Merge and Skip Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Motion Vector Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 RD Cost Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Classical Theory of RD Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Distortion Measurement Technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Mean of Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Mean of Absolute Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.3 Sum of Absolute Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Calculating  for the RD Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Fast Prediction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 Need for the Fast Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Fast Options in HEVC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.1 Early CU Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.2 Early Skip Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2.3 CBF Fast Mode Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.4 Fast Decision for Merge RD Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3 Block Matching Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 Full Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5 Unsymmetrical-Cross Multihexagon-Grid Search . . . . . . . . . . . . . . . . . . . . 70
6.6 Diamond Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.7 Enhanced Predictive Zonal Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.8 Test Zone Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.9 Fixed Search Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.10 Search Patterns Based on Block Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.11 Search Patterns Based on Motion Classification . . . . . . . . . . . . . . . . . . . . . . 79
6.12 Prediction-Based Fast Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.13 Improved RD Cost-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.14 Efficient Filter-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.15 Improved Transform-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter 1
Introduction

1.1 Background and Need for Video Compression

The field of video processing is concerned with information processing activity for
which the input and output signals are video sequences. A wide range of emerging
applications, such as videophone, video conferencing through wired and wireless
medium, streaming video, digital TV/HDTV broadcast, video database service,
CD/DVD storage, etc., demand a significant amount of video compression to store
or transmit the video efficiently. Recently, there is a drastic change that happened
in the video communication technology from lower-resolution video to ultra-high-
definition (UHD) video format. In our modern society, a huge demand is present for
the UHD video for consumer use in real-time-based systems.
Now, in order to transmit or to store a video data, the compression of the raw
file is essential. Video compression refers to the tools and techniques which operate
on video sequences to reduce the quantity of data. Today, modern data compression
techniques can store or transmit vast amount of data to represent video sequence in
an efficient and robust way. One question should arise in this point: what is the need
for the video compression? However, we can store a raw file instead of a compressed
one. The answer of this question is the amount of data. Generally, uncompressed
video signal generates a huge quantity of data, which is difficult enough to store
and transmit through a channel. For this reason, raw video data needs to compress
for our daily life applications. Again, a new question comes, at this point: how to
compress a video data? In last few decades, a good amount of research works have
been reported in the domain of video compression. In a nutshell, most of the natural
video sequences contain huge redundancy of data which can be explored by using
statistical models and psycho-visual limitations of the human eye. Algorithms for
video compression are mainly based on a statistical model of input data or psycho-
visual limitations of the human eye, which reduce the raw video sequence to a
compressed data sequence. The act of discarding data introduces distortion in the

© The Author(s) 2016 1


B.-G. Kim and K. Goswami, Basic Prediction Techniques in Modern Video Coding
Standards, SpringerBriefs in Electrical and Computer Engineering,
DOI 10.1007/978-3-319-39241-7_1
2 1 Introduction

decompressed data sequence. However, the compression is done in such a way that
the introduced distortion is not noticeable to the human eye.
This introductory chapter starts with a brief explanation of different redundan-
cies, which is a very fundamental concept for the video compression theory. It
continues with a description of the modern hybrid video codec which is used in the
High Efficiency Video Coding (HEVC). A brief history about compression standard
is given next. Finally, the chapter ends by giving the organization of the book.

1.2 Classifications of the Redundancies

In the previous section, we have introduced the term “redundancy.” Now, very
informally, it can be thought of as a repetition of a data in a data set. For example,
if we consider a pixel in an image, then most of its neighboring pixels have similar
intensity values. Moreover, if it is a homogeneous region, then there is a high chance
that most of its neighboring pixels have the same value. This kind of similarity of
data is generally named as redundancy. Broadly, the redundancies can be divided
into two categories: statistical and psycho-visual redundancies.

1.2.1 Statistical Redundancy

Statistical redundancy occurs due to the fact that pixels within an image tend to
have similar intensity values of its neighbor, and for video, intensities of same
pixel position across successive frames tend to be very similar. For this reason, the
statistical redundancy can be subdivided into two categories: spatial and temporal
redundancies.

1.2.1.1 Spatial Redundancy

For an image, it can be easily observed that most of the pixels have almost the same
intensity level as those in their neighborhood. Only at the boundary of an object, the
intensity changes significantly. Hence, there is a considerable amount of redundancy
present in an image which can be exploited for significant data compression. This
kind of redundancy is called spatial redundancy. The spatial redundancy can be
exploited by using lossless and lossy compression techniques. Lossless compression
algorithms operate on a statistical model of input data. The general concept of
lossless compression is to assign shorter code words to more frequently occurring
symbols and longer code words to less frequently occurring symbols. Run-length
coding, entropy coding, and Lempel-Ziv coding are some of the examples of lossless
compression technique. Lossy compression algorithms, on the other hand, employ
psycho-visual limitations of the human eye to discard redundant data. The human
1.2 Classifications of the Redundancies 3

Fig. 1.1 Spatial redundancy for a video frame (Foreman)

eye is more responsive to slow and gradual changes of illumination than perceiving
finer details and rapid change of intensities. Exploitation of such psycho-visual
characteristic has been incorporated within the multimedia standards like JPEG,
MPEGs, and H.26x.
In Fig. 1.1, we have shown a frame from the Foreman video sequence. In this
frame, there are good amount of places where the contents in the neighboring pixels
are very similar to each other. Some of the similar pixel-based patches are shown
in this diagram. From this diagram, it is very clear that the pixels in the specified
blocks have very similar amount of intensity value. This one is a very basic and
fundamental example of the spatial redundancy. In the next section, we will discuss
about the temporal redundancy.
4 1 Introduction

1.2.1.2 Temporal Redundancy

In case of video sequence, one can consider it as a sequence of frames. So for each
frame the spatial redundancy is present. Apart from that, between successive frames,
only a limited amount of movement of an object is possible. Hence, most of the
pixels do not exhibit any change at all between successive frames. This is called
temporal redundancy which is exploited through the prediction of current frame
using the stored information of the past frames. The temporal prediction is based
on the assumption that consecutive frames in a video sequence have a very close
similarity. This assumption is mostly valid except for the frames having significant
change of content or appearance of new objects in a frame. The prediction technique
is applied on the current frame with respect to the previous frame(s). Hence,
redundancies are not only present within a frame (spatial redundancy) but also
between successive frames (temporal redundancy) for a video. To compress a video
sequence efficiently, both of these redundancies need to be exploited and reduced as
much as possible.
In Fig. 1.2, we have shown an example of the temporal redundancy. In this
diagram, the first ten frames of the Akiyo sequence are shown. If we see it minutely,
it is quite clear from this diagram that, apart from the lip and eye portions in the
face of the lady, the rest of the parts in this ten-frame sequence are static in nature.
Hence, from the very first frame, it is possible to predict the tenth frame if we have
the information about the motion of the lip and the eye portions of the face. This
one is the very fundamental knowledge about the temporal redundancy. In the next
section, we will discuss about the psycho-visual redundancy.

Fig. 1.2 Temporal redundancy for a video sequence (Akiyo)


1.3 Hybrid Video Codec 5

1.2.2 Psycho-Visual Redundancy

This kind of redundancy arises due to the problem of perception. Generally,


the human eye is more sensitive and responsive to slow and gradual changes
of illumination. On the other hand, for very finer details and rapid changes of
intensities, the human eye cannot distinguish. Since all the systems are made for the
human being, these kinds of human limitations are studied properly and explored to
a great extent. However, the exploration of the psycho-visual redundancies has been
included within the multimedia standards.

1.3 Hybrid Video Codec

A video codec is a device capable of encoding and decoding a video stream. Since
the modern video codec uses a combination of predictive and transform-domain
techniques, it is generally referred to as hybrid codec. Simplified block diagrams of
a hybrid video encoder and decoder are shown in Figs. 1.3 and 1.4, respectively.
In this codec, current frame is predicted using temporal and spatial redundancies
from the previously encoded reference frame(s). The temporal prediction is based
on the assumption that the consecutive frames in a video sequence have a very close

current residual
frame image
video Transform Quantizer output
VLC Buffer bit stream
frame (DCT) (Q)
(+)
(-)
predicted
Inverse
frame
Quantizer

Inverse
Transform
(+)
(+)

Motion
Compensated
Predictor

Motion Vector
Motion
Estimation

Fig. 1.3 Block diagram of a video encoder


6 1 Introduction

(+)
Input Inverse Inverse Decoded
Buffer VLC
Bit Stream Quantizer Transform Video
(+)

Motion
Compensated
Predictor

Motion Vectors

Fig. 1.4 Block diagram of a video decoder

similarity. This assumption is mostly valid except for the frames having significant
change of content or some significant scene change. For this kind of scenario, spatial
redundancy of the new region (scene) is needed to be exploited.
In a hybrid video codec, when a frame FN (Nth frame in a sequence) comes
as an input, first of all it is compared with its predicted frame F cN . Generally,
the current frame FN is subtracted from the predicted frame F c N , and the error
image is called residual image F. Since the current and predicted frames are very
similar (depending upon the prediction technique) to each other, the residual image
generally exhibits considerable spatial redundancy. Moreover, from the residual
image and the predicted frame, the current frame can be constructed using addition
operation without any error. In Fig. 1.3, the residual image is shown using the black
color, because in ideal case (when current and predicted frames are the same), each
pixel in the residual image should have “0” value, which produces a black image.
Since, the residual image has significant spatial redundancy, it should be
exploited properly. For this reason, it is transformed into frequency domain.
Generally, discrete cosine transform (DCT) is used in the hybrid codec. Now one
question may arise in your mind: to transform into frequency domain, why should
not we go for the discrete Fourier transform (DFT)? The main advantage of the DCT
over DFT is its compactness that means after transformation into frequency domain,
DCT requires less amount of bits than DFT.
Till now, the compression schemes applied in the hybrid codec are based on
statistical approach. After the DCT, a quantization operation is performed on the
residual image which is based on the psycho-visual redundancy. Conceptually, it
is just a matrix operation over the DCT to eliminate the high-frequency terms. We
have mentioned earlier that the human eye is more sensitive over low-frequency
component than the higher one. Hence, if we drop the high-frequency terms from
the DCT output and reconstruct the video signal again, then for the human being, it
does not make any significant change over the original one. Now the quantization
parameter (QP) is one of the most important features for the hybrid video codec,
because this is the only part where the error is introduced in the output bit stream.
These matrices are fixed for a particular video codec, and these are constructed after
rigorous amount of psycho-visual experiments over human being.
1.3 Hybrid Video Codec 7

After the quantization, the output data is again compressed by the lossless
entropy coding or variable length coding (VLC), and the final output is sent to
the channel after proper buffering. Generally, arithmetic coding-based approaches
are used in the modern hybrid video codec for the entropy coding. In Fig. 1.3, a
feedback loop is added from the buffer to the quantizer. This loop signifies the
adaptive quantization parameter setting technique which is generally used in the
modern codec.
In the hybrid video codec, a decoder is also embedded in the encoder side. The
block diagram of a decoder is shown in Fig. 1.4. A decoder generally consists of
an inverse quantizer, inverse transform, and motion-compensated predictor. If we
notice carefully, then it is quite easy to observe the same decoder block in the
encoder side (Fig. 1.3). The reason to embed a decoder in the encoder side is that
we want to predict the same reference picture in the encoder which is observed to
the end user.
The motion estimation and motion-compensated prediction are the most impor-
tant parts of the hybrid video codec. From the reference frame, the current frame is
predicted using the motion vectors. The detailed description of this technique will
be discussed in the next chapter.
In Fig. 1.5, the basic block diagram of the H.264/AVC is shown. Till now, it is
the most commercially used encoder. This one is a block-based encoding technique.
However, the block size is fixed to the 64  64 dimension. These fixed-sized
blocks in the H.264/AVC are generally referred to as macroblocks (MBs). The main
goals of the H.264/AVC standardization effort have been enhanced compression
performance and provision of a “network-friendly” video representation addressing
“conversational” (video telephony) and “non-conversational” (storage, broadcast, or
streaming) applications [1].
For more than a decade, the above-discussed hybrid video coding techniques are
used commercially. Moreover, the latest video standard, HEVC, also adopted the
same techniques. The block diagram of the HEVC encoder is shown in Fig. 1.6. The
HEVC standard is designed to achieve multiple goals, including coding efficiency,
ease of transport system integration, and data loss resilience, as well as to implement
ability using parallel processing architectures. The detailed description of the latest
hybrid codec is discussed in the next chapter.
Now, one question may appear in your mind: what is the need for the standardiza-
tions like H.264, HEVC, etc.? The answer is very simple. To decode a compressed
video sequence, you have to know the encoding schemes. Now, in the absence of
any standardization, anybody can compress a video sequence by applying his own
algorithm, which is quite difficult to decode by a user. Hence, more formally we can
say, to simplify the interoperability between encoders and decoders from different
manufactures and to minimize the violation of different patents, a standardization is
required for video encoding [3]. In the next section, we will discuss a brief history
about different compression standards.
8 1 Introduction

Fig. 1.5 The structure of the H.264/AVC video encoder [1]

Fig. 1.6 The structure of the HEVC video encoder [2]

1.4 Brief History About Compression Standards

The efforts on standardization of video encoder are actively in progress since the
early 1980s. An expert group, named as Motion Picture Experts Group (MPEG),
was established in 1988 in the framework of the Joint ISO/IEC Technical Committee
1.4 Brief History About Compression Standards 9

The first standard was made by this team in 1992 and was known as the MEPG-1.
Today, MPEG-1 is used in video CD (VCD) which is supported by most of the DVD
players with the video quality at 1:5 Mbit/s and 352  288=240-pixel resolution.
After that in 1993, the next version of the standard was introduced by the same
team named as MPEG-2. The MPEG-2 added improved compression tools and
interlace support and ushered in the era of digital television (DTV) and DVD. Till
now, most of the DVD players, all DTV systems, and some digital camcorders use
the MPEG-2 [4].
In 1994, the MPEG committee introduced a new standardization phase, called
MPEG-4, which finally became a standard at 2000. In MPEG-4, many novel coding
concepts were introduced such as interactive graphics, object and shape coding,
wavelet-based still image coding, face modeling, scalable coding, and 3D graphics.
Very few of these techniques have found their way into commercial products. Later,
standardization efforts have focused more narrowly on compression of regular video
sequences [5].
Apart from the MPEG committee, the International Telecommunication Union—
Telecommunication Standardization Sector (ITU-T) also evolved the standards for
the multimedia communications in parallel. In 1988–1990, the H.261 standard was
developed by this group which was a forerunner to the MPEG-1. The target was
to transmit video over ISDN lines, with multiples of 64 kbit/s data rates and CIF
(352  288-pixel) or QCIF (176  144-pixel) resolution.
The H.263 standard (1995) developed by the ITU was a big step forward and is
today’s dominant video conferencing and cell phone codec [5]. H.263 built upon
MPEG-1, MPEG-2, and H.261, an earlier video teleconferencing standard, added
new coding tools optimized for very low bit-rate applications [4].
The need for further improvement in coding efficiency in 1998 by the Video
Coding Experts Group (VCEG) of the ITU-T invited proposals for a new video
coding project named H.26L. The goal was to compress a video twice the rate of
the previous video standards while retaining the same picture quality. In December
2001, these two leading groups (VCEG and MPEG) merged together and formed a
Joint Video Team (JVT). Their combined effort was originally known as H.264/AVC
[1]. Due to its improved compression quality, H.264 is quickly becoming the leading
standard; it has been adopted in many video coding applications such as the iPod
and the PlayStation Portable, as well as in TV broadcasting standards such as
DVB-H and DMB. Portable applications primarily use the Baseline Profile up to
SD resolutions, while high-end video coding applications such as set-top boxes,
Blu-ray, and HD-DVD use the Main or High Profile at HD resolutions. The Baseline
Profile does not support interlaced content; the higher profiles do [4].
The increased commercial interest in video communication calls forth the need of
international video coding standard. This standardization requires the collaboration
between regions and countries with different infrastructures (both academic and
industrial), with different technical background, and with different political and
commercial interests [6]. The primary goal of most video coding standards is the
ability to minimize the bit rate necessary for representation of video content to
reach a given level of video quality [7]. However, international standards do not
10 1 Introduction

necessarily represent the best technical solutions but rather attempt to achieve
a trade-off between the amount of flexibility and efficiency supported by the
standard and the complexity of the implementation required for the standard [6].
Recently, the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC
Moving Picture Experts Group (MPEG) joined together in a partnership known as
the Joint Collaborative Team on Video Coding (JCT-VC) [8]. In January 2013, this
joint standardization organization finalized the latest video coding standard named
the High Efficiency Video Coding (HEVC) [2]. This new standard is designed
to achieve multiple goals, including bit-rate reduction over the previous standard
(H.264/MPEG-4 AVC [9]) while maintaining the same picture quality, ease of
transport system integration, and data loss resilience, as well as the ability to
implement it using parallel processing architectures [2]. The major motivation for
this new standard is the growing popularity of HD video and the demand of the
UHD format in commercial video transmission.
ITU-T VCEG (Q6/16) and ISO/IEC MPEG (JTC 1/SC 29/WG 11) are studying
the potential need for standardization of future video coding technology with a
compression capability that significantly exceeds that of the current HEVC standard
(including its current extensions and near-term extensions for screen content coding
and high-dynamic-range coding). Such future standardization action could either
take the form of additional extension(s) of HEVC or an entirely new standard. The
groups are working together on this exploration activity in a joint collaboration
effort known as the Joint Video Exploration Team (JVET) to evaluate compression
technology designs proposed by their experts in this area. The description of
encoding strategies used in experiments for the study of the new technology is
referred to as Joint Exploration Model (JEM). The first meeting was held on October
19–21, 2015.

1.5 About This Book

In this book, we will focus on the basic prediction techniques which are used widely
in the in modern video codec. Hybrid codec structure and inter- and intra-prediction
techniques in MPEG-4, H.264/AVC, and HEVC are discussed together. While we
had started our research in the video codec, we spend a lot of time to understand
the basic algorithms behind each step. Form the specification documents and the
research papers, we gained the knowledge of these, which was very time-consuming
and tedious in nature. For this reason, we think that a textbook is essential in this
domain for the new researches to understand the basic algorithms of the video codec
easily. Moreover, in this book the latest research trends are also summarized, which
can be helpful for the readers to do further research in this area.
The book is organized as follows:
• Chapter 2 explains hybrid video codec in details. The picture partitioning tech-
niques are discussed here. The basic concepts of the intra- and inter-prediction
References 11

modes are also highlighted. Moreover, the in-loop filters, DCT, quantization, and
entropy coding techniques are explained in detail.
• Chapter 3 is mainly focused on the intra-prediction techniques in the latest video
codec. In this context, angular, planer, and DC intra-prediction techniques are
explained in detail. After that, smoothing algorithms and DPCM-based lossless
intra-prediction are also explained.
• Chapter 4 highlights on inter-prediction techniques. Unidirectional and bidirec-
tional prediction techniques are discussed here. Different inter-prediction modes
are explained in detail. Moreover, the motion vector prediction is also mentioned
here.
• Chapter 5 explains on the RD cost optimization theory. The background and the
classical RD theory are also discussed here.
• Chapter 6 is dedicated for the researchers in this domain. In this chapter, the
latest works in the fast prediction techniques are discussed in detail.

References

1. T. Wiegand, G.J. Sullivan, G. Bjontegard, A. Luthra, Overview of the H.264/AVC video coding
standard. IEEE Trans. Circ. Syst. Video Technol. 13(7), 560–576 (2003)
2. G.J. Sullivan, J.R. Ohm, W.J. Han, T. Wiegand, Overview of the High Efficiency Video Coding
(HEVC) standard. IEEE Trans. Circ. Syst. Video Technol. 22(12), 1649–1668 (2012)
3. I.E. Richardson, Introduction: The Role of Standards in The H.264 Advanced Video Compres-
sion Standard, 2nd edn. (Wiley, New York)
4. A. Michael, Historical overview of video compression in consumer electronic devices. In: IEEE
Int. Conf. on Consumer Electronics (ICCE), Jan. 2007
5. M. Jacobs, J. Probell, A brief history of video coding. ARC International Whitepaper, Jan. 2007
6. R. Schafer, T. Sikora, Digital video coding standards and their role in video communications.
Proc. IEEE 83(6), 907–924 (1995)
7. J.R. Ohm, G.J. Sullivan, H. Schwarz, T.K. Tan, T. Wiegand, Comparison of the coding efficiency
of video coding standards - including High Efficiency Video Coding (HEVC). IEEE Trans. Circ.
Syst. Video Technol. 22(12), 1669–1684 (2012)
8. B. Bross, W.J. Han, G.J. Sullivan, J.R. Ohm, T. Wiegand, High Efficiency Video Coding (HEVC)
text specification draft 9. Document JCTVC-K1003, ITU-T/ISO/IEC Joint Collaborative Team
on Video Coding (JCT-VC), Oct. 2012
9. T. Wiegand, G.J. Sullivan. G. Bjontegard, A. Luthra, Overview of the H.264/AVC video coding
standard. IEEE Trans. Circ. Syst. Video Technol. 13(7), 560–576 (2003)
Chapter 2
Hybrid Video Codec Structure

2.1 Picture Partitioning

In the previous chapter, a brief description of latest hybrid video codec was given.
The hybrid video encoder is basically a block-based video encoder, which breaks
a picture into different blocks and processes each of them independently or with
dependence. Generally, the hybrid video codec uses two-layered high-level system
design for picture partitioning. These are video coding layer (VCL) and network
abstraction layer (NAL). The VCL includes the low-level picture partitioning, like
picture prediction, transform coding, entropy coding, in-loop filtering, etc. On the
other hand, the NAL includes the high-level picture partitioning using encapsulating
coded data and associated information into a logical data packet format which is
useful for video transmission over various transport layers. The need for this kind
of high-level partitioning is the parallel processing and packetization. In the next
subsection, we will discuss about the high-level picture partitioning in detail.

2.1.1 High-Level Picture Partitioning

As we have mentioned earlier that for parallel processing and packetization, the
high-level picture partitioning is required. In the latest video standard, HEVC,
as well as its previous standard, uses slices for this kind of high-level picture
partitioning.

© The Author(s) 2016 13


B.-G. Kim and K. Goswami, Basic Prediction Techniques in Modern Video Coding
Standards, SpringerBriefs in Electrical and Computer Engineering,
DOI 10.1007/978-3-319-39241-7_2
14 2 Hybrid Video Codec Structure

2.1.1.1 Slice

A slice provides a portioning of a picture in such a way that each slice is


independently decodable. Hence, if a picture is partitioned into N number of slices,
then N number of parallel processing is possible for that particular picture. As
shown in Fig. 2.1, a picture is divided into three slices, and each of them can be
processed independently. Apart from the parallel processing, the slice provides error
robustness of the encoder also.
Conceptually, a slice consists of a slice header and its data. The decoding
information of the slice data is specified in the slice header. According to the latest
video coding standard, there are two types of slices. These are:
1. Independent slice—As the name suggests, these slices are independent of each
other. So, without having any information from previously encoded slices, the
independent slices can be processed. Up to H.264/AVC encoder, this kind of
slices was independent in nature. However, in the HEVC, the dependent slice
structure is introduced.
2. Dependent slice—In the HEVC, the slice fragmentation concept has been
introduced. According to this, each slice can be subdivided into subsets or
substreams. As shown in Fig. 2.1, the slice 2 is subdivided into two parts. Here,
the first part of a slice should be an independent slice (having full slice header).
The rest of the subsets should not have the full slice header, which uses the
previous slice information. These kinds of slices are referred to as dependent
slices. This concept is mainly useful in low-delay encoding. On the other hand,
the dependent slice segments do not provide the same error robustness as
independent slice.
Apart from that, each slice can be coded as different coding types. Different
coding types of a slice are given below:

1 2 3 4 independent
9 10 11 slice

slice 1 1 2 3 4 5 6 7 8 5 6 7 8
5 6 7 8
9 10 11 12 13 14 15 16 12 13 14 15 16
12 13 14 15 16
17 18 19 20 21 22 23 24
17 18 19 20 21 22 23 24
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
slice 2 25 26 27 28 29 30 31 32
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
33 34 35 36 37 38 39 40
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 41 42 43 44 45
41 42 43 44 45
49 50 51 52 53 54 55 56 dependent slice
slice 3 57 58 59 60 61 62 63 64

46 47 48
49 50 51 52 53 54 55 56 independent
57 58 59 60 61 62 63 64 slice

Fig. 2.1 Slice structure of a picture


2.1 Picture Partitioning 15

Fig. 2.2 Different slice


encoding types I Slice I0 I1 I2 I3

P Slice I0 P1 P2 P3

B1 B2

B Slice I0 P3

1. I-slice: Here, all the elements (coding units) of the slice are encoded as intra-
picture prediction mode.
2. P-slice: In this case, in addition to the intra-prediction mode, some of the
elements (coding units) of the slice are predicted using inter-picture prediction
mode from only one reference picture.
3. B-slice: Finally, the concept of the B-slices is quite similar to the P-slice, but
the only difference is that the reference pictures here should be more than one
(generally two). So, bi-prediction method is used here.
All three different slice encoding structures are shown in Fig. 2.2. So for P- and
B-slices, the first element should be intra-type. Moreover, for a B-slice, the second
element should be intra-predicted (P).

2.1.1.2 Tile

The picture partitioning mechanism of the tiles is quite similar to the slices, but here
only rectangular-shaped-based partitioning is allowed, as shown in Fig. 2.3. On the
other hand, a slice is not restricted as rectangular shaped. Tiles are independently
decodable regions in a picture. The main advantage of the tiles is that it can enhance
the parallel processing and it can also be used for the spatial random access. In terms
of error resilience, the tiles are not very attractive, whereas for the coding efficiency,
tiles provide superior performance over slices.

2.1.1.3 Wavefront Parallel Processing

This one is another latest feature in the HEVC encoder. The WPP option is enabled;
a slice is divided into rows of elements (coding tree units or CTUs). The first row is
processed in an ordinary way. Now, the magic starts from the second row onward.
After processing the second element of the first row, the processing of the second
row can be started. Similarly, after processing the second element of the second row,
the third row can be processed and so on. In Fig. 2.4, the pictorial representation of
16 2 Hybrid Video Codec Structure

1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
Tile 1 Tile 2
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
Tile 3 Tile4
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64

Fig. 2.3 Tile structure in a picture

1 2 3 4 5 6 7 8 T1 1 2
9 10 11 12 13 14 15 16 T2 9 10
slice 1
17 18 19 20 21 22 23 24 T3 17 18
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
slice 2 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64

Fig. 2.4 WPP structure in a slice

the WPP is shown. The first thread (T1 ) starts normally. After finishing the second
element (element number 2 in this figure) of the T1 , the second tread T2 starts.
Similarly, after finishing the second element in the T2 (element number 10 in this
figure), T3 starts working. The WPP provides an excellent parallel processing within
a slice. Moreover, it may provide better compression performance than tiles.

2.2 Block Partitioning

The modern hybrid encoders divide a frame into different blocks and process each
of them separately. By the word “processing,” we are trying to mean the prediction,
transform, in-loop filtering, etc. The block sizes may or may not be fixed. In this
context, we will discuss the H.264/AVC and HEVC encoder block partitioning
techniques separately.
2.2 Block Partitioning 17

2.2.1 H.264/AVC Block Partitioning

The basic building units of the H.264/AVC are the macroblocks (MBs). An MB
consists a fixed size of a 16  16 luma sample and two corresponding chroma
samples. Now, why the size is 16  16 ? Actually, in the literatures, it is shown
that it is the reasonable size to give a good trade-off between memory requirement
and coding efficiency within a HD format, whereas, for the higher resolutions, the
16  16 size is not a good option.
Now for the inter-prediction, each MB can be processed in a two-stage hieratical
process. An MB can be predicted as one 16  16, two 16  8, two 8  16, or four
8  8 partitioned. If it is partitioned as 8  8, then each four of the 8  8 blocks can
undergo to the second level of partitioning of the 8  8 block size. In this case, each
8  8 block can be partitioned as one 8  8, two 8  4, two 4  8, or four 4  4
partitions. The diagram of the abovementioned partitioning style for the inter-mode
prediction is shown in Fig. 2.5.
Unlike inter-mode prediction, for the intra-mode prediction, only 44, 88, and
16  16 are allowed for an MB. On the other hand, only 4  4 and 8  8 partitioning
is used for transform coding.

16 * 16
8*8

4*8
8*16

8*4

16*8 each 8*8


blocks

4*4

8*8

Fig. 2.5 MB partitioning for inter node prediction


18 2 Hybrid Video Codec Structure

Fig. 2.6 (a) Block partitioning of inter-mode and (b) intra-mode for H.264/AVC

In Fig. 2.6, the inter- and intra-partition modes for the H.264/AVC are shown.
In Fig. 2.6a, the segmentations of the macroblock for motion compensation are
described. In this diagram, the top part is for the segmentation of macroblocks, and
the bottom part describes the segmentation of 8  8 partitions. On the other hand,
in Fig. 2.6b, different intra-partitioning of the H.264/AVC is shown. The detailed
description of each mode for the intra-partitioning will be described in Chap. 3.

2.2.2 HEVC Block Partitioning

Unlike the fixed partitioning structure using MB concept, in the HEVC, more
flexible and efficient block partitioning techniques are used. The HEVC introduces
four different block concepts: CTU, CU, PU, and TU. Each CTU consists of a luma
coding tree block (CTB) and two chroma CTBs. A similar relationship is valid for
CU, PU, and TU. The detailed description of each block is given below.
2.2 Block Partitioning 19

2.2.2.1 Coding Tree Unit

The CTU is basically an analogue to the macroblock in the H.264/AVC. Each slice
contains an integer multiple of CTUs. A CTU has flexible sizes of 64  64, 32  32,
16  16, or 8  8, and it can be specified at the time of encoding. Since it can
support larger sizes of block up to 64  64, it provides better coding efficiency for
the high-resolution-based video contains.

2.2.2.2 Coding Unit

A CTU has a block structure size of 64  64 pixels, which can be decomposed into
four 32  32 pixels CUs. Further still, each 32  32 pixels CU can be divided into
four CUs of 16  16 pixels. This decomposition process can continue to CUs of up
to 8  8 pixel blocks. That means the 8  8 pixel block is the smallest possible for a
CU. Moreover, for the different combinations of CU structures, different CTBs are
generated for a single CTU. For each CTB, RD cost value is calculated. The CTB
which has the minimum RD cost value is considered as the best one. The illustration
of the CTB structure for a CTU is given in Fig. 2.7a. In Fig. 2.7, a 6464 pixel CTU
block is shown divided into smaller blocks of CUs. Upon calculating the RD cost
for every combination, the CUs which are under the red dotted part of Fig. 2.7a give
the minimum RD value. The corresponding CTU partitioning and CTB structure for
this particular (best) combination are shown in Fig. 2.7b.
The CTB is an efficient representation of variable block sizes so that regions of
different sizes can be coded with fewer bits while maintaining the same quality.
It is possible to encode stationary or homogeneous regions with a larger block
size, resulting in a smaller side-information overhead. On the other hand, the CTB
structure dramatically increases the computational complexity. As an example, if
a frame has dimensions of 704  576 pixels, then it will be decomposed into
99  .11  9/ CTUs, and a separate CTB will be created for each CTU. For each
CTB, 85 calculations are involved for different CU sizes. As a result, 8415 CU
calculations are required for the CTB structure, whereas only 1584 calculations are
needed for a 1616 macroblock, as was used in the previous standard (H.264/AVC).
Let us consider O.n/ is the total number of operations when the maximum depth
of the coding tree is set to n and Pi is the number of operations required for the given
CU size at the i-th level. The computational complexity based on variable CU sizes
can be described as Eq. 2.1:

O.n/ D O.n  1/ C 4n  Pn
O.0/ D P0
(2.1)
1
Pi D . /i  Pi1
4
20 2 Hybrid Video Codec Structure

a
64*64
32*32 This combination
16*16 provides the
lowest RD Cost
8*8

C TU
P ar t i t i o n i n g

C TB S t r u c t u r e 64*64

32*32 32*32 32*32


32*32

16*16 16*16 16*16 16*16 16*16


16*16
16*16 16*16
8*8 8*8 8*8 8*8 8*8 8*8
8*8 8*8 8*8 8*8 8*8 8*8
8*8 8*8 8*8 8*8
8*8 8*8 8*8 8*8

Fig. 2.7 (a) CTB structure which provides the lowest RD cost for CTU and (b) corresponding
CTU partitioning for the best CTB

The total number of operations can be expressed as Eq. 2.2:


3
X
O.n/ D 4i  Pi (2.2)
iD0
2.2 Block Partitioning 21

Fig. 2.8 Coding tree block (CTB) structure and the corresponding CUs for a benchmark video
sequences (Blowing Bubbles)

As shown in Eq. 2.2, the computational complexity increases monotonically with


respect to the CU depth. In the next section, significant analysis is provided for the
early termination of the CTB structure (Fig. 2.8).
The advantages of this kind of flexible CU structure are:
• When a region is homogeneous, a large CU can represent the region by using
smaller number of bits.
• The arbitrary size of CTU enables the codec to be readily optimized for various
contents, applications, and devices.
• Finally, it is a very simple but elegant way to represent the multilevel hieratical
quadtree structure.

2.2.2.3 Prediction Unit

The prediction of each CB is signaled as intra-spatial or inter-temporal which


is performed in prediction unit (PU). A CU can be split into one, two, or four
PUs according to the PU splitting type. Unlike the CU, there is no recursive
decomposition procedure for the PU, which means it can split once.
Similar to the H.264, for inter-prediction, each CB is split into one, two, or
four prediction blocks (PB) and predicted separately as shown in Fig. 2.9. Both
symmetric square- or rectangular- and asymmetric rectangular-shaped (AMP) PU
partitioning are preformed for each CB. A CB with dimension 8  8 required
9 PU calculations (PART_2N 2N + 2*PART_N 2N + 2*PART_2NN +
4*PART_N  N), whereas CBs with higher dimensions required 13 (PART_2N2N
+ 2*PART_N2N + 2*PART_2N N + 2*4 AMPs) PU calculations. Moreover,
bidirectional prediction technique is also adopted in HEVC. Hence, two motion
22 2 Hybrid Video Codec Structure

s kip m o de intr a m o de

2N*2N 2N*2N N*N

i n t e r m o d e ( s q u ar e an d r e c t an g u l ar ) - n o n AM P

2N*2N N*N 2N*N N*2N


i n t e r m o d e ( as y m m e t r i c ) - AM P

2 N * nU 2 N * nD nL * 2 N nR * 2 N

Fig. 2.9 PU partition types in HEVC for intra- and inter-modes

vectors (MVs) are calculated separately for each inter-PB using two reference
pictures from list-0 and list-1. For each MV, RD cost is calculated using the original
and generated predicted blocks.

2.2.2.4 Transform Unit

The prediction residual is coded using block transforms. A TU tree structure has its
root at the CU level. The luma CB residual may be identical to the luma transform
block (TB) or may be further split into smaller luma TBs. The same applies to the
chroma TBs. Integer basis functions similar to those of a discrete cosine transform
(DCT) are defined for the square TB sizes 4  4, 8  8, 16  16, and 32  32. For
the 4  4 transform of luma intra-picture prediction residuals, an integer transform
derived from a form of discrete sine transform (DST) is alternatively specified.
2.4 In-Loop Filters 23

2.3 Prediction Modes

The prediction technique is used to temporally and spatially predict the current
frame from the previous one(s) that must be stored. The temporal prediction is based
on the assumption that the consecutive frames in a video sequences exhibit very
close similarity, except for the fact that the objects or the parts of a frame in general
may get somewhat displaced in position. This assumption is mostly valid except
for the frames having significant change of contents. The predicted frame generated
by the exploitation of temporal redundancy is subtracted from the incoming video
frame, pixel by pixel, and the difference is the error image, which will in general
exhibit considerable spatial redundancy. The detailed description about the inter-
picture prediction techniques in the hybrid video codec is discussed in Chap. 3.
On the other hand, the intra-picture prediction technique is based on the spatial
redundancy. This technique has the similar concept of the still image compression.
However, in modern hybrid codec, sophisticated algorithms are applied for the intra-
mode decision. We have dedicated a full chapter (Chap. 4) on the topic in this book.

2.4 In-Loop Filters

After prediction of a block (PU/TU or MB), generally good amount of artifacts


are present in the block boundaries. For this reason, a post-processing filtering
is essential to smooth the sharp edges that are present in the block boundaries.
In-loop filters are used for this purpose. In the H.264/AVC, only deblocking filters
are used as the in-loop filter. However, the HEVC standard specifies two in-loop
filters: deblocking filter and sample adaptive offset (SAO). The brief description
about these two in-loop filters is given below.

2.4.1 Deblocking Filter

The basic concept of the deblocking filter in H.264/AVC and HEVC is quite similar.
This filter is intended to reduce the blocking artifact due to the block-based coding.
Moreover, it is only applied to the samples located in the block boundaries.
The operation of deblocking filter can be divided into three main steps: filter
strength computation, filter decision, and filter implementation.

2.4.1.1 Filter Strength Computation

Let us consider that the two blocks (P and Q) are adjacent to each other. In Fig. 2.10,
two adjacent blocks are shown for vertical edge. The concept is quite similar to
the horizontal edge also. The amount of filtering is computed with the help of a
24 2 Hybrid Video Codec Structure

P3 ,0 P 2 ,0 P 1,0 P 0 ,0 Q0 ,0 Q 1,0 Q2 ,0 Q 3,0

P3 ,1 P 2 ,1 P 1,1 P 0 ,1 Q0 ,1 Q 1,1 Q2 ,1 Q 3,1

P3 ,2 P 2 ,2 P 1,2 P 0 ,2 Q0 ,2 Q 1,2 Q2 ,2 Q 3,2

P3 ,3 P 2 ,3 P 1,3 P 0 ,3 Q0 ,3 Q 1,3 Q2 ,3 Q 3,3

Block P Block Q

Fig. 2.10 Four sample segments of vertical block boundary between adjacent blocks P and Q

no
P or Q intra ?

yes
P or Q has no
non-zero
coefficients ?

P and Q uses no
different ref
yes pictures ?

Abs . diff between no


yes P and Q's MVs 1
integer samples ?

yes
Bs = 2 Bs = 1 Bs = 0

Fig. 2.11 Boundary strength (Bs) calculation for two adjacent blocks P and Q

parameter called the boundary strength (Bs). The boundary strength (Bs) of the filter
depends on the current quantizer, block type, motion vector, and other parameters.
In the HEVC, the boundary strength is calculated using an algorithm which is shown
in Fig. 2.11 as a simplified flowchart. If the boundary strength is greater than zero,
then the deblocking filtering is applied on the blocks.
2.4 In-Loop Filters 25

2.4.1.2 Filtering Decision

There are two kinds of filtering decisions that are taken in the HEVC encoder. These
are:
• required filtering or not ?
• if filtering is required, then is it a normal filtering or a strong filtering ?
The condition for the first decision can be formulated as Eq. 2.3:

jP2;0  2P1;0 C P0;0 j C jP2;3  2P1;3 C P0;3 jC


(2.3)
jQ2;0  2Q1;0 C Q0;0 j C jQ2;3  2Q1;3 C Q0;3 j < ˇ

In this equation, ˇ is a threshold which depends on the quantization parameter


(QP) and is derived from a lookup table in the encoder. On the other hand, for the
second decision, there are three conditions. If all of the three conditions are satisfied,
then a strong filtering is applied on the block otherwise normal filtering is applied.
These three conditions are given below:

ˇ
jP2;i  2P1;i C P0;i j C jQ2;i  2Q1;i C Q0;i j < (2.4)
8
ˇ
jP3;i  P0;i j C jQ3;i  Q0;i j < (2.5)
8
jP0;i  Q0;i j < 2:5  tc (2.6)

These conditions are applied for i D 0 and i D 3. In Eq. 2.6, the tc is another
threshold, which is generally referred to as clipping parameter. Now the algorithm
for the filtering decision is shown in Fig. 2.12 as a flowchart.

2.4.1.3 Filter Implementation

When a normal deblocking filter is selected, then one or two samples are modified
from block P or Q based on some conditions. On the other hand, the strong
deblocking filter is applied to smooth flat areas where artifacts are more visible.
This filtering mode modifies three samples from the block boundary and enables
strong low-pass filtering.

2.4.2 Sample Adaptive Offset

Sample adaptive offset is the second-level in-loop filtering in the HEVC which
attenuates the ringing artifacts. The ringing artifacts generally appear for large
26 2 Hybrid Video Codec Structure

boundary is no
aligned with 8*8
sample grid

yes

boundary is no
between PU or
TU
No filtering

yes no

Bs > 0

yes

condition (2.3) no
is true

yes

conditions (2.4), (2.5) no


Normal filtering
and (2.6) are true

yes

Strong filtering

Fig. 2.12 Filter decision algorithm

transform sizes. SAO is applied on the output of the deblocking filter. The HEVC
includes two kinds of SAO types. These are:
• Edge offset (EO)
• Band offset (BO)
2.4 In-Loop Filters 27

Fig. 2.13 One directional a b


patterns for EO sample
classification: (a) horizontal, n0
(b) vertical, (c) 135ı
diagonal, and (d) 45ı n0 n1
diagonal
p p

n1

c d

n0 n0

p p

n1 n1

Table 2.1 EdgeIdx categories in SAO edge classes


EdigIdx Condition Meaning
0 Cases not listed below Monotonic area
1 p < n0 and p < n1 Local min
2 (p < n0 and p D n1 ) or (p < n1 and p D n0 ) Edge
3 (p > n0 and p D n1 ) or (p > n1 and p D n0 ) Edge
4 p > n0 and p > n1 Local max

2.4.2.1 Edge Offset

Edge offset is based on the comparison between the current sample and its neigh-
boring sample. EO uses four one-directional patterns for edge offset classification in
the CTB. These patterns are horizontal, vertical, 1350 diagonal, and 450 diagonal as
shown in Fig. 2.13. Each sample in the CTB is classified into one of five categories
by comparing the neighboring values. The categories are generally defined as
EdgeIdx. The meaning of different EdgeIdx and the corresponding conditions is
given in the Table 2.1.
Depending upon the EdgeIdx, an offset value from a transmitted lookup table is
added to the sample value. For EdgeIdx D 1 and 2 positive offset and for EdgeIdx
= 3 and 4 negative offset is added to the samples for smoothing.

2.4.2.2 Band Offset

In this kind of SAO, same offset is added to all samples whose value belongs to the
same band. Here the amplitude of a sample is the key factor for the offset. In this
mode, a full sample amplitude range is uniformly divided into 32 bands. The sample
values belonging to four of these bands are modified by adding band offsets.
28 2 Hybrid Video Codec Structure

2.5 Entropy Coding

After the in-loop filtering, the next step in the hybrid codec is the entropy coding of
the transformed data. Here, lossless compression schemes are applied. In the modern
hybrid video codec, context-based adaptive binary arithmetic coding (CABAC)
is used. But before describing the CABAC, some preliminary knowledge about
entropy coding is required. So, some basic entropy coding algorithms, like Huffman
coding and arithmetic coding, are discussed first, followed by CABAC.

2.5.1 Huffman Coding

Huffman coding is a popular lossless variable length coding scheme, based on the
following principles:
• Shorter code words are assigned to more probable symbols.
• No code word of a symbol is a prefix of another code word.
• Every source symbol must have a unique code word assigned to it.
It is better to explain the Huffman coding by using an example. Let us consider
we have six symbols a1 , a2 , a3 , a4 , a5 , and a6 . Moreover, before applying the
Huffman coding, we also know the probability of occurrences of each symbol. Let
us consider the probabilities are 0:4; 0:3; 0:1; 0:1; 0:06, and 0:04, respectively.
The steps for the Huffman coding are given below:
step 1: Arrange the symbols in the decreasing order of their probabilities.
step 2: Combine the lowest probability symbols into a single compound symbol
that replaces them in the next source reduction. In this example, a5 and a6 are
combined into a compound symbol of probability 0:1.
step 3: Continue the source reductions of step 2, until we are left with only two
symbols. This is shown in Fig. 2.14. The second symbol in this table indicates a
compound symbol of probability 0:4. We are now in a position to assign codes
to the symbols.
step 4: Assign codes 0 and 1 to the last two symbols.
step 5: Work backward along the table to assign the codes to the elements of
the compound symbols. Continue till codes are assigned to all the elementary
symbols. This is shown in Fig. 2.15
Hence, after applying the Huffman coding, the corresponding coding values of
each symbol are a1 D 1, a2 D 00, a3 D 011, a4 D 0100, a5 D 01010, and
a6 D 01011. If we calculate it properly, then the average length of this code is 2:2
bits per pixel. Huffman’s procedure creates the optimal code for a set of symbols
and probabilities subject to the constraint that the symbols be coded one at a time.
2.5 Entropy Coding 29

Original source Source Reduction


Symbol Probability 1 2 3 4
a1 0.4 0.4 0.4 0.4 0.6
a2 0.3 0.3 0.3 0.3 0.4
a3 0.1 0.1 0.2 0.3
a4 0.1 0.1 0.1
a5 0.06 0.1
a6 0.04

Fig. 2.14 Huffman coding technique up to step 3

Original source Source Reduction


Symbol Probability 1 2 3 4
a1 0.4 1 0.4 1 0.4 1 0.4 1 0.6 0
a2 0.3 00 0.3 00 0.3 00 0.3 00 0.4 1
a3 0.1 011 0.1 011 0.2 010 0.3 01
a4 0.1 0100 0.1 0100 0.1 011
a5 0.06 01010 0.1 0101
a6 0.04 01011

Fig. 2.15 Huffman coding technique up to step 5

2.5.2 Arithmetic Coding

Arithmetic coding is also a variable length coding (VLC) scheme requiring a priori
knowledge of the symbol probabilities. The basic steps for this algorithm are given
below:
step 1: Consider a range of real numbers in Œ0; 1/. Subdivide this range into a
number of subranges that is equal to the total number of symbols in the source
alphabet. Each subrange spans a real value equal to the probability of the source
symbol.
step 2: Consider a source message and take its first symbol. Find to which
subrange does this source symbol belong.
step 3: Subdivide this subrange into a number of next-level subranges, according
to the probability of the source symbols.
step 4: Now parse the next symbol in the given source message and determine
the next-level subrange to which it belongs.
step 5: Repeat step 3 and step 4 until all the symbols in the source message are
parsed. The message may be encoded using any real value in the last subrange
so formed. The final message symbol is reserved as a special end-of-symbol
message indicator.
30 2 Hybrid Video Codec Structure

2.5.3 CABAC

As we have mentioned earlier that the context-based adaptive binary arithmetic


coding (CABAC) technique is generally used in the modern hybrid codec. The
CABAC algorithm has four very distinct steps. These steps are given below:
step 1: A non-binary-valued symbol (transform coefficient or motion vector) is
converted into a binary code. This process is generally referred to as binarization.
step 2: A probability model for one or more bins is chosen from the recently
coded data symbols. This is referred to as context model selection.
step 3: An arithmetic coder encodes each bin according to the selected probability
model.
step 4: Finally, the selected context model is updated based on the actual coded
value.
Chapter 3
Intra-prediction Techniques

3.1 Background

In the intra-prediction, a block is predicted only with the help of the current frame.
So, in this kind of prediction, the reference frames are not required. Only spatial
redundancy is explored in this prediction. The main concept behind this prediction
is that the neighboring pixel of a block should have a high amount of correlation.
For an example, let us consider the Foreman sequence as shown in Fig. 3.1. In this
diagram, a block is enlarged from the sequence (Fig. 3.1a). Now the blocks which
are present in the above and the left side of the enlarged block are already encoded
which are denoted in this diagram with a “0” sign, and the other non-encoded blocks
are denoted as “” sign. The top neighboring pixels of this block are shown in
Fig. 3.1b. Let us consider the current block is predicted from the top neighboring
pixels. That means, in the predicted block, all the pixels in column have the same
value of the vertically neighbor pixel corresponding to that column. This kind of
prediction is generally referred to as padding. In Fig. 3.1b, the vertically padded
block (predicted block) for the current block is shown.
Now, one question might appear in your mind that an error should be produced
in this prediction. The answer is yes. For this reason, the corresponding residual
block is also generated. In the previous chapter, we have discussed in detail about
the residual block. In a nutshell, this is basically just the difference between the
predicted and the current block.
In this example, only the vertical padding is considered. No doubt that if we
consider different other orientations of padding, then the prediction will be more
accurate. Let us consider three orientation of padding as shown in Fig. 3.2. In
this diagram apart from the vertical padding, horizontal padding and diagonal
padding are also considered. So, for this example, all the three predictions are
performed (vertical, horizontal, and diagonal). The corresponding residual blocks
are generated, and the rate-distortion cost values are calculated. Now the prediction

© The Author(s) 2016 31


B.-G. Kim and K. Goswami, Basic Prediction Techniques in Modern Video Coding
Standards, SpringerBriefs in Electrical and Computer Engineering,
DOI 10.1007/978-3-319-39241-7_3
32 3 Intra-prediction Techniques

Fig. 3.1 Conceptual diagram of a block and the correlation with its neighboring pixels. (a)
Enlarged version of block from the Foreman sequence, (b) the neighboring pixels of the block
which are present in the above and the corresponding vertical padding with these pixels

which provides the minimum rate-distortion cost is considered the best one. This
is the basic background of the modern intra-prediction technique. All the latest
hybrid encoders use this phenomenon for the intra-coding. Now depending upon
the encoder, the angular prediction modes are varied. We will discuss different intra-
modes for H.264/AVC and HEVC in the next subsections.

3.2 Intra-prediction Modes in H.264/AVC

In the H.264/AVC, the intra-predictions are only made for the square-shaped blocks.
The size of the square-shaped block can vary from 4  4 to 16  16 for luma
component. Now, the 8  8 luma block is a special case which is used for the
3.2 Intra-prediction Modes in H.264/AVC 33

Fig. 3.2 Different orientation of the padding

M A B C D E F G H Mode 2: Mode 8
DC
I a b c d Mode 1

J e f g h samples to be Mode 6
intra predicated
K i j k l Mode 3
samples that are Mode 4
L m n o p already encoded Mode 5
Mode 7 Mode 0

Fig. 3.3 Sample position in the 4  4 macroblock

high profiles. The 4  4 and 8  8 are considered as smaller blocks, and the 16  16
is considered as the larger block. In the H.264/AVC, nine modes are assigned for
the smaller blocks, and four modes are assigned for the larger block.
In Fig. 3.3, a 4  4 macroblock and its corresponding neighboring pixels are
shown. The neighboring green-colored pixels represent the pixels which are already
encoded, and the corresponding 4  4 macroblock is intra-predicted with the help
of these neighboring pixels. As we have mentioned earlier, nine intra-modes are
supported by the H.264/AVC encoder. The angular direction of these nine modes is
shown in Fig. 3.3. The brief description of each mode is given in Table 3.1, and the
corresponding pictorial representation is shown in Fig. 3.4. If you compare Fig. 3.4
and Table 3.1, then it is quite easy to understand different angular intra-prediction
modes for the smaller macroblocks.
The prediction technique of mode 0, mode 1, and mode 2 is very straightforward.
Only simple padding concept and the average function are used in these three modes.
On the other hand, the rest of the six modes have little complex way to calculate the
predicted pixels, and each of the pixels in the macroblock need not to have the
same predicted value. To understand it more easily, Fig. 3.5 provides a pictorial
view of the predicted value of each pixel in the macroblock. Figure 3.5 is self
explanatory, and we hope the readers will understand the calculation techniques
of intra-prediction for each pixel in the macroblock.
So far, we have discussed about the intra-prediction only for the smaller blocks
(44 and 88). Now for the 1616 luma blocks, the intra-prediction is more simple.
34 3 Intra-prediction Techniques

0 (vertical) 1 (horizontal) 2 (DC)


M A B C D E F G H M A B C D E F G H M A B C D E F G H
I a b c d I I
J e f g h J J Mean of
K i j k l K K (A~L)
L L L

3 (diagonal down-left) 4 (diagonal down-right) 5 (vertical-right)


M A B C D E F G H M A B C D E F G H M A B C D E F G H

I I I
J J J
K K K
L L L

6 (horizontal-down) 7 (vertical-left) 8 (horizontal-up)


M A B C D E F G H M A B C D E F G H M A B C D E F G H
I I I
J J J
K K K
L L L

Fig. 3.4 Pictorial representation of the nine intra-prediction modes for a smaller macroblock

Table 3.1 Description of different intra-modes


Modes Angular direction Description
Mode 0 Vertical A; B; C; D are extrapolated vertically
Mode 1 Horizontal I; J; K; L are extrapolated horizontally
Mode 2 DC All samples are mean of fA; B; C; D; I; J; K; Lg
Mode 3 Diagonal down-left Interpolated with 45ı angle between lower left and upper right
Mode 4 Diagonal down-right Extrapolated with 45ı angle between upper left and lower right
Mode 5 Vertical left Extrapolated with 26:6ı angle to the left of vertical
Mode 6 Horizontal down Extrapolated with 26:6ı angle below horizontal
Mode 7 Vertical right Extrapolated with 26:6ı angle to the right of vertical
Mode 8 Horizontal up Extrapolated with 26:6ı angle above horizontal

Only four modes are available for this kind of larger blocks. These are mode 0,
mode 1, mode 2, and mode 4. Conceptually, the description of these modes is the
same as mentioned in Table 3.1 [1].

3.3 Intra-prediction Modes in HEVC

3.3.1 Angular Prediction

The intra-prediction operates according to the transform block (TB) size. The TB
sizes vary from the 4  4 to the 32  32. In the HEVC, there are 35 different intra-
prediction modes allowable. Among these, 33 intra-predictions are directional, and
3.3 Intra-prediction Modes in HEVC 35

Mode 3 Mode 4 Mode 5


Q A B C D E F G H Q A B C D E F G H Q A B C D E F G H
I I I
J J J
K K K
L L L
Predictors: Predictors: Predictors:
(A+2B+C+2)/4 (F+2G+H+2)/4 (L+2K+J+2)/4 (A+2B+C+2)/4 (A+B+1)/2 (A+2B+C+2)/4
(B+2C+D+2)/4 (G+3H+2)/4 (K+2J+I+2)/4 (B+2C+D+2)/4 (B+C+1)/2 (B+2C+D+2)/4
(C+2D+E+2)/4 (J+2I+Q+2)/4 (C+D+1)/2 (C+2D+E+2)/4
(D+2E+F+2)/4 (I+2Q+A+2)/4 (D+E+1)/2 (D+2E+F+2)/4
(E+2F+G+2)/4 (Q+2A+B+2)/4 (E+F+1)/2 (E+2F+G+2)/4

Mode 6 Mode 7 Mode 8

Q A B C D E F G H Q A B C D E F G H Q A B C D E F G H
I I I
J J J
K K K
L L L
Predictors: Predictors: Predictors:
(Q+I+1)/2 (Q+2I+J+1)/4 (Q+A+1)/2 (Q+2A+B+2)/4 (I+J+1)/2 (K+2L+L+2)/4
(I+2Q+A+2)/4 (J+K+1)/2 (A+B+1)/2 (A+2B+C+2)/4 (I+2J+K+2)/2 L
(Q+2A+B+2)/4 (I+2J+K+2)/4 (B+C+1)/2 (B+2C+D+2)/4 (J+K+1)/2
(A+2B+C+2)/4 (K+L+1)/2 (C+D+1)/2 (Q+2I+J+2)/4 (J+2K+L+2)/2
(I+J+1)/2 (J+2K+L+2)/4 (I+2Q+A+2)/4 (I+2J+K+2)/4 (K+L+1)/2

Fig. 3.5 Calculation techniques for different intra-prediction modes http://sidd-reddy.blogspot.kr/


2011/04/h264-intra-coding.html

Fig. 3.6 Modes and directional orientation in the HEVC encoder of the intra-prediction [2]

one is DC and the last one is the planer. We will discuss about the DC and the
planer in the next subsection. All the modes and directional orientation in the HEVC
encoder are shown in Fig. 3.6.
The 33 angular modes are generally referred to as Intra_AngularŒk, where k
is a mode number from 2 to 34. The angles are internally designed to provide
denser coverage for near-horizontal and near-vertical angles and course coverage
for near-diagonal angles for the effectiveness of the signal prediction processing [2].
Generally, the Intra_Angular prediction targets the regions which have strong
directional edges.
36 3 Intra-prediction Techniques

In the Intra_AngularŒk, the k ranges from 2 to 17 referrers the prediction of


horizontal modes where the samples located in the above row are projected as
additional samples located in the left column. On the other hand, the k ranges from
18 to 34 referrers the sample prediction for the vertical modes. Let us consider that
a sample, which we need to predict, is represented as pŒxŒy, where x and y are the
indexes. Now for k ranges from 2 to 17, pŒxŒy is represented as

pŒxŒy D ..32  f /  ref Œy C i C 1 C f ? ref Œy C i C 2 C 16/ >> 5 (3.1)

For k ranges from 18–34, pŒxŒy is represented as

pŒxŒy D ..32  f /  ref Œx C i C 1 C f ? ref Œx C i C 2 C 16/ >> 5 (3.2)

where, i is the projected integer displacement on row y or column x and calculated


as a functional angular parameter A as

i D ..x C 1/  A/ >> 5; k D 2; 3; : : : ; 17 (3.3)


i D ..y C 1/  A/ >> 5; k D 18; 19; : : : ; 34 (3.4)

On the other hand, in the 3.2 and 3.1, f represents the functional part of the
projected displacement on the same row or column and is calculated as

f D ..x C 1/  A/&31; k D 2; 3; : : : ; 17 (3.5)


i D ..y C 1/  A/&31; k D 18; 19; : : : ; 34 (3.6)

To improve the intra-prediction accuracy in the HEVC, the projected reference


sample projection is computed with 1=32 sample accuracy. Bilinear interpolation
technique is used here to obtain the value of the projected reference sample using
two closest reference samples located at integer position [2].

3.3.2 DC and Planer Prediction

Conceptually, these prediction techniques are quite similar to the H.264/AVC. Intra-
DC prediction uses an average value of reference samples which are present in the
immediate left and the above of the block to be predicted. On the other hand, the
average values of two linear predictions using four corner reference samples are
used in intra-planar prediction to prevent discontinuities along the block boundaries.
Generally, the planer prediction has the capability to predict a region without
discontinuities on the block boundaries. By using the averaging of vertical and
horizontal linear prediction, the planer prediction is calculated. For example, a
pŒxŒy sample can be predicted as

pŒxŒy D .ph ŒxŒy C pv ŒxŒy C N/ >> .log2 .N/ C 1/ (3.7)


3.4 Lossless Intra-prediction Using DPCM 37

In this equation, ph ŒxŒy and pv ŒxŒy represent horizontal and vertical predictions
which are calculated as

ph ŒxŒy D .N  1  x/  pŒ1Œy C .x C 1/ ? pŒNŒ1 (3.8)


pv ŒxŒy D .N  1  y/  pŒxŒ1 C .y C 1/ ? pŒ1ŒN (3.9)

3.3.3 Reference Sample Smoothing and Boundary Value


Smoothing

In the HEVC, a three-tap [1 2 3]/4 smoothing filter is used for the reference samples
in the intra-prediction. The reference sample smoothing is adaptive in nature for
the HEVC. For different block sizes, the reference sample smoothing is applied as
follows [2]:
• For 8  8 blocks, only the diagonal directions, Intra_Angular[k] with k = 2, 18,
or 34, use the reference sample smoothing.
• For 16  16 blocks, the reference samples are filtered for most directions except
the near-horizontal and near-vertical directions, k in the range of 9–11 and 25–27.
• For 32  32 blocks, all directions except the exactly horizontal (k D 10) and
exactly vertical (k D 26) directions use the smoothing filter,
To remove discontinuities along block boundaries, boundary value smoothing is
used. This smoothing technique is used for three modes: intra_DC (mode 1) and
Intra_Angular[k] with k D 10 (exactly horizontal) or k D 26 (exactly vertical).

3.4 Lossless Intra-prediction Using DPCM

In the previous section, we have discussed about the intra-prediction for both
H.264/AVC and HEVC in detail. Differential pulse code modulation (DPCM)-based
approach is a special technique which is proposed in [3]. This technique is efficient
enough to improve the intra-coding efficacy with a good extent.
Let us consider a 4  4 block, and this block is intra-predicted horizontally. In the
Fig. 3.7, the corresponding 4  4 block and its reference pixels are shown. Now in
the normal horizontal intra-prediction, the residuals of the first row is calculated as

r0 D p0  q0
r1 D p1  q1
(3.10)
r2 D p2  q2
r3 D p3  q3
38 3 Intra-prediction Techniques

Fig. 3.7 Boundary samples l0 l1 l2 l3 l4 l5 l6 l7 l8


and inside samples for 4  4
intra-prediction [3]
q0 p0 p1 p2 p3

q1 p4 p5 p6 p7

q2 p8 p9 p10 p11

q3 p12 p13 p10 p11

In this equation, r0 , r1 , r2 , and r3 are the corresponding residual values in the first
row. Now according to the DPCM-based approach, the residuals can be calculated as

r0 D p0  q0
r1 D p1  p0
(3.11)
r2 D p2  p1
r3 D p3  p2

The encoder sends r0 , r1 , r2 , r3 , and as part of a residual block, and the decoder
can then decode the residuals as a block and then apply them for reconstruction. In
the decoder, the reconstruction of the p0 , p1 , p2 , and p3 is also quite simple. The
generalized relationship for the first row of the 4  4 block is

X
i
pi D q0 C rk ; 0  i  3 (3.12)
kD0

The vertical prediction can be performed in a similar way to the horizontal


prediction. For other modes, the same concept can also be applied. An overall
improvement in lossless coding compression capability of approximately 12 %
has been shown in the experiment results, without a substantial increase in the
complexity of the encoding or decoding processes [3].

References

1. T. Wiegand, G.J. Sullivan, G. Bjontegard, A. Luthra, Overview of the H.264/AVC video coding
standard. IEEE Trans. Circ. Syst. Video Technol. 13(7), 560–576 (2003)
2. G.J. Sullivan, J.R. Ohm, W.J. Han, T. Wiegand, Overview of the High Efficiency Video Coding
(HEVC) standard. IEEE Trans. Circ. Syst. Video Technol. 22(12), 1649–1668 (2012)
3. Y.-L. Lee, K.-H. Han, G.J. Sullivan, OImproved lossless intra coding for H.264/MPEG-4 AVC.
IEEE Trans. Image Proces. 15(9), 2610–2615 (2006)
Chapter 4
Inter-prediction Techniques

4.1 Motion Estimation

In the first two chapters, a decent overall description of latest hybrid video codec
was explained. As we have discussed earlier, that there are mainly two kinds of
prediction techniques are used in the modern hybrid video codec. These are inter-
and intra-prediction techniques. Generally, temporal and spatial redundancies are
exploited in these prediction techniques, respectively.
Temporal prediction is based on the assumption that the consecutive video
frames exhibit very close similarity. This technique is used in the motion estimation
block. This block computes the difference between a current frame and a reference
frame. Generally, the immediate past frame is considered as a reference frame. The
difference in position between a candidate block and its closest match in a reference
frame is called the motion vector. After determining the motion vectors, one can
predict the current frame using the reference frame.
Motion estimation is one of the most important operations involved in any
video processing system. The ultimate goal is to minimize the total number of
bits used for coding the motion vectors and the prediction errors. According to the
occurrence of the current and reference frame, motion estimation can be divided into
two categories—forward and backward motion estimation, as shown in Fig. 4.1. In
backward motion estimation, the current frame is considered as the candidate frame,
and the reference frame is a past frame, which implies the search is backward. On the
other hand, in forward motion estimation, the exact opposite scenario occurs as
shown in Fig. 4.1. A general problem in both kinds of motion estimation is how to
parameterize the motion field. Usually, there are multiple objects in a video frame
that can move in different directions. Hence, a global parameterized model is usually
not adequate to solve this problem. The basic approaches of motion estimation are
as follows:

© The Author(s) 2016 39


B.-G. Kim and K. Goswami, Basic Prediction Techniques in Modern Video Coding
Standards, SpringerBriefs in Electrical and Computer Engineering,
DOI 10.1007/978-3-319-39241-7_4
40 4 Inter-prediction Techniques

Time T + T1
Current
Frame
Backward Motion
Estimation Reference
Frame

Current
Frame

Time T Forward Motion


Estimation

Time T − T1

Fig. 4.1 Block diagram of motion estimation

• Pixel-based representation
• Block-based representation
• Mesh-based representation
However, in the hybrid video codec, block-based motion estimation techniques
are applied. For this reason, in this book, we will discuss only the block-based
motion estimation technique.
In the block-based motion estimation, a picture or frame is partitioned into small
nonoverlapping blocks (detailed description given in Chap. 2). Motion variation
within each nonoverlapping block can be characterized well, and motion vectors
can be estimated independently. This method provides a good compromise between
accuracy and complexity. In this technique, the motion vector is calculated for each
block independently. The main challenge in this method is how to specify the search
area of a block. This implies that if a block is placed in a certain position in the
reference frame, then one has to estimate the corresponding tentative positions of
the block in the current frame.
The main disadvantage in block-based representation is that the resulting motion
is often discontinuous across block boundaries. Unless the motion vectors of
adjacent blocks vary smoothly, the estimated motion fields may be discontinuous
and sometimes chaotic. This effect causes boundary artifacts.
Let us consider the frame t in Fig. 4.2 is the current frame and the blocks in this
current frame are predicted from a previously decoded frame t  1, which is referred
4.1 Motion Estimation 41

Fig. 4.2 Pictorial representation of motion estimation

to as the reference frame. As shown in Fig. 4.2, first of all a search region is defined
in the reference frame for a particular block. After that, for all the possible positions
in this search range, a cost function is calculated for this block. This algorithm is
generally referred to as full search block motion (FSBM) estimation, which is quite
expensive in terms of speed. There are a good amount of efficient and popular fast
block matching algorithms (BMA) available, which give satisfactory results in terms
of both quality and speed.
Now one question may arise in your mind: how to calculate the cost function?
In this context, by using the term “cost function,” we are basically intending to
mean a matching criteria to get the motion vector. Generally, different kinds of
techniques can be used for this purpose, and in the hybrid codec, user can change
the cost function by modifying the configuration file. However, most efficient and
low computational complexity-based cost function is the sum of absolute difference
(SAD). Suppose the block size is N  N, then the SAD of two blocks in the frame t
and t  1 can be calculated as
X
n1 X
n1
SAD.i; j/ D jBt .x; y/  Bt1 .x; y/j (4.1)
xD0 yD0

To explain the motion estimation concept more clearly, let us use a toy example.
In Fig. 4.3, the search range in the reference frame is shown as the green-colored
box. First of all, the SAD value is calculating from the origin (as shown in
Fig. 4.3). After that, the box motion toward right by one pixel and the corresponding
SAD value will be calculated. In this way, for the all possible positions in the
search region, the corresponding SAD values are calculated. Let us consider that
after calculating all possible SAD values, the corresponding SAD values for each
positions are shown in Fig. 4.3b. Now the minimum SAD value for this example
is 22. Hence, the corresponding motion vector (MV), for this example, will be
the vector from the origin to the position which provides the minimum SAD
value (as shown in Fig. 4.3). So, mathematically, it can be written as
MotionVector.MV/ D Œd1 ; d2  D arg minŒSAD.i; j/ (4.2)
i;j
42 4 Inter-prediction Techniques

Fig. 4.3 Search region in the reference frame and the corresponding motion vector

Now, the significance of the motion vector is that this vector provides a region in
the reference frame which is the most similar region to the corresponding region in
the current frame in the defined search zone. Now, one point needs to be clarified
here that the predicted region in the reference frame may not have the exact
same illumination or chrominance characteristics to the corresponding region in
the current frame. Hence, by using the motion estimation concept, the constructed
predicted frame should have a significant difference with the actual current frame.
This difference between the predicted and the actual current frame is called the
residual frame. Now, in the hybrid video codec system, the motion vectors and the
residual frame are sent to the decoder side. In the decoder, reference frame is already
present while decoding the current frame. So, by using the reference frame, motion
vectors, and the residual frame, the corresponding current frame can be constructed
without any error. In Fig. 4.4, the reconstruction of the current frame in the decoder
side with the residual frame, motion vector, and the reference frame is shown.

4.2 Uni- and Bidirectional Predictions

For hybrid video codec, generally two kinds of inter-prediction techniques are used
nowadays. These are unidirectional and bidirectional predictions. The concept of
these two is quite straightforward. On one hand, only one reference picture is used
in the unidirectional prediction, and on the other hand, two reference pictures are
used in the bidirectional prediction.
Let us consider a toy example, where eight frames are present in a group of
pictures (GOP). Among these, the first and the last one are intra-predicted and the
rest are inter-predicted. The intra-predicted frames are shown as I-frame in Figs. 4.5
and 4.6. For the unidirectional, predicted frames are shown in Fig. 4.5 as P-frame.
From this diagram, it is quite clear that the P-frame is predicted from a single
reference frame. The reference frame to predict a picture in the unidirectional need
not be an I-frame—it can be a P-frame also. In Fig. 4.5a, only I-frames are shown,
4.2 Uni- and Bidirectional Predictions 43

Fig. 4.4 Reconstruction of the current frame in the decoder side with the residual frame, motion
vectors, and the reference frame

Fig. 4.5 Unidirectional prediction for (a) I-frame and (b) P-frame

and in Fig. 4.5b both P-frames and I-frames are shown for the GOP where the first
P-frame is predicted from an I-frame and the second one from a P-frame.
44 4 Inter-prediction Techniques

Fig. 4.6 Bidirectional prediction for (a) only one B-frame and (b) total GOP

In case of a bidirectional prediction, at least reference frames are required.


The bidirectional predicted frames are generally represented as B-frame, as shown
in Fig. 4.6. The reference frames can be I-frame, P-frame, or a B-frame for a
bidirectional prediction. One point that needs to be clarified here is that there are two
different motion vectors (MVs) present in the bidirectional prediction. Generally, a
bidirectional prediction provides a better coding efficiency than the unidirectional
one. In Fig. 4.6b, all prediction techniques are shown for this toy example.

4.3 Complexity in the Inter-prediction

Generally, inter-prediction is the most complex part in the hybrid video codec. For
this reason, this module is one of the key components in terms of time consumption
to encode a video stream. In this section, let us make a time profile of H.264/AVC
and HEVC encoder for different coding module.
First of all, consider the H.264/AVC. The H.264/AVC video standard has very
high complexity to improve video quality and compression gain. Figure 4.7 shows
the encoding time profile for the H.264/AVC. Using the term “time profile,” we
want to mention the average time consumption of the different modules in the
H.264/AVC. From this diagram, it is very clear that the inter-prediction is the most
dominating in terms of other modules. It takes over 57 % in average encoding time
but sometimes over than 70 %. In this context, we want to mention that among
4.3 Complexity in the Inter-prediction 45

Fig. 4.7 The average


consumed time profile for 3%
encoding H.264/AVC
video (%) 4%

16% Inter Prediction


Intra Prediction
Transform
57% CAVLC
20%
other

QP 25 QP 35

22% 11% 21% 6%


Intra Intra
Inter 45% Inter
26% 41% 28%
Transform Transform
ETC ETC

Fig. 4.8 The average consumed time profile for encoding HEVC video (%)

different parts in the inter-prediction, motion estimation is the most expensive


one. ME includes variable block mode decision process and motion vector search.
However, in this diagram, different modules in the inter-prediction are not shown
individually. In the time profile of the H.264/AVC, the second candidate is the intra-
prediction module. It takes over 20 % encoding time. However, if we consider the
both intra- and inter-prediction modes, then in terms of encoding time, these parts
are the most important parts in this codec.
Now, consider the HEVC encoder. The time profile of the HEVC encoder is
shown in Fig. 4.8. Let us first clarify the experimental environment of the given
plots. It is tested in common set; quantization parameter (QP) values are 20, 25,
30, 35, and 40; structure of slice is hierarchical B; fast search used the enhanced
predictive zonal search (EPZS); dimension of sequences is 832  480; the number
of frame is 50.
In these plots, two different configurations (in terms of QP) are shown. QP refers
to the quantization parameter. Generally, low QP means high-quality encoded data.
Now for both of these cases, the inter-prediction takes over 40 % encoding time.
Again the second position is for the inter-prediction.
46 4 Inter-prediction Techniques

From this analysis, it is quite clear that the prediction modes are the most
important parts in the video codec. If it is analyzed more deeply, then the inter-
prediction takes higher encoding time than the intra. Now, in terms of fast encoding
techniques, these modules have the highest priority for exploration.

4.4 Different Inter-prediction Modes

The coding unit (CU) in the HEVC or a macroblock (MB) in the H.264/AVC can
be predicted in different modes in the inter-prediction. However, the concepts of
the prediction modes in these two standards are quite similar. In the HEVC, the
prediction unit (PU) is treated separately with a different abstraction. Now the
different prediction modes for the HEVC standard are shown in Fig. 4.9. For inter-
prediction, there are three kinds of modes available for the prediction in the HEVC.
These are:
1. Skip mode
2. Square- and rectangular-shaped modes
3. Asymmetric modes
Let us consider the CU size is 2N  2N. As shown in Fig. 4.9, only
PART_2N  2N PU splitting is allowed for the skipped CU. Other than the skip
mode, eight different PU modes are available for the inter-prediction. Among these
eight modes, two modes are square shaped, PART_2N  2N and PART_N  N,
and two modes are rectangular shaped, PART_N  2N and PART_N  2N. These
four types of prediction modes (square and rectangular) are symmetric in nature.
For all CU sizes (64  64 to 8  8), the symmetric-shaped prediction modes are
calculated.
On the other hand, the rest of the four inter-prediction modes are grouped as
asymmetric prediction modes (AMP modes). The AMP modes are PART_2N  nU,
PART_2N  nD, PART_nL  2N, and PART_nR  2N. Now, for the CU size 8  8,
AMP modes are not calculated.
For a CB with dimension 8  8, nine PU calculations are required (PART_2N 
2N C 2*PART_N  2N C 2*PART_2N  N C 4*PART_N  N), whereas CBs
with higher dimensions, 13 (PART_2N  2N C 2*PART_N  2N C 2*PART_2N 
N C 2*4 AMPs) PU calculations are required. Moreover, bidirectional prediction
technique is also adopted in HEVC. Hence, two motion vectors (MVs) are cal-
culated separately for each inter-PB using two reference pictures from list-0 and
list-1. For each MV, RD cost is calculated using the original and generated predicted
blocks.
In order to get the best mode, the HEVC encoder uses a cost function for
evaluating all the possible structures which are coming from the quadtree splitting.
Similar to the previous standard, the rate-distortion (RD) cost is also used in the
HEVC. In this process, a CTB is initially encoded as intra- or inter-prediction,
and then forward transform (T) and quantization (Q) are performed on it which
produces the encoded bit stream. This encoded bit rate (R) is considered as the rate
function in the final cost calculation. From the encoded bit stream, using the inverse
4.4 Different Inter-prediction Modes 47

s kip m o de intr a m o de

2N*2N 2N*2N N*N

i n t e r m o d e ( s q u ar e an d r e c t an g u l ar ) - n o n AM P

2N*2N N*N 2N*N N*2N


i n t e r m o d e ( as y m m e t r i c ) - AM P

2 N * nU 2 N * nD nL * 2 N nR * 2 N

Fig. 4.9 Different prediction modes for the HEVC standard

Table 4.1 Number of RD CB dimension # CB # PB # RD cost calculation


cost calculation for a CTB
64  64 1 14 28
32  32 4 56 112
16  16 16 224 448
16  16 64 640 1280
Total 85 934 1864

quantization (Q1 ) and transform (T 1 ), a reconstructed CTU is generated. The


reconstructed frame provides the same visual quality to the decoder side. To evaluate
the compression error in the decoder side, a distortion function (D) is calculated
using the original and the reconstructed frame as sum of squared difference (SSD),
weighted SSD, or Hadamard with step according to the specification file. The RD
cost (J) is calculated using the summation of distortion (D) and a Lagrangian
weighted () rate (R) function as shown in Eq. 4.3.

J D D C R (4.3)

Moreover, bidirectional prediction technique is also adopted in HEVC. Hence,


two motion vectors (MVs) are calculated separately for each inter-PB using two
48 4 Inter-prediction Techniques

reference pictures from list-0 and list-1. For each MV, RD cost is calculated
using original and the generated predicted blocks. The number of CB, PB, and
corresponding RD cost calculations of a CTB with size 64  64 are given in
Table 4.1. In this table, we are only considering the inter-mode prediction and also
the merge/skip prediction. According to Table 4.1, 1864 RD cost calculations are
required for a 64  64 CU size, to predict its correct inter-mode. In this table, we do
not consider the RD cost of intra-mode.

4.5 Merge and Skip Modes

HEVC includes a merge mode which is conceptually similar to the direct and skip
modes in the H.264. Whenever a CB is considered to be encoded as merge mode,
then its motion information are derived from spatially or temporally neighboring
blocks.
Apart from the previous standards, the skip mode is considered as a special case
of merge mode when there is no need to encode motion vector and all coded block
flags are equal to zero. When a CU is encoded as skip mode, then the following two
conditions are satisfied:
1. the motion vector difference between the current 2N  2N PU and the neighbor-
ing PU is zero (since it is merge-skip)
2. residuals are all quantized to zero.
Since only skip flag and the corresponding merge index are transmitted to the
decoder side, skip mode requires minimum amount of bits to transmit.
Generally, homogeneous and motionless regions in a video sequence are encoded
as skip mode. In one word, we can say that stationary region refers to the homogene-
ity and the motionlessness. In Fig. 4.10, the CTU structure of a video frame from
the traffic sequence is shown. Moreover, the CBs which are finally encoded as skip
mode for this frame are shown in Fig. 4.10. It is quite clear from Fig. 4.10 that most
of the stationary regions of the video frame are finally encoded as skip mode. Hence,
in order to detect the skip mode before the RD cost calculation process, it should be
beneficial to identify the stationary regions from a video sequence.
We have analyzed the amount of the skip modes in different benchmark video
sequences. In Table 4.2, the percentage of CUs which are finally encoded as
skip mode by the HEVC encoder for 6 different sequences are shown. In this
table the benchmark sequences with different resolutions and motion activities are
considered. For an example, the Traffic and the Park Scene sequences have relatively
high motionless backgrounds. On the other hand, the Basketball Pass sequence has
quite a good amount of foreground motion, and the BQ Terrace sequence has a
camera movement which affects throughout the video frame.
From Table 4.2, it is quite clear that for the best case, more than 80 % CUs are
skipped if its size is 64  64, and for the worst case, that is, for CU size 8  8, over
4.5 Merge and Skip Modes 49

Fig. 4.10 The CTB structure and the corresponding CUs which are finally encoded as skip mode
in the Traffic video sequence for QP D 37. (a) The CTB structure of frame no. 5 and (b) the CUs
which are encoded as skip mode of frame no. 5 are shown here using blue color

32 % CUs are encoded as skip. If we consider the overall scenario (average of all
QPs and CU sizes), then more than 58 % CUs are encoded as skip mode.
Apart from that, there are two observations that we want to highlight from this
table for all the sequences:
1. the percentage of skip is higher for the larger size of CUs than the lower one
2. generally, for larger QP values more amount of CUs are encoded as skip mode.
The distribution of skip percentage for different QP values and CU sizes for these
benchmark video sequences are shown in Fig. 4.11. It is quite clear that Fig. 4.11
justifies our observations from Table 4.2.
50 4 Inter-prediction Techniques

Table 4.2 Percentage of CUs that are encoded as skip for different benchmark
video sequences with different QP values
% Skip mode for different CU size
Sequences QP 64  64 32  32 16  16 8  8
Traffic .2560  1600/ 22 79 60 46 35
27 85 64 51 36
32 89 68 52 33
37 93 70 48 28
avg 86:50 65:50 49:25 33:00
Park Scene .1920  1080/ 22 70 64 45 30
27 82 67 51 35
32 88 69 54 37
37 92 69 51 40
avg 83:00 67:25 50:25 35:50
BQ Terrace .1920  1080/ 22 68 47 64 26
27 82 68 51 39
32 91 71 60 46
37 94 74 66 50
avg 83:75 65:00 60:25 40:25
Party Scene .832  480/ 22 87 57 33 20
27 86 64 43 25
32 86 66 46 27
37 90 64 46 28
avg 87:25 62:75 42:00 25:00
Blowing Bubbles .416  240/ 22 51 46 36 23
27 56 53 42 28
32 72 59 48 30
37 75 62 54 28
avg 63:50 55:00 45:00 27:25
Basketball Pass .416  240/ 22 82 82 61 33
27 84 83 66 36
32 84 83 69 35
37 87 85 72 32
avg 84:25 83:25 67:00 34:00
Total average 81:37 66:46 52:29 32:50

4.6 Motion Vector Prediction

Generally, the motion vector of a block is correlated with the motion vectors of
its neighboring blocks in the current frame or in the earlier encoded pictures. The
reason behind this phenomenon is that the neighboring blocks likely correspond to
the same moving object. Therefore, if we send the difference between the motion
4.6 Motion Vector Prediction 51

Skip Percentage
100.00
90.00
80.00
70.00
QP 22
60.00
QP 27
50.00 QP 32
40.00 QP 37
30.00
20.00
10.00
0.00
64 32 16 8 CU Size

Fig. 4.11 The distribution of skip percentage for different QP values and CU sizes (average of all
six benchmark video sequences which are given in Table 4.2)

Fig. 4.12 Positions of spatial


candidates of motion b2 b1 b0
information

a1

a0

vectors in the decoder side, we can achieve higher data compression. This technique
is generally known as the motion vector prediction.
In the HEVC, when an inter-picture is not encoded as skip or merge mode, the
motion vector is differentially coded using motion vector prediction. In Fig. 4.12,
five spatial candidates are shown, and among them only two are chosen. The first
one is chosen from a0 ; a1 which is the set of left position, and the second one is
chosen from the set of above positions which are b0 ; b1 ; b2 . When the number of
spatial candidate is not equal to two, then only the temporal motion vector prediction
is done.
In the HEVC, there is a new concept included called the advanced motion vector
prediction (AMVP). According to this, a scaled version of motion vector is used
when the reference index of the neighboring PU is not equal to the current PU. The
scaling is done according to the temporal distance between the current picture and
the reference pictures.
Chapter 5
RD Cost Optimization

5.1 Background

In the previous two chapters, we have discussed about different prediction modes.
Now in a hybrid video codec, for each possible combination of modes, the
reconstructed images are created. Now, one question arises in this context: which
mode the encoder should choose among all? Generally, the hybrid encoder uses a
cost function to measure the effectiveness of a prediction mode. The cost function
is called the rate-distortion cost or RD cost in short. Now for all possible prediction
modes, the RD cost values are calculated, and the mode which provides the
minimum cost value is chosen as the best mode by the encoder. This is no doubt
an optimization problem, and it is referred to as RD optimization or RDO in short.
Let us consider the HEVC encoder. In order to get the best mode, the HEVC
encoder uses an RD cost for evaluating all the possible structures which are
coming from the quadtree splitting. A simplified RD cost calculation technique
is shown in Fig. 5.1. In this process, a CTB is initially encoded as intra- or inter-
prediction, and then forward transform (T) and quantization (Q) are performed on
it which produces the encoded bit stream. This encoded bit rate (R) is considered
as the rate function in the final cost calculation. From the encoded bit stream,
using the inverse quantization (Q1 ) and transform (T 1 ), a reconstructed CTU
is generated. The reconstructed frame provides the same visual quality to the
decoder side. To evaluate the compression error in the decoder side, a distortion
function (D) is calculated using the original and the reconstructed frame as sum of
squared difference (SSD), weighted SSD, or Hadamard with step according to the
specification file. The RD cost (J) is calculated using the summation of distortion
(D) and a Lagrangian weighted () rate (R) function as shown in Eq. 5.1.

J D D C R (5.1)

© The Author(s) 2016 53


B.-G. Kim and K. Goswami, Basic Prediction Techniques in Modern Video Coding
Standards, SpringerBriefs in Electrical and Computer Engineering,
DOI 10.1007/978-3-319-39241-7_5
54 5 RD Cost Optimization

C urrent Prediction Transform Quantization Encode Bit Stream


CTU (R)

Inverse Inverse
Reconstruction
Transform Quantization

Distortion
Calculation Distortion (D)

Fig. 5.1 Rate (R) and distortion (D) calculation technique

Table 5.1 Number of RD CB dimension # CB # PB # RD cost calculation


cost calculation for a CTB
64  64 1 14 28
32  32 4 56 112
16  16 16 224 448
16  16 64 640 1280
Total 85 934 1864

The HEVC includes merge mode to derive the motion information from spatially
and temporally neighboring blocks. This is conceptually similar to the direct and
skip mode in the H.264/MPEG-4 AVC. The skip mode is considered as a special
case of the merge mode. In the skip mode, all coded block flags (CBF), motion
vector difference, and the coded quantized transform coefficients are equal to zero.
Moreover, bidirectional prediction technique is also adopted in HEVC. Hence,
two motion vectors (MVs) are calculated separately for each inter-PB using two
reference pictures from list-0 and list-1. For each MV, RD cost is calculated
using the original and generated predicted blocks. The number of CB, PB, and
corresponding RD cost calculations of a CTB with size 64  64 is given in
Table 5.1. In this table, we are only considering the inter-mode prediction and also
the merge/skip prediction. According to Table 5.1, 1864 RD cost calculations are
required for a 64  64 CU size, to predict its correct inter-mode. In this table, we do
not consider the RD cost of intra-mode.
From this analysis, we want to emphasize on a point that tremendous amount of
RD cost calculations takes place in a hybrid encoder. So, we should understand this
process in more detail.

5.2 Classical Theory of RD Cost

Rate-distortion (RD) theory provides an analytical expression about the maximum


achievable lossy compression for a given channel. The rate-distortion theory
concerns with the task of representing a source with the fewest number of bits to
5.3 Distortion Measurement Technique 55

achieve a given reproduction quality. Suppose we have an input raw video sequence
which we want to compress and transmit to the receiver. In the above example,
the input raw video sequence can be considered as the source. Now, the RD theory
addresses the problem of determining the minimal number of bits per symbol so that
the source (input video) can be approximately reconstructed at the receiver (output
video) without exceeding a given amount of distortion.
The compression can be two types: lossless and lossy. In case of the lossless
compression, as the name suggests, the decompressed data is an exact copy of
the original source data. This kind of compression schemes is generally important
where one needs perfect reconstruction of the source. However, it suffers from
impracticability for the applications where the source information is voluminous or
the channel bandwidth is limited. On the other hand, the lossy compression is more
effective in terms of compression ratio at a cost of imperfect source representation.
Generally, the properties of human visual system are accurately exploited in the
lossy compression. For this reason, for the human eye, the decompressed video
sequence and the source video sequence are indistinguishable.
Now in the lossy compression, a fundamental trade-off is essential on how much
fidelity of the representation (distortion) we are willing to support in order to reduce
the number of bits in the representation (rate). The trade-off between source fidelity
and coding rate is exactly the rate-distortion trade-off [1].
For a given system, source, and all possible quantization choices, we can plot
each distortion achieved by the encoder/decoder pair for different rate values. This
is generally called operational rate-distortion curve. A conceptual operational rate-
distortion curve is shown in Fig. 5.1. Now, in this curve a boundary is always present
to distinguish the best achievable operating points and suboptimal or unachievable
points. The boundary between achievable and unachievable is defined by the convex
hull of the set of the operating points.

5.3 Distortion Measurement Technique

In Eq. 5.1, it is shown that the RD cost is a linear combination of rate and the
distortion. Now the calculation of rate is very straightforward. It can be easily
calculated based on the actual encoded bits for a video stream. On the other
hand, the distortion measurement has different algorithms. Most common distortion
measurement schemes are described below (Fig. 5.2).

5.3.1 Mean of Squared Error

Considering .k  l/ as the past references frame .l > 0/ for backward motion


estimation, the mean square error of a block of pixels computed at a displacement
.I; j/ in the reference frame is given by

1 X X
N1 N1
MSE.i; j/ D 2 Œs.n1 ; n2 ; k/  s.n1 C i; n2 C j; k  l/2 (5.2)
N n D0 n D0
1 2
56 5 RD Cost Optimization

Fig. 5.2 Operating RD


characteristics
Convex Hull of RD

Distortion (D)
Operating Points

Operating points

Rate (R)

The physical significance of the above equation should be well understood. We


consider a block of pixels of size N  N in the reference frame, at a displacement
of, where i and j are integers with respect to the candidate block position.
The MSE is computed for each displacement position .i; j/ within a specified
search range in the reference image, and the displacement that gives the minimum
value of MSE is the displacement vector which is more commonly known as motion
vector. The MSE is computed for each displacement position ()ji, within a specified
search range in the reference image and the displacement that gives the minimum
value of MSE is the displacement vector which is more commonly known as motion
vector and is given by

Œd1 ; d2  D arg minŒMSE.i; j/ (5.3)


i;j

The MSE criterion defined in Eq. 5.2 requires computation of N 2 subtractions, N 2


multiplications (squaring), and .N 2  1/ additions for each candidate block at each
search position. This is computationally costly, and a simpler matching criterion, as
defined below, is often preferred over the MSE criterion.

5.3.2 Mean of Absolute Difference

Like the MSE criterion, the mean of absolute difference (MAD) too makes the error
values as positive, but instead of summing up the squared differences, the absolute
differences are summed up. The MAD measure at displacement .i; j/ is defined as
5.4 Calculating  for the RD Cost Function 57

1 X X
N1 N1
MAD.i; j/ D Œs.n1 ; n2 ; k/  s.n1 C i; n2 C j; k  l/ (5.4)
N 2 n D0 n D0
1 2

Œd1 ; d2  D arg minŒMAD.i; j/ (5.5)


i;j

The MAD criterion requires computations of N 2 subtractions with absolute


values and N 2 additions for each candidate block at each search position. Moreover,
an averaging operation is also used here. The absence of multiplications makes this
criterion computationally more attractive and facilitates easier hardware implemen-
tation.

5.3.3 Sum of Absolute Difference

This one is the most computation inexpensive criteria. The sum of absolute
difference (SAD) is quite similar to the MAD, but instead of averaging with the
block dimension, here only the sum is calculated. The calculation of SAD is given
below.
X N1
N1 X
SAD.i; j/ D Œs.n1 ; n2 ; k/  s.n1 C i; n2 C j; k  l/ (5.6)
n1 D0 n2 D0

Œd1 ; d2  D arg minŒSAD.i; j/ (5.7)


i;j

Just like the MAD, the SAD criterion requires computations of N 2 subtractions
with absolute values and N 2 additions for each candidate block at each search
position, since the absence of averaging and multiplication operations makes this
criteria most cost-effective among all.

5.4 Calculating  for the RD Cost Function

In the previous sections, we have mentioned about different techniques to calculate


the distortion value in the hybrid video codec. The rate calculation is quite straight-
forward, since it can be calculated from the encoded bit stream. The major problem
that arises in the rate-distortion calculation is the modeling of the parameter.
Because of the dependencies of the temporal and spatial domain, the modeling of
the parameter is quite a difficult job.
As we have mentioned in the background description of this chapter, the mode
with the minimal cost is selected as the best mode. The mode decision is made by
minimizing

JMODE .s; c; MODEjMODE / D SSD.s; c; MODE/ C MODE :R.s; c; MODE/ (5.8)


58 5 RD Cost Optimization

In the above equation, SSD denotes the sum of square differences between the
original block and its reconstruction, and MODE indicates a mode out of a set of
potential modes of the blocks (MB or CTU). The computation of the Lagrangian
costs for the inter-modes is much more demanding than for the intra and SKIP
modes. This is because of the block motion estimation step [1].
Given the Lagrange parameter MOTION and the decoded reference picture s1 ,
rate-constrained motion estimation for a block Si is performed by minimizing the
Lagrangian cost function

mi D arg min DDFD .Si ; m/ C MOTION RMOTION .Si ; m/ (5.9)


mM

The final remark should be made regarding the choice of the Lagrange parameters
MODE and MOTION . In [1], an in-depth study of the parameter selection is given.
Selected rate-distortion curves and bit-rate savings plots for video streaming, video-
conferencing, and entertainment-quality applications are given in Figs. 5.3, 5.4,
and 5.5.
5.4 Calculating  for the RD Cost Function 59

Fig. 5.3 Selected rate-distortion curves and bit-rate saving plots for videoconferencing applica-
tions [1]
60 5 RD Cost Optimization

Fig. 5.4 Selected rate-distortion curves and bit-rate saving plots for video streaming applica-
tions [1]
Reference 61

Fig. 5.5 Selected rate-distortion curves and bit-rate saving plots for video entertainment applica-
tions [1]

Reference

1. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, G.J. Sullian, Rate constrained coder control and
comparison of video coding standards. IEEE Trans. Circ. Syst. Video Technol. 13(7), 688–703
(2003)
Chapter 6
Fast Prediction Techniques

6.1 Need for the Fast Prediction Algorithms

The CTB is an efficient representation of variable block sizes so that regions of


different sizes can be coded with fewer bits while maintaining the same quality.
It is possible to encode stationary or homogeneous regions with a larger block
size, resulting in a smaller side information overhead. On the other hand, the CTB
structure dramatically increases the computational complexity.
As an example, if a frame has resolution of 704  576 pixels, then it will be
decomposed into 99 (11  9) CTUs, and a separate CTB will be created for each
CTU. For each CTB, 85 calculations are involved for different CU sizes. As a
result, 8415 CU calculations are required for the CTB structure, whereas only 1584
calculations are needed for a 16  16 macroblock, as was used in the previous
standard (H.264/AVC). From this analysis, it is clear that the new CTB structure
in HEVC greatly increases the computational complexity.
For the compared analysis, we have compared the performance of the
H.264/AVC with the HEVC for the same video sequence. The experimental
environment was as follows: in common set, quantization parameter (QP) value
is 25 and 35; structure of slice is hierarchical B; dimension of sequences is
832  480; number of frame is 50; version of JM was 18.0 and HM was 3.0; the JM
employed used high profile, and the HM was for random access in high-efficiency
configuration.
Tables 6.1 and 6.2 show the compared performance in QP 25 and QP 35,
respectively. According to these tables, the HM has near about ten times of the time
complexity than JM reference software. In terms of PSNR, the HM has increased
almost 3 (dB) than JM. In terms of the bit rate, the HM has reduced almost 55 % of
bit-rate reduction than JM.
In other words, the HEVC has a good performance in terms of video quality
(PSNR) and the bit rate. On the other hand, in terms of encoding time, it suffers a

© The Author(s) 2016 63


B.-G. Kim and K. Goswami, Basic Prediction Techniques in Modern Video Coding
Standards, SpringerBriefs in Electrical and Computer Engineering,
DOI 10.1007/978-3-319-39241-7_6
64 6 Fast Prediction Techniques

Table 6.1 The performance analysis (QP 25, 832  480 sequences) on JM 18.0 and HM 3.0
JM 18.0 HM 3.0 Differential
PSNR Bit rate Time PSNR Bit rate Time PSNR B% T%
Basketball Drill 38:92 5916:11 592 38:70 2345:11 5345 0:22 60:3 902
Flower Vase 43:67 820:04 451 43:61 323:83 4087 0:06 60:3 906
Keiba 39:80 5519:28 392 38:11 2265:61 3735 1:69 58:9 952
Mobisode2 44:62 583:23 239 44:85 254:02 2422 0:23 56:4 1013
Party Scene 37:44 17228:56 448 35:92 6198:63 3772 1:52 64:0 841
Average 0:65 60:0 923

Table 6.2 The performance analysis (QP 35, 832  480 sequences) on JM 18.0 and HM 3.0
JM 18.0 HM 3.0 Differential
PSNR Bit rate Time PSNR Bit rate Time PSAR B% T%
Basketball Drill 32:77 1218:43 459 33:16 552:74 4388 0:39 54:6 955
Flower Vase 36:91 126:47 459 33:16 552:74 4388 1:30 33:5 1018
Keiba 33:15 1304:86 331 32:41 562:24 2981 0:74 56:9 900
Mobisode2 41:23 158:65 226 41:64 62:15 2249 0:41 60:8 995
Party Scene 29:01 3107:84 349 29:13 1461:27 2866 0:12 53:0 821
Average 0:30 51:8 938

very large computational complexity because of sub-tree computations in quadtree


structure. Hence, for the real-time applications, it is quite an important challenge
for us to reduce the encoder complexity with negligible PSNR loss and bit-rate
increment.

6.2 Fast Options in HEVC Encoder

6.2.1 Early CU Termination

The CU is the basic unit of region splitting used for inter-/intra-prediction. The
CU concept allows recursive splitting into four equally sized blocks, starting from
the quadtree block. In [5], a fast CU depth decision algorithm is proposed which
is commonly known as ECU. According to the ECU, no further processing of sub-
trees is required when the current CU selects SKIP mode as the best prediction mode
at the current CU depth. The diagram of proposed algorithm is depicted in Fig. 6.1.
6.2 Fast Options in HEVC Encoder 65

TEncCU:: xComressCU (Current HM 3.1)

SKIP mode Inter 2Nx2N mode Inter NxN mode

Intra 2Nx2N mode Inter 2NxN mode Inter Nx2N mode

Intra NxN mode Intra PCM mode

BestMode Yes
Finish
=SKIP?

No Recursive call

xCompressCU xCompressCU xCompressCU xCompressCU

Fig. 6.1 Early CU termination (ECU) algorithm [5]

6.2.2 Early Skip Detection

To decide the best PU mode, the HEVC encoder computes the RD costs of all
the possible inter-PU modes and intra-PU modes. Since each of them entails high
computational complexity, it is practically very desirable if the encoder can decide
the best PU mode at the earliest possible stage without checking all possible modes
exhaustively.
According to [35], an early detection of SKIP mode is proposed to reduce the
encoding complexity of HEVC by simply checking the differential motion vector
(DMV) and a coded block flag (CBF) after searching the best inter 2N  2N mode.
The flowchart of the ESD method is depicted in Fig. 6.2. As shown in Fig. 6.2, in
the proposed method, the current CU searches inter 2N2N modes (AMVP and
merge) before checking the SKIP mode. After selecting the best inter 2N 2N mode
having the minimum RD cost, the proposed method checks its DMV and CBF. If
DMV and CBF of the best inter 2N2N mode are respectively equal to (0, 0) and
zero (these two conditions are called as “early SKIP conditions”), the best mode of
current CU is determined early as the SKIP mode. By doing this, in other words, the
remaining PU modes are not investigated anymore. The proposed method can omit
RD calculation for the other modes, thus reducing encoding complexity without
sizable coding efficiency loss.
66 6 Fast Prediction Techniques

Inter 2Nx2N mode

Yes Early SKIP No


Condition?

SKIP mode Inter NxN mode

Intra 2NxN mode Intra Nx2N mode

Intra 2NxnU mode Intra 2NxnD mode

Intra nRx2N mode Intra nLx2N mode

Intra 2Nx2N mode Intra NxN mode

Intra PCM mode

Recursive call

xCompressCU xCompressCU xCompressCU xCompressCU

Fig. 6.2 Early SKIP detection (ESD) algorithm [35]

6.2.3 CBF Fast Mode Setting

When a CU is encoded in an inter-picture, the RD costs for total six PUs, inter
2N  2N, inter 2N  N, inter N  2N, inter N  N, intra 2N  2N, and intra N  N,
are examined. And the RD costs for inter N  N and intra N  N are examined only
for 8  8 CU.
According to [7], if CBF of an inter-PU except inter N  N PU in a CU is zero
(CBF D 0) for luma and two chromas (CBF luma, CBF U, CBF V), the next PU
encoding process of the CU is terminated. This algorithm is generally referred to as
the CFM, and the corresponding flowchart is shown in Fig. 6.3.

6.2.4 Fast Decision for Merge RD Cost

This early termination rule deals with the computation of the rate-distortion cost of
the motion vector predictors at the encoder side. More precisely, a termination rule
is proposed to avoid estimating all the rate-distortion costs of the merge candidates.
In [13], an efficient Fast Decision for Merge RD Cost algorithm is proposed which
is commonly referred to as FDM.
6.3 Block Matching Algorithm 67

Fig. 6.3 CBF Fast Mode


Setting (CFM) algorithm [7]

Figure 6.4 presents the proposed encoder algorithm change to avoid some
rate-distortion cost evaluations for some merge candidates. Instead of performing
systematically the rate-distortion cost of each candidate, an early termination rule
is applied. The diagram in Fig. 6.4 uses a Boolean variable to signal the early
termination for merge (ETM). When the condition is reached, i.e., (ETM DD
TRUE), the computation of the rate-distortion cost of the merge mode for a given
candidate is not performed.

6.3 Block Matching Algorithm

Motion estimation techniques form the core of H.264/AVC video compression


and video processing applications. It extracts motion information from the video
sequence where the motion is typically represented using a motion vector (MV).
The MV indicates the displacement of a pixel or a pixel block from the current
location due to motion. This information is used in video compression to find
the best matching block in reference frame, to calculate low-energy residue, and
to generate temporally interpolated frames. It is also used in applications such as
motion estimation techniques. There are pixel recursive techniques, which derive
MV for each pixel, and there is also the phase plane correlation technique, which
generates motion vectors via correlation between current frame and reference frame.
However, the most popular technique is block matching algorithms.
Block matching algorithm is the most popular motion estimation algorithm. Such
calculates motion vector for an entire block of pixels instead of individual pixels.
The same motion vector is applicable to all the pixels in the block. This reduces
computational requirement and also results in a more accurate motion vector since
the objects are typically a cluster of pixels.
68 6 Fast Prediction Techniques

List of N MV predictors candidtes

i=0, ETM= FALSE

Extract
candidate i
from the list No
ETM ==
Candidate i
TRUE ?

Yes

Compute rate-distorsion Compute rate-distorsion


cost for Skip mode and cost for Merge mode and
candidate i: candidate i:
JSkip; JMrg;

No
J = Min(J,JSkipi)
J = Min(J,JSkip1r JMrg;)
i≥N? i++

Yes
Yes ETM =TRUE J == JSkipi ?

J, imin, BM
No

Fig. 6.4 Fast decision for merge (FDM) RD cost algorithm [13]

The current frame is divided into pixel blocks, and motion estimation is per-
formed independently for each pixel block. Motion estimation is done by identifying
a pixel block from the reference frame. The displacement is provided by the MV.
MV consists of a pair .x; y/ of horizontal and vertical displacement values. There
are various criteria available for calculating block matching.
The reference pixel blocks are generated only from a region known as the search
area. Search range defines the boundary for the motion vectors and limits the number
of blocks to evaluate. The height and width of the search range is dependent on
the motion in video sequence. The available computing power also determines the
search range. Bigger search range requires more computation due to increase in
number of evaluated candidates. Typically the search range is kept wider (i.e., width
is more than height) since many video sequences often exhibit panning motion. The
search region can also be changed adaptively depending upon the detected motion.
The horizontal and vertical search range, Sx and Sy , define the search range (˙Sx
and ˙Sy ) as in Figs. 6.5 and 6.6.
In the H.264/AVC and the HEVC standard, the block-based encoding structure
has been adopted. For the inter-prediction, motion estimation technique is the
core of the video compression and various video processing applications which
extracts the motion information from the video sequence. Typically using motion
estimation, a motion vector is generated for a block (MB or CU) in the video
compression standard. The motion vector indicates the displacement of a block of
pixels from the current location due to motion of object or camera. This information
is used to find the best matching block in the reference frame to minimize the rate-
6.3 Block Matching Algorithm 69

One pixel

Search range

MV

Reference block R(x,y)

Fig. 6.5 Reference frame

Current block C(x,y)

Fig. 6.6 Current frame

distortion cost. This technique is known as the block matching algorithm (BMA).
We have studied various motion estimation algorithms used in the H.264/AVC and
the HEVC. According to our survey, the existing BMAs can be classified into
following categories: full search, unsymmetrical-cross multihexagon-grid search,
diamond search, enhanced predictive zonal search, test zone search, fixed search
patterns, search patterns based on block correlation, and search patterns based on
block correlation.
70 6 Fast Prediction Techniques

6.4 Full Search

The FS block matching algorithm searches every possible pixel block in the search
range [1]. Hence, it can generate the best block matching motion vector. This
type of BMA can give least possible residue for video compression. Though the
required computations are prohibitively high due to the large amount of search
point to evaluate in a defined search region, the number of search point to search
is .2  Sx C 1/  .2  Sy C 1/ which is predominantly high compared to any of the
search algorithms. There are several other fast BMAs , which reduce the number
of search point yet try to keep good block matching accuracy. Note that since these
algorithms test only limited candidates, they might result in selecting a candidate
corresponding to local minima, unlike full search, which always results in global
minima.

6.5 Unsymmetrical-Cross Multihexagon-Grid Search

The unsymmetrical-cross multihexagon-grid search was proposed for the fast inte-
ger pel and fractional pel motion estimation in H. 264/AVC [4]. The UMHexagonS
conducts the overall search in four steps, from an initial predicted start search point:
step one, a sparse uneven cross search; step two, a fine full search within a small
rectangle; step three, a sparse uneven hexagon grid search, the grid is sparser and
larger when the search point is away from the hexagon center; and step four, a
refinement with hexagon or diamond search. Figure 6.7 demonstrates a typical
search procedure in a search window with search range equals 16 (here assumes
the start search point to be (0,0) vector).
Compared to FS, the UMHexagonS algorithm claims that it can reduce 90 % of
motion estimation time, drop less than 0.05dB PSNR, and maintain the low bit rate,
in order to make the initial search point close to the best prediction point.
The UMHexagonS algorithm searching strategy begins with cursory search
pattern and then turns to elaborate search patterns. With multi-patterns, it can get
rid of the disadvantage that the traditional fast algorithms are easy to trap in local
minima.
However, in the UMHexagonS algorithm compared to ARPS and EPZS, the
computational complexity is very high, because the search pattern shape has more
search candidate.

6.6 Diamond Search

A new diamond search (DS) algorithm for fast block matching motion estimation
employed two search patterns: the first pattern, called large diamond search pattern
(LDSP) as illustrated in Fig. 6.8, comprises nine checking points from which eight
6.6 Diamond Search 71

15

10

−5

−10

−15
−15 −10 −5 0 5 10 15
step1 step2 step3 step4-1 step4-2

Fig. 6.7 Search process of UMHexagonS algorithm, WD16

Fig. 6.8 Large diamond


search pattern

points surround the center one to compose a diamond shape (˙), and the second
pattern consisting of five checking points forms a smaller diamond shape called
small diamond search pattern (SDSP) as illustrated in Fig. 6.9.
In the searching procedure of the DS algorithm, LDSP is repeatedly used until
the step in which the minimum block distortion (MBD) occurs at the center point.
The search pattern is then switched from LDSP to SDSP as reaching to the final
search stage. Among the five checking points in SDSP, the position yielding the
MBD provides the motion vector of the best matching block.
72 6 Fast Prediction Techniques

Fig. 6.9 Small diamond


search pattern

The DS algorithm is summarized as follows:


Step 1 The initial LDSP is centered at the origin of the search window, and the
nine checking points of LDSP are tested. If the MBD point calculated is located
at the center position, go to Step 3; otherwise, go to Step 2.
Step 2 The MBD point found in the previous search step is repositioned as the
center point to form a new LDSP. If the new MBD point obtained is located at
the center position, go to Step 3; otherwise, recursively repeat this step.
Step 3 Switch the search pattern from LDSP to SDSP. The MBD point found
in this step is the final solution of the motion vector which points to the best
matching block.
In our algorithm, we employed the use of SDSP to search slow-motion
sequences.

6.7 Enhanced Predictive Zonal Search

The enhanced predictive zonal search for single and multiple frame motion estima-
tion (EPZS) [31] could be considered to be used as an improvement of predictive
motion vector field adaptive search technique (PMVFAST)-enhancing block-based
motion estimation [30] and fast block matching motion estimation using advanced
predictive diamond zonal search (APDZS) [29]. The EPZS improves upon these
algorithms by introducing an additional set of predictors; the early stopping criteria
are more efficiently selected. Furthermore, due to the enhanced reliability of
the predictors, only one search pattern is used, thus considerably reducing any
associated overhead of the algorithm. The checking pattern, depending on the
implementation requirements, could be either a diamond or square. The algorithm
is similar to other zonal type algorithms.
6.7 Enhanced Predictive Zonal Search 73

Fig. 6.10 Large diamond pattern for PMVFAST

Fig. 6.11 Small diamond pattern for PMVFAST

The PMVFAST algorithm managed to significantly improve upon the perfor-


mance of motion vector field adaptive search technique (MVFAST) in terms of
both speedup and PSNR by enhancing several aspects of the algorithms. Even
though both algorithms make use of two different diamond patterns that can be
seen in Figs. 6.10 and 6.11, they differ significantly in several other aspects. More
specifically, in PMVFAST, instead of initially examining the (0,0) motion vector as
is done in MVFAST, the median predictor, used also for motion vector encoding, is
examined instead. This is done since the median predictor is more reliable and has
higher probability to be the true predictor especially for nonzero-biased sequences.
74 6 Fast Prediction Techniques

frame t-2 frame t-1


Current frame

Fig. 6.12 Use of acceleration information as a motion vector predictor

The EPZS algorithm improves upon PMVFAST, but also upon APDZS, by
considering several other additional predictors in the generalized predictor selection
phase of these algorithms that select a more robust and efficient adaptive threshold-
ing calculation. Due to the high efficiency of the prediction stage, the pattern of the
search can be considerably simplified.
The EPZS algorithm considers accelerator motion vector (Fig. 6.12) and is the
differentially increased/decreased motion vector taken after considering not only
the motion vector of the collocated frame in the previous frame but also of the frame
before that. The concept behind the selection of such predictor is that a block may
not be following a constant velocity but may instead be accelerating.
The EPZS used current block adjacent block information that previous frame and
adjacent blocks collocated block in the previous frame like in Fig. 6.13.

6.8 Test Zone Search

The TZS algorithm is a mixture of zonal search and raster search patterns. The
flowchart of the complete algorithm is shown in Fig. 6.14. The algorithm can be
broadly classified into four steps as described in the following:
Motion vector prediction: TZS algorithm employs median predictor, left, up, and
right. The minimum of theses predictors is selected as a starting location for
further search steps.
Initial grid search: In this step, the algorithm searches the search window in using
diamond or square patterns with different stride lengths ranging from 1 through
6.8 Test Zone Search 75

Fig. 6.13 The MV of the a


current block might have
more relationship with the
motion vectors of the blocks
around the collocated block
in the previous frame. (a)
frame t  1 (b) current frame

64, in multiples of 2. The patterns used are either eight-point diamond search or
eight-point square search that can be selected. A sample grid with stride length
8 for diamond is shown in Fig. 6.15a. The motion vector with minimum SAD
is taken as the center search point for further steps. The stride length for this
minimum distortion point is stored in variable uiBestDistance. The maximum
number of search points for this step, n1, is given by

n1 D P.1 C floor.lg 2 S// (6.1)

where S is the size of search window, P is the number of search points per each
grid (eight for diamond, six for hexagon, etc.), and floor represents floor function.
Raster search: The raster search is a simple full search on a down-sampled version
of the search window. A predefined value iRaster for raster scan is set before
compilation of the code [10]. This value is used as a sampling factor for the
search window. The search window (for 16  16 search window) for raster scan
with iRaster value 3 is shown in Fig. 6.15b. As shown in flowchart in Fig. 6.14,
the condition for performing this raster search is that uiBestDistance (obtained
from previous step) must be greater than iRaster. If this condition is not satisfied,
76 6 Fast Prediction Techniques

Fig. 6.14 Flowchart of TZS


start
algorithm [26]

Motion Vector Prediction

Initial grid search with all possible stride


lengths

No
uiBestDistance>iRaster

Yes
Raster Search with length=iRaster

No
Raster Refinement enabled

Yes
No
uiBestDistance>0

Yes
Raster Refinementnew search center with
uiBestDistance = uiBestDistance/2

No
Star Refinement enabled

Yes
No
uiBestDistance>0

Yes
Star Refinement at new search center
with all possible stride lengths

stop

the algorithm will skip this step. If this step is processed, then uiBestDistance is
changed to iRaster value. As seen from Fig. 6.15b, the number of search points in
each row/column would be ceil(S/R), where ceil represents ceiling function and
6.9 Fixed Search Patterns 77

Fig. 6.15 (a) Diamond search pattern and (b) hexagonal search pattern with stride length 8

R represents iRaster value. Thus, the maximum number of search points in this
step, n2, is given by

n2 D .seil.S=R//2 (6.2)

Raster/Star refinement: This step is a fine refinement of the motion vectors


obtained from the previous step. As shown in flowchart in Fig. 6.14, either
raster refinement or the square/diamond (star refinement) pattern refinement
can be enabled. In general, only one of the refinement methods is enabled for
fast computation. In both of these refinements, either eight-point square pattern
or eight-point diamond pattern is used. The two refinement methods differ in
their search operation. The raster refinement will search by downscaling the
uiBestDistance value (obtained from raster search) by 2 in every step of the
loop, till uiBestDistance equals to zero. The star refinement is similar to step
2 except for small changes in the starting location. The whole refinement process
will only start if uiBestDistance is greater than zero. After every loop, the
new stride length is stored in variable uiBestDistance. The search stops when
uiBestDistance equals to zero. The total number of search points for this step
(n3 ) depends on the video sequence, and it is not constant for each iteration.

6.9 Fixed Search Patterns

In this category, most of the methods are based on the assumption that ME matching
error decreases monotonically as the search moves toward the position of the global
minimum error. The motion vector of each block is searched independently by using
78 6 Fast Prediction Techniques

fixed search patterns. Examples are displacement measurement and its application
in interframe image coding (2-LOGS), motion-compensated interframe coding for
video conferencing (TSS), novel four-step search algorithm for fast block motion
estimation (4SS), block-based gradient descent search algorithm for block motion
estimation in video coding (BBGDS), hexagon-based search pattern for fast block
motion estimation (HEXBS) [40], DS, and UMHexagonS. These algorithms reduce
the number of search points. However, these algorithms have trade-off between the
complexity reduction and the image quality.
The 4SS and TSS are efficient for fast motion video sequences because the MVs
in fast motion sequences are far away from center point of macroblock. However,
in other cases such as middle and slow-motion sequences, it can be trapped local
minima. Also the TSS uses a constantly allocated checking point pattern in its
first step, which becomes inefficient for the estimation of slow motions. A new
three-step search algorithm for block motion estimation (NTSS) [17], an efficient
three-step search algorithm for block motion estimation (ETSS) [9], and a simple
and efficient search algorithm for block matching motion estimation (SES) [20]
algorithms have been proposed in order to improve the performance of simple fixed
search pattern algorithm.

6.10 Search Patterns Based on Block Correlation

Instead of using the predetermined search patterns, the methods exploit the cor-
relation between the current block and its adjacent block in the spatial and/or
temporal domains to predict the candidate MVs. The predicted MVs are obtained
by calculating the statistical average (such as the mean, the median, the weighted
mean/median, etc.) of the neighboring MVs [21] or selecting one of the neighboring
MVs according to certain criteria. In addition, one such candidate that is named
as the accelerator MVs is the differentially increased/decreased MVs taken after
considering not only the motion vector of the collocated frame in the previous frame
but also of the frame before that.
The concept behind the selection of such predictor is that a block may not be
following a constant velocity but may be accelerating. This kind of approach uses
spatial and/or temporal correlation to calculate the predictor as the ARPS and EPZS.
These algorithms set pattern sizes or estimate positions from previous frame and/or
neighboring current block MVs. The EPZS and ARPS preserve the peak signal-
to-noise ratio (PSNR) like FS, the consumed time is reduced with similar bit rate.
However, they made much overhead in terms of memory resource since they use
spatial-temporal information.
6.12 Prediction-Based Fast Algorithms 79

6.11 Search Patterns Based on Motion Classification

Apart from the abovementioned search patterns (fixed or variable), another kind of
attempt is reported for the block matching algorithm using the motion activity of the
video sequence. The video sequences could be broadly divided into three categories
based on the motion activity in the successive frame—slow, medium, and fast video
sequences. Some algorithms use different schemes to classify video sequences.
The search pattern switching algorithm for block motion estimation (SPS) [23]
has combined two approaches of motion estimation proposed. The first approach
uses coarse-to-fine technique to reduce the number of search points like 2-DLOG
and TSS; this approach is efficient for fast motion video sequences, because in these
sequences the search points are evenly distributed over the search window, and thus
the global minima far away from window centers can be located more efficiently.
The second approach utilizes the center-biased characteristic of MVs, algorithms
such as N3SS, 4SS, BBGDS, and DS. It uses center-biased search patterns to utilize
the center-biased global minima distribution. Compared with the first approach, a
substantial reduction of search points can be achieved for slow motion. The SPS
algorithms combine the advantages of the above two approaches by using different
search patterns according to the motion content of a block. The performance of an
adaptive algorithm depends on the accuracy of its motion content classification.
In real video sequences, contents with slow, medium, and fast motion frequently
exist together. The adaptive fast block matching algorithm by switching search
patterns for sequences with wide-range motion content (A-TDB) can efficiently
remove the temporal redundancy of sequences with wide-range motion content.
Based on the characteristics of predicted profit list, the A-TDB can adaptively
switch search patterns among the TSS, DS, and BBGDS according to the motion
content [8].
An adaptive motion estimation scheme for video coding (NUMHexagonS)
statistic of MV distribution was analyzed. The algorithm put forward the method
of predicting MV distribution and made full use of the MV characteristics and
also combined MV distribution prediction with the new search patterns to make
the search position more accurate [19].

6.12 Prediction-Based Fast Algorithms

There are good amount of papers that have been reported in efficient prediction
techniques. It can be said that this is one of the most effective techniques to make a
fast algorithm in HEVC.
The fast encoder decision algorithm called FEN has been included in HM
software which can reduce the complexity greatly. The main idea of FEN is that
the following CU calculation is skipped when the rate-distortion cost of current CU
selects SKIP mode as the best mode which is smaller than the average rate-distortion
80 6 Fast Prediction Techniques

cost of previously encoded CUs as SKIP mode. The average rate distortion of
previously skipped CUs is multiplied by fixed weighting factor to increase the
number of CUs which can be encoded as SKIP mode. The weighting factor of
FEN is 1.5. In [36], a novel algorithm was proposed for scalable H.264/AVC using
Bayesian framework.
In [11], an adaptive coding unit has been proposed based on early SKIP detection
technique. In this paper three tests have been performed to find the statistical
characteristics of SKIP mode. From these tests it is found that current CU and
neighboring CUs are highly correlated. Hence in this paper an adaptive weighting
factor adjusting method is proposed using these correlations. The initial weighting
factor of proposed method is fixed on one, and then the weighting factor is adjusted
from 1.0 to 2.0. The experimental result shows that the average coding time can
be reduced up to 54 % using this technique. In natural pictures, neighboring blocks
usually hold similar textures. Consequently, the optimal intra-prediction of current
block may have strong correlation with its neighboring blocks. Based on this
consideration, in [39], conditional probabilities have been estimated for the optimal
intra-direction of current block. From this calculation a most probable mode (MPM)
is defined from its neighboring blocks. From the statistic results, it is observed that
the MPM of current block possesses a large ratio to be the best mode in current
block in both test conditions, and this ratio of MPM fluctuates only a little between
different sequences.
In [16], it is shown that the large CU can be considered as very efficient for high-
resolution, slow-motion, or large QP video sequence. Larger CU can provide less
side information and motion vectors. Apart from that it can also predict the smooth
and slow motion part of sequence more accurately. So there exists mode correlation
among consecutive frames. In this paper, the authors have provided two key ideas:
frame level and CU level in this context. 45 % encoding time saving can be possible
using this technique. In [25], the authors take the reference software HM0.9 as
a benchmark and developed their own system based on hierarchical block-based
coding and a block-adaptive translational model in interframe coding. In [32], a
low complexity intra-mode prediction algorithm has been proposed which combines
most probable mode flag signaling and intra-mode signaling in one elegant solution.
Using this algorithm, 33 % bit-rate reduction can be obtained. The algorithm takes
neighboring intra-modes into account to obtain a prioritization of the different
modes. In most video coding, chroma sample prediction is performed after the luma
samples are taken.
In [3], the authors have proposed a reversed prediction structure that would make
luma predictions after the chroma samples were taken. In the conventional structure,
the intra-prediction has to calculate 341 (256C64C16C4C1) times for luma intra-
prediction when the maximum CU is set to 64  64, and the max allowed partition
depth is 4. However, the proposed structure calculates only 85 (64 C 16 C 4 C 1)
times in chroma samples. Experiment results show that the proposed algorithm can
achieve approximately 30 % time savings in average with 0.03 and 0.05 BD-PSNR
losses in chroma components and unnoticeable increments in bit rate.
6.13 Improved RD Cost-Based Algorithms 81

Generally, the bi-prediction is effective when the video has scene changes,
camera panning, zoom-in/out, and very fast scenes. In [12] it is shown that the
RD costs of forward and backward prediction are increasing when bi-prediction
is the best prediction mode from observation. This paper presents a kind of the
bi-prediction skipping method which can reduce the computational complexity of
bi-prediction efficiently. Their assumption is that if the bi-prediction is selected by
the best prediction mode, the RD costs of blocks which are included in each list
(forward and backward) can be larger than the average RD cost of previous blocks
which is coded by forward and backward prediction.
The consuming time for bi-prediction is almost 20 % of total encoding time. The
proposed method can reduce nearly half of total bi-prediction time with negligible
loss of quality. In [14], another efficient bi-prediction algorithm has been proposed
based on the overlapped block motion compensation (OBMC). It views the received
motion data as a source of information about the motion field and forms a better
prediction of a pixel’s intensity based on its own and nearby block MVs.
On the other hand, the prediction mode in HEVC can be divided into three
categories: inter, skip, and merge. When a PU is coded in either skip or merge mode,
no motion information is transmitted except the index of the selected candidate.
The residual signal is also omitted for skip. Based on this observation, three novel
techniques have been proposed in [18] for efficient merging of the candidate block.
However, these three proposed coding tools were adopted in HEVC and integrated
in HM-3.0 onward. In [28], a fast algorithm for residual quadtree mode decision
has been proposed based on merge and split decision process. Experimental results
shows that it gives 42–55 % encoding time reduction. In [24], using all-zero block
(AZB) and motion estimation information of inter 2N2N CU, an early merge mode
decision algorithm has been reported.
The abovementioned literatures are related inter-prediction. On the other hand,
a good amount of works are reported based on fast intra-prediction and transform
unit (TU) termination. In [38], variance values of coding mode costs are used to
terminate the current CU mode decision as well as TU size selection. A novel
adaptive intra-mode skipping algorithm has been reported in [33] based on the
statistical properties of the neighboring reference samples.

6.13 Improved RD Cost-Based Algorithms

Apart from fast mode decision algorithms, researchers are trying to improve the
rate-distortion calculation technique. In this context in [15], a mixture of Laplacian-
based RD cost calculation scheme has been proposed. In this work it is shown and
analyzed that the inter-predicted residues exhibit different statistical characteristics
for the CU blocks in different depth levels. The experimental results show that,
based on the mixture Laplacian distribution, the proposed rate and distortion models
are capable of better estimating the actual rates and distortions than the one based
on the single Laplacian distribution.
82 6 Fast Prediction Techniques

In order to reduce the total rate-distortion (RD) cost, in [41], a set of transform
pairs that can minimize the total RD cost has been proposed. The proposed
transforms are trained offline using several video sequences. The transforms are
achieved by matrix multiplication. The proposed scheme provides a set of rate-
distortion optimized transforms, which achieves 2:0 % bit-rate saving and 3:2 % bit
rate in intra-HE and intra-LoCo setting. In [27], the number of full R-D checks for
intra-prediction mode decision is reduced. The residual quadtree (RQT) checking is
always done for all intra-prediction modes that undergo R-D checks. That is, less
intra-prediction modes are tested, but for each of the mode tested, a thorough search
for the optimal transform tree is carried out.

6.14 Efficient Filter-Based Algorithms

The video codec under development still relies on transform domain quantization
and includes the same in-loop deblocking filter adopted in the H.264/AVC standard
to reduce quantization blocking artifacts. This deblocking filter provides two offsets
to vary the amount of filtering for each image area.
In [22], a perceptual optimization technique has been proposed of these offsets
based on a quality metric able to quantify the blocking artifacts’ impact on the
perceived video quality. The implementation complexity of adaptive loop filtering
(ALF) for luma at the decoder is analyzed in [2]. Implementation complexity
analysis involves not only analysis of computations but also analysis of memory
bandwidth and memory size. These filters reduce memory bandwidth and size
requirements by 25 % and 50 %, respectively, with minimal impact on coding
efficiency.
Sample adaptive offset, namely, SAO, has been proposed in [6] to reduce the
distortion between reconstructed pixels and original pixels. The proposed SAO can
achieve 1:3, 2:2, 1:8, and 3:0 % bit-rate reductions. The encoding time is roughly
unchanged, and the decoding time is increased by 1–3 %.

6.15 Improved Transform-Based Algorithms

Applying mode-dependent separable transforms is an effective method for improv-


ing transform coding of intra-prediction residuals. In [37], an orthogonal four-point
integer discrete sign transform (DST) has been proposed that has a multiplier-less
implementation consisting of only adds and bit-shifts. These properties make the
proposed implementation suitable for low-complexity architecture. Experimental
results show that the proposed implementation matches the coding performance of
fixed-point arithmetic implementation of the integer odd type-3 discrete sine trans-
form (ODST-3) and approaches closely the performance of fixed-point arithmetic
implementation of trained KLTs.
References 83

In [34], the new transform coding techniques in the HEVC Test Model has been
described, including the residual quadtree (RQT) approach and coded block pattern
signaling. Experimental results showing the advantage of using larger block size
transforms, especially for high-resolution video material, are presented.

References

1. X. Artigas, et al., The DISCOVER codec: architecture, techniques and evaluation. In: Picture
Coding Symposium, vol. 17(9), Lisbon, Portugal, 2007
2. M. Budagavi, V. Sze, M. Zhou, HEVC ALF decode complexity analysis and reduction. In:
International Conference on Image Processing (ICIP), 2011
3. W.J. Chen, J. Su, B. Li, T. Ikenaga, Reversed Intra Prediction Based On Chroma Extraction
In HEVC, International Symposium on Intelligent Signal Processing and Communications
Systems (ISPACS), 2011
4. Z. Chen, et al., Fast integer-pel and fractional-pel motion estimation for H. 264/AVC. J. Vis.
Commun. Image Represent. 17.2, 264–290 (2006)
5. K. Choi, S.-H. Park, E.S. Jang, Coding tree pruning based CU early termination, document
JCTVC-F092. JCT-VC, July 2011
6. C.-M. Fu, C.-Y. Chen, Y.-W. Huang, S. Lei, Sample adaptive offset for HEVC. In: International
Workshop on Multimedia Signal Processing (MMSP), 2011
7. R.H. Gweon, Y.-L. Lee, J. Lim, Early termination of CU encoding to reduce HEVC complexity,
document JCTVC-F045. JCT-VC, July 2011
8. S.-Y. Huang, C.-Y. Cho, J.-S. Wang, Adaptive fast block-matching algorithm by switching
search patterns for sequences with wide-range motion content. IEEE Trans. Circ. Syst. Video
Technol. 15.11, 1373–1384 (2005)
9. X. Jing, L.-P. Chau, An efficient three-step search algorithm for block motion estimation. IEEE
Trans. Multimedia 6.3, 435–438 (2004)
10. JVT of ISO/IEC MPEG, ITU-T VCEG, MVC software Reference Manual-JMVC 8.2, May
2010
11. J. Kim, S. Jeong, S. Cho, J.S. Choi, Adaptive coding unit early termination algorithm for
HEVC. In: International Conference on Consumer Electronics (ICCE), Las Vegas, 2012
12. J. Kim, S. Jeong, S. Cho, J.S. Choi, An efficient bi-prediction algorithm for HEVC. In:
International Conference on Consumer Electronics (ICCE), Las Vegas, 2012
13. G. Laroche, T. Poirier, P. Onno, Encoder speed-up for the motion vector predictor cost
estimation, document JCTVC-H0178. JCT-VC, Feb. 2012
14. C.-L. Lee, C.-C. Chen, Y.-W. Chen, M.-H. Wu, C.-H. Wu, W.-H. Peng, Bi-prediction combined
template and block motion compensations. In: International Conference on Image processing
(ICIP), 2011
15. B. Lee, M. Kim, Modeling rates and distortions based on a mixture of laplacian distributions
for inter-predicted residues in quadtree coding of HEVC. IEEE Signal Process. Lett. 18(10),
571–574 (2011)
16. J. Leng, L. Sun, T. Ikenaga, S. Sakaida, Content based hierarchical fast coding unit decision
algorithm for HEVC. In: International Conference on Multimedia and Signal Processing, 2011
17. R. Li, B. Zeng, M.L. Liou, A new three-step search algorithm for block motion estimation.
IEEE Trans. Circ. Syst. Video Technol. 4.4, 438–442 (1994)
18. J.-L. Lin, Y.-W. Chen, Y.-P. Tsai, Y.-W. Huang, S. Lei, Motion vector coding techniques for
HEVC. In: International Workshop on Multimedia Signal Processing (MMSP), 2011
19. P. Liu, Y. Gao, K. Jia, An adaptive motion estimation scheme for video coding. Scientific World
J. 2014 (2014)
20. J. Lu, M.L. Liou, A simple and efficient search algorithm for block-matching motion
estimation. IEEE Trans. Circ. Syst. Video Technol. 7.2, 429–433 (1997)
84 6 Fast Prediction Techniques

21. L. Luo, et al., A new prediction search algorithm for block motion estimation in video coding.
IEEE Trans. Consumer Electron. 43.1, 56–61 (1997)
22. M. Naccari, C. Brites, J. Ascenso, F. Pereira, Low complexity deblocking filter perceptual
optimization for the HEVC codec. In: International Conference on Image Processing (ICIP),
2011
23. K.-H. Ng, et al., A search patterns switching algorithm for block motion estimation. IEEE
Trans. Circ. Syst. Video Technol. 19.5, 753–759 (2009)
24. Z. Pan, S. Kwong, M.T. Sun, J. Lei, Early merge mode decision based on motion estimation
and hierarchical depth correlation for HEVC, IEEE Trans. broadcasting 60(2), 405–412 (2014)
25. X. Peng, J. Xu, F. Wu, Exploiting inter-frame correlations in compound video coding. In:
International Conference on Visual Communications and Image Processing (VCIP), 2011
26. N. Purnachand, L.N. Alves, A. Navarro, Improvements to TZ search motion estimation
algorithm for multiview video coding. In: 19th International Conference on Systems, Signals
and Image Processing (IWSSIP), 2012. IEEE, 2012
27. Y.H. Tan, C. Yeo, H.L. Tan, Z. Li, On residual quad-tree coding in HEVC. In: International
Workshop on Multimedia Signal Processing (MMSP), 2011
28. S.-W. Teng, H.-M. Hang, Y.-F. Chen, Fast mode decision algorithm for residual quadtree
coding. In: International Conference on Visual Communications and Image Processing (VCIP),
2011
29. A.M. Tourapis, et al., Fast block-matching motion estimation using advanced predictive
diamond zonal search (APDZS). In: ISO/IEC JTC1/SC29/WG11 MPEG2000 M 5865
(2000). APA
30. A.M. Tourapis, O.C. Au, M.L. Liou, Predictive motion vector field adaptive search technique
(PMVFAST)-enhancing block based motion estimation. In: Proceedings of SPIE., vol. 4310,
2001
31. A.M. Tourapis, Enhanced predictive zonal search for single and multiple frame motion
estimation. Electronic Imaging 2002. In: International Society for Optics and Photonics, 2002
32. S. Van Leuven, J. De Cock, P. Lambert, R. Van de Walle, J. Barbarien, A. Munteanu, Improved
intra mode signaling for HEVC. In: International Conference on Multimedia and Expo (ICME),
2011
33. L.L. Wang, W.C. Siu, Novel adaptive algorithm for intra prediction with compromised modes
skipping and signalling process in HEVC, IEEE Trans. Circuits Syst. Video Technol., 23(10),
1686–1694 (2013)
34. M. Winken, P. Helle, D. Marpe, H. Schwarz, T. Wiegand, Transform coding in the HEVC test
model. In: International Conference on Image processing (ICIP), 2011
35. J. Yang, J. Kim, K. Won, H. Lee, B. Jeon, Early SKIP detection for HEVC, document JCTVC-
G543. JCT-VC, Geneva, Switzerland, Nov. 2011
36. C.H. Yeh, K.J. Fan, M.J. Chen, G.L. Li, Fast Mode Decision Algorithm for Scalable Video
Coding Using Bayesian Theorem Detection and Markov Process, IEEE Trans. Circuits Syst.
Video Technol. 20(4), 536–574 (2010)
37. C. Yeo, Y.H. Tan, Z. Li, Low complexity mode dependent KLT for block based intra coding.
In: International Conference on Image Processing (ICIP), 2011
38. H. Zhang, Z. Ma, Early Termination Schemes for Fast Intra Mode Decision in High Efficiency
Video Coding, IEEE Inter. Symposium Circuits Syst., Beijing, China, 19–23 (2013)
39. L. Zhao, L. Zhang, S. Ma, D. Zhao, Fast mode decision algorithm for intra prediction in HEVC.
In: International Conference on Visual Communications and Image Processing (VCIP), 2011
40. C. Zhu, X. Lin, L-P. Chau, Hexagon-based search pattern for fast block motion estimation.
IEEE Trans. Circ. Syst. Video Technol. 12.5, 349–355 (2002)
41. F. Zou, O.C. Au, C. Pang, J. Dai, Rate distortion optimized transform for intra block coding for
HEVC. In: International Conference on Visual Communications and Image Processing (VCIP),
2011

You might also like