Professional Documents
Culture Documents
Basic Prediction Techniques in Modern Video Coding Standards PDF
Basic Prediction Techniques in Modern Video Coding Standards PDF
Byung-Gyu Kim
Kalyan Goswami
Basic Prediction
Techniques in
Modern Video
Coding Standards
123
SpringerBriefs in Electrical and Computer
Engineering
123
Byung-Gyu Kim Kalyan Goswami
Department of IT Engineering Visual Media Research Section
Sookmyung Women’s University Broadcasting and Media Research laboratory
Seoul, Republic of Korea Electronics and Telecommunication
Research Institute (ETRI)
Daejeon, Republic of Korea
This book is intended as a basic technical guide for the latest video coding standard
with general descriptions of the latest video compression standard technologies. The
H.264/advanced video coding (AVC) scheme as a video compression standard has
been applied in a variety of multimedia services over the last 10 years. As the latest
video coding standard, High Efficiency Video Coding (HEVC) standard technology
is also expected to be used in a variety of ultrahigh-definition (UHD) multimedia
and immersive media services over the next 10 years.
The structure of the H.264/AVC standard scheme is explained in contrast with
earlier technologies, and the HEVC video compression technology is presented.
The history and background of the overall video coding technology and the hybrid
video codec structure are explained in the Introduction. A detailed explanation of the
modules and functions of the hybrid video codec is presented in Chap. 2. Detailed
descriptions of intra-prediction, inter-prediction, and RD optimization techniques
of H.264/AVC standard modules of the video codec follow. The high degree of
video quality achieved using this standard results in computational complexity in the
video encoding system. Thus, fast algorithms and schemes for reduction in HEVC
encoding system computational complexity are presented and analyzed in Chap. 6.
A complete, comprehensive, and exhaustive analysis of HEVC and the
H.264/AVC video codec is beyond the scope of this book. However, the latest
technologies used in the codec are presented in an attempt to gain an understanding
of both structure and function. Basic principles of video data compression based on
removal of correlations between data are presented and explained. Therefore, this
book will help interested readers to gain an understanding of the latest video codec
technology.
v
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background and Need for Video Compression . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Classifications of the Redundancies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Statistical Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Psycho-Visual Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Hybrid Video Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Brief History About Compression Standards. . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Hybrid Video Codec Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Picture Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 High-Level Picture Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Block Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 H.264/AVC Block Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 HEVC Block Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Prediction Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 In-Loop Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Deblocking Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Sample Adaptive Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Entropy Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 Huffman Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.2 Arithmetic Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.3 CABAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Intra-prediction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Intra-prediction Modes in H.264/AVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Intra-prediction Modes in HEVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Angular Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 DC and Planer Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
vii
viii Contents
3.3.3
Reference Sample Smoothing and Boundary
Value Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Lossless Intra-prediction Using DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Inter-prediction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Uni- and Bidirectional Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Complexity in the Inter-prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Different Inter-prediction Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Merge and Skip Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Motion Vector Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 RD Cost Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Classical Theory of RD Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Distortion Measurement Technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Mean of Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Mean of Absolute Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.3 Sum of Absolute Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Calculating for the RD Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Fast Prediction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 Need for the Fast Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Fast Options in HEVC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.1 Early CU Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.2 Early Skip Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2.3 CBF Fast Mode Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.4 Fast Decision for Merge RD Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3 Block Matching Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 Full Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5 Unsymmetrical-Cross Multihexagon-Grid Search . . . . . . . . . . . . . . . . . . . . 70
6.6 Diamond Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.7 Enhanced Predictive Zonal Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.8 Test Zone Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.9 Fixed Search Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.10 Search Patterns Based on Block Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.11 Search Patterns Based on Motion Classification . . . . . . . . . . . . . . . . . . . . . . 79
6.12 Prediction-Based Fast Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.13 Improved RD Cost-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.14 Efficient Filter-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.15 Improved Transform-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter 1
Introduction
The field of video processing is concerned with information processing activity for
which the input and output signals are video sequences. A wide range of emerging
applications, such as videophone, video conferencing through wired and wireless
medium, streaming video, digital TV/HDTV broadcast, video database service,
CD/DVD storage, etc., demand a significant amount of video compression to store
or transmit the video efficiently. Recently, there is a drastic change that happened
in the video communication technology from lower-resolution video to ultra-high-
definition (UHD) video format. In our modern society, a huge demand is present for
the UHD video for consumer use in real-time-based systems.
Now, in order to transmit or to store a video data, the compression of the raw
file is essential. Video compression refers to the tools and techniques which operate
on video sequences to reduce the quantity of data. Today, modern data compression
techniques can store or transmit vast amount of data to represent video sequence in
an efficient and robust way. One question should arise in this point: what is the need
for the video compression? However, we can store a raw file instead of a compressed
one. The answer of this question is the amount of data. Generally, uncompressed
video signal generates a huge quantity of data, which is difficult enough to store
and transmit through a channel. For this reason, raw video data needs to compress
for our daily life applications. Again, a new question comes, at this point: how to
compress a video data? In last few decades, a good amount of research works have
been reported in the domain of video compression. In a nutshell, most of the natural
video sequences contain huge redundancy of data which can be explored by using
statistical models and psycho-visual limitations of the human eye. Algorithms for
video compression are mainly based on a statistical model of input data or psycho-
visual limitations of the human eye, which reduce the raw video sequence to a
compressed data sequence. The act of discarding data introduces distortion in the
decompressed data sequence. However, the compression is done in such a way that
the introduced distortion is not noticeable to the human eye.
This introductory chapter starts with a brief explanation of different redundan-
cies, which is a very fundamental concept for the video compression theory. It
continues with a description of the modern hybrid video codec which is used in the
High Efficiency Video Coding (HEVC). A brief history about compression standard
is given next. Finally, the chapter ends by giving the organization of the book.
In the previous section, we have introduced the term “redundancy.” Now, very
informally, it can be thought of as a repetition of a data in a data set. For example,
if we consider a pixel in an image, then most of its neighboring pixels have similar
intensity values. Moreover, if it is a homogeneous region, then there is a high chance
that most of its neighboring pixels have the same value. This kind of similarity of
data is generally named as redundancy. Broadly, the redundancies can be divided
into two categories: statistical and psycho-visual redundancies.
Statistical redundancy occurs due to the fact that pixels within an image tend to
have similar intensity values of its neighbor, and for video, intensities of same
pixel position across successive frames tend to be very similar. For this reason, the
statistical redundancy can be subdivided into two categories: spatial and temporal
redundancies.
For an image, it can be easily observed that most of the pixels have almost the same
intensity level as those in their neighborhood. Only at the boundary of an object, the
intensity changes significantly. Hence, there is a considerable amount of redundancy
present in an image which can be exploited for significant data compression. This
kind of redundancy is called spatial redundancy. The spatial redundancy can be
exploited by using lossless and lossy compression techniques. Lossless compression
algorithms operate on a statistical model of input data. The general concept of
lossless compression is to assign shorter code words to more frequently occurring
symbols and longer code words to less frequently occurring symbols. Run-length
coding, entropy coding, and Lempel-Ziv coding are some of the examples of lossless
compression technique. Lossy compression algorithms, on the other hand, employ
psycho-visual limitations of the human eye to discard redundant data. The human
1.2 Classifications of the Redundancies 3
eye is more responsive to slow and gradual changes of illumination than perceiving
finer details and rapid change of intensities. Exploitation of such psycho-visual
characteristic has been incorporated within the multimedia standards like JPEG,
MPEGs, and H.26x.
In Fig. 1.1, we have shown a frame from the Foreman video sequence. In this
frame, there are good amount of places where the contents in the neighboring pixels
are very similar to each other. Some of the similar pixel-based patches are shown
in this diagram. From this diagram, it is very clear that the pixels in the specified
blocks have very similar amount of intensity value. This one is a very basic and
fundamental example of the spatial redundancy. In the next section, we will discuss
about the temporal redundancy.
4 1 Introduction
In case of video sequence, one can consider it as a sequence of frames. So for each
frame the spatial redundancy is present. Apart from that, between successive frames,
only a limited amount of movement of an object is possible. Hence, most of the
pixels do not exhibit any change at all between successive frames. This is called
temporal redundancy which is exploited through the prediction of current frame
using the stored information of the past frames. The temporal prediction is based
on the assumption that consecutive frames in a video sequence have a very close
similarity. This assumption is mostly valid except for the frames having significant
change of content or appearance of new objects in a frame. The prediction technique
is applied on the current frame with respect to the previous frame(s). Hence,
redundancies are not only present within a frame (spatial redundancy) but also
between successive frames (temporal redundancy) for a video. To compress a video
sequence efficiently, both of these redundancies need to be exploited and reduced as
much as possible.
In Fig. 1.2, we have shown an example of the temporal redundancy. In this
diagram, the first ten frames of the Akiyo sequence are shown. If we see it minutely,
it is quite clear from this diagram that, apart from the lip and eye portions in the
face of the lady, the rest of the parts in this ten-frame sequence are static in nature.
Hence, from the very first frame, it is possible to predict the tenth frame if we have
the information about the motion of the lip and the eye portions of the face. This
one is the very fundamental knowledge about the temporal redundancy. In the next
section, we will discuss about the psycho-visual redundancy.
A video codec is a device capable of encoding and decoding a video stream. Since
the modern video codec uses a combination of predictive and transform-domain
techniques, it is generally referred to as hybrid codec. Simplified block diagrams of
a hybrid video encoder and decoder are shown in Figs. 1.3 and 1.4, respectively.
In this codec, current frame is predicted using temporal and spatial redundancies
from the previously encoded reference frame(s). The temporal prediction is based
on the assumption that the consecutive frames in a video sequence have a very close
current residual
frame image
video Transform Quantizer output
VLC Buffer bit stream
frame (DCT) (Q)
(+)
(-)
predicted
Inverse
frame
Quantizer
Inverse
Transform
(+)
(+)
Motion
Compensated
Predictor
Motion Vector
Motion
Estimation
(+)
Input Inverse Inverse Decoded
Buffer VLC
Bit Stream Quantizer Transform Video
(+)
Motion
Compensated
Predictor
Motion Vectors
similarity. This assumption is mostly valid except for the frames having significant
change of content or some significant scene change. For this kind of scenario, spatial
redundancy of the new region (scene) is needed to be exploited.
In a hybrid video codec, when a frame FN (Nth frame in a sequence) comes
as an input, first of all it is compared with its predicted frame F cN . Generally,
the current frame FN is subtracted from the predicted frame F c N , and the error
image is called residual image F. Since the current and predicted frames are very
similar (depending upon the prediction technique) to each other, the residual image
generally exhibits considerable spatial redundancy. Moreover, from the residual
image and the predicted frame, the current frame can be constructed using addition
operation without any error. In Fig. 1.3, the residual image is shown using the black
color, because in ideal case (when current and predicted frames are the same), each
pixel in the residual image should have “0” value, which produces a black image.
Since, the residual image has significant spatial redundancy, it should be
exploited properly. For this reason, it is transformed into frequency domain.
Generally, discrete cosine transform (DCT) is used in the hybrid codec. Now one
question may arise in your mind: to transform into frequency domain, why should
not we go for the discrete Fourier transform (DFT)? The main advantage of the DCT
over DFT is its compactness that means after transformation into frequency domain,
DCT requires less amount of bits than DFT.
Till now, the compression schemes applied in the hybrid codec are based on
statistical approach. After the DCT, a quantization operation is performed on the
residual image which is based on the psycho-visual redundancy. Conceptually, it
is just a matrix operation over the DCT to eliminate the high-frequency terms. We
have mentioned earlier that the human eye is more sensitive over low-frequency
component than the higher one. Hence, if we drop the high-frequency terms from
the DCT output and reconstruct the video signal again, then for the human being, it
does not make any significant change over the original one. Now the quantization
parameter (QP) is one of the most important features for the hybrid video codec,
because this is the only part where the error is introduced in the output bit stream.
These matrices are fixed for a particular video codec, and these are constructed after
rigorous amount of psycho-visual experiments over human being.
1.3 Hybrid Video Codec 7
After the quantization, the output data is again compressed by the lossless
entropy coding or variable length coding (VLC), and the final output is sent to
the channel after proper buffering. Generally, arithmetic coding-based approaches
are used in the modern hybrid video codec for the entropy coding. In Fig. 1.3, a
feedback loop is added from the buffer to the quantizer. This loop signifies the
adaptive quantization parameter setting technique which is generally used in the
modern codec.
In the hybrid video codec, a decoder is also embedded in the encoder side. The
block diagram of a decoder is shown in Fig. 1.4. A decoder generally consists of
an inverse quantizer, inverse transform, and motion-compensated predictor. If we
notice carefully, then it is quite easy to observe the same decoder block in the
encoder side (Fig. 1.3). The reason to embed a decoder in the encoder side is that
we want to predict the same reference picture in the encoder which is observed to
the end user.
The motion estimation and motion-compensated prediction are the most impor-
tant parts of the hybrid video codec. From the reference frame, the current frame is
predicted using the motion vectors. The detailed description of this technique will
be discussed in the next chapter.
In Fig. 1.5, the basic block diagram of the H.264/AVC is shown. Till now, it is
the most commercially used encoder. This one is a block-based encoding technique.
However, the block size is fixed to the 64 64 dimension. These fixed-sized
blocks in the H.264/AVC are generally referred to as macroblocks (MBs). The main
goals of the H.264/AVC standardization effort have been enhanced compression
performance and provision of a “network-friendly” video representation addressing
“conversational” (video telephony) and “non-conversational” (storage, broadcast, or
streaming) applications [1].
For more than a decade, the above-discussed hybrid video coding techniques are
used commercially. Moreover, the latest video standard, HEVC, also adopted the
same techniques. The block diagram of the HEVC encoder is shown in Fig. 1.6. The
HEVC standard is designed to achieve multiple goals, including coding efficiency,
ease of transport system integration, and data loss resilience, as well as to implement
ability using parallel processing architectures. The detailed description of the latest
hybrid codec is discussed in the next chapter.
Now, one question may appear in your mind: what is the need for the standardiza-
tions like H.264, HEVC, etc.? The answer is very simple. To decode a compressed
video sequence, you have to know the encoding schemes. Now, in the absence of
any standardization, anybody can compress a video sequence by applying his own
algorithm, which is quite difficult to decode by a user. Hence, more formally we can
say, to simplify the interoperability between encoders and decoders from different
manufactures and to minimize the violation of different patents, a standardization is
required for video encoding [3]. In the next section, we will discuss a brief history
about different compression standards.
8 1 Introduction
The efforts on standardization of video encoder are actively in progress since the
early 1980s. An expert group, named as Motion Picture Experts Group (MPEG),
was established in 1988 in the framework of the Joint ISO/IEC Technical Committee
1.4 Brief History About Compression Standards 9
The first standard was made by this team in 1992 and was known as the MEPG-1.
Today, MPEG-1 is used in video CD (VCD) which is supported by most of the DVD
players with the video quality at 1:5 Mbit/s and 352 288=240-pixel resolution.
After that in 1993, the next version of the standard was introduced by the same
team named as MPEG-2. The MPEG-2 added improved compression tools and
interlace support and ushered in the era of digital television (DTV) and DVD. Till
now, most of the DVD players, all DTV systems, and some digital camcorders use
the MPEG-2 [4].
In 1994, the MPEG committee introduced a new standardization phase, called
MPEG-4, which finally became a standard at 2000. In MPEG-4, many novel coding
concepts were introduced such as interactive graphics, object and shape coding,
wavelet-based still image coding, face modeling, scalable coding, and 3D graphics.
Very few of these techniques have found their way into commercial products. Later,
standardization efforts have focused more narrowly on compression of regular video
sequences [5].
Apart from the MPEG committee, the International Telecommunication Union—
Telecommunication Standardization Sector (ITU-T) also evolved the standards for
the multimedia communications in parallel. In 1988–1990, the H.261 standard was
developed by this group which was a forerunner to the MPEG-1. The target was
to transmit video over ISDN lines, with multiples of 64 kbit/s data rates and CIF
(352 288-pixel) or QCIF (176 144-pixel) resolution.
The H.263 standard (1995) developed by the ITU was a big step forward and is
today’s dominant video conferencing and cell phone codec [5]. H.263 built upon
MPEG-1, MPEG-2, and H.261, an earlier video teleconferencing standard, added
new coding tools optimized for very low bit-rate applications [4].
The need for further improvement in coding efficiency in 1998 by the Video
Coding Experts Group (VCEG) of the ITU-T invited proposals for a new video
coding project named H.26L. The goal was to compress a video twice the rate of
the previous video standards while retaining the same picture quality. In December
2001, these two leading groups (VCEG and MPEG) merged together and formed a
Joint Video Team (JVT). Their combined effort was originally known as H.264/AVC
[1]. Due to its improved compression quality, H.264 is quickly becoming the leading
standard; it has been adopted in many video coding applications such as the iPod
and the PlayStation Portable, as well as in TV broadcasting standards such as
DVB-H and DMB. Portable applications primarily use the Baseline Profile up to
SD resolutions, while high-end video coding applications such as set-top boxes,
Blu-ray, and HD-DVD use the Main or High Profile at HD resolutions. The Baseline
Profile does not support interlaced content; the higher profiles do [4].
The increased commercial interest in video communication calls forth the need of
international video coding standard. This standardization requires the collaboration
between regions and countries with different infrastructures (both academic and
industrial), with different technical background, and with different political and
commercial interests [6]. The primary goal of most video coding standards is the
ability to minimize the bit rate necessary for representation of video content to
reach a given level of video quality [7]. However, international standards do not
10 1 Introduction
necessarily represent the best technical solutions but rather attempt to achieve
a trade-off between the amount of flexibility and efficiency supported by the
standard and the complexity of the implementation required for the standard [6].
Recently, the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC
Moving Picture Experts Group (MPEG) joined together in a partnership known as
the Joint Collaborative Team on Video Coding (JCT-VC) [8]. In January 2013, this
joint standardization organization finalized the latest video coding standard named
the High Efficiency Video Coding (HEVC) [2]. This new standard is designed
to achieve multiple goals, including bit-rate reduction over the previous standard
(H.264/MPEG-4 AVC [9]) while maintaining the same picture quality, ease of
transport system integration, and data loss resilience, as well as the ability to
implement it using parallel processing architectures [2]. The major motivation for
this new standard is the growing popularity of HD video and the demand of the
UHD format in commercial video transmission.
ITU-T VCEG (Q6/16) and ISO/IEC MPEG (JTC 1/SC 29/WG 11) are studying
the potential need for standardization of future video coding technology with a
compression capability that significantly exceeds that of the current HEVC standard
(including its current extensions and near-term extensions for screen content coding
and high-dynamic-range coding). Such future standardization action could either
take the form of additional extension(s) of HEVC or an entirely new standard. The
groups are working together on this exploration activity in a joint collaboration
effort known as the Joint Video Exploration Team (JVET) to evaluate compression
technology designs proposed by their experts in this area. The description of
encoding strategies used in experiments for the study of the new technology is
referred to as Joint Exploration Model (JEM). The first meeting was held on October
19–21, 2015.
In this book, we will focus on the basic prediction techniques which are used widely
in the in modern video codec. Hybrid codec structure and inter- and intra-prediction
techniques in MPEG-4, H.264/AVC, and HEVC are discussed together. While we
had started our research in the video codec, we spend a lot of time to understand
the basic algorithms behind each step. Form the specification documents and the
research papers, we gained the knowledge of these, which was very time-consuming
and tedious in nature. For this reason, we think that a textbook is essential in this
domain for the new researches to understand the basic algorithms of the video codec
easily. Moreover, in this book the latest research trends are also summarized, which
can be helpful for the readers to do further research in this area.
The book is organized as follows:
• Chapter 2 explains hybrid video codec in details. The picture partitioning tech-
niques are discussed here. The basic concepts of the intra- and inter-prediction
References 11
modes are also highlighted. Moreover, the in-loop filters, DCT, quantization, and
entropy coding techniques are explained in detail.
• Chapter 3 is mainly focused on the intra-prediction techniques in the latest video
codec. In this context, angular, planer, and DC intra-prediction techniques are
explained in detail. After that, smoothing algorithms and DPCM-based lossless
intra-prediction are also explained.
• Chapter 4 highlights on inter-prediction techniques. Unidirectional and bidirec-
tional prediction techniques are discussed here. Different inter-prediction modes
are explained in detail. Moreover, the motion vector prediction is also mentioned
here.
• Chapter 5 explains on the RD cost optimization theory. The background and the
classical RD theory are also discussed here.
• Chapter 6 is dedicated for the researchers in this domain. In this chapter, the
latest works in the fast prediction techniques are discussed in detail.
References
1. T. Wiegand, G.J. Sullivan, G. Bjontegard, A. Luthra, Overview of the H.264/AVC video coding
standard. IEEE Trans. Circ. Syst. Video Technol. 13(7), 560–576 (2003)
2. G.J. Sullivan, J.R. Ohm, W.J. Han, T. Wiegand, Overview of the High Efficiency Video Coding
(HEVC) standard. IEEE Trans. Circ. Syst. Video Technol. 22(12), 1649–1668 (2012)
3. I.E. Richardson, Introduction: The Role of Standards in The H.264 Advanced Video Compres-
sion Standard, 2nd edn. (Wiley, New York)
4. A. Michael, Historical overview of video compression in consumer electronic devices. In: IEEE
Int. Conf. on Consumer Electronics (ICCE), Jan. 2007
5. M. Jacobs, J. Probell, A brief history of video coding. ARC International Whitepaper, Jan. 2007
6. R. Schafer, T. Sikora, Digital video coding standards and their role in video communications.
Proc. IEEE 83(6), 907–924 (1995)
7. J.R. Ohm, G.J. Sullivan, H. Schwarz, T.K. Tan, T. Wiegand, Comparison of the coding efficiency
of video coding standards - including High Efficiency Video Coding (HEVC). IEEE Trans. Circ.
Syst. Video Technol. 22(12), 1669–1684 (2012)
8. B. Bross, W.J. Han, G.J. Sullivan, J.R. Ohm, T. Wiegand, High Efficiency Video Coding (HEVC)
text specification draft 9. Document JCTVC-K1003, ITU-T/ISO/IEC Joint Collaborative Team
on Video Coding (JCT-VC), Oct. 2012
9. T. Wiegand, G.J. Sullivan. G. Bjontegard, A. Luthra, Overview of the H.264/AVC video coding
standard. IEEE Trans. Circ. Syst. Video Technol. 13(7), 560–576 (2003)
Chapter 2
Hybrid Video Codec Structure
In the previous chapter, a brief description of latest hybrid video codec was given.
The hybrid video encoder is basically a block-based video encoder, which breaks
a picture into different blocks and processes each of them independently or with
dependence. Generally, the hybrid video codec uses two-layered high-level system
design for picture partitioning. These are video coding layer (VCL) and network
abstraction layer (NAL). The VCL includes the low-level picture partitioning, like
picture prediction, transform coding, entropy coding, in-loop filtering, etc. On the
other hand, the NAL includes the high-level picture partitioning using encapsulating
coded data and associated information into a logical data packet format which is
useful for video transmission over various transport layers. The need for this kind
of high-level partitioning is the parallel processing and packetization. In the next
subsection, we will discuss about the high-level picture partitioning in detail.
As we have mentioned earlier that for parallel processing and packetization, the
high-level picture partitioning is required. In the latest video standard, HEVC,
as well as its previous standard, uses slices for this kind of high-level picture
partitioning.
2.1.1.1 Slice
1 2 3 4 independent
9 10 11 slice
slice 1 1 2 3 4 5 6 7 8 5 6 7 8
5 6 7 8
9 10 11 12 13 14 15 16 12 13 14 15 16
12 13 14 15 16
17 18 19 20 21 22 23 24
17 18 19 20 21 22 23 24
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
slice 2 25 26 27 28 29 30 31 32
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
33 34 35 36 37 38 39 40
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 41 42 43 44 45
41 42 43 44 45
49 50 51 52 53 54 55 56 dependent slice
slice 3 57 58 59 60 61 62 63 64
46 47 48
49 50 51 52 53 54 55 56 independent
57 58 59 60 61 62 63 64 slice
P Slice I0 P1 P2 P3
B1 B2
B Slice I0 P3
1. I-slice: Here, all the elements (coding units) of the slice are encoded as intra-
picture prediction mode.
2. P-slice: In this case, in addition to the intra-prediction mode, some of the
elements (coding units) of the slice are predicted using inter-picture prediction
mode from only one reference picture.
3. B-slice: Finally, the concept of the B-slices is quite similar to the P-slice, but
the only difference is that the reference pictures here should be more than one
(generally two). So, bi-prediction method is used here.
All three different slice encoding structures are shown in Fig. 2.2. So for P- and
B-slices, the first element should be intra-type. Moreover, for a B-slice, the second
element should be intra-predicted (P).
2.1.1.2 Tile
The picture partitioning mechanism of the tiles is quite similar to the slices, but here
only rectangular-shaped-based partitioning is allowed, as shown in Fig. 2.3. On the
other hand, a slice is not restricted as rectangular shaped. Tiles are independently
decodable regions in a picture. The main advantage of the tiles is that it can enhance
the parallel processing and it can also be used for the spatial random access. In terms
of error resilience, the tiles are not very attractive, whereas for the coding efficiency,
tiles provide superior performance over slices.
This one is another latest feature in the HEVC encoder. The WPP option is enabled;
a slice is divided into rows of elements (coding tree units or CTUs). The first row is
processed in an ordinary way. Now, the magic starts from the second row onward.
After processing the second element of the first row, the processing of the second
row can be started. Similarly, after processing the second element of the second row,
the third row can be processed and so on. In Fig. 2.4, the pictorial representation of
16 2 Hybrid Video Codec Structure
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
Tile 1 Tile 2
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
Tile 3 Tile4
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
1 2 3 4 5 6 7 8 T1 1 2
9 10 11 12 13 14 15 16 T2 9 10
slice 1
17 18 19 20 21 22 23 24 T3 17 18
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
slice 2 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
the WPP is shown. The first thread (T1 ) starts normally. After finishing the second
element (element number 2 in this figure) of the T1 , the second tread T2 starts.
Similarly, after finishing the second element in the T2 (element number 10 in this
figure), T3 starts working. The WPP provides an excellent parallel processing within
a slice. Moreover, it may provide better compression performance than tiles.
The modern hybrid encoders divide a frame into different blocks and process each
of them separately. By the word “processing,” we are trying to mean the prediction,
transform, in-loop filtering, etc. The block sizes may or may not be fixed. In this
context, we will discuss the H.264/AVC and HEVC encoder block partitioning
techniques separately.
2.2 Block Partitioning 17
The basic building units of the H.264/AVC are the macroblocks (MBs). An MB
consists a fixed size of a 16 16 luma sample and two corresponding chroma
samples. Now, why the size is 16 16 ? Actually, in the literatures, it is shown
that it is the reasonable size to give a good trade-off between memory requirement
and coding efficiency within a HD format, whereas, for the higher resolutions, the
16 16 size is not a good option.
Now for the inter-prediction, each MB can be processed in a two-stage hieratical
process. An MB can be predicted as one 16 16, two 16 8, two 8 16, or four
8 8 partitioned. If it is partitioned as 8 8, then each four of the 8 8 blocks can
undergo to the second level of partitioning of the 8 8 block size. In this case, each
8 8 block can be partitioned as one 8 8, two 8 4, two 4 8, or four 4 4
partitions. The diagram of the abovementioned partitioning style for the inter-mode
prediction is shown in Fig. 2.5.
Unlike inter-mode prediction, for the intra-mode prediction, only 44, 88, and
16 16 are allowed for an MB. On the other hand, only 4 4 and 8 8 partitioning
is used for transform coding.
16 * 16
8*8
4*8
8*16
8*4
4*4
8*8
Fig. 2.6 (a) Block partitioning of inter-mode and (b) intra-mode for H.264/AVC
In Fig. 2.6, the inter- and intra-partition modes for the H.264/AVC are shown.
In Fig. 2.6a, the segmentations of the macroblock for motion compensation are
described. In this diagram, the top part is for the segmentation of macroblocks, and
the bottom part describes the segmentation of 8 8 partitions. On the other hand,
in Fig. 2.6b, different intra-partitioning of the H.264/AVC is shown. The detailed
description of each mode for the intra-partitioning will be described in Chap. 3.
Unlike the fixed partitioning structure using MB concept, in the HEVC, more
flexible and efficient block partitioning techniques are used. The HEVC introduces
four different block concepts: CTU, CU, PU, and TU. Each CTU consists of a luma
coding tree block (CTB) and two chroma CTBs. A similar relationship is valid for
CU, PU, and TU. The detailed description of each block is given below.
2.2 Block Partitioning 19
The CTU is basically an analogue to the macroblock in the H.264/AVC. Each slice
contains an integer multiple of CTUs. A CTU has flexible sizes of 64 64, 32 32,
16 16, or 8 8, and it can be specified at the time of encoding. Since it can
support larger sizes of block up to 64 64, it provides better coding efficiency for
the high-resolution-based video contains.
A CTU has a block structure size of 64 64 pixels, which can be decomposed into
four 32 32 pixels CUs. Further still, each 32 32 pixels CU can be divided into
four CUs of 16 16 pixels. This decomposition process can continue to CUs of up
to 8 8 pixel blocks. That means the 8 8 pixel block is the smallest possible for a
CU. Moreover, for the different combinations of CU structures, different CTBs are
generated for a single CTU. For each CTB, RD cost value is calculated. The CTB
which has the minimum RD cost value is considered as the best one. The illustration
of the CTB structure for a CTU is given in Fig. 2.7a. In Fig. 2.7, a 6464 pixel CTU
block is shown divided into smaller blocks of CUs. Upon calculating the RD cost
for every combination, the CUs which are under the red dotted part of Fig. 2.7a give
the minimum RD value. The corresponding CTU partitioning and CTB structure for
this particular (best) combination are shown in Fig. 2.7b.
The CTB is an efficient representation of variable block sizes so that regions of
different sizes can be coded with fewer bits while maintaining the same quality.
It is possible to encode stationary or homogeneous regions with a larger block
size, resulting in a smaller side-information overhead. On the other hand, the CTB
structure dramatically increases the computational complexity. As an example, if
a frame has dimensions of 704 576 pixels, then it will be decomposed into
99 .11 9/ CTUs, and a separate CTB will be created for each CTU. For each
CTB, 85 calculations are involved for different CU sizes. As a result, 8415 CU
calculations are required for the CTB structure, whereas only 1584 calculations are
needed for a 1616 macroblock, as was used in the previous standard (H.264/AVC).
Let us consider O.n/ is the total number of operations when the maximum depth
of the coding tree is set to n and Pi is the number of operations required for the given
CU size at the i-th level. The computational complexity based on variable CU sizes
can be described as Eq. 2.1:
O.n/ D O.n 1/ C 4n Pn
O.0/ D P0
(2.1)
1
Pi D . /i Pi1
4
20 2 Hybrid Video Codec Structure
a
64*64
32*32 This combination
16*16 provides the
lowest RD Cost
8*8
C TU
P ar t i t i o n i n g
C TB S t r u c t u r e 64*64
Fig. 2.7 (a) CTB structure which provides the lowest RD cost for CTU and (b) corresponding
CTU partitioning for the best CTB
Fig. 2.8 Coding tree block (CTB) structure and the corresponding CUs for a benchmark video
sequences (Blowing Bubbles)
s kip m o de intr a m o de
i n t e r m o d e ( s q u ar e an d r e c t an g u l ar ) - n o n AM P
2 N * nU 2 N * nD nL * 2 N nR * 2 N
vectors (MVs) are calculated separately for each inter-PB using two reference
pictures from list-0 and list-1. For each MV, RD cost is calculated using the original
and generated predicted blocks.
The prediction residual is coded using block transforms. A TU tree structure has its
root at the CU level. The luma CB residual may be identical to the luma transform
block (TB) or may be further split into smaller luma TBs. The same applies to the
chroma TBs. Integer basis functions similar to those of a discrete cosine transform
(DCT) are defined for the square TB sizes 4 4, 8 8, 16 16, and 32 32. For
the 4 4 transform of luma intra-picture prediction residuals, an integer transform
derived from a form of discrete sine transform (DST) is alternatively specified.
2.4 In-Loop Filters 23
The prediction technique is used to temporally and spatially predict the current
frame from the previous one(s) that must be stored. The temporal prediction is based
on the assumption that the consecutive frames in a video sequences exhibit very
close similarity, except for the fact that the objects or the parts of a frame in general
may get somewhat displaced in position. This assumption is mostly valid except
for the frames having significant change of contents. The predicted frame generated
by the exploitation of temporal redundancy is subtracted from the incoming video
frame, pixel by pixel, and the difference is the error image, which will in general
exhibit considerable spatial redundancy. The detailed description about the inter-
picture prediction techniques in the hybrid video codec is discussed in Chap. 3.
On the other hand, the intra-picture prediction technique is based on the spatial
redundancy. This technique has the similar concept of the still image compression.
However, in modern hybrid codec, sophisticated algorithms are applied for the intra-
mode decision. We have dedicated a full chapter (Chap. 4) on the topic in this book.
The basic concept of the deblocking filter in H.264/AVC and HEVC is quite similar.
This filter is intended to reduce the blocking artifact due to the block-based coding.
Moreover, it is only applied to the samples located in the block boundaries.
The operation of deblocking filter can be divided into three main steps: filter
strength computation, filter decision, and filter implementation.
Let us consider that the two blocks (P and Q) are adjacent to each other. In Fig. 2.10,
two adjacent blocks are shown for vertical edge. The concept is quite similar to
the horizontal edge also. The amount of filtering is computed with the help of a
24 2 Hybrid Video Codec Structure
Block P Block Q
Fig. 2.10 Four sample segments of vertical block boundary between adjacent blocks P and Q
no
P or Q intra ?
yes
P or Q has no
non-zero
coefficients ?
P and Q uses no
different ref
yes pictures ?
yes
Bs = 2 Bs = 1 Bs = 0
Fig. 2.11 Boundary strength (Bs) calculation for two adjacent blocks P and Q
parameter called the boundary strength (Bs). The boundary strength (Bs) of the filter
depends on the current quantizer, block type, motion vector, and other parameters.
In the HEVC, the boundary strength is calculated using an algorithm which is shown
in Fig. 2.11 as a simplified flowchart. If the boundary strength is greater than zero,
then the deblocking filtering is applied on the blocks.
2.4 In-Loop Filters 25
There are two kinds of filtering decisions that are taken in the HEVC encoder. These
are:
• required filtering or not ?
• if filtering is required, then is it a normal filtering or a strong filtering ?
The condition for the first decision can be formulated as Eq. 2.3:
ˇ
jP2;i 2P1;i C P0;i j C jQ2;i 2Q1;i C Q0;i j < (2.4)
8
ˇ
jP3;i P0;i j C jQ3;i Q0;i j < (2.5)
8
jP0;i Q0;i j < 2:5 tc (2.6)
These conditions are applied for i D 0 and i D 3. In Eq. 2.6, the tc is another
threshold, which is generally referred to as clipping parameter. Now the algorithm
for the filtering decision is shown in Fig. 2.12 as a flowchart.
When a normal deblocking filter is selected, then one or two samples are modified
from block P or Q based on some conditions. On the other hand, the strong
deblocking filter is applied to smooth flat areas where artifacts are more visible.
This filtering mode modifies three samples from the block boundary and enables
strong low-pass filtering.
Sample adaptive offset is the second-level in-loop filtering in the HEVC which
attenuates the ringing artifacts. The ringing artifacts generally appear for large
26 2 Hybrid Video Codec Structure
boundary is no
aligned with 8*8
sample grid
yes
boundary is no
between PU or
TU
No filtering
yes no
Bs > 0
yes
condition (2.3) no
is true
yes
yes
Strong filtering
transform sizes. SAO is applied on the output of the deblocking filter. The HEVC
includes two kinds of SAO types. These are:
• Edge offset (EO)
• Band offset (BO)
2.4 In-Loop Filters 27
n1
c d
n0 n0
p p
n1 n1
Edge offset is based on the comparison between the current sample and its neigh-
boring sample. EO uses four one-directional patterns for edge offset classification in
the CTB. These patterns are horizontal, vertical, 1350 diagonal, and 450 diagonal as
shown in Fig. 2.13. Each sample in the CTB is classified into one of five categories
by comparing the neighboring values. The categories are generally defined as
EdgeIdx. The meaning of different EdgeIdx and the corresponding conditions is
given in the Table 2.1.
Depending upon the EdgeIdx, an offset value from a transmitted lookup table is
added to the sample value. For EdgeIdx D 1 and 2 positive offset and for EdgeIdx
= 3 and 4 negative offset is added to the samples for smoothing.
In this kind of SAO, same offset is added to all samples whose value belongs to the
same band. Here the amplitude of a sample is the key factor for the offset. In this
mode, a full sample amplitude range is uniformly divided into 32 bands. The sample
values belonging to four of these bands are modified by adding band offsets.
28 2 Hybrid Video Codec Structure
After the in-loop filtering, the next step in the hybrid codec is the entropy coding of
the transformed data. Here, lossless compression schemes are applied. In the modern
hybrid video codec, context-based adaptive binary arithmetic coding (CABAC)
is used. But before describing the CABAC, some preliminary knowledge about
entropy coding is required. So, some basic entropy coding algorithms, like Huffman
coding and arithmetic coding, are discussed first, followed by CABAC.
Huffman coding is a popular lossless variable length coding scheme, based on the
following principles:
• Shorter code words are assigned to more probable symbols.
• No code word of a symbol is a prefix of another code word.
• Every source symbol must have a unique code word assigned to it.
It is better to explain the Huffman coding by using an example. Let us consider
we have six symbols a1 , a2 , a3 , a4 , a5 , and a6 . Moreover, before applying the
Huffman coding, we also know the probability of occurrences of each symbol. Let
us consider the probabilities are 0:4; 0:3; 0:1; 0:1; 0:06, and 0:04, respectively.
The steps for the Huffman coding are given below:
step 1: Arrange the symbols in the decreasing order of their probabilities.
step 2: Combine the lowest probability symbols into a single compound symbol
that replaces them in the next source reduction. In this example, a5 and a6 are
combined into a compound symbol of probability 0:1.
step 3: Continue the source reductions of step 2, until we are left with only two
symbols. This is shown in Fig. 2.14. The second symbol in this table indicates a
compound symbol of probability 0:4. We are now in a position to assign codes
to the symbols.
step 4: Assign codes 0 and 1 to the last two symbols.
step 5: Work backward along the table to assign the codes to the elements of
the compound symbols. Continue till codes are assigned to all the elementary
symbols. This is shown in Fig. 2.15
Hence, after applying the Huffman coding, the corresponding coding values of
each symbol are a1 D 1, a2 D 00, a3 D 011, a4 D 0100, a5 D 01010, and
a6 D 01011. If we calculate it properly, then the average length of this code is 2:2
bits per pixel. Huffman’s procedure creates the optimal code for a set of symbols
and probabilities subject to the constraint that the symbols be coded one at a time.
2.5 Entropy Coding 29
Arithmetic coding is also a variable length coding (VLC) scheme requiring a priori
knowledge of the symbol probabilities. The basic steps for this algorithm are given
below:
step 1: Consider a range of real numbers in Œ0; 1/. Subdivide this range into a
number of subranges that is equal to the total number of symbols in the source
alphabet. Each subrange spans a real value equal to the probability of the source
symbol.
step 2: Consider a source message and take its first symbol. Find to which
subrange does this source symbol belong.
step 3: Subdivide this subrange into a number of next-level subranges, according
to the probability of the source symbols.
step 4: Now parse the next symbol in the given source message and determine
the next-level subrange to which it belongs.
step 5: Repeat step 3 and step 4 until all the symbols in the source message are
parsed. The message may be encoded using any real value in the last subrange
so formed. The final message symbol is reserved as a special end-of-symbol
message indicator.
30 2 Hybrid Video Codec Structure
2.5.3 CABAC
3.1 Background
In the intra-prediction, a block is predicted only with the help of the current frame.
So, in this kind of prediction, the reference frames are not required. Only spatial
redundancy is explored in this prediction. The main concept behind this prediction
is that the neighboring pixel of a block should have a high amount of correlation.
For an example, let us consider the Foreman sequence as shown in Fig. 3.1. In this
diagram, a block is enlarged from the sequence (Fig. 3.1a). Now the blocks which
are present in the above and the left side of the enlarged block are already encoded
which are denoted in this diagram with a “0” sign, and the other non-encoded blocks
are denoted as “” sign. The top neighboring pixels of this block are shown in
Fig. 3.1b. Let us consider the current block is predicted from the top neighboring
pixels. That means, in the predicted block, all the pixels in column have the same
value of the vertically neighbor pixel corresponding to that column. This kind of
prediction is generally referred to as padding. In Fig. 3.1b, the vertically padded
block (predicted block) for the current block is shown.
Now, one question might appear in your mind that an error should be produced
in this prediction. The answer is yes. For this reason, the corresponding residual
block is also generated. In the previous chapter, we have discussed in detail about
the residual block. In a nutshell, this is basically just the difference between the
predicted and the current block.
In this example, only the vertical padding is considered. No doubt that if we
consider different other orientations of padding, then the prediction will be more
accurate. Let us consider three orientation of padding as shown in Fig. 3.2. In
this diagram apart from the vertical padding, horizontal padding and diagonal
padding are also considered. So, for this example, all the three predictions are
performed (vertical, horizontal, and diagonal). The corresponding residual blocks
are generated, and the rate-distortion cost values are calculated. Now the prediction
Fig. 3.1 Conceptual diagram of a block and the correlation with its neighboring pixels. (a)
Enlarged version of block from the Foreman sequence, (b) the neighboring pixels of the block
which are present in the above and the corresponding vertical padding with these pixels
which provides the minimum rate-distortion cost is considered the best one. This
is the basic background of the modern intra-prediction technique. All the latest
hybrid encoders use this phenomenon for the intra-coding. Now depending upon
the encoder, the angular prediction modes are varied. We will discuss different intra-
modes for H.264/AVC and HEVC in the next subsections.
In the H.264/AVC, the intra-predictions are only made for the square-shaped blocks.
The size of the square-shaped block can vary from 4 4 to 16 16 for luma
component. Now, the 8 8 luma block is a special case which is used for the
3.2 Intra-prediction Modes in H.264/AVC 33
M A B C D E F G H Mode 2: Mode 8
DC
I a b c d Mode 1
J e f g h samples to be Mode 6
intra predicated
K i j k l Mode 3
samples that are Mode 4
L m n o p already encoded Mode 5
Mode 7 Mode 0
high profiles. The 4 4 and 8 8 are considered as smaller blocks, and the 16 16
is considered as the larger block. In the H.264/AVC, nine modes are assigned for
the smaller blocks, and four modes are assigned for the larger block.
In Fig. 3.3, a 4 4 macroblock and its corresponding neighboring pixels are
shown. The neighboring green-colored pixels represent the pixels which are already
encoded, and the corresponding 4 4 macroblock is intra-predicted with the help
of these neighboring pixels. As we have mentioned earlier, nine intra-modes are
supported by the H.264/AVC encoder. The angular direction of these nine modes is
shown in Fig. 3.3. The brief description of each mode is given in Table 3.1, and the
corresponding pictorial representation is shown in Fig. 3.4. If you compare Fig. 3.4
and Table 3.1, then it is quite easy to understand different angular intra-prediction
modes for the smaller macroblocks.
The prediction technique of mode 0, mode 1, and mode 2 is very straightforward.
Only simple padding concept and the average function are used in these three modes.
On the other hand, the rest of the six modes have little complex way to calculate the
predicted pixels, and each of the pixels in the macroblock need not to have the
same predicted value. To understand it more easily, Fig. 3.5 provides a pictorial
view of the predicted value of each pixel in the macroblock. Figure 3.5 is self
explanatory, and we hope the readers will understand the calculation techniques
of intra-prediction for each pixel in the macroblock.
So far, we have discussed about the intra-prediction only for the smaller blocks
(44 and 88). Now for the 1616 luma blocks, the intra-prediction is more simple.
34 3 Intra-prediction Techniques
I I I
J J J
K K K
L L L
Fig. 3.4 Pictorial representation of the nine intra-prediction modes for a smaller macroblock
Only four modes are available for this kind of larger blocks. These are mode 0,
mode 1, mode 2, and mode 4. Conceptually, the description of these modes is the
same as mentioned in Table 3.1 [1].
The intra-prediction operates according to the transform block (TB) size. The TB
sizes vary from the 4 4 to the 32 32. In the HEVC, there are 35 different intra-
prediction modes allowable. Among these, 33 intra-predictions are directional, and
3.3 Intra-prediction Modes in HEVC 35
Q A B C D E F G H Q A B C D E F G H Q A B C D E F G H
I I I
J J J
K K K
L L L
Predictors: Predictors: Predictors:
(Q+I+1)/2 (Q+2I+J+1)/4 (Q+A+1)/2 (Q+2A+B+2)/4 (I+J+1)/2 (K+2L+L+2)/4
(I+2Q+A+2)/4 (J+K+1)/2 (A+B+1)/2 (A+2B+C+2)/4 (I+2J+K+2)/2 L
(Q+2A+B+2)/4 (I+2J+K+2)/4 (B+C+1)/2 (B+2C+D+2)/4 (J+K+1)/2
(A+2B+C+2)/4 (K+L+1)/2 (C+D+1)/2 (Q+2I+J+2)/4 (J+2K+L+2)/2
(I+J+1)/2 (J+2K+L+2)/4 (I+2Q+A+2)/4 (I+2J+K+2)/4 (K+L+1)/2
Fig. 3.6 Modes and directional orientation in the HEVC encoder of the intra-prediction [2]
one is DC and the last one is the planer. We will discuss about the DC and the
planer in the next subsection. All the modes and directional orientation in the HEVC
encoder are shown in Fig. 3.6.
The 33 angular modes are generally referred to as Intra_AngularŒk, where k
is a mode number from 2 to 34. The angles are internally designed to provide
denser coverage for near-horizontal and near-vertical angles and course coverage
for near-diagonal angles for the effectiveness of the signal prediction processing [2].
Generally, the Intra_Angular prediction targets the regions which have strong
directional edges.
36 3 Intra-prediction Techniques
On the other hand, in the 3.2 and 3.1, f represents the functional part of the
projected displacement on the same row or column and is calculated as
Conceptually, these prediction techniques are quite similar to the H.264/AVC. Intra-
DC prediction uses an average value of reference samples which are present in the
immediate left and the above of the block to be predicted. On the other hand, the
average values of two linear predictions using four corner reference samples are
used in intra-planar prediction to prevent discontinuities along the block boundaries.
Generally, the planer prediction has the capability to predict a region without
discontinuities on the block boundaries. By using the averaging of vertical and
horizontal linear prediction, the planer prediction is calculated. For example, a
pŒxŒy sample can be predicted as
In this equation, ph ŒxŒy and pv ŒxŒy represent horizontal and vertical predictions
which are calculated as
In the HEVC, a three-tap [1 2 3]/4 smoothing filter is used for the reference samples
in the intra-prediction. The reference sample smoothing is adaptive in nature for
the HEVC. For different block sizes, the reference sample smoothing is applied as
follows [2]:
• For 8 8 blocks, only the diagonal directions, Intra_Angular[k] with k = 2, 18,
or 34, use the reference sample smoothing.
• For 16 16 blocks, the reference samples are filtered for most directions except
the near-horizontal and near-vertical directions, k in the range of 9–11 and 25–27.
• For 32 32 blocks, all directions except the exactly horizontal (k D 10) and
exactly vertical (k D 26) directions use the smoothing filter,
To remove discontinuities along block boundaries, boundary value smoothing is
used. This smoothing technique is used for three modes: intra_DC (mode 1) and
Intra_Angular[k] with k D 10 (exactly horizontal) or k D 26 (exactly vertical).
In the previous section, we have discussed about the intra-prediction for both
H.264/AVC and HEVC in detail. Differential pulse code modulation (DPCM)-based
approach is a special technique which is proposed in [3]. This technique is efficient
enough to improve the intra-coding efficacy with a good extent.
Let us consider a 4 4 block, and this block is intra-predicted horizontally. In the
Fig. 3.7, the corresponding 4 4 block and its reference pixels are shown. Now in
the normal horizontal intra-prediction, the residuals of the first row is calculated as
r0 D p0 q0
r1 D p1 q1
(3.10)
r2 D p2 q2
r3 D p3 q3
38 3 Intra-prediction Techniques
q1 p4 p5 p6 p7
q2 p8 p9 p10 p11
In this equation, r0 , r1 , r2 , and r3 are the corresponding residual values in the first
row. Now according to the DPCM-based approach, the residuals can be calculated as
r0 D p0 q0
r1 D p1 p0
(3.11)
r2 D p2 p1
r3 D p3 p2
The encoder sends r0 , r1 , r2 , r3 , and as part of a residual block, and the decoder
can then decode the residuals as a block and then apply them for reconstruction. In
the decoder, the reconstruction of the p0 , p1 , p2 , and p3 is also quite simple. The
generalized relationship for the first row of the 4 4 block is
X
i
pi D q0 C rk ; 0 i 3 (3.12)
kD0
References
1. T. Wiegand, G.J. Sullivan, G. Bjontegard, A. Luthra, Overview of the H.264/AVC video coding
standard. IEEE Trans. Circ. Syst. Video Technol. 13(7), 560–576 (2003)
2. G.J. Sullivan, J.R. Ohm, W.J. Han, T. Wiegand, Overview of the High Efficiency Video Coding
(HEVC) standard. IEEE Trans. Circ. Syst. Video Technol. 22(12), 1649–1668 (2012)
3. Y.-L. Lee, K.-H. Han, G.J. Sullivan, OImproved lossless intra coding for H.264/MPEG-4 AVC.
IEEE Trans. Image Proces. 15(9), 2610–2615 (2006)
Chapter 4
Inter-prediction Techniques
In the first two chapters, a decent overall description of latest hybrid video codec
was explained. As we have discussed earlier, that there are mainly two kinds of
prediction techniques are used in the modern hybrid video codec. These are inter-
and intra-prediction techniques. Generally, temporal and spatial redundancies are
exploited in these prediction techniques, respectively.
Temporal prediction is based on the assumption that the consecutive video
frames exhibit very close similarity. This technique is used in the motion estimation
block. This block computes the difference between a current frame and a reference
frame. Generally, the immediate past frame is considered as a reference frame. The
difference in position between a candidate block and its closest match in a reference
frame is called the motion vector. After determining the motion vectors, one can
predict the current frame using the reference frame.
Motion estimation is one of the most important operations involved in any
video processing system. The ultimate goal is to minimize the total number of
bits used for coding the motion vectors and the prediction errors. According to the
occurrence of the current and reference frame, motion estimation can be divided into
two categories—forward and backward motion estimation, as shown in Fig. 4.1. In
backward motion estimation, the current frame is considered as the candidate frame,
and the reference frame is a past frame, which implies the search is backward. On the
other hand, in forward motion estimation, the exact opposite scenario occurs as
shown in Fig. 4.1. A general problem in both kinds of motion estimation is how to
parameterize the motion field. Usually, there are multiple objects in a video frame
that can move in different directions. Hence, a global parameterized model is usually
not adequate to solve this problem. The basic approaches of motion estimation are
as follows:
Time T + T1
Current
Frame
Backward Motion
Estimation Reference
Frame
Current
Frame
Time T − T1
• Pixel-based representation
• Block-based representation
• Mesh-based representation
However, in the hybrid video codec, block-based motion estimation techniques
are applied. For this reason, in this book, we will discuss only the block-based
motion estimation technique.
In the block-based motion estimation, a picture or frame is partitioned into small
nonoverlapping blocks (detailed description given in Chap. 2). Motion variation
within each nonoverlapping block can be characterized well, and motion vectors
can be estimated independently. This method provides a good compromise between
accuracy and complexity. In this technique, the motion vector is calculated for each
block independently. The main challenge in this method is how to specify the search
area of a block. This implies that if a block is placed in a certain position in the
reference frame, then one has to estimate the corresponding tentative positions of
the block in the current frame.
The main disadvantage in block-based representation is that the resulting motion
is often discontinuous across block boundaries. Unless the motion vectors of
adjacent blocks vary smoothly, the estimated motion fields may be discontinuous
and sometimes chaotic. This effect causes boundary artifacts.
Let us consider the frame t in Fig. 4.2 is the current frame and the blocks in this
current frame are predicted from a previously decoded frame t 1, which is referred
4.1 Motion Estimation 41
to as the reference frame. As shown in Fig. 4.2, first of all a search region is defined
in the reference frame for a particular block. After that, for all the possible positions
in this search range, a cost function is calculated for this block. This algorithm is
generally referred to as full search block motion (FSBM) estimation, which is quite
expensive in terms of speed. There are a good amount of efficient and popular fast
block matching algorithms (BMA) available, which give satisfactory results in terms
of both quality and speed.
Now one question may arise in your mind: how to calculate the cost function?
In this context, by using the term “cost function,” we are basically intending to
mean a matching criteria to get the motion vector. Generally, different kinds of
techniques can be used for this purpose, and in the hybrid codec, user can change
the cost function by modifying the configuration file. However, most efficient and
low computational complexity-based cost function is the sum of absolute difference
(SAD). Suppose the block size is N N, then the SAD of two blocks in the frame t
and t 1 can be calculated as
X
n1 X
n1
SAD.i; j/ D jBt .x; y/ Bt1 .x; y/j (4.1)
xD0 yD0
To explain the motion estimation concept more clearly, let us use a toy example.
In Fig. 4.3, the search range in the reference frame is shown as the green-colored
box. First of all, the SAD value is calculating from the origin (as shown in
Fig. 4.3). After that, the box motion toward right by one pixel and the corresponding
SAD value will be calculated. In this way, for the all possible positions in the
search region, the corresponding SAD values are calculated. Let us consider that
after calculating all possible SAD values, the corresponding SAD values for each
positions are shown in Fig. 4.3b. Now the minimum SAD value for this example
is 22. Hence, the corresponding motion vector (MV), for this example, will be
the vector from the origin to the position which provides the minimum SAD
value (as shown in Fig. 4.3). So, mathematically, it can be written as
MotionVector.MV/ D Œd1 ; d2 D arg minŒSAD.i; j/ (4.2)
i;j
42 4 Inter-prediction Techniques
Fig. 4.3 Search region in the reference frame and the corresponding motion vector
Now, the significance of the motion vector is that this vector provides a region in
the reference frame which is the most similar region to the corresponding region in
the current frame in the defined search zone. Now, one point needs to be clarified
here that the predicted region in the reference frame may not have the exact
same illumination or chrominance characteristics to the corresponding region in
the current frame. Hence, by using the motion estimation concept, the constructed
predicted frame should have a significant difference with the actual current frame.
This difference between the predicted and the actual current frame is called the
residual frame. Now, in the hybrid video codec system, the motion vectors and the
residual frame are sent to the decoder side. In the decoder, reference frame is already
present while decoding the current frame. So, by using the reference frame, motion
vectors, and the residual frame, the corresponding current frame can be constructed
without any error. In Fig. 4.4, the reconstruction of the current frame in the decoder
side with the residual frame, motion vector, and the reference frame is shown.
For hybrid video codec, generally two kinds of inter-prediction techniques are used
nowadays. These are unidirectional and bidirectional predictions. The concept of
these two is quite straightforward. On one hand, only one reference picture is used
in the unidirectional prediction, and on the other hand, two reference pictures are
used in the bidirectional prediction.
Let us consider a toy example, where eight frames are present in a group of
pictures (GOP). Among these, the first and the last one are intra-predicted and the
rest are inter-predicted. The intra-predicted frames are shown as I-frame in Figs. 4.5
and 4.6. For the unidirectional, predicted frames are shown in Fig. 4.5 as P-frame.
From this diagram, it is quite clear that the P-frame is predicted from a single
reference frame. The reference frame to predict a picture in the unidirectional need
not be an I-frame—it can be a P-frame also. In Fig. 4.5a, only I-frames are shown,
4.2 Uni- and Bidirectional Predictions 43
Fig. 4.4 Reconstruction of the current frame in the decoder side with the residual frame, motion
vectors, and the reference frame
Fig. 4.5 Unidirectional prediction for (a) I-frame and (b) P-frame
and in Fig. 4.5b both P-frames and I-frames are shown for the GOP where the first
P-frame is predicted from an I-frame and the second one from a P-frame.
44 4 Inter-prediction Techniques
Fig. 4.6 Bidirectional prediction for (a) only one B-frame and (b) total GOP
Generally, inter-prediction is the most complex part in the hybrid video codec. For
this reason, this module is one of the key components in terms of time consumption
to encode a video stream. In this section, let us make a time profile of H.264/AVC
and HEVC encoder for different coding module.
First of all, consider the H.264/AVC. The H.264/AVC video standard has very
high complexity to improve video quality and compression gain. Figure 4.7 shows
the encoding time profile for the H.264/AVC. Using the term “time profile,” we
want to mention the average time consumption of the different modules in the
H.264/AVC. From this diagram, it is very clear that the inter-prediction is the most
dominating in terms of other modules. It takes over 57 % in average encoding time
but sometimes over than 70 %. In this context, we want to mention that among
4.3 Complexity in the Inter-prediction 45
QP 25 QP 35
Fig. 4.8 The average consumed time profile for encoding HEVC video (%)
From this analysis, it is quite clear that the prediction modes are the most
important parts in the video codec. If it is analyzed more deeply, then the inter-
prediction takes higher encoding time than the intra. Now, in terms of fast encoding
techniques, these modules have the highest priority for exploration.
The coding unit (CU) in the HEVC or a macroblock (MB) in the H.264/AVC can
be predicted in different modes in the inter-prediction. However, the concepts of
the prediction modes in these two standards are quite similar. In the HEVC, the
prediction unit (PU) is treated separately with a different abstraction. Now the
different prediction modes for the HEVC standard are shown in Fig. 4.9. For inter-
prediction, there are three kinds of modes available for the prediction in the HEVC.
These are:
1. Skip mode
2. Square- and rectangular-shaped modes
3. Asymmetric modes
Let us consider the CU size is 2N 2N. As shown in Fig. 4.9, only
PART_2N 2N PU splitting is allowed for the skipped CU. Other than the skip
mode, eight different PU modes are available for the inter-prediction. Among these
eight modes, two modes are square shaped, PART_2N 2N and PART_N N,
and two modes are rectangular shaped, PART_N 2N and PART_N 2N. These
four types of prediction modes (square and rectangular) are symmetric in nature.
For all CU sizes (64 64 to 8 8), the symmetric-shaped prediction modes are
calculated.
On the other hand, the rest of the four inter-prediction modes are grouped as
asymmetric prediction modes (AMP modes). The AMP modes are PART_2N nU,
PART_2N nD, PART_nL 2N, and PART_nR 2N. Now, for the CU size 8 8,
AMP modes are not calculated.
For a CB with dimension 8 8, nine PU calculations are required (PART_2N
2N C 2*PART_N 2N C 2*PART_2N N C 4*PART_N N), whereas CBs
with higher dimensions, 13 (PART_2N 2N C 2*PART_N 2N C 2*PART_2N
N C 2*4 AMPs) PU calculations are required. Moreover, bidirectional prediction
technique is also adopted in HEVC. Hence, two motion vectors (MVs) are cal-
culated separately for each inter-PB using two reference pictures from list-0 and
list-1. For each MV, RD cost is calculated using the original and generated predicted
blocks.
In order to get the best mode, the HEVC encoder uses a cost function for
evaluating all the possible structures which are coming from the quadtree splitting.
Similar to the previous standard, the rate-distortion (RD) cost is also used in the
HEVC. In this process, a CTB is initially encoded as intra- or inter-prediction,
and then forward transform (T) and quantization (Q) are performed on it which
produces the encoded bit stream. This encoded bit rate (R) is considered as the rate
function in the final cost calculation. From the encoded bit stream, using the inverse
4.4 Different Inter-prediction Modes 47
s kip m o de intr a m o de
i n t e r m o d e ( s q u ar e an d r e c t an g u l ar ) - n o n AM P
2 N * nU 2 N * nD nL * 2 N nR * 2 N
J D D C R (4.3)
reference pictures from list-0 and list-1. For each MV, RD cost is calculated
using original and the generated predicted blocks. The number of CB, PB, and
corresponding RD cost calculations of a CTB with size 64 64 are given in
Table 4.1. In this table, we are only considering the inter-mode prediction and also
the merge/skip prediction. According to Table 4.1, 1864 RD cost calculations are
required for a 64 64 CU size, to predict its correct inter-mode. In this table, we do
not consider the RD cost of intra-mode.
HEVC includes a merge mode which is conceptually similar to the direct and skip
modes in the H.264. Whenever a CB is considered to be encoded as merge mode,
then its motion information are derived from spatially or temporally neighboring
blocks.
Apart from the previous standards, the skip mode is considered as a special case
of merge mode when there is no need to encode motion vector and all coded block
flags are equal to zero. When a CU is encoded as skip mode, then the following two
conditions are satisfied:
1. the motion vector difference between the current 2N 2N PU and the neighbor-
ing PU is zero (since it is merge-skip)
2. residuals are all quantized to zero.
Since only skip flag and the corresponding merge index are transmitted to the
decoder side, skip mode requires minimum amount of bits to transmit.
Generally, homogeneous and motionless regions in a video sequence are encoded
as skip mode. In one word, we can say that stationary region refers to the homogene-
ity and the motionlessness. In Fig. 4.10, the CTU structure of a video frame from
the traffic sequence is shown. Moreover, the CBs which are finally encoded as skip
mode for this frame are shown in Fig. 4.10. It is quite clear from Fig. 4.10 that most
of the stationary regions of the video frame are finally encoded as skip mode. Hence,
in order to detect the skip mode before the RD cost calculation process, it should be
beneficial to identify the stationary regions from a video sequence.
We have analyzed the amount of the skip modes in different benchmark video
sequences. In Table 4.2, the percentage of CUs which are finally encoded as
skip mode by the HEVC encoder for 6 different sequences are shown. In this
table the benchmark sequences with different resolutions and motion activities are
considered. For an example, the Traffic and the Park Scene sequences have relatively
high motionless backgrounds. On the other hand, the Basketball Pass sequence has
quite a good amount of foreground motion, and the BQ Terrace sequence has a
camera movement which affects throughout the video frame.
From Table 4.2, it is quite clear that for the best case, more than 80 % CUs are
skipped if its size is 64 64, and for the worst case, that is, for CU size 8 8, over
4.5 Merge and Skip Modes 49
Fig. 4.10 The CTB structure and the corresponding CUs which are finally encoded as skip mode
in the Traffic video sequence for QP D 37. (a) The CTB structure of frame no. 5 and (b) the CUs
which are encoded as skip mode of frame no. 5 are shown here using blue color
32 % CUs are encoded as skip. If we consider the overall scenario (average of all
QPs and CU sizes), then more than 58 % CUs are encoded as skip mode.
Apart from that, there are two observations that we want to highlight from this
table for all the sequences:
1. the percentage of skip is higher for the larger size of CUs than the lower one
2. generally, for larger QP values more amount of CUs are encoded as skip mode.
The distribution of skip percentage for different QP values and CU sizes for these
benchmark video sequences are shown in Fig. 4.11. It is quite clear that Fig. 4.11
justifies our observations from Table 4.2.
50 4 Inter-prediction Techniques
Table 4.2 Percentage of CUs that are encoded as skip for different benchmark
video sequences with different QP values
% Skip mode for different CU size
Sequences QP 64 64 32 32 16 16 8 8
Traffic .2560 1600/ 22 79 60 46 35
27 85 64 51 36
32 89 68 52 33
37 93 70 48 28
avg 86:50 65:50 49:25 33:00
Park Scene .1920 1080/ 22 70 64 45 30
27 82 67 51 35
32 88 69 54 37
37 92 69 51 40
avg 83:00 67:25 50:25 35:50
BQ Terrace .1920 1080/ 22 68 47 64 26
27 82 68 51 39
32 91 71 60 46
37 94 74 66 50
avg 83:75 65:00 60:25 40:25
Party Scene .832 480/ 22 87 57 33 20
27 86 64 43 25
32 86 66 46 27
37 90 64 46 28
avg 87:25 62:75 42:00 25:00
Blowing Bubbles .416 240/ 22 51 46 36 23
27 56 53 42 28
32 72 59 48 30
37 75 62 54 28
avg 63:50 55:00 45:00 27:25
Basketball Pass .416 240/ 22 82 82 61 33
27 84 83 66 36
32 84 83 69 35
37 87 85 72 32
avg 84:25 83:25 67:00 34:00
Total average 81:37 66:46 52:29 32:50
Generally, the motion vector of a block is correlated with the motion vectors of
its neighboring blocks in the current frame or in the earlier encoded pictures. The
reason behind this phenomenon is that the neighboring blocks likely correspond to
the same moving object. Therefore, if we send the difference between the motion
4.6 Motion Vector Prediction 51
Skip Percentage
100.00
90.00
80.00
70.00
QP 22
60.00
QP 27
50.00 QP 32
40.00 QP 37
30.00
20.00
10.00
0.00
64 32 16 8 CU Size
Fig. 4.11 The distribution of skip percentage for different QP values and CU sizes (average of all
six benchmark video sequences which are given in Table 4.2)
a1
a0
vectors in the decoder side, we can achieve higher data compression. This technique
is generally known as the motion vector prediction.
In the HEVC, when an inter-picture is not encoded as skip or merge mode, the
motion vector is differentially coded using motion vector prediction. In Fig. 4.12,
five spatial candidates are shown, and among them only two are chosen. The first
one is chosen from a0 ; a1 which is the set of left position, and the second one is
chosen from the set of above positions which are b0 ; b1 ; b2 . When the number of
spatial candidate is not equal to two, then only the temporal motion vector prediction
is done.
In the HEVC, there is a new concept included called the advanced motion vector
prediction (AMVP). According to this, a scaled version of motion vector is used
when the reference index of the neighboring PU is not equal to the current PU. The
scaling is done according to the temporal distance between the current picture and
the reference pictures.
Chapter 5
RD Cost Optimization
5.1 Background
In the previous two chapters, we have discussed about different prediction modes.
Now in a hybrid video codec, for each possible combination of modes, the
reconstructed images are created. Now, one question arises in this context: which
mode the encoder should choose among all? Generally, the hybrid encoder uses a
cost function to measure the effectiveness of a prediction mode. The cost function
is called the rate-distortion cost or RD cost in short. Now for all possible prediction
modes, the RD cost values are calculated, and the mode which provides the
minimum cost value is chosen as the best mode by the encoder. This is no doubt
an optimization problem, and it is referred to as RD optimization or RDO in short.
Let us consider the HEVC encoder. In order to get the best mode, the HEVC
encoder uses an RD cost for evaluating all the possible structures which are
coming from the quadtree splitting. A simplified RD cost calculation technique
is shown in Fig. 5.1. In this process, a CTB is initially encoded as intra- or inter-
prediction, and then forward transform (T) and quantization (Q) are performed on
it which produces the encoded bit stream. This encoded bit rate (R) is considered
as the rate function in the final cost calculation. From the encoded bit stream,
using the inverse quantization (Q1 ) and transform (T 1 ), a reconstructed CTU
is generated. The reconstructed frame provides the same visual quality to the
decoder side. To evaluate the compression error in the decoder side, a distortion
function (D) is calculated using the original and the reconstructed frame as sum of
squared difference (SSD), weighted SSD, or Hadamard with step according to the
specification file. The RD cost (J) is calculated using the summation of distortion
(D) and a Lagrangian weighted () rate (R) function as shown in Eq. 5.1.
J D D C R (5.1)
Inverse Inverse
Reconstruction
Transform Quantization
Distortion
Calculation Distortion (D)
The HEVC includes merge mode to derive the motion information from spatially
and temporally neighboring blocks. This is conceptually similar to the direct and
skip mode in the H.264/MPEG-4 AVC. The skip mode is considered as a special
case of the merge mode. In the skip mode, all coded block flags (CBF), motion
vector difference, and the coded quantized transform coefficients are equal to zero.
Moreover, bidirectional prediction technique is also adopted in HEVC. Hence,
two motion vectors (MVs) are calculated separately for each inter-PB using two
reference pictures from list-0 and list-1. For each MV, RD cost is calculated
using the original and generated predicted blocks. The number of CB, PB, and
corresponding RD cost calculations of a CTB with size 64 64 is given in
Table 5.1. In this table, we are only considering the inter-mode prediction and also
the merge/skip prediction. According to Table 5.1, 1864 RD cost calculations are
required for a 64 64 CU size, to predict its correct inter-mode. In this table, we do
not consider the RD cost of intra-mode.
From this analysis, we want to emphasize on a point that tremendous amount of
RD cost calculations takes place in a hybrid encoder. So, we should understand this
process in more detail.
achieve a given reproduction quality. Suppose we have an input raw video sequence
which we want to compress and transmit to the receiver. In the above example,
the input raw video sequence can be considered as the source. Now, the RD theory
addresses the problem of determining the minimal number of bits per symbol so that
the source (input video) can be approximately reconstructed at the receiver (output
video) without exceeding a given amount of distortion.
The compression can be two types: lossless and lossy. In case of the lossless
compression, as the name suggests, the decompressed data is an exact copy of
the original source data. This kind of compression schemes is generally important
where one needs perfect reconstruction of the source. However, it suffers from
impracticability for the applications where the source information is voluminous or
the channel bandwidth is limited. On the other hand, the lossy compression is more
effective in terms of compression ratio at a cost of imperfect source representation.
Generally, the properties of human visual system are accurately exploited in the
lossy compression. For this reason, for the human eye, the decompressed video
sequence and the source video sequence are indistinguishable.
Now in the lossy compression, a fundamental trade-off is essential on how much
fidelity of the representation (distortion) we are willing to support in order to reduce
the number of bits in the representation (rate). The trade-off between source fidelity
and coding rate is exactly the rate-distortion trade-off [1].
For a given system, source, and all possible quantization choices, we can plot
each distortion achieved by the encoder/decoder pair for different rate values. This
is generally called operational rate-distortion curve. A conceptual operational rate-
distortion curve is shown in Fig. 5.1. Now, in this curve a boundary is always present
to distinguish the best achievable operating points and suboptimal or unachievable
points. The boundary between achievable and unachievable is defined by the convex
hull of the set of the operating points.
In Eq. 5.1, it is shown that the RD cost is a linear combination of rate and the
distortion. Now the calculation of rate is very straightforward. It can be easily
calculated based on the actual encoded bits for a video stream. On the other
hand, the distortion measurement has different algorithms. Most common distortion
measurement schemes are described below (Fig. 5.2).
1 X X
N1 N1
MSE.i; j/ D 2 Œs.n1 ; n2 ; k/ s.n1 C i; n2 C j; k l/2 (5.2)
N n D0 n D0
1 2
56 5 RD Cost Optimization
Distortion (D)
Operating Points
Operating points
Rate (R)
Like the MSE criterion, the mean of absolute difference (MAD) too makes the error
values as positive, but instead of summing up the squared differences, the absolute
differences are summed up. The MAD measure at displacement .i; j/ is defined as
5.4 Calculating for the RD Cost Function 57
1 X X
N1 N1
MAD.i; j/ D Œs.n1 ; n2 ; k/ s.n1 C i; n2 C j; k l/ (5.4)
N 2 n D0 n D0
1 2
This one is the most computation inexpensive criteria. The sum of absolute
difference (SAD) is quite similar to the MAD, but instead of averaging with the
block dimension, here only the sum is calculated. The calculation of SAD is given
below.
X N1
N1 X
SAD.i; j/ D Œs.n1 ; n2 ; k/ s.n1 C i; n2 C j; k l/ (5.6)
n1 D0 n2 D0
Just like the MAD, the SAD criterion requires computations of N 2 subtractions
with absolute values and N 2 additions for each candidate block at each search
position, since the absence of averaging and multiplication operations makes this
criteria most cost-effective among all.
In the above equation, SSD denotes the sum of square differences between the
original block and its reconstruction, and MODE indicates a mode out of a set of
potential modes of the blocks (MB or CTU). The computation of the Lagrangian
costs for the inter-modes is much more demanding than for the intra and SKIP
modes. This is because of the block motion estimation step [1].
Given the Lagrange parameter MOTION and the decoded reference picture s1 ,
rate-constrained motion estimation for a block Si is performed by minimizing the
Lagrangian cost function
The final remark should be made regarding the choice of the Lagrange parameters
MODE and MOTION . In [1], an in-depth study of the parameter selection is given.
Selected rate-distortion curves and bit-rate savings plots for video streaming, video-
conferencing, and entertainment-quality applications are given in Figs. 5.3, 5.4,
and 5.5.
5.4 Calculating for the RD Cost Function 59
Fig. 5.3 Selected rate-distortion curves and bit-rate saving plots for videoconferencing applica-
tions [1]
60 5 RD Cost Optimization
Fig. 5.4 Selected rate-distortion curves and bit-rate saving plots for video streaming applica-
tions [1]
Reference 61
Fig. 5.5 Selected rate-distortion curves and bit-rate saving plots for video entertainment applica-
tions [1]
Reference
1. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, G.J. Sullian, Rate constrained coder control and
comparison of video coding standards. IEEE Trans. Circ. Syst. Video Technol. 13(7), 688–703
(2003)
Chapter 6
Fast Prediction Techniques
Table 6.1 The performance analysis (QP 25, 832 480 sequences) on JM 18.0 and HM 3.0
JM 18.0 HM 3.0 Differential
PSNR Bit rate Time PSNR Bit rate Time PSNR B% T%
Basketball Drill 38:92 5916:11 592 38:70 2345:11 5345 0:22 60:3 902
Flower Vase 43:67 820:04 451 43:61 323:83 4087 0:06 60:3 906
Keiba 39:80 5519:28 392 38:11 2265:61 3735 1:69 58:9 952
Mobisode2 44:62 583:23 239 44:85 254:02 2422 0:23 56:4 1013
Party Scene 37:44 17228:56 448 35:92 6198:63 3772 1:52 64:0 841
Average 0:65 60:0 923
Table 6.2 The performance analysis (QP 35, 832 480 sequences) on JM 18.0 and HM 3.0
JM 18.0 HM 3.0 Differential
PSNR Bit rate Time PSNR Bit rate Time PSAR B% T%
Basketball Drill 32:77 1218:43 459 33:16 552:74 4388 0:39 54:6 955
Flower Vase 36:91 126:47 459 33:16 552:74 4388 1:30 33:5 1018
Keiba 33:15 1304:86 331 32:41 562:24 2981 0:74 56:9 900
Mobisode2 41:23 158:65 226 41:64 62:15 2249 0:41 60:8 995
Party Scene 29:01 3107:84 349 29:13 1461:27 2866 0:12 53:0 821
Average 0:30 51:8 938
The CU is the basic unit of region splitting used for inter-/intra-prediction. The
CU concept allows recursive splitting into four equally sized blocks, starting from
the quadtree block. In [5], a fast CU depth decision algorithm is proposed which
is commonly known as ECU. According to the ECU, no further processing of sub-
trees is required when the current CU selects SKIP mode as the best prediction mode
at the current CU depth. The diagram of proposed algorithm is depicted in Fig. 6.1.
6.2 Fast Options in HEVC Encoder 65
BestMode Yes
Finish
=SKIP?
No Recursive call
To decide the best PU mode, the HEVC encoder computes the RD costs of all
the possible inter-PU modes and intra-PU modes. Since each of them entails high
computational complexity, it is practically very desirable if the encoder can decide
the best PU mode at the earliest possible stage without checking all possible modes
exhaustively.
According to [35], an early detection of SKIP mode is proposed to reduce the
encoding complexity of HEVC by simply checking the differential motion vector
(DMV) and a coded block flag (CBF) after searching the best inter 2N 2N mode.
The flowchart of the ESD method is depicted in Fig. 6.2. As shown in Fig. 6.2, in
the proposed method, the current CU searches inter 2N2N modes (AMVP and
merge) before checking the SKIP mode. After selecting the best inter 2N 2N mode
having the minimum RD cost, the proposed method checks its DMV and CBF. If
DMV and CBF of the best inter 2N2N mode are respectively equal to (0, 0) and
zero (these two conditions are called as “early SKIP conditions”), the best mode of
current CU is determined early as the SKIP mode. By doing this, in other words, the
remaining PU modes are not investigated anymore. The proposed method can omit
RD calculation for the other modes, thus reducing encoding complexity without
sizable coding efficiency loss.
66 6 Fast Prediction Techniques
Recursive call
When a CU is encoded in an inter-picture, the RD costs for total six PUs, inter
2N 2N, inter 2N N, inter N 2N, inter N N, intra 2N 2N, and intra N N,
are examined. And the RD costs for inter N N and intra N N are examined only
for 8 8 CU.
According to [7], if CBF of an inter-PU except inter N N PU in a CU is zero
(CBF D 0) for luma and two chromas (CBF luma, CBF U, CBF V), the next PU
encoding process of the CU is terminated. This algorithm is generally referred to as
the CFM, and the corresponding flowchart is shown in Fig. 6.3.
This early termination rule deals with the computation of the rate-distortion cost of
the motion vector predictors at the encoder side. More precisely, a termination rule
is proposed to avoid estimating all the rate-distortion costs of the merge candidates.
In [13], an efficient Fast Decision for Merge RD Cost algorithm is proposed which
is commonly referred to as FDM.
6.3 Block Matching Algorithm 67
Figure 6.4 presents the proposed encoder algorithm change to avoid some
rate-distortion cost evaluations for some merge candidates. Instead of performing
systematically the rate-distortion cost of each candidate, an early termination rule
is applied. The diagram in Fig. 6.4 uses a Boolean variable to signal the early
termination for merge (ETM). When the condition is reached, i.e., (ETM DD
TRUE), the computation of the rate-distortion cost of the merge mode for a given
candidate is not performed.
Extract
candidate i
from the list No
ETM ==
Candidate i
TRUE ?
Yes
No
J = Min(J,JSkipi)
J = Min(J,JSkip1r JMrg;)
i≥N? i++
Yes
Yes ETM =TRUE J == JSkipi ?
J, imin, BM
No
Fig. 6.4 Fast decision for merge (FDM) RD cost algorithm [13]
The current frame is divided into pixel blocks, and motion estimation is per-
formed independently for each pixel block. Motion estimation is done by identifying
a pixel block from the reference frame. The displacement is provided by the MV.
MV consists of a pair .x; y/ of horizontal and vertical displacement values. There
are various criteria available for calculating block matching.
The reference pixel blocks are generated only from a region known as the search
area. Search range defines the boundary for the motion vectors and limits the number
of blocks to evaluate. The height and width of the search range is dependent on
the motion in video sequence. The available computing power also determines the
search range. Bigger search range requires more computation due to increase in
number of evaluated candidates. Typically the search range is kept wider (i.e., width
is more than height) since many video sequences often exhibit panning motion. The
search region can also be changed adaptively depending upon the detected motion.
The horizontal and vertical search range, Sx and Sy , define the search range (˙Sx
and ˙Sy ) as in Figs. 6.5 and 6.6.
In the H.264/AVC and the HEVC standard, the block-based encoding structure
has been adopted. For the inter-prediction, motion estimation technique is the
core of the video compression and various video processing applications which
extracts the motion information from the video sequence. Typically using motion
estimation, a motion vector is generated for a block (MB or CU) in the video
compression standard. The motion vector indicates the displacement of a block of
pixels from the current location due to motion of object or camera. This information
is used to find the best matching block in the reference frame to minimize the rate-
6.3 Block Matching Algorithm 69
One pixel
Search range
MV
distortion cost. This technique is known as the block matching algorithm (BMA).
We have studied various motion estimation algorithms used in the H.264/AVC and
the HEVC. According to our survey, the existing BMAs can be classified into
following categories: full search, unsymmetrical-cross multihexagon-grid search,
diamond search, enhanced predictive zonal search, test zone search, fixed search
patterns, search patterns based on block correlation, and search patterns based on
block correlation.
70 6 Fast Prediction Techniques
The FS block matching algorithm searches every possible pixel block in the search
range [1]. Hence, it can generate the best block matching motion vector. This
type of BMA can give least possible residue for video compression. Though the
required computations are prohibitively high due to the large amount of search
point to evaluate in a defined search region, the number of search point to search
is .2 Sx C 1/ .2 Sy C 1/ which is predominantly high compared to any of the
search algorithms. There are several other fast BMAs , which reduce the number
of search point yet try to keep good block matching accuracy. Note that since these
algorithms test only limited candidates, they might result in selecting a candidate
corresponding to local minima, unlike full search, which always results in global
minima.
The unsymmetrical-cross multihexagon-grid search was proposed for the fast inte-
ger pel and fractional pel motion estimation in H. 264/AVC [4]. The UMHexagonS
conducts the overall search in four steps, from an initial predicted start search point:
step one, a sparse uneven cross search; step two, a fine full search within a small
rectangle; step three, a sparse uneven hexagon grid search, the grid is sparser and
larger when the search point is away from the hexagon center; and step four, a
refinement with hexagon or diamond search. Figure 6.7 demonstrates a typical
search procedure in a search window with search range equals 16 (here assumes
the start search point to be (0,0) vector).
Compared to FS, the UMHexagonS algorithm claims that it can reduce 90 % of
motion estimation time, drop less than 0.05dB PSNR, and maintain the low bit rate,
in order to make the initial search point close to the best prediction point.
The UMHexagonS algorithm searching strategy begins with cursory search
pattern and then turns to elaborate search patterns. With multi-patterns, it can get
rid of the disadvantage that the traditional fast algorithms are easy to trap in local
minima.
However, in the UMHexagonS algorithm compared to ARPS and EPZS, the
computational complexity is very high, because the search pattern shape has more
search candidate.
A new diamond search (DS) algorithm for fast block matching motion estimation
employed two search patterns: the first pattern, called large diamond search pattern
(LDSP) as illustrated in Fig. 6.8, comprises nine checking points from which eight
6.6 Diamond Search 71
15
10
−5
−10
−15
−15 −10 −5 0 5 10 15
step1 step2 step3 step4-1 step4-2
points surround the center one to compose a diamond shape (˙), and the second
pattern consisting of five checking points forms a smaller diamond shape called
small diamond search pattern (SDSP) as illustrated in Fig. 6.9.
In the searching procedure of the DS algorithm, LDSP is repeatedly used until
the step in which the minimum block distortion (MBD) occurs at the center point.
The search pattern is then switched from LDSP to SDSP as reaching to the final
search stage. Among the five checking points in SDSP, the position yielding the
MBD provides the motion vector of the best matching block.
72 6 Fast Prediction Techniques
The enhanced predictive zonal search for single and multiple frame motion estima-
tion (EPZS) [31] could be considered to be used as an improvement of predictive
motion vector field adaptive search technique (PMVFAST)-enhancing block-based
motion estimation [30] and fast block matching motion estimation using advanced
predictive diamond zonal search (APDZS) [29]. The EPZS improves upon these
algorithms by introducing an additional set of predictors; the early stopping criteria
are more efficiently selected. Furthermore, due to the enhanced reliability of
the predictors, only one search pattern is used, thus considerably reducing any
associated overhead of the algorithm. The checking pattern, depending on the
implementation requirements, could be either a diamond or square. The algorithm
is similar to other zonal type algorithms.
6.7 Enhanced Predictive Zonal Search 73
The EPZS algorithm improves upon PMVFAST, but also upon APDZS, by
considering several other additional predictors in the generalized predictor selection
phase of these algorithms that select a more robust and efficient adaptive threshold-
ing calculation. Due to the high efficiency of the prediction stage, the pattern of the
search can be considerably simplified.
The EPZS algorithm considers accelerator motion vector (Fig. 6.12) and is the
differentially increased/decreased motion vector taken after considering not only
the motion vector of the collocated frame in the previous frame but also of the frame
before that. The concept behind the selection of such predictor is that a block may
not be following a constant velocity but may instead be accelerating.
The EPZS used current block adjacent block information that previous frame and
adjacent blocks collocated block in the previous frame like in Fig. 6.13.
The TZS algorithm is a mixture of zonal search and raster search patterns. The
flowchart of the complete algorithm is shown in Fig. 6.14. The algorithm can be
broadly classified into four steps as described in the following:
Motion vector prediction: TZS algorithm employs median predictor, left, up, and
right. The minimum of theses predictors is selected as a starting location for
further search steps.
Initial grid search: In this step, the algorithm searches the search window in using
diamond or square patterns with different stride lengths ranging from 1 through
6.8 Test Zone Search 75
64, in multiples of 2. The patterns used are either eight-point diamond search or
eight-point square search that can be selected. A sample grid with stride length
8 for diamond is shown in Fig. 6.15a. The motion vector with minimum SAD
is taken as the center search point for further steps. The stride length for this
minimum distortion point is stored in variable uiBestDistance. The maximum
number of search points for this step, n1, is given by
where S is the size of search window, P is the number of search points per each
grid (eight for diamond, six for hexagon, etc.), and floor represents floor function.
Raster search: The raster search is a simple full search on a down-sampled version
of the search window. A predefined value iRaster for raster scan is set before
compilation of the code [10]. This value is used as a sampling factor for the
search window. The search window (for 16 16 search window) for raster scan
with iRaster value 3 is shown in Fig. 6.15b. As shown in flowchart in Fig. 6.14,
the condition for performing this raster search is that uiBestDistance (obtained
from previous step) must be greater than iRaster. If this condition is not satisfied,
76 6 Fast Prediction Techniques
No
uiBestDistance>iRaster
Yes
Raster Search with length=iRaster
No
Raster Refinement enabled
Yes
No
uiBestDistance>0
Yes
Raster Refinementnew search center with
uiBestDistance = uiBestDistance/2
No
Star Refinement enabled
Yes
No
uiBestDistance>0
Yes
Star Refinement at new search center
with all possible stride lengths
stop
the algorithm will skip this step. If this step is processed, then uiBestDistance is
changed to iRaster value. As seen from Fig. 6.15b, the number of search points in
each row/column would be ceil(S/R), where ceil represents ceiling function and
6.9 Fixed Search Patterns 77
Fig. 6.15 (a) Diamond search pattern and (b) hexagonal search pattern with stride length 8
R represents iRaster value. Thus, the maximum number of search points in this
step, n2, is given by
n2 D .seil.S=R//2 (6.2)
In this category, most of the methods are based on the assumption that ME matching
error decreases monotonically as the search moves toward the position of the global
minimum error. The motion vector of each block is searched independently by using
78 6 Fast Prediction Techniques
fixed search patterns. Examples are displacement measurement and its application
in interframe image coding (2-LOGS), motion-compensated interframe coding for
video conferencing (TSS), novel four-step search algorithm for fast block motion
estimation (4SS), block-based gradient descent search algorithm for block motion
estimation in video coding (BBGDS), hexagon-based search pattern for fast block
motion estimation (HEXBS) [40], DS, and UMHexagonS. These algorithms reduce
the number of search points. However, these algorithms have trade-off between the
complexity reduction and the image quality.
The 4SS and TSS are efficient for fast motion video sequences because the MVs
in fast motion sequences are far away from center point of macroblock. However,
in other cases such as middle and slow-motion sequences, it can be trapped local
minima. Also the TSS uses a constantly allocated checking point pattern in its
first step, which becomes inefficient for the estimation of slow motions. A new
three-step search algorithm for block motion estimation (NTSS) [17], an efficient
three-step search algorithm for block motion estimation (ETSS) [9], and a simple
and efficient search algorithm for block matching motion estimation (SES) [20]
algorithms have been proposed in order to improve the performance of simple fixed
search pattern algorithm.
Instead of using the predetermined search patterns, the methods exploit the cor-
relation between the current block and its adjacent block in the spatial and/or
temporal domains to predict the candidate MVs. The predicted MVs are obtained
by calculating the statistical average (such as the mean, the median, the weighted
mean/median, etc.) of the neighboring MVs [21] or selecting one of the neighboring
MVs according to certain criteria. In addition, one such candidate that is named
as the accelerator MVs is the differentially increased/decreased MVs taken after
considering not only the motion vector of the collocated frame in the previous frame
but also of the frame before that.
The concept behind the selection of such predictor is that a block may not be
following a constant velocity but may be accelerating. This kind of approach uses
spatial and/or temporal correlation to calculate the predictor as the ARPS and EPZS.
These algorithms set pattern sizes or estimate positions from previous frame and/or
neighboring current block MVs. The EPZS and ARPS preserve the peak signal-
to-noise ratio (PSNR) like FS, the consumed time is reduced with similar bit rate.
However, they made much overhead in terms of memory resource since they use
spatial-temporal information.
6.12 Prediction-Based Fast Algorithms 79
Apart from the abovementioned search patterns (fixed or variable), another kind of
attempt is reported for the block matching algorithm using the motion activity of the
video sequence. The video sequences could be broadly divided into three categories
based on the motion activity in the successive frame—slow, medium, and fast video
sequences. Some algorithms use different schemes to classify video sequences.
The search pattern switching algorithm for block motion estimation (SPS) [23]
has combined two approaches of motion estimation proposed. The first approach
uses coarse-to-fine technique to reduce the number of search points like 2-DLOG
and TSS; this approach is efficient for fast motion video sequences, because in these
sequences the search points are evenly distributed over the search window, and thus
the global minima far away from window centers can be located more efficiently.
The second approach utilizes the center-biased characteristic of MVs, algorithms
such as N3SS, 4SS, BBGDS, and DS. It uses center-biased search patterns to utilize
the center-biased global minima distribution. Compared with the first approach, a
substantial reduction of search points can be achieved for slow motion. The SPS
algorithms combine the advantages of the above two approaches by using different
search patterns according to the motion content of a block. The performance of an
adaptive algorithm depends on the accuracy of its motion content classification.
In real video sequences, contents with slow, medium, and fast motion frequently
exist together. The adaptive fast block matching algorithm by switching search
patterns for sequences with wide-range motion content (A-TDB) can efficiently
remove the temporal redundancy of sequences with wide-range motion content.
Based on the characteristics of predicted profit list, the A-TDB can adaptively
switch search patterns among the TSS, DS, and BBGDS according to the motion
content [8].
An adaptive motion estimation scheme for video coding (NUMHexagonS)
statistic of MV distribution was analyzed. The algorithm put forward the method
of predicting MV distribution and made full use of the MV characteristics and
also combined MV distribution prediction with the new search patterns to make
the search position more accurate [19].
There are good amount of papers that have been reported in efficient prediction
techniques. It can be said that this is one of the most effective techniques to make a
fast algorithm in HEVC.
The fast encoder decision algorithm called FEN has been included in HM
software which can reduce the complexity greatly. The main idea of FEN is that
the following CU calculation is skipped when the rate-distortion cost of current CU
selects SKIP mode as the best mode which is smaller than the average rate-distortion
80 6 Fast Prediction Techniques
cost of previously encoded CUs as SKIP mode. The average rate distortion of
previously skipped CUs is multiplied by fixed weighting factor to increase the
number of CUs which can be encoded as SKIP mode. The weighting factor of
FEN is 1.5. In [36], a novel algorithm was proposed for scalable H.264/AVC using
Bayesian framework.
In [11], an adaptive coding unit has been proposed based on early SKIP detection
technique. In this paper three tests have been performed to find the statistical
characteristics of SKIP mode. From these tests it is found that current CU and
neighboring CUs are highly correlated. Hence in this paper an adaptive weighting
factor adjusting method is proposed using these correlations. The initial weighting
factor of proposed method is fixed on one, and then the weighting factor is adjusted
from 1.0 to 2.0. The experimental result shows that the average coding time can
be reduced up to 54 % using this technique. In natural pictures, neighboring blocks
usually hold similar textures. Consequently, the optimal intra-prediction of current
block may have strong correlation with its neighboring blocks. Based on this
consideration, in [39], conditional probabilities have been estimated for the optimal
intra-direction of current block. From this calculation a most probable mode (MPM)
is defined from its neighboring blocks. From the statistic results, it is observed that
the MPM of current block possesses a large ratio to be the best mode in current
block in both test conditions, and this ratio of MPM fluctuates only a little between
different sequences.
In [16], it is shown that the large CU can be considered as very efficient for high-
resolution, slow-motion, or large QP video sequence. Larger CU can provide less
side information and motion vectors. Apart from that it can also predict the smooth
and slow motion part of sequence more accurately. So there exists mode correlation
among consecutive frames. In this paper, the authors have provided two key ideas:
frame level and CU level in this context. 45 % encoding time saving can be possible
using this technique. In [25], the authors take the reference software HM0.9 as
a benchmark and developed their own system based on hierarchical block-based
coding and a block-adaptive translational model in interframe coding. In [32], a
low complexity intra-mode prediction algorithm has been proposed which combines
most probable mode flag signaling and intra-mode signaling in one elegant solution.
Using this algorithm, 33 % bit-rate reduction can be obtained. The algorithm takes
neighboring intra-modes into account to obtain a prioritization of the different
modes. In most video coding, chroma sample prediction is performed after the luma
samples are taken.
In [3], the authors have proposed a reversed prediction structure that would make
luma predictions after the chroma samples were taken. In the conventional structure,
the intra-prediction has to calculate 341 (256C64C16C4C1) times for luma intra-
prediction when the maximum CU is set to 64 64, and the max allowed partition
depth is 4. However, the proposed structure calculates only 85 (64 C 16 C 4 C 1)
times in chroma samples. Experiment results show that the proposed algorithm can
achieve approximately 30 % time savings in average with 0.03 and 0.05 BD-PSNR
losses in chroma components and unnoticeable increments in bit rate.
6.13 Improved RD Cost-Based Algorithms 81
Generally, the bi-prediction is effective when the video has scene changes,
camera panning, zoom-in/out, and very fast scenes. In [12] it is shown that the
RD costs of forward and backward prediction are increasing when bi-prediction
is the best prediction mode from observation. This paper presents a kind of the
bi-prediction skipping method which can reduce the computational complexity of
bi-prediction efficiently. Their assumption is that if the bi-prediction is selected by
the best prediction mode, the RD costs of blocks which are included in each list
(forward and backward) can be larger than the average RD cost of previous blocks
which is coded by forward and backward prediction.
The consuming time for bi-prediction is almost 20 % of total encoding time. The
proposed method can reduce nearly half of total bi-prediction time with negligible
loss of quality. In [14], another efficient bi-prediction algorithm has been proposed
based on the overlapped block motion compensation (OBMC). It views the received
motion data as a source of information about the motion field and forms a better
prediction of a pixel’s intensity based on its own and nearby block MVs.
On the other hand, the prediction mode in HEVC can be divided into three
categories: inter, skip, and merge. When a PU is coded in either skip or merge mode,
no motion information is transmitted except the index of the selected candidate.
The residual signal is also omitted for skip. Based on this observation, three novel
techniques have been proposed in [18] for efficient merging of the candidate block.
However, these three proposed coding tools were adopted in HEVC and integrated
in HM-3.0 onward. In [28], a fast algorithm for residual quadtree mode decision
has been proposed based on merge and split decision process. Experimental results
shows that it gives 42–55 % encoding time reduction. In [24], using all-zero block
(AZB) and motion estimation information of inter 2N2N CU, an early merge mode
decision algorithm has been reported.
The abovementioned literatures are related inter-prediction. On the other hand,
a good amount of works are reported based on fast intra-prediction and transform
unit (TU) termination. In [38], variance values of coding mode costs are used to
terminate the current CU mode decision as well as TU size selection. A novel
adaptive intra-mode skipping algorithm has been reported in [33] based on the
statistical properties of the neighboring reference samples.
Apart from fast mode decision algorithms, researchers are trying to improve the
rate-distortion calculation technique. In this context in [15], a mixture of Laplacian-
based RD cost calculation scheme has been proposed. In this work it is shown and
analyzed that the inter-predicted residues exhibit different statistical characteristics
for the CU blocks in different depth levels. The experimental results show that,
based on the mixture Laplacian distribution, the proposed rate and distortion models
are capable of better estimating the actual rates and distortions than the one based
on the single Laplacian distribution.
82 6 Fast Prediction Techniques
In order to reduce the total rate-distortion (RD) cost, in [41], a set of transform
pairs that can minimize the total RD cost has been proposed. The proposed
transforms are trained offline using several video sequences. The transforms are
achieved by matrix multiplication. The proposed scheme provides a set of rate-
distortion optimized transforms, which achieves 2:0 % bit-rate saving and 3:2 % bit
rate in intra-HE and intra-LoCo setting. In [27], the number of full R-D checks for
intra-prediction mode decision is reduced. The residual quadtree (RQT) checking is
always done for all intra-prediction modes that undergo R-D checks. That is, less
intra-prediction modes are tested, but for each of the mode tested, a thorough search
for the optimal transform tree is carried out.
The video codec under development still relies on transform domain quantization
and includes the same in-loop deblocking filter adopted in the H.264/AVC standard
to reduce quantization blocking artifacts. This deblocking filter provides two offsets
to vary the amount of filtering for each image area.
In [22], a perceptual optimization technique has been proposed of these offsets
based on a quality metric able to quantify the blocking artifacts’ impact on the
perceived video quality. The implementation complexity of adaptive loop filtering
(ALF) for luma at the decoder is analyzed in [2]. Implementation complexity
analysis involves not only analysis of computations but also analysis of memory
bandwidth and memory size. These filters reduce memory bandwidth and size
requirements by 25 % and 50 %, respectively, with minimal impact on coding
efficiency.
Sample adaptive offset, namely, SAO, has been proposed in [6] to reduce the
distortion between reconstructed pixels and original pixels. The proposed SAO can
achieve 1:3, 2:2, 1:8, and 3:0 % bit-rate reductions. The encoding time is roughly
unchanged, and the decoding time is increased by 1–3 %.
In [34], the new transform coding techniques in the HEVC Test Model has been
described, including the residual quadtree (RQT) approach and coded block pattern
signaling. Experimental results showing the advantage of using larger block size
transforms, especially for high-resolution video material, are presented.
References
1. X. Artigas, et al., The DISCOVER codec: architecture, techniques and evaluation. In: Picture
Coding Symposium, vol. 17(9), Lisbon, Portugal, 2007
2. M. Budagavi, V. Sze, M. Zhou, HEVC ALF decode complexity analysis and reduction. In:
International Conference on Image Processing (ICIP), 2011
3. W.J. Chen, J. Su, B. Li, T. Ikenaga, Reversed Intra Prediction Based On Chroma Extraction
In HEVC, International Symposium on Intelligent Signal Processing and Communications
Systems (ISPACS), 2011
4. Z. Chen, et al., Fast integer-pel and fractional-pel motion estimation for H. 264/AVC. J. Vis.
Commun. Image Represent. 17.2, 264–290 (2006)
5. K. Choi, S.-H. Park, E.S. Jang, Coding tree pruning based CU early termination, document
JCTVC-F092. JCT-VC, July 2011
6. C.-M. Fu, C.-Y. Chen, Y.-W. Huang, S. Lei, Sample adaptive offset for HEVC. In: International
Workshop on Multimedia Signal Processing (MMSP), 2011
7. R.H. Gweon, Y.-L. Lee, J. Lim, Early termination of CU encoding to reduce HEVC complexity,
document JCTVC-F045. JCT-VC, July 2011
8. S.-Y. Huang, C.-Y. Cho, J.-S. Wang, Adaptive fast block-matching algorithm by switching
search patterns for sequences with wide-range motion content. IEEE Trans. Circ. Syst. Video
Technol. 15.11, 1373–1384 (2005)
9. X. Jing, L.-P. Chau, An efficient three-step search algorithm for block motion estimation. IEEE
Trans. Multimedia 6.3, 435–438 (2004)
10. JVT of ISO/IEC MPEG, ITU-T VCEG, MVC software Reference Manual-JMVC 8.2, May
2010
11. J. Kim, S. Jeong, S. Cho, J.S. Choi, Adaptive coding unit early termination algorithm for
HEVC. In: International Conference on Consumer Electronics (ICCE), Las Vegas, 2012
12. J. Kim, S. Jeong, S. Cho, J.S. Choi, An efficient bi-prediction algorithm for HEVC. In:
International Conference on Consumer Electronics (ICCE), Las Vegas, 2012
13. G. Laroche, T. Poirier, P. Onno, Encoder speed-up for the motion vector predictor cost
estimation, document JCTVC-H0178. JCT-VC, Feb. 2012
14. C.-L. Lee, C.-C. Chen, Y.-W. Chen, M.-H. Wu, C.-H. Wu, W.-H. Peng, Bi-prediction combined
template and block motion compensations. In: International Conference on Image processing
(ICIP), 2011
15. B. Lee, M. Kim, Modeling rates and distortions based on a mixture of laplacian distributions
for inter-predicted residues in quadtree coding of HEVC. IEEE Signal Process. Lett. 18(10),
571–574 (2011)
16. J. Leng, L. Sun, T. Ikenaga, S. Sakaida, Content based hierarchical fast coding unit decision
algorithm for HEVC. In: International Conference on Multimedia and Signal Processing, 2011
17. R. Li, B. Zeng, M.L. Liou, A new three-step search algorithm for block motion estimation.
IEEE Trans. Circ. Syst. Video Technol. 4.4, 438–442 (1994)
18. J.-L. Lin, Y.-W. Chen, Y.-P. Tsai, Y.-W. Huang, S. Lei, Motion vector coding techniques for
HEVC. In: International Workshop on Multimedia Signal Processing (MMSP), 2011
19. P. Liu, Y. Gao, K. Jia, An adaptive motion estimation scheme for video coding. Scientific World
J. 2014 (2014)
20. J. Lu, M.L. Liou, A simple and efficient search algorithm for block-matching motion
estimation. IEEE Trans. Circ. Syst. Video Technol. 7.2, 429–433 (1997)
84 6 Fast Prediction Techniques
21. L. Luo, et al., A new prediction search algorithm for block motion estimation in video coding.
IEEE Trans. Consumer Electron. 43.1, 56–61 (1997)
22. M. Naccari, C. Brites, J. Ascenso, F. Pereira, Low complexity deblocking filter perceptual
optimization for the HEVC codec. In: International Conference on Image Processing (ICIP),
2011
23. K.-H. Ng, et al., A search patterns switching algorithm for block motion estimation. IEEE
Trans. Circ. Syst. Video Technol. 19.5, 753–759 (2009)
24. Z. Pan, S. Kwong, M.T. Sun, J. Lei, Early merge mode decision based on motion estimation
and hierarchical depth correlation for HEVC, IEEE Trans. broadcasting 60(2), 405–412 (2014)
25. X. Peng, J. Xu, F. Wu, Exploiting inter-frame correlations in compound video coding. In:
International Conference on Visual Communications and Image Processing (VCIP), 2011
26. N. Purnachand, L.N. Alves, A. Navarro, Improvements to TZ search motion estimation
algorithm for multiview video coding. In: 19th International Conference on Systems, Signals
and Image Processing (IWSSIP), 2012. IEEE, 2012
27. Y.H. Tan, C. Yeo, H.L. Tan, Z. Li, On residual quad-tree coding in HEVC. In: International
Workshop on Multimedia Signal Processing (MMSP), 2011
28. S.-W. Teng, H.-M. Hang, Y.-F. Chen, Fast mode decision algorithm for residual quadtree
coding. In: International Conference on Visual Communications and Image Processing (VCIP),
2011
29. A.M. Tourapis, et al., Fast block-matching motion estimation using advanced predictive
diamond zonal search (APDZS). In: ISO/IEC JTC1/SC29/WG11 MPEG2000 M 5865
(2000). APA
30. A.M. Tourapis, O.C. Au, M.L. Liou, Predictive motion vector field adaptive search technique
(PMVFAST)-enhancing block based motion estimation. In: Proceedings of SPIE., vol. 4310,
2001
31. A.M. Tourapis, Enhanced predictive zonal search for single and multiple frame motion
estimation. Electronic Imaging 2002. In: International Society for Optics and Photonics, 2002
32. S. Van Leuven, J. De Cock, P. Lambert, R. Van de Walle, J. Barbarien, A. Munteanu, Improved
intra mode signaling for HEVC. In: International Conference on Multimedia and Expo (ICME),
2011
33. L.L. Wang, W.C. Siu, Novel adaptive algorithm for intra prediction with compromised modes
skipping and signalling process in HEVC, IEEE Trans. Circuits Syst. Video Technol., 23(10),
1686–1694 (2013)
34. M. Winken, P. Helle, D. Marpe, H. Schwarz, T. Wiegand, Transform coding in the HEVC test
model. In: International Conference on Image processing (ICIP), 2011
35. J. Yang, J. Kim, K. Won, H. Lee, B. Jeon, Early SKIP detection for HEVC, document JCTVC-
G543. JCT-VC, Geneva, Switzerland, Nov. 2011
36. C.H. Yeh, K.J. Fan, M.J. Chen, G.L. Li, Fast Mode Decision Algorithm for Scalable Video
Coding Using Bayesian Theorem Detection and Markov Process, IEEE Trans. Circuits Syst.
Video Technol. 20(4), 536–574 (2010)
37. C. Yeo, Y.H. Tan, Z. Li, Low complexity mode dependent KLT for block based intra coding.
In: International Conference on Image Processing (ICIP), 2011
38. H. Zhang, Z. Ma, Early Termination Schemes for Fast Intra Mode Decision in High Efficiency
Video Coding, IEEE Inter. Symposium Circuits Syst., Beijing, China, 19–23 (2013)
39. L. Zhao, L. Zhang, S. Ma, D. Zhao, Fast mode decision algorithm for intra prediction in HEVC.
In: International Conference on Visual Communications and Image Processing (VCIP), 2011
40. C. Zhu, X. Lin, L-P. Chau, Hexagon-based search pattern for fast block motion estimation.
IEEE Trans. Circ. Syst. Video Technol. 12.5, 349–355 (2002)
41. F. Zou, O.C. Au, C. Pang, J. Dai, Rate distortion optimized transform for intra block coding for
HEVC. In: International Conference on Visual Communications and Image Processing (VCIP),
2011