Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1

Deep Learning-Based Technology in Responses to


the Joint Call for Proposals on Video Compression
with Capability beyond HEVC
Dong Liu, Senior Member, IEEE, Zhenzhong Chen, Senior Member, IEEE, Shan Liu, and Feng Wu, Fellow, IEEE

Abstract—Deep learning has achieved great success in the model, known as the deep artificial neural network, or deep
past decade, especially in the fields of computer vision and network for short. A deep network, whose structure is inspired
image processing. After witnessing such success, video coding by biological neural systems, is composed of multiple (usually
experts are motivated to consider whether deep learning can also
benefit video coding, and if so, they seek to discover why and more than three) processing layers, each of which is further
how. Indeed, a number of research studies have been conducted composed of multiple simple, non-linear, basic computational
to explore deep learning for image and video coding, which units. A deep network is designed so that it can process
has been an active and fast-growing research area especially data with multiple levels of abstraction and convert data into
since the year 2015. These prior arts can be divided into two different kinds of representations. Such representations can
categories: new coding schemes that are built solely upon deep
networks (deep schemes), and deep network-based coding tools be sophisticated if using many processing layers (i.e. deep
that are embedded into traditional coding schemes (deep tools). networks). The entire deep network is learned from a massive
Moreover, in the responses to the joint call for proposals on amount of data using a general machine learning procedure.
video compression with capability beyond High Efficiency Video One benefit of deep learning is the capacity for “automating”
Coding (HEVC), a number of deep tools have been proposed, the generation of representations and eliminating the necessity
and some of them are further studied for the upcoming Versatile
Video Coding (VVC). In this paper, we summarize the ongoing of handcrafted representations, also known as feature engi-
efforts in the Joint Video Experts Team about the proposed deep neering, which were necessary in previous machine learning
tools, and we discuss several promising tools in much detail, techniques. As feature engineering has been a longstanding
including neural network-based intra prediction, convolutional difficulty for natively unstructured data, such as acoustic and
neural network (CNN) based in-loop filtering, and CNN-based visual signals, deep learning is especially useful for such data.
block-adaptive-resolution coding. A series of experimental results
are provided to demonstrate the capability of these tools in Specifically for processing images and videos, deep learning
achieving higher compression efficiency than the VVC or HEVC using convolutional neural networks (CNNs) has achieved
anchor. These results shed light on the promising direction of great success in the fields of computer vision and image
deep learning-based future video coding, towards which a lot of processing. In 2012, Krizhevsky et al. presented an eight-
open problems call for further study. layer CNN and won the image classification challenge with a
Index Terms—Convolutional neural network (CNN), deep surprisingly low error rate compared with previous works [2].
learning, High Efficiency Video Coding (HEVC), neural network In 2014, Girshick et al. proposed regions with CNN features
(NN), Versatile Video Coding (VVC), video coding. and improved the performance of object detection by a large
margin [3]. In 2014, Dong et al. presented a three-layer CNN
I. I NTRODUCTION for single image super-resolution (SR) known as SRCNN,
which outperformed previous works not only in reconstruction
In the past decade, the most impactful breakthrough in
quality but also in computational speed [4]. In 2017, Zhang
computing technology is probably the boom of deep learning,
et al. presented a deep CNN for image denoising, which
which has been increasingly adopted in the hope of promoting
demonstrated that a single CNN model may tackle several
the development of artificial intelligence [1]. Deep learning
different image restoration tasks including Gaussian denoising,
belongs to machine learning technology, and it is distinct from
single image SR, and compression artifact reduction [5].
other machine learning techniques due to its computational
Witnessing such success, video coding experts cannot help
Date of current version September 30, 2019. This work was supported by but consider whether deep learning can also benefit video
the Natural Science Foundation of China under Grants 61772483, 61425026, coding, and if it can, they seek to uncover why and how.
and 61771348. In fact, the computational model of deep learning (i.e., the
D. Liu and F. Wu are with the CAS Key Laboratory of Technology
in Geo-Spatial Information Processing and Application System, University neural network, or NN) is not strange to the image/video
of Science and Technology of China, Hefei 230027, China (e-mail: don- coding community. From the 1980s to the 1990s, a number of
geliu@ustc.edu.cn; fengwu@ustc.edu.cn). research studies were already conducted on NN-based image
Z. Chen is with the School of Remote Sensing and Information Engineering,
Wuhan University, Wuhan 430079, China (e-mail: zzchen@whu.edu.cn). coding [6], [7]. In that period, the networks were not deep, and
S. Liu is with the Tencent Media Lab, Palo Alto, CA 94301, USA (e-mail: the compression efficiency was not high, either. However, the
shanl@tencent.com). situation is quite different nowadays. Thanks to the advances in
Copyright © 2019 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be computing power, efficient algorithms, and huge data, it is now
obtained from the IEEE by sending an email to pubs-permissions@ieee.org. possible to train very deep networks with even more than 1000

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

layers [8]. From this point of view, an exploration of using A. Deep Schemes
deep learning for image/video coding is worth reconsidering. Most of the existing deep schemes are designed for image
Recently, especially since the year 2015, researchers have coding. In these schemes, there are in general two kinds of
been exploring deep learning for image and video coding methods: the first is pixel probability modeling, and the second
with certain success. There are in general two approaches is auto-encoding. These two kinds of methods are usually
for exploration: one approach is to base new coding schemes combined in the current deep schemes.
solely on deep networks [9]–[21], which will be referred to Pixel probability modeling originates from the funda-
as deep schemes in this paper; the other approach is to design mental idea to estimate the probability of an image x in a
new coding tools based on deep networks and to integrate the progressive manner,
tools into traditional coding schemes, such as High Efficiency m×n
Video Coding (HEVC) [22]–[29], which will be referred to as
Y
p(x) = p(xi |x1 , . . . , xi−1 ) (1)
deep tools in this paper. For image coding, deep schemes have i=1
currently achieved compression efficiency comparable with
where x is assumed to be a grayscale image with m × n
state-of-the-art non-deep schemes, e.g. HEVC intra; however,
pixels. The pixels are arranged in a specific order, e.g. the
for video coding, deep schemes do not reach the compression
raster scan order, and xi is the i-th pixel under the specific
efficiency of HEVC. On the other hand, several different kinds
order. To fulfill lossless coding, the only difficulty is to
of deep tools have all achieved noticeable coding gains on top
estimate the conditional probability p(xi |x1 , . . . , xi−1 ) given
of HEVC.
the previous pixels x1 , . . . , xi−1 . Deep learning is adopted for
In 2017 and 2018, ISO/IEC MPEG and ITU-T VCEG this purpose. For example in [30], recurrent neural network
issued a joint call for proposals (JCfP) on video compres- (RNN) and CNN-based methods are proposed, known as
sion with capability beyond HEVC, and organizations from PixelRNN and PixelCNN, respectively. A combination of deep
both industry and academia responded with their technical network-based probability estimation and entropy coding (e.g.
proposals. In these proposals, several deep tools were reported arithmetic coding) provides a deep lossless coding scheme. For
to achieve higher compression efficiency. Following the JCfP, example in [9], the deep scheme is reportedly better than TIFF,
ISO/IEC MPEG and ITU-T VCEG appointed the Joint Video GIF, PNG, JPEG-LS, and JPEG2000-LS for lossless grayscale
Experts Team (JVET) to develop a new video coding standard, image compression.
informally known as Versatile Video Coding (VVC) that is Auto-encoders were developed from the pioneering work of
anticipated to become the successor of HEVC. During the dimensionality reduction with neural networks [31], wherein
development of VVC, several deep tools are further studied a network consists of two sub-networks. The encoder sub-
and are deemed quite promising to be adopted in future network converts a datum to features, and the decoder sub-
standards. network converts features back into the data domain. Such
In this paper, we summarize the ongoing efforts of the a network, trained with images, is ready for image coding
JVET with regard to deep learning-based technology for as long as the features are quantized to digits (the quanti-
video compression that appeared in response to the JCfP zation can be embedded into the encoder sub-network). For
and afterwards. In addition to the general introduction, we example, in [10], [11] RNN-based methods are proposed for
select and discuss three promising deep tools in greater detail, variable rate image coding, where variable rates are achieved
i.e. NN-based intra prediction, CNN-based in-loop filtering, by iterating the RNN multiple times, making it analogous
and CNN-based block-adaptive-resolution coding. A series of to a scalable coding scheme. In [12] a CNN-based method
experimental results are provided to demonstrate the capability is proposed, where the network features a new nonlinear
of these tools. function known as generalized divisive normalization. The
The remainder of this paper is organized as follows. In network is optimized with a loss function that resembles the
Section II we give a brief overview of related works. In Section joint rate-distortion cost; in addition, the quantized features
III we specifically review the proposed deep tools in response are further compressed by context-adaptive binary arithmetic
to the JCfP and afterwards. In Section IV, we try to interpret coding. Several other CNN-based methods are proposed in
the underlying presumptions of the benefits of using deep [13]–[18], and the variational auto-encoder is also studied for
learning for image/video coding. In Section V, we focus on the image compression [19].
three selected deep tools and present the experimental results Several deep schemes combine PixelCNN or other like
regarding the three deep tools. Open problems are discussed methods with auto-encoder methods. Similar to pixel prob-
in Section VI and conclusions are drawn in Section VII. ability modeling, PixelCNN can be used for modeling the
probabilities of the quantized features [11], [17]. Even without
II. R ELATED W ORK PixelCNN, other entropy coding tools are widely adopted
in auto-encoder-based methods, and they are very important
In this section, we briefly review some representative works
for the final compression efficiency [12]. Nowadays, a few
about using deep learning for image and video coding. These
deep schemes [15], [18] have achieved higher compression
works are divided into two previously mentioned categories:
efficiency than BPG, which is an image coding scheme using
deep schemes and deep tools. It is worth noting that the deep
the HEVC intra coding technology1 .
tools appeared in response to the JCfP will be discussed in
Section III. 1 https://bellard.org/bpg/

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
LIU et al.: DEEP LEARNING-BASED TECHNOLOGY IN RESPONSES TO THE JCFP ON VIDEO COMPRESSION WITH CAPABILITY BEYOND HEVC 3

More recently, a few deep schemes have been proposed for Down- and up-sampling-based coding tools are in-
video coding. In [20], a video coding scheme is presented, spired by the success of deep learning-based super-resolution
where the intra (I) frames are first compressed by a deep image [4]. Traditionally, down-sampling prior to encoding and up-
coding scheme [11]. The other frames are interpolated recur- sampling after decoding is known to be better than direct
sively like in the hierarchical bipredictive (B) coding structure, coding at very low bit-rates. With the help of trained deep
and the interpolation is also done by trained deep networks. In networks for down-sampling and/or up-sampling, the perfor-
[21], the proposed scheme is conceptually similar to a hybrid mance of down- and up-sampling-based coding is further
video coding scheme, and for each block, a PixelMotionCNN enhanced [51]. In addition, block-adaptive-resolution coding
is proposed to predict the block from previously encoded (BARC) is proposed where several blocks are down-sampled
frames and blocks. The prediction residue is then compressed to encode but others are directly encoded, so as to suit different
by an auto-encoder. Both [20] and [21] report compression local characteristics. BARC together with deep networks are
efficiency on par with H.264. studied for intra frames in [28], [52], for inter frames in [53],
and for motion predicted residues in [54], respectively.
Encoding optimization tools include deep learning-based
B. Deep Tools
methods that are used only at the encoder side, such as for fast
A hybrid video coding scheme, such as HEVC, is composed mode decision, rate control, deciding region-of-interest, and
of multiple coding tools, including prediction tools, transform so on. Fast mode decision forecasts the most probable modes
tools, entropy coding tools, in-loop filtering tools, and en- according to a video’s characteristics, so as to avoid multi-
coding optimization tools. In recent years, a great number of pass trials of different modes. For fast mode decision, deep
research studies have redesigned individual coding tools based learning-based methods have shown remarkable success [29],
on trained deep networks. Here, we review some representative [55]. Rate control is highly dependent on the rate modeling. In
studies. the case of the rate-lambda (R-λ) model, a CNN-based method
Prediction tools include methods for intra-picture predic- is proposed to estimate the model parameters for different
tion, inter-picture prediction, and cross-channel prediction. For image blocks [56]. In [57], a CNN-based method is adopted
intra-picture prediction, both fully-connected networks [22] to decide the salient regions of an image, with the following
and CNN [32] based methods are proposed. For inter-picture rate allocation to improve the quality of salient regions.
prediction, several works focus on the task of fractional-pixel
interpolation [23], [33], [34], while other works deal with
III. OVERVIEW OF D EEP T OOLS IN R ESPONSES TO JC F P
bi-directional motion compensation [35], motion compensa-
tion refinement [36], combined intra/inter prediction [37], or In response to the JCfP, as well as during the development
directly extrapolating a frame for reference [38]. For cross- of VVC, JVET has witnessed a number of coding tools based
channel prediction, a multiple hypothesis method is proposed on learning deep networks. These tools are proposed for two
to predict chroma components from luma components [39], different objectives: one is to improve compression efficiency
and a hybrid network-based method is presented to combine with increased encoding/decoding complexity, and the other is
hints of collocated luma and neighboring chroma [24]. to reduce encoding complexity while maintaining compression
Transform tools are less studied. In [25], a CNN-based efficiency. At the current stage, the first objective is more
DCT-like transform is studied for image compression, where urgent. In this section, we briefly introduce the related deep
the network is virtually an auto-encoder. In [40], an auto- tools. Some tools will be discussed in Section V in more detail.
encoder is proposed for compressing the motion predicted
residue in video coding, and the quantized features are further
compressed by Huffman coding. A. Intra Prediction
Entropy coding tools adopt deep networks to estimate In the proposal by Albrecht et al. [58], a neural network-
the probabilities of predefined syntax elements, just like the based intra prediction tool is described [59]. As proposed, a
estimation of pixel probability in PixelCNN. The syntax fully connected neural network is used to generate prediction
elements that have been studied include intra prediction mode signals for the current to-be-coded block based on its neigh-
[26], quantized coefficient [41], and transform index wherein boring reconstructed pixels. The network has three hidden
the coding scheme enables choosing from multiple transforms layers. There are multiple such networks, known as different
[42]. modes, and an encoder chooses one mode for each block and
Post-processing and in-loop filtering tools occupy a ma- signals the choice to a decoder. The mode signaling is also
jority of the studied deep video coding tools, and they are based on a neural network. Another fully connected network
inspired by the confirmed success of deep learning-based is used to predict the probabilities of different modes for the
compression artifact reduction [43] and image denoising [5]. current block based on its neighboring reconstructed pixels.
Post-processing tools are applied solely at the decoder side to The modes are sorted by the predicted probabilities, and the
improve reconstruction quality, as studied in [44]–[47]. In-loop chosen mode is signaled with the sorted mode list. As a further
filtering tools are applied inside the coding loop, i.e. filtered development of the proposed tool, the prediction signal can
frames are used as references for later frames, as studied in be generated in the transformed domain, and a set of non-
[27], [48]–[50]. The network structure has been a focus of separable transforms can be applied to compress the prediction
study in these works. residues for different modes.

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

B. In-Loop Filtering the tool is applied at the CTU level, and a binary flag is
Right before the responses to the JCfP, a CNN-based in- signaled for each CTU to control the on/off switch for the
loop filtering tool was proposed by Zhou et al. [60]. As tool. If the tool is on, then the CTU is down-sampled by
proposed, a trained CNN is used as the only in-loop filter a factor of 2 and compressed. Then, it is up-sampled to its
before the adaptive loop filter. The CNN replaces the bilateral original resolution. In the proposed tool, both down- and up-
filter, deblocking filter, and sample adaptive offset for intra sampling are performed by trained CNN models. There is only
frames. Input to the CNN includes not only the reconstructed one down-sampling CNN but multiple up-sampling CNNs for
pixel values, but also the quantization parameter (QP) map. different QPs.
The network consists of eight convolutional layers equipped
with rectified linear units (ReLU). A following proposal [61] D. Fast Algorithm for Block Partitioning
extends the CNN-based filter to also deal with inter frames,
which are different from intra frames. For inter frames, a In the proposal by Bordes et al. [64], a CNN-based tool
binary flag is signaled for each 64 × 64 block to indicate for quickly determining the block partitioning structure is
whether the block uses the CNN-based filter or not. described [65]. Besides quadruple tree-based partitions, as
In the proposal entitled Deep Learning Video Coding in HEVC, the proposal [64] also introduces binary tree and
(DLVC), which was proposed by Wu et al. [62]2 , an alternative triple tree-based partitions, which makes it more complex to
CNN-based in-loop filter is described. The filter is applied determine the optimal block partition. The CNN-based tool
after deblocking filter and before sample adaptive offset. For takes a CTU as input, considers all the possible borders inside
different QPs, different CNN models are prepared. The model the CTU due to block partitioning, and outputs the probability
to use is selected according to the QP of the current frame, that each possible border is a designated border. With the
i.e. the model that has the most similar QP. The CNN has probabilities, the most probable partitioning modes are derived
34 convolutional layers organized into 16 blocks with residue for the CTU, while a lot of less probable modes are filtered
connections. A binary flag is signaled for each coding tree unit out. In this way, the tool achieves encoding time savings, while
(CTU) to control the on/off switch for the CNN-based filter. maintaining compression efficiency.
In the proposal by Hsu et al., another CNN-based in-loop
filter is described [63]. The CNN-based filter is applied after IV. W HY D OES D EEP L EARNING B ENEFIT V IDEO
the adaptive loop filter. Input to the CNN includes not only C ODING ?
the reconstructed pixel values, but also the prediction signal Witnessing the successes of deep schemes and deep tools
and the compressed residue signal. One network has eight for image/video coding as mentioned before, one may be
layers. Different networks are trained for luma and chroma partially convinced by the following claim: the successes are
respectively, but their inputs all include both luma and chroma. not accidental. There should be some fundamental reasons for
Multi-level flags are signaled to control the on/off switch for why deep learning benefits video coding. In this section, we
the CNN-based filter: At the slice level there are three modes: give our interpretations of this question. We hereby define
slice-all-on, slice-all-off, and CTU-level-decision. If the slice- deep learning as a class of techniques belonging to statistical
level mode is CTU-level-decision, then at the CTU level there machine learning technology, and the techniques have a unique
are three modes: CTU-all-on, CTU-all-off, and block-level- feature in their computational models, i.e. deep networks. In
decision. If block-level-decision is selected, then the on/off fact, most of our analyses are also valid for general statistical
flags are further signaled for each 32 × 32 block. machine learning.

C. Down- and Up-Sampling-Based Coding


A. Interpretation from the Statistical Learning Paradigms
In the proposal by Bull et al., a CNN-based down- and up-
sampling-based coding tool is described [51]. As proposed, the Statistical learning includes different tasks, such as regres-
decision to perform down-sampling-coding is based on some sion, classification, and density estimation [66]. First let us
designed features extracted from video frames, and it is applied consider density estimation, i.e. given a dataset of observations
at the frame level and the group of pictures (GOP) level for xi | N
i=1 about a random variable x, we need to estimate
low-delay and random-access configurations, respectively. In the probability distribution p(x). According to the maximum
addition to decreasing spatial resolution, the proposed tool also likelihood criterion, an estimator can be
enables decreasing bit-depth (e.g. from ten bit to eight bit). The N
X
spatial down-sampling is performed via a fixed linear kernel, q ∗ (x) = arg min − log q(xi ) (2)
q(x)
and the bit-depth reduction is achieved by right shift. On the i=1
decoder side, multiple CNN models are trained for spatial up- Note that density estimation is also the key to the success of
sampling, bit-depth up-sampling, or joint up-sampling. The source coding [67]. The optimal coding for the information
CNN consists of 20 layers equipped with ReLU. source x is − log p(x). That is to say, when we have a set of
In the DLVC proposal [62], a different CNN-based down- signals, we can use a statistical learning technique to estimate
and up-sampling-based coding tool is described. As proposed, the probability distribution of that set. As long as the estimator
2 The source code of DLVC has been published at https://github.com/ is optimal in the maximum likelihood sense, it leads to the
FVC2018/DLVC and http://dlvc.bitahub.com/. optimal lossless coding for that set of signals.

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
LIU et al.: DEEP LEARNING-BASED TECHNOLOGY IN RESPONSES TO THE JCFP ON VIDEO COMPRESSION WITH CAPABILITY BEYOND HEVC 5

Then, let us consider classification, where we are given a C. Benefits of Deep Networks
dataset (xi , yi )|N
i=1 for two related variables, x and y, and Now we can safely claim that learning-based video coding
yi ∈ Y ⊂ N, N is the set of natural numbers. Our goal is to is feasible, or in other words, video coding can be viewed
identify a function ŷ = f (x) to characterize the relationship as a machine learning problem, or a set of machine learning
between x and y. For coding, let xi be a signal and yi be its sub-problems. We now turn to the case of deep learning,
corresponding code, note that it is not difficult to convert any considering why we need deep networks, and how we can
binary sequence into an integer. If we somehow have a dataset achieve these deep networks.
of signals plus their codes, we can use a statistical learning Neural networks are universal approximators, i.e. they are
technique to achieve a function that converts signal into code. capable of approximating any practically useful function [68],
From the above analyses, it is clear now that statistical which makes them appealing in machine learning tasks. To be
learning can indeed be utilized for coding, at least in theory. In a universal approximator, a network does not need to be deep;
the following, we extend our discussions to consider practical instead, the network can have only one hidden layer and as
video coding solutions. many neurons as required, i.e. the network is shallow but wide.
However, recent studies favor deep networks with limited
widths, which is believed to provide increased capability in
modeling nonlinearity [1]. In other words, if the function to
B. Interpretation from the Coding Strategies
be learned is presumed to be very complex, then the adoption
The current mainstream video coding schemes, such as of a deep network is worthwhile. Video coding is a highly
HEVC, are known as hybrid coding schemes. These schemes nonlinear process if we consider the function as mapping a
consist of multiple coding tools, but fundamentally, the tools video into a number of bits. Therefore, we believe that using
follow either one of two strategies: predictive coding or deep networks is going to be very useful for video coding in
transform coding. the future.
As noted, the more nonlinear the target model is, the deeper
Predictive coding can be described as a supervised learning
and more complex the network needs to be [1]. Given the
task, or more specifically, a regression or classification task,
highly non-linear nature of video coding, a very large and deep
depending on whether the variables to predict are regarded as
neural network may be needed. Such a network is difficult to
continuous or discrete. For example, intra-picture prediction,
train due to its sheer size. It may need to be trained with
inter-picture prediction, and cross-channel prediction can all be
more data that is presently available. This may be a reason
considered as regression tasks [22]–[24]. Entropy coding re-
why it is currently difficult for a deep scheme, designed from
quires probability estimation, which is also a typical regression
scratch, to outperform traditional non-deep coding schemes.
task [26]. Post-processing, or in-loop filtering, is regression,
The hybrid video coding framework has been developed for
too [27]. Fast mode decision can be regarded as a classification
more than three decades, and it has combined the wisdom of
task [29]. In deep schemes, pixel probability modeling can
many video coding experts. As such, we believe the current
also be viewed as predictive coding [9]. Currently, supervised
video coding scheme is a good starting point for investigating
learning is the paradigm wherein deep learning exhibits its
deep learning-based video coding. More specifically, we keep
capability most profoundly [1]. Therefore, currently, deep
the entire framework unchanged, but we may consider each
predictive coding schemes (e.g. [9]) and deep predictive coding
module inside the framework, and try to replace a handcrafted
tools (e.g. [22]) receive more attention than the others.
module with a deep network that performs the same task.
Transform coding, on the other hand, is actually an un- This way, we can have many deep tools and “deepen” the
supervised learning task, similar to dimensionality reduction framework. We prefer this way than directly building up deep
and clustering. In these tasks, we are given merely a dataset schemes.
xi |N
i=1 , but we want to find a function of interest y = f (x)
without knowing any examples yi . Auto-encoding is a typical
V. D ISCUSSION OF S ELECTED D EEP T OOLS
transform coding scheme [10]–[19]. In fact, auto-encoders
were invented for dimensionality reduction, which is a typical There are four kinds of deep tools as reviewed in Section III.
unsupervised learning task [31]. So far, deep transform tools The former three kinds relate to video coding standards, and
in the literature are all similar to the auto-encoder [25], [40]. the last kind is about fast encoding algorithm. In the following,
In addition, if we consider down- and up-sampling jointly, the we select one tool from each of the former three kinds to
down-sampling operation can be viewed as an encoder and the discuss.
up-sampling operation can be viewed as a decoder. Then, the
joint training of down- and up-sampling resembles an auto- A. NN-Based Intra Prediction
encoder [52], and can be viewed as a transform coding tool. As mentioned before, an NN-based intra prediction tool was
There is virtually no module in a hybrid coding scheme proposed in response to the JCfP [59]. The tool has been
that cannot be viewed as performing a learning task (of continuously researched and improved during the development
course excluding arithmetic coding, which actually converts of VVC. The focus of improvement is to reduce the number
a symbol/number together with its probability value into bits). of parameters in the networks and to reduce the computational
Thus, we can claim that deep learning is by all means useful complexity. In the initial version [59], each network has
for video coding. three hidden layers. In a later version [69], each network has

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

𝑟1

𝑡1 𝑡8 Averaging 𝑟1 𝑟4
Reshape Affine
𝑙1 prediction 𝑝1
𝑟4
𝑙1

𝑙4

𝑙4

Linear
𝑙1 𝑞1 𝑝1 interpolation 𝑝1 Reshape

𝑙4 𝑞16 𝑝16 𝑝16


𝑝16

Fig. 1. Illustration of the matrix-based intra prediction (MIP) [69] for 8 × 4 blocks. Here, small blocks with different colors denote pixels of different groups.

only one layer without nonlinear activation, i.e. each network pixels by the network (or the affine operation) is reduced. For
reduces to a matrix plus a bias vector. Thus, the tool was 8 × 4 blocks, the affine operation generates 16 pixels and the
formally named as matrix-based intra prediction (MIP). MIP linear interpolation generates the remaining 16 pixels. This is
has been adopted into VVC and integrated into VTM (the different from [59], where the network generates all pixels. It
VVC reference software). Here, we describe the MIP tool that is worth noting that, in the affine operation case, the weight
was included in VTM version 5.0 in detail. matrix size should be dout × din and the bias vector size
1) Method: In VVC, block partitioning is more flexible, so should be dout ×1, where din and dout are the input and output
there are blocks of different sizes. We take the case of 8 × 4 dimensions. Thus, reducing the number of reference pixels,
blocks as an example. The process of MIP for 8 × 4 blocks performing local average of reference pixels, and reducing
is depicted in Fig. 1. The processes for other block sizes are the number of predicted pixels, all help reduce the required
similar. The process in Fig. 1 consists of three steps: reference number of parameters of the network itself.
pixel averaging, affine prediction, and linear interpolation.
In the first step, reference pixels are processed. Here, the 3) Results: In the proposal [69], multiple tests are per-
reference pixels include the row above the target block (i.e. formed under two settings. In the first setting, different sized
t1 , . . . , t8 ) and the column to the left (i.e. l1 , . . . , l4 ). t1 , . . . , t8 blocks use different weight matrices and bias vectors. In total,
are locally averaged into r1 , . . . , r4 , which are then combined these weights and biases require 6,336 parameters, and occupy
with l1 , . . . , l4 to form an 8-dimensional (8D) vector. In the 7.92 kilobyte memory if using 10-bit precision. In the second
second step, an affine operation is conducted to calculate a setting, a same set of weight matrices and bias vectors are used
16D vector from the input 8D vector, i.e. for all blocks. It then needs 5,760 parameters, and occupy 7.20
kilobyte memory if using 10-bit precision. Note that all these
[p1 , . . . , p16 ]T = Ak [r1 , . . . , r4 , l1 , . . . , l4 ]T + bk (3) parameters are integers, and the network has integer arithmetic
where Ak is a 16×8 weight matrix and bk is a 16D bias vector, only.
both of which are achieved by offline training. k indicates
Typical results of MIP on top of VTM version 4.0 are
prediction mode. There are multiple modes; for each mode, a
summarized in Table I. Under all-intra configuration, the BD-
weight matrix and a bias vector are prepared. The calculated
rate reductions are 0.79% and 0.82% for the two settings,
vector is then used to fill in the pixels at odd columns of
respectively. The decoding time change is negligible, showing
the target block. In the third step, a linear interpolation is
that MIP is efficient in controlling the computational complex-
performed by using the values of l1 , . . . , l4 and p1 , . . . , p16
ity of intra prediction (network inference). The encoding time
to generate the values of q1 , . . . , q16 , i.e. the pixels at even
increase is due to additional mode selection, which may be
columns. This completes the intra prediction.
solved by fast mode decision algorithm in the future.
2) Comparison: Compared to the initial version [59], MIP
is greatly simplified. First, a network with three hidden layers MIP is a machine learning-based coding tool, in the sense
is replaced by a network with only one layer. Second, the that the weights and biases are trained out. However, it is not
number of reference pixels is reduced. In [69], only one deep as the network has only one layer. It demonstrates that
row and one column of reconstructed picture are used as sometimes neural network implementations can show the way
reference pixels, while in [59], there are multiple rows and to non-deep solutions, especially if computational complexity
columns for reference pixels. Third, the number of predicted is considered.

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
LIU et al.: DEEP LEARNING-BASED TECHNOLOGY IN RESPONSES TO THE JCFP ON VIDEO COMPRESSION WITH CAPABILITY BEYOND HEVC 7

TABLE I
R ESULTS OF MIP C OMPARED TO VTM 4.0 UNDER A LL -I NTRA
C ONFIGURATION

Setting 1 Y U V EncT DecT


Class A1 −1.19% −0.69% −0.68% 138% 98%
Class A2 −0.63% −0.09% −0.16% 137% 98%
Class B −0.67% −0.16% −0.11% 139% 100%
Class C −0.71% −0.21% −0.23% 138% 99% Fig. 3. The dense residual unit (DRU) in [70], where ‘+’ stands for addition
Class E −0.82% −0.30% −0.36% 137% 100% and ‘×’ stands for concatenation.
Overall −0.79% −0.27% −0.28% 138% 99%
Class D −0.81% −0.34% −0.38% 138% 101% layer and two 3 × 3 convolutional layers with ReLU between
Setting 2 Y U V EncT DecT them. There are two shortcuts in the DRU. The inner shortcut
Class A1 −1.15% −0.60% −0.64% 141% 97% performs residual learning by adding up the outputs of the
Class A2 −0.67% −0.07% −0.18% 139% 98% 1 × 1 convolutional layer to the last 3 × 3 convolutional layer.
Class B −0.73% −0.12% −0.21% 143% 100% The outer shortcut directly passes the original input to the next
Class C −0.72% −0.25% −0.25% 141% 100%
Class E −0.90% −0.25% −0.39% 140% 99% unit. Let x be the input of the DRU, the output of the DRU
can be formulated as:
Overall −0.82% −0.24% −0.32% 141% 99%
Class D −0.81% −0.20% −0.44% 141% 101% FDRU (x) = H(x, g(x) + f (g(x))) (4)
where H(·) denotes the concatenation operation, g(·) repre-
sents the 1 × 1 convolutional layer, and f (·) represents the
two 3 × 3 convolutional layers. The outer shortcut in the
DRU allows the signal to bypass the convolutional layers and
directly propagate to the next DRU, which enables subsequent
units to explore new features from preceding inputs. However,
such shortcuts may cause the subsequent units to have too
much input data. Then, the 1 × 1 convolutional layer serves
as a bottleneck layer. It effectively reduces the number of
channels that otherwise increases due to the concatenation. By
reducing the number of channels, the number of parameters
Fig. 2. The proposed decoding scheme based on VVC and integrating DRNLF in the network is reduced, and hence also the computational
[71]. cost is reduced. Note that the bottleneck layer is not used in
the first DRU since its input has less channels. No activation
is appended after the bottleneck layer so that the layer simply
B. CNN-Based In-Loop Filtering generates a linear combination of the inputs. Since the input
Several CNN-based in-loop filters were proposed in re- and the target in the task of in-loop filtering are highly similar,
sponse to the JCfP [60]–[63]. During the development of VVC, most of the residuals between the input and the target are
several new filters were proposed and researched. In this paper, small or zero. Employing residual learning contributes to
we choose the dense residual network-based in-loop filter faster convergence and easier training. In [71], the 3 × 3
(DRNLF) as an example to discuss. In an earlier version [70], (conventional) convolutional layer is replaced by the 3 × 3
DRNLF is reported to achieve impressive coding gains. In a depth-wise separable convolutional (DSC) layer [72] in each
later version [71], multiple modifications are proposed for the DRU, as depicted in Fig. 5. It greatly reduces the number of
trade-off between computational efficiency and performance. parameters.
This study is quite typical. b) Dense residual network (DRN): The network structure
1) Proposed Scheme: There are multiple in-loop filters in of DRN proposed in [70] is shown in Fig. 4, where N and M
VVC, including the deblocking filter (DF), sample adaptive denote the number of DRUs and the number of convolution
offset (SAO), and adaptive loop filter (ALF). DRNLF is kernels in each layer, respectively. The revised DRN proposed
introduced into VVC as an additional filter placed after DF in [71] is depicted in Fig. 5. Compared with Fig. 4, in Fig.
and before SAO and ALF, as shown in Fig. 2. DRNLF uses 5 there are two important changes. First, besides the multiple
the principle of rate-distortion optimization (RDO) to decide DRUs, there are four conventional layers in Fig. 4, but there
whether to apply the CNN-based filter at the CTU level, and are only two in Fig. 5. This change contributes to slightly
binary flags are written into the bitstream accordingly [70], better performance [71]. Second, a normalized QP map is
[71]. concatenated with the reconstructed frame as the inputs to the
2) Network Structure: DRN in Fig. 5. But in Fig. 4, only the reconstructed frame
a) Dense residual unit: The network contains several is used as the input. In accordance to this change, there are
units called dense residual units (DRUs), which were first multiple network models trained for different QPs in [70], but
introduced in [70] and then revised in [71]. As shown in there is only one model that is used for multiple QPs in [71].
Fig. 3, the DRU in [70] consists of one 1 × 1 convolutional Moreover, in order to reduce the computational cost of the

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

TABLE II
R ESULTS OF DRNLF C OMPARED TO VTM 2.0.1 UNDER A LL -I NTRA
C ONFIGURATION

Y U V EncT DecT
Class A1 −1.56% −2.21% −4.20% 124% 6611%
Class A2 −1.89% −1.24% −0.83% 110% 3818%
Class B −1.45% −1.71% −2.47% 108% 5120%
Class C −3.08% −1.92% −1.75% 107% 6914%
Fig. 4. Network structure of DRN in [70]. Class E −4.06% −0.81% −1.39% 117% 11210%
Overall −2.34% −1.61% −2.14% 112% 6198%
Class D −3.74% −2.13% −2.41% 105% 6635%

TABLE III
D IFFERENT TEST CONDITIONS FOR DRNLF. N AND M DENOTE THE
NUMBER OF DENSE RESIDUAL UNITS AND THE NUMBER OF
CONVOLUTIONAL KERNELS RESPECTIVELY. C DENOTES THE COLOR
SPACE OF THE TRAINING DATA . W DENOTES THE WEIGHTS OF
COMPONENTS IN (6).

Fig. 5. Network structure of DRN in [71]. Test ID N M C W Network


Test 1 8 64 RGB 1:1:1 Fig. 4
Test 2 8 64 YUV 10:1:1
network, the number of DRUs is reduced from eight to four Test 3 4 32 RGB 1:1:1
and the number of convolution kernels is reduced from 64 to Test 4 4 32 YUV 1:1:1
32, respectively. The total number of the model parameters is Test 5 4 32 YUV 4:1:1 Fig. 5 w/o DSC
Test 6 4 32 YUV 10:1:1
22K in [71] compared to 812K in [70]. Test 7 4 64 YUV 10:1:1
3) Training: A natural image dataset, known as DIV2K3 , Test 8 8 32 YUV 10:1:1
is utilized to generate the training set and validation set for Test 9 4 32 YUV 10:1:1 Fig. 5 w. DSC
the proposed network. There are 800 images in the training
set and 100 images in the validation set. The images in the
DIV2K dataset are converted from RGB color space to YUV 4) Results: Some results of DRNLF on top of VTM version
color space before compression. The images are compressed 2.0.1, reported in [71], are presented in Table II. Note that
by VTM using different QPs under all-intra configuration. The the network inference was performed on CPU only. Under
compressed images and their corresponding QPs are used to all-intra configuration, the BD-rate reduction is 2.34%, which
train the network, with the images before compression used is significant. However, the decoding time increase is huge,
as the labels. Given a set of images with compression artifacts which is a common problem in CNN-based in-loop filters.
{Xi } and the corresponding labels {Yi }, the goal of the The encoding time increase is marginal.
training is to minimize the following loss: To validate the proposed technique, multiple tests are con-
N ducted in the proposal [71]. Table III gives a summary of the
1 X
L(Θ) = Li (Θ) (5) tests. More details and comparison results are described in the
N i=1
following.
where Table IV shows the overall BD-rate results of Test 1 and
3 2 Test 2. In Test 1, the network shown in Fig. 4 and the training
(c)
X
Li (Θ) = (wc · F (Xi ; Θ)(c) − Yi ) (6) method described in [70] are employed. In Test 2, the network

c=1 shown in Fig. 5 (but using normal convolution rather than
where Θ denotes the parameters of the proposed DRN, N is DSC) and the training method described in [71] are employed.
the number of training samples, c is the index of components Comparison between Test 1 and Test 2 shows that the proposed
(e.g. RGB or YUV), wc is the weight of the c-th component, technique in [71] is better than that in [70].
2
k·k denotes the mean-squared-error (MSE), and the super- Tests 3–6 are conducted to investigate the influences of
script (c) is to select the c-th component. training data and loss function. In these four tests, the network
There are two important changes from [70] to [71] regarding structure is the same, and the networks are trained with differ-
training. First, multiple models are trained for different QPs in
[70], i.e. the training data generated with different QPs {22,
27, 32, 37} are separated. But only one model is trained in TABLE IV
C OMPARISON OF T EST 1 AND T EST 2
[71] with all the training data generated with different QPs.
Second, the training data are in the RGB color space in [70], Y U V
but in the YUV color space with weights 10:1:1 in [71].
Test 1 −3.06% −12.09% −13.24%
3 https://data.vision.ee.ethz.ch/cvl/DIV2K/ Test 2 −4.42% −9.90% −10.98%

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
LIU et al.: DEEP LEARNING-BASED TECHNOLOGY IN RESPONSES TO THE JCFP ON VIDEO COMPRESSION WITH CAPABILITY BEYOND HEVC 9

TABLE V C. CNN-Based Block-Adaptive-Resolution Coding


C OMPARISON OF TRAINING DATA AND LOSS FUNCTION FOR DRNLF
Down- and up-sampling-based coding with CNN was pro-
C W Y U V posed in response to the JCfP [51], [62]. The proposal [51]
Test 3 RGB 1:1:1 −1.84% −8.17% −9.18%
performs down- and up-sampling at the frame level and the
Test 4 YUV 1:1:1 −1.70% −8.01% −9.33% GOP level, while the proposal [62] performs down- and up-
Test 5 YUV 4:1:1 −2.49% −7.00% −7.79% sampling at the block level. The latter proposal is named
Test 6 YUV 10:1:1 −2.84% −3.97% −4.13%
block-adaptive-resolution coding (BARC). BARC is more ap-
pealing to pursue higher compression ratio, because natural
TABLE VI
video is equipped with regions of different characteristics, and
C OMPARISON OF NETWORKS FOR DRNLF down/up-sampling-based coding may be more suitable for the
regions with less details, but less suitable for highly textured
N M #Params Ratio Y U V regions. We discuss the CNN-based BARC in the following.
Test 1 8 64 812035 110% −3.06% −12.09% −13.24% A number of prior arts have been conducted using CNN
Test 2 8 64 738755 100% −4.42% −9.90% −10.98%
Test 6 4 32 85347 12% −2.84% −3.97% −4.13% for better up-sampling, known as super-resolution (SR) [4].
Test 7 4 64 336579 46% −3.30% −8.06% −9.09% Conversely, it is also possible to carry out down-sampling by
Test 8 8 32 186083 25% −3.68% −6.94% −8.41%
Test 9 4 32 22371 3% −2.34% −1.61% −2.14%
CNN, known as compact-resolution (CR) [52]. The BARC
scheme for intra frames with CRCNN and CNN-SR was
proposed in [62]. The scheme is depicted in Fig. 7.
1) CNN for Down-Sampling: To train a CNN for down-
ent training sets and different weights in the loss function. To sampling (i.e. CRCNN), the usual supervised learning strategy
be specific, Test 3 is trained on RGB images with weights of is not applicable because there is no ground-truth for low-
1:1:1, and Tests 4–6 are trained on YUV images with weights resolution images. To tackle this difficulty, an unsupervised
of 1:1:1, 4:1:1, 10:1:1, respectively. The results are shown in learning strategy is proposed, as shown in Fig. 8. Here, the key
Table V. It can be observed that training with YUV color space idea is to compare the original image and the low-resolution
and weights 10:1:1 achieves a balanced coding gain among Y, (i.e. compact-resolved) image. Since these two images have
U, and V. This setting is recommended in [71]. different resolutions, two loss terms are designed to compare
Tests 2 and 6–9 are conducted to investigate the influence of them at two resolutions, respectively. First, the compact-
network structure. In these five tests, the training data and loss resolved image is up-sampled to the original resolution for
function are the same, and the networks are configured with comparison, resulting in reconstruction loss,
different hyper-parameters (N and M) and with conventional 2
convolution or DSC. Table VI shows the overall BD-rate Jrec = ||g(f (x)) − x|| (7)
results of these tests. In addition, the number of parameters is where x is the original image. f and g denote the map-
calculated for each network. Ratio of the number is calculated ping functions for CRCNN and up-sampling, respectively.
with regard to the number of Test 2. The numbers and ratios For the up-sampling, we may adopt either a predefined up-
are also summarized in Table VI. It can be observed that the sampling operation or a CNN-based up-sampling that can
BD-rate in general decreases as the number of parameters even be trained. Second, the original image is down-sampled
decreases. Fig. 6 shows the relationship more explicitly. If by another predefined down-sampling operation to the low
we use the ratio of BD-rate (Y channel) over number of resolution for comparison, which results in regularization loss:
parameters as a metric to evaluate the networks, then Test 9 2
achieves the best. Thus, the setting of Test 9 is recommended Jreg = ||f (x) − F (x)|| (8)
in [71], which corresponds to the results shown in Table II. where F is the predefined down-sampling operation. In prac-
tice, the function F used is a low-pass filter plus a decimation
function.
Combining (7) and (8) leads to the entire objective function
for training:
2 2
J = Jrec + λ · Jreg = ||g(f (x)) − x|| + λ||f (x) − F (x)||
(9)
where λ is a parameter that controls the relative weight of the
regularization loss. When λ is larger, the compact-resolved
image will be smoother and contain less high-frequency in-
formation. Meanwhile, less information of the original image
will be preserved.
2) CNN for Up-Sampling: In order to not disrupt normal
intra coding of the subsequent CTUs, a convolutional neural
network for super-resolution is adopted to bring the down-
Fig. 6. The relationship between BD-rate saving (Y channel) and the number sampled coded CTU to its original resolution. CNN-based SR
of parameters (shown in ratio), see also Table VI. techniques have been studied intensively in computer vision.

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Coding Parameters
Setting

Low-Resolution
CRCNN CNN-SR
Intra Coding

Simple DS/CRCNN Selection


Input Image
Low-Resolution
Simple DS CNN-SR
CTU
U Intra Coding
Full/Low-Resolution
Coding Selection
Split Into CTUs Full-Resolution
Intra Coding

Reconstructed Image

Deblocking Second Stage


& SAO Up-Sampling

Fig. 7. The proposed CNN-based block-adaptive-resolution coding scheme based on HEVC. Note the blue colored blocks.

A CNN-SR is specifically designed by balancing its perfor- Original CR Reconstructed


Image Image Image
mance and complexity. The designed CNN-SR has several key x CRCNN f(x) Up-Scaling g(f(x))
ingredients and is shown in Fig. 9.
First, multi-scale kernels (in the second and fourth layers)
||g(f (x)) x ||22
are used in hopes of effectively aggregating multi-scale infor-
mation [73], [74]. Second, the resolution increase is embedded
into CNN-SR using a trainable deconvolutional layer. In this Naïve DS F(x) ||f(x) F(x) ||22
way, the mapping from an LR (low-resolution) image to an
HR (high-resolution) image is learned without involving any Fig. 8. The proposed scheme for training CRCNN with two loss terms.
manual interpolation. Third, residual learning [8] is borrowed
64 16,32 48 16,32 1
by labeling the LR image with the difference between the
original block and its degraded version generated by the 5x5
5x5,
12x12
3x3,
3x3
3x3 1x1
discrete cosine transform based interpolation filter (DCTIF). Down-Sampled
In addition to the above network designed for up-sampling CTU (Luma) Reconstructed
CTU (luma)
the luma component, another network is developed for up- DCTIF
Up-Sampling
sampling chroma components by augmenting the luma net-
Convolution for Deconvolution for Convolution for
work with two features: incorporating LR luma information Feature Extraction Up-Sampling Reconstruction

for inference and jointly learning Cb and Cr. Readers may


refer to [28] for further details about the luma and chroma Fig. 9. The network structure of CNN-SR for up-sampling the luma
component. The numbers shown inside the blocks indicate the kernel sizes,
CNN-SR. and the numbers above the blocks indicate the numbers of output channels.
3) Coding Parameter Setting: In BARC, blocks are coded Note that there are two layers with variable kernel sizes: in the second layer,
there are 16 kernels with 5 × 5 and 32 kernels with 3 × 3, so the number
at different resolutions, so their coding parameters are differ- of output channels is 16 + 32 = 48; in the fourth layer, there are 16 kernels
ent, too. Note that the final quality is evaluated at the original with 3 × 3 and 32 kernels with 1 × 1.
resolution; thus, for low-resolution coded blocks, we need to
consider the coding distortion not at the low resolution but at
the full resolution during the RDO process. Nonetheless, if we where α and β are two parameters dependent on the content
want to use full-resolution distortion to decide modes based of the block. Thus, during the mode selection process in
on RDO, we need to up-sample a block many times during low-resolution coding, Dlow + λlow R can be used to replace
λ
the mode selection process, which is hard to implement. Df ull + λf ull R, where λlow = fαull . λf ull is the default
The full-resolution distortion (i.e. Df ull ) is a mixture of value for RDO. In [28], the α values are calculated for a large
distortion incurred by low-resolution coding (i.e. Dlow ) and number of blocks when the down-sampling ratio is 1/2 × 1/2.
distortion incurred by down- and up-sampling. An accurate Most of the values are between 3.0 and 6.0, and the mode
λ
model of Df ull and Dlow would be complex in theory. For- is around 4.0. Thus, for simplicity, λlow = f4ull is used for
tunately, an approximately linear relationship exists between all low-resolution coded blocks [62]. It is possible to apply
Df ull and Dlow for a block [28], i.e. content-adaptive λ values in future work.
4) Second-Stage Up-Sampling: As mentioned in Section
Df ull = α × Dlow + β (10) V-C2, the first stage of up-sampling is performed right after

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
LIU et al.: DEEP LEARNING-BASED TECHNOLOGY IN RESPONSES TO THE JCFP ON VIDEO COMPRESSION WITH CAPABILITY BEYOND HEVC 11

the down-sampled CTU is coded and reconstructed. However, Besides coding efficiency, various functions are required in
the performance of this up-sampling is compromised due to future video services, such as manipulating, searching, and
the absent bottom and right boundaries, which have not been interacting with semantic-level objects.
compressed yet. Hence, a second stage of up-sampling is Optimization: Few existing works have considered specific
performed after compressing the whole frame to refine each optimization for perceptual quality. It is not easy to manually
up-sampled CTU around its bottom and right boundaries. Note design a non-deep tool for perceptual quality optimization.
the second stage of up-sampling is performed for only the However, for deep tools, optimization for perceptual quality
CTUs with the low-resolution coding mode. The same process is much easier. We just switch to another loss function that is
is conducted in both the encoder and decoder, so no overhead calibrated with perceptual quality, and train the deep networks
bit is required. accordingly. In addition, traditional video compression algo-
5) Results: The proposed CNN-BARC is implemented rithms are agnostic to the data being compressed, and they
using the HEVC reference software–HM version 12.1. For do not degrade gracefully. The concept of generative com-
evaluation, both PSNR and structural similarity (SSIM) [75] pression, i.e. the compression of data using deep generative
are used, as the latter is believed to be more consistent models, introduces a direction that is worth pursuing in order
with subjective quality. The test sequences include 20 video to produce visually pleasing reconstructions at much higher
sequences that are divided into Classes A, B, C, D, and E, as compression levels for both image and video data.
well as five sequences at 4K (3840 × 2160) resolution from Computation: One drawback of the existing work is the
the SJTU dataset [76]. very high computational time. New computing infrastructures
The BD-rate results are summarized in Table VII. At low bit are being investigated for deep networks, but deep networks
rates, i.e. QP 32–47, the proposed scheme improves the coding should be carefully designed to achieve better trade-offs be-
efficiency significantly, leading to average BD-rate reductions tween compression efficiency and computational efficiency. In
of 6.9%, 6.4%, and 3.0% for Y, U, and V, respectively, in addition, most of the existing neural networks use floating
the HEVC test sequences (Classes A–E). As for the UHD point operations, which may cause different rounding errors
test sequences, the scheme achieves even higher coding gains, on different hardware or software platforms and hinder the
i.e. 10.4%, 4.5%, and 4.9% BD-rate reductions for Y, U, interoperability. Thus, integer-arithmetic-only networks are
and V. In addition, when using SSIM as the quality metric, necessary for video coding. Readers may refer to [77] and
the BD-rate reductions are more significant, i.e. 11.0% and the references therein for some recent studies on integer-
12.0% reductions for Y in the HEVC and UHD sequences, arithmetic-only networks.
respectively. Thus, it is claimed that down/up-sampling-based
coding is more friendly to subjective quality at low bit rates VII. C ONCLUSIONS
[62]. In addition to the QP 32–47, the QPs 22 and 27 are
In this paper, we have summarized the ongoing efforts in the
also tested according to the HEVC common test conditions,
JVET to use deep learning to improve compression efficiency
and the corresponding BD-rate results are included in Table
on top of state-of-the-art hybrid video coding schemes like
VII. As QP increases, the BD-rate reductions become more
HEVC and VVC. We discuss three promising coding tools, i.e.
and more significant. This demonstrates how CNN-BARC is
NN-based intra prediction, CNN-based in-loop filtering, and
useful especially at low bit rates. Note that it is a common
CNN-based block-adaptive-resolution coding. Building novel
practice nowadays to use reduced resolution for transmission
network structures, adopting appropriate loss functions, and
at low bit rates, which is to down-sample the entire video
integrating deep tools into video coding schemes seamlessly,
sequence. To the contrary, BARC provides the flexibility of
are the major themes of this research. It is observed that
down-sampling several blocks, i.e., it enables the resolution
these deep tools can improve the compression efficiency by
switch inside the encoder/decoder.
a significant margin.
In the future, a lot of open problems need to be addressed to
VI. O PEN P ROBLEMS FOR F UTURE R ESEARCH augment deep learning-based video coding. First, the coding
tools inside the traditional hybrid coding framework can be
In this section, we will discuss open problems worth further
all “deepened,” leading to a complete deep scheme. Second,
study.
optimization of deep tools or deep schemes for perceptual
Framework: In most cases, deep tools have been typi-
quality rather than MSE is worthy of further study. Third,
cally developed to replace modules of hybrid video coding
deep networks should be enhanced to achieve both higher
schemes. Once we have multiple deep tools, it is possible
compression efficiency and lower computational complexity,
to connect them to have a larger network. Moreover, the
where automatic searches for optimal network structures [78]
larger network can be further fine-tuned. In addition, some
may be quite useful.
end-to-end deep coding schemes [12], [13], [15] have shown
inspiring results. Building a novel and competitive deep video
coding scheme is an important problem that is worth further ACKNOWLEDGMENT
research. For example, deep learning has achieved remarkable The authors thank the two anonymous reviewers for the
results in semantic analysis, like image/video captioning. It meticulous review and a lot of suggested revisions and edits.
is worth investigating how to improve coding efficiency by D. Liu and F. Wu thank the following colleagues and col-
employing semantic meanings that provide richer information. laborators: Jizheng Xu, Bin Li, Houqiang Li, Zhibo Chen,

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
12 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

TABLE VII
R ESULTS OF CNN-BARC C OMPARED TO HM 12.1 UNDER A LL -I NTRA C ONFIGURATION

QP 32–47 QP 27–42 QP 22–37


Class Sequence
Y U V Y SSIM Y U V Y SSIM Y U V Y SSIM
Traffic −11.8% −6.3% 3.1% −14.8% −6.4% −7.6% −3.8% −12.1% −1.9% −4.9% −3.4% −6.5%
PeopleOnStreet −11.7% −23.0% −23.4% −14.5% −5.5% −20.3% −21.5% −9.7% −1.5% −10.1% −11.3% −5.2%
Class A
Nebuta −1.8% −14.5% −1.8% −4.8% −0.3% −4.4% −0.4% −1.6% −0.0% −0.5% −0.2% −0.0%
SteamLocomotive −2.3% −23.7% −17.3% −7.6% −0.8% −9.7% −7.3% −3.5% −0.1% −1.1% -1.5% −0.4%
Kimono −9.0% −7.3% 19.7% −10.8% −8.2% −7.4% 8.9% −10.1% −6.0% −7.7% 0.2% −8.0%
ParkScene −8.3% −17.9% 0.8% −13.0% −4.1% −13.6% −5.5% −10.8% −0.9% −6.0% −2.7% −4.6%
Class B Cactus −8.5% −5.2% 3.1% −12.9% −4.2% −5.0% −1.8% −8.0% −1.1% −2.5% −2.6% −3.3%
BQTerrace −4.8% −8.7% −10.3% −11.3% −1.4% −2.5% −3.2% −4.2% −0.2% −0.6% −0.8% −0.9%
BasketballDrive −8.0% −1.2% −1.4% −13.0% −4.8% −1.9% −3.0% −9.7% −1.5% −2.7% −2.7% −4.3%
BasketballDrill −7.5% 2.5% 3.4% −10.5% −3.6% −1.5% −1.3% −4.7% −1.2% −2.3% −2.1% −2.1%
BQMall −3.9% −7.7% −9.6% −8.0% −1.1% −3.7% −2.9% −3.3% −0.0% −1.5% −0.6% −0.2%
Class C
PartyScene −1.9% −3.3% −0.7% −6.5% −0.2% −0.3% 1.2% −1.9% 0.1% 0.3% 0.4% 0.2%
RaceHorsesC −8.2% 2.0% 9.7% −13.0% −3.5% 0.9% 6.2% −9.4% −0.7% −0.6% 1.8% −2.4%
BasketballPass −4.6% −1.8% 5.0% −9.1% −1.4% −2.1% −0.2% −4.1% −0.1% −0.6% −0.2% −0.1%
BQSquare −1.8% 1.5% −19.1% −3.6% −0.6% 0.2% −4.7% −0.7% −0.2% −0.1% −1.4% −0.4%
Class D
BlowingBubbles −4.2% 8.8% −5.7% −8.9% −1.7% 2.5% −3.5% −5.1% −0.6% 0.6% −1.0% −1.7%
RaceHorses −13.0% 8.0% 15.3% −18.0% −5.6% 4.8% 2.1% −14.0% −1.2% 1.9% −0.8% −4.8%
FourPeople −9.1% −10.2% −8.6% −14.5% −4.8% −9.0% −10.0% −10.3% −1.7% −5.2% −6.8% −4.7%
Class E Johnny −10.2% −6.7% −7.1% −11.8% −5.6% −6.1% −7.8% −9.3% −2.0% −3.7% −4.5% −5.4%
KristenAndSara −7.9% −13.4% −14.3% −14.0% −4.2% −8.0% −10.2% −9.7% −1.5% −5.4% −5.1% −4.0%
Fountains −4.9% −17.4% −11.0% −8.7% −2.6% −18.2% −8.5% −6.4% −0.8% −12.9% −5.5% −2.4%
Runners −11.9% 15.5% −5.0% −13.1% −7.9% 2.7% −9.0% −11.1% −2.4% −1.5% −6.2% −6.7%
Class UHD Rushhour −10.1% 4.0% 1.0% −11.5% −8.4% −1.7% −3.8% −10.0% −5.6% −5.2% −6.7% −7.6%
TrafficFlow −14.7% −12.5% −7.5% −14.8% −10.3% −16.1% −11.0% −11.2% −4.9% −14.6% −10.5% −6.9%
CampfireParty −10.4% −12.4% −2.0% −11.9% −4.9% −11.1% −2.0% −7.7% −1.1% −6.5% −1.4% −3.2%
Average of Classes A–E −6.9% −6.4% −3.0% −11.0% −3.4% −4.7% −3.4% −7.1% −1.1% −2.6% −2.3% −2.9%
Average of Class UHD −10.4% −4.5% −4.9% −12.0% −6.8% −8.9% −6.9% −9.3% −3.0% −8.1% −6.1% −5.4%

Li Li, Fangdong Chen, Yuanying Dai, Lei Guo, Ye Li, Yue [12] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image
Li, Jianping Lin, Changyue Ma, Ning Yan, and Haitao Yang. compression,” arXiv:1611.01704, 2016.
[13] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image
Z. Chen and S. Liu thank the following colleagues and compression with compressive autoencoders,” arXiv:1703.00395, 2017.
collaborators: Yingbin Wang, Yiming Li, Liang Zhao, and [14] T. Dumas, A. Roumy, and C. Guillemot, “Image compression with
Xiang Li. stochastic winner-take-all auto-encoder,” in ICASSP. IEEE, 2017, pp.
1512–1516.
[15] O. Rippel and L. Bourdev, “Real-time adaptive image compression,” in
R EFERENCES ICML, 2017, pp. 2922–2930.
[16] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte,
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-
no. 7553, pp. 436–444, May 2015. end learning compressible representations,” in NIPS, 2017, pp. 1141–
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 1151.
with deep convolutional neural networks,” in NIPS, 2012, pp. 1097– [17] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional
1105. networks for content-weighted image compression,” in CVPR, 2018, pp.
[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature 673–681.
hierarchies for accurate object detection and semantic segmentation,” [18] J. Ballé, “Efficient nonlinear transforms for lossy image compression,”
in CVPR, 2014, pp. 580–587. in PCS. IEEE, 2018, pp. 248–252.
[4] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional [19] K. Gregor, F. Besse, D. J. Rezende, I. Danihelka, and D. Wierstra,
network for image super-resolution,” in ECCV. Springer, 2014, pp. “Towards conceptual compression,” in NIPS, 2016, pp. 3549–3557.
184–199. [20] C.-Y. Wu, N. Singhal, and P. Krahenbuhl, “Video compression through
[5] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a image interpolation,” in ECCV, 2018, pp. 416–431.
Gaussian denoiser: Residual learning of deep CNN for image denoising,”
[21] Z. Chen, T. He, X. Jin, and F. Wu, “Learning for video compression,”
IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155,
IEEE Transactions on Circuits and Systems for Video Technology, DOI:
2017.
10.1109/TCSVT.2019.2892608, 2019.
[6] R. D. Dony and S. Haykin, “Neural network approaches to image
compression,” Proceedings of the IEEE, vol. 83, no. 2, pp. 288–303, [22] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network-
1995. based intra prediction for image coding,” IEEE Transactions on Image
[7] J. Jiang, “Image compression with neural networks–A survey,” Signal Processing, vol. 27, no. 7, pp. 3236–3247, 2018.
Processing: Image Communication, vol. 14, no. 9, pp. 737–760, 1999. [23] N. Yan, D. Liu, H. Li, B. Li, L. Li, and F. Wu, “Convolutional neural
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image network-based fractional-pixel motion compensation,” IEEE Transac-
recognition,” in CVPR, 2016, pp. 770–778. tions on Circuits and Systems for Video Technology, vol. 29, no. 3, pp.
[9] M. Li, S. Gu, D. Zhang, and W. Zuo, “Enlarging context with 840–853, 2019.
low cost: Efficient arithmetic coding with trimmed convolution,” [24] Y. Li, L. Li, Z. Li, J. Yang, N. Xu, D. Liu, and H. Li, “A hybrid neural
arXiv:1801.04662, 2018. network for chroma intra prediction,” in ICIP, 2018, pp. 1797–1801.
[10] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, [25] D. Liu, H. Ma, Z. Xiong, and F. Wu, “CNN-based DCT-like transform
S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compres- for image compression,” in MMM. Springer, 2018, pp. 61–72.
sion with recurrent neural networks,” arXiv:1511.06085, 2015. [26] R. Song, D. Liu, H. Li, and F. Wu, “Neural network-based arithmetic
[11] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, coding of intra prediction modes in HEVC,” in VCIP, 2017, pp. 1–4.
and M. Covell, “Full resolution image compression with recurrent neural [27] W.-S. Park and M. Kim, “CNN-based in-loop filtering for coding
networks,” in CVPR, 2017, pp. 5306–5314. efficiency improvement,” in IVMSP. IEEE, 2016, pp. 1–5.

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
LIU et al.: DEEP LEARNING-BASED TECHNOLOGY IN RESPONSES TO THE JCFP ON VIDEO COMPRESSION WITH CAPABILITY BEYOND HEVC 13

[28] Y. Li, D. Liu, H. Li, L. Li, F. Wu, H. Zhang, and H. Yang, “Convolutional [53] J. Lin, D. Liu, H. Yang, H. Li, and F. Wu, “Convolutional
neural network-based block up-sampling for intra frame coding,” IEEE neural network-based block up-sampling for HEVC,” IEEE Trans-
Transactions on Circuits and Systems for Video Technology, vol. 28, actions on Circuits and Systems for Video Technology, DOI:
no. 9, pp. 2316–2330, 2018. 10.1109/TCSVT.2018.2884203, 2018.
[29] Z. Liu, X. Yu, Y. Gao, S. Chen, X. Ji, and D. Wang, “CU partition mode [54] K. Liu, D. Liu, H. Li, and F. Wu, “Convolutional neural network-based
decision for HEVC hardwired intra encoder using convolution neural residue super-resolution for video coding,” in VCIP, 2018, pp. 1–4.
network,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. [55] M. Xu, T. Li, Z. Wang, X. Deng, R. Yang, and Z. Guan, “Reducing
5088–5103, 2016. complexity of HEVC: A deep learning approach,” IEEE Transactions
[30] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent on Image Processing, vol. 27, no. 10, pp. 5044–5059, 2018.
neural networks,” in ICML, 2016, pp. 1747–1756. [56] Y. Li, B. Li, D. Liu, and Z. Chen, “A convolutional neural network-
[31] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of based approach to rate control in HEVC intra coding,” in VCIP. IEEE,
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2017, pp. 1–4.
2006. [57] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer, “Semantic
[32] W. Cui, T. Zhang, S. Zhang, F. Jiang, W. Zuo, Z. Wan, and D. Zhao, perceptual image compression using deep convolution networks,” in
“Convolutional neural networks based intra prediction for HEVC,” in DCC. IEEE, 2017, pp. 250–259.
DCC. IEEE, 2017, p. 436. [58] M. Albrecht et al., “Description of SDR, HDR and 360° video coding
[33] N. Yan, D. Liu, H. Li, B. Li, L. Li, and F. Wu, “Invertibility-driven technology proposal by Fraunhofer HHI,” JVET, Tech. Rep. JVET-
interpolation filter for video coding,” IEEE Transactions on Image J0014, 2018.
Processing, vol. 28, no. 10, pp. 4912–4925, 2019. [59] J. Pfaff et al., “Intra prediction modes based on neural networks,” JVET,
Tech. Rep. JVET-J0037, 2018.
[34] J. Liu, S. Xia, W. Yang, M. Li, and D. Liu, “One-for-all: Grouped
[60] L. Zhou et al., “Convolutional neural network filter (CNNF) for intra
variation network based fractional interpolation in video coding,” IEEE
frame,” JVET, Tech. Rep. JVET-I0022, 2018.
Transactions on Image Processing, vol. 28, no. 5, pp. 2140–2151, 2019.
[61] J. Yao et al., “AHG9: Convolutional neural network filter for inter
[35] Z. Zhao, S. Wang, S. Wang, X. Zhang, S. Ma, and J. Yang, “Enhanced frame,” JVET, Tech. Rep. JVET-J0043, 2018.
bi-prediction with convolutional neural network for high efficiency [62] F. Wu et al., “Description of SDR video coding technology proposal
video coding,” IEEE Transactions on Circuits and Systems for Video by University of Science and Technology of China, Peking University,
Technology, DOI: 10.1109/TCSVT.2018.2876399, 2018. Harbin Institute of Technology, and Wuhan University,” JVET, Tech.
[36] S. Huo, D. Liu, F. Wu, and H. Li, “Convolutional neural network-based Rep. JVET-J0032, 2018.
motion compensation refinement for video coding,” in ISCAS, 2018, pp. [63] C.-W. Hsu et al., “Description of SDR video coding technology proposal
1–4. by MediaTek,” JVET, Tech. Rep. JVET-J0018, 2018.
[37] Y. Wang, X. Fan, C. Jia, D. Zhao, and W. Gao, “Neural network based [64] P. Bordes et al., “Description of SDR, HDR and 360° video coding tech-
inter prediction for HEVC,” in ICME. IEEE, 2018, pp. 1–6. nology proposal by Qualcomm and Technicolor – medium complexity
[38] J. Lin, D. Liu, H. Li, and F. Wu, “Generative adversarial network-based version,” JVET, Tech. Rep. JVET-J0022, 2018.
frame extrapolation for video coding,” in VCIP, 2018, pp. 1–4. [65] F. Galpin et al., “AHG9: CNN-based driving of block partitioning for
[39] M. H. Baig and L. Torresani, “Multiple hypothesis colorization and intra slices encoding,” JVET, Tech. Rep. JVET-J0034, 2018.
its application to image compression,” Computer Vision and Image [66] V. Vapnik, The nature of statistical learning theory. Springer science
Understanding, vol. 164, pp. 111–123, 2017. & business media, 2013.
[40] T. Chen, H. Liu, Q. Shen, T. Yue, X. Cao, and Z. Ma, “DeepCoder: A [67] C. E. Shannon, “A mathematical theory of communication,” Bell System
deep neural network based video compression,” in VCIP. IEEE, 2017, Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
pp. 1–4. [68] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward
[41] C. Ma, D. Liu, X. Peng, L. Li, and F. Wu, “Convolutional neural networks are universal approximators,” Neural Networks, vol. 2, no. 5,
network-based arithmetic coding for HEVC intra-predicted residues,” pp. 359–366, 1989.
IEEE Transactions on Circuits and Systems for Video Technology, [69] J. Pfaff et al., “CE3: Affine linear weighted intra prediction (CE3-4.1,
DOI:10.1109/TCSVT.2019.2927027, 2019. CE3-4.2),” JVET, Tech. Rep. JVET-N0217, 2019.
[42] S. Puri, S. Lasserre, and P. Le Callet, “CNN-based transform index [70] Y. Wang, Z. Chen, Y. Li, L. Zhao, S. Liu, and X. Li, “Dense residual
prediction in multiple transforms framework to assist entropy coding,” convolutional neural network based in-loop filter,” JVET, Tech. Rep.
in EUSIPCO. IEEE, 2017, pp. 798–802. JVET-K0391, 2018.
[43] C. Dong, Y. Deng, C. C. Loy, and X. Tang, “Compression artifacts [71] ——, “AHG9: Dense residual convolutional neural network based in-
reduction by a deep convolutional network,” in ICCV, 2015, pp. 576– loop filter,” JVET, Tech. Rep. JVET-L0242, 2018.
584. [72] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
[44] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approach M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural
for post-processing in HEVC intra coding,” in MMM. Springer, 2017, networks for mobile vision applications,” arXiv:1704.04861, 2017.
pp. 28–39. [73] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust
[45] T. Wang, M. Chen, and H. Chao, “A novel deep learning-based method object recognition with cortex-like mechanisms,” IEEE Transactions on
of improving coding efficiency from the decoder-end for HEVC,” in Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 411–426,
DCC. IEEE, 2017, pp. 410–419. 2007.
[74] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[46] R. Yang, M. Xu, T. Liu, Z. Wang, and Z. Guan, “Enhancing quality for
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
HEVC compressed videos,” IEEE Transactions on Circuits and Systems
in CVPR, 2015, pp. 1–9.
for Video Technology, vol. 29, no. 7, pp. 2039–2054, 2019.
[75] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
[47] R. Yang, M. Xu, Z. Wang, and T. Li, “Multi-frame quality enhancement quality assessment: From error visibility to structural similarity,” IEEE
for compressed video,” in CVPR, 2018, pp. 6664–6673. Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
[48] Y. Dai, D. Liu, Z.-J. Zha, and F. Wu, “A CNN-based in-loop filter with [76] L. Song, X. Tang, W. Zhang, X. Yang, and P. Xia, “The SJTU 4K video
CU classification for HEVC,” in VCIP, 2018, pp. 1–4. sequence dataset,” in QoMEX, 2013, pp. 34–35.
[49] C. Jia, S. Wang, X. Zhang, S. Wang, J. Liu, S. Pu, and S. Ma, [77] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam,
“Content-aware convolutional neural network for in-loop filtering in and D. Kalenichenko, “Quantization and training of neural networks for
high efficiency video coding,” IEEE Transactions on Image Processing, efficient integer-arithmetic-only inference,” in CVPR, 2018, pp. 2704–
vol. 28, no. 7, pp. 3343–3356, 2019. 2713.
[50] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residual [78] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable
highway convolutional neural networks for in-loop filtering in HEVC,” architectures for scalable image recognition,” in CVPR, 2018, pp. 8697–
IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3827–3841, 8710.
2018.
[51] D. Bull et al., “Description of SDR video coding technology proposal
by University of Bristol,” JVET, Tech. Rep. JVET-J0031, 2018.
[52] Y. Li, D. Liu, H. Li, L. Li, Z. Li, and F. Wu, “Learning a convolutional
neural network for image compact-resolution,” IEEE Transactions on
Image Processing, vol. 28, no. 3, pp. 1092–1107, 2019.

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2019.2945057, IEEE
Transactions on Circuits and Systems for Video Technology
14 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Dong Liu (M’13–SM’19) received the B.S. and Feng Wu (M’99–SM’06–F’13) received the B.S.
Ph.D. degrees in electrical engineering from the Uni- degree in electrical engineering from Xidian Uni-
versity of Science and Technology of China (USTC), versity in 1992, and received the M.S. and Ph.D.
Hefei, China, in 2004 and 2009, respectively. He was degrees in Computer Science from Harbin Institute
a Member of Research Staff with Nokia Research of Technology in 1996 and 1999, respectively. He
Center, Beijing, China, from 2009 to 2012. He joined is a Professor and the Assistant to the President
USTC as an Associate Professor in 2012. in the University of Science and Technology of
His research interests include image and video China, Hefei, China. Previously, he was a Principle
coding, multimedia signal processing, and multime- Researcher and Research Manager with Microsoft
dia data mining. He has authored or co-authored Research Asia, Beijing, China.
more than 100 papers in international journals and His research interests include various aspects of
conferences. He has 16 granted patents. He has one technical proposal adopted video technology and artificial intelligence. He has authored or co-authored
by AVS. He received the 2009 IEEE T RANSACTIONS ON C IRCUITS AND over 120 journal papers (including several dozens of IEEE Transactions
S YSTEMS FOR V IDEO T ECHNOLOGY Best Paper Award and the VCIP 2016 papers) and top conference papers on MOBICOM, SIGIR, CVPR and ACM
Best 10% Paper Award. He and his team were winners of four technical MM. He has 80 granted US patents. His 15 techniques have been adopted into
challenges held in ACM MM 2018, ECCV 2018, CVPR 2018, and ICME international video coding standards. He serves or had served as the Deputy
2016. He is a Senior Member of CSIG, and an elected member of MSA Editor-in-Chief for IEEE T RANSACTIONS ON C IRCUITS AND S YSTEM FOR
Technical Committee of IEEE CAS Society. He served as a Registration Co- V IDEO T ECHNOLOGY and as an Associate Editor for IEEE T RANSACTIONS
Chair for ICME 2019 and a Symposium Co-Chair for WCSP 2014. ON I MAGE P ROCESSING and IEEE T RANSACTIONS ON M ULTIMEDIA . He
also serves as General Chair in ICME 2019, TPC Chair in MMSP 2011, VCIP
2010 and PCM 2009. He received the best paper awards in IEEE TCSVT
2009, VCIP 2016, PCM 2008, and VCIP 2007, and Best Associate Editor
Award of IEEE T RANSACTIONS ON I MAGE P ROCESSING in 2018.

Zhenzhong Chen (S’02–M’07–SM’15) received the


B.Eng. degree from Huazhong University of Science
and Technology and the Ph.D. degree from the
Chinese University of Hong Kong, both in electrical
engineering. He is currently a Professor at Wuhan
University.
His current research interests include image and
video processing and analysis, computational vision
and human-computer interaction, multimedia data
mining, photogrammertry and remote sensing. He
has been a VQEG board member and Immersive
Media Working Group Co-Chair, a Selection Committee Member of ITU
Young Innovators Challenges, an Associate Editor of IEEE T RANSACTIONS
ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY , Journal of the
Association for Information Science and Technology (JASIST), Journal of
Visual Communication and Image Representation, and Editor of IEEE IoT
Newsletters. He was a recipient of the CUHK Young Scholars Dissertation
Award, the CUHK Faculty of Engineering Outstanding Ph.D. Thesis Award,
Microsoft Fellowship.

Shan Liu obtained her B.Eng. degree in Electron-


ics Engineering from Tsinghua University, Beijing,
China and M.S. and Ph.D. degrees in Electrical
Engineering from University of Southern California,
Los Angeles, USA. She is a Distinguished Scientist
and General Manager at Tencent where she heads
the Tencent Media Lab. Prior to joining Tencent she
was the Chief Scientist and Head of America Media
Lab at Futurewei Technologies. She was formerly
Director of Multimedia Technology Division at Me-
diaTek USA. She was also formerly with MERL,
Sony and IBM.
Dr. Liu is the inventor of more than 200 US and global patent applications
and the author of more than 60 journal and conference articles. She actively
contributes to international standards such as VVC, H.265/HEVC, DASH,
OMAF, and served co-Editor of H.265/HEVC v4 and VVC. Dr. Liu was in
Industrial Relationship Committee of IEEE Signal Processing Society 2014-
2015 and was the VP of Industrial Relations and Development of Asia-Pacific
Signal and Information Processing Association (APSIPA) 2016-2017. She was
named APSIPA Industrial Distinguished Leader in 2018.

1051-8215 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like