Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

A 5.

3 Gpixels/s Frame Memory Recompression


Method for QHD Video Coding
Chenhao Gu1, Xiaoyang Zeng1, and Yibo Fan1*
1
State Key Laboratory of ASIC and System, Fudan University, Shanghai, 200433, China
* Email: fanyibo@fudan.edu.cn

Abstract communication. Limited memory bandwidth can be the


bottleneck for video quality because of constraints on
4K (3840×2160) and 8K (7680×4320) formats video
amount of reference pixels to be loaded for ME process.
applications enhance visual experience, however, at the
From the experiment results in [2], the average number of
expense of much more complicated video encoder needed.
reference pixels read for motion estimation is about six
High-Efficiency Video Coding(HEVC) standard is a new
times of the frame size. Hence, the throughput
video compression standard which can double the
requirement of ME process is about 3.0 Gpixels/s for
compression ratio with respect to the previous
3840×2160@60fps.
H.264/AVC standard at the same level of video quality.
Frame recompression is a technique that compresses
H.265/HEVC adopts the quad-tree coding structure,
reference frames before deblocking filter process stores
leading to serious design problems, such as high
them into external DRAM and decompresses these data
computational complexity, high energy consumption and
when ME/MC fetches them back, as shown in Fig. 1.
huge external memory bandwidth requirement. Frame
memory recompression is an efficient way to reduce both
system power and external memory traffic. This work
Deblocking
emphasizes the hardware implementation of a high- Filter
Compressor
throughput reference frame recompression scheme. Memory
DRAM
Organization
Compared to no compression, the proposed algorithm can ME/MC Decompressor
reach an average lossless compression rate of 51.0% for
Codec FMR1)
evaluated video sequences. The hardware architecture is
implemented with 29.5k gates for the compressor and 1) FMR: Frame memory recompression.
14.4k gates for the decompressor at 500MHz using TSMC Fig. 1. High-level block diagram for FMR system
65nm CMOS standard-cell libraries. The throughput of
this architecture is 5.3 Gpixels/s under 4:2:0 sampling, Previous works on frame compression can be roughly
which is able to process Quad High Definition (QHD) at divided in two categories: 1) lossy compression [4] and 2)
60 fps. lossless compression [3][5]. However, most of them are
not able to process Quad High Definition (QHD) in real
Keywords time. For instance, Lian [3] proposes a pixel-grain
directional prediction algorithm for lossless frame
H.265/HEVC, VLSI, high throughput, frame memory memory compression and claims this method is suitable
recompression. for QHD encoding. Their conclusion is based on Level-D
reference data reuse scheme and this throughput is not
1. Introduction able to encode QHD video sequences when random and
During the past decade, with the 4K (3840×2160) and unpredictable search algorithms are used in IME process,
8K (7680×4320) video applications gradually entering such as [2]. In [4], a novel algorithm called mixed lossy
people's horizons, video coding technology has faced and lossless (MLL) reference frame recompression is
great challenges. In 2013, the JCT-VC team, which is set proposed. Truncated pixels are compressed for integer
up by ITU-T VCEG and ISO/IEC MPEG standardization motion estimation (IME) and truncated residuals are
organizations, released the next generation of video fetched from external memory to reconstruct lossless data
coding standard, H.265/HEVC [1]. H.265/HEVC can for fractional motion estimation (FME) and motion
achieve 50% coding efficiency gain over AVC. However, compensation (MC). The decoding throughput is 0.76
this improvement comes at the expense of huge external pixels per clock cycle, which is also not enough for QHD
memory traffic, high energy consumption and much more videos.
hardware resource cost due to some efficient coding tools In this paper, a high-throughput and low-complexity
integrated in HEVC. frame recompression algorithm for external memory
Motion estimation (ME) is the main source of memory reduction is proposed. The rest of this paper is organized

978-1-5386-4441-6/18/$31.00 ©2018 IEEE


Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 16:24:01 UTC from IEEE Xplore. Restrictions apply.
as follows. Section 2 presents the proposed reference original pixels, as shown in Fig. 3. The pixel value of the
frame compression algorithm, while the hardware top-left pixel is put into “F” part of final compressed data.
implementation of this method is provided in Section 3. The residuals for the next 3 samples of 1st column are the
Section 4 shows the results and comparison with difference between current sample and the sample in
related work. Finally, Section 5 concludes this paper. previous row. Then the residuals for the remaining 28
samples of 2nd-7th columns are the difference between
2. Proposed Frame Compression Algorithm current sample and the sample in previous column.
For the entropy coding process, the semi-fixed length
Most of the process units in previous works are larger coding proposed in [8] is used and modified for small-
than 4×4, such as 8×8 [4], 16×16 [3], and 32×32 [6]. value optimization, as shown in Table 1 and Table 2. The
However, to be compatible with some existing deblocking 8×4 residuals are firstly divided into eight 2×2 sub-blocks.
filter algorithms [7], the height of processing units is Then the coding mode (M) of each sub-blocks is decided
supposed to be no longer that 4. We define the basic according to the maximum absolute value of residuals in
partition size as 8×4 in this work. The format of one sub-block. If any of these residuals equals to -2M-1 or
compressed data is shown in Fig. 2, where F denotes “first 2M-1, an extra trailing bit (T) is needed to denote its sign.
pixel”, G denotes “regrouping mode”, M denotes “coding
modes”, D denotes “compressed residuals” and T denotes Table 1. Semi-fixed length coding
“trailing bits”. MAV 1)
0 1 2 3~4 5~8 9~16 17~32
>=
33
Coding
0 1 2 3 4 5 6 7
Ori Data: length[7:0] = 0 Mode
0 0 00 000 0000 00000 000000
Ori Data ±1 1 S1 SS1 SSS1 SSSS1 SSSSS1

Original Pixel Values


256 ±2 10 S10 SS10 SSS10 SSSS10
±3 SS1 SSS1 SSSS1 SSSSS1
Not All Zero: 8 < length[7:0] < 256 ±4 100 S100 SS100 SSS100
±5 SSS1 SSSS1 SSSSS1
F G M D T D

8 2 2~48 0~256 0~8 ±8 1000 S1000 SS1000
All Zero:1) length[7:0] = 8 …
±16 10000 S10000
F …
±32 100000
8
1) MAV: maximum absolute value.
1) All Zero: all the residuals after DPCM are equal to 0. 2) “S” is the sign of residuals, “S” is the logic negation of residuals.
Fig. 2. Format of compressed data
Table 2. Coding mode for residuals
In this design, final compressed data of each 8×4 Coding Mode 0 1 2 3
partition can be formed in three patterns: Ori data, Not All M 00 01 10 110
Zero, All zero. The pattern of final compressed data is Coding Mode 4 5 6 7
M 1100 11110 111110 111111
indicated by its length, as shown in Fig. 2. The 8-bits
length is stored into on-chip memory or external DRAM.
Adjacent sub-blocks which have same coding mode
Before compressed data are fetched back, its length needs
can be regrouped together. This method can decrease bits
to be read in advance in order to decide its pattern. When
used for the “M” part of final compressed data, since sub-
all the 8×4 residuals are all equal to zero, only the value
blocks within one group share common coding mode. The
of the top-left pixel is needed and the length of
proposed four regrouping schemes, which are indicated
compressed data is 8. Furthermore, our algorithm cannot
by regrouping mode (G), are shown in Fig. 4.
guarantee original pixels can always be compressed to
less than 256 bits. When the length of compressed data is
no less than 256 (≥256), the original pixels are stored and G0 G1 G2 G3 G0 G1
the final length stored is equal to 0. G4 G5 G6 G7 G2 G3
G = 00 G = 01

G0 G1 G0

G = 10 G = 11

Fig. 4. Regrouping modes


Fig. 3. DPCM scan direction
3. Hardware Implementation
For each partition, DPCM scan is first performed on
The pipeline stages of compressor and decompressor

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 16:24:01 UTC from IEEE Xplore. Restrictions apply.
are respectively shown in Fig. 5 and Fig. 6. Each 8×4 which can support up to 4K@60fps video compression.
partition requires only two cycles for either compressor or The results provided in section 4 indicate that the
decompressor. A throughput of 16 samples per cycle is proposed architecture offers remarkable improvement in
achieved. both hardware resource cost and throughput.

2 cycles Acknowledgement
DPCM &
Encode Merge Data1)
Regroup This work was supported in part by the National
DPCM & Natural Science Foundation of China under Grant
Encode Merge Data
Regroup
1) Merge Data: Merge F, G, D, T parts for final compressed data.
61674041, in part by Alibaba Group through Alibaba
Fig. 5. Pipeline stage of compressor Innovative Research (AIR) Program, in part by the
STCSM under Grant 16XD1400300, in part by the
2 cycles pioneering project of academy for engineering and
technology and Fudan-CIOMP joint fund.
Split1) Divide2) Decode Reconstruct 3)

Split Divide Decode Reconstruct References


1) Split: Split F, G, D, T parts from compressed data. [1] G. J. Sullivan, J. R. Ohm, W. J. Han and T. Wiegand,
2) Divide: Divide D part into 8x4 pixels. "Overview of the High Efficiency Video Coding
3) Reconstruct: Reverse process of DPCM.
(HEVC) Standard," in IEEE Trans. Circuits Syst.
Fig. 6. Pipeline stage of decompressor Video Technol., vol. 22, no. 12, pp. 1649-1668, Dec.
2012.
4. Results and discussion [2] D. Zhou, J. Zhou, G. He and S. Goto, "A 1.59 Gpixel/s
The architecture of proposed frame recompression Motion Estimation Processor With -211 to +211
Search Range for UHDTV Video Encoder," in IEEE
method is synthesized targeting TSMC 65nm standard- Journal of Solid-State Circuits, vol. 49, no. 4, pp. 827-
cell libraries. 837, April 2014.
Results show that proposed algorithm can reach an [3] X. Lian, Z. Liu, W. Zhou and Z. Duan, "Lossless
average lossless compression rate of 51.0% for evaluated Frame Memory Compression Using Pixel-Grain
video sequences. Table 3 shows the comparison of the Prediction and Dynamic Order Entropy Coding,"
proposed algorithm with previous frame recompression in IEEE Trans. Circuits Syst. Video Technol., vol. 26,
algorithms. no. 1, pp. 223-235, Jan. 2016.
[4] Y. Fan, Q. Shang and X. Zeng, "In-Block Prediction-
Table 3. Comparison with related works Based Mixed Lossy and Lossless Reference Frame
Lian’s This Recompression for Next-Generation Video
Fan’s [4] Guo’s [8] Encoding," in IEEE Trans. Circuits Syst. Video
[3] work
Block Size 16×16 8×8 8×8 8×4 Technol., vol. 25, no. 1, pp. 112-124, Jan. 2015.
Compression
68.5
45.4
57.6 51.0
[5] Y. Lee, C. Chen, and Y. You, "Design of VLSI
Rate (%) ~52.9 Architecture of Autocorrelation-Based Lossless
CMOS tech.
65 130 90 65 Recompression Engine for Memory Efficient Video
(nm) Coding Systems," in Springer CSSP, vol. 33, no. 2,
Gate Count (k) 71.2 91.2 79.58 43.98 pp.459-482, Feb. 2014.
Max. freq. [6] D. Silveira, G. Povala, L. Amaral, B. Zatt, L. Agostini
578 250 300 500
(MHz)
Throughput1)
and M. Porto, "A real-time architecture for reference
(pixels/cycle)
1.33 0.76 14.2 10.67 frame compression for high definition video
Throughput coders," 2015 IEEE Int. Symposium on Circuits and
(Gpixels/s)
0.78 0.19 4.3 5.3 Systems (ISCAS), Lisbon, 2015, pp. 842-845.
1) Decompressor, YUV 4:2:0. [7] P. K. Hsu and C. A. Shen, "The VLSI Architecture of
a Highly Efficient Deblocking Filter for HEVC
Systems," in IEEE Trans. Circuits Syst. Video Technol.,
Among these works, the proposed architecture vol. 27, no. 5, pp. 1091-1103, May 2017.
presents the smallest gate counts and highest throughput. [8] L. Guo, D. Zhou and S. Goto, "A New Reference
However, the compression rate of proposed algorithm is Frame Recompression Algorithm and Its VLSI
relatively low due to small process unit (8×4) and simple Architecture for UHDTV Video Codec," in IEEE
prediction method. Trans. on Multimedia, vol. 16, no. 8, pp. 2323-2332,
Dec. 2014.
5. Conclusion
A high-throughput and hardware-friendly frame
memory recompression design is presented in this paper

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 16:24:01 UTC from IEEE Xplore. Restrictions apply.

You might also like