Improved SIMD Architecture For High Performance Video Processors

Improved SIMD Architecture for High
Performance Video Processors

Wing-Yee Lo, Daniel Pak-Kong Lun, Member, IEEE, Wan-Chi Siu, Senior Member, IEEE, Wendong
Wang, and Jiqiang Song, Senior Member, IEEE
much interest from both academic researchers and VLSI

Abstract—SIMD execution is in no doubt an efficient way to system designers. Among the image and video processing
exploit the data level parallelism in image and video applications. operations that are performed in general computer
However, SIMD execution bottlenecks must be tackled in order applications, video coding is the most computation intensive
to achieve high execution efficiency. We first analyze in this
paper the implementation of two major kernel functions of
operation that is often used as the benchmark to measure the
H.264/AVC namely, SATD and subpel interpolation, in performance of a video processor. For the rest of this paper,
conventional SIMD architectures to identify the bottlenecks in we shall focus on the realization of the state-of-the-art video
traditional approaches. Based on the analysis results, we propose coding standard H.264/AVC [1] and use it as an example to
a new SIMD architecture with two novel features: (1) parallel illustrate the merit of the proposed video processor design.
memory structure with variable block size and word length
To deal with the extremely high computational complexity
support; and (2) configurable SIMD structure. The proposed
parallel memory structure allows great flexibility for of video coding, one common approach is to exploit the data
programmers to perform data access of different block sizes and level parallelism (DLP) in the execution. As different from
different word lengths. The configurable SIMD structure allows application specific ASIC designs, a general purpose video
almost “random” register file access and slightly different processor should provide great flexibility for programmers
operations in ALUs inside SIMD. The new features greatly while exploiting the parallelism in the execution. For this
benefit the realization of H.264/AVC kernel functions. For reason, the Single Instruction Multiple Data (SIMD)
instance, the fractional motion estimation, particularly the half to
quarter pixel interpolation, can now be executed with minimal or
architecture is most suitable and is widely adopted. Two
no additional memory access. When comparing with the popular examples are Intel’s MMX/SSE1/SSE2/SSE3 [2]
conventional SIMD systems, the proposed SIMD architecture can and Motorola’s AltiVec [3], where multimedia SIMD
have a further speedup of 2.1X to 4.6X when implementing instruction set extensions have been added for efficient
H.264/AVC kernel functions. Based on Amdahl’s law, the overall realization of video processing applications.
speedup of H.264/AVC encoding application can be projected to In recent years, many researchers studied how much
be 2.46X. We expect significant improvement can also be
performance can be gained after using SIMD instructions in
achieved when applying the proposed architecture to other image
and video processing applications. modern video codec [4]-[9]. Simulation results using
reference model demonstrate that there is at least 2-12X
Index Terms—Configurable SIMD, Parallel memory speedup. A basic requirement to employ SIMD instructions
structure, SIMD bottlenecks, video codec processor is to possibly feed multiple data elements perfectly into
vector registers so that the same computation operation can
be applied. Although much research effort [6] [10]-[12] has
I. INTRODUCTION been made to address the problem, there are often overheads
W ith the extensive use of image and video information in

modern computer applications, the development of high
performance image and video processing units has attracted
and performance bottlenecks when aligning the multiple data
to feed into vector registers. Extra memory loads and stores,
unpacking, packing and shuffling are often required that
prevent SIMD execution from achieving the peak
Manuscript received September 23, 2009; revised April 27, 2010 and performance. Besides, the memory mis-alignment, stride
December 13, 2010. This work was supported in part by the Hong Kong memory access, memory latency, random register file access
Polytechnic University under grant no 1-BB9B. Most of the research work and branch mis-prediction also prevent the processor from
and implementation development were done in the Hong Kong Applied
Science and Technology Research Institute (ASTRI) and Beijing SimpLight
fetching data in a timely fashion to achieve peak throughput
Nanoelectroinics Ltd. [13]-[15].
Wing-Yee Lo, Daniel Pak-Kong Lun and Wan-Chi Siu are with the Centre To address the aforementioned problems, our team has
for Signal Processing of the Department of Electronic and Information
Engineering of the Hong Kong Polytechnic University, Hung Hom, Kowloon designed and implemented a new SIMD based video
Hong Kong. (e-mail: winnielowingyee@gmail.com; enpklun@ polyu.edu.hk; processor with architecture as shown in Fig. 1. Our video
enwcsiu@polyu.edu.hk). processor is a 5-stage pipeline multi-threaded multi-issue
Wendong Wang is with the SimpLight Nanoelectronics Ltd., Beijing,
China. (e-mail: wending.wang@simplnano.com). semi out-of-order superscalar processor. It supports a
Jiqiang Song is with the Intel Lab, Beijing, China (e-mail: maximum of 4 threads of execution simultaneously. A
jiqiangsong@gmail.com). maximum of 4 instructions can be
Copyright (c) 2011 IEEE. Personal use of this material is permitted. However, permission to use this material for any other
purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2
additional memory accesses.

With these features, the performance of the processor is
significantly improved. First of all, no packing and
unpacking instructions are required in our video instruction
set extension. Data shuffling within the SIMD registers can
be accomplished in one cycle. They greatly benefit the
realization of many major H.264/AVC kernel functions. For
instance, the fractional motion estimation, particularly the
half to quarter pixel interpolation, can now be executed by
the proposed SIMD structure with minimal or no additional
memory accesses. When comparing with the conventional
SIMD structures, the proposed SIMD architecture can have a
further speedup of 2.1X to 4.6X when implementing
Fig. 1. Proposed SIMD architecture.
H.264/AVC encoder kernel functions. As these kernel
issued in every clock cycle. The multi-thread and multi-issue functions are often used in other image and video processing
features hide away the memory latency and branch mis- operations, the proposed SIMD architecture can be generally
prediction penalty bottlenecks. The processor is implemented applied to different image and video applications with
with TSMC 0.13m technology. Based on our simulation significantly improved performance.
and performance data, it is capable of encoding and decoding The paper is organized as follow. In Section II, we briefly
video sequences with CIF resolution at 90MHz and 30MHz discuss the previous works on SIMD architectural bottleneck
respectively. The die area is about 8mm 2 including 32KB analysis. In Section III, we analyze where the SIMD
instruction cache, 48KB internal SRAM memory and 8KB structure bottlenecks are and how the SIMD structure can be
LUT memory. A breakdown of the area used for various enhanced. Then the two proposed features are described in
functional units is shown in TABLE I. Section IV. The experimental results are shown in Section V.
TABLE I Finally the conclusion is drawn in Section VI.
PROPOSED VIDEO PROCESSOR AREA BREAKDOWN
Local Storage II. RELATED WORKS

Synthesized
Logic
Memory Register
Total To allow SIMD architectures to achieve the peak
File throughput performance, many researchers find ways to
mm2 2.7 3.9 1.4 8.0 increase the efficiency of loading data from memory and
Area
% 33.6 48.6 17.8 100
aligning them within a vector register file for SIMD
execution before the data are used. A SIMD processor,
In our design, two novel features are introduced to reduce MediaBreeze, was proposed in [13][15] to alleviate the
the overheads and performance bottlenecks when executing SIMD bottleneck problem. They proposed a multi-
SIMD instructions for video coding operations. Firstly, a dimensional vector instruction named as Breeze instruction
parallel memory structure with variable block size and word to speed up the nested loop operations, which are often found
length support is proposed. The structure resolves the in video applications. However, the Breeze instruction
unaligned and stride memory access SIMD bottleneck structure is very complicated. It needs dedicated instruction
problems. Our design accounts for the fact that video coding memory and decoder to store and decode the Breeze
tasks operate on byte or word data and generate word and instruction before execution. It can only fully exploit 5-level
double-word results. Using a memory interleave scheme, the looping in very regular execution functions such as full
proposed parallel memory structure can load a block of data search algorithm in motion estimation. For most fast search
with size up to double-word in no more than 4 cycles. algorithms, the block can be in any arbitrary location. In this
Comparing with the previous parallel memory schemes, the case, only 2-level looping in Breeze instruction can be
proposed approach provides higher flexibility in block size applied.
and data length selection with low hardware complexity. The Although some SIMD architectures claim to support
second proposed feature is the introduction of a configurable unaligned memory access, their approaches often have
SIMD (CSIMD) structure using a look up table (LUT). It different limitations [14]. They include the need of multiple
allows almost random register access and slightly different aligned loads followed by data shift and OR operation, not
operations in ALUs inside the SIMD unit. It eliminates the being thread safe, extra latency for crossing cache boundaries,
problem in some video processors that sub-word data etc. To deal with these problems, it is shown in [14] that an
elements resulting from previous SIMD operations cannot be alignment network can be added after two-bank interleaved
retrieved and used directly without packing, unpacking, and L1-1 cache to reduce the cache line boundary penalty.
shuffling. It is common for these video processors to perform However, such approach only handles the unaligned memory
additional memory stores and re-loads to solve the problem. access problem among many SIMD bottleneck problems.
The proposed feature mitigates the performance impact by Multi-bank vector memory is also used to reduce the
successfully reducing such SIMD architecture overheads [16]-[20]. The image data are
interleaved and loaded into multiple memory modules function is DCT/IDCT that contributes to about 10-20% of
sequentially. In [16], a modulo addressing mode was computation. For H.264/AVC decoders, the most intensive
introduced to allow part of the bytes in a word to be accessed functions are interpolation and inverse transform. They
from both ends of a circular buffer to reduce external contribute to about 20% and 5-10% of computation
memory bandwidth. Chang et al. [17] proposed adding one respectively [7][11]. For intra-frame coders, the most
extra memory module in addition to the number of ALUs in complicated functions are SATD transform for cost
SIMD processor to solve the possible memory module generation and mode selection, intra prediction and
conflict problem. However, the number of memory modules DCT/Q/IDCT [23]. They contribute to about 57%, 20% and
must be relatively prime to the supported stride values 16% of computation respectively. Therefore, if we can
resulting in larger hardware cost in address generation and enhance the SIMD execution in motion estimation, transform
shuffling logic. In [18], a scalable data alignment scheme and mode decision, the overall performance can be improved
was proposed for rectangular block data access using simpler significantly. Among these functions, SAD, SATD,
memory address generation. It is achieved by using a two- DCT/IDCT, and subpel interpolation are the main targets. In
dimensional notation for both pixel location and memory this section, we analyze two video encoding kernel functions
module number. However, it is not flexible enough to in detail in order to demonstrate where the conventional
support variable block sizes. It is noted that a block based SIMD architectures can be further enhanced. Example codes
data access approach is often used in many image and video in VideoLAN X264 opened source
processing applications while the block size can be different [23] are used to illustrate our findings. The VideoLAN X264
for different algorithms. Flexibility should be provided when source uses the most popular Intel MMX/SSE1-3 instructions
designing a general purpose video processor to allow data to realize SIMD functions.
access of variable block size and word length without greatly
increasing hardware complexity. A. 4x4 Block SATD
In [19]-[20], a video signal processor with read-permuter We first analyze the 4x4 SATD function in H.264/AVC.
and write-transposer placed, respectively, before and after the The function comprises several smaller sub-functions:
vector register file was described. They facilitate data memory load, subtraction, two-dimensional (2-D) Hadamard
reorganization in SIMD register before execution, but it still transform, transpose and summation. We went through the
needs N cycles to do an NxN transpose operation. Seo et al. source codes of SATD in VideoLAN X264 source [23]. The
[21] on the other hand introduced diagonal memory numbers of instructions used to complete these sub-functions
organization and programmable crossbars in their SIMD with different block sizes are listed in TABLE II. The
architecture. The diagonal memory organization allows the operations under the “Others” column are improved more
horizontal and vertical memory access without any conflict. easily by other techniques
Due to data access complexity in H.264 algorithm, 3 TABLE II
programmable crossbar shuffle networks are added such that INSTRUCTION COUNT BREAKDOWN OF SATD, IDCT AND DCT IN VIDEOLAN
any data shuffle patterns required by H.264 algorithm can be X264.
Instruction Counts MMX
supported. However, in order to accommodate complex data Block
or
Size
access patterns, only predefined fixed pattern crossbars are SSE2
Memory 1-D 4x4 1-D 4x4
implemented. This limitation requires the crossbar patterns to Load
Subtraction
Transform
Transpose
Transform
Others Total
be pre-designed based on the algorithm. They may not be SATD4x4 (pixel_satd_<blk_size> functions)
flexible enough to realize future algorithm enhancement or 4x4 8 12 12 12 12 19 75 MMX
4x8 16 24 24 24 24 38 150 MMX
support new video coding standards efficiently. Besides, the 3 8x4 8 12 12 18 12 19 81 SSE2
shuffle networks make the SIMD pipeline longer which may 8x8 16 24 24 36 24 38 162 SSE2
increase the branch mis-prediction penalty and execute-to- 8x16 32 48 48 72 48 76 324 SSE2
consume latency between pipeline stages. 16x8 32 48 48 72 48 76 324 SSE2
16x16 64 96 96 144 96 152 648 SSE2
Another deficiency of the traditional approaches is that DCT4x4DC (idct4x4dc functions)
they do not have direct support to major kernel functions in 4x4 4 0 12 12 12 9 49 MMX
IDCT4x4DC (idct4x4dc)
image and video processing. It will be discussed in next
4x4 4 0 12 12 12 0 40 MMX
section. DCT4x4Residual (sub<blk_size>_dct functions)
4x4 8 12 14 12 14 0 60 MMX
III. ANALYSIS 8x8 16 24 28 36 28 3 135 SSE2
16x16 64 96 112 144 112 9 537 SSE2
As mentioned above, we use video coding as an example IDCT4x4Residual (add<blk_size>_idct functions)
to illustrate the deficiency of the traditional SIMD 4x4 8 0 15 12 15 18 68 MMX
8x8 24 0 30 36 30 38 158 SSE2
architectures in supporting image and video processing 16x16 96 0 120 144 120 140 620 SSE2
kernel functions. It is well known that motion estimation is
the most computation intensive function in H.264/AVC such as enhancing SIMD instruction set extension. Hence
encoders. It contributes to more than 50% among all they are not discussed here. It can be seen that in VideoLAN
computations [5][22]. If four reference frames are used, X264 source, the number of SIMD instructions used to
motion estimation alone accounts for more than 70% of realize memory load, subtraction, 2 1-D Hadamard transform
computation [22]. The next intensive and transpose contribute to about 75% of the total
instructions in
4x4 SATD function for different block sizes. In fact, as can involves no arithmetic operations but only data shuffles.
be seen in TABLE II, these sub-functions are equally Most of these instructions are not required if a dedicate
important in functions such as DCT and IDCT. Their hardware construct is provided for data shuffling. In fact,
efficient realization obviously is decisive to improve the data shuffling is required in many other parts of a video
overall performance. codec which further justifies the need for an efficient data
Although these sub-functions are very simple, shuffling unit. Fig. 3 shows the basic operations as carried
conventional SIMD architectures often cannot achieve the out by VideoLAN X264 source for implementing matrix
peak throughput due to the following 4 reasons: transpose of a 4x4 block. Since there is not a dedicate
1. lack of memory block load with different data length hardware for matrix transpose in MMX, the most efficient
support; way to perform transpose is to use different unpack
2. limited support for data shuffling; instructions, which include PUNPCKLWD, PUNPCKHWD,
3. requirement of carrying out the same operations by all PUNPCKLDQ and PUNPCKHDQ. It is seen
ALU in SIMD for each SIMD instruction execution; that 8 instructions are required to implement the matrix
and transpose, of which most of them are unnecessary if a
4. inability to support cross bank data access in a SIMD dedicated data shuffling unit is available. In the actual codes
register file. of VideoLAN X264 source, 12 instructions instead of 8 are
In VideoLAN 4x4 SATD function, MOVD instruction is used for each 4x4 block matrix transpose. Extra instructions
used to load 4 pixel data bytes from memory to lower double are required to store the temporary results generated in the
word of the 64-bit MMX register while filling the upper computation due to insufficient number of registers.
double word with zeros. Two PUNPCKLBW instructions are Since most operations in H.264/AVC are performed in
then used to unpack 8 data bytes from two lower double block mode, it is obvious that the efficiency of SIMD
words of operations can be significantly uplifted by having all data in
a block loaded into the register before the SIMD operations
take place. Assume the bit-width of the registers is large
enough such that all data in a block can be loaded into a
register. Intuitively we expect more data can be processed at
the same time. However, it is not the case since very often
different data in a block may need to perform slightly
different operation. More commonly, data of a block may
need to work with other data in a block. Let us take the
computation of the 2-D Hadamard transform in SATD as an
Fig. 2. Packed word subtraction from packed byte .
example. The transform can be realized by applying 1-D
MMX registers, 4 in each, into the destination register (see Hadamard transform to all columns and then all rows of a
Fig. 2). The instructions convert four packed data bytes to data block. A length-4 1-D Hadamard transform is defined
four packed words before subtraction. The unpack as:
instructions prevent the execution result from being
overflowed in
subsequent operations. It can be seen that the number of Y  H.X (1)
cycles
to just perform memory load and subtraction can consume where X is the input 4x4 data block and Y is the transformed
more than 30% of the total execution cycles of the sub- output. H is the transform matrix and is given by
functions.
1 1 1 1
This inefficient SIMD execution can be improved by loading 
1 1 
1 1
data bytes from memory, extending them to data words H 
1  1 1
before writing the packed words into the register. 1 
1  1
Conventional SIMD architectures often have limited 1 
. Its application to the columns of a 4x4 data
support 1
for data shuffling. As can be seen in TABLE II, the number block can be implemented with the steps as shown in (2)-(7):
of instructions for the implementation of matrix transpose A(i, j) = X(i, j) + X(i+1, j)  i = 0, 2, j = 0-3. (2)
can be as high as 22% of the total instructions for computing B(i, j) = X(i, j) – X(i–1, j)  i = 1, 3, j = 0-3. (3)
SATD and other kernel functions. Note that a matrix
transpose
Fig. 3. Basic operations of 4x4 matrix transpose.
Fig. 4. Basic operations in 1-D Hadamard transforms.

Fig. 5. 1-D Hadamard transforms in 256-bit register.
Y(i, j) = A(i, j) + A(i+2, j)  i = 0, j = 0-3. (4) The fractional motion estimation is one of the new
Y(i, j) = B(i, j) + B(i+2, j)  i = 1, j = 0-3 (5) features of H.264/AVC. To illustrate the difficulty of
Y(i, j) = A(i, j) – A(i–2, j)  i = 2, j = 0-3. (6) realizing fractional motion estimation using SIMD structures,
Y(i, j) = B(i, j) – B(i–2, j)  i = 3, j = 0-3. (7) let us first recall the procedures for half and quarter pixel
Fig. 4 shows the basic operations as carried out by interpolation. As shown in Fig. 6, the half pixels located
VideoLAN X264 source for implementing the 1-D between two adjacent integer pixels (e.g. k) are interpolated
Hadamard transforms. Since each Intel’s MMX register can by applying a 6-tap filter using three upper and three lower
only handle 4 data words at the same time, 8 PADDW / (A, C, G, M, R, T), or three left and three right integer pixels
PSUBW instructions are required to complete the transform (E, F, G, H, I, J). After all half pixels adjacent to integer
for all columns of a 4x4 data block. In fact, 12 instructions pixels are generated, the half pixels located between half
rather than 8 are used in actual source codes. It is again due pixels (e.g. e) are interpolated by applying the 6-tap filter
to insufficient number of registers that requires extra using either three upper and three lower half pixels (a, b, d, f,
instructions to deal with temporary result storage. g, h) or three left and three right pixels (i, j, k, m, n, q). Once
To speed up the computation, an intuitive solution is to all half pixels are generated, quarter pixels can be
increase the bit-width of the register such that all data in a interpolated. Those located adjacent to integer and half pixels
block can be processed at the same time. Assume now we (e.g. 0), and between half pixels (e.g. 4) are estimated by
have a 256-bit register as shown in the upper part of Fig. 5 linear interpolation with the corresponding horizontal and
such that all 16 data words of a block can be loaded into this vertical pixels. The remaining quarter pixels (e.g. 13) are
register. We expect that all 16 data can be processed at the linearly interpolated with two diagonally adjacent half pixels
same time, but in fact they cannot. Based on (2)-(7), the (e.g. d, k).
operations to be performed for implementing the 1-D The generation of quarter pixels requires the strict order of
Hadamard transforms are shown in the lower part of Fig. 5. pixel generation as described previously. Arbitrarily storing
We can see that each data word requires adding itself with, the previous pixels will easily introduce much difficulty for
or subtracting itself from another word lane in the register. later pixel interpolation using SIMD instructions without
This requirement deviates from the operations performed by memory loads and stores, packing or shuffling beforehand.
traditional SIMD structures, which require all ALUs in the For example, as shown in Fig. 6, half pixel d works with half
SIMD unit to perform exactly the same operation whenever a pixels k and m to generate quarter pixels 13 and 14,
SIMD instruction is executed. Besides, the ALUs must respectively. If, say, k and m are stored in the same row
retrieve the operands from its own register bank. Although hence they must be in different banks, it is thus impossible for
later multimedia extension adds new instructions to support d to be stored in the same bank as k and m at the same time.
cross bank operand retrieval, these instructions either limit to In general, it is highly likely that some of the quarter pixel
several retrieval patterns or require additional instructions to interpolations have to be performed with half pixels stored in
configure the retrieval pattern in another register before different rows of different register banks. Such nearly
using. The above means that even if we can throw in extra “random” register access imposes great difficulty for
resource to provide long bit-width registers, the problem traditional SIMD executions. Note that at the time the half
cannot be resolved without a redesign of the SIMD pixels are generated and stored, the quarter pixels required to
architecture. We show in Section IV how the proposed be generated are still unknown. It is thus very difficult to
SIMD architecture handles these problems. devise an optimized storage plan for the half pixel data to
solve the abovementioned problem. For this reason, the
B. Fractional Motion Estimation conventional way to interpolate the quarter pixels is to store
back the interpolated half pixels to memory and then reload
G 0 d 2 H
AaB 345 them with unpacking, packing and shuffling before further
k 6 e 7 m
CbD 8910 execution. It greatly affects the SIMD execution throughput.
M 11 f 12 N
In the following section, we show how the novel features of
E F G d H I J
iK jL k e m nP qQ the proposed SIMD architecture handle the problems.
M f N
G d H
13 14
RgS
kem
IV. PROPOSED FEATURES
15 16
ThU MfN
A. New Parallel Memory Structure
Fig. 6. Subpel interpolation. Most video applications process the acquired video data in
the unit of block. Hence to avoid frequent memory access, it
is always desirable to load all data in a block to registers
before
further operations take place. However, traditional memory

storages only allow sequential memory access. For this
similar to that in Fig. 9b except that the memory module
reason, multi-bank or parallel memory structures were
assignment is transposed). However it can be seen that the
proposed to allow multiple data access concurrently [16]-
required data may be stored in different physical addresses
[20]. Similar to the previous approaches, the proposed SIMD
(pixels occupies across different dotted boxes). An efficient
architecture is equipped with a 32KB parallel memory
address generator is needed to determine the required
structure served as a buffer between the external memory and
physical address for each memory module.
the register file as shown in Fig. 7. The parallel memory is
In fact, the data loading from external memory to internal
divided into 16 modules each of which has the size of 2K
memory modules is performed by a direct memory access
bytes and has a separated data bus connected to one of the 16
(DMA) unit following mapping functions as described
banks of a register file. Each register bank has 32 rows and
below. Let As be the starting address of the part of a video
each element of a register bank can store a 16 bits word (in
frame to be loaded into the parallel memory and Af be the
fact, the register file is constructed by 32 256-bit registers).
address of a pixel
:
:
16 banks x 16 banks x
External 2048 x 8 bits Parallel Memory Structure 32 x 16 bits Register File &
Memory 16 ALUs
Fig. 7. Proposed parallel memory structure.

(a)
(b)
Fig. 9. Memory interleave to allow block access.
within that part of the video frame. Then

Fig. 8 shows the relationship between the logical offset A f As  Aoff (8)
address and physical address we have defined in the where Aoff is the offset of the address of that pixel from the
proposed parallel memory structure. In the proposed starting address. Assume that the video frame has the size of
architecture, the logical address is unique for a memory Nx columns and Ny rows. Then Aoff can always be written as
location and the
physical address is the real address generated to every Aoff  yN x  x for x = 0 to Nx-1 and (9)
memory module for data access. The data from external y = 0 to Ny-1.
memory are
interleaved and loaded to the internal parallel memory modules Alternatively, the index x and y can be obtained from Aoff by:
such that when accessing a data block, only one data from y  A off / N x  and x  Aoff (10)
Nx
each memory module needs to be retrieved, no matter
where the
block is located. Fig. 9 shows how the pixels of an image are where . is the floor function and a b
stands for a modulo b.
stored in different memory modules to facilitate data Let {m, p} be the module number and the physical address
retrieval of different block sizes. In the figure, the characters respectively of the parallel memory structure as shown in Fig.
grouped inside a dash square block refer to the 16 memory 8. By inspecting Fig. 9, the mapping functions that the DMA
modules that can be accessed by the same physical address. unit should use for loading the data to the parallel memory
The numbers 0-9 and letters a-f denote the memory bank are:
number (a-f stands for For 4x4 block loading,
bank 10-15 respectively). To show that there is no access method for 2x8 block is p
conflict when loading a block of data from the parallel
y / 4
memory to the register, a few examples are shown in Fig. 9.
The characters grouped inside a solid square block refer to * N
the data blocks to be retrieved to the register. It can be seen x /4
that for both 4x4 (Fig. 9a) and 8x2 (Fig. 9b) block accesses, x / 4
one data will be retrieved from each parallel memory module 
no matter where the block is located (the data loading
m  x 4* y
(11) 4 4
For 8x2 block loading, (15)
(12)
p  y / 2 * N x / 8  x / 8 
m  x 8 * y
(13) 8 2
Similarly, for 2x8 block loading, (14)

p  y / 8 * N x / 2  x / 2 
mx 2* y
(16)
2 8
Once the block size is known, the DMA unit will load the
data from external memory to the parallel memory following
the respective mapping functions. Then data can be retrieved
from the parallel memory to the register efficiently. Assume
that a 4x4 block with indices of the first pixel be {xs, ys} is to
be retrieved. The pixels in the block can be described by
{xs+xo, ys+yo}, where xo, yo = 0 to 3. Following from (11),
Fig. 8. Parallel memory logical offset address and physical address.
p  ( ys  y 0 ) / 4* N x / 4  ( xs  x0 ) / 4 (17)
Let
ys  ys ' ys 4
and xs  xs ' xs 4
such that ys ' and xs ' negligible as far as a video processor is concerned. The above
will be always divisible by 4. (17) can then be written derivation can also be applied to 2x8 and 8x2 block retrieval:
as

p y '/4 y y /4 *N/4    a. 8x2 block access

s s 4
o x '/4
s
 4

x / (18) qy  ( ys 2 m / 8
y
s2
2
and

) / 2  q y ' / 2 (30)
x
s s 4
o
q (x   mx )/8  '/8

(31)
q
The two floor functions in (18) can only be equal to 0 or 1. x s8 s 8 8 x
Therefore, for any 4x4 block with indices of the first pixel be  ps qy ' 2 and qx '  8
{xs, ys}, all data of the block are stored in at most 4 different 
 ps  1 qy ' 2 and qx '  8
physical addresses only in the parallel memory. So if the first  (32)
p  Nx / 8 qy ' 2 and qx '  8
data is stored in p , the rest must not be stored in address  s
other s
than {ps+1, ps+Nx/4, ps+Nx/4+1}. More specifically,  ps  N x / 8  1 q y and qx '  8
 2
'
let
q y  ( y s  y 0 ) / 4  and (19) b. 2x8 block access
q x  ( 
x s 4
4
x0 )/4 then qy  ( ys  8
m ys 8
8

) / 8  q y ' / 8 (33)
  and
p
s

q y 0 an q x 0 
qx  ( xs 2 m / 8  xs 2
2 
) / 2  qx ' / 2 (34)
d 
 p s 1 q y 0 an q x 1  qy ' 8 and qx '  2
d ps
 (20) 
p  N /4 q y 1 and q x 0  ps  qy ' 8 and qx '  2
  1 (35)
s x
/41 q y 1 and q x 1 p  Nx / 2 qy ' 8 and qx '  2
  N  s
ps
x
The following further shows that we only need to have the  ps  N x / 2 qy ' 8 and qx '  2

1
indices of the first pixel {xs, ys}, we can easily determine the For actual implementation, each parallel memory module
physical address for each module. Again we use the 4x4 is installed with an address generation unit as shown in Fig.
block retrieval as an example. From (12), it can be seen that: 10 for the implementation of (29), (32), or (35) based on the
selected
4
m  x 4
4 * y 4
 x s  xo 4
4* y (21) block size. An additional address generation unit is
responsible for pre-computing the 4 possible physical
m xs  x s  xo  xs  x o (22) addresses in each case. A 4-to-1 multiplexer is installed in
4 4 4 44
each address
Note that m = 0 to 15 is the index to the 16 parallel memory generation unit for selecting one of the supplied addresses
modules. Substitute (22) to (21), we have based on the results of (29), (32), or (35).
m  xs  m  xs 4 4 4 y  m 4 y 4 (23)
* *
  4 4 4
Add. Gen.
y   Add. Add. Gen. m=1 Add. Gen. m=15
4 m  m 4 / 4  m / 4  (24) Gen.
xs ysbsNxdsdsb
Hence
m=0
Total 16 Memory modules
y s y o  m / 4  (25)
4
yo 4 y o m / 4  y s4 (26) xs ys bs xs ys bs xs ys bs
4

Substitute (26) and (22) to (19), we can express qx and qy in Note: bs  block size; refer (36) for the definition of ds and dsb
terms of m, xs and ys as follows: Fig. 10. Address generation for parallel memory structure.
q y  q y ' 4  qy ' ys 4  m / 4 ys 4 (27) Fig. 11 shows an example of loading 352x40 pixels of a CIF
4
where 
qx  qx ' 4 where qx ' xs 4  m  xs 4 (28)

image from external memory into 16 internal memory modules.
4 Again we assume the 4x4 block access is selected hence the
Since qx and qy must be equal to {0, 1}, (20) can be written as: way of data loading follows (11) and (12). The numbers in the
p q y'  4 and q x'  4 figure are the physical addresses. To load the 16 pixels of the
s

 ps 1 q '4 and q x '  4 (29) upper 4x4 block in Fig. 11, physical addresses 174, 174, 174,
y

p  N x/ 4 q y '  4 and q x'  4 173, 174, 174, 174, 173, 86, 86, 86, 85, 86, 86, 86, 85 are
 s generated by the address generation units installed with
 ps  Nx /4  q y '  4 and q x '  4
 memory module 0 to 15, respectively, following (27), (28), and
1
The evaluation of qy’ and qx’ is very simple. The modulo (29). The lower 4x4 box in Fig. 11 depicts the actual
function . can be implemented by extracting the last 2 bits
4 implementation when the block access crosses the last row
of the number. m / 4 can be implemented by shifting m to (assume the last physical address in memory module is 879).
It can be seen that the data access are “wrapped” back to the
right by 2 bits. The addition and subtraction can be
beginning of the buffer. The physical addresses generated for
implemented by a small adder. And finally the comparison
each memory module in this case are 3, 3, 3, 2, 795, 795,
between qy’ and qx’ with a constant 4 can be implemented by
795,
checking the output carrier bit of the small adder. All the
794, 795, 795, 795, 794, 795, 795, 795, 794.
above can be implemented by less than 20 logic gates, which
It is known that the dominant word lengths in video
is
lengths, the proposed approach also allows data access of

different block sizes, such as 2x8, 4x4 and 8x2. This would
be difficult to achieve in [18] since the memory modules are
hardwired to a specific two-dimensional form. Although
more address generators are needed in the proposed
approach, each of them is so simple that its complexity is
Fig.11. Memory module physical addresses for CIF image pixel. negligible as far as a video processor is concerned. The
support of word and double word accesses is a unique feature
applications are 8 and 16 bits [26]. It is because most video that cannot be found in the previous parallel memory
kernel functions take in pixel operands (8 bits in size) for structures [16]-[20].
computation and generate 16 bits data words. Temporary
results with size of word or even double word may be stored B. Configurable SIMD
back to the parallel memory structures from the registers. The second new feature of the proposed SIMD architecture
They may be retrieved later back to the registers for further is the configurable SIMD [24], which provides of 2 useful
processing. Traditional parallel memory structures do not functions: almost random RF access and MIMD-like
explicitly support word or double word data access. Extra (Multiple Instruction Multiple Data) execution support. The
packing, unpacking and shifting instructions may be required SIMD register file in the proposed SIMD architecture is
to achieve the word and double word data access which viewed as 16 banks each of which has 32 entries specified by
lower the efficiency of execution. The vector memory access the row address. The almost random RF access in the
in our architecture can easily be extended to support word proposed SIMD architecture is supported by a mux control
and double word memory access. When storing a data word unit placed between the registers and ALUs as shown in Fig.
from the registers back to the parallel memory structure, the 12, and the RF row address control unit. Inside the mux
data are stored to the successive physical address of the same control unit, there is a crossbar switch by which each ALU
module. The case for double word access is similar. The four can retrieve a data from any register bank. The full crossbar
bytes of a double word will be stored to the consecutive four switch supports any operand shuffling pattern used in matrix
physical addresses of the same module. For example, if the transpose, matrix multiplication, subpel interpolation, luma
first byte of a double word is stored into physical address intra prediction and other operations in video applications.
0x002 (logical address 0x22) of module #2 in Fig. 8 the other The RF row address control unit is used to generate 16 row
3 bytes will be stored into addresses 0x003, 0x004, and addresses to each RF
0x005 (logical addresses 0x32, 0x42 and 0x52) of module
#2. In general when accessing a vector data of different size
in the parallel memory (either storing or retrieval), we use
the following equations which are modified from (29) as
follows (again 4x4 case is used as an example without loss in
generality):
 ps  qy'  an qx'  4
dsb 4 d

 ps  1  dsb qy'  4 and qx'  4
 (36)
p  ( ds* Nx ) / 4  qy'  4 an qx'  4
s
dsb d
 p s  ( ds* N x ) / 4  1  qy'  an qx'  4 bank. With both the mux control unit and the RF row address
 d
dsb 4 Fig. 12. The proposed SIMD architecture with switching control.
ds stands for data size that has value 1, 2 or 4 for data control unit, almost random RF access is supported.
retrieval of bytes, words, or double words, respectively. dsb Although two operands can be read from a register bank at
is an index to the byte to be accessed within a byte, word or the same time, we impose several architectural constraints in
double word. ps is the physical address of the first pixel. Note order to save the hardware cost. Firstly, the crossbar switch
that there is no change to the evaluation of qx’ and qy’. It supports the shuffling of one operand only. That is, if a
means that the 16 address generators associated with the SIMD operation requires two operands, one of them must
parallel memory modules will remain the same. Only the still be retrieved from its own bank. Secondly, the crossbar
additional address generator is involved to take care of the switch allows only a maximum of one data coming from
changes in word length. This greatly simplifies the other register bank to prevent the register bank conflict
complexity of the hardware structure for address generation. problem and to minimize the number of register bank output
In overall, a 4x4 block of bytes, words and double words can ports to two. Thirdly, the crossbar only shuffles word size
be retrieved from the parallel memory modules or stored operand. If double word size operand shuffle is needed, two
back to the parallel memory modules in 1, 2 and 4 cycles, SIMD instructions are used to shuffle the whole operand.
respectively. The same is applied to 2x8 and 8x2 block Finally, the computed result is restricted to write back to
access cases. ALU’s own RF bank. Because of these hardware constraints,
Comparing with the previous parallel memory structures configurable SIMD allows only almost but not fully random
for video processing, such as [18], the proposed approach RF access.
allows more flexibility in data access. Not only data of To support random RF access in configurable SIMD, a
different word
look-up table named as configurable SIMD look-up table accessing a particular entry in CSLUT. The format of typical
(CSLUT) is introduced. The CSLUT is made by 5 memory and CSIMD instructions is shown in Fig. 14 for comparison.
modules. Their logical and physical addresses as well as their In the figure, CMD is the instruction opcode. The MISC field
structure are shown in Fig. 13. There are three major specifies execution controls such as operand shift bits, zero
configuration data in the table. The 80-bit row address or sign extension options, etc. For typical instructions, the
configuration data and 64-bit bank configuration data specify register row addresses of two sources and one destination are
the register row addresses and register bank numbers of 16 specified by RS1 and RS2, and RD respectively. It requires
operands to be retrieved respectively. The mux control unit all ALU to get two operands at row addresses RS1 and RS2
takes in 64-bit bank configuration data from CSLUT so that from their own register banks; and requires the ALU to write
each operand of ALU can be retrieved from any bank. The the execution result to row address RD of their own register
RF address control unit takes in 80-bit row address banks. That is, typical instructions do not allow cross register
configuration data from CSLUT and generates 16 row bank data access, nor different row data access. For CSIMD
addresses to each RF bank. If only the bank number instructions, the CSLUT address is specified in
configuration data in CSLUT is used, 16 operands on the CSLUT_ADDR field. Each ALU can get one of the operands
same row from different bank specified in bank at any row address of any register bank specified in CSLUT.
configuration data are retrieved. If only the row address Furthermore, slightly different operations are allowed to be
configuration data in CSLUT is used, each ALU in SIMD executed among the ALUs as mentioned above. They
takes one operand in any row address specified in row provide great flexibility that fully addresses the problems of
configuration data from its own RF bank. Using both row traditional SIMD architectures as mentioned above.
and
Fig. 14.
A methodology
Instructionsimilar
fields of a to configurable
typical SIMD is also
and CSIMD instruction.
proposed in a patent application [25]. Compared to this
patent, the proposed configurable SIMD has additional
advantages. First of all, the configuration data in the patent is
not stored in the SRAM based LUT but programmable logic
Fig. 13. CSLUT for configurable SIMD instruction.
array (PLA). Hence the extent of reconfiguration is limited.
That is also why
bank configuration data, the operand in any row address can
be retrieved from any bank to achieve random RF access.
Besides near random RF access, the proposed configurable
SIMD also provides MIMD-like execution support, i.e. it
allows a minor difference in operation among ALUs. To the patent design needs extra data called Pseudo Static
accommodate this, a 16-bit miscellaneous column is Control Information (PSCI), in addition to configuration data
introduced into the CSLUT for indicating the slightly retrieved in instruction field, to generate the reconfiguration
different operations to be performed among the ALUs. For data. The PSCI dictates the aspects of the functionality and
example, we can use this column to define whether an behavior of the execution unit and crossbar interconnect. It
addition or subtraction is to be performed for each ALU in a cannot be dynamically reconfigured in cycle basis via
SIMD processor. It is useful in many fast transform instruction. Instead, a dedicated PSCI-setting instruction is
algorithms, including the Hadamard transform used in the used to update the PSCI data from time to time. On the other
4x4 SATD function. It will be shown in next section. hand, the proposed configurable SIMD uses SRAM as
To access the table, a set of so-called CSIMD configuration data storage which allows much larger extent of
(Configurable SIMD) instructions is provided in the reconfiguration. The reconfiguration can be dynamically done
instruction set. These instructions have a particular field to in cycle basis by getting the look-up table entry address from
store the address for the instruction.
Fig. 15. 4x4 1-D Hadamard transform using only two CSIMD instructions.
Beside, the crossbar in the patent design is controlled totally of the CSLUT. It makes use the misc configuration in
by PSCI data. It only provides shuffling on operands read CSLUT to specify whether addition (e.g. “1”) or subtraction
from register file location specified by the source operand (e.g. “0”) is performed. The complete configuration data in
address instruction field. It cannot allow random register file CSLUT to perform each set of four length-4 1-D Hadamard
access as the proposed approach. transforms is shown in TABLE III. Such feature provides
In the following subsections, we demonstrate how the great flexibility in program design and in turn leads to
implementation of H.264/AVC kernel functions is made reduction in SIMD instructions in the program.
simple by the new CSIMD structure. We particularly use The full crossbar switch also greatly enhances the
SATD and fractional motion estimation as examples performance of matrix transpose. Referring to TABLE II the
although similar improvement can also be achieved in other instruction counts to perform a 4x4 block transpose in
kernel functions such as intra prediction. To simplify our VideoLan X264 are 12 and 9 when using MMX and SSE2
discussion, we assume that video data are accessed in the respectively. With CSIMD, a 4x4 block transpose can be
form of 4x4 blocks. Operations involving larger data blocks carried out in one clock cycle. It is actually one of the
are composed by combining the results of the constituting shuffling operations supported by the full crossbar switch. In
4x4 blocks. fact, when actual implementing the SATD function, the
matrix transpose operation is embedded into second 1-D 4x4
1) SATD Computation: Hadamard transform. That is, we do not need to dedicate a
A SATD computation consists of data load, subtraction, 2- CSIMD instruction to perform the transpose operation. It is
D 4x4 Hadamard transform, matrix transposes, taking done together with second 1-D 4x4 Hadamard transform. In
absolute of the transformed data and summation. Let us first overall, the proposed SIMD architecture takes only 4
consider the realization of 2-D 4x4 Hadamard transform. As instructions to perform the first two steps of a 4x4 SATD
discussed above, a 2-D 4x4 Hadamard transform can be function (from memory load to 2-D Hadamard transform)
implemented by four length-4 1-D Hadamard transforms while VideoLAN X264 takes 56 instructions to do the same.
applied to the rows and followed by another four applied to In fact the proposed CSIMD structure can also greatly
the columns. Fig. 4 shows that at least 8 instructions are benefit the implementation of a few other similar functions
needed to perform each set of four 1-D Hadamard transforms of H.264/AVC and MPEG4 such as 4x4 IDCT/DCT and 4x4
using the MMX instruction set due to insufficient register matrix multiplication. In both cases, the proposed SIMD
bit-width. We have also shown in Fig. 5 that even if we have architecture only takes 2 instructions to finish.
the resource to install registers with sufficient bit-width such 2) Efficient H.264/AVC fractional motion estimation:
that all data of a block can be loaded into a register, we still To compute the fractional motion estimation for a 4x4
cannot easily implement the 1-D Hadamard transforms using block, it needs a maximum of 10x10 integer pixels, which
SIMD instructions since different operations are performed can be loaded to 9 register rows, with row address 1 to 9
in different register banks and they may require operands respectively, as shown in Fig. 16. The square boxes in the
from different register banks. For the proposed SIMD figure represent the integer pixels and the number inside the
architecture, we use only 2 CSIMD instructions to realize square box is the register bank where the pixel data is stored.
each set of four 1-D Hadamard transforms as shown in Fig. The 6-tap filtering
15. Before execution, the 4x4 input data is placed in 256-bit 5 6 7 4 56 7 4 5 6
SIMD register in, say, row 5. Each CSIMD instruction takes 9 a

row=1 e
b 8 9a
row=2 de
b 8 9
row=3 d
a
0
c
1
d
2
e
3
f
one operand from its own RF bank and one operand from d f c
c
de
f
f
c e 0 0 1 1 2 2 3 3
other bank to perform either addition or subtraction. For 1 2 3 0 0 1 1

2 2 3 3 0 1 2
row=11
01 row=0 2 3
0123 5 6
example, in the first CSIMD instruction, the ALU0 (right 5 6
7 4 4 5 5
6 6 7 7
4567 4 5 6 4
4
4 5 5 6 6 7
7
7
one) takes data X00 from its own bank and data X10 in bank 9 a
b 8 8 9 9 a a b b
8ab
8 9 a 4 4
row=10
5 b7
f c c d d e e f f
4 of row 5 to perform addition. ALU 3 (the fourth one from d
row=4 e 9 row=9
c
row=5 d
e
8
8
9
9
a
a
b
b
8 9 a b
right) takes data X10 from its own bank and data X00 in
1 2 3 0 1 2 3 0 1 2 8 9 a b
bank 0 of row 5 to perform subtraction. All configuration c d e f
information is specified in row, bank and misc memory 5 6 7 4 5 6 7 5 6 c c d d e e f f
content 9 a
row=6
b 8 9
row=7
a b 8 9
row=8
a
c 8 4
TABLE III row=2 row=4
8 4 3 2 1 0 row=9
ROW, BANK AND MISC CONFIGURATION IN CSLUT FOR THE 3 2 1
XXXXXX XXXXXX
IMPLEMENTATION OF THE HADAMARD TRANSFORMS
+ +
ALU f e d c b a 9 8 7 6 5 4 3 2 1 0 row=10
row=0 c 0
ROW 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
First BANK
b a 9 8 f e d c 3 2 1 0 7 6 5 4 avg
CSIMD
row=11 0
MISC 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
ROW 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Second BANK
CSIMD
7 6 5 4 3 2 1 0 f e d c b a 9 8
MISC 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 Fig.16. Subpel interpolation by CSIMD.
operation for integer to half interpolation is done by six fractional motion estimation can be evaluated efficiently
multiply-and-accumulate (MAC) instructions. In Fig. 16 without extra memory load store, as well as the redundant
(upper left hand side), the solid line triangle half pixel c is packing and unpacking operations.
generated by
TABLE IV
V. EXPERIMENTAL RESULTS
REGISTER ROW NUMBER AND BANK INFORMATION IN CSLUT FOR SUBPEL Extensive simulations have been performed to evaluate the
INTERPOLATION.
performance of the proposed CSIMD architecture in two
Half to aspects: memory accesses and cycle counts for computing
Integer to Half (dotted) Integer to Half (Solid)
ALU Quarter major H.264 kernel functions. To evaluate the performance
BRBRBRBRBR BR B RBRBR BRBR B R B R B R in memory accesses, two Baseline Profile C models were
0 1 4 2 4 3 4 0 9 1 9 2 9 8 2 c 2 0 9 49 8 9 c 9 0 0 c a
1 2 4 3 4 0 9 1 9 2 9 3 9 9 2 d 2 1 9 59 9 9 d 9 1 0 d a used in our experiments for comparison. One is the
2 3 4 0 9 1 9 2 9 3 9 0 5 a 2 e 2 2 9 69 a 9 e 9 2 0 e a Optimized JM Encoder which is optimized from JM7.4
3 0 9 1 9 2 9 3 9 0 5 1 5 b 2 f 2 3 9 79 b 9 f 9 3 0 f a reference model 0 by removing Main Profile features,
4 5 4 6 4 7 4 4 9 5 9 6 9 c 2 0 9 4 9 89 c 9 0 7 4 0 0 a
5 6 4 7 4 4 9 5 9 6 9 7 9 d 2 1 9 5 9 99 d 9 1 7 5 0 1 a
dynamic memory allocation and release, and rate-distortion
6 7 4 4 9 5 9 6 9 7 9 4 5 e 2 2 9 6 9 a9 e 9 2 7 6 0 2 a optimization. The other one is our CSIMD H.264 Encoder
7 4 9 5 9 6 9 4 5 5 5 6 5 f 2 3 9 7 9 b9 f 9 3 7 7 0 3 a which is based on the Optimized JM Encoder and further
8 9 4 a 4 b 4 8 9 9 9 a 9 0 9 4 9 8 9 c 9 0 7 4 7 8 0 4 a
enhanced by using all proposed features described in this
9 a 4 b 4 8 9 9 9 a 9 b 9 1 9 5 9 9 9 d9 1 7 5 7 9 0 5 a
a b 4 8 9 9 9 a 9 b 9 8 5 2 9 6 9 a 9 e9 2 7 6 7 a 0 6 a paper namely, the advanced parallel memory structure with
b 8 9 9 9 a 9 b 9 8 5 9 5 3 9 7 9 b 9 f 9 3 7 7 7 b 0 7 a variable block size and word length support and the CSIMD
c d 4 e 4 f 4 c 9 d 9 e 9 4 2 8 2 c 2 09 4 9 8 9 c 0 8 a structure that allows nearly random register access. We use
d e 4 f 4 c 9 d 9 e 9 f 9 5 2 9 2 d 2 19 5 9 9 9 d 0 9 a
e f 4 c 9 d 9 e 9 f 9 c 5 6 2 a 2 e 2 29 6 9 a 9 e 0 a a the number of memory accesses as a yardstick for
f c 9 d 9 e 9 f 9 c 5 d 5 7 2 b 2 f 2 39 7 9 b 9 f 0 a performance evaluation because they directly affect, to a
large extent, the overall computation time. The proposed
multiplying integer pixels in row 2 of banks 4, 8, c, and CSIMD H.264 Encoder is equipped with a 16-module
integer pixels in row 9 of banks 0, 4 and 8 with 6 filter taps parallel memory structure plus efficient address generation
and summing the results up. It can be seen that the operations units. The memory accesses here refer to the accesses to the
require nearly random access to different rows of different parallel memory. Note that for the proposed CSIMD H.264
register banks. For instance, the circle quarter pixel 0 is Encoder the data access to external memory are achieved
interpolated from half pixels in row 10 of bank 0 and in row 0 using a hardware DMA unit similar to other traditional
of bank c, while the quarter pixel 5 is interpolated from half parallel memory systems.
pixels in row 0 of bank 1 and row 10 of bank 5. In the lower Based on the above, the numbers of memory accesses for
part of Fig. 16, it shows how the half and/or quarter pixels are computing integer and fractional motion estimation (IME
retrieved randomly in any banks of any rows before and FME) required by the two models are evaluated. The
execution. Note that the multipliers and adders in the figure motion estimation is done on a CIF resolution image.
only show the operation it required to do interpolation for TABLE V shows the results we obtained in the simulation.
clarity. It does not represent the real hardware. Also, the In the table, LS and VLS stand for the number of load/store
second operand of FIR tag to the multiplier is not shown in and vector load/store instructions, respectively. Since our
the figure. As mentioned, the half to integer interpolation is algorithm uses a bottom-up approach, the vector LS in
performed by 6 MAC instructions. Each MAC takes one CSIMD Encoder mainly refers to 4x4 block load or store in
operand from location specified by CSLUT table before it is our simulation. Note that other block sizes, such as 2x8 or
multiplied with a filter tag and then added to previous MAC 8x2, can also be easily implemented using the proposed
results. While such random register access will introduce parallel memory structure and the address generation unit.
much difficulty to traditional SIMD executions, the proposed The number of instructions required for loading or storing a
CSIMD structure handles them easily with the use of the 4x4 block by Optimized JM Encoder varies from block to
CSLUT table and the crossbar switch. TABLE IV shows the block. It depends on whether the block is aligned in memory.
related information stored in the CSLUT table required for Since one vector LS instruction can replace 16 scalar LS
the interpolation of the solid line and dotted line triangle half instructions at most, if the Optimized JM Encoder and the
pixels as well as the circle quarter pixels in Fig. CSIMD Encoder are different only in the parallel memory
16. The “B” and “R” columns refer to the register bank and structure, the scalar LS instructions required by the
the row number of the pixels to be retrieved and sent to the Optimized JM Encoder (Opt.JM_LS) should be close to
ALU to perform one MAC operation. For each quarter pixel
interpolation, a CSIMD instruction will be issued and the TABLE V
required entries in the CSLUT table will be retrieved. The MEMORY ACCESS AND INSTRUCTION COUNT REDUCTION.
related register access information will be sent to the register

file and the crossbar switch. With the help of the crossbar
switch, one of the operands required in the interpolation can
be obtained from any row of any register bank. The whole
Optimized JM CSIMD Encoder
ME LS Instr. Cnt LS VLS LS+(16*VLS) Instr. Cnt
IME 10,565,010 35,446,064 476,452 507,295 8,593,172 3,668,072
FME 23,956,652 96,534,062 166,954 102,155 1,801,434 789,070
the sum of the scalar LS (CSIMD_LS) instructions and 16 one SSE register in VideoLAN. Hence the two 4x4 blocks
times vector LS (CSIMDvec_LS) instructions required by can only be performed separately in MMX register.
the CSIMD Encoder. However, it can be seen in TABLE V, TABLE VII further shows the simulation results when
computing SATD and IDCT/DCT in a H.264 encoding
Opt.JM_LS >> CSIMD_ LS + (16*CSIMDvec_LS) (37)
process. In this simulation, one second of video sequence
It is particularly true in fractional motion estimation. It Stefan (25 frames, 1I+24P) with CIF resolution was used.
shows that while the parallel memory structure can help to The number of cycle count reduction by using the proposed
reduce the memory access, the introduction of other features CSIMD Encoder model as compared with SIMD
in the CSIMD Encoder, in particular the “random” register implementation using VideoLAN X264 source codes is
access feature, gives a further amount of saving in memory shown. It can be seen that more than 60% of execution cycles
access. It is especially the case for fractional motion can be reduced using the proposed CSIMD Encoder model.
estimation. In fact when using the proposed SIMD All improvement as mentioned above stems from the
architecture for computing motion estimation, less than 10% advanced parallel memory and CSIMD structures.
SIMD instructions in integer motion estimation are CSIMD Based on the Amdahl’s Law [28], we can project the
instructions, while more than 90% SIMD instructions in speedup of the entire H.264/AVC encoding application from
fractional ME are CSIMD instructions. This explains why the kernel function speedup, with respect to adopting the
the improvement for fractional ME is so significant. As a proposed parallel memory structure and configurable SIMD
result, the total number of memory feature in conventional SIMD architecture. Let T be the
execution time (measured in execution cycles) of the original
TABLE VI H.264/AVC encoding application, Tker be the execution time
EXECUTION CYCLES SPEEDUP VERSUS VIDEOLAN X264. of the kernel function and Tcsimd be the execution time of the
DCT4x4 IDCT4x4
kernel function performed by our CSIMD Encoder.
Functio
n
SATD4x4 DC Residual DC Residual Amdahl’s Law states that the overall speedup of the
Block 4x 4x 8x 8x 8x1 16x 16x1 4x 4x 8x 16x1 4x 4x 8x 16x1 application S is:
Size 4 8 4 8 6 8 6 4 4 8 6 4 4 8 6
Speedup 2.9 4.6 2.6 2.4 2.5 2.5 2.3 2.7 2.6 2.4 2.7 2.1 3.5 3.3 3.7 S T 1
 1  (   / (38)
access for integer motion estimation is reduced by ~10.7 T  ker  T csim
times, T d s)
and that for fractional motion estimation is reduced by ~89.0 where   Tker / T is the percentage proportion of the kernel
times. The table also shows that the total numbers of
function in the entire application and s  Tker / Tcsimd is the
instruction counts to perform the integer and fractional
speedup of the kernel function execution with respect to our
motion estimation are reduced by ~9.7 and ~122.4 times
comparing with the Optimized JME Encoder. proposed features. It is easily to extend the overall
To give an idea of how the proposed SIMD architecture application speedup if there are multiple kernel functions as
below:
1
compares with the state-of-the-art SSE/MMX SIMD S
1  (  
(39)
architecture, the execution cycles to perform 4x4 SATD and
i
i 
 / si )
i
i
IDCT/DCT using the proposed CSIMD Encoder model and Several kernel functions are taken in our calculation. They
VideoLAN X264 are estimated. We developed a include integer motion estimation (IME), fractional motion
performance simulator to emulate our CSIMD Encoder. The estimation (FME), SATD, DCT and IDCT. TABLE VIII
simulator is a cycle-accurate model. Since there is no shows the kernel functions’ speedup and their corresponding
VideoLAN X264 performance simulator, we modified our percentage proportion in application based on our profiling
performance simulator to emulate the SSE/MMX result. The speedup of IME and FME mainly comes from the
instructions in VideoLAN. In TABLE VI, the speedup by instruction count reduction shown in TABLE V which is 9.7
using the proposed CSIMD Encoder as compared with and 122.4 respectively. It should be noted that the SATD in
VideoLAN X264 for the computations of SATD and this table only refers to inter mode decision but not in motion
IDCT/DCT of different block sizes is shown. It is seen that estimation because SATD speedup is already accounted in
an improvement of 2.1X to 4.6X can be achieved. Note that FME speedup. The speedup of SATD, DCT and IDCT is
the speedup of SATD for block size 4x8 is exceptionally from TABLE VI. According to equation (39), the overall
high. It is because Intel’s SSE/MMX does not support stride speedup of H.264/AVC encoding application is 2.46X.
load so that one row of 4 pixels from each upper and lower TABLE VIII
4x4 blocks inside the 4x8 block can be loaded into PROPORTION AND SPEEDUP OF KERNEL FUNCTIONS.
TABLE VII Kernel IME FME SATD DCT IDCT

CYCLE COUNT REDUCTION FOR IMPLEMENTING SOME H.264 KERNEL Proportion (%) 12 33 7 7 13
FUNCTIONS WHEN ENCODING 1 SECOND OF CIF SEQUENCE . Speedup 9.7 122.4 2.9 2.7 2.1
Function SATD DCT4x4RES IDCT4x4RES Besides video coding functions, the new SIMD
Times / I Frame 75,655 11,120 4,952
P Frame 70,916 6,480 1,694
architecture is very generic and flexible that is also useful to
Frame
Cycle Count Reduction / 95,992,506 6,498,960 2,645,264 many other image and video applications. To illustrate this,
Second (percentage) (65.6%) (61.9%) (71.6) we have
applied the proposed SIMD architecture to the With these features, the SIMD performance when
implementation of several general video and image implementing matrix transpose, DCT/IDCT transform and
processing functions (e.g. de-interlacing, scaling, transform, SATD can be significantly improved. The H.264/AVC
color space conversion, etc.). Due to the flexibility provided fractional motion estimation can also be implemented
by the proposed parallel memory structure, we can support efficiently. The number of memory access can be greatly
image and video applications of different block sizes and reduced. In fact, the proposed CSIMD structure can also
word lengths. And by redefining the CSLUT table entries, greatly benefit the implementation of other kernel functions
we can realize these applications efficiently using the such as the Luma 4x4 intra prediction. Due to page
CSIMD instructions. TABLE IX shows the numbers of limitation, it has not been explained in detail in this paper.
predefined entries in the CSLUT table for the
implementation of different major kernel functions in each REFERENCES
application. It can be seen that for implementing the listed 6 [1] Wiegand, Gary J. Sullivan, Gisle Bjontegaard, and Ajay Luthra,
applications, only 689 entries are required. It shows that the “Overview of the H.264/AVC Video Coding Standard,” IEEE
memory required for the storage of the CSLUT table is Transactions on Circuits and Systems for Video Technology, vol. 13, no.
7, pp. 560-576, Jul. 2003.
insignificant as far as a general purpose video/image [2] Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume
processor is concerned. As such, the proposed SIMD 1: Basic Architecture [Online]. Available:
architecture can support multiple video applications well by http://www.intel.com/products/processor/manuals.
[3] K. Diefendorff, P. K. Dubey, R. Hochsprung, and H. Scales, “AltiVec
simply using different entries of the CSLUT table for Extension to PowerPC Accelerates Media Processing,” IEEE Micro,
different applications. The proposed features only increase vol. 20, no. 2, pp. 85-95, Mar.-Apr. 2000.
the area of the video processor by not more than 5% of the [4] Yong-Hwan Kim, Jin-Woo Yoo, Seong-Won Lee, Joonki Paik, and
Byeongho Choi, “Optimization of H.264 Encoder Using Adaptive
total area. As a brief account, the CSIMD LUT contributes to
Mode Decision and SIMD Instructions,” Proc., International
about 4% increase in area, while the crossbar switch and Conference on Consumer Electronics, pp. 289-290, Jan. 2005.
CSIMD control contribute to 0.23% and 0.6% increase in [5] Yu Shengfa, Chen Zhenping, and Zhuang Zhaowen, “Instruction-Level
area resp. Optimization of H.264 Encoder Using SIMD Instructions,” Proc.,
International Conference on Communications, Circuits and Systems
Proceedings, vol. 1, pp. 126-129, Jun. 2006.
TABLE IX [6] Marco Raggio, Massimo Bariani, Ivano Barbieri, Davide Brizzolara,
NUMBER OF CSLUT CONFIGURATION ENTRIES FOR DIFFERENT IMAGE “H.264 Implementation on SIMD VLIW Cores,” STreaming Day 07,
AND VIDEO APPLICATIONS. Genova, September 2007.
[7] Juyup Lee, Sungkun Moon, and Wonyong Sung, “H.264 Decoder
Fractional Data Optimization Exploiting SIMD Instructions,” Proc., IEEE Asia-Pacific
Video Application SATD Transform
Interpolation Shuffle Conference on Circuits and Systems, vol. 2, pp. 1149-1152, Dec. 2004.
H.264/AVC Encoder 147 82 8 8 [8] Lv Huayi, Ma Lini, Liu Hai, “Analysis and Optimization of the
H.264/AVC Decoder 88 52 0 8 UMHexagonsS Algorithm in H.264 based on SIMD,” Communication
AVS-M Decoder 62 50 0 4 Systems, Networks and Applications, pp.239-244, Jun. – Jul. 2010.
AVS Decoder 56 27 0 4 [9] Ali R. Iranpour, and Krzysztof Kuchcinski, “Evaluation of SIMD
MPEG4 Decoder 32 28 0 4 Architecture Enhancement in Embedded Processors for MPEG-4,”
Image Processor 0 21 0 8 Proc., Euromicro Symposium on Digital System Design, pp. 262-269,
Aug. 2004.
[10] Ye Jianhong, and Liu Jilin, “Fast Parallel Implementation of H.264/AVC
VI. CONCLUSION Transform Exploiting SIMD Instructions,” Proc., International
Symposium on intelligent Signal Processing and Communication
In this paper, we have proposed a novel SIMD architecture Systems, pp. 870-873, Nov. 2007.
with two new features namely, parallel memory structure [11] Joohyun Lee, Gwanggil Jeon, Sangjun Park, Taeyoung Jung, and
with variable block size and word length support, and Jechang Jeong, “SIMD Optimization of the H.264/SVC Decoder with
Efficient Data Structure,” Proc., IEEE International Conference on
configurable SIMD (CSIMD) structure using a look up table. Multimedia and Expo, pp. 69-72, 2008.
When applying to block based image or video applications, [12] Stephen Warrington, Hassan Shojania, Subramania Sudharsanan, and
the proposed parallel memory structure provides extra Wai-Yip Chan, “Performance Improvement of the H.264/AVC
Deblocking Filter Using SIMD Instructions,” Proc., IEEE International
flexibility in supporting multiple block sizes and multiple Symposium on Circuits and Systems, pp. 21-24, May 2006.
word lengths data access by changing only a few parameters [13] Deepu Talla, Lizy Kurian John, and Dong Burger, “Bottlenecks in
in the address generation units. The hardware complexity of Multimedia Processing with SIMD Style Extensions and Architectural
Enhancements,” IEEE Transactions on Computers, vol. 52, no. 8, pp.
implementing these address generation units is negligible as 1015-1031, Aug. 2003.
far as a general purpose image and video processor is [14] Mauricio Alvarez, Esther Salami, Alex Ramirez, and Mateo Valero,
concerned. By using the proposed parallel memory structure, “Performance Impact of Unaligned Memory Operations in SIMD
Extensions for Video Codec Application,” Proc., IEEE International
a vector of 16 bytes, words and double words can be Symposium on Performance Analysis of Systems and Software, pp.62-
retrieved (or stored) from (to) the memory in 1, 2 and 4 71, Apr. 2007.
cycles respectively. On the other hand, the proposed CSIMD [15] Deependra Talla, Architectural Techniques to Accelerate Multimedia
Applications on General-Purpose Processors, Ph.D. dissertation,
structure allows nearly “random” data access to SIMD University of Texas at Austin, 2001.
registers by means of a crossbar switch. Programmers can [16] Jarno K. Tanskanen, Tero Sihvo, and Jarkko Niittylahti, “Byte and
specify the row number and the register bank to be accessed Modulo Addressable Parallel Memory Architecture for Video Coding,”
IEEE Transactions on Circuits and Systems for Video Technology, vol.
in the CSLUT table, which we have shown to require only a 14,
small amount of internal memory for its implementation.
Programmers can also define using the CSLUT table slightly
different operations among the ALUs.
no. 11, pp. 1270-1276, Nov. 2004. Wan-Chi Siu (M’77, SM’90) received the MPhil
[17] Hoseok Chang, Junho Cho, and Wonyong Sung, “Performance degree from The Chinese University of Hong Kong
Evaluation of an SIMD Architecture with a Multi-bank Vector Memory and the PhD Degree from Imperial College of
Unit,” Proc., IEEE Workshop on Signal Processing Systems Design and Science, Technology & Medicine in October in 1977
Implementation, pp. 71-76, Oct. 2006. and 1984 respectively. He joined The Hong Kong
[18] Georgi Kuzmanov, Georgi Gaydadjiev, and Stamatis Vassiliadis, Polytechnic University as a Lecturer in 1980 and has
“Multimedia Rectangularly Addressable Memory,” IEEE Transactions become Chair Professor in the Department of
on Multimedia, vol. 8, no. 2, pp. 315-322, Apr. 2006. Electronic and Information Engineering since 1992.
[19] Zhi Zhang, Xiaolang Yan, and Xing Qin, “An Efficient Programmable He was Head of the same department and
Engine for Interpolation of Multi-Standard Video Coding,” Proc., IEEE subsequently Dean of Engineering Faculty between
International Conference on ASIC, pp. 750-753, Oct. 2007. 1994 and 2002. He is now
[20] Kunjie Liu, Xing Qin, Xiaolang Yan, and Li Quan, “A SIMD Video Director of Centre for Signal Processing of the same university. He is an
Signal Processor with Efficient Data Organization,” IEEE Asian Solid- expert in Digital Signal Processing, specializing in fast algorithms and video
State Circuis Conferencet, pp. 115-118, 2006. coding, and has published 380 research papers, over 160 of which appeared in
[21] S. Seo, M. Who, S. Mahlke, T.Mudge, S. Vijay, C. Chakrabarti, international journals, such as IEEE Transactions on CSVT. His research
“Customizing Wide-SIMD Architecture for H.264,” IEEE International interests also include transforms, image coding, wavelets, and computational
Symposium on Systems, Architecture, Modeling and Simulation, pp. aspects of pattern recognition. Professor Siu has been/was Guest Editor,
172-179, Jul. 2009. Associate Editor and Member of editorial board of a number of journals,
[22] Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee including IEEE Transactions on Circuits and Systems, Pattern Recognition,
Chen, “Analysis, Fast Algorithm, and VLSI Architecture Design for Journal of VLSI Signal Processing Systems for Signal, Image, Video
H.264/AVC Intra Frame Coder,” IEEE Transaction on Circuits and Technology, and the EURASIP Journal on Applied Signal Processing. He is a
Systems for Video Technology, vol. 15, no. 3, pp. 378-401, Mar. 2005. very popular lecturing staff member within the University, while outside the
[23] X264 Free H.264/AVC Encoder [Online]. University he has been a keynote speaker of over 10 international/national
Available: conferences in the recent 10 years, and an invited speaker of numerous
http://www.videolan.org/developers/x264.html. professional events, such as IEEE CPM’2002 (keynote speaker, Taipei,
[24] Wing-Yee Lo, Simon Moy, “Configurable SIMD Processor Instruction Taiwan), IEEE ISIMP’2004 (keynote speaker, Hong Kong), and IEEE
Specifying Index to LUT Storing Information for Different Operation ICICS’07 (invited speaker, Singapore) and IEEE ICNNSP’2008 (keynote
and Memory Location for Each Processing Unit,” U.S. Patent 7,441,099 speaker, Zhenjiang). He is the organizer of many international conferences,
B2, Filed in October 2006, Granted in October 21, 2008. including the MMSP’08 (Australia) as General Co-Chair, and three IEEE
[25] Simon Knowles, “Apparatus and Method for Configurable Processing,” Society sponsored flagship conferences: ISCAS’1997 as Technical Program
US 2006/0253689 A1, Published in November 9, 2006. Chair; ICASSP’2003 as the General Chair; and recently ICIP’2010 as the
[26] Keith Diefendorff, and Pradeep K. Dubey, “How Multimedia General Chair (2010 IEEE International Conference on Image Processing,
Workloads Will Change Processor Design,” Computer, vol. 30, iss. 9, which was held in Hong Kong, 26-29 September 2010). Prof. Siu is also the
pp.43-45, Sep. 1997. President Elect (2011-13) of a new professional association, the “Asia-Pacific
[27] H.264/AVC JM Software Reference Model [Online]. Signal and Information Processing Association”, APSIPA. He is a member
Available: http://iphome.hhi.de/suehring/html. (2010-2012) of the Engineering Panel and also was a member of the Physical
[28] D. A. Patterson and J. L. Hennessy, Computer Architecture: A Sciences and Engineering Panel (1991-1995) of the Research Grants Council
Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1996. (RGC), Hong Kong Government. In 1994, he chaired the first Engineering
and Information Technology Panel of the Research Assessment Exercise
(RAE) to assess the research quality of 19 departments from all universities in
Wing-Yee Lo received the B.Eng. (hons) degree from Hong Kong. (Home Page : http://www.eie.polyu.edu.hk/~wcsiu/mypage.htm)
the Northumbria University, UK, and the MPil. degree
from the Chinese University of Hong Kong, both in Wendong Wang received the B.S. degree in electrical
Electronics Engineering. She has more than 10 years engineering from Shandong University, China, and
of ASIC design experience in Motorola M.S. degree in computer science in Beijing University
Semiconductors Hong Kong Ltd., VTech of Technology, China, in 1997 and 2004, respectively.
Communications Ltd., Hong Kong Applied Science He is a senior software engineer in SimpLight
and Techonogy Institute, and Beijing SimpLight Nanoelectronics Ltd., Beijing and focus on computer
Nanoelectronics Ltd. She is very familiar with ASIC architecture analysis and video processing algorithm
design flow and has been working development.
on various SoC chips for mobile and consumer products, video processor
architectural analysis and parallel processor designs. She joined a Shenzhen
startup company as Director of ASIC engineering since 2009. She is
currently a doctoral candidate in the Hong Kong Polytechnic University.
Jiqiang Song (M’01, SM’07) received the B.Sc. and
Ph.D. degrees from Nanjing University, China, in
Daniel Pak-Kong Lun (M’91) received the B.Sc. 1996 and 2001, respectively, both in Computer
(hons.) degree from the University of Essex, Essex, Science and Application. He worked in the
U.K., and the Ph.D. degree from the Hong Kong Department of Computer Science and Engineering of
Polytechnic University, Hong Kong, in 1988 and the Chinese University of Hong Kong as Postdoctoral
1991, respectively. He is now an Associate Professor Fellow from 2001 to 2004. After that, he joined Hong
and Associate Head of the Department of Electronic Kong Applied Science and Technology Institute as
and Information Engineering, the Hong Kong Algorithm Lead in a video processor project. In 2006,
Polytechnic University. His research interests include he worked in Simplight Nanoelectronics Ltd., Beijing,
digital signal processing, wavelets, and Multimedia as R&D
Technology. Dr. Director of Multimedia and engaged in multimedia SIMD processor
Lun is a Chartered Engineer and corporate member of development. He joined Intel Labs China as Staff Research Scientist in 2008.
the IET and HKIE. (Home Page : http://www.eie.polyu.edu.hk/~enpklun) His research interests include graphics recognition, video encoding, image
and video processing. He has published over 30 research papers in
international journals and conferences.

Improved SIMD Architecture For High Performance Video Processors

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improved SIMD Architecture For High Performance Video Processors

Uploaded by

Copyright:

Available Formats

Improved SIMD Architecture for High

Performance Video Processors

much interest from both academic researchers and VLSI

W ith the extensive use of image and video information in

additional memory accesses.

Local Storage II. RELATED WORKS

Fig. 4. Basic operations in 1-D Hadamard transforms.

Fig. 5. 1-D Hadamard transforms in 256-bit register.

further operations take place. However, traditional memory

Fig. 7. Proposed parallel memory structure.

within that part of the video frame. Then

Similarly, for 2x8 block loading, (14)

y   Add. Add. Gen. m=1 Add. Gen. m=15

qx  qx ' 4 where qx ' xs 4  m  xs 4 (28)

lengths, the proposed approach also allows data access of

SIMD register in, say, row 5. Each CSIMD instruction takes 9 a

other bank to perform either addition or subtraction. For 1 2 3 0 0 1 1

information is specified in row, bank and misc memory 5 6 7 4 5 6 7 5 6 c c d d e e f f

related register access information will be sent to the register

TABLE VII Kernel IME FME SATD DCT IDCT

You might also like