Professional Documents
Culture Documents
Area Optimized High Speed Parallel Architecture With Internal Pipelined Structure For Fic On Fpga
Area Optimized High Speed Parallel Architecture With Internal Pipelined Structure For Fic On Fpga
Area Optimized High Speed Parallel Architecture With Internal Pipelined Structure For Fic On Fpga
Abstract
1. Introduction
⎡ x ⎤ ⎡ck dk 0 ⎤ ⎡gk ⎤
wk ⎢ y ⎥ = ⎢ek fk 0 ⎥ + ⎢ hk ⎥ . (1)
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣z⎦ ⎣ 0 0 s k ⎦ ⎣ ok ⎦
Affine transformation wk is performed on the domain blocks by
controlling reflections and rotations indicated by the parameters ck , f k , g k
and hk . Image brightness and contrast at every x, y location can be controlled
by sk and ok parameters. As the size of domain block is larger than range
block, domain block is resized to the size of range block and is affine
transformed so as to match the range block. The error between the range and
domain blocks is computed for every specific value of brightness and
contrast by computation of root mean square (RMS) value given by equation
(2), (N represents the number of pixels in range and domain blocks):
R(d j , rk ) =
N ∑l bl2 − (Σl bl )2 − N ∑l al2 − (Σl al )2 ⋅ s 2. (2)
N N
Each domain block (identity) and its transformations (8 transformations)
are compared with range block and nine RMS errors are computed. The RMS
error which is minimum is chosen for the corresponding position of the
domain block. The transformation performed on the domain block is
recorded. As the image is subdivided into overlapping domain blocks and for
each domain block corresponding range block matching is to be performed
large computational time is required. In this work, efficient hardware
architecture is designed and proposed that performs search operations by
comparisons and fractal image compression is made faster.
Area Optimized High Speed Parallel Architecture … 53
3. Proposed Architecture
The major challenges in FIC are delay in search time that is constrained
by the range and domain blocks. In order to overcome the demerits of FIC,
many researchers are working in this area to find suitable means for reducing
the encoding time. In [17], the authors have designed hardware accelerators
to store the input image partitioned into range blocks and domain blocks in
an internal on chip memory using simple schemes for block classification.
The classification method employed is simple pyramidal decomposition
method. The simplicity and regularity of the method makes it easy for the
method to be implemented on programmable logic devices, such as FPGAs
or in custom VLSI integrated circuits. Soft computing techniques such as
genetic algorithm and swarm techniques have been used to improve the
speed [18, 19]. In [20, 21], frequency information is extracted from image
using DCT and DWT to reduce computation complexity. A parallel
architecture based on single instruction multiple data array of processors is
designed for FIC by Acken et al. [22], the design requires less than 83.6 ms
for a 256 × 256 image. In [23], ASIC implementation of integer based
architecture that minimizes computation complexity by squaring, subtracting
and accumulation operations is designed. In [24], parallel matching algorithm
that operates on quadtree structure is designed and implemented on FPGA
that requires less than 8.36 ms. In [24], real time FIC based on characteristic
vector matching is proposed and implemented on Virtex-4 FPGA which
computes FIC in less than 0.8 ms. The proposed architectures in [22-24]
improve processing frequency at the cost of hardware complexity.
Compression of image is one of the modules which need to be integrated
with several other signal processing and communication modules for an
application development. Integrating the modules on a single chip requires
optimizing area complexity as well as meeting computation time
requirements for real time applications. Domain block of size 8 × 8 and
range block of size 4 × 4 are the standard block sizes that have been reported
in literature for FIC. In this work, domain block of size 4 × 4 and range block
of size 2 × 2 are used, to improve the quality of compression. An image of
size N × N when subdivided into domain block of size 4 × 4, there would
54 S. Padmashree and Rohini Nagapadma
A. Image processing
The input image that needs to be compressed is first preprocessed to a
standard size of N × N , where N is a multiple of 2. Input image I, consisting
of N1 rows and N 2 columns as represented in equation (3), is resized to
N × N and is represented by I by padding with zeros. As the algorithm
56 S. Padmashree and Rohini Nagapadma
⎧i = 1, 2, ..., N1,
I = X (i, j ) ⎨
⎩ j = 1, 2, ..., N 2 ,
N1 N
≠ INTEGER, 2 ≠ INTEGER, (3)
8 8
⎧i = 1K N N1 + N 3 = INTEGER,
⎪ 1 8
⎪
⎪ j = 1K N N1 + N 3 = INTEGER,
⎪ 2 8
I11 = {X (i, j )} + {Zero(n, m )}, ⎨ (4)
⎪n = 1K N N 2 + N 4 = INTEGER,
⎪ 3 8
⎪
⎪m = 1K N N 2 + N 4 = INTEGER.
⎩ 4 8
The preprocessed image is further subgrouped into 8 × 8 subblock. The
preprocessed input image is set to standard size of 256 × 256 and is stored in
a memory of size 64 × 8. The subblocks of 8 × 8 are created by reading the
8 × 8 block from the main memory into eight separate memory of size
64 × 8. The FSM control unit is designed to generate address, clock and
control signals for read and write operations without loss of data. The
subblock (SB) stored in the intermediate memory is represented as in
equation (5):
⎧l , k = 1, 2, ..., 8,
SB ≡ Yi ≡ I1(l + i, l + i ) ⎨
⎩i = 1, 2, ..., N ,
N1 + N 3 N + N4
N = or 2 . (5)
8 8
Data stored in separate memories are read into accumulator unit that
computes the average of 8 × 8 blocks as in equation (6):
Area Optimized High Speed Parallel Architecture … 57
∑l , k ∑ Yi (l , k ),
1 8
X avg = i = 1, 2, K , N . (6)
64
Figure 3.2 shows the quantization operation. The data from intermediate
memory is shifted into barrel shifter, which is controlled by the average
number (multiple of 2), the normalized data is stored into output memory.
Due to normalization process, the input data that is represented by 8-bit is
58 S. Padmashree and Rohini Nagapadma
represented by 5-bit after quantization operation, thus the bits per pixel is
reduced from 8 to 5.
B. Eight stage parallel processing architecture
The parallel processing architecture shown in Figure 3.3 consists of eight
intermediate memories each of size 64 × 8, the contents are simultaneously
read into the quantizer unit and are quantized to 5-bit data. The operations
like computations of suitable domain block for a range block through proper
affine transforms, search operations and comparisons are carried out in
parallel to generate the fractal codebook.
Figure 3.3. Eight stage parallel processing architecture for FIC codebook
generation.
Figure 3.5 shows the proposed greater than and less than comparator unit
that operates on two input pixels and finds the maximum and minimum of
them simultaneously.
60 S. Padmashree and Rohini Nagapadma
The down sampler unit reads the four neighbouring pixels and averages
them. The FSM counter1 reads the four neighbouring pixels in four clock
cycles, the adder unit accumulates the four pixels and right shifts by two
(dividing by 4 for averaging operation) and the accumulator output is stored
in the output memory controlled by FSM counter2 with clock frequency four
times slower than the major clock. Each of the 8 × 8 block consisting of four
4 × 4 pixels are affine transformed and down sampled to four 2 × 2 blocks,
each of them are compared with corresponding range block to obtain four
RMS errors. Computation of RMS errors is performed as shown in equation
(7) which consists of square root operator, accumulator and a difference:
right
Emod = shift ⎡
⎢
operation ⎣
∑ Ri (l , k ) − Di (l , k ) ⎤ ,
⎥⎦
∑ Ri (l , k ) − Di (l , k ) = X E . (8)
Figure 3.9 shows the modified RMS calculator architecture that consists
of difference unit, accumulation unit and right shift operation. Four pixels of
domain block and range block are subtracted using two stage subtraction unit
consisting of difference unit, the modulus operator and right shift operator
computes the final RMS error from the difference output.
The right shift operation is performed using the novel architecture shown
in Figure 3.11, the output of mod operator that is represented using 7-bit is
loaded into an intermediate register. The 4th, 5th, 6th and 7th bit positions
are compared for priority, if 4th bit position is ‘1’, then right shift by 2 is
performed, if 5th bit is ‘1’ and 4th bit is ‘0’, then right shift by 3 is performed
and so on. The right shifter output is selected by the four MSBs to compute
the RMS error.
Area Optimized High Speed Parallel Architecture … 65
From the simulated results, it is seen that among the four domain blocks
chosen for an 8 × 8 image, each of size 4 × 4 and labeled as domain 1, 2, 3,
4. The domain block 2 is not been selected as the suitable domain block for
any of the range blocks, so the information contained in domain block 2 can
be neglected. Hence, there is reduction of the size since among the 4 domain
blocks to be matched with the range blocks, one of the domain blocks is left
unselected.
4.1. Implementation results
The proposed architecture was simulated and synthesized using Xilinx
ISE, the target FPGA Virtex-5 is chosen, appropriate constraints are set for
area optimization and power optimization. The input image of size
256 × 256 and the compressed codebook are stored in the internal memory,
the functional verification is carried out by reading the codebook into
simulation window. The operating frequency of the design is 139.9 MHz in
FPGA. The proposed architecture is found effective when compared to the
conventional fractal image compression. FIC encoding is completed in less
than 8.3 ns, the memory usage is found to be 357512 Kilobytes and power
consumed is less than 0.46W operating at maximum frequency. The design is
optimized to consume 9% of the total FPGA slice registers, 3% of LUTs and
34% of LUT-FF pairs. The design requires 348 IOBs, as the codebook
generated is readout for functional verification. Table 4.1 shows the
comparison of the performances of the proposed architecture with other
references.
5. Conclusion
Acknowledgement
The authors thank the anonymous referees for their valuable suggestions
which let to the improvement of the manuscript.
References