Area Optimized High Speed Parallel Architecture With Internal Pipelined Structure For Fic On Fpga

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Far East Journal of Electronics and Communications

© 2016 Pushpa Publishing House, Allahabad, India


Published Online: January 2016
http://dx.doi.org/10.17654/EC016010049
Volume 16, Number 1, 2016, Pages 49-70 ISSN: 0973-7006

AREA OPTIMIZED HIGH SPEED PARALLEL


ARCHITECTURE WITH INTERNAL PIPELINED
STRUCTURE FOR FIC ON FPGA

S. Padmashree and Rohini Nagapadma


Department of Electronics and Communication Engineering
GSSS Institute for Engineering and Technology for Women
Mysore, India
Department of Electronics and Communication Engineering
The National Institute of Engineering
Mysore, India

Abstract

The goal of image compression is to remove redundancy present in a


data giving sufficient room for proper image reconstruction. There are
numerous lossless and lossy compression techniques. Lossless
compression techniques allow the image to be compressed by reducing
the redundancy in the data where decompressed data is an exact copy
of the original with no loss of data. However, lossy compression
sacrifices the exact reproduction of the original image. JPEG is an
example of lossy compression. One such compression is fractal image
compression (FIC). FIC is a lossy compression technique and is also
used for medical image compression. In this paper, a novel
architecture for FIC is proposed, modeled and implemented on FPGA
platform. The input image is grouped into 8 × 8 blocks, and
simultaneously eight blocks are processed using parallel architecture.
The nine isometries are realized using interleaver technique and a
Received: June 25, 2015; Revised: August 6, 2015; Accepted: August 20, 2015
Keywords and phrases: parallel architecture, internal pipelined structure, low power,
interleaver structure, multiplexer logic.
50 S. Padmashree and Rohini Nagapadma
search/comparison operation is carried out using parallel architecture.
The two parallel architectures have internal pipelined structure for
arithmetic operations. The FSM control unit designed synchronizes the
data movement and ensures codebook generation for 256 × 256 × 8
input data. The design operates at maximum frequency of 139.9 MHz
consuming less than 30% of FPGA resources consuming 0.46 W of
power. The design is suitable for real time medical image
compression.

1. Introduction

Compression of images improves transmission speed and reduces storage


space. With increased capacity of communication channels and demand for
image and video based services by common man, there is a need for
compression of images that could lead to faster services in the field of
medical image processing, internet services and video conferencing. Storage
space for video data also needs to be reduced as applications that require
video information such as video surveillance require large storage space.
There are several algorithms for video compression, fractal image
compression that works on the principle of local similarity structure as
compared with traditional encoding techniques that operate on pixel data and
statistical data. FIC has been extensively used in image processing, pattern
recognition, facial recognition. FIC was proposed by Barnsley and applied to
the gray scale images by Barnsley and Jacquin [1]. Fractal image
compression is also called as fractal image encoding (FIE) as the compressed
image is represented by contractive transforms and mathematical functions
which are required for reconstruction of the image [2]. In FIC, the input
image is partitioned into subblocks and grouped into domain blocks and
range blocks. The data in domain blocks are compressed to the size of range
block and is searched to match with the appropriate range block and the
matched domain blocks for corresponding range blocks are stored in
codebook. FIC limitations are that the searching time is large and hence
impose challenges for real time image compression. Several algorithms for
searching and matching have been reported in literature for FIC compression
[3-10], in [3], fast schemes for searching and matching algorithms that
Area Optimized High Speed Parallel Architecture … 51

operate on adjacent domain blocks were proposed. In [4], searching


algorithms that operate on adjacent range blocks are used to improve the
speed of search algorithms. Classification algorithms were also proposed that
operate on sorting based on features in [5, 6] to demonstrate the improvement
in encoding schemes. In [7], the encoding scheme is reported to take 20
seconds and in [8], the encoding time is 2.8 seconds. In [9], variance based
block classification is proposed and in [10], iterated transform based FIC is
proposed. The reconstructed images discussed in [9, 10] exhibit poor PSNR
as compared with traditional techniques. In [11], FIC is proposed based on
finding similarity using affine similarity and Pearson’s correlation coefficient
that requires less than 117 ms to encode data. In [12, 13], searchless
algorithms have been reported to reduce encoding time, but reconstructed
images suffer from poor PSNR. In [14], searchless iterated function system is
proposed and implemented on FPGA platform that reduces encoding time of
3D image to 8.36 ms on Altera APEX device with operating frequency of
32 MHz. In [15], wavelet transform is combined with fractal image
compression and improvement in PSNR is demonstrated.

However, the computation complexity increases with use of DWT. In


[16], Fischer’s method of FIC is implemented on Virtex-5 FPGA operating at
100 MHz, the area resources occupied limits the architecture for use of real
time application. Delay in search algorithms and computation complexity on
hardware platforms are the two major challenges that have limited the use of
fractal image compression.

In this paper, a novel architecture for hardware implementation of fractal


image compression on FPGA platform is proposed, designed and
implemented. An effective architecture of quadtree partitioning iterated
function systems with affine transforms has been validated on Virtex-5
FPGA using 256 × 256 8-bit gray scale medical image.

2. Fractal Image Compression

Fractal image encoding technique works on the principle of self


similarities that exist in an image. The image is partitioned into subunits and
52 S. Padmashree and Rohini Nagapadma

iteratively self similarities are searched to prepare codebook. An input image


is first partitioned into smaller blocks of size R × R that are non-overlapping
and are termed as range block. The input image is also grouped into
overlapped domain blocks of size D × D ( D > R ). The domain block is
affine transformed with 8 isometries such as identity, 90° clockwise rotation,
180° clockwise rotation, 270° clockwise rotation, x reflection, y reflection,
y = x reflection and y = − x reflection and is searched for best matching
with the range block as in equation (1):

⎡ x ⎤ ⎡ck dk 0 ⎤ ⎡gk ⎤
wk ⎢ y ⎥ = ⎢ek fk 0 ⎥ + ⎢ hk ⎥ . (1)
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣z⎦ ⎣ 0 0 s k ⎦ ⎣ ok ⎦
Affine transformation wk is performed on the domain blocks by
controlling reflections and rotations indicated by the parameters ck , f k , g k
and hk . Image brightness and contrast at every x, y location can be controlled
by sk and ok parameters. As the size of domain block is larger than range
block, domain block is resized to the size of range block and is affine
transformed so as to match the range block. The error between the range and
domain blocks is computed for every specific value of brightness and
contrast by computation of root mean square (RMS) value given by equation
(2), (N represents the number of pixels in range and domain blocks):

R(d j , rk ) =
N ∑l bl2 − (Σl bl )2 − N ∑l al2 − (Σl al )2 ⋅ s 2. (2)
N N
Each domain block (identity) and its transformations (8 transformations)
are compared with range block and nine RMS errors are computed. The RMS
error which is minimum is chosen for the corresponding position of the
domain block. The transformation performed on the domain block is
recorded. As the image is subdivided into overlapping domain blocks and for
each domain block corresponding range block matching is to be performed
large computational time is required. In this work, efficient hardware
architecture is designed and proposed that performs search operations by
comparisons and fractal image compression is made faster.
Area Optimized High Speed Parallel Architecture … 53

3. Proposed Architecture

The major challenges in FIC are delay in search time that is constrained
by the range and domain blocks. In order to overcome the demerits of FIC,
many researchers are working in this area to find suitable means for reducing
the encoding time. In [17], the authors have designed hardware accelerators
to store the input image partitioned into range blocks and domain blocks in
an internal on chip memory using simple schemes for block classification.
The classification method employed is simple pyramidal decomposition
method. The simplicity and regularity of the method makes it easy for the
method to be implemented on programmable logic devices, such as FPGAs
or in custom VLSI integrated circuits. Soft computing techniques such as
genetic algorithm and swarm techniques have been used to improve the
speed [18, 19]. In [20, 21], frequency information is extracted from image
using DCT and DWT to reduce computation complexity. A parallel
architecture based on single instruction multiple data array of processors is
designed for FIC by Acken et al. [22], the design requires less than 83.6 ms
for a 256 × 256 image. In [23], ASIC implementation of integer based
architecture that minimizes computation complexity by squaring, subtracting
and accumulation operations is designed. In [24], parallel matching algorithm
that operates on quadtree structure is designed and implemented on FPGA
that requires less than 8.36 ms. In [24], real time FIC based on characteristic
vector matching is proposed and implemented on Virtex-4 FPGA which
computes FIC in less than 0.8 ms. The proposed architectures in [22-24]
improve processing frequency at the cost of hardware complexity.
Compression of image is one of the modules which need to be integrated
with several other signal processing and communication modules for an
application development. Integrating the modules on a single chip requires
optimizing area complexity as well as meeting computation time
requirements for real time applications. Domain block of size 8 × 8 and
range block of size 4 × 4 are the standard block sizes that have been reported
in literature for FIC. In this work, domain block of size 4 × 4 and range block
of size 2 × 2 are used, to improve the quality of compression. An image of
size N × N when subdivided into domain block of size 4 × 4, there would
54 S. Padmashree and Rohini Nagapadma

be N 2 domain blocks with each domain block subdivided into 2 × 2 range


block, there would be 4N 2 range blocks. The proposed algorithm for FIC
computation is shown in Figure 3.1. The input image is first preprocessed so
as to set the image size to standard size that is multiple of 2. The
preprocessed image is subdivided into 8 × 8 blocks for parallel processing.
First, the set of 8 × 8 blocks is selected and is quantized with a known
number. The quantized eight subblocks are stored in separate memories. The
maximum pixel intensity and the minimum pixel intensity for each 8 × 8
subblock are identified and the difference is computed. The difference is
compared with the set threshold and if the difference is greater than the set
threshold, then the 8 × 8 block is selected for FIC. If the difference is less
than the set threshold, then a new set of 8 × 8 blocks is chosen from the
memory discarding the current chosen 8 × 8 block. The selected 8 × 8
blocks of eight numbers are further subgrouped into overlapping domain
blocks of 4 × 4 and range blocks of 2 × 2. Each domain block is down
sampled and nine isometries are applied to the down sampled domain block.
The transformed domain block is then compared with each range block.
From the difference, the RMS error is computed for each down sampled
transformed domain block and range block. From the obtained RMS error,
the minimum RMS error obtained for the corresponding domain is identified
and is inserted in the codebook completing the process of FIC. The process is
repeated until all the 8 × 8 subblocks are completed from the input image. In
the proposed work, in order to reduce the computation time, a parallel
processing architecture that independently operates on 8 × 8 subblock is
designed. Further, the affine transform is carried out in parallel and 9
comparisons are carried out in parallel. Thus, there is parallelism at two
stages. In the first stage, the parallel processing operates on eight 8 × 8
subblocks and in the second stage, nine processing units operate in parallel
computing affine transformation and comparison. The input image is stored
in main memory and the FSM control unit reads 8 × 8 subblocks into
separate intermediate memories for FIC. The FSM control unit reads 8 × 8
subblocks from main memory and stores in the intermediate memory based
on control signals from the main control unit designed using FSM logic.
Area Optimized High Speed Parallel Architecture … 55

Figure 3.1. Proposed flowchart for quadtree partitioned FIC implementation.

A. Image processing
The input image that needs to be compressed is first preprocessed to a
standard size of N × N , where N is a multiple of 2. Input image I, consisting
of N1 rows and N 2 columns as represented in equation (3), is resized to
N × N and is represented by I by padding with zeros. As the algorithm
56 S. Padmashree and Rohini Nagapadma

proposed operates on 8 × 8 subblock, resizing operation is carried out only if


N1 8 and N 2 8 not equal to integer. If condition in equation (3) is satisfied,
then extra zeros are padded on both sides of the input image I to obtain
preprocessed image I1 as in equation (4):

⎧i = 1, 2, ..., N1,
I = X (i, j ) ⎨
⎩ j = 1, 2, ..., N 2 ,

N1 N
≠ INTEGER, 2 ≠ INTEGER, (3)
8 8

⎧i = 1K N N1 + N 3 = INTEGER,
⎪ 1 8

⎪ j = 1K N N1 + N 3 = INTEGER,
⎪ 2 8
I11 = {X (i, j )} + {Zero(n, m )}, ⎨ (4)
⎪n = 1K N N 2 + N 4 = INTEGER,
⎪ 3 8

⎪m = 1K N N 2 + N 4 = INTEGER.
⎩ 4 8
The preprocessed image is further subgrouped into 8 × 8 subblock. The
preprocessed input image is set to standard size of 256 × 256 and is stored in
a memory of size 64 × 8. The subblocks of 8 × 8 are created by reading the
8 × 8 block from the main memory into eight separate memory of size
64 × 8. The FSM control unit is designed to generate address, clock and
control signals for read and write operations without loss of data. The
subblock (SB) stored in the intermediate memory is represented as in
equation (5):

⎧l , k = 1, 2, ..., 8,
SB ≡ Yi ≡ I1(l + i, l + i ) ⎨
⎩i = 1, 2, ..., N ,

N1 + N 3 N + N4
N = or 2 . (5)
8 8

Data stored in separate memories are read into accumulator unit that
computes the average of 8 × 8 blocks as in equation (6):
Area Optimized High Speed Parallel Architecture … 57

∑l , k ∑ Yi (l , k ),
1 8
X avg = i = 1, 2, K , N . (6)
64

An FSM is designed to read the data from 64 memory locations and is


accumulated. After every accumulation, 6-bit data from LSB is discarded and
stored in the memory for successive accumulation process, and the average is
computed. The input data is represented by 8 bits and hence would require
arithmetic units that operate on 8-bit data. In order to reduce hardware
complexity, the input image is normalized to a maximum of M, where M is
less than 256. Thus, the average value computed from each subblock is used
as quantization number for normalization of subblocks. The average for each
subblock is verified to be multiples of 2, if it is not, then the average value is
rounded off to its nearest integer which is multiples of 2. This operation is
carried out by checking for LSB bit, if LSB bit is ‘1’, then it is not a multiple
of 2 and hence the LSB bit is set to ‘0’. After setting the average value to be
multiple of 2, the data from intermediate memory consisting of 64 elements
is processed using a barrel shifter that shifts the data to right by B times
number.

Figure 3.2. Quantization operation.

Figure 3.2 shows the quantization operation. The data from intermediate
memory is shifted into barrel shifter, which is controlled by the average
number (multiple of 2), the normalized data is stored into output memory.
Due to normalization process, the input data that is represented by 8-bit is
58 S. Padmashree and Rohini Nagapadma

represented by 5-bit after quantization operation, thus the bits per pixel is
reduced from 8 to 5.
B. Eight stage parallel processing architecture
The parallel processing architecture shown in Figure 3.3 consists of eight
intermediate memories each of size 64 × 8, the contents are simultaneously
read into the quantizer unit and are quantized to 5-bit data. The operations
like computations of suitable domain block for a range block through proper
affine transforms, search operations and comparisons are carried out in
parallel to generate the fractal codebook.

Figure 3.3. Eight stage parallel processing architecture for FIC codebook
generation.

FSM control unit (not shown in figure) is designed to synchronize data


movement across the pipelined stage, clock signal and handshake signals
which are generated by the FSM unit to avoid loss of data and speedup the
processing operation. Quantization operation reduces 8-bit data to 5-bit data,
finding the maximum number, minimum number and their difference (Y1 ) for
Area Optimized High Speed Parallel Architecture … 59

each 8 × 8 block. The process is faster as only 5 bits is needed to be verified.


As affine transform is performed for all nine isometries and then search
operation is performed, the delay is introduced.
A novel algorithm is proposed to reduce the computation time and is
discussed in the next section. In order to perform encoding as discussed in
the flowchart, it is required to choose domain blocks which have information
above a set threshold. To find the information in the domain block, the
maximum number and the minimum number need to be determined. As there
are 64 pixels each of 5 bits, the maximum and minimum numbers are found
through iterative process which introduces delay.
In this work, a pipelined architecture is designed that consists of five
stages of processing as shown in Figure 3.4 to compute the maximum and
minimum numbers. In the first stage, input data is grouped into two pixels
and 32 groups of two pixels per group are compared to choose the maximum
or minimum number from each group. In the second stage, 32 pixels that are
obtained that represent maximum numbers are further grouped into two
pixels and compared to obtain 16 pixels, similarly after five stages of
comparison, the maximum number is computed. Minimum number
computation is also simultaneously performed in parallel to maximum
number computation.

Figure 3.4. Five stages greater than/lesser than computation unit.

Figure 3.5 shows the proposed greater than and less than comparator unit
that operates on two input pixels and finds the maximum and minimum of
them simultaneously.
60 S. Padmashree and Rohini Nagapadma

Figure 3.5. Single stage greater than/lesser than computation unit.

As the design operates on 5-bit of data, each bit of data is compared


using the architecture shown in Figure 3.5. The greater than and less than
operator is realized using multiplexer, and hence the complete architecture is
realized using multiplexer structure that is compatible to that of FPGA LUT
structure. Computation of Y1 is used to choose the domain block for FIC
encoding. Each of the 8 × 8 image is grouped into 4 × 4 domain block, each
of the 4 × 4 domain block is down sampled, affine transformed and is
compared with the range block. Each of the 8 × 8 subblock is regrouped into
sixteen 2 × 2 range blocks, each domain block is affine transformed to 9
metrics, each of them needs to be compared with sixteen range blocks, thus
the total number of search operation and comparison operation will be 16 ∗ 9
= 144 operations.
C. Search and comparison operations
Affine transformation such as rotation and reflection needs to be
performed on 4 × 4 domain block. By observation, the process of affine
transformation is to change the pixel positions as per the 9 isometries (one of
the operations is shown in Figure 3.6). In this work, a novel method for
affine transformation is carried out using interleaver technique.
Area Optimized High Speed Parallel Architecture … 61

Figure 3.6. Affine transformation (transpose operation).

Figure 3.7 shows the interleaver operation that is required to perform


affine transformation. As the positions of pixels need to be interleaved
according to the positions shown in Figure 3.6, FSM based counter is
designed to read the data from the input memory in the sequence indicated to
store the pixels in the output memory. The first pixel from address location
0000 is stored in the first location or output memory 0000, the second pixel
read from 0001 location is stored in the fifth location of output memory and
so on.

Figure 3.7. Interleaver FSM control unit.

Similarly, the nine isometries are computed using interleaver operation


for comparison and search operations. After affine transformation, the 4 × 4
domain block is down sampled to 2 × 2 size for comparison with range
block, the down sampler unit is shown in Figure 3.8.
62 S. Padmashree and Rohini Nagapadma

Figure 3.8. Down sampler unit.

The down sampler unit reads the four neighbouring pixels and averages
them. The FSM counter1 reads the four neighbouring pixels in four clock
cycles, the adder unit accumulates the four pixels and right shifts by two
(dividing by 4 for averaging operation) and the accumulator output is stored
in the output memory controlled by FSM counter2 with clock frequency four
times slower than the major clock. Each of the 8 × 8 block consisting of four
4 × 4 pixels are affine transformed and down sampled to four 2 × 2 blocks,
each of them are compared with corresponding range block to obtain four
RMS errors. Computation of RMS errors is performed as shown in equation
(7) which consists of square root operator, accumulator and a difference:

Ei = ∑ [Ri (l, k ) − Di (l, k )]2 , i = 1, 2, 3, 4,

Ri (l , k ) = ith Range Block (2 × 2 ) ,

Di (l , k ) = ith Down Sampled Domain Block (2 × 2). (7)

In this work, the complexity in RMS calculation is minimized with the


modified RMS computation as in equation (8). The square root and square
operation are replaced with modulus operation and the result is compared
Area Optimized High Speed Parallel Architecture … 63

with a known constant and is quantized. The comparison and quantization


operations are performed using right shift operation:

right
Emod = shift ⎡

operation ⎣
∑ Ri (l , k ) − Di (l , k ) ⎤ ,
⎥⎦

∑ Ri (l , k ) − Di (l , k ) = X E . (8)

Figure 3.9 shows the modified RMS calculator architecture that consists
of difference unit, accumulation unit and right shift operation. Four pixels of
domain block and range block are subtracted using two stage subtraction unit
consisting of difference unit, the modulus operator and right shift operator
computes the final RMS error from the difference output.

Figure 3.9. Modified RMS computation.

The difference unit shown in Figure 3.10 computes the difference


between the two operands Ri and Di , the MSB bit of two operands is used
to change the operands into positive numbers using the EX-OR operation.
The N-bit subtractor unit performs subtraction of Ri and Di , the difference
computed is stored in output register which is appended with MSB bit ‘0’
thus performing modulus operation (MOD operation).
64 S. Padmashree and Rohini Nagapadma

Figure 3.10. Difference units with modular operation.

Figure 3.11. RMS error computation unit.

The right shift operation is performed using the novel architecture shown
in Figure 3.11, the output of mod operator that is represented using 7-bit is
loaded into an intermediate register. The 4th, 5th, 6th and 7th bit positions
are compared for priority, if 4th bit position is ‘1’, then right shift by 2 is
performed, if 5th bit is ‘1’ and 4th bit is ‘0’, then right shift by 3 is performed
and so on. The right shifter output is selected by the four MSBs to compute
the RMS error.
Area Optimized High Speed Parallel Architecture … 65

Figure 3.12. Codebook generation logic.

Each of the four 2 × 2 domain block compared with range block


generates four RMS errors. The RMS error that is the best match is identified
for generation of codebook. The four errors (Ei) are compared and the
minimum error is selected by the multiplexer logic shown in Figure 3.12.
This process simultaneously generates the x, y coordinates of the domain
block with minimum error and the coordinates are stored in the output two
bit register labeled as add register. The architecture proposed is the first of its
kind for finding the x, y coordinates and address generation. The coordinates
obtained are appended into the codebook iteratively until all the subblocks
are processed to complete FIC.
The proposed architecture for FIC is modeled using Verilog HDL.
Hierarchical approach is adopted in developing the FIC architecture. The
lower level modules are modeled using behavioral modeling technique and
the modules are integrated into top model using structural coding style.
ModelSim simulation results are cross verified for various test cases and the
functionally correct HDL code is synthesized using Xilinx ISE targeting
Virtex-5 FPGA.
66 S. Padmashree and Rohini Nagapadma

4. FPGA Implementation of Proposed FIC

In this work, proposed FIC algorithm is implemented on Virtex-5 FPGA


development kit. In this implementation, Xilinx Virtex-5 (device xc5vlx110t)
is used with 110 million gate counts. The numbers of configurable logic
blocks (CLBs) are arranged in 160 × 54 matrix fashion. Each CLB has
17,280 Virtex-5 Slices and Max Distributed Random Access Memory
(RAM) of size 1,120Kb. Virtex-5 board has 64 DSP48E slices and block
RAMS of varied sizes. An input image of size 256 × 256 has been taken as
input and is stored as coefficient in the input memory of FPGA by mapping
as coefficient file. The input image coefficient file is stored in ROM. The
input image is partitioned into 8 × 8 subblocks and is stored in the internal
memory of FPGA.
Then parallel processing of all eight blocks is carried out with four
quadtree fractal image compression units and fractal codebook generation is
computed. After computation of fractal codebook, the codebook is stored in
RAM through FIFO. The top level module that has been designed is
controlled using FSM control unit as shown in Figure 4.1.

Figure 4.1. Top module of FIC architecture with control unit.


Area Optimized High Speed Parallel Architecture … 67

From the simulated results, it is seen that among the four domain blocks
chosen for an 8 × 8 image, each of size 4 × 4 and labeled as domain 1, 2, 3,
4. The domain block 2 is not been selected as the suitable domain block for
any of the range blocks, so the information contained in domain block 2 can
be neglected. Hence, there is reduction of the size since among the 4 domain
blocks to be matched with the range blocks, one of the domain blocks is left
unselected.
4.1. Implementation results
The proposed architecture was simulated and synthesized using Xilinx
ISE, the target FPGA Virtex-5 is chosen, appropriate constraints are set for
area optimization and power optimization. The input image of size
256 × 256 and the compressed codebook are stored in the internal memory,
the functional verification is carried out by reading the codebook into
simulation window. The operating frequency of the design is 139.9 MHz in
FPGA. The proposed architecture is found effective when compared to the
conventional fractal image compression. FIC encoding is completed in less
than 8.3 ns, the memory usage is found to be 357512 Kilobytes and power
consumed is less than 0.46W operating at maximum frequency. The design is
optimized to consume 9% of the total FPGA slice registers, 3% of LUTs and
34% of LUT-FF pairs. The design requires 348 IOBs, as the codebook
generated is readout for functional verification. Table 4.1 shows the
comparison of the performances of the proposed architecture with other
references.

Table 4.1. Performances of the proposed architecture with references


Parameters Proposed Samavi et al. Jackson et al. Acken et al. Vidya et al.
method [24] [14] [22] [23]
Input size 256 × 256 × 8 256 × 256 × 8 512 × 512 × 8 512 × 512 × 8 128 × 128 × 8
Clock rate 139.9 MHz 240 MHz 32.05 MHz 200 MHz -
Total cycles 124,000 197,000 268,052 262,000,000 -
Execution 66.4 ns 0.8 ms 8.38 ms 1.31 s 14.15 s
time
Power 0.46 W 400 μW - - -
dissipation
Platform Virtex-5 Virtex-4 FPGA SIMD ASIC
68 S. Padmashree and Rohini Nagapadma

The proposed design operates at frequency of 139.9 MHz and completes


the encoding process in less than 66 ns and hence is faster than the existing
techniques, the power dissipation of the proposed work is higher as
compared with other techniques. The power dissipation is due to the parallel
processing of eight subblocks, power saving techniques such as block
enabling and clock gating can be adopted to reduce power dissipation. The
same architecture could be extended to a 512 × 512 image and can be
implemented on FPGA. The speedup factor (SF) of a particular method can
be defined as the ratio of the time taken in full search to that of the said
method. In this paper, the results have revealed that the encoding time for
fractal image compression is the encoding time is less and the speedup factor
when compared to conventional fractal image compression is increased 109
times. As the compression ratio is less for FIC with FPGA than the
conventional method of implementing FIC, the quality of the image is
preserved. As the major demerit of FIC is its long encoding time, FIC
implemented using FPGA overcomes the problem as the encoding time is
very less when compared to the conventional FIC method.

5. Conclusion

Computation time and complexity of fractal image compression due to


search and comparison operations in generation of codebook have limited the
use of FIC for real time applications. Many scientific works reported in
literature have attempted in reducing the computation time compromising
area complexity. In this work, computation time to less than few milliseconds
with area optimization is achieved by designing the architecture using
parallel processing logic and realizing the arithmetic units using multiplexers.
The two pipelined architectures for processing data using subblocks and
affine transformation with search have been modeled and implemented on
FPGA. The parallel processing architecture has internal pipelined structure
for computation of arithmetic operations. The synthesis results obtained
for 256 × 256 × 8 input image demonstrate that the design can perform
compression in less than 66 ns and are suitable for real time applications.
Area Optimized High Speed Parallel Architecture … 69

Acknowledgement

The authors thank the anonymous referees for their valuable suggestions
which let to the improvement of the manuscript.

References

[1] M. F. Barnsley and A. E. Jacquin, Application of recurrent iterated function


systems to images, Proc. SPIE, Vol. 1001, 1988, pp. 122-131.
[2] Mario Polvere and Michele Nappi, Speed-up in fractal image coding: comparison
of methods, IEEE Trans. Image Process. 9(6) (2000), 1002-1009.
[3] T. K. Truong, C. M. Kung, J. H. Jeng and M. L. Hsieh, Fast fractal image
compression using spatial correlation, Chaos Solitons Fractals 22(5) (2004),
1071-1076.
[4] X. Wang, Y. Wang and J. Yun, An improved fast fractal image compression using
spatial texture correlation, Chin. Phys. B 20(10) (2001), 104202-1-104202-11.
[5] H. N. Chen, K. L. Chung and J. E. Hung, Novel fractal image encoding algorithm
using normalized one-norm and kick-out condition, Image Vis. Comput. 28(3)
(2010), 518-525.
[6] C. He, X. Xu and G. Li, Improvement of fast algorithm based on correlation
coefficients for fractal image encoding, Comput. Simul. 12(4) (2005), 60-63.
[7] R. Distasi, M. Nappi and D. Riccio, A range/domain approximation error-based
approach for fractal image compression, IEEE Trans. Image Process. 15(1)
(2006), 89-97.
[8] Y. Zhou, C. Zhang and Z. Zhang, An efficient fractal image coding algorithm
using unified feature and DCT, Chaos Solitons Fractals 39(4) (2009), 1823-1830.
[9] C. He, S. Yang and X. Huang, Variance-based accelerating scheme for fractal
image encoding, Electron. Lett. 40(2) (2004), 1052-1053.
[10] Y. Fisher, E. W. Jacobs and R. D. Boss, Fractal image compression using iterated
transforms, Image Text Compress. 176 (1992), 35-61.
[11] Jianji Wang and Nanning Zheng, A novel fractal image compression scheme with
block classification and sorting based on Pearson’s correlation coefficient, IEEE
Trans. Image Process. 22(9) (2013), 3690-3702.
[12] X. Wu, D. J. Jackson and H. Chen, A new searchless two-level IFS fractal image
encoding method, Proceedings of the 19th International Conference on Computers
and their Applications, 2004a, pp. 6-10.
70 S. Padmashree and Rohini Nagapadma
[13] X. Wu, D. J. Jackson, H. Chen, W. A. Stapleton and K. G. Ricks, A new deeper
quadtree searchless IFS fractal image encoding method, Proceedings of the
International Conference on Imaging Science, Systems, and Technology, 2004b,
pp. 324-329.
[14] David Jeff Jackson, Haichen Ren, Xianwei Wu and Kenneth G. Ricks, A
hardware architecture for real-time image compression using a searchless fractal
image coding method, J. Real-Time Image Proc. 1 (2007), 225-237.
DOI: 10.1007/s11554-007-0024-2.
[15] Jyoti Bhola and Simarpreet Kaur, Encoding time reduction method for the wavelet
based fractal image compression, Inter. J. Computer Engineering Science (IJCES)
2(5) (2012), 28-35.
[16] Thai Nam Son, Ong Manh Hung, Dang Thi Xuan, Van Long, T. Nguyen Tien
Dzung and Thang Manh Hoang, Implementation of fractal image compression on
FPGA, Published in Fourth International Conference on Communications and
Electronics (ICCE), 2012, pp. 339-344.
[17] M. Ramirez, A. D. Sanchez, M. L. Aranda and J. Vega-Pineda, Simple and fast
fractal image compression for VLSI circuits, Proc. of the 3rd International
Symposium on Image and Signal Processing and Analysis, 2003, pp. 112-116.
[18] M. S. Wu, J. H. Jeng and J. G. Hsieh, Schema genetic algorithm for fractal image
compression, Eng. Appl. Artificial Intelligence 20 (2007), 531-538.
[19] Yi-Ming Zhou, Chao Zhanga and Zeng-Ke Zhang, Fast hybrid fractal image
compression using an image feature and neural network, Chaos Solitons Fractals
37(2) (2008), 623-631.
[20] J. Li and C. C. Jay Kua, Image compression with a hybrid wavelet-fractal coder,
IEEE Trans. Image Process. 8(6) (1999), 868-873.
[21] D. J. Duh, J. H. Jeng and S. Y. Chen, DCT based simple classification scheme for
fractal image compression, Image Vis. Comput. 23(13) (2005), 1115-1121.
[22] K. P. Acken, M. J. Irwin and R. M. Owens, A parallel ASIC architecture for
efficient fractal image coding, J. VLSI Signal Process. 19(1) (1998), 97-113.
[23] D. Vidya, R. Parthasarathy, T. C. Bina and N. G. Swaroopa, Architecture for
fractal image compression, J. System Architecture 46 (2000), 1275-1291.
[24] S. Samavi, M. Habibi, S. Shirani and N. Rowshanbin, Real time fractal image
coder based on characteristic vector matching, J. Image and Vision Computing
28(11) (2010), 1557-1568.

You might also like