Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

X. Lan et al.

Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform

379

Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform
Xuguang Lan, Nanning Zheng, Senior Member, IEEE, Yuehu Liu
column. The approach needs large memory to store intermediate results. But the line-based architecture [7]-[9] minimizes the memory. W.Chang[9] describes a line-based architecture for 2-D DWT using lifting scheme, which consists of one row filter, one column filter and on-chip memory. It requires 9N storage cells for computing NN image using 9/7 wavelet and O(N2) clock cycles(ccs). K.Andra[8] tries to generalize the lifting-based architecture, which consists of two row processors ,two column processors and two memory modules. But the memory control logic of the architecture is complex, and O(N2)ccs are needed for NN image. G.Dillen [7] describes a combined line-based architecture for IDWT which needs tow row processors. It is necessary to increase the resource requirement to compute NN image in O(N2/2)ccs. This paper describes a new line-based, low power, parallel and architecture for computing 2-D DWT/IDWT of JPEG2000 using lifting scheme, including (5,3) (the high-pass and the low-pass filter have five taps and there taps, respectively ), (9,7), (13,7), (2,2), (2,6), (2,10), (7,5), (10,18) and (6,10)[11]. Since the JPEG2000 part I is the core system of the standard, we focus on the configuration of the default filters, including (5,3), (9,7). The proposed architecture, which contains one row processor to compute along the rows and one column processor to compute along the columns, performs the multilevel decomposition of DWT and IDWT, one level at a time, in row-column fashion. The Multilevel DWT can be further optimized to speed up and reduce the memory of DWT coefficients by memory control. Both of the row and column processor consist of sub-filters, and each sub-filter is equivalent to one lifting-step of lifting scheme. The row processor is time-multiplexed. Two lines are computed at a time, and the horizontal and vertical filtering are performed in parallel. The edge extension is implemented by embedded circuit. The shift-add operations which utilize the common factor are substituted for the multiplications. The whole architecture is optimized by pipelined way, and control circuit is simple, computing NN image in O(N2/2)ccs. This paper is organized as follows. Section II settles the theoretical concepts required for our architecture, describing the lifting scheme. Section III describes the DWT and IDWT architecture, including row processor and column processor. Experiment results are presented in section IV. The conclusion is summarized in section V. II. LIFTING SCHEME

Abstract A low-power, high-speed architecture which performs two-dimension forward and inverse discrete wavelet transform (DWT) for the set of filters in JPEG2000 is proposed by using a line-based and lifting scheme. It consists of one row processor and one column processor each of which contains four sub-filters. And the row processor which is time-multiplexed performs in parallel with the column processor. Optimized shift-add operations are substituted for multiplications, and edge extension is implemented by embedded circuit. The whole architecture which is optimized in the pipeline design way to speed up and achieve higher hardware utilization has been demonstrated in FPGA. Two pixels per clock cycle can be encoded at 100MHz. The architecture can be used as a compact and independent IP core for JPEG2000 VLSI implementation and various real-time image/video applications. Index Terms-2-D DWT, Parallel Architecture, JPEG2000, Lifting Scheme, VLSI.

I.

INTRODUCTION

Digital image compression engines are more and more needed to store or transmit large image/video data in order to satisfy the multimedia world. As a new generation image compression standard[1~3], JPEG2000 has many characteristics such as supporting progressive transmission by quality and resolution, and ROI (Region of Interest) and so on. This is due to the fact that discrete wavelet transform (DWT) is used for the spatial decomposition which supports the scalability of code stream. The DWT is implemented by the lifting scheme in JPEG 2000. Compared to the traditional convolution, computing complexity of lifting scheme[4][5] is reduced about by half. Two-dimension DWT is usually implemented directly in [6], that is, firstly computing the image along line, then along

This work was supported in part by the National High-Tech Program 863 of P.R.China under Grant No. 2002AA103011 and Creative Foundation of Nature Science No.60021302. Xuguang Lan is with the Institute of Artificial Intelligence and Robotics, Xian JiaoTong Unversity, Xian, P.R.China. (e-mail: xglan@aiar.xjtu.edu.cn). Nanning Zheng is with the Institute of Artificial Intelligence and Robotics, Xian JiaoTong Unversity, Xian, P.R.China. (e-mail: nnzheng@mail.xjtu.edu.cn). Yuehu Liu is with the Institute of Artificial Intelligence and Robotics, Xian JiaoTong Unversity, Xian, P.R.China. (e-mail: liuyh@aiar.xjtu.edu.cn). Contributed Paper Manuscript received December 3, 2004

Lifting scheme is called as the second generation wavelet[5] because of the construction of biorthogonal wavelets which does not require the Fourier transform. And lifting scheme,

0098 3063/05/$20.00 2005 IEEE

380

IEEE Transactions on Consumer Electronics, Vol. 51, No. 2, MAY 2005

which leads to a faster, fully in-place implementation of the wavelet transform, is attractive for both high throughput and low-power applications. Every finite wavelet transform can be factorized into lifting scheme[4], and the basic principle is to factorize the polyphase matrix of the wavelet or subband filters into a sequence of alternating upper and lower triangular matrices and a diagonal matrix using the Euclidean algorithm. This leads to the wavelet implementation by means of banded-matrix multiplications. Let polyphase matrix of the wavelet is

From Fig.1, we can derive that the DWT and IDWT are symmetrical. The IDWT is obtained only by traversing in the inverse direction, changing the factor K to 1/K, factor 1/K to K, and reversing the signs of coefficients in t (z ) and

s (z ) .
The biorthogonal wavelet, for example, Daubechies 9/7 wavelet filter[1], is factorized into lifting scheme,
1 P( z ) = 1 (1 + z )

P( z ) = [ ~ ] ~ g e ( z) g o ( z)
Where, h e ( z ) and h o ( z ) denote the even part and odd ~ ~ ~ part of low-pass analysis h ( z ) , and g e ( z ) and g o ( z ) are the even part and odd part of high-pass analysis g ( z ) of the wavelet filter, respectively. The polyphase matrix can be factorized into following formula by using Euclidean algorithm,
~

he ( z)

ho ( z )

(1 + z )

0 1 1 0

(1 + z 1 ) 1

P( z ) = [ ~ ] ~ g e ( z) g o ( z)
=

h e ( z)
m

ho ( z )

The dual polyhase matrix is given by[4]

0 K 0 1 s ( z ) 1 0 i 1 t ( z ) 1 0 1/ K i =1 i
1

And the lifting scheme is shown as follows, Forward DWT: step1: Y(2n+1) = Xext(2n+1) +(Xext(2n) + Xext(2n+2)) i 0 3 2 n + 1 < i1 + 3 step2: Y(2n) = Xext(2n) + (Y(2n 1) +Y(2n +1)) i 0 2 2 n < i1 + 2 step3: Y(2n+1) =Y(2n+1) + (Y(2n) +Y(2n+2)) i 1 2n + 1 < i + 1 step4: Y(2n) =Y(2n) + (Y(2n 1) +Y(2n +1)) i 2 n < i step5: Y ( 2 n + 1) = K Y ( 2 n + 1) i 0 2 n + 1 < i1 step6: Y ( 2 n ) = Y ( 2 n ) / K i 0 2 n < i1 Inverse DWT: step1: X (2n ) = K X ext (2n ) i0 3 2n < i1 + 3
0 1
0 1

1 (1 + z )

0 K 1 0

0 . 1/ K

step2: X ( 2 n + 1) = X ext ( 2 n + 1) / K step3: X(2n) = X(2n) (X(2n1)+ X(2n+1)) step4: X(2n +1) = X(2n +1) (X(2n) + X(2n+2)) step5: X(2n) = X(2n) (X(2n 1) + X(2n+1)) step6: X (2n +1) = X (2n +1) ( X (2n) + X (2n + 2)) Where,

i0 2 2n +1< i1 + 2
i0 3 2n < i1 + 3

i0 22n+1<i1 +2

m 0 1 ti ( z ) 1 / K 0 1 P( z ) = 1 1 0 K 1 0 i =1 si ( z ) where, K is a constant. Therefore, the DWT is factorized into

i0 12n<i1 +1
i0 1 2n + 1 < i1

i0 and( i1 -1)denote the index of the first sample and

lifting scheme. The lifting scheme of the wavelet filter computing one dimension signal is shown in Fig.1, and consists of four steps: 1) Split step, where the signal is split into even and odd points because the maximum correlation between adjacent pixels can be utilized for the next predict step. 2) Predict step, where the even samples are multiplied by the time domain equivalent of t (z ) and are added to the odd samples. 3) Update step, where the odd samples multiplied by the time domain equivalent of s (z ) are added to the even samples. 4) Scaling step, where the even samples are multiplied by 1/K and odd samples by K. 5)
z
input X 2 Xo

the last sample of one row or column. The implementation of D9/7 forward and inverse DWT is factorized into six-step lifting scheme, as shown in Fig.2.
z
input X 2 Xo

HP

HP

1/K

Split

Merge

Xe

+
LP Forward DW T

+
LP

1/K

0.443506852

Inverse D W T K

-1.586134342

-0.052980118

0.882911075

1.230174105

Fig.2. Lifting scheme of Daubechies 9/7 wavelet

III. VLSI ARCHITECTURE OF 2-D DWT The architecture of 2-D DWT is described in Fig.3. Two lines of Image data or LL subband data are routed in the Control Unit(CU). After the data are encoded by the row processor(RP) and column processor(CP) respectively, the address generator generates the corresponding addresses of the output data of the CP. The function of the CU is to route the input data, and control the next level decomposition according to the current level of decomposition and levels of decomposition required. That the row processor is time-multiplexed does not only increase the utilization, but also minimizes storage cells. The CP outputs two pixels at one clock, LL and LH or HL and HH subband data. The

+
t (z )

HP

1/K

s( z) t (z )

Split

s (z )

Merge

Xe

+
LP Forward DWT

1/K

Inverse DWT 2

Downsample

Upsample

Fig.1. Lifting

Scheme

X. Lan et al.:

Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform

381

address generator produces addresses, and the output data are stored in memory by the memory control. The Memory Control unit is not the purpose of this article and will not be discussed. The architecture does not only perform high-pass and low-pass filters of the RP and CP in parallel, but also horizontal and vertical decomposition in parallel.
In p u t im a g e d a ta

The multiplication is optimized by using shift and add operations in this way. The other coefficients are processed in the same way. The architecture employs 2s complement fix-point data representation which has 24 bits, 13 bits for the fractional part and 11 bits for integral part. Compared to the float data, the effect of fix-point data can be ignored, as shown in Fig.4. Two compared images are goldhill and baboon, and the size is 512 512 8bits. B. Embedded Periodic Extension at the Boundaries The finite length of signal processed by using wavelet filter leads to the edge effect[10]. JPEG2000 standard employs the symmetric extension at the boundaries to eliminate it, see Table I and Fig.5a. The extension of signal is embedded into the row processor and column processor in the implementation, in other words, the extension points are not actually calculated, because the computations of extension points are the same as those of original image samples. Multiplexers can be utilized to route the two input points for the adder which values are equal to those of original samples in each lifting step. In Fig. 5b, for example, the forward extension samples signaled number 4, 3, 2 and 1 at the left of the sample 0 are not computed in step1, and only the odd original samples are predicted. In step2, the original computed sample 1 is added by itself to update the even sample 0. The embedded scheme is implemented by finite state machine(FSM) and multiplexers, as shown in Fig.6,7 and 8. The scale and power of circuit is greatly reduced in this way. The FSM has four states: IDLE, FORWARD extension, NORMAL, LAST extension. The triggering signals of transition of the states are produced according to the line enable wire of pixels. The forward extension signals ext_en1 and ext_en2, which are employed to enable the multiplexer in sub-filters and of the RP and CP respectively, are generated in the state FORWARD. And the signals ext_en3

CONTROL
Input LL subband data
E v e n lin e

U N IT
O d d lin e

ROW

PRO CESSO R

DWT

E v e n lin e

O d d lin e

COLUM N
L L /H L

PRO CESSO R
L H /H H

ADDRESS
AD DRESS

GENERATOR
L L ,H L ,L H ,H H

M EM ORY

CONTROL

M EM ORY

Fig.3. The Architecture of 2-D DWT

A. Optimize multiplication using shift-add operation Because the multipliers occupy a great amount of area and hardware resources, which are not appropriate to chip design, and the coefficients of wavelet filter are constants, We substitute the multiplications for shift-add operations to optimize the implementation. Firstly, the coefficients of wavelet filters, , , , , K and 1 / K are quantized into the formulation which consists of the number of bits with value 1 in their positive representation. Thats because each 1 yields a term to be summed. Then the common factor is found. For example, =(0.443506852)10= (0.0111000110001)2, and the multiplication of and X is equivalent to the form X = X >> 2 + X >> 3 + X >> 4 + X >> 8 + X >> 9 + X >> 13. The above equation, for example, is simplified as follows, X = X >> 2 + X >>13+ ( X >>1 + X >> 2) >> 2 + ( X >>1 + X >> 2) >> 7 .
40

E D C B A B C D E F E D C

ile ft

i0

i1 i r i g h t

(a) Periodic symmetric extension.

PSNR(dB)

30
step 6
step 5

20

goldhill_float goldhill_fix baboon_float baboon_fix

step 4 step 3 step 2 step 1

10

0 0 0.2 0.4 0.6 0.8 1 1.2


4 3 2 1

Rate(bpp) Fig.4. PSNR and Rate

1 0 U p d a ted p o in t

3 4 O r ig in a l p o in t

3 4 U p d a ted p o in t

(b)Embedded scheme of periodic extension(only compute the samples in rectangle) Fig.5. Periodic extension at signal boundaries

382 TABLE I PERIODIC EXTENSION OF WAVELET AT THE BOUNDARIES

IEEE Transactions on Consumer Electronics, Vol. 51, No. 2, MAY 2005

ileft

iright
odd 1 3

i0
5/3 9/7

even 2 4

i1

even 1 3

odd 2 4

ileft

denotes the extension

iright denotes the extension


samples towards right

samples towards left

and ext_en4 of last extension, which are applied to enable the multiplexers in sub-filters and respectively, are produced in the state of LAST extension. These signals are shown in Table II, Fig.7 and 8.
TABLE II
CONTROL SIGNAL OF THE EXTENSION OF FORWARD AND BACK END

coefficient of the sub-filter (which is optimized to shift-add operation), the products are added to odd samples of each line. Compared to the first input line (even line), the second input line(odd line) is delayed to compute by one clock. Even line pixels are computed in odd clocks, and odd line pixels in even clocks. Pipelined computations are achieved via the registers. The two lines of output samples of the sub-filter are routed into sub-filter by the multiplexers. And the odd samples are summed by using the DFF. The summations are multiplied by the coefficient , added to even samples. The same optimization operations as sub-filter are utilized. Computations of the sub-filters and are similar to those of samples of each output line in sub-filter are multiplied by 1 / K and K , respectively. Filtering of the image along the line is finished in this way. Time-multiplexing the row processor is implemented by computing the even lines at odd clocks and odd lines at even clocks. The control signal sel_en of the router, which triggers the multiplexers to route for adders, is generated by the counter. The row processor is optimized in the pipelined way, and the samples are encoded continuously as the samples are input. Hardware utilization reaches approximately 100%, and the control logic is simple. The schedule of the sub-filter is shown in Table III. Xi,j denotes the sample of line i and column j. X0,2, for example, expresses the third sample of the first line. SAi, denotes the multiplication of the coefficients of wavelet filter with the summation at previous clock, and i is line number, and 0 denotes the even line of two input lines; 1 is the odd one. For instance, at clock 3, add X0,2 to X0,0 and store the sum in register. At clock 4, add X1,2 to X1,0, at the same time, compute the multiplication SA-0= (X0,2+X0,0) and store the results in registers. Multiplication SA-1= (X1,2+X1,0) is computed at clock 5.
R is ter eg A Q1 A Register Q1 A Register Q1 A Reg er ist Q1 A R is ter eg Q1

the sub-filters and , respectively.

The even and odd

IDLE Ext_en1 Ext_en2 Ext_en3 Ext_en4 0 0 0 0

First_ext 1 1 0 0

normal 0 0 0 0

Last_ext 0 0 1 1

re s e t Id l e

FO RW ARD LAST L a s t_ en

I n it i a l_ e n

NORMAL

N o r m a l_ e n

Fig.6. Finite state machine(FSM) of extension

C. The Architecture of the Row Processor


sel_en
Q8

R s ter egi

R ist e eg r Q 1 A Q1 A

Re gister Q 1
A

R ist er eg Q 1 A

R ist er eg Q1

Q8

Q8

Q8

Q8

are similar to the sub-filters and , respectively. All the hardware resources of the row processor can be time-multiplexed, see Fig.7. Two lines are calculated at a time , and every line pixels are not divided into even and odd samples when they are input into the row processor continuously. That is, two pixels are encoded at one clock. This reduces storage cells and increases the speed in row processor. Compared to the row processor that the input lines are partitioned into even and odd samples (which needs tow parallel RPs), the proposed row processor is time-multiplexed at the cost of several registers. Therefore, hardware resources are greatly cut down and utilization is high. Two lines pixels are input firstly into the sub-filter in the row processor, and even samples of each line are summed by utilizing the delay flip-flop (DFF) and multiplexers. After the summations are multiplied by the

The row processor consists of four sub-filters , , and ,as shown in Fig.7. The sub-filters and

Even Line

sel_en sel_en 0 1 0 1
R ist er eg A Q1 H Q8

ENB

EN B

EN B

Q 8

Q8

Q 8

Q 8

Q8

EN B

E NB

E NB

ENB

sel_en 0 1 0 sel_en 1
R iste eg r

ext_en3 ext_en3
R s ter egi R ist e eg r Q 1 A Q1

01 01
Re gister A Q 1 Q8 H Q 8

Q1 Q8

ENB

Even Line sel_en 0 1 Odd line


R s ter egi A Q1 A Register Q 1 A Register Q 1 H Q8 H Q 8 H Q 8

ext_en1 sel_en 0 0 1 1

ENB

Q1 Q8

ENB

sel_en 0 1
Q1 Q8

0 1
Register

Even Line

R egister

EN B

<< +

Q 1

Q 8

ENB ENB

0 1 sel_en

+
01 ext_en1

Reg er ist A

R is ter eg

<< +

Q 1

Q 8

EN B ENB

ENB

0 1 sel_en

Odd line

Reg e ist r A Q1
A

Register Q 1 A

Register Q 1

Odd line

Reg er ist A Q1 A

R ist er eg Q1

Q 8

Q8

Q 8

Q 8

EN B

E NB

E NB

E NB

ENB

EN B

Q8

Q8

EN B

EN B

EN B

ENB

ENB

(a) Prediction step1, Sub-Filter

(b)Update step 1, Sub-Filter

R egister A Q 1 A

R ist er eg Q1 A

Register Q 1 A

R is ter eg Q1 A

Register Q 1

sel_en
Q 8

R ister eg

R ist e eg r Q 1 A Q 1 A

R iste eg r Q1
A

R is ter eg Q1 A

R s ter egi Q 1

Q 8

Q8

Q 8

Q8

Even Line

sel_en sel_en 0 1 0 1
Register A Q 1 H Q 8

EN B

ENB

EN B

Q 8

Q 8

Q8

Q8

Q 8

EN B

E NB

E NB

sel_en 0 1 0 sel_en 1

EN B

ext_en4 ext_en4
R ister eg

Q1 Q8

EN B

Even Line sel_en 0 1 Odd line


Register A Q 1 A R egister Q 1

0 1

01 01
R iste eg r A Q1 H Q8

ext_en2 sel_en 0 1

ENB

Q1 Q8

ENB

sel_en 0 1
Q 1 Q 8

0 1
Register

1/ K
Even Line

R iste eg r

R is e r eg t

E NB

<< +

Q1

Q8

EN B ENB

0 1 sel_en

+
01 ext_en2
R is ter eg A Q 1 Q 8 H Q 8

R is ter eg A

Register

<< +

Q 1

Q 8

EN B EN B

ENB

0 1 sel_en

K
Odd line

R s ter egi A Q 1
A

Register Q 1 A

Regist er Q1

Odd line

Q 1

R is ter eg A Q1 A

Regist er Q1

Q 8

Q 8

Q 8

Q8

EN B

E NB

EN B

EN B

ENB

Q 8

Q8

Q8

EN B

EN B

ENB

ENB

ENB

(c) Prediction step2, Sub-Filter


Re gister A Q1 H Q8 EN B

(d) Update step 2, Sub-Filter


, , ,
and

0 M ultiplexer R egister 1

A dder

<< +

S hift-A dd,

K are coefficients of w avelet filter

Fig.7.

Row Processor of the Forward DWT

X. Lan et al.:

Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform

383

From Fig.7, only eight adders and six shift-add operations are needed to implement row filter of D9/7. The shift registers dont increase the hardware consumption, because shift registers could be implemented by the combination logic of wire AND operations. Therefore, the proposed architecture of the row processor greatly reduces the hardware consumption and complexity.
TABLE III SHEDULE OF THE SUB-FILTER IN RP
Clock Adder 1 2 3 4 5 6 7 8 X0,2+X0,0 X1,2+X1,0 X0,4+X0,2 X1,4+X1,2 X0,6+X0,4 X1,6+X1,4 SA-0 SA-1 SA-0 SA-1 SA-0 SA-0+X0,1 SA-1+X1,1 SA-0+X0,3 SA-1+X1,3 X0,0

of the sub-filters , and is similar to that of the sub-filter . Two pixels, LL and LH subband pixels or HL and HH, a r e o u t p u t a t o n e c l o c k i n c o l u m n p r o c e s s o r. In Fig.8, only eight adders and six shift-add operations are needed, and the column processor is performed in parallel with the row processor (only delay one line clock cycles).

sub-filter

Adder

Even line

Register A Q1

Q8

EN B

line buffer
ext_en3 01

Register A Q1 A

Register Q1 A

Register Q1

Even line ext_en1 ext_en1 0 1

Register A Q 1 A

Register Q1

Q8

Q8

Q8

Q 8

Q8

EN B

Shift Adder

Q1

EN B

EN B

EN B

Q1

EN B

Register

EN B

EN B

EN B

Q8

<<

Register A Q1

Register

Q8

Q1

Odd line

Register

Q8

<<

Register A Q1

Register

Q8

EN B

Q8

Q1

Even line

Q8

EN B

EN B

Register

Register Q1

Register Q1 A Q1

Register

Register Q1 A Q1 A

Register Q1

Odd line

Q8

line buffer

Q8

Q8

line buffer

Odd line

Q8

Q8

Q8

EN B

EN B

EN B

EN B

EN B

EN B

(a) Prediction step1, Sub-Filter

(b)Update step 1, Sub-Filter

Even line

Register

Register Q1 A Q1 A

Register Q1

line buffer
01

Even line ext_en2 ext_en2 0 1

Register A Q1 A

Register Q1

Q8

Q8

Q8

Q8

Q8

ENB

Q 1

ENB

ENB

ENB

Q1

ENB

Register

ext_en4

ENB

ENB

ENB

Q 8

<<

Register A Q1

Register

Register Q1

Q8

O line dd

Q8

<<

Register A Q1

Register

Q8

ENB

Q8

Q1

1/K
Even line

Q8

ENB

ENB

Register

Register Q 1 A Q1 A

O line dd

Register

Register Q1 A Q1 A

Register Q1

line buffer

Q 8

Q8

line buffer

Q8

Q8

Q8

K
O line dd

ENB

ENB

ENB

ENB

ENB

D. The Architecture of the Column Processor The column processor performs wavelet transform along the column, and the pixels processed are from the row processor. This leads to the difference of column processor from row processor. Line buffers, which buffer the samples required by column processor, are substituted for the registers of the row processor, as shown in Fig.8. Two output lines of the row processor are even line and odd line for the column processor, in other words, the pixels are naturally separated into even samples and odd samples along the column. This reduces the storage cells and simplifies the design of column processor. Six-line buffers for D9/7 and there-line for 5/3 are Required. Where, the function of the line buffer is identical to the FIFO (First in, First out) which size is WN , where W is the width of data path, N is image width, and the size is no relation to image height. The column processor begins to calculate the samples after the first two lines finish computing in row processor. Firstly, two lines are performed in sub-filter . Then the outputs of the sub-filter , even line and odd line, are computed in sub-filters , and in turn. The computations of the sub-filters and are similar to those of the sub-filters and , respectively. The even and odd samples of each output line in sub-filter are multiplied by 1 / K and K , respectively. Filtering along the line is finished in this way. We also optimize the column processor in pipelined way to increase the speed. The schedule of sub-filter is shown in Table IV. At clock 2, for example, add X2,0 to X0,0 and store the sum in registers. At clock 3, read the sum from the registers and calculate the multiplication, SA1= (X2,0+X0,0). At clock 4, add SA1 to X1,0 and store the sum in registers. The schedule

(c) Prediction step2, Sub-Filter

(d) Update step 2, Sub-Filter

Fig.8. Column Processor of the Forward DWT

TABLE IV THE SCHEDULE OF THE SUB-FILTER IN CP


Clock Adder 1 2 3 4 5 6 7 X0,0 X2,0+X0,0 X4,0+X2,0 X6,0+X4,0 X8,0+X6,0 X10,0+X8,0 SA1 SA2 SA3 SA4 SA1+X1,0 SA2+X3,0 SA3+X5,0 -

sub-filter

Adder

Shift-Add

E. Modularity Two lifting steps are required for (5,3) wavelet[1] as follows, Step1: Y(2n+1) = Xext(2n+1) Xext(2n) + Xext(2n+2) 2 Y(2n1) +Y(2n+1) +2 . Step2: Y(2n) = Xext(2n) + 4 Therefore, we need only utilize the sub-filters and of the row and column processors in D9/7 to perform the transform except replacing the coefficients of D'9/7 wavelet

384

IEEE Transactions on Consumer Electronics, Vol. 51, No. 2, MAY 2005

with those of (5,3) wavelet. Because the coefficients of the (5,3) wavelet is

1 1 = 2 1 and = = 2 2 , the 2 4

utilize the hardware resources of the forward DWT. The proposed architecture has the universality for 2-D DWT based on lifting scheme.
K 1/ K Even column
Regist er A Q 1 A Regist er Q1 A Regist er Q1 A Regist er Q1 A Regist er Q1

multiplication is reduced to the shift operations, and the three-step algorithm (add-multiplication-add) of each sub-filter is shown in Fig.9.
Register A Q1 A Register Q1 A Register Q1 A Register Q1 A Register Q 1

sel_en sel_en 0 1 Even column 0 1


Regist er A Q1 A Regist er Q1 A Regist er Q1 A H Q8 H Q8 H Q8 H ENB ENB ENB

Regist er Q1 A

Regist er Q1

Q8

ENB

ENB

ENB

sel_en 0 1

0 1

ext_en2 sel_en 0 1

ENB

Q1 Q8

ENB

sel_en sel_en 0 1 0 1
Regist er A Q1 H Q8

Q8

Q8

sel_en sel_en 0 1 0 1

ENB

Regist er A

01
Regist er Regist er Q1 A Q1 A Regist er Q1 A Regist er Q1

ENB

<< +

Regist er A Q1

Q8

Regist er A Q1

ext_en4 ext_en4 0 1
Regist er

01 01
Regist er A Q1 H Q8

Q 1 Q 8

ENB

Even column

Q8

Regist er A

ENB ENB

ENB

<< +
A

R egister A Q 1

Q 8

ENB ENB

0 1 sel_en

Odd column

ext_en2
Regist er A Q1 A Regist er Q1 H Q8 H Q8

Regist er A Q1 A

Regist er Q1

Regist er Q1

sel_en
Q 8

Odd column
Even Line

K 1/ K

Odd column

Q1

Q8

Q8

Q8

Q8

Q8

Q8

Q8

Q8

sel_en

ENB

ENB

ENB

ENB

ENB

ENB

ENB

ENB

ENB

ENB

ENB

Register

Register Q1 A Q 1 A

Register Q 1 A

Register Q1 A

Register Q1

Even Line

sel_en sel_en 0 1 0 1
Register A Q
1

Q8

Q8

Q8

Q8

ENB

ENB

ENB

Q8

Q 8

Q 8

Q8

Q8

ENB

ENB

ENB

sel_en 0 1 sel_en 0 1

ENB

ext_en3 ext_en3
Register Register Q1 A Q 1

01 01
Register A Q1 Q 8 H Q8

Q1 Q8

ENB

Even Line sel_en 0 1 Odd line


Register A Q1 A Register Q1 A Register Q 1 H Q8 H Q8 H Q 8

0 1

Register A

ENB

<< 1
A

Re g i ter s A Q1

Q8

ext_en1 sel_en 0 1 01

ENB

Q1 Q8

ENB

sel_en 0 1

0 1

Q 8

EN B

ENB

0 1 sel_en

Register A

ENB

<< 2

Register A Q 1

Q 8

Register A Q1

Q8

ENB ENB

0 1 sel_en

Odd line

(a) Sub-Filter

(b) Sub-Filter

Register A Q 1 A

Register Q 1

Register Q 1

Odd line

ext_en1
Register A Q1 A Register Q
1

Q8

Q 8

Q 8

Q 8

ENB

ENB

ENB

ENB

ENB

ENB

Q8

Q8

ENB

ENB

ENB

ENB

ENB

(a) Prediction step, Sub-Filter

(b)Update step, Sub-Filter

Re r giste A Q1 A

Re r giste Q1 A

Re r giste Q 1 A

Re r giste Q 1 A

Re r giste Q1

Evencolum n

Q8

Q8

Q 8

Q 8

Q8

sel_en sel_en 0 1 0 1

Re r giste

Re r giste Q1 A Q 1 A

Re r giste Q1 A

Re r giste Q1 A

Re r giste Q1

E NB

E NB

E NB

sel_en 0 1

ext_en1 sel_en 0 0 1 1 01

E NB

Q 1 Q 8

E NB

Evencolum n

sel_en sel_en 0 1 0 1
Re r giste A Q1 H Q8

Q8

Q 8

Q8

Q8

Q8

E NB

E NB

E NB

sel_en 0 1 0 sel_en 1

E NB

Re r giste A

E NB

<< +

Re r giste A Q1

Q8

Re r giste A Q1

ext_en3
Q8

01 01
Re r giste A Q1 H Q8

Q1 Q8

E NB

Evencolum n

E NB E NB

0 1 sel_en

ext_en3 O colum dd n
A H Re r giste Q1

Re r giste A

E NB

<< +

R egister A Q1

Q8

E NB E NB

0 1 sel_en

O colum dd n

ext_en1
Re r giste Re r giste Q 1 A Q 1 A H Q 8 H Q 8

Re r giste A Q1 A

Re r giste Q 1 A

Re r giste Q1

Register

Even line

Q1

Q8

ENB

line buffer
ext_en3 01

Register A Q1 A

Register Q1 A

Register Q1

Even line ext_en1 ext_en1 0 1

Register A Q1 A

Register Q1

O colum dd n

Re r giste A Q 1 A

Re r giste Q1 A

Re r giste Q1

Q8

Q8

Q 8

Q8

E NB H Q 8 H Q8 H Q8

E NB

E NB

E NB

E NB

Q8

Q8

Q8

Q8

Q8

ENB

Q1 Q8

ENB

ENB

ENB

Q1 Q8

ENB

E NB

E NB

E NB

E NB

E NB

Register

Register

Register Q1

ENB

ENB

ENB

<< 1

Q8

Q1

Odd line 0

Register

Register

Register Q1

Q8

ENB

<< 2

Q8

Q1

Even line

Q8

ENB

ENB

(c) Sub-Filter

(d) Sub-Filter

Register

Register Q1

Register Q1 A Q1

Register

Register Q1 A Q1 A

Register Q1

Odd line

Q8

line buffer

Q8

Q8

line buffer

Odd line

Q8

Q8

Q8

ENB

ENB

ENB

ENB

ENB

ENB

Fig.11. Column Processor of the IDWT

(c) Prediction step, Sub-Filter

(d)Update step, Sub-Filter


Regist er Regist er Q1 A Q1 Regist er Regist er Q1 A Q1 A Regist er Q1

Fig.9. The Row Processor(a and b) and Column Processor(c and d) of (5,3) Lifting Scheme

Even column ext_en2 ext_en2 0 1

Q8

Q8

line buffer
Even column 01
Regist er

Q8

Q8

Q8

ENB

Q1 Q8

ENB

ENB

Q1 Q8

ENB

ENB

Even column

1/ K

Regist er

Odd column

ENB

<< +

Regist er A Q1

Regist er

Q8

Q1

Q8

ext_en4

ENB

ENB

ENB

<< +

Regist er A Q1

Regist er

Q8

Q1

Q8

ENB

ENB

Odd column

F. The Architecture of IDWT The architectures of IDWT and DWT are symmetric because of the symmetry of lifting scheme for forward DWT and inverse DWT. The filtering of the IDWT is first to perform transforms along the column, then along the line, see Fig.10.
Input LL HL LH HH subband data

Regist er

Regist er Q1 A Q1 A

Regist er Q1 A

Regist er Q1 A

Regist er Q1

line buffer

Q8

Q8

Q8

line buffer
Odd column

Q8

ENB

ENB

ENB

ENB

ENB

(a) Sub-Filter
Regi ter s Regi ter s Regi ter s Q1 A Q1 H Q8 A Q1

(b)

Sub-Filter
Regi ter s A Q1 A Regi ter s Q1 A Regi ter s Q1

Even column ext_en1 ext_en1 1 0 Odd column 0 1

Q8

Q8 ENB

line buffer
ext_en3 01

Q8

Q8

Q8

ENB

Q1 Q8

ENB

ENB

Even column
Regi ter s A Q1 Regi ter s Regi ter s

Q1 Q8

ENB

ENB

Even column

Regi ter s

ENB

<< +

Q8

Q1

Q8

ENB

ENB

ENB

<< +

Regi ter s A Q1

Regi ter s

Q8

Q1

Q8

ENB

ENB

Odd column

Odd column
Regi ter s Regi ter s Q1 A Q1 A Regi ter s Q1 A Regi ter s Q1 Regi ter s Regi ter s Q1 A Q1

line buffer

Q8

Q8

Q8

Q8

line buffer

Q8

Q8

ENB

ENB

ENB

ENB

ENB

ENB

CONTROL
E v en c o lu m n

U N IT
O d d c o lu m n

(c) Sub-Filter

(d) Sub-Filter

COLUM N

PR O C ESSO R
O d d c o lu m n

Fig.12. Row Processor of the IDWT

IDWT

E v en c o lu m n

ROW

PR O C ESSO R
L L /Im a g e

IV

IMPLEMENTATION

L L /Im a g e

AD DR ESS
ADDRESS

GENERATOR
L L /Im a g e

M EM ORY

CONTROL

M EM ORY

Fig.10. The Architecture of IDWT

The subband data are input the control unit, and two column data are output. After the data are computed by the column processor and row processor, the synthesis image or LL data are stored in the memory via the memory control. Multilevel decomposition is implemented by the control unit. The filtering order of the IDWT in row and column processor is the sub-filers , , and in turn, as shown in Fig.11 and 12. The architectures of the sub-filters are the same as those of forward DWT correspondingly. Therefore, the IDWT can

We have developed a RTL(Register Transfer Level) model of our architecture which is able to perform forward and inverse DWT of D9/7 and 5/3 in FPGA. Two pixels per clock cycle can be encoded at 100MHz. For (5,3) and (9,7), 25% of total area of the main chip which has about 25000 logic elements is needed for multilevel decomposition of wavelet which is implemented by the FSM, one level at a time, and the data path is 24 bits wide. Only 5% of total area is needed for implementation of the (5,3) wavelet. The memory requirement depends on the width of the image to be processed and the data path. For image 512 512 8bits, there levels of decomposition using D9/7, the speed is 558 frame/s and the size of required memory is 72Kbits.

X. Lan et al.:

Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform

385

V CONCLUSION A low-power and high-speed architecture which performs the 2D DWT/IDWT for JPEG2000 is proposed in this paper. The main advantages of proposed architecture are as follows. 1) Time-multiplexing row processor and line-based design way minimize the on-chip storage cell, and reduce the hardware consumption, and increase the utilization. 2) Two pixels are computed at one clock, and row processor performs in parallel with column processor. 3) The optimized shift-add operations are substituted for multiplications, and periodic extension at the boundaries is implemented by embedded circuit, and the proposed architecture is optimized in pipelined way. These reduce the quantity of computation and the power. 4) Universality for 2-D DWT/IDWT based on lifting scheme and multilevel decomposition are allowed through cascading the proposed architecture. REFERENCES
[1] ISO/IEC JTC 1/SC 29/WG 1 N1646R, JPEG2000 part 1 final committee draft version 1.0,2000. [2] Taubman D. High performance scalable image compression with EBCOT, IEEE Transactions on Image Processing,vol.9, no.7, pp.1158~1170,2000. [3] C.Christopoulos,A.Skodras. The JPEG2000 still image coding system: An overview, IEEE Transaction on Consumer Electronics,vol.46, pp.1103~1127,2000. [4] Daubechies I, Sweldens W. Factoring wavelet transforms into lifting schemes, J. Fourier Anal. Appl,vol.4,pp.247~269,1998. [5] Sweldens W. The lifting scheme: a new philosophy in biorthogonal wavelet constructions, Proc SPIE,1995. [6] M. Vishwanath,R.M. Owens,M.J.Irwin, VLSI architecture for the discrete wavelet transform, IEEE trans. On circuit and systems-11, vol.42, no.5, pp.305-316,1995. [7] G.Dillen,B.Georis. Combined line-based architecture for the 5-3 and 9-7 wavelet transform for JPEG2000, IEEE Transaction on Circuits and Systems for Video Technology, vol.13, no.9, pp.944-950, 2003. [8] Andra K, Chakrabarti C. A VLSI architecture for lifting-based forward and inverse wavelet transform, IEEE Trans on Signal Processing, vol.50, no.4, pp.966-977, 2002. [9]Wei-Hsin Chang,Yew-San Lee, Wen-shiaw Peng, Chen-Yi Lee, A Line-based, memory efficient and programmable architecture for 2D DWT using lifting sheme, IEEE International Symposium on Circuits and Systems, 2001. [10] K. Seth S.Srinivasan. VLSI implementation of 2-D DWT/IDWT Cores using 9/7-tap filter banks based on the non-expansive symmetric extension scheme, Proceeding of the 15th international conference on VLSI Design, 2002. [11] ISO_IEC_15444-2:2004(E), Information technology JPEG 2000 image coding system: Extensions,2004.

Xuguang Lan received the BS degree from the College of Automobile Engineering, Shan Dong University of Science and Technology in 1999, and the MS degree from the College of Transportation Engineering, Kunming University of Science and Technology in 2002. Now he is a PhD student of the Institute of Artificial Intelligence and Robotics, Xian JiaoTong University. His research interests include image/video processing, hardware implementation of intelligent systems, and VLSI design. NanNing Zheng (SM93) graduated in 1975 from the Department of Electrical Engineering, Xian Jiaotong University, Xian, China, received the ME degree in information and control engineering from Xian JiaoTong University, Xian, China in 1981, and the PhD degree in electrical engineering from Keio University, Japan, in1985. He is currently a professor and the director of the Institute of Artificial Intelligence and Robotics at Xian Jiaotong University. His research interests include computer vision, pattern recognition, computational intelligence, image processing, and hardware implementation of intelligent systems. He served as the general chair for the International Symposium on Information Theory and Its Applications in 2002, and the general co-chair for the International Symposium on Nonlinear Theory and Its Applications in 2002. Since 2000, he has been the Chinese representative on the Governing Board of the International Association for Pattern Recognition. He presently serves as executive editor of Chinese Science Bulletin. He became a member of the Chinese Academy Engineering in 1999. He is a senior member of IEEE. Yuehu Liu received a B.E. and a M.E. degree in computer science& engineering at Xian Jiaotong University, China in 1984 and 1989, respectively, and the PhD degree in electrical engineering from Keio University, Japan, in 2000. He is currently a professor and the vice director of the Institute of Artificial Intelligence and Robotics at Xian Jiaotong University. His research interests include computer vision, pattern recognition, computational intelligence, image processing.

You might also like