Professional Documents
Culture Documents
Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform
Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform
Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform
379
Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform
Xuguang Lan, Nanning Zheng, Senior Member, IEEE, Yuehu Liu
column. The approach needs large memory to store intermediate results. But the line-based architecture [7]-[9] minimizes the memory. W.Chang[9] describes a line-based architecture for 2-D DWT using lifting scheme, which consists of one row filter, one column filter and on-chip memory. It requires 9N storage cells for computing NN image using 9/7 wavelet and O(N2) clock cycles(ccs). K.Andra[8] tries to generalize the lifting-based architecture, which consists of two row processors ,two column processors and two memory modules. But the memory control logic of the architecture is complex, and O(N2)ccs are needed for NN image. G.Dillen [7] describes a combined line-based architecture for IDWT which needs tow row processors. It is necessary to increase the resource requirement to compute NN image in O(N2/2)ccs. This paper describes a new line-based, low power, parallel and architecture for computing 2-D DWT/IDWT of JPEG2000 using lifting scheme, including (5,3) (the high-pass and the low-pass filter have five taps and there taps, respectively ), (9,7), (13,7), (2,2), (2,6), (2,10), (7,5), (10,18) and (6,10)[11]. Since the JPEG2000 part I is the core system of the standard, we focus on the configuration of the default filters, including (5,3), (9,7). The proposed architecture, which contains one row processor to compute along the rows and one column processor to compute along the columns, performs the multilevel decomposition of DWT and IDWT, one level at a time, in row-column fashion. The Multilevel DWT can be further optimized to speed up and reduce the memory of DWT coefficients by memory control. Both of the row and column processor consist of sub-filters, and each sub-filter is equivalent to one lifting-step of lifting scheme. The row processor is time-multiplexed. Two lines are computed at a time, and the horizontal and vertical filtering are performed in parallel. The edge extension is implemented by embedded circuit. The shift-add operations which utilize the common factor are substituted for the multiplications. The whole architecture is optimized by pipelined way, and control circuit is simple, computing NN image in O(N2/2)ccs. This paper is organized as follows. Section II settles the theoretical concepts required for our architecture, describing the lifting scheme. Section III describes the DWT and IDWT architecture, including row processor and column processor. Experiment results are presented in section IV. The conclusion is summarized in section V. II. LIFTING SCHEME
Abstract A low-power, high-speed architecture which performs two-dimension forward and inverse discrete wavelet transform (DWT) for the set of filters in JPEG2000 is proposed by using a line-based and lifting scheme. It consists of one row processor and one column processor each of which contains four sub-filters. And the row processor which is time-multiplexed performs in parallel with the column processor. Optimized shift-add operations are substituted for multiplications, and edge extension is implemented by embedded circuit. The whole architecture which is optimized in the pipeline design way to speed up and achieve higher hardware utilization has been demonstrated in FPGA. Two pixels per clock cycle can be encoded at 100MHz. The architecture can be used as a compact and independent IP core for JPEG2000 VLSI implementation and various real-time image/video applications. Index Terms-2-D DWT, Parallel Architecture, JPEG2000, Lifting Scheme, VLSI.
I.
INTRODUCTION
Digital image compression engines are more and more needed to store or transmit large image/video data in order to satisfy the multimedia world. As a new generation image compression standard[1~3], JPEG2000 has many characteristics such as supporting progressive transmission by quality and resolution, and ROI (Region of Interest) and so on. This is due to the fact that discrete wavelet transform (DWT) is used for the spatial decomposition which supports the scalability of code stream. The DWT is implemented by the lifting scheme in JPEG 2000. Compared to the traditional convolution, computing complexity of lifting scheme[4][5] is reduced about by half. Two-dimension DWT is usually implemented directly in [6], that is, firstly computing the image along line, then along
This work was supported in part by the National High-Tech Program 863 of P.R.China under Grant No. 2002AA103011 and Creative Foundation of Nature Science No.60021302. Xuguang Lan is with the Institute of Artificial Intelligence and Robotics, Xian JiaoTong Unversity, Xian, P.R.China. (e-mail: xglan@aiar.xjtu.edu.cn). Nanning Zheng is with the Institute of Artificial Intelligence and Robotics, Xian JiaoTong Unversity, Xian, P.R.China. (e-mail: nnzheng@mail.xjtu.edu.cn). Yuehu Liu is with the Institute of Artificial Intelligence and Robotics, Xian JiaoTong Unversity, Xian, P.R.China. (e-mail: liuyh@aiar.xjtu.edu.cn). Contributed Paper Manuscript received December 3, 2004
Lifting scheme is called as the second generation wavelet[5] because of the construction of biorthogonal wavelets which does not require the Fourier transform. And lifting scheme,
380
which leads to a faster, fully in-place implementation of the wavelet transform, is attractive for both high throughput and low-power applications. Every finite wavelet transform can be factorized into lifting scheme[4], and the basic principle is to factorize the polyphase matrix of the wavelet or subband filters into a sequence of alternating upper and lower triangular matrices and a diagonal matrix using the Euclidean algorithm. This leads to the wavelet implementation by means of banded-matrix multiplications. Let polyphase matrix of the wavelet is
From Fig.1, we can derive that the DWT and IDWT are symmetrical. The IDWT is obtained only by traversing in the inverse direction, changing the factor K to 1/K, factor 1/K to K, and reversing the signs of coefficients in t (z ) and
s (z ) .
The biorthogonal wavelet, for example, Daubechies 9/7 wavelet filter[1], is factorized into lifting scheme,
1 P( z ) = 1 (1 + z )
P( z ) = [ ~ ] ~ g e ( z) g o ( z)
Where, h e ( z ) and h o ( z ) denote the even part and odd ~ ~ ~ part of low-pass analysis h ( z ) , and g e ( z ) and g o ( z ) are the even part and odd part of high-pass analysis g ( z ) of the wavelet filter, respectively. The polyphase matrix can be factorized into following formula by using Euclidean algorithm,
~
he ( z)
ho ( z )
(1 + z )
0 1 1 0
(1 + z 1 ) 1
P( z ) = [ ~ ] ~ g e ( z) g o ( z)
=
h e ( z)
m
ho ( z )
0 K 0 1 s ( z ) 1 0 i 1 t ( z ) 1 0 1/ K i =1 i
1
And the lifting scheme is shown as follows, Forward DWT: step1: Y(2n+1) = Xext(2n+1) +(Xext(2n) + Xext(2n+2)) i 0 3 2 n + 1 < i1 + 3 step2: Y(2n) = Xext(2n) + (Y(2n 1) +Y(2n +1)) i 0 2 2 n < i1 + 2 step3: Y(2n+1) =Y(2n+1) + (Y(2n) +Y(2n+2)) i 1 2n + 1 < i + 1 step4: Y(2n) =Y(2n) + (Y(2n 1) +Y(2n +1)) i 2 n < i step5: Y ( 2 n + 1) = K Y ( 2 n + 1) i 0 2 n + 1 < i1 step6: Y ( 2 n ) = Y ( 2 n ) / K i 0 2 n < i1 Inverse DWT: step1: X (2n ) = K X ext (2n ) i0 3 2n < i1 + 3
0 1
0 1
1 (1 + z )
0 K 1 0
0 . 1/ K
step2: X ( 2 n + 1) = X ext ( 2 n + 1) / K step3: X(2n) = X(2n) (X(2n1)+ X(2n+1)) step4: X(2n +1) = X(2n +1) (X(2n) + X(2n+2)) step5: X(2n) = X(2n) (X(2n 1) + X(2n+1)) step6: X (2n +1) = X (2n +1) ( X (2n) + X (2n + 2)) Where,
i0 2 2n +1< i1 + 2
i0 3 2n < i1 + 3
i0 22n+1<i1 +2
i0 12n<i1 +1
i0 1 2n + 1 < i1
lifting scheme. The lifting scheme of the wavelet filter computing one dimension signal is shown in Fig.1, and consists of four steps: 1) Split step, where the signal is split into even and odd points because the maximum correlation between adjacent pixels can be utilized for the next predict step. 2) Predict step, where the even samples are multiplied by the time domain equivalent of t (z ) and are added to the odd samples. 3) Update step, where the odd samples multiplied by the time domain equivalent of s (z ) are added to the even samples. 4) Scaling step, where the even samples are multiplied by 1/K and odd samples by K. 5)
z
input X 2 Xo
the last sample of one row or column. The implementation of D9/7 forward and inverse DWT is factorized into six-step lifting scheme, as shown in Fig.2.
z
input X 2 Xo
HP
HP
1/K
Split
Merge
Xe
+
LP Forward DW T
+
LP
1/K
0.443506852
Inverse D W T K
-1.586134342
-0.052980118
0.882911075
1.230174105
III. VLSI ARCHITECTURE OF 2-D DWT The architecture of 2-D DWT is described in Fig.3. Two lines of Image data or LL subband data are routed in the Control Unit(CU). After the data are encoded by the row processor(RP) and column processor(CP) respectively, the address generator generates the corresponding addresses of the output data of the CP. The function of the CU is to route the input data, and control the next level decomposition according to the current level of decomposition and levels of decomposition required. That the row processor is time-multiplexed does not only increase the utilization, but also minimizes storage cells. The CP outputs two pixels at one clock, LL and LH or HL and HH subband data. The
+
t (z )
HP
1/K
s( z) t (z )
Split
s (z )
Merge
Xe
+
LP Forward DWT
1/K
Inverse DWT 2
Downsample
Upsample
Fig.1. Lifting
Scheme
X. Lan et al.:
Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform
381
address generator produces addresses, and the output data are stored in memory by the memory control. The Memory Control unit is not the purpose of this article and will not be discussed. The architecture does not only perform high-pass and low-pass filters of the RP and CP in parallel, but also horizontal and vertical decomposition in parallel.
In p u t im a g e d a ta
The multiplication is optimized by using shift and add operations in this way. The other coefficients are processed in the same way. The architecture employs 2s complement fix-point data representation which has 24 bits, 13 bits for the fractional part and 11 bits for integral part. Compared to the float data, the effect of fix-point data can be ignored, as shown in Fig.4. Two compared images are goldhill and baboon, and the size is 512 512 8bits. B. Embedded Periodic Extension at the Boundaries The finite length of signal processed by using wavelet filter leads to the edge effect[10]. JPEG2000 standard employs the symmetric extension at the boundaries to eliminate it, see Table I and Fig.5a. The extension of signal is embedded into the row processor and column processor in the implementation, in other words, the extension points are not actually calculated, because the computations of extension points are the same as those of original image samples. Multiplexers can be utilized to route the two input points for the adder which values are equal to those of original samples in each lifting step. In Fig. 5b, for example, the forward extension samples signaled number 4, 3, 2 and 1 at the left of the sample 0 are not computed in step1, and only the odd original samples are predicted. In step2, the original computed sample 1 is added by itself to update the even sample 0. The embedded scheme is implemented by finite state machine(FSM) and multiplexers, as shown in Fig.6,7 and 8. The scale and power of circuit is greatly reduced in this way. The FSM has four states: IDLE, FORWARD extension, NORMAL, LAST extension. The triggering signals of transition of the states are produced according to the line enable wire of pixels. The forward extension signals ext_en1 and ext_en2, which are employed to enable the multiplexer in sub-filters and of the RP and CP respectively, are generated in the state FORWARD. And the signals ext_en3
CONTROL
Input LL subband data
E v e n lin e
U N IT
O d d lin e
ROW
PRO CESSO R
DWT
E v e n lin e
O d d lin e
COLUM N
L L /H L
PRO CESSO R
L H /H H
ADDRESS
AD DRESS
GENERATOR
L L ,H L ,L H ,H H
M EM ORY
CONTROL
M EM ORY
A. Optimize multiplication using shift-add operation Because the multipliers occupy a great amount of area and hardware resources, which are not appropriate to chip design, and the coefficients of wavelet filter are constants, We substitute the multiplications for shift-add operations to optimize the implementation. Firstly, the coefficients of wavelet filters, , , , , K and 1 / K are quantized into the formulation which consists of the number of bits with value 1 in their positive representation. Thats because each 1 yields a term to be summed. Then the common factor is found. For example, =(0.443506852)10= (0.0111000110001)2, and the multiplication of and X is equivalent to the form X = X >> 2 + X >> 3 + X >> 4 + X >> 8 + X >> 9 + X >> 13. The above equation, for example, is simplified as follows, X = X >> 2 + X >>13+ ( X >>1 + X >> 2) >> 2 + ( X >>1 + X >> 2) >> 7 .
40
E D C B A B C D E F E D C
ile ft
i0
i1 i r i g h t
PSNR(dB)
30
step 6
step 5
20
10
1 0 U p d a ted p o in t
3 4 O r ig in a l p o in t
3 4 U p d a ted p o in t
(b)Embedded scheme of periodic extension(only compute the samples in rectangle) Fig.5. Periodic extension at signal boundaries
ileft
iright
odd 1 3
i0
5/3 9/7
even 2 4
i1
even 1 3
odd 2 4
ileft
and ext_en4 of last extension, which are applied to enable the multiplexers in sub-filters and respectively, are produced in the state of LAST extension. These signals are shown in Table II, Fig.7 and 8.
TABLE II
CONTROL SIGNAL OF THE EXTENSION OF FORWARD AND BACK END
coefficient of the sub-filter (which is optimized to shift-add operation), the products are added to odd samples of each line. Compared to the first input line (even line), the second input line(odd line) is delayed to compute by one clock. Even line pixels are computed in odd clocks, and odd line pixels in even clocks. Pipelined computations are achieved via the registers. The two lines of output samples of the sub-filter are routed into sub-filter by the multiplexers. And the odd samples are summed by using the DFF. The summations are multiplied by the coefficient , added to even samples. The same optimization operations as sub-filter are utilized. Computations of the sub-filters and are similar to those of samples of each output line in sub-filter are multiplied by 1 / K and K , respectively. Filtering of the image along the line is finished in this way. Time-multiplexing the row processor is implemented by computing the even lines at odd clocks and odd lines at even clocks. The control signal sel_en of the router, which triggers the multiplexers to route for adders, is generated by the counter. The row processor is optimized in the pipelined way, and the samples are encoded continuously as the samples are input. Hardware utilization reaches approximately 100%, and the control logic is simple. The schedule of the sub-filter is shown in Table III. Xi,j denotes the sample of line i and column j. X0,2, for example, expresses the third sample of the first line. SAi, denotes the multiplication of the coefficients of wavelet filter with the summation at previous clock, and i is line number, and 0 denotes the even line of two input lines; 1 is the odd one. For instance, at clock 3, add X0,2 to X0,0 and store the sum in register. At clock 4, add X1,2 to X1,0, at the same time, compute the multiplication SA-0= (X0,2+X0,0) and store the results in registers. Multiplication SA-1= (X1,2+X1,0) is computed at clock 5.
R is ter eg A Q1 A Register Q1 A Register Q1 A Reg er ist Q1 A R is ter eg Q1
First_ext 1 1 0 0
normal 0 0 0 0
Last_ext 0 0 1 1
re s e t Id l e
FO RW ARD LAST L a s t_ en
I n it i a l_ e n
NORMAL
N o r m a l_ e n
R s ter egi
R ist e eg r Q 1 A Q1 A
Re gister Q 1
A
R ist er eg Q 1 A
R ist er eg Q1
Q8
Q8
Q8
Q8
are similar to the sub-filters and , respectively. All the hardware resources of the row processor can be time-multiplexed, see Fig.7. Two lines are calculated at a time , and every line pixels are not divided into even and odd samples when they are input into the row processor continuously. That is, two pixels are encoded at one clock. This reduces storage cells and increases the speed in row processor. Compared to the row processor that the input lines are partitioned into even and odd samples (which needs tow parallel RPs), the proposed row processor is time-multiplexed at the cost of several registers. Therefore, hardware resources are greatly cut down and utilization is high. Two lines pixels are input firstly into the sub-filter in the row processor, and even samples of each line are summed by utilizing the delay flip-flop (DFF) and multiplexers. After the summations are multiplied by the
The row processor consists of four sub-filters , , and ,as shown in Fig.7. The sub-filters and
Even Line
sel_en sel_en 0 1 0 1
R ist er eg A Q1 H Q8
ENB
EN B
EN B
Q 8
Q8
Q 8
Q 8
Q8
EN B
E NB
E NB
ENB
sel_en 0 1 0 sel_en 1
R iste eg r
ext_en3 ext_en3
R s ter egi R ist e eg r Q 1 A Q1
01 01
Re gister A Q 1 Q8 H Q 8
Q1 Q8
ENB
ext_en1 sel_en 0 0 1 1
ENB
Q1 Q8
ENB
sel_en 0 1
Q1 Q8
0 1
Register
Even Line
R egister
EN B
<< +
Q 1
Q 8
ENB ENB
0 1 sel_en
+
01 ext_en1
Reg er ist A
R is ter eg
<< +
Q 1
Q 8
EN B ENB
ENB
0 1 sel_en
Odd line
Reg e ist r A Q1
A
Register Q 1 A
Register Q 1
Odd line
Reg er ist A Q1 A
R ist er eg Q1
Q 8
Q8
Q 8
Q 8
EN B
E NB
E NB
E NB
ENB
EN B
Q8
Q8
EN B
EN B
EN B
ENB
ENB
R egister A Q 1 A
R ist er eg Q1 A
Register Q 1 A
R is ter eg Q1 A
Register Q 1
sel_en
Q 8
R ister eg
R ist e eg r Q 1 A Q 1 A
R iste eg r Q1
A
R is ter eg Q1 A
R s ter egi Q 1
Q 8
Q8
Q 8
Q8
Even Line
sel_en sel_en 0 1 0 1
Register A Q 1 H Q 8
EN B
ENB
EN B
Q 8
Q 8
Q8
Q8
Q 8
EN B
E NB
E NB
sel_en 0 1 0 sel_en 1
EN B
ext_en4 ext_en4
R ister eg
Q1 Q8
EN B
0 1
01 01
R iste eg r A Q1 H Q8
ext_en2 sel_en 0 1
ENB
Q1 Q8
ENB
sel_en 0 1
Q 1 Q 8
0 1
Register
1/ K
Even Line
R iste eg r
R is e r eg t
E NB
<< +
Q1
Q8
EN B ENB
0 1 sel_en
+
01 ext_en2
R is ter eg A Q 1 Q 8 H Q 8
R is ter eg A
Register
<< +
Q 1
Q 8
EN B EN B
ENB
0 1 sel_en
K
Odd line
R s ter egi A Q 1
A
Register Q 1 A
Regist er Q1
Odd line
Q 1
R is ter eg A Q1 A
Regist er Q1
Q 8
Q 8
Q 8
Q8
EN B
E NB
EN B
EN B
ENB
Q 8
Q8
Q8
EN B
EN B
ENB
ENB
ENB
0 M ultiplexer R egister 1
A dder
<< +
S hift-A dd,
Fig.7.
X. Lan et al.:
Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform
383
From Fig.7, only eight adders and six shift-add operations are needed to implement row filter of D9/7. The shift registers dont increase the hardware consumption, because shift registers could be implemented by the combination logic of wire AND operations. Therefore, the proposed architecture of the row processor greatly reduces the hardware consumption and complexity.
TABLE III SHEDULE OF THE SUB-FILTER IN RP
Clock Adder 1 2 3 4 5 6 7 8 X0,2+X0,0 X1,2+X1,0 X0,4+X0,2 X1,4+X1,2 X0,6+X0,4 X1,6+X1,4 SA-0 SA-1 SA-0 SA-1 SA-0 SA-0+X0,1 SA-1+X1,1 SA-0+X0,3 SA-1+X1,3 X0,0
of the sub-filters , and is similar to that of the sub-filter . Two pixels, LL and LH subband pixels or HL and HH, a r e o u t p u t a t o n e c l o c k i n c o l u m n p r o c e s s o r. In Fig.8, only eight adders and six shift-add operations are needed, and the column processor is performed in parallel with the row processor (only delay one line clock cycles).
sub-filter
Adder
Even line
Register A Q1
Q8
EN B
line buffer
ext_en3 01
Register A Q1 A
Register Q1 A
Register Q1
Register A Q 1 A
Register Q1
Q8
Q8
Q8
Q 8
Q8
EN B
Shift Adder
Q1
EN B
EN B
EN B
Q1
EN B
Register
EN B
EN B
EN B
Q8
<<
Register A Q1
Register
Q8
Q1
Odd line
Register
Q8
<<
Register A Q1
Register
Q8
EN B
Q8
Q1
Even line
Q8
EN B
EN B
Register
Register Q1
Register Q1 A Q1
Register
Register Q1 A Q1 A
Register Q1
Odd line
Q8
line buffer
Q8
Q8
line buffer
Odd line
Q8
Q8
Q8
EN B
EN B
EN B
EN B
EN B
EN B
Even line
Register
Register Q1 A Q1 A
Register Q1
line buffer
01
Register A Q1 A
Register Q1
Q8
Q8
Q8
Q8
Q8
ENB
Q 1
ENB
ENB
ENB
Q1
ENB
Register
ext_en4
ENB
ENB
ENB
Q 8
<<
Register A Q1
Register
Register Q1
Q8
O line dd
Q8
<<
Register A Q1
Register
Q8
ENB
Q8
Q1
1/K
Even line
Q8
ENB
ENB
Register
Register Q 1 A Q1 A
O line dd
Register
Register Q1 A Q1 A
Register Q1
line buffer
Q 8
Q8
line buffer
Q8
Q8
Q8
K
O line dd
ENB
ENB
ENB
ENB
ENB
D. The Architecture of the Column Processor The column processor performs wavelet transform along the column, and the pixels processed are from the row processor. This leads to the difference of column processor from row processor. Line buffers, which buffer the samples required by column processor, are substituted for the registers of the row processor, as shown in Fig.8. Two output lines of the row processor are even line and odd line for the column processor, in other words, the pixels are naturally separated into even samples and odd samples along the column. This reduces the storage cells and simplifies the design of column processor. Six-line buffers for D9/7 and there-line for 5/3 are Required. Where, the function of the line buffer is identical to the FIFO (First in, First out) which size is WN , where W is the width of data path, N is image width, and the size is no relation to image height. The column processor begins to calculate the samples after the first two lines finish computing in row processor. Firstly, two lines are performed in sub-filter . Then the outputs of the sub-filter , even line and odd line, are computed in sub-filters , and in turn. The computations of the sub-filters and are similar to those of the sub-filters and , respectively. The even and odd samples of each output line in sub-filter are multiplied by 1 / K and K , respectively. Filtering along the line is finished in this way. We also optimize the column processor in pipelined way to increase the speed. The schedule of sub-filter is shown in Table IV. At clock 2, for example, add X2,0 to X0,0 and store the sum in registers. At clock 3, read the sum from the registers and calculate the multiplication, SA1= (X2,0+X0,0). At clock 4, add SA1 to X1,0 and store the sum in registers. The schedule
sub-filter
Adder
Shift-Add
E. Modularity Two lifting steps are required for (5,3) wavelet[1] as follows, Step1: Y(2n+1) = Xext(2n+1) Xext(2n) + Xext(2n+2) 2 Y(2n1) +Y(2n+1) +2 . Step2: Y(2n) = Xext(2n) + 4 Therefore, we need only utilize the sub-filters and of the row and column processors in D9/7 to perform the transform except replacing the coefficients of D'9/7 wavelet
384
with those of (5,3) wavelet. Because the coefficients of the (5,3) wavelet is
1 1 = 2 1 and = = 2 2 , the 2 4
utilize the hardware resources of the forward DWT. The proposed architecture has the universality for 2-D DWT based on lifting scheme.
K 1/ K Even column
Regist er A Q 1 A Regist er Q1 A Regist er Q1 A Regist er Q1 A Regist er Q1
multiplication is reduced to the shift operations, and the three-step algorithm (add-multiplication-add) of each sub-filter is shown in Fig.9.
Register A Q1 A Register Q1 A Register Q1 A Register Q1 A Register Q 1
Regist er Q1 A
Regist er Q1
Q8
ENB
ENB
ENB
sel_en 0 1
0 1
ext_en2 sel_en 0 1
ENB
Q1 Q8
ENB
sel_en sel_en 0 1 0 1
Regist er A Q1 H Q8
Q8
Q8
sel_en sel_en 0 1 0 1
ENB
Regist er A
01
Regist er Regist er Q1 A Q1 A Regist er Q1 A Regist er Q1
ENB
<< +
Regist er A Q1
Q8
Regist er A Q1
ext_en4 ext_en4 0 1
Regist er
01 01
Regist er A Q1 H Q8
Q 1 Q 8
ENB
Even column
Q8
Regist er A
ENB ENB
ENB
<< +
A
R egister A Q 1
Q 8
ENB ENB
0 1 sel_en
Odd column
ext_en2
Regist er A Q1 A Regist er Q1 H Q8 H Q8
Regist er A Q1 A
Regist er Q1
Regist er Q1
sel_en
Q 8
Odd column
Even Line
K 1/ K
Odd column
Q1
Q8
Q8
Q8
Q8
Q8
Q8
Q8
Q8
sel_en
ENB
ENB
ENB
ENB
ENB
ENB
ENB
ENB
ENB
ENB
ENB
Register
Register Q1 A Q 1 A
Register Q 1 A
Register Q1 A
Register Q1
Even Line
sel_en sel_en 0 1 0 1
Register A Q
1
Q8
Q8
Q8
Q8
ENB
ENB
ENB
Q8
Q 8
Q 8
Q8
Q8
ENB
ENB
ENB
sel_en 0 1 sel_en 0 1
ENB
ext_en3 ext_en3
Register Register Q1 A Q 1
01 01
Register A Q1 Q 8 H Q8
Q1 Q8
ENB
0 1
Register A
ENB
<< 1
A
Re g i ter s A Q1
Q8
ext_en1 sel_en 0 1 01
ENB
Q1 Q8
ENB
sel_en 0 1
0 1
Q 8
EN B
ENB
0 1 sel_en
Register A
ENB
<< 2
Register A Q 1
Q 8
Register A Q1
Q8
ENB ENB
0 1 sel_en
Odd line
(a) Sub-Filter
(b) Sub-Filter
Register A Q 1 A
Register Q 1
Register Q 1
Odd line
ext_en1
Register A Q1 A Register Q
1
Q8
Q 8
Q 8
Q 8
ENB
ENB
ENB
ENB
ENB
ENB
Q8
Q8
ENB
ENB
ENB
ENB
ENB
Re r giste A Q1 A
Re r giste Q1 A
Re r giste Q 1 A
Re r giste Q 1 A
Re r giste Q1
Evencolum n
Q8
Q8
Q 8
Q 8
Q8
sel_en sel_en 0 1 0 1
Re r giste
Re r giste Q1 A Q 1 A
Re r giste Q1 A
Re r giste Q1 A
Re r giste Q1
E NB
E NB
E NB
sel_en 0 1
ext_en1 sel_en 0 0 1 1 01
E NB
Q 1 Q 8
E NB
Evencolum n
sel_en sel_en 0 1 0 1
Re r giste A Q1 H Q8
Q8
Q 8
Q8
Q8
Q8
E NB
E NB
E NB
sel_en 0 1 0 sel_en 1
E NB
Re r giste A
E NB
<< +
Re r giste A Q1
Q8
Re r giste A Q1
ext_en3
Q8
01 01
Re r giste A Q1 H Q8
Q1 Q8
E NB
Evencolum n
E NB E NB
0 1 sel_en
ext_en3 O colum dd n
A H Re r giste Q1
Re r giste A
E NB
<< +
R egister A Q1
Q8
E NB E NB
0 1 sel_en
O colum dd n
ext_en1
Re r giste Re r giste Q 1 A Q 1 A H Q 8 H Q 8
Re r giste A Q1 A
Re r giste Q 1 A
Re r giste Q1
Register
Even line
Q1
Q8
ENB
line buffer
ext_en3 01
Register A Q1 A
Register Q1 A
Register Q1
Register A Q1 A
Register Q1
O colum dd n
Re r giste A Q 1 A
Re r giste Q1 A
Re r giste Q1
Q8
Q8
Q 8
Q8
E NB H Q 8 H Q8 H Q8
E NB
E NB
E NB
E NB
Q8
Q8
Q8
Q8
Q8
ENB
Q1 Q8
ENB
ENB
ENB
Q1 Q8
ENB
E NB
E NB
E NB
E NB
E NB
Register
Register
Register Q1
ENB
ENB
ENB
<< 1
Q8
Q1
Odd line 0
Register
Register
Register Q1
Q8
ENB
<< 2
Q8
Q1
Even line
Q8
ENB
ENB
(c) Sub-Filter
(d) Sub-Filter
Register
Register Q1
Register Q1 A Q1
Register
Register Q1 A Q1 A
Register Q1
Odd line
Q8
line buffer
Q8
Q8
line buffer
Odd line
Q8
Q8
Q8
ENB
ENB
ENB
ENB
ENB
ENB
Fig.9. The Row Processor(a and b) and Column Processor(c and d) of (5,3) Lifting Scheme
Q8
Q8
line buffer
Even column 01
Regist er
Q8
Q8
Q8
ENB
Q1 Q8
ENB
ENB
Q1 Q8
ENB
ENB
Even column
1/ K
Regist er
Odd column
ENB
<< +
Regist er A Q1
Regist er
Q8
Q1
Q8
ext_en4
ENB
ENB
ENB
<< +
Regist er A Q1
Regist er
Q8
Q1
Q8
ENB
ENB
Odd column
F. The Architecture of IDWT The architectures of IDWT and DWT are symmetric because of the symmetry of lifting scheme for forward DWT and inverse DWT. The filtering of the IDWT is first to perform transforms along the column, then along the line, see Fig.10.
Input LL HL LH HH subband data
Regist er
Regist er Q1 A Q1 A
Regist er Q1 A
Regist er Q1 A
Regist er Q1
line buffer
Q8
Q8
Q8
line buffer
Odd column
Q8
ENB
ENB
ENB
ENB
ENB
(a) Sub-Filter
Regi ter s Regi ter s Regi ter s Q1 A Q1 H Q8 A Q1
(b)
Sub-Filter
Regi ter s A Q1 A Regi ter s Q1 A Regi ter s Q1
Q8
Q8 ENB
line buffer
ext_en3 01
Q8
Q8
Q8
ENB
Q1 Q8
ENB
ENB
Even column
Regi ter s A Q1 Regi ter s Regi ter s
Q1 Q8
ENB
ENB
Even column
Regi ter s
ENB
<< +
Q8
Q1
Q8
ENB
ENB
ENB
<< +
Regi ter s A Q1
Regi ter s
Q8
Q1
Q8
ENB
ENB
Odd column
Odd column
Regi ter s Regi ter s Q1 A Q1 A Regi ter s Q1 A Regi ter s Q1 Regi ter s Regi ter s Q1 A Q1
line buffer
Q8
Q8
Q8
Q8
line buffer
Q8
Q8
ENB
ENB
ENB
ENB
ENB
ENB
CONTROL
E v en c o lu m n
U N IT
O d d c o lu m n
(c) Sub-Filter
(d) Sub-Filter
COLUM N
PR O C ESSO R
O d d c o lu m n
IDWT
E v en c o lu m n
ROW
PR O C ESSO R
L L /Im a g e
IV
IMPLEMENTATION
L L /Im a g e
AD DR ESS
ADDRESS
GENERATOR
L L /Im a g e
M EM ORY
CONTROL
M EM ORY
The subband data are input the control unit, and two column data are output. After the data are computed by the column processor and row processor, the synthesis image or LL data are stored in the memory via the memory control. Multilevel decomposition is implemented by the control unit. The filtering order of the IDWT in row and column processor is the sub-filers , , and in turn, as shown in Fig.11 and 12. The architectures of the sub-filters are the same as those of forward DWT correspondingly. Therefore, the IDWT can
We have developed a RTL(Register Transfer Level) model of our architecture which is able to perform forward and inverse DWT of D9/7 and 5/3 in FPGA. Two pixels per clock cycle can be encoded at 100MHz. For (5,3) and (9,7), 25% of total area of the main chip which has about 25000 logic elements is needed for multilevel decomposition of wavelet which is implemented by the FSM, one level at a time, and the data path is 24 bits wide. Only 5% of total area is needed for implementation of the (5,3) wavelet. The memory requirement depends on the width of the image to be processed and the data path. For image 512 512 8bits, there levels of decomposition using D9/7, the speed is 558 frame/s and the size of required memory is 72Kbits.
X. Lan et al.:
Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform
385
V CONCLUSION A low-power and high-speed architecture which performs the 2D DWT/IDWT for JPEG2000 is proposed in this paper. The main advantages of proposed architecture are as follows. 1) Time-multiplexing row processor and line-based design way minimize the on-chip storage cell, and reduce the hardware consumption, and increase the utilization. 2) Two pixels are computed at one clock, and row processor performs in parallel with column processor. 3) The optimized shift-add operations are substituted for multiplications, and periodic extension at the boundaries is implemented by embedded circuit, and the proposed architecture is optimized in pipelined way. These reduce the quantity of computation and the power. 4) Universality for 2-D DWT/IDWT based on lifting scheme and multilevel decomposition are allowed through cascading the proposed architecture. REFERENCES
[1] ISO/IEC JTC 1/SC 29/WG 1 N1646R, JPEG2000 part 1 final committee draft version 1.0,2000. [2] Taubman D. High performance scalable image compression with EBCOT, IEEE Transactions on Image Processing,vol.9, no.7, pp.1158~1170,2000. [3] C.Christopoulos,A.Skodras. The JPEG2000 still image coding system: An overview, IEEE Transaction on Consumer Electronics,vol.46, pp.1103~1127,2000. [4] Daubechies I, Sweldens W. Factoring wavelet transforms into lifting schemes, J. Fourier Anal. Appl,vol.4,pp.247~269,1998. [5] Sweldens W. The lifting scheme: a new philosophy in biorthogonal wavelet constructions, Proc SPIE,1995. [6] M. Vishwanath,R.M. Owens,M.J.Irwin, VLSI architecture for the discrete wavelet transform, IEEE trans. On circuit and systems-11, vol.42, no.5, pp.305-316,1995. [7] G.Dillen,B.Georis. Combined line-based architecture for the 5-3 and 9-7 wavelet transform for JPEG2000, IEEE Transaction on Circuits and Systems for Video Technology, vol.13, no.9, pp.944-950, 2003. [8] Andra K, Chakrabarti C. A VLSI architecture for lifting-based forward and inverse wavelet transform, IEEE Trans on Signal Processing, vol.50, no.4, pp.966-977, 2002. [9]Wei-Hsin Chang,Yew-San Lee, Wen-shiaw Peng, Chen-Yi Lee, A Line-based, memory efficient and programmable architecture for 2D DWT using lifting sheme, IEEE International Symposium on Circuits and Systems, 2001. [10] K. Seth S.Srinivasan. VLSI implementation of 2-D DWT/IDWT Cores using 9/7-tap filter banks based on the non-expansive symmetric extension scheme, Proceeding of the 15th international conference on VLSI Design, 2002. [11] ISO_IEC_15444-2:2004(E), Information technology JPEG 2000 image coding system: Extensions,2004.
Xuguang Lan received the BS degree from the College of Automobile Engineering, Shan Dong University of Science and Technology in 1999, and the MS degree from the College of Transportation Engineering, Kunming University of Science and Technology in 2002. Now he is a PhD student of the Institute of Artificial Intelligence and Robotics, Xian JiaoTong University. His research interests include image/video processing, hardware implementation of intelligent systems, and VLSI design. NanNing Zheng (SM93) graduated in 1975 from the Department of Electrical Engineering, Xian Jiaotong University, Xian, China, received the ME degree in information and control engineering from Xian JiaoTong University, Xian, China in 1981, and the PhD degree in electrical engineering from Keio University, Japan, in1985. He is currently a professor and the director of the Institute of Artificial Intelligence and Robotics at Xian Jiaotong University. His research interests include computer vision, pattern recognition, computational intelligence, image processing, and hardware implementation of intelligent systems. He served as the general chair for the International Symposium on Information Theory and Its Applications in 2002, and the general co-chair for the International Symposium on Nonlinear Theory and Its Applications in 2002. Since 2000, he has been the Chinese representative on the Governing Board of the International Association for Pattern Recognition. He presently serves as executive editor of Chinese Science Bulletin. He became a member of the Chinese Academy Engineering in 1999. He is a senior member of IEEE. Yuehu Liu received a B.E. and a M.E. degree in computer science& engineering at Xian Jiaotong University, China in 1984 and 1989, respectively, and the PhD degree in electrical engineering from Keio University, Japan, in 2000. He is currently a professor and the vice director of the Institute of Artificial Intelligence and Robotics at Xian Jiaotong University. His research interests include computer vision, pattern recognition, computational intelligence, image processing.