A Multiplierless 2-D Convolver Chip For Real-Time Image Processing

Journal of VLSI Signal Processing 38, 6371, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
A Multiplierless 2-D Convolver Chip for Real-Time Image Processing

MYUNG H. SUNWOO AND SEONG K. OH School of Electrical and Computer Engineering, Ajou University, San 5, Wonchun-Dong, Yeongtong-Ku, Suwon, 442-749, Korea Received May 16, 2002; Revised July 9, 2003; Accepted September 9, 2003
Abstract. This paper proposes a new real-time 2-D convolver chip with no multiplier. Several commercial 2-D convolver chips have many multipliers and existing multiplierless architectures have many shift-and-accumulators to meet the real-time image processing requirement, i.e., the standard of CCIR601. Even though the proposed architecture uses only one shift-and-accumulator, it can meet the real-time requirement. Furthermore, because it controls the input data sequence, the proposed chip does not require row buffers to store two adjacent rows as do commercial chips, and it can further reduce the gate count. The proposed architecture can reduce the gate count by more than 70 and 90% compared to HSP48901 and HSP48908, respectively, and the gate count of the computation block itself by more than 70% compared to existing multiplierless architectures. We have implemented the chip using the SamsungTM 0.8 m SOG cell library (KG60K). The implemented lter chip consists of only 3,893 gates, operates at 125 MHz and can meet the real-time image processing requirement. The proposed architecture is especially suitable for larger size convolutions because of its small gate count. Keywords: convolution, multiplier, VLSI architecture, image processing, VLSI design, digital lter 1. Introduction to 4, 2, 1, 0, 1, 2 and 4 has been proposed [4]. However, the application of this lter is mainly limited to edge detection because of its restricted coefcients. Since multipliers occupy a large area [8, 11, 12], a lter chip without any multipliers can be of signicantly reduced size. Several multiplierless convolution lter architectures [1319] have been proposed. For a 3 3 convolution, multiplierless architectures replace nine multipliers with nine shift-and-accumulation blocks. Although these architectures reduce the area compared with architectures having multipliers, they require shift-and-add operations in all taps and many adders to sum results of many shift-and-accumulation blocks. Therefore, they still require a relatively large area because they have nine shift-and-accumulation blocks in all nine taps. The above-mentioned architectures may not be suitable for larger size convolutions that are common in pattern recognition and machine vision applications [20, 21] because of their large areas. To reduce area, this paper proposes a new multiplierless architecture for a 3 3 2-D convolution lter
In general, 2-D convolution lters are widely used in image processing applications, such as guidance, surveillance, inspection, machine vision systems, etc. [17]. Real-time image processing requires hundreds of million multiplications and additions per second, but conventional DSP chips and microprocessors may not be capable of such massive computations and are relatively expensive to use just for a convolution. Hence, commercial image lter chips having many multipliers [8] are being used as a coprocessor to meet the real-time requirement, that is, 720 480 pixels per frame and 30 frames per second (10.4 Mpixels/second), which is the standard of CCIR601 (MPEG-2) [9, 10]. Filter chips having multipliers, such as HSP48901 and HSP48908 [8], use nine 8-b 8-b multipliers to perform a 3 3 convolution. These architectures require a large area for the nine multipliers. To reduce the overall size of the multipliers, a lter architecture in which the values of each coefcient are restricted
64
Sunwoo and Oh
chip using only one shift-and-accumulation block [22, 23]. Whereas existing architectures either with multipliers [4, 8] or without multipliers [1319] use eight 16-b adders for the summation of nine products from nine taps, the proposed architecture uses half the size of an adder tree, i.e., eight 8-b adders. Furthermore, the existing architecture [4] and HSP48908 [8] use row buffers to store the previous and next rows, however, the proposed architecture eliminates the need for row buffers by controlling the input data sequence. We have simulated the proposed architecture using VHDL (VHSIC Hardware Description Language) models and have performed logic synthesis using the SamsungTM 0.8 m SOG (sea-of-gate) cell library (KG60K). We have actually implemented a lter chip that has been completely veried as to function and timing simulations and can provide the maximum data rate of 15.625 Mpixels/second, which meets the real-time image processing requirement. The proposed lter chip consisting of only 3,893 gates can reduce the gate count by more than 70 and 90% compared to HSP48901 and HSP 48908 [8], respectively and the size of the computation block itself by more than 70% compared to existing multiplierless architectures [1319]. The remainder of the paper is organized as follows. Section 2 introduces the modied algorithm used for the computation unit in the proposed lter and compares it with existing algorithms. Section 3 presents architectural details of the lter. Section 4 describes the implementation of the lter chip, evaluates the performance of the lter chip and compares it with other lter chips. Finally, Section 5 contains concluding remarks.
Figure 1.
The lter architecture using multipliers based on (2).
as (2)
2 2
G(x, y) =
n=0 m=0
H (n, m)F(x n, y m).
(2)
2.
The Modied 2-D Convolution Algorithm
This section describes the modied algebraic algorithm used in the design of the proposed lter chip. Assume that the mask size of a convolution lter is N M, the coefcients are H (x, y), the input sequence is F(x, y) and the output sequence is G(x, y). Then the relationship between input and output sequences is represented as
N 1 M1 n=0 m=0
Figure 1 shows the direct implementation of (2) and requires nine 8-b 8-b multipliers and a 16-b tree adder (eight 16-b adders). The architectures based on (2) include HSP48901 and HSP48908 [8]. The area of an 8-b 8-b multiplier is about six times larger than that of an 8-b adder [24]. Since nine multipliers require a large VLSI area, the algorithm and its architecture have been modied to reduce the area described in [1319]. Since coefcients are, in general, 8-b binary numbers, we can convert (2) into (3) G(x, y)
2 2 7
=
n=0 m=0 k=0
F(x n, y m)h k (n, m)2k . (3)
G(x, y) =
H (n, m)F(x n, y m)
(1)
where x and y indicate a pixel position in the image data. If the mask size is 3 3, then we can specify (1)
In the above equation, the binary number coefcients 7 k are H (n, m) = k=0 h k (n, m)2 , where h k (n, m) is 1-b data, i.e., either 0 or 1, and k means the weight of each partial product. Hence, the summation considering k can be computed by shift-and-accumulation operations instead of multiplications. Several multiplierless architectures based on (3) have been proposed [1319].
65
Figure 2.
The lter architecture using nine SAs based on (3).
Figure 3. on (4).
The proposed lter architecture using only one SA based
Figure 2 shows the direct implementation of (3) requiring nine 8-b 1-b two input AND gates, nine 16-b shift-and-accumulators (SAs) and a 16-b tree adder (eight 16-b adders). Each 8-b 1-b two input AND gate is composed of eight two input AND gates, and thus, 72 two input AND gates are required. The convolution operation is as follows. First, logic AND gates make nine partial products of F (data) and H (coefcients). Second, nine SAs accumulate each 8-b partial product to perform summations in the brace of (3) and the result of each SA is 16-b. Third, the 16-b tree adder sums nine 16-b SA results. Instead of a multiplier in Fig. 1, one SA consisting of an accumulator and a 16-b adder makes one product of F and H in eight cycles. Hence, the architecture based on (3) requires a much smaller area than the architecture based on (2). To reduce the area further, we modify (3) to (4) by using the distributive law G(x, y)
7 2 2
=
k=0 n=0 m=0
F(x n, y m) h k (n, m) 2k . (4)
Note that the result inside the brace in (3) is 16-b, but the result inside the brace in (4) is 8-b. Therefore, (4) requires eight 8-b adders instead of eight 16-b adders
in (3). Figure 3 shows the proposed lter architecture based on (4), which consists of 72 two input AND gates, an 8-b adder tree (eight 8-b adders) and only one SA. Since all the partial products of nine AND gate blocks have the same weight, the 8-b adder tree simply adds all the partial products. Then the result of the 8-b tree is directly added to the k-b left-shifted value in SA. To prevent the overow, a wider bit adder should be used instead of an 8-b adder tree or the image data should be scaled down before addition. The 3 3 convolution operation based on (4) is as follows. First, logic AND gates make nine partial products. Second, the tree adder consisting of eight 8-b adders sums nine partial products in the brace of (4). Third, SA accumulates the result of the tree adder sequentially with the 1-b left-shifted previous result. The coefcient bits are shifted to the left so that the partial products can be sequentially generated from MSB (most signicant bit) to LSB (least signicant bit). Hence, eight time accumulations make one ltering output sample, which is the same number of accumulations as in Fig. 2. Hence, the proposed architecture in Fig. 3 uses one 16-b SA instead of nine 16-b SAs as in Fig. 2. In addition, the 8-b tree adder in Fig. 3 instead of the 16b tree adder in Fig. 2 sums nine partial products before performing shift-and-accumulation (in terms of
66
Sunwoo and Oh
Table 1.
Comparisons among computation units based on (2), (3) and the proposed unit of (4). Architectures
Modules Multipliers AND gates Tree adder Shift-and-Accumulators Total gate count
Based on (2) [8] Nine 8-b 8-b multipliers Eight 16-b adders 6,213
Based on (3) [1319] 72 two input AND gates Eight 16-b adders Nine 16-b SAs 4,512
The proposed unit (4) 72 two input AND gates Eight 8-b adders One 16-b SA 1,168
k). Therefore, this architecture requires only 72 two input AND gates, one 8-b tree adder and one SA. Table 1 shows comparisons among computation unit architectures based on (2), (3) and (4). The gate counts are referred from the SamsungTM 0.8 m SOG cell library (KG60K) data book [24]. The 8-b 8-b multiplier has 525 gates, the two input AND gate has 2 gates, and the 16-b SA has 320 gates. The 16-b SA is composed of an 8-b full adder, an 8-b half adder, and two 16-b registers. The 16-b adder is 186 gates and the 8-bit adder is 88 gates. As shown in Table 1, the proposed computation unit can reduce the gate count approximately 80% compared to the computation unit itself based on (2) [8] and approximately 70% compared to computation units based on (3) [1319].
3.
The Archithcture of the Proposed Filter Chip
We have designed the proposed chip to be recongurable and expandable so that it can be used for both a 2-dimensional lter having an N M mask and a 1-dimensional lter having NM taps. Hence, the architecture concept can be used for a lter, a convolver, a correlator, etc. Figure 4 shows the block diagram of the proposed 3 3 lter which consists of nine register units (RUs), a computation unit (CU) and an input control unit (ICU). Each RU makes a partial product of the pixel data and one bit of a coefcient in a clock cycle. Hence, nine RUs compute nine partial products in a cycle. Each RU makes eight partial products of the data and all bits of the coefcient in eight clock cycles. Therefore,
Figure 4.
The block diagram of the proposed lter.
67
Figure 5.
The RU structure. Figure 6. The CU structure.
RUs make 9 partial products per clock cycle and 72 partial products in eight cycles. The 8-b tree adder in the CU performs a summation of nine partial products in a cycle and the CU requires eight clock cycles to compute eight shift-and-accumulations. More details of each unit are as follows. Figure 5 shows the RU structure that consists of an 8-b data register, an 8-b coefcient register, and eight two input AND gates. The value in the data register is shifted to the next RU after eight clock cycles and the coefcient bits are rotated left every cycle. Logical AND gates can perform the multiplication with the 8-b data and a 1-b coefcient. When the MSB of the coefcient register is 1, the output of AND gates is the pixel data itself. Otherwise, the output of AND gates is all zeros, that is, the partial product is zero. Hence, each RU makes sequentially eight 8-b partial products of a pixel and a coefcient in eight clock cycles without using multipliers. The CU structure shown in Fig. 6 consists of an 8-b tree adder and one SA which is composed of a 16 + 8 b adder, an accumulator and an output register. The 8-b tree adder consisting of eight group CLAs (carry lookahead adders) is a pipelined scheme and sums nine partial products from nine RUs in a cycle. We used the CLAs because the implementation of the 8-b tree adder is relatively easy. The 8-b tree adder can be implemented by simply repeating one CLA eight times. The gate count can be reduced about 25% if we use a carry save adder tree to make the 8-b tree adder. The 16 + 8-b adder, consisting of an 8-b full adder and an 8-b half adder, makes an addition of the 8-b value from the tree adder and the 16-b accumulator value with 1-b shift left to adjust the weight. In other words, the SA performs the shift-and-accumulation operation using the 16+8-b
adder and the accumulator. The shift left operation is carried out by a direct connection without using an accumulator shift operation, and thus, there is no timing overhead for the shift operation. In eight clock cycles, the SA performs eight accumulations and nally the output register contains a ltered output sample. Since the CU performs the shift-and-accumulation after a summation of nine partial products as in (4), the operand size of the tree adder is reduced from 16-b in Fig. 2 to 8-b in Fig. 6. Moreover, the number of SAs is reduced from nine in Fig. 2 to one in Fig. 6. Hence, the proposed architecture can dramatically reduce the gate count compared with existing architectures [8, 1316]. In a 2-D convolution, the previous row and the next row are required to convolve the current row. Hence, HSP48908 [8] has an on-chip row buffer which can store two rows (21024 pixels). In contrast, HSP48901 [8] requires an off-chip row buffer. The architecture in [4] requires a memory for the entire image in which the convolver is used for the coprocessor of the DSP processor. The source image data is provided from the memory using the DMA (Direct Memory Access) controller in the DSP processor. As in the architecture [4], if the source image data is stored in memory and more than one pixel can be provided in a cycle, the row buffer is not required. In this architecture, the proposed scheme called an ICU which is a simple nite state machine (FSM) can be used. As shown in Fig. 7, there are three states based on connections in the ICU. Assume that all rows are stored in memory by the interleaved fashion using the DMA or the DSP processor. First, the state1 maps input ports I1 , I2 , and I3 in the ICU to output ports O1 , O2 , and
68
Sunwoo and Oh
Figure 7.
The ICU control scheme. (a) The data sequence from memory to the chip and (b) The switching states.
O3 , respectively. Therefore, three input rows are row1, row2, and row3 and the ltered output is the result of row2. Second, the state2 maps input ports I2 , I3 , and I1 to O1 , O2 , and O3 . In state2, three input rows are row2, row3, and row4. Third, the state3 maps input ports I3 , I2 , and I1 to O1 , O2 , and O3 . In state3, input rows are row3, row4, and row5. After every state, the row change (RC) signal is asserted to change states. As a result, the right sequence of rows is loaded into the chip without using any row buffer. Hence, we do not need row buffers to store the previous row and the next row if rows are stored in memory by the interleaved fashion shown in Fig. 7. Moreover, if we use the offchip row buffer as in [8, 25], the proposed lter can process the raster scan image data.
4.
Implementation and Comparisons
The proposed lter has been simulated using VHDL models and logic synthesis has been performed using the SYNOPSYSTM Design Compiler. We have used the SamsungTM 0.8 m SOG cell library (KG60K) and have veried completely function and timing simulations. Figure 8 shows the implemented lter chip that
has 47 signal pins and 17 power pins. The implemented chip can operate at 125 MHz even with the 0.8 m SOG technology, provide up to 15.625 Mpixels/second and meet the real-time requirement (10.4 Mpixels/second). The proposed lter can be easily expanded for larger size lters because the input data is passed through the RUs in Fig. 4. Moreover, the lter can be used for a 1-D FIR lter because the input data of the previous row can be shifted to the next row using the MUXs in Fig. 4. We can compose a 36-tap 1-D FIR lter by serially cascading four lters as shown in Fig. 9(a). In addition, Fig. 9(b) shows the 2-D lter using four lters. This cascadability has been veried at the gate level simulation through HDL coding and logic synthesis. The implemented chip can signicantly reduce the gate count compared with the other architectures [8, 1316]. Table 2 shows comparisons among HSP48901, HSP48908, and the implemented chip. The gate counts for HSP4890 chips are given in the datasheets [8]. HSP48901 has nine multipliers, does not have row buffers and the total gate count is 13,594 [8]. HSP48908 has nine multipliers and row buffers and its gate count is 47,500 [8]. However, the implemented chip has only 3,893 gates. In addition, compared with the existing multiplierless architectures [1316], the
69
Figure 9. Cascading multiple lters. (a) The extended 36-tap FIR lter and (b) The extended 2-D lter. Figure 8. The implemented lter chip. (a) Photomicrograph and (b) Packaged chip.
proposed lter architecture can reduce the tree adder size from 16-b to 8-b and the number of SAs from 9 to 1. The gate count of the multiplierless architecture [13] is estimated at about 6,900 assuming that the 1-b CSA (Carry Save Adder) is 9 gates and the 2 1 MUX is 3 gates [15]. The proposed lter architecture can reduce the gate count about 50, 70, and 90% compared to the multiplierless architecture [13], HSP48901 [8] and HSP48908 [8], respectively. In general, pattern recognition and machine vision applications require much larger size 2-D convoluTable 2.
tions, and thus, the proposed architecture with its small gate count is especially valuable for these applications [20, 21]. 5. Conclusions
This paper proposes the new VLSI architecture for a multiplierless 2-D convolution lter for real-time image processing and presents its chip design and actual implementation. We have derived the architecture from the modied algorithm. The architecture can reduce the gate count by more than 70 and 90% of HSP48901 and HSP48908, respectively. In addition, compared
Comparisons among HSP48901, HSP48908 and the implemented chip. HSP48901 30 MHz 13,594 gates 2.21 Kpixels/(sec gate) Off-chip row buffer 30 Mpixels/second HSP48908 32 MHz 47,500 gates 0.67 Kpixels/(sec gate) on-chip row buffer 32 Mpixels/second The proposed chip 125 MHz 3,893 gates 4.01 Kpixels/(sec gate) FSM 15.6 Mpixels/second
Models spec. System clock rate Total gate count Max. data rate/gate count Row buffer Max. data rate
70
Sunwoo and Oh
with the existing multiplierless architecture [1319], the proposed architecture can reduce the gate count of the computation block itself about 70% because of the reduction in tree adder size from 16-b to 8-b and the number of SAs from 9 to 1. Moreover, the architecture does not require row buffers, because it controls the input sequence in the ICU. We actually implemented the lter chip using the SamsungTM 0.8 m SOG cell library (KG60K). The implemented lter chip consists of 3,893 gates, operates at 125 MHz and can process real-time image data. If we use better technology, then the clock speed must be much higher. The proposed architecture is especially valuable for larger size 2-D convolutions that are common in image processing since it has the reduced gate count. Acknowledgment This work was supported in part by the NRL (National Research Laboratory) program of MOST (Ministry of Science & Technology), in part by the HY-SDR Research Center under the ITRC Program of MIC and in part by IDEC (IC Design Education Center). References
1. R.C. Gonzalez and R.E. Woods, Digital Image Processing, Addition Wesley, June 1993. 2. M. Sonka, V. Hlavac, and R. Boyle, Image Processing, Analysis and Machine Vision, Chapman and Hall, 1993. 3. E.R. Dougherty, Digital Image Processing Methods, Marcel Dekker, 1994. 4. B. Bosi, G. Bois, and Y. Savaria, Recongurable Pipelined 2-D Convolvers for Fast Digital Signal Processing, IEEE Trans. Very Large Scale Integration (VlSI) Syst., vol. 7, 1999, pp. 299308. 5. M.S. Andrews, Architectures for Generalized 2D FIR Filtering Using Separable Filter Structures, in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Process., vol. 4, 1999, pp. 2215 2218. 6. F. Marino, A Two-Level Interleaving Architecture for Convolvers, IEEE Trans. Signal Process., vol. 47, 1999, pp. 1481 1486. 7. H. Lee, J. Chung, and G.E. Sobelman, FPGA-Based DigitSerial CSD FIR Filter for Image Signal Format Conversion, in Proc. Int. Conf. Signal Process. Appl. & Tech., vol. 1, 1998, pp. 689693. 8. HARRIS semiconductor Inc., Digital Signal Processing, 1994. 9. ISO-IEC/JTC1/SC29/WG11, MPEG92/229 (revised), Information on Requirements for MPEG-2 Video, July 1992. 10. B.G. Haskell, A. Puri, and A.N. Netravali, Digital Video: Introduction to MPEG-2. ITP, 1997. 11. V.K. Madisetti, VLSI Digital Signal Processor, ButterworthHeinemann, 1995.
12. N.H.E. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Addison Wesley, 1993. 13. W.J. Oh and Y.H. Lee, Implementation of Programmable Multiplierless FIR Filters With Power-of-Two Coefcient, IEEE Trans. Circuits Syst., vol. 42, 1995, pp. 553555. 14. R. Jain, R.T. Yang, and T. Yoshino, FIRGEN: A ComputerAided Design System for High Performance FIR Filter Integrated Circuit, IEEE Trans. Signal Process., vol. 39, 1991, pp. 16551668. 15. T. Yoshino, R. Jain, P.T. Yang, H. Davis, W. Gass, and A.H. Shah, A 100-MHz 64-tap FIR Digital Filter in 0.8-mm BiCMOS Gate Array, IEEE J. Solid-State Circuits, vol. 25, 1990, pp. 1494 1501. 16. M. Ishikawa et al., Automatic Layout Synthesis for FIR Filters Using a Silicon Computer, in Proc. IEEE Int. Symp. Circuits Syst., May 1990, pp. 25882591. 17. D. Li, Minimum Number of Adders for Implementing a Multiplier and its Application to the Design of Multiplierless Digital Filters, IEEE Trans. Circuits Syst.II: Analog and Digital Signal Process., vol. 42, 1995, pp. 453460. 18. S. Samadi, H. Iwakura, and A. Nishihara, Multiplierless and Hierarchical Structures for Maximally Flat Half-Band FIR Filters, IEEE Trans. Circuits Syst.II: Analog and Digital Signal Process., vol. 46, 1999, pp. 12251230. 19. S. Sriranganathan, D.R. Bull, and D.W. Redmill, Optimization of Multiplierless Two-Dimensional Digital Filters, Visual Commun. Image Process, 96-SPIE, vol. 2727, part 3/3, 1996, pp. 12801287. 20. L.J. Siegel, H.J. Siegel, and A.E. Feather, Parallel Processing Approaches to Image Correlation, IEEE Trans. Comput., vol. c-31, 1982, pp. 208218. 21. N. Ranganathan and S. Venugopal, An Efcient VLSI Architecture for Template Matching Based on Moment Preserving Pattern Matching, in Proc. ICPR-D, 1994, pp. 388390. 22. H.M. Chang and M.H. Sunwoo, An Efcient Programmable 2-D Convolver Chip, in Proc. Int. Symp. Circuits and Syst. (ISCAS98), June 1998, WAA1410. 23. S.Y. Eun and M.H. Sunwoo, An Efcient 2-D Convolver Chip for Real-Time Image Processing, in Proc. Asia and South Pacic Design Automation Conf., Feb. 1998, pp. 329330. 24. Samsung Electronics, SEC KGL 60K Cell Library Data Book, 1995. 25. B. Arambepola, V.B. Patel, and G. Cheung, Cascadable One/Two-Dimensional Digital Convolver, IEEE J. Solid-State Circuits, vol. 23, 1988, pp. 351357.
Myung H. Sunwoo received the B.S. degree in electronic engineering from Sogang University in 1980, the M.S. degree in electrical and electronics from Korea Advanced Institute of Science and
71
Technology in 1982, and the Ph.D. in electrical and computer engineering from The University of Texas at Austin in 1990. He worked for Electronics and Telecommunications Research Institute (ETRI) in Taejon, Korea from 1982 to 1985 and Digital Signal Processor Operations Division, Motorola U.S.A from 1990 to 1992. Since 1992, he has been a Professor with School of Electrical and Computer Engineering, Ajou University in Suwon, Korea. His research interests include VLSI architectures, SOC design for multimedia and communications, and application-specic DSP chip design. He is the author of more than 110 journal and conference papers. He has served as a Technical Program Chair of the IEEE Workshop on SIGNAL PROCESSING SYSTEMS in 2003, a Technical Committee of the IEEE Circuit and Systems VSATC since 1996, and as a Program Committee of the IEEE Workshop on SIGNAL PROCESSING SYSTEMS and the IEEE International ASIC/SOC Conference. He served as an Associate Editor for the IEEE TRANSACTIONS ON VERY LARGE SCALE INTERATION SYSTEMS from 2001. He is a Senior Member of IEEE. sunwoo@madang.ajou.ac.kr
and the M.S. and Ph.D. degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST), Taejon, Korea, in 1985 and 1990, respectively. From 1988 to 1993, he was with Transmission Systems Lab, Samsung Electronics Inc., Seoul, Korea, as a Senior Researcher. Since 1993, he has been with Ajou University, Suwon, Korea, where he is currently an Full Professor at the School of Electronics Engineering, leading the Communication Systems Research Group. During 19961997, he was a Visiting Professor at Simon Fraser University, Burnaby, BC, Canada. His research interests include smart antennas, space-time coding, MIMO systems, OFDM, and digital transmission technologies.
Seong Keun Oh received the B.S. degree in electronics engineering from Kyungpook National University, Taegu, Korea, in 1983,

A Multiplierless 2-D Convolver Chip For Real-Time Image Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Multiplierless 2-D Convolver Chip For Real-Time Image Processing

Uploaded by

Copyright:

Available Formats

Journal of VLSI Signal Processing 38, 6371, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

A Multiplierless 2-D Convolver Chip for Real-Time Image Processing

The lter architecture using multipliers based on (2).

H (n, m)F(x n, y m).

The Modied 2-D Convolution Algorithm

F(x n, y m)h k (n, m)2k . (3)

A Multiplierless 2-D Convolver Chip for Real-Time Image Processing

The lter architecture using nine SAs based on (3).

The proposed lter architecture using only one SA based

F(x n, y m) h k (n, m) 2k . (4)

The Archithcture of the Proposed Filter Chip

The block diagram of the proposed lter.

A Multiplierless 2-D Convolver Chip for Real-Time Image Processing

The RU structure. Figure 6. The CU structure.

Implementation and Comparisons

A Multiplierless 2-D Convolver Chip for Real-Time Image Processing

A Multiplierless 2-D Convolver Chip for Real-Time Image Processing

You might also like