Design and Implementation of A High-Speed Matrix Multiplier Based On Word-Width Decomposition

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

380

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 4, APRIL 2006

Design and Implementation of a High-Speed Matrix Multiplier Based on Word-Width Decomposition


Sangjin Hong, Senior Member, IEEE, Kyoung-Su Park, Student Member, IEEE, and Jun-Hee Mun, Student Member, IEEE
AbstractThis paper presents a exible 2 2 matrix multiplier architecture. The architecture is based on word-width decomposition for exible but high-speed operation. The elements in the matrices are successively decomposed so that a set of small multipliers and simple adders are used to generate partial results, which are combined to generate the nal results. An energy reduction mechanism is incorporated in the architecture to minimize the power dissipation due to unnecessary switching of logic. Two types of decomposition schemes are discussed, which support 2s complement inputs, and its overall functionality is veried and designed with a eld-programmable gate array (FPGA). The architecture can be easily extended to a recongurable matrix multiplier. We provide results on performance of the proposed architecture from FPGA post-synthesis results. We summarize design factors inuencing the overall execution speed and complexity. Index TermsField-programmable gate array (FPGA) implementation, matrix multiplier, power reduction, recongurable architecture, word-width decomposition.

I. INTRODUCTION

N MANY multimedia applications, fast and efcient matrix multipliers are very critical in overall operations. Matrix multiplication is an often used core operation in a wide variety of graphic, image, robotics, and signal processing applications. Traditional approaches for matrix multiplications are carried out either by software on fast processors or by hardware with multiple scalar multipliers. Software operations of the matrix multiplications can be very slow and become bottlenecks in overall system operations. On the other hand, hardware multipliers incorporate high-speed logic, but may be very costly in terms of complexity and power consumption. Moreover, these multipliers are usually designed for a specic wordlength and often lack exibility where the size of scalar multipliers must correspond to the input of the matrix multiplier. The objective of this paper is to propose a exible and energy efcient matrix multiplier, which can be extended to recongurable highspeed processing implementation. A key concept is to apply the divide-and-conquer approach by decomposition, both at matrix level and at word level. Many studies have been done on matrix multiplications. Most notably, [1] introduced an efcient matrix multiplication algorithm for linear arrays with recongurable pipelined bus sys-

Manuscript received April 26, 2005; revised September 12, 2005 and December 29, 2005. This work was supported by the National Science Foundation under Award CCR-0325584. The authors are with the Department of Electrical and Computer Engineering, State University of New York (SUNY) at Stony Brook, Stony Brook, NY 11794-2350 USA (e-mail: snjhong@ece.sunysb.edu; kyopark@ece.sunysb.edu; jmun@ece.sunysb.edu). Digital Object Identier 10.1109/TVLSI.2006.874302

tems. This architecture also follows the divide-and-conquer approach for high-speed parallel processing. However, their decomposition is limited to the matrix level. A matrix multiplier for sparse matrices that reduces the input/output (I/O) and the number of trivial multiplications, is introduced in [2]. This architecture is focused on the data elements of the matrices. However, the architecture is highly benecial only when the matrices are sparse. As we will discuss in this paper, the proposed architecture also reduces computation energy by exploiting zeros during actual computations. Most of the previous implementations of matrix multiplication on eld-programmable gate arrays (FPGAs) are focused mainly on tradeoff between area and maximum running frequency [3][7]. A bit-serial matrix multiplier using Booth encoding was implemented on the FPGA [3]. The multiplier discussed in [4] improved the design in [3] using modied Booth-encoder multiplication along with Wallace tree addition, where the running frequency has been doubled from the design described in [3]. These designs are further improved in terms of area and speed in [7]. All of their designs consider the input word-width directly and the size of scalar multipliers can be signicantly large for practical VLSI implementation when the input word-width increases. Moreover, their designs are mainly focused on reducing FPGA resources by incorporating static memory in the implementation. Our design is focused on both computational and energy efciency, as well as structural exibility. In this paper, we consider high-speed matrix multiplications based on two-level matrix decomposition. The idea of decomposition has been applied in many arithmetic computations to reduce the amount of computation and latency [8][10]. We extend the idea of decomposition further into the word level. Ini, tially, a matrix multiplication with higher dimension, is decomposed into several 2 2 matrix multiplications. Then, these decomposed 2 2 matrix multiplications are further decomposed into several matrix multiplications where each element in these matrix multiplications is represented with smaller word-width. Then a set of simple, but fast operators is used concurrently to generate partial results. The structure and size of the simple operators are xed irrespective of the input element word-width. The nal outputs are generated by combining (composing) the partial results through accumulation with adder trees. The proposed matrix multiplications support 2s complement input with energy efcient computation. The design methodology of the proposed architecture can utilize varying numbers of simple operators to support various degrees of time multiplexing to reduce complexity with a low-latency penalty. Our presentation is focused on an architecture for a 2 2 matrix multiplier based on word-width decomposition, which can

1063-8210/$20.00 2006 IEEE

HONG et al.: DESIGN AND IMPLEMENTATION OF A HIGH-SPEED MATRIX MULTIPLIER

381

Fig. 1. Illustration of the balanced word-width decomposition with

W = 16 and p = 4.
Then, each matrix multiplication (i.e., eight such matrix multiplications for the example above) is decomposed in terms of word-width. For example, for the value of equal to 4 (3)

be a building block for computing larger dimension matrix multiplications. In this paper, we demonstrate its performance and complexity with design and implementation on commercially available FPGA. The remainder of this paper has four sections. In Section II, we describe the concept of matrix multiplications via two-level decomposition. Two different word-width decomposition schemes are discussed, as well as an issue on 2s complement data representation in the proposed architecture. In Section III, a set of key units comprising a exible matrix multiplier architecture is described in detail. A key set of design tradeoff parameters is also discussed. In Section IV, implementation and evaluation of the design are presented. We summarize our contribution in Section V. II. MATRIX MULTIPLICATION DESIGN OVERVIEW

(4)

(5) Thus, 2 2 matrix multiplication is represented as

(6) A. Theory of Operation The overall operation is based on two basic assumptions. matrices where 1) Matrix multiplication is performed on is a power of two. 2) Each element in the matrices is a xedpoint integer with word-width of . In the design, we adopt matrix multiplication two-level decomposition. Initially, is decomposed into several 2 2 matrix multiplications. This as process is illustrated with (1) (2) are all 2 2 matrices. After the where initial decomposition, all matrix multiplications are 2 2 matrix multiplications where word-width of each element is . , and represent where decomposed 2 2 matrix multiplications. Two matrices, and , are decomposed further until each element in all the decomposed matrices is less than where all elements are represented with -bit precision. Upon completion of the decomposition process, all matrix multiplications with -bit precision are computed in parallel with much smaller scalar multipliers than . The outputs from this computation are accumulated to generate the nal outputs for one -bit matrix multiplication. After repeating the same 2 2 matrix multiplications are process, the outputs for constructed. Throughout the paper, our presentation is focused on word-width decomposition. B. Word Width Decomposition There are two ways to decompose the matrices. The rst approach is to divide the width of the original elements successively

382

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 4, APRIL 2006

Fig. 2. Illustration of the skewed word-width decomposition with

W = 16 and p = 4.
word-width which are multiplied as (7) (8) (9) and Consider the balanced decomposition with Then after the decomposition, the matrix multiplication represented as . is

in half. This approach, called balanced word-width decomposition, is illustrated in Fig. 1. Let a nite number represent the decomposed data width. Then, a multiplication by becomes a simple -bit shift. Initially, the matrix multiplication with -bit input elements is decomposed into two submatrix multiplications -bitelements. Then, the decomposed matrix multiplicawith tion is further decomposed until the element word length is equal to . After the decomposition, there will be many but smaller submatrix multiplications which can be performed with simple arithmetic units. The depth of the decomposition tree depends on the word length of the original data element. must The restriction with this approach is that the size of , where is an integer. For example, for , the only be possible word lengths are . The second approach, skewed word-width decomposition, relaxes this restriction and we can decompose the input elements bits at a time, starting from either the least signicant bits of the element or the most signicant bits of the element. This skewed word-width decomposition is illustrated in Fig. 2. Thus, can be any multiple of . the original word length The illustrations shown in Figs. 1 and 2 may seem to suggest that the decomposition process takes multiple stages of operations. Actually, the decomposition of the original matrix multiplication results in 16 submatrix multiplications (i.e., for and ). We will show that the actual decomposition can be done directly from the input elements and the decomposition processes illustrated above, are handled during the composition where shiftings and summations are performed. Basically, the decomposition process is merely dividing the original values through interconnection distribution. The difference between these two decomposition approaches is the ordering of the input data for the basic operators. C. Composition The composition is the reverse operation of the decompoand with input sition. Consider two matrices

(10) Similarly, after the skewed decomposition, the matrix multiplication can be represented as

(11) Note that with the exception of multiplication factors, the original matrix multiplication consists of many smaller ma-

HONG et al.: DESIGN AND IMPLEMENTATION OF A HIGH-SPEED MATRIX MULTIPLIER

383

Fig. 3. Illustration of 2-input adder tree structure in the balanced composition process with

W = 16 and p = 4.

trix multiplications, which can be computed with the same hardware. The results from these smaller matrix multiplications are the partial results for where they are accumulated by an adder tree to generate the outputs of the matrix multiplication. Hence, the adder tree is executed four times to generate a complete 2 2 output matrix . A similar equation can be derived for the skew decomposition case. Figs. 3 and 4 illustrate composition schemes for the balanced and skewed decomposition schemes, respectively. Note that each adder in the adder tree consists of two inputs where one of the inputs is scaled by , which is equivalent to a -bit shift operation. Thus, in the adder tree, the output of each adder is appropriately shifted to reverse the decomposition operations. III. DESIGN APPROACH A. Overall Architecture The proposed overall architecture is shown in Fig. 5. The architecture consists of decomposition, basic operation, and composition unit. In this architecture, a small number of -bit multipliers is used several times to generate -bit results. Both decomposition and composition units are operated at the same clock frequency, which is equal to the I/O rate. However, the operating clock frequency of the basic operators is higher than that of the decomposition unit and the composition unit and the

actual operating clock frequency of the basic operators depends on the degree of time multiplexing (i.e., number of computation be a single basic operator). Each unit is separated by the networks for data routing. The input elements of matrix and , which have the word-width of , enter through Ports A and B. Each input is segmented into -bit and 2s complement conversion is made. The converted elements are stored in de. The decomposed elements are composition registers , to then ordered and stored in basic operator registers, be executed consecutively by the basic operators via time-multiplexing. The outputs from the basic operators are then stored , for nal accumulation. For the in composition registers, matrix multiplication, the total required register size is the sum , and . In this section, of these registers, we describe each unit in detail. In addition, analysis of key parameters that affect the design are presented. B. Decomposition Unit Fig. 6 illustrates the decomposition unit and the interface with the basic operations. We assume that the inputs enter the decomposition unit in a pair of elements of matrix A and B as , and . In the decomposition unit, a direct decomposition of -bit elements is accomplished. The complex decomposition processes illustrated

384

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 4, APRIL 2006

Fig. 4. Illustration of 2-input adder tree structure in the skewed word-width composition process with

W = 16 and p = 4.

Fig. 5. Overall architecture with three main units. Figure is illustrated with the , which is equal to 4. number of basic operators,

in Figs. 1 and 2 are performed through data ordering in the register and composition unit.

The decomposition unit is responsible for two main tasks. The rst task is to segment -bit elements into several -bit elements and make conversion to support 2s complement number representation. Otherwise, incorrect results are generated when one of the inputs is a negative value. For example, consider a . This element can be decomposed into ( value ). Its 2s complement binary representation is 1111111111101011. This can be decomposed into four segmented data 1111, 1111, 1110, and 1011. The rst two data in decimal number and the third data represents represent in decimal number. While the last data is represented correctly, the overall segmented data representation is wrong (i.e., , and ). The decomposition unit must they must be 0, 0, resolve such discrepancy to support word-width decomposition of 2s complement numbers. This is accomplished by incorporating an adder for each -bit segment of the original element as shown in Fig. 7. If the original input element is a negative number, a sign bit of the original element is appended to each -bit segment to form a -bit segment. Then, 1 is added to this value and its overow bit is ignored. Such addition is not necessary for the least signicant -bits. For this element, however, the sign bit is overwritten if the -bit element is all zero. No conversion is necessary if the original elements are positive. A multiplexor is used for selection.

HONG et al.: DESIGN AND IMPLEMENTATION OF A HIGH-SPEED MATRIX MULTIPLIER

385

Fig. 6. Illustration of the decomposition unit and interface (decomposition network) with the basic operators.

The total number of registers in the decomposition unit is given by (12) where represents the minimum depth of registers to support a continuous stream in input elements. Note that when zero bits are not used, the total number of registers in the decomposition . The value of unit becomes is a sum of two factors, and . The data ready factor represents the data separation of the input elements that are necessary to generate one output. For example, in order to generate , required inputs are , and , where and are separated by two input cycles. The data saving factor represents an additional number of registers necessary to keep the existing current elements. With these additional registers, the basic operators can get the elements from the decomposition unit such that the new set of elements for the next matrix enter the decomposition unit without overwriting the previous elements. The data saving factor is equal to 3. , After the decomposition, each element is stored in where the size of each element is -bit wide. Then, these elements are distributed via the decomposition network. The netof the input work conguration depends on the word-width elements, the number of basic operators, and the basic operator width . C. Radix- Basic Operator Fig. 8 illustrates one radix- basic operator. A basic operator , two -bit multipliers, consists of a set of registers, one adder, and a multiplexor. Each operation takes four inputs. The outputs from the multipliers are added for the nal output. -bit elements (one bit for sign extension and the other bit for zero ag) of the working submatrix are loaded in order into registers. Once ordered, the data will propagate through a chain

Fig. 7. Illustration of the 2s compliment element conversion with = 16 and = 4 as an example. Zero check logic is implemented with OR gates.

At the output of the multiplexor, -bit data are checked if they are zero using NOR-gates. The ag bit becomes one if the data is zero, and zero, otherwise. The zero ag bit is appended in addition to the sign bits. Thus, the width of the data segment is -bit after the conversion. The purpose of the zero bit ag is to save power consumption by reducing unnecessary operation in the basic operators. When a large element is segmented, it is more likely that some of the segments have values of zero. For example, it is obvious that for a small positive number represented in -bit, there will be many zeros including the extended sign bit. Consider a small negative number. Then most of the data will be lled with extended sign bit one. Then some -bit segments will consist of all ones. In the conof the version this value is incremented by one resulting in all zeros, and the zero ag bit will be set. Hence, no scalar multiplication is necessary with this segment. The second task is basically ordering the data so that output are generated at the same rate with the input rate elements with low overall latency. Once the data are properly ordered by the decomposition unit, all basic operators can start the processing as early as possible without any stoppage. This is discussed in the next section. The decomposition unit is designed to take a continuous stream of 2 2 matrices.

386

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 4, APRIL 2006

Fig. 8. Illustration of the basic operator. It consists of two multipliers and an adder. A multiplexor is used to minimize the switching.

of registers as shown in Fig. 8. This ordering is established to , and . produce four elements corresponding to At the input of the multipliers, there are three-state buffers where these buffers are controlled by zero-ag bits (i.e., and for the left multiplier, and and for the right multiplier). Note, that for FPGA implementation, SR latches are used instead of the three-state buffers. Each buffer is activated or only when the data element is not zero. When either for left multiplier is set, the multiplication is carried out and the zero output is selected at the multiplexor. This mechanism eliminates unnecessary switching due to the multiplier. The same is true for the right multiplier. The adder is active only when all elements going through the multipliers are nonzero. Four possible , are selected by the multiplexor. This outputs, and , selection is controlled by multiplexor select inputs and . It is where important to note, that in case of pipeline implementation of the multipliers comprising the basic operator, the energy reduction scheme (i.e., zero ag bits, tri-state buffers, multiplexor) is less effective. and , the total number of basic operations is Given . We assume that the number of basic operators is a such that multiple of

The total number of registers in the basic operator unit is given as (15) is register depth for a basic operator. The value of where depends on the data transfer rate from the decomposition unit as a fetch rate from and the basic operators. If we dene registers in the decomposition to registers in the basic operator, and are affected by . The registers in then represented as the basic operators are lled at the rate of (16) and (17) , we need more registers at the basic operIf , we ators (i.e., increase ). On the other hand, if just need register depth of 1 by fetching only the element that is needed by the basic operator. Fig. 8 illustrates for the basic . operator for D. Composition Unit: 0-mp Adder Tree Figs. 9 and 10 illustrate the composition adder tree. The composition unit consists of registers and an adder tree with 0-mp adders. It is interesting to observe that the inputs of the 0-mp adder are skewed by mp-bits with respect to each other. Thus, certain range of bits are known before the addition (i.e., zero bits at the low signicant bits of one input and extended sign bits at the high signicant bits of the other). Hence, the worst case delay is less than a normal adder with the same width. In the implementation, the adders are constructed with carry-select adders where each segment of adder is designed with a -bit ripple carry adder [12]. At the input of composition unit, a nite set of data (i.e., equal . to the number of basic operators) is available at each At the basic operators and the composition unit interface, the number of registers needed is equal to the total number of basic operators.

(13) where is an integer and represents the degree of time multiplexing. The value of is chosen such that is of 16 and of 4, an integer. For example, for word-width there will be total of 16 concurrent submatrices multiplications. With four basic operators executing in parallel, each output is generated in 4 iterations. , is Then, the decomposition clock period,

(14)

HONG et al.: DESIGN AND IMPLEMENTATION OF A HIGH-SPEED MATRIX MULTIPLIER

387

Fig. 9. Illustration of the aligned adder tree by merging 0-mp adders for the balanced composition.

W = 16 and p = 4 for the illustration.

Fig. 10. Illustration of the aligned adder tree by merging 0-mp adders for the skewed composition.

W = 16 and p = 4 for the illustration.

The number of registers required for composition is given as

(18)

through The output from each basic operator is stored in the composition network. The composition network conguration depends on the number of basic operators. The width of each , where one bit is for the sign bit. The values register is stored in these registers represent the partial results.

388

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 4, APRIL 2006

Fig. 11. Illustration of execution timing and latency of the proposed matrix multiplier. The gure is illustrated for the degree of time multiplexing of 2 (i.e., T = 1=2T ) and T =T .

At the end of adder tree, the output word-width is -bits. It is clear that the composition adder tree is critical for highthroughput operation. The composition unit is pipelined to reduce the critical path to maintain high throughput at the expense should be made equal to of latency. Thus, the target of through pipelining. The pipelining of the adder tree is greatly inuenced by the data ordering. The data is ordered at the decomposition unit and executed at the basic operators such that partial results for lower signicant values are generated rst. If there is multiplexing, the speed of basic operators must be faster by a factor equal to the degree of time multiplexing. The pipelining depends on the speed of the basic operators and the desired speed of the composition unit. For constant word-width composition, where the input wordwidth is equal to the output word-width, truncation or rounding is necessary. Consider the composition adder tree for the constant word-width operations. Possible ranges of truncation of adder trees are illustrated in Figs. 9 and 10 with lines and . With truncation with a line , almost half of all adders in the composition unit can be eliminated. Moreover, many results from the basic operators are not necessary. However, it will cause serious truncation error. On the other hand, the adder tree can be truncated with a line . In this scheme, all of the results from the basic operators are used and precision can be well maintained. The widths of adders are incrementally reduced at the adders on the right side of the composition adder tree. These truncations reduce both the complexity of the composition unit and the critical path by avoiding long adder chains. Depending on the truncation method (i.e., between and ), a signicant amount of total adders can be saved. In our implementation, we follow successive truncation by maintaining two additional bits of adder in the previous row of adders. This is indicated by shaded adders (i.e., unshaded adders are not implemented). Based on this implementation, we were able to reduce at least 25% of resources (adders and wires) for a 16-bit output composition unit.

on the clock period of the decomposition unit cussed in (14). Hence, the throughput resented as

, as dis, can be rep-

(19) where the multiplicity of 4 comes from the number of elements in 2 2 matrices and represents the degree of time multiplexing of the basic operators. Thus, doubling the number of basic operators reduces the decomposition unit period by half. Fig. 11 illustrates the relative timing of each unit. Note that the rates of the decomposition unit and composition unit are equal while the rate of the basic operators is faster than either of the two. The initial latency of the architecture is a sum of latencies in the decomposition, basic operators, and composition units. Thus, the total latency is given as

(20) where the rst component of the total latency is due to the data ready factor in the decomposition unit and this value is the same for all implementations (i.e., a factor does not inuence the latency). The second component is due to the multiplexing of the basic operators. The third and the fourth components are due to pipeline of the basic operators and composition unit, where and are the numbers of pipeline stages in the basic operators and the composition unit, respectively. F. Effects of We now generalize the complexity of the matrix multiplier , we and represent this complexity in terms of . For consider the complexity for . In this analysis, the . Thus, when , number of basic operators is equal to we employ one basic operator. Table I illustrates the number of registers needed as a function of the number of basic operators (i.e., reciprocity of the degree of time multiplexing). The necessary numbers of registers are reduced for the decomposition unit and the composition unit for a larger value of . For the basic operators, the number of

E. Execution Throughput and Latency Throughput is dened as a rate of completion of -bit 2 2 matrix multiplications per second. This rate ultimately depends

HONG et al.: DESIGN AND IMPLEMENTATION OF A HIGH-SPEED MATRIX MULTIPLIER

389

TABLE I NUMBER OF NECESSARY REGISTERS AS A FUNCTION OF FOR = 16. THE NUMBER OF REGISTERS IN THE BASIC OPERATORS IS NORMALIZED FOR ONE BASIC OPERATOR

TABLE II RESOURCE AND DELAY OF EACH UNIT FROM FPGA IMPLEMENTATION FOR = 16 AND = 4. LOGIC RESOURCE IS MEASURED IN THE NUMBER OF CLBS. THE DELAYS ARE MEASURED FOR LOGIC BETWEEN THE REGISTERS IN CASE OF PIPELINE STRUCTURES

registers per basic operator is increased, but the total number of registers for the basic operators is reduced for a larger value of . When , the entire architecture is reduced to the traditional matrix multiplier where two multipliers and an adder are employed. IV. IMPLEMENTATION AND EVALUATION A. Implementation and In this paper, we design a matrix multiplier with for illustration based on the proposed architecture. We use the Xilinx ISE 7.1i development tool and ModelSim for our design. Our target device is a Xilinx Virtex-II XCV1500 series FPGA. Even though our architecture is better suited for application-specied integrated circuit (ASIC) implementation due to decomposition and composition network wirings and switches in terms of area, we provide evaluation with FPGA synthesis. The accuracy of the designs are veried through post-synthesis simulation and all the results given in this section are based on post-synthesis data. Both constant word-width and full word-width architectures and the number of basic are chosen for the design for with time multiplexing. operators is chosen to be We assume that a pair of input elements arrive at one time. The skewed decomposition scheme is used in the design. Both and are designed to be less than or equal to the . The adder tree in the composition unit is pipelined to meet the target speed. The fetch period, , is chosen to be to minimize the number of registers in the basic equal to operators. B. Evaluation Table II tabulates resource requirement and execution delays for each unit. The resource requirement for each unit is represented in the number of CLBs used in the implementation and the delay represents the maximum clock speed that the logic can achieve. Both decomposition and composition units include resources and delays due to the decomposition network and composition network, respectively. The basic operators are implemented with and without the pipeline stage. Two 5-bit embedded multipliers are incorporated in each basic operator. Note that the FPGA provides 18-bit embedded multipliers and only 5-bit multipliers are synthesized. The amount of resources used in the design of a basic operator is 7 CLBs without the pipeline and 11 CLBs with the pipeline. Their respective delays and 3.186 . Additional resources are necessary are 10.43 when the multipliers are implemented using FPGA logic elements. An implementation of the array multipliers based on the BaughWooley algorithm [13] in place of embedded multipliers requires additional 6 CLBs. Tri-state buffers are located near the I/O of Virtex II FPGA. Because of the position, synthesis of tri-state buffers require more CLBs and the design is slower. In order to minimize the complexity and improve the speed, a set of SR latches with enable control are used to deactivate the multipliers when at least one of the inputs is zero. The effect of this mechanism is illustrated later when energy consumption is described. The decomposition unit is the fastest among different units even without the pipeline because of the simple logic involved in the design. The unit used 23 CLBs of resources and its corresponding delay is 2.458 . Thus, the decomposition unit will not be a bottleneck of the overall architecture. The decomposition unit is faster than that of the fastest basic operator. The resources required by the decomposition unit includes sequencing logic for controlling time-multiplexing of the basic operators and routing of data. For the composition units, both constant word-width and full word-width outputs are designed for the design targeted for 4, 8, and 16 basic operators. The composition units are appropriately pipelined so that to match the operation speed of the basic operators. The constant word-width composition units require 18, 20, 37 CLBs and have delays of 7.457. 5.4, and 3.302 for 4, 8, and 16 basic operators. On the other hand, the full word-width composition units require 24, 27, and 52 CLBs and have delays of 7.645. 5.62, and 3.386 for 4, 8, and 16 basic operators. The constant word-width composition units are faster and less complex than that of full word-width composition units. These reductions are due to fewer adders and shorter adder propagation. While the decomposition unit is not the bottleneck, the basic operators and the composition unit can be a bottleneck depending on the number of basic operators used. Table III tabulates the design complexity and operating frequency of the proposed matrix multipliers. The total resource usages are represented in the number of CLBs. The resources for the basic operators are multiplied by the number of basic operators employed. Overall complexity includes CLBs for additional routing required for implementing the architecture with the degree of time multiplexing of 1, 2, and 4. In general, the implementation with low degree of time multiplexing requires more resources due to the required number of basic operators. In the table, we have included 4 4 matrix multiplier designed by Jang and Prasanna as a reference [7] for comparison. All designs use the same FPGA device (i.e., Xilinx XCV1500). This design is one of the most efcient matrix multipliers

390

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 4, APRIL 2006

TABLE III DESIGN COMPLEXITY AND OPERATING FREQUENCY OF THE DESIGN IN TERMS OF CLBS. IN THE PROPOSED DESIGN, THE FIRST FREQUENCY IS FOR THE BASIC OPERATORS AND THE SECOND FREQUENCY IS FOR THE DECOMPOSITION AND COMPOSITION UNITS

implemented in FPGA [3][5], [7]. Our proposed design has two operating frequencies; one for the basic operators and the other for both decomposition unit and composition unit. While the frequency of the basic operators is constant at 314 MHz, the speed of the decomposition and composition units corresponds to the input and output rate, which depends on the degree of time multiplexing of the basic operators. Hence, the frequency of the decomposition unit and the composition unit corresponds to the input/output clock period (i.e., equivalent to ). All of the designs in the table use embedded multipliers. The complexity of embedded multipliers is translated into the effective CLBs where the design by Jang and Prasanna uses 27.5 CLBs for 16-bit multipliers (i.e., their design employed 4 multipliers where a total of 110 CLBs are used in the design) [7]. To make fair comparison and follow their translation method, our designs used 2.5 effective CLBs for each multiplier. This is smaller than direct synthesis of array multipliers based on the BaughWooley algorithm [13]. However, our design does not utilize any memory, whereas the design by Jang and Prasanna uses the memory with an effective number of CLBs equal to 25. Note that our proposed design is implemented for 2 2 matrix multiplications where multiple execution is necessary for the 4 4 matrix multiplications. Our design runs at much higher running frequencies (i.e., 314 MHz) than the designs described in [3][5], [7], where their running frequencies range from 33 to 166 MHz. One drawback of our design is that it requires more resources with a low degree of time multiplexing. However, no embedded memory is needed in the proposed designs. Table IV illustrates the throughput in terms of the number of matrix multiplications per second with the fastest basic operators (i.e., pipelined version) with degrees of time multiplexing of 1, 2, and 4, with an appropriate composition unit. The execution time represents the amount of time involved in executing one matrix multiplication for a given size of matrices. Data for higher order matrix multiplications are derived from the 2 2 post-synthesis results. The inverse of the execution time represents the throughput represented in the number of matrix computations per second. The I/O rates are 190 and 368 MHz for the congurations with the degree of time multiplexing of 4, 2, and 1, respectively. Note that the speed up by the conguration with the degree of time multiplexing of 1 is less than twice over the conguration with the degree of time multiplexing 2. This is because the composition unit became the bottleneck for the case with the degree of time multiplexing of 1. Table V tabulates both latencies and effective execution time of the designs. As illustrated in the table, all designs have latencies due to internal pipelines. The values of latencies are com-

TABLE IV THROUGHPUT AND EXECUTION TIME FOR EXECUTING LARGER MATRIX MULTIPLICATION USING THE PROPOSED 2 2 MULTIPLIER FOR = 16 AND THE DEGREES OF TIME MULTIPLEXING OF 1, 2, AND 4. THE TIMES ARE MEASURED BY MULTIPLYING THE TOTAL NUMBER OF 2 2 MATRIX MULTIPLICATIONS TO THE THROUGHPUT OF THE ARCHITECTURE

puted with the respect to input clocks (i.e., in our case, the clock ). In both designs, the effective exfrequency is for the ecution time of one 4 4 matrix multiplications is a sum of latency and the total execution time. With our design, the execution time of 4 4 matrix multiplications is computed by the number of 2 2 matrix multiplications necessary and the 2 2 results. Fig. 12 illustrates the energy consumption of the designs. Different designs are compared with an energy consumption factor dened as (21) where represents the total amount of CLBs used in represents the operating frequency of the corthe design, represents the time which it takes to responding units, and compute a single 4 4 matrix multiplication. Note that our design is for 2 2 matrix multiplications. Thus, is derived by consecutive 2 2 matrix multiplications. The energy efciency factor is not the actual energy dissipation but is a good indicator of energy dissipation for comparison purpose. We assume the input patterns are random. In Fig. 12, each design is denoted as the number of basic operators used and whether the zero ag bit mechanism is employed or not (i.e., B4-Z represents the design with 4 basic operators with the zero ag bit mechanism, whereas B4-NZ represents the same design without the zero bit ag mechanism). In the proposed designs, activity factor of 0.73 is multiplied to the CLBs required by the basic operators. Even though random input bit patterns were generated, the activity of the basic operators are inactive for 27% of the time due to zero data segments. With the random input bit patterns, it is possible to reduce the total energy efciency factor by up to 21%. We expect more energy saving can be achieved when the bit patterns within inputs are sparse. In total CLBs determination, we have included effective CLBs for both embedded multipliers and memory based on the transformation used in respective designs [7], [11].

HONG et al.: DESIGN AND IMPLEMENTATION OF A HIGH-SPEED MATRIX MULTIPLIER

391

TABLE V LATENCY AND EXECUTION TIME FOR EXECUTING ONE 4 4 MATRIX MULTIPLICATIONS FOR = 16 AND THE DEGREES OF TIME MULTIPLEXING OF 1, 2, AND 4. THE TIMES ARE MEASURED BY MULTIPLYING THE TOTAL NUMBER OF 2 2 MATRIX MULTIPLICATIONS TO THE THROUGHPUT OF THE ARCHITECTURE

The proposed architecture can be extended to support recongurability. Recongurability is possible because the basic operators are not input word-width dependent. Most matrix multipliers require multipliers whose sizes correspond to the input word-width. However, in our design, the same basic operators can be used for different word-width inputs. Reconguration can be supported by having a programmable interface between the units and utilizing an accumulator at the composition unit. V. CONCLUSION We have presented a exible 2 2 matrix multiplier architecture where a set of small but fast multipliers are used in the design. The architecture is based on word-width decomposition for exibility with high-speed operation. The architecture supports energy efcient operations with zero ag tagging. The energy reduction scheme is also effective in an FPGA implementation. We have designed the proposed architecture in an FPGA to show where each unit in the architecture is optimized for minimum resource usage and high speed. We have achieved input and output rates of 78, 157, and 295 MHz with the degree of a time multiplexing factor of 4, 2, and 1, respectively. When the throughput is improved with the lower degree of time multiplexing, the composition unit becomes the bottleneck. We have analyzed the design issues between the degree of time multiplexing and the speed requirement for each unit. We also demonstrated that larger matrix multiplications can be computed using the proposed architecture. REFERENCES [1] K. Li, Y. Pan, and S. Q. Zheng, Fast and processor efcient parallel matrix multiplication algorithms on a linear array with a recongurable pipelined bus system, IEEE Trans. Parallel Distrib. Syst., vol. 9, no. 8, pp. 705720, Aug. 1998. [2] C. I. Brown and R. B. Yates, VLSI architecture for sparse matrix multiplication, Electron. Lett., vol. 32, no. 10, pp. 891893, May 1996. [3] O. Mencer, M. Morf, and M. Flynn, PAM-Blox: High performance FPGA design for adaptive computing, in Proc. IEEE Symp. FPGAs Custom Computing Machines, 1998, pp. 167174. [4] A. Amira, A. Bouridane, and P. Milligan, Accelerating matrix product on recongurable hardware for signal processing, in Proc. 11th Int. Conf. Field-Programmable Logic Appl. (FPL), 2001, pp. 101111. [5] J. Jang, S. Choi, and V. K. Prasanna, Energy-efcient matrix multiplication on FPGAs, in Proc. Int. Conf. Field Programmable Logic Appl., 2002, pp. 534544. [6] V. K. Prasanna and Y. Tsai, On synthesizing optimal family of linear systolic arrays for matrix multiplication, IEEE Trans. Comput., vol. 40, no. 6, pp. 770774, Jun. 1991. [7] J.-W. Jang, S. Choi, and V. K. Prasanna, Area and time efcient implementations of matrix multiplication on FPGAs, in Proc. IEEE Int. Conf. Field Programmable Technol., 2002, pp. 93100. [8] R. Lin, Bit-matrix decomposition and dynamic reconguration: Unied arithmetic processor architecture, design, and test, in Proc. Recongurable Arch. Workshop (RAW), 2002, p. 83. [9] R. Lin, A recongurable low-power high-performance matrix multiplier architecture with borrow parallel counters, in Proc. Recongurable Arch. Workshop (RAW), 2003, p. 182.

Fig. 12. Energy consumption factors of the designs. The design B4-Z represents the proposed design with four basic operators and zero ag bit mechanism. JP4 represents the design by Jang and Prasanna.

C. Discussions We can improve the speed of the proposed matrix multiplier by pipelining the multipliers within basic operators. In the design, the multipliers are not pipelined for energy efciency. When the multipliers are pipelined, area overhead will be small since the size of the multiplier is -bit. However, when a traditional matrix multiplier is pipelined, the -bit multiplier must be pipelined, which will lead to large overhead. Moreover, the speed improvement is usually limited by the word-width. Our FPGA implementations can be further improved in terms of speed and complexity when they are synthesized in an ASIC. The rst improvement is due to the registers and routing. The use of registers for the decomposition in an FPGA is expensive in terms of logic resources because of routing of logics [5]. However, in the ASIC implementation, a set of registers can be compact in terms of physical area where they can be implemented with a few transistors and multiple metal layers for routing of wires. The basic operator can also be improved. In our design, we used SR latches for energy reduction scheme instead of the tri-state buffers. In ASIC implementation, tri-state buffers can be easily incorporated with only a few transistors. Thus, we would expect that the ASIC implementation is much more efcient where more compact design can be obtained. When multipliers are pipelined to enhance the performance, the effectiveness of the energy reduction scheme is degraded. We were able to design the basic operators without pipelining the multipliers to achieve up to 500 MHz. The speed of the basic operators is very critical since the speed of the overall architecture depends on it. Moreover, we utilized many multiplexors in the design. ASIC implements very efcient multiplexors.

392

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 4, APRIL 2006

[10] R. Lin, Recongurable parallel inner product processor architectures, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 2, pp. 261272, Apr. 2001. [11] S. Choi, R. Scrofano, V. K. Prasanna, and J.-W. Jang, Energy-efcient signal processing using FPGAs, in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2003, pp. 225234. [12] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits: A Design Persepective, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 2003. [13] C. R. Baugh and B. A. Wooley, A twos complement parallel array multiplication algorithm, IEEE Trans. Comput., vol. C-22, no. 12, pp. 10451047, Jan. 1973.

Kyoung-Su Park (S04) received the B.S. and M.S. degrees in electrical engineering from Chonbuk National University, Chonju, Korea, in 1999 and 2001, respectively, and the M.S. degree in electrical and computer engineering from Stony Brook University, Stony Brook, NY, in 2005. He is currently pursuing the Ph.D. degree at Stony Brook University. He was with the Research Center of Hyosung Corporation, Korea, Changwon, Korea, from 20012002, where he was an RF Circuit and System Designer. His research interests include circuits for image processing and architecture optimization for high-performance DSP systems design.

Sangjin Hong (SM04) received the B.S. and M.S. degrees in electrical engineering and computer science from the University of California, Berkeley, in 1986 and 1992, respectively, and the Ph.D. degree in electrical engineering and computer science from the University of Michigan, Ann Arbor, in 1999. He is currently with the Department of Electrical and Computer Engineering at Stony Brook University, Stony Brook, NY. Before joining Stony Brook University, he worked at Ford Aerospace Corporation in the Computer Systems Division as a Systems Engineer. He also worked at Samsung Electronics, in Korea, as a Technical Consultant. His current research interests are in the areas of low-power VLSI design of multimedia wireless communications and digital signal processing systems, recongurable SoC design and optimization, VLSI signal processing, and low-complexity digital circuits. Prof. Hong served on numerous technical program committees for IEEE conferences.

Jun-Hee Mun (S05) received the B.S. degree in electrical engineering from Ajou University in Korea, Suwon, Korea. He is currently pursuing the Ph.D. degree in electrical and computer engineering at Stony Brook University, Stony Brook, NY. His research interests include low-power digital circuits, recongurable architecture for DSP, and wireless communications interconnect architecture optimization.

You might also like