Low-Power Multiple-Precision Iterative Floating-Point Multiplier With SIMD Support

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 58, NO. 2,

FEBRUARY 2009

175

Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support


Dimitri Tan, Member, IEEE, Carl E. Lemonds, Senior Member, IEEE, and Michael J. Schulte, Senior Member, IEEE
AbstractThe demand for improved SIMD floating-point performance on general-purpose x86-compatible microprocessors is rising. At the same time, there is a conflicting demand in the low-power computing market for a reduction in power consumption. Along with this, there is the absolute necessity of backward compatibility for x86-compatible microprocessors, which includes the support of x87 scientific floating-point instructions. The combined effect is that there is a need for low-power, low-cost floating-point units that are still capable of delivering good SIMD performance while maintaining full x86 functionality. This paper presents the design of an x86-compatible floating-point multiplier (FPM) that is compliant with the IEEE-754 Standard for Binary Floating-Point Arithmetic [12] and is specifically tailored to provide good SIMD performance in a low-cost, low-power solution while maintaining full x87 backward compatibility. The FPM efficiently supports multiple precisions using an iterative rectangular multiplier. The FPM can perform two parallel single-precision multiplies every cycle with a latency of two cycles, one double-precision multiply every two cycles with a latency of four cycles, or one extended-double-precision multiply every three cycles with a latency of five cycles. The iterative FPM also supports division, square-root, and transcendental functions. Compared to a previous design with similar functionality, the proposed iterative FPM has 60 percent less area and 59 percent less dynamic power dissipation. Index TermsComputer arithmetic, rectangular multiplier, floating-point arithmetic, low-power, multiplying circuits, multimedia, very-large-scale integration.

1 INTRODUCTION
SIMD floating-point extensions include SSE, SSE2, and SSE3 [5]. These instructions are heavily used in multimedia applications and in particular single-precision (SP) operations occur very frequently [7]. In recent x86 floating-point units, the SIMD extensions and x87 instructions are mapped onto the same hardware to save resources. In the AMD-K7TM and AMD-K8TM microprocessors and derivatives, the hardware is optimized for x87 instructions [8], [9]. An alternative approach, presented in this paper, is to optimize for SIMD extensions and provide x87 functionality with a reduction in the performance of the latter. The advantage of this alternative approach is a reduction in hardware resources and power, and an improvement in the performance of the SIMD extensions. This paper presents the design of an x86-compatible floating-point multiplier (FPM) that is optimized for SP SSE instructions. The FPM can perform two parallel 24-bit 24-bit SP multiplies each cycle with a latency of two cycles, one 53-bit 53-bit double-precision (DP) multiply every two cycles with a latency of four cycles, or one 64-bit 64-bit extended-double-precision (EP) multiply every three cycles with a latency of five cycles. In addition to performing multiplication, the FPM is used to perform division and square root, and provides support for the x87 transcendental functions. Two internal multiplier significand precisions of 68-bit 68-bit and 76-bit 76-bit are required to support divide, square-root, and transcendental functions. The FPM is based on a rectangular significand multiplier tree that performs DP and EP multiplies through iteration. A rectangular multiplier is of the form N M, where the
Published by the IEEE Computer Society

VER since the introduction of SIMD extensions to generalpurpose processors, there has been a rising demand for improved SIMD performance to accommodate 3D graphics, video conferencing, and other multimedia applications [1], [2], [3], [4], [5]. At the same time, the low-power computing market is demanding a reduction in power consumption despite an increase in performance. In general, these two requirements are conflicting since increased performance is typically achieved with a corresponding increase in power consumption due to increased frequency, increased hardware resources, or a combination of these. Backward compatibility of the x86 microprocessors has enabled the survival of this Complex Instruction Set Computer (CISC) architecture and is therefore an absolute requirement for future microprocessors. In the area of floating-point, backward compatibility includes support for x87 floating-point instructions [6]. These instructions are used in scientific computing and are not generally used in multimedia applications [7]. In current x86 processors, the

. D. Tan and C.E. Lemonds are with Advanced Micro Devices Inc., PCS-3, 9500 Arboretum Blvd, Suite 400, Austin, TX 78759. E-mail: {Dimitri.Tan, Carl.Lemonds}@amd.com. . M.J. Schulte is with the University of Wisconsin-Madison, 4619 Engineering Hall, 1415 Engineering Drive, Madison, WI 53706-1691. E-mail: schulte@engr.wisc.edu. Manuscript received 21 July 2007; revised 28 Feb. 2008; accepted 18 Sept. 2008; published online 23 Oct. 2008. Recommended for acceptance by P. Kornerup, P. Montuschi, J.-M. Muller, and E. Schwarz. For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TCSI-2007-07-0339. Digital Object Identifier no. 10.1109/TC.2008.203.
0018-9340/09/$25.00 2009 IEEE

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

176

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 58, NO. 2,

FEBRUARY 2009

multiplicand width N is greater than the multiplier width M [10]. The rectangular FPM uses significantly less hardware than a fully pipelined multiplier. Furthermore, the rectangular FPM reduces the latency of SP multiplies, and the wider multiplicand conveniently accommodates two parallel SP (packed) multiplies. The rectangular multiplier is also used to decrease the latency of divide and square-root operations as described in [11]. The combination of these effects has the potential to reduce power dissipation for multimedia applications. The main contribution of this paper is the presentation of an iterative rectangular FPM that is optimized for packed SP multiplies and efficiently supports DP and EP multiplies. Several of the individual techniques presented in this paper have been previously published, but the manner in which they have been combined in this design has not been previously published to the authors knowledge. Specifically, this is the only multiplier that uses multiple passes for DP and EP multiplies to reduce area and power while supporting two packed SP multiplies in a single pass. This paper also presents a new rounding scheme that efficiently supports multiple iterations, multiple precisions, and multiple rounding boundaries for EP. The proposed FPM complies with the IEEE Standard for Binary Floating-Point Arithmetic [12] with some external hardware and microcode support, and it supports the SSE and x87 floating-point multiply, divide, square-root, and transcendental function instructions specified in [6]. As demonstrated in Section 7, the proposed FPM reduces area and dynamic power by roughly 60 percent compared to a previous FPM with similar functionality. The remainder of this paper is organized as follows: Section 2 gives a brief overview of the main ideas and the theory behind the techniques used in the FPM. Section 3 presents the hardware architecture of the FPM. Section 4 describes the iterative multiplication algorithm. Section 5 describes the rounding algorithm and hardware. Section 6 gives an overview of previous x86 FPMs and iterative FPMs. Section 7 provides area and power estimates for the proposed design and compares it to a previous design with similar functionality. Section 8 gives our conclusions.

single wider multiplication into multiple narrower multiplications and sums the resulting products. For example, A B AH AL BH BL AH BH AH BL AL BH AL BL ; where A is the multiplicand, and B is the multiplier. A and B can be divided into an arbitrary number of parts of different widths. This partitioning gives different design choices and trade-offs. The maximum widths dictate the hardware requirements. The recursive algorithm can be applied iteratively by reusing the same hardware and performing each of the narrower multiplications in different cycles. For example, A B A BH iteration1 A BL iteration2 : Typically, in an iterative-recursive multiplier algorithm, the product from the previous iteration is fed back to the current iteration in redundant form to avoid the delay of carry propagation in the critical path. The redundant product is typically merged into the partial product addition tree without adding delay. Typically, FPMs assume normalized inputs and attempt to combine the addition and rounding stages to avoid the delay of two carry propagations in series. It is possible to do this if rounding is performed before normalization. If we assume normalized inputs, rounding in an FPM must deal with two distinct cases: rounding overflow and no rounding overflow. Rounding overflow refers to the case in which the unrounded product is in the range [2.0, 4.0), and no rounding overflow refers to the case in which the unrounded product is in the range [1.0, 2.0). These two cases can be computed separately using dedicated adder circuits and then selected once the overflow outcome is known [8]. In this scheme, a constant is added to the intermediate product to reduce all rounding modes to round-to-zero, i.e., truncation. The constant is rounding mode dependent and precision dependent and thus can accommodate multiple rounding modes and precisions. Alternatively, injectionbased rounding also adds (injects) a constant but then uses a compound adder to compute the sum and sum 1 [14]. This allows both rounding overflow and no rounding overflow cases to be handled simultaneously with only one adder. Accommodating multiple rounding positions in injectionbased rounding becomes problematic because the use of the compound adder assumes a fixed rounding position. The multiplier presented in this paper uses recursiveiterative multiplication to perform DP and EP multiplies by taking multiple passes through a rectangular multiplier. It also has the ability to perform two SP multiplies in parallel. Rounding results to different precisions is implementing using two separate rounding paths: one that takes one cycle and is highly optimized for two parallel SP operations and another which takes two cycles and handles higher precision operations.

MAIN IDEAS

AND

THEORY

According to [13], Many FP/multimedia applications have a fairly balanced set of multiplies and adds. The machine can usually keep busy interleaving a multiply and an add every two clock cycles at much less cost than fully pipelining all the FP/SSE execution hardware. Multiplication readily lends itself to iterative algorithms and can accommodate numerous configurations which enable various area versus latency trade-offs. As noted in [7], Most graphics multimedia applications use 32-bit floating-point operations. Therefore, a reasonable approach is to optimize for SP operations. Before describing the multiplier architecture, it is worthwhile to briefly review some of the techniques that it uses. The multiplier presented in this paper uses both recursion and iteration to trade off performance (i.e., throughput) against area and power. A recursive multiplier algorithm divides a

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT

177

Fig. 1. FPM significand data path.

Fig. 2. FPM pipeline diagrams. (a) SSE-SP scalar (one SP multiply). (b) SSE-SP packed (two SP multiplies). (c) SSE-DP. (d) x87 EP or Internal Precision (IP68, IP76)

RECTANGULAR FLOATING-POINT MULTIPLIER ARCHITECTURE

A block diagram of our proposed FPM, illustrating the details of the significand data path, is shown in Fig. 1. To simplify Fig. 1, the additional hardware for exception processing, exponent computations, and divide/square-root support is not shown. The significand data path consists of three pipeline stages. The first pipeline stage consists of a 76-bit 27-bit multiplier which uses modified radix-4 Booth recoding [15] and a partial product reduction tree consisting of 4-2 compressors [16]. The 76-bit 27-bit multiplier accepts a feedback product in redundant carry-save form to facilitate iteration and a 76-bit addend specifically to support divide and square-root operations. The addend is needed because the iterations for divide and square-root use a restricted form of the multiply-add operation. The details of the Goldschmidt-based divide algorithm are explained in [11] and [17]. The operand width of 76 bits is required at the microarchitectural level to support division at the internal precision of 68 bits for transcendental functions [8]. The second and third pipeline stages consist of combined addition and rounding followed by result selection, formatting for different precisions, and forwarding of the result to the register file and bypass networks. There are two identical copies of the SP rounding unit to support packed SP multiply operations and a single combined DP/EP rounding unit that also handles all rounding for all other precisions and for divide and square-root operations. The SP rounders take one cycle and the DP/EP rounder takes two cycles. The outputs of the two SP rounders are

combined, formatted, and multiplexed with the output from the DP/EP rounder to select the final result. The final result is written to the register file and forwarded back to the inputs of the FPM and other FP units via the bypass networks to enhance performance of dependent operations. With such a configuration, a scalar SP multiplication takes one iteration, two parallel (packed) SP multiplications take one iteration, a scalar DP multiplication takes two iterations, and a scalar EP multiplication takes three iterations. Fig. 2 shows the pipeline diagrams for each precision supported by the FPM. The significand multiplier consists of a 76-bit 27-bit rectangular tree multiplier, which performs 76-bit 76-bit multiplications over multiple cycles, as shown in Fig. 3. This saves considerable area compared to a fully parallel 76-bit 76-bit multiplier, but penalizes the performance of the higher precision (DP and EP) multiply instructions because the multiplier must stall subsequent multiply instructions. However, the multiplier is fully pipelined for SP operations. The multiplier accepts a 76-bit multiplicand input, a 76-bit multiplier input, and a 76-bit addend input. These inputs are held for the duration of the operation. The 76-bit multiplier input is supplied to alignment multiplexing which outputs two 27-bit values. Each 27-bit value is then recoded using a set of modified radix-4 Booth encoders. Two separate 27-bit multiplier values are required to support the packed SP mode. The outputs of the Booth encoders are used to select the multiples of the multiplicand to form fourteen 81-bit partial products. One of the 27-bit multiplier values controls the generation of the upper 38 bits of each partial product, while

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

178

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 58, NO. 2,

FEBRUARY 2009

Fig. 3. 76-bit 27-bit rectangular multiplier.

the other 27-bit multiplier value controls the generation of the lower 38 bits of each partial product. In unpacked modes, the two 27-bit multiplier values are identical. In parallel to the partial product generation, two 76-bit feedback terms are combined with a 76-bit addend using a 3-2 carry-save adder (CSA). The 3-2 carry-save addition is computed in parallel with the Booth encoding and multiplexing and does not add to the critical path. The 14 partial products plus two combined terms are summed using a compression tree consisting of three levels of 4-2 compressors to produce a 103-bit product in redundant carry-save representation. The 103-bit carry-save product is then stored in two 103-bit registers. A diagram of the partial product array for the 76-bit 27-bit multiplication is show in Fig. 4. This diagram also shows the alignment of the two 76-bit feedback terms and the 76-bit addend. The two feedback terms are needed to support iterations and are aligned to the right. The addend is needed to support division and square root and is aligned to the left. The division algorithm that exploits this multiplier hardware is described in [11]. To avoid unnecessary hardware, the additional terms are inserted into the unused portions of the array wherever possible. Fig. 4 shows how the partial product terms are partitioned into groups of four corresponding to the first level of 4-2 compressors shown in Fig. 3. Note that, in certain bit positions, a 4-2 compressor cell is not required since some of the inputs are zeros. In these cases, the

4:2 compressor cell can be replaced by either a full adder (FA) (i.e., 3:2 CSA) cell, or a half-adder (HA) cell or with a buffer cell depending on the number of inputs that are zero. The subsequent levels of the compression tree can also benefit from these optimizations to save area. Although the multiplier is unsigned, a sign extension term is required to accommodate the sign embedded in the uncompressed feedback terms from the previous iteration. This is an artifact of the signed nature of the Booth encoding and the use of sign encoding for each individual partial product instead of sign extension [15]. Each partial product also requires hot-ones which are used to account for the increment term required when taking the twos complement for negatively weighted partial products [18]. For a given partial product, the hotones are appended to the subsequent partial product. For positively weighted partial products, the hot-ones are zeroes. As shown in Fig. 3, the two feedback terms and addend are compressed using a 3-2 CSA into two terms for a total of 16 values to be summed. In order to support two parallel SP multiplications, the two SP multiplications are mapped onto the array simultaneously. The superposition of two 24-bit 24-bit multiplier partial product arrays onto a 76-bit 27-bit partial product array is shown in Fig. 5. Since the lower array ends at bit 48, the significant bits of the upper array and lower array are separated by seven bits. The reduction tree has three levels of 4-2 compressors. Therefore, the lower array can propagate a

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT

179

Fig. 4. Radix-4 Booth-encoded 76-bit 27-bit partial product array.

carry at most three bit positions and will not interfere with the upper array. Hence, no additional hardware is required to kill any potential carries propagating from the lower array into the upper array. However, in order to accommodate the sign encoding bits and the hot-ones, an additional multiplexer is inserted after the Booth multiplexers and prior to the 4-2 compressor tree as indicated in Fig. 3. The multiplexing after the Booth multiplexing is only required for the sign encoding bits of the lower array and the hot-ones of the upper array, so the additional hardware required is small. This hardware, however, is on the critical path and adds the delay of a 2-1 multiplexer. An alternative to multiplexing in the signencoding bits and hot-one bits after the Booth multiplexing is to insert these bits into the feedback terms which are all zeros for the first iteration.

ITERATIVE 76 27 MULTIPLICATION ALGORITHM

The iterative multiplication algorithm for the rectangular multiplier is given in Fig. 6. For each multiply iteration, the appropriate multiplier bits are selected for the high and low multiplier values, and the product is computed in redundant carry-save form. For SSE-SP multiplies and the first iteration of all other precisions, the two feedback terms are set to zero. For the second iteration of SSE-DP multiplies and the second and third iterations of EP multiplies, the two feedback terms are set to the upper 76 bits of the product

from the previous iteration and are then added to the lower 76 bits of the current product. SP multiplies require only a single iteration, DP multiplies require two iterations and EP multiplies require three iterations. The alignment of the unrounded product and the position of the rounding points within the 103-bit carrysave multiplier output are shown in Fig. 7. This diagram shows the position of the rounding overflow bit V , the most-significant bit of the product M, the least-significant bit of the product L, the round bit R, the remaining result significand bits, and the sticky region. For packed SP multiplies, the unrounded products are aligned such that the high subword product is fully left aligned and the low subword product is fully right aligned. To help simplify the rounding, the DP and EP multiplies align the final product such that the number of unique rounding points are reduced without adding more precision multiplexer inputs. For EP multiplies that are to be rounded to SP (EP24), the unrounded product is aligned such that the LSB of the product is in the same position as the LSB of the DP product and EP product to be rounded to DP (EP53). This has the added benefit of reducing the size of the sticky region compared to its size if the product is instead fully left aligned. It is also possible to align the EP64 and IP68 rounding points, but this would require an additional precision multiplexer input in the multiplier stage. The 76-bit internal precision product (IP76) is used for

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

180

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 58, NO. 2,

FEBRUARY 2009

is partitioned into feedback, sticky, and carry regions, and the final result extraction. During the first two passes, the feedback term is sent back to the multiplier, and the bits in the sticky and carry regions are sent to the DP/EP Rounder discussed in Section 5. During the third pass through the multiplier, all of the product bits in carry-save format are sent to the DP/EP Rounder. In this pass, the 48 lower product bits correspond to the sticky and carry regions, the next 24 product bits make up the significand if overflow does not occur, and the 29 upper product bits are discarded.

ROUNDING

Fig. 6. Iterative multiply algorithm.

intermediate results in division and square root. No rounding is needed for this mode since truncation is sufficient [11]. As an example, the multiplication algorithm for EP rounded to SP (EP24) is shown graphically in Fig. 8. To align the LSB of the EP24 product with the LSBs from the SSE-DP and EP53 products, the multiplicand and multiplier are aligned to the right as far as possible. For the first pass, the lower 27 multiplier bits are selected for the multiplier operand, for the second pass the next 27 bits are selected, and for the third pass the upper 10 bits are selected with 17 zeros prepended to form the 27-bit multiplier operand supplied to the Booth encoders. Fig. 8 also shows the 103-bit product generated from each pass, how the 103-bit product

Before describing the details of the proposed rounding scheme, the rounding scheme used in the AMD-K7TM / AMD-K8TM FPM is briefly explained [8], [9]. In this rounding scheme, the product is computed using three separate 152-bit carry-propagate adders (CPAs). The first CPA computes the unrounded result for denormals and determines the significand product overflow bit. The second CPA computes a rounded result with the assumption that the unrounded result will not have an overflow, i.e., the unrounded product is assumed to be in the range [1.0, 2.0). The third CPA computes a rounded result with the assumption that the unrounded result will have an overflow, i.e., the unrounded product is assumed to be in the range [2.0, 4.0). Rounding is achieved by selecting a rounding constant which, when added to the product, reduces all rounding modes to a simple truncation with a possible LSB fix-up for round-to-nearest-even (RTNE). To avoid an extra carrypropagate addition, the rounding constant is first combined with the redundant carry-save form of the product using a 3-2 CSA before being passed to the CPA. The 3-2 CSA also provides support for the divide and square-root operations for computing the back-mul step [8]. For RTNE, the rounding constant consists of a single one in the round bit position (i.e., the half ULP position). Therefore, if the round bit is one, the product is incremented. This achieves roundto-nearest-up and in the case of a tie, the LSB is set to zero to keep the result even. For round-to-infinity, when the result is of the appropriate sign, the round constant consists of a string of ones starting from the round bit and ending at the LSB of the fully precise product. Therefore, any 1 located in that region causes the product to be incremented. The AMD-K7TM /AMD-K8TM rounding scheme is fast and easily supports multiple rounding precisions but consumes a considerable amount of hardware and is therefore undesirable in low-cost and low-power systems. The proposed rounding circuitry takes as input the product in redundant carry-save form and rounds the result according to the appropriate control word (FCW for x87 instructions or MXCSR for SSE instructions). The rounding circuitry contains separate rounding units for SSE-SP high and SSE-SP low results, and a combined rounding unit that rounds for SSE-DP, x87-EP, and divide/ square-root results. Each of the rounding units is based on a compound adder rounding scheme, which is more power and area efficient than the rounding scheme used in the AMD-K7TM /AMD-K8TM multiplier [8]. It should be noted that the AMD-K8TM rounding scheme is inherently faster than the rounding scheme presented here but at the cost of increased area and power. The microarchitecture requires that the FPM be able to produce the unrounded, normalized

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT

181

Fig. 7. Unrounded product alignment.

Fig. 8. EP multiply rounded to SP (EP24).

result for support of denormalized results, as described at the end of this section. This complicates the use of injectionbased rounding, described in [19], [20], and [21], which could potentially simplify the rounding units. The SSE-SP rounder performs SSE-SP rounding only. This is a highly optimized and compact rounder compared to the DP/EP rounder since it only has to deal with one precision. This unit has two identical instances: one for the lower SSE-SP result and one for the upper SSE-SP result. A block diagram of the SP rounder is given in Fig. 9. In the proposed SP rounding scheme, the upper 24 bits are passed through one level of HAs which compresses the lower two bits to one bit Xs 1. The lower bits are denoted as a0 Ps 23, b0 Pc 23, a1 Xs 1. The sum of these bits is denoted as sum1 : 0 fa1 ; a0 g f0; b0 g. These three bits a0 ; a1 ; b0 are passed to a set of 2-bit constant adders which compute sum1 : 0 plus zero sum01 : 0, sum1 : 0 plus one sum11 : 0, sum1 : 0 plus two sum21 : 0, and

sum1 : 0 plus three sum31 : 0. The 2-bit constant adders also compute the carry-out from bit 1 into bit 2 for each summation case c2p0; c2p1; c2p2; c2p3. The upper 23 bits are passed to a two-way compound adder that computes their sum plus zero S0 Xs 24 : 2 Xc 24 : 1 and their sum plus one S1 Xs 24 : 2 Xc 24 : 1 1. Each of these products is then normalized based on the significand product overflow bits (V0 for S0 and V1 for S1 ). In parallel to the upper data path, the lower 24 bits are passed to a carry-tree and sticky-bit computation logic. The carry-tree computes the unrounded LSB L, the round bit R, and the carry-out from the R-bit Rcout . In parallel, the sticky-bit computation logic performs the logical OR of the lower 22 bits to produce the sticky-bit S. Two sets of rounding selects are then determined using L, R, Rcout , S, the products sign sign, and the rounding mode. One set of rounding selects assumes overflow of the product does not occur V 0, or equivalently, that the unrounded

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

182

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 58, NO. 2,

FEBRUARY 2009

Fig. 9. SP rounder.

significand product is in the range [1.0, 2.0). The other set of rounding selects assumes that overflow of the product does occur V 1, or equivalently, that the unrounded significand product is in the range [2.0, 4.0). This is similar to the approach described in [22], except that all possibilities are computed in parallel to reduce delay. The two LSBs are selected for each condition (V 0 and V 1), and based on Rcout , the unrounded overflow bit V is determined. The V -bit is then used to select the appropriate rounding increment determination to select S0 or S1 . Finally, for the RTNE rounding mode, the LSB may need to be set to zero. The rounding algorithm is described in pseudocode in Fig. 10. It should be noted that the particular ordering of steps described was chosen for ease of description and, in the actual hardware implementation, the order of each step is best determined by examining the specific timing paths and ensuring a balance between the upper path and lower path. For instance, the order of the round-increment selection step and normalization step can be swapped. It should also be noted that originally the SP and DP/EP rounding algorithms both used two consecutive HA rows to accommodate all rounding possibilities. However, analysis during formal verification efforts revealed that it was possible to reduce this to one HA row. The combined DP/EP rounder performs rounding for SSE-DP, x87-SP, x87-DP, x87-EP, IP68 (for transcendental functions), and for divide and square-root operations. A

block diagram of the DP/EP rounder is shown in Fig. 11. Due to the large number of different precisions that must be supported, the DP/EP rounder is split over two cycles, as it is in the AMD-K8TM processor. However, unlike the AMD-K8TM FPM, the combined DP/EP rounder is based on a compound adder rounding scheme that is more area and power efficient than the AMD-K8TM rounding scheme. The DP/EP rounding scheme is similar to the SP rounding scheme except that it is necessary to perform a right shift to prealign the rounding point to the same significance prior to the compound addition and to perform a left shift to postalign the MSB to the same significance after the compound addition. This is the cost of having to support multiple rounding points in the same data path. The second difference is that the carry-tree and sticky logic need to include the carry-out and sticky from previous iterations. The third difference is that for each target precision there is a pair of 2-1 multiplexers that are used to insert the two rounded LSBs into the correct positions within the final rounded significand. The DP/EP rounder also provides a bypass path for divide and square root to allow the compound adder to be reused for other additions, such as computing the intermediate quotient 1 ULP, instead of adding dedicated hardware. For simplicity, Fig. 11 does not show the rounding circuitry required for divide and square root.

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT

183

additional normalization hardware or correction hardware. Second, in the case of denormal results, the FPM produces the normalized, unrounded result with the exponent falling out of range (below Emin ) along with sticky information. This is fed to an external unit which performs denormalization and rounding according to the IEEE-754 standard. To support this system, the floating-point registers are represented as normalized numbers with an extended exponent field in the register file. The internal representation is converted from memory format during loads and to memory format during stores. This approach for handling denormals is also used in the AMD-K8TM processor.

RELATED MULTIPLIER ARCHITECTURES

Fig. 10. SP rounding algorithm.

In order to fully support the IEEE-754 standard, the FPM requires some external assistance in dealing with denormals. First, the FPM assumes denormal inputs are first normalized with the exponent sufficiently extended to accommodate the normalization shift amount. In this manner, the FPM can operate directly on the operands without needing any

Previous x86 FPMs have taken various forms. The Cyrix multiplier includes a 17-bit 69-bit rectangular significand multiplier that uses radix-8 signed encoding, a signed-digit summation tree, and signed-digit redundant feedback [10]. This design is very area efficient. In contrast, the AMD-K7TM / AMD-K8TM multiplier includes a full-pipelined 76-bit 76-bit significand multiplier with a latency of four cycles and is optimized for EP operations [8]. The Intel Pentium-41 multiplier is fully pipelined for DP and takes two iterations for EP [13]. Both the AMD-K7TM /AMD-K8TM multiplier and Intel Pentium-41 multiplier can execute two parallel SP (packed) multiplies every clock cycle. Iterative FPMs have also been described in the literature. For example, Anderson et al. [17] describe an iterative tree multiplier that generates only six partial products per cycle and requires five cycles to assimilate the 56-bit multiplier significand. In [14], a dual-mode iterative FPM is described that executes a SP multiply in two clock cycles at a throughput of one multiplication per clock cycle, or a DP multiply in three clock cycles at a throughput of one multiplication per two clock cycles. The multiplier consists of a 27-bit 53-bit tree multiplier coupled with an injection-based rounder. In [18], a single-pass fused-multiply-add (FMA) floating-point unit is compared to a dual-pass FMA floating-point unit. Both FMA units support SP and DP operations. The dual-pass FMA unit is again based on an iterative rectangular multiplier and executes an SP FMA operation in one pass and a DP FMA operation in two passes. None of these iterative designs support simultaneous (packed) SP operations. Lastly, Akkas and Schulte [23] describe an iterative FPM that supports two DP multiplies without iteration or a quadruple-precision multiply using two iterations. In this design, the quadrupleprecision multiply is achieved using an iterative algorithm. Alternative methods for achieving packed integer multiplies are described in [24] and [25], and an application to packed FMA is described in [26]. A dual-mode FPM which supports one DP multiply or two parallel SP multiplies is described in [22]. This multiplier uses radix-8 Booth encoding and handles the packed multiplies in a fashion similar to the proposed design, except that the generation and compression of partial products is performed in multiple pipeline stages and EP multiplies are not supported. The multiplier is fully

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

184

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 58, NO. 2,

FEBRUARY 2009

Fig. 11. Combined DP/EP rounder.

pipelined and operates without stalling. It therefore requires a full DP significand multiplier.

RESULTS, COMPARISON,

AND

TESTING

The proposed rectangular multiplier was implemented in a 65-nm SOI technology using static CMOS logic and a data-path-orientated, cell-based methodology. The cell library used consisted of typical static CMOS cells in addition to some specialized cells such as the 4-2 compressor, the Booth encoder, and the Booth multiplexer. To provide a point of comparison, a design similar to the AMD-K8TM FPM (AMD-K8TM -FPM) described in [8] and [9] was also implemented with the same technology. The implementation results are shown in Tables 1, 2, and 3. The dynamic power was measured by applying random input patterns and measuring the average current using a SPICE-like circuit simulator with the transistor netlist and extracted parasitics. Both designs were measured using the same clock frequency ftypical and the same supply voltage Vtypical . The proposed design consumes significantly less area and dynamic power compared to the

baseline design (AMD-K8TM -FPM). The AMD-K8TM -FPM is a highly aggressive design that is specifically targeted toward high performance. In contrast, the proposed design is intended to be a low-cost and low-power solution with similar functionality. The implementation results reflect the two different design objectives. Functional testing was performed using a mixture of random data patterns and directed data patterns by simultaneously applying the same stimulus to the proposed iterative FPM unit and the AMD-K8TM reference FPM unit. The results from each unit were captured and compared. A comparison of multiply instruction latencies and throughputs is given in Table 4. Performance modeling studies were performed to measure the estimated instructions per cycle (IPC) for a range of benchmarks. The AMD-K8TM performance model configured with the original AMD-K8TM FPM instruction latencies and throughputs served as the baseline model while the AMD-K8TM performance model configured with the proposed iterative FPM instruction latencies and throughputs served as the comparison model. As expected, performance studies using SSE-SPdominated target applications demonstrated an increase in

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT

185

TABLE 1 Area/Power Comparison for Significand Multipliers

TABLE 3 Area/Power Comparison for Entire Significand Data Path

performance compared to the baseline design. For instance, a set of SSE-SP-dominant traces extracted from the SPECCPU20061 benchmark demonstrated a range of improvements from 1.1 percent to 10.5 percent relative to the baseline design. For x87-dominant applications, there was a similar decrease in performance. However, since those applications are mainly dominated by memory throughput, the difference was not significant on average and other microarchitectural choices such as load bandwidth and instruction window size are more important. For example, on average, the SPECfp20001 benchmark, which contains a significant percentage
TABLE 2 Area/Power Comparison for Rounders

of x87 instructions, demonstrated a performance loss of 2.5 percent. The x87 architecture requires that the multiplication be carried out in EP and then rounded to the target precision of SP, DP, or EP. Therefore, it is necessary to perform a full EP multiply even if the operands only contain significant bits which fit within the SP region or within the DP region. To reduce the latency of some x87 multiplies, it is possible to detect the number of significant bits in the multiplier and determine if this quantity falls within the range of SP, DP, or EP, and then only perform the multiplication to that precision. The multiplicand does not need to be examined since it does not contribute to the number of passes through the 76-bit 27-bit multiplier array. For instance, if the multiplier significand contains less than 28 leading significant bits, then only a single pass through the multiplier array is required and the latency of the EP multiply will be reduced from five cycles to three cycles and the throughput will be increased from 1/3 to 1. To make use of this feature, it is necessary to either use an instruction scheduler that can accommodate data-dependent instruction latencies or can keep track of the number of significant bits in the data. This feature relies on the assumption that for certain applications the operands have SP or DP ranges. Furthermore, if it can be arranged that the multiplier always contain the least number of significant bits compared to the multiplicand, then this will increase the extent to which this feature can be used. Using this feature can return some of the performance loss introduced by the pipeline stalls due to the iterative nature of the EP multiplies.

CONCLUSION

This paper has presented an x86-compatible FPM that is based on a 76-bit 27-bit rectangular multiplier and is optimized for packed SSE-SP multiples. The multiplier is compared to a design with similar functionality that was
Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

186

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 58, NO. 2,

FEBRUARY 2009

TABLE 4 Latency/Throughput Comparison

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12] [13]

[14]

[15]

optimized instead for the largest precision. The proposed design consumes significantly less area and power while achieving improved performance for the target applications and only slightly reduced performance for x87-dominated applications. The rectangular multiplier also facilitates efficient algorithms for divide and square root with a small amount of additional hardware.

[16] [17]

[18]

[19]

ACKNOWLEDGMENTS
We would like to thank Peter Seidel for suggesting optimizations to the rounding circuitry based on analysis derived from formal verification efforts, Albert Danysh and Eric Quinnell for their excellent work on the multiplier and rounding circuitry implementation, Raj Desikan for his excellent work on the performance modeling and analysis, and to the anonymous reviewers for their helpful comments.

[20]

[21]

[22]

[23]

REFERENCES
[1] P. Ranganathan, S. Adve, and N. Jouppi, Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions, Proc. 26th Ann. Intl Symp. Computer Architecture (ISCA 99), vol. 27, pp. 124-135, May 1999. S.K. Raman, V. Pentkovski, and J. Keshava, Implementing Streaming SIMD Extensions on the Pentium III Processor, IEEE Micro, vol. 20, pp. 47-57, July 2000. M.-L. Li, R. Sasanka, S. Adve, Y.-K. Chen, and E. Debes, The ALPBench Benchmark Suite for Complex Multimedia Applications, Proc. IEEE Intl Symp. Workload Characterization (IISWC 05), pp. 34-45, Oct. 2005. [24]

[2] [3]

[25]

[26]

H. Nguyen and L.K. John, Exploiting SIMD Parallelism in DSP and Multimedia Algorithms Using the AltiVec Technology, Proc. 13th Intl Conf. Supercomputing (ICS 99), pp. 11-20, June 1999. Advanced Micro Devices, AMD64 Architecture Programmers Manual Volume 4: 128-Bit Media Instructions, rev. 3.07 ed., Dec. 2005. Advanced Micro Devices, AMD64 Architecture Programmers Manual Volume 5: 64-Bit Media and x87 Floating-Point Instructions, rev. 3.06 ed., Dec. 2005. J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, ch. 2, third ed. Morgan Kaufmann, p. 119, May 2002. S. Oberman, Floating-Point Division and Square Root Algorithms and Implementation in the AMD-K72 Microprocessor, Proc. 14th IEEE Symp. Computer Arithmetic (ARITH 99), pp. 106-115, Apr. 1999. C. Keltcher, K. McGrath, A. Ahmed, and P. Conway, The AMD Opteron Processor for Multiprocessor Servers, IEEE Micro, vol. 23, pp. 66-76, Mar. 2003. W. Briggs and D. Matula, A 17 69 Bit Multiply and Add Unit with Redundant Binary Feedback and Single Cycle Latency, Proc. 11th IEEE Symp. Computer Arithmetic (ARITH 93), pp. 163-170, July 1993. M. Schulte, C. Lemonds, and D. Tan, Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier, Proc. IEEE Intl Conf. Computer Design (ICCD 07), pp. 304-310, Oct. 2007. ANSI and IEEE, IEEE-754 Standard for Binary Floating-Point Arithmetic, 1985. G. Hinton, M. Upton, D. Sager, D. Boggs, D. Carmean, P. Roussel, T. Chappell, T. Fletcher, M. Milshtein, M. Sprague, S. Samaan, and R. Murray, A 0.18-um CMOS IA-32 Processor with a 4-GHz Integer Execution Unit, IEEE J. Solid-State Circuits, vol. 36, pp. 1617-1627, Nov. 2001. G. Even, S.M. Mueller, and P.-M. Seidel, A Dual Mode IEEE Multiplier, Proc. Second Ann. IEEE Intl Conf. Innovative Systems in Silicon (ISIS 97), pp. 282-289, Oct. 1997. S. Vassiliadis, E. Schwarz, and B. Sung, Hard-Wired Multipliers with Encoded Partial Products, IEEE Trans. Computers, vol. 40, pp. 1181-1197, Nov. 1991. A. Weinberger, 4:2 Carry-Save Adder Module, IBM Technical Disclosure Bull., vol. 23, pp. 3811-3814, Jan. 1981. S. Anderson, J. Earle, R. Goldschmidt, and D. Powers, The IBM System/360 Model 91: Floating-Point Execution Unit, IBM J. Research and Development, vol. 11, pp. 34-53, Jan. 1967. R.M. Jessani and M. Putrino, Comparison of Single- and DualPass Multiply-Add Fused Floating-Point Units, IEEE Trans. Computers, vol. 47, pp. 927-937, Sept. 1998. M.R. Santoro, G. Bewick, and M. Horowitz, Rounding Algorithms for IEEE Multipliers, Proc. Ninth IEEE Symp. Computer Arithmetic (ARITH 89), pp. 176-183, Sept. 1989. G. Even and P.-M. Seidel, A Comparison of Three Rounding Algorithms for IEEE Floating-Point Multiplication, IEEE Trans. Computers, vol. 49, pp. 638-650, July 2000. N.T. Quach, N. Takagi, and M. Flynn, Systematic IEEE Rounding Method for High-Speed Floating-Point Multipliers, IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 12, pp. 511-521, May 2004. A. Enriques and K. Jones, Design of a Multi-Mode Pipelined Multiplier for Floating-Point Applications, Proc. IEEE Natl Aerospace and Electronics Conf. (NAECON 91), vol. 1, pp. 77-81, May 1991. A. Akkas and M. Schulte, A Quadruple Precision and Dual Double Precision Floating-Point Multiplier, Proc. Euromicro Symp. Digital System Design (DSD 03), pp. 76-81, Sept. 2003. D. Tan, A. Danysh, and M. Liebelt, Multiple-Precision FixedPoint Vector Multiply-Accumulator Using Shared Segmentation, Proc. 16th IEEE Symp. Computer Arithmetic (ARITH 03), pp. 12-19, June 2003. S. Krithivasan and M.J. Schulte, Multiplier Architectures for Media Processing, Proc. IEEE 37th Asilomar Conf. Signals, Systems, and Computers (ACSSC 03), vol. 2, pp. 2193-2197, Nov. 2003. L. Huang, L. Shen, K. Dai, and Z. Wang, A New Architecture for Multiple-Precision Floating-Point Multiply-Add Fused Unit Design, Proc. 18th IEEE Symp. Computer Arithmetic (ARITH 07), pp. 69-76, June 2007.

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT

187

Dimitri Tan received the BSEE degree from the University of Adelaide, Australia. He was previously with Motorola Inc. and Freescale Semiconductor Inc., where he worked on various microprocessor and SoC designs. He is currently with Advanced Micro Devices Inc., Austin, Texas, working on x86 microprocessor design. His research interests include computer architecture, computer arithmetic, and reconfigurable computing. He is a member of the IEEE. Carl E. Lemonds received the BSEE and MSEE degrees from the University of Missouri, Columbia. He worked in corporate R&D at Texas Instruments, where he designed arithmetic circuits and algorithms for various DSP test chips. After a brief stint at Cyrix, he joined Intel in 1999. At Intel, he worked on the FPU for the Tejas project (Pentium4 class processor). In January of 2004, he joined Advanced Micro Devices (AMD) Inc., Austin, Texas, where he is currently a principal member of the technical staff. His interests include computer arithmetic, floating-point, and DSP. His current research is in vector floating-point processors. He is a senior member of the IEEE and a member of the ACM.

Michael J. Schulte received the BS degree in electrical engineering from the University of Wisconsin-Madison and the MS and PhD degrees in electrical engineering from the University of Texas, Austin. He is currently an associate professor at the University of Wisconsin-Madison, where he leads the Madison Embedded Systems and Architectures Group. His research interests include high-performance embedded processors, computer architecture, domain-specific systems, and computer arithmetic. He is a senior member of the IEEE. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

You might also like