Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

A 5GHz+ 128-bit Binary Floating-Point Adder for the POWER6 Processor

Xiao Yan Yu, Yiu-Hing Chan, Michael Kelly, Eric Schwarz, Brian Curran
IBM System and Technology Group Poughkeepsie, USA {xianyu, chanyiu, mrkelly, eschwarz, curranb}@us.ibm.com
AbstractA fast 128-bit end-around carry adder is designed and fabricated as part of the POWER6 floating-point unit in a 65nm SOI process technology. Efficient use of static circuits and careful balance of the look-ahead tree enable our floatingpoint design to operate beyond 5GHz with 1.1V supply.

Bruce Fleischer
IBM T. J. Watson Research Center Yorktown Heights, USA fleischr@us.ibm.com

the multiply compression tree. These designs are power efficient only when the final addition is performed right after the compression tree and when the end-around carry computation is not needed.

I.

INTRODUCTION

Addition is often the timing critical path of modern microprocessors. A number of high-performance adders have been proposed in the past [5],[6],[7]. All of them implement a parallel adder structure using dynamic logic in order to achieve the performance target. These techniques all result in significantly more power consumption. Designing solely for higher frequency will only yield a less power efficient design. Design space should be fully explored before a design choice is made. Recent silicon technologies have led to the increase in subthreshold leakage. This is especially true for low-Vt transistors which have been intensively used in high performance designs. One must fully exploit the use of highVt devices before inserting low-Vt devices. As circuits approaching the ultimate power limit, careful circuit and layout implementations are required in order to produce a power efficient design. Interconnect delay has become more significant in each generation [8]. Currently, wire delays contribute large percentage of cycle time. Designs with of long wires will suffer from significant increase in area, delay and power. Performance impact due to physical implementation must be analyzed at design time in order to guarantee the optimality of a design. As a result, adders with dense prefix trees are not desirable due to their massive signal communications. This paper presents a fast 128-bit adder implemented in static circuits using a 65nm SOI technology [2] with nominal Vt devices. The adder is realized in a 7-cycle multiply-add pipeline that is a part of the POWER6 microprocessor [1]. Several adders have been proposed in the past to be used in the multiply-add operation [16],[17],[18],[19],[20]. They utilize various adder schemes based on the delay profile of

Figure 1. POWER6 floating-point dataflow [1]

Figure 1 shows a block diagram of our floating-point unit. The shaded boxes indicate the latch points of each pipeline stage. Addition is partitioned into three different cycles in order to provide the best floating-point performance. The generation of bit-wise propagate and generate terms is done at end of third pipeline cycle. The actual end-around carry computation and the generation of 32b group conditional sums are accomplished during fourth cycle. The final sums are selected prior to normalization at beginning of fifth cycle. This floating-point unit requires a high performance adder design since the carry signal is on the critical path. Our adder is implemented using a prefix-2 Kogge-Stone tree which is

described in Section II. The chip measurements demonstrate the operating frequency beyond 5GHz at 1.1V. Storage elements are clock-gated when it is not in use to save active power. The structure is tuned using slack-based transistor level timing methodology. This allows us to produce a power efficient design. II. ADDER IMPLEMENTATION

performance. With this configuration, the carry path becomes the most critical.

A. Preliminary The descriptions of binary floating-point unit with multiplyadd dataflow can be found in [2] and [4]. This implementation allows the realization of fused multiply-add operation: T=B+AC (1)

The end-around carry adder performs the final addition after the multiply counter tree. Its carry chains are equal length and wraps around for effective subtraction. Assuming the adder is divided into four groups, the carry for each group can be expressed as:

(a)

(b)

Figure 2. Binary floating point unit floorplan [1]

C0 = G0 + P0G1 + P0 P 1G2 + P 0P 1P 2 G3 + P 0P 1P 2P 3 C1 = G1 + P 1G2 + P 1P 2 G3 + P 1P 2P 3G0 + P 0P 1P 2P 3 C2 = G2 + P2G3 + P2 P3G0 + P2 P3 P0G1 + P0 P 1P 2P 3 C3 = G3 + P3G0 + P3 P0G1 + P3 P0 P 1G2 + P 0P 1P 2P 3
Interested readers can refer to [4] for more details on endaround carry adder. B. Our Adder Structure In order to minimize communication overhead within the floating point unit, we created an O shaped floorplan. Data flows clockwise along the right stack and through the adder and up the left stack back to the registers shown in Figure 2(a). This floorplan limits the wire resources that the adder can use internally. For this reason, we decided to use a nonuniform sparse adder scheme. Uniformly sparse adder schemes would occupy much more wire tracks for intermediate carries and is not feasible in our case. We use denser prefix tree for blocks with relatively short wires and sparser prefix tree for blocks with long wires. By doing this way, we are able to route our critical signals with better wire width and space without allocating dedicated routing areas in our design for them. In order to fit the entire floating point unit in the given area, the adder is separated into two sections placed side by side. The carry computation and the final sum selection area shown in Figure 2(b). The end-around carries and the conditional sum signals are required to turn 180 before entering the final sum selection block. Our intention is not to produce an adder with the best stand-alone performance but one that can provide the best overall floating point (2)

Figure 3(a) shows the block diagram of our adder. It is divided into three different sub-blocks, the 32-bit adder block, the end-around carry generation block and the final sum selection block. Circuit blocks are placed optimally to speed up the carry paths. Figure 3(b) shows the block diagram of the 32-bit adder with critical path labeled with a thick line. It is partitioned into three sub-components. First sub-component is the 8-bit prefix-2 Kogge-Stone tree [9] with sparseness of 2 that generates 8-bit carry, carry+1 and propagate terms as well as conditional sums. This is needed later for sum selection. Second sub-component is the prefix2 Kogge-Stone tree with sparseness of 8 that generates 32-bit carry, propagate term and as well as 32-bit conditional sums. Carry+1 term is only propagated within the 32-bit group. Since carry-out is our critical path, we have isolated this path so that the fan-out of each net on this path is 1. The carry path is replicated in order to generate intermediate carries. There are several ways to implement the carry propagation. First way is to propagate the carry assuming the carry-in of the group is 0. Carry+1 at ith bit can then be generated by an OR operation: (Carry+1)i = Carryi + Pi (3) This way does not require the propagation of both carry and carry+1. Carry+1 signal can be produced in an additional stage. Second way is to propagate both carry signals and use carry+1 as the propagate term since Pi (Carry+1)i. Due to the fact that the carry signal is critical in our timing, we decided to choose the second scheme. This requires one less stage on the critical path comparing to the first scheme. Using this technique, we are able to meet the timing requirement of intermediate carry paths without disturbing the critical paths. Our sum selection block is implemented using transmission gate multiplexers with buffers to drive the 180 wire turn. The end-around carry logic block implements

the equations (2) shown in previous section. The carry blocks are placed to ensure balanced wire delays at each stage. The final sum selection is implemented using similar structure as the sum selection in 32-b adder block.

number shows the relative speed of a design is compared to the cycle time target. The average power dissipation of a design at each performance point is simulated using our inhouse power simulator, CPAM [14]. Each output is loaded with equivalent capacitive load calculated at unit level. All slack numbers are normalized to the fanout-of-4 (FO4) inverter delay, which is independent of technology and environment [15]. Figure 4(a) shows the power-performance curve of these designs with 20% input switching activity. All designs are optimized under same range of power-performance tradeoff factors.
41 Average Power (mW) 20% Switching Our Implementation LFA Sparse 8 39 37 35 33 31 29 27 25 0.00

(a)

-7.00

-6.00

-5.00

-4.00

-3.00 -2.00 Slack (FO4)

-1.00

1.00

(a)
12 Our Implementation LFA Spase-8 Leakage Power (mW) 11 10 9 8 7 6 5

(b) Figure 3. (a) Block diagram of our adder (b) Diagram of 32-b block

-7.00

-6.00

-5.00

-4.00

-3.00 -2.00 Slack (FO4)

-1.00

4 0.00

1.00

(b)

III.

COMPARISON WITH CONVENTIONAL DESIGNS

Figure 4. (a) Average Power vs. Slack (b) Leakage Power vs. Slack

We have compared our design against the Ladner-Fischer (LFA) design [10], [11] and a prefix-2 Kogge-Stone adder with sparseness of 8 (Sparse-8). The LFA design is used in our first pass test chip. Its 8-b sub-component is implemented using full Ladner Fischer tree. The 32-b subcomponent is implemented using the same prefix scheme as its 8-b sub-component with sparseness of 8. All designs use only nominal Vt transistors. The optimization points of each design are obtained by varying power performance tradeoff factor using our in-house formal static tuner, Einstuner with constrained input size [12]. The performance of each point is simulated using our in-house transistor level static timer, EinsTLT [13] and is presented as a slack number. This slack

All three designs behave similarly at slack time of -6 FO4. Around that slack, each topology can be implemented through circuit tuning to get to an efficient design. We begin to see differences between these topologies as the cycle time approaches our target. LFA has highest performance at -0.4 FO4 slack. The performance of our design and the Sparse-8 design, on the other hand, continues to improve beyond our target. Our design has similar performance as the Sparse-8 design at each trade-off point. However, the power-performance curve of our design resides below that of the Sparse-8 and crosses over the LFA design. Figure 4(b) shows the trend of leakage power as a function

of performance pertaining to each design. The leakage power curve of our design and the LFA design are coincide with each other. The Sparse-8 design, however, is shifted upwards approximately 2mW. This shift results in reduction of power efficiency for the Sparse-8 design in the power performance space. Therefore, our design is more power efficient than the Sparse-8 design and the LFA design. At the highest performance points of all designs, carry and sum paths become equally critical. Optimizing the design with even higher tradeoff factor will see diminishing return. Since we are only interested at a design point that is able to achieve the cycle time target, the point where it just crosses 0 slack boundary is desired. This design point is about 0.5 FO4 faster than LFA with only 5% power increase and 6% area increase. This shows that balancing the prefix tree in a design according to its critical path improves the overall performance. Our adder is implemented and fabricated using a 65nm SOI technology. Figure 5 shows a part of our floating-point unit. Boxes on the figure indicate the positions of our adder blocks. The chip measurements show that adder is fully functional at 5GHz with 1.1V supply voltage.

REFERENCES
[1] B. Curran, et. al, 4GHz+ Low-Latency Fixed-Point and Binary Floating-Point Execution Units for the POWER6 Processor, Digest of 2006 IEEE International Solid-State Circuits Conference, February 8, 2006. E. Leobandun, et. al, High Performance 65 nm SOI Technology with Dual Stress Liner and Low capacitance SRAM cell, Digest of 2005 Symposium on VLSI Technology, 2005. R. K. Montoye, et. al, Design of the IBM RISC System/6000 floating-point execution unit, IBM Journal of Research and Development, vol. 34, no. 1, pp. 59. E. Schwarz, Binary Floating-Point Unit Design, book chapter in High Performance Energy Efficient Microprocessor Design, Springer, Edited by R. Krishnamurthy and V. G. Oklobdzija, March 2006. J. Park, et. al, 470ps 64-bit Parallel Binary Adder, Digest of of 2000 Symposium on VLSI Circuits, 2000. S. Mathew, et. al, Sub-500-ps 64-b ALUs in 0.18-m SOI/bulk CMOS: design and scaling trends, IEEE Journal of Solid-State Circuits, Volume 11, November 2001. B. Zeydel et. al, Efficient Mapping of Addition Recurrence Algorithms in CMOS, 17th IEEE Symposium on Computer Arithmetic, June 27-29, 2005. Interconnect, International Technology Roadmap for Semiconductors (ITRS) 2005. P. M. Kogge and H. S. Stone, A parallel algorithm for the efficient solution of a general class of recurrence equations, IEEE Trans. Computers, Vol. C-22, No. 8, 1973, pp.786-793. R. E. Ladner, et. al, Parallel Prefix Computation, J. ACM, vol. 27, no. 4, pp. 831-838, 1980. S. Knowles, "A Family of Adders", Proceedings of the 14th IEEE Symposium on Computer Arithmetic, Adelaide, Australia, April 1416, 1999. A. R. Conn, et. al, Gradient-Based Optimization of Custom Circuits Using a Static-Timing Formulation, Proceedings of the Design Automation Conference, June 1999, pp. 452 459. V. Rao, et. al, EinsTLT: Transistor Level Timing With EinsTimer, ACM/IEEE International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, March 8-9, 1999. J. S. Neely, et. al, "CPAM: a common power analysis methodology for high-performance VLSI design," in Proceedings, IEEE 9th Topical Meeting on Electrical Performance of Electronic Packaging, October 2000, pp. 303-306. M. Horowitz, "VLSI Scaling for Architects", Presentation slides, Computer Systems Laboratory, Stanford University. V. G. Oklobdzija and D. Villeger, "Multiplier Design Utilizing Improved Column Compression Tree And Optimized Final Adder In CMOS Technology", Proceedings of the 1993 International Symposium on VLSI Technology, Systems and Applications, pp. 209-212, 1993. V. G. Oklobdzija and D. Villeger, "Improving Multiplier Design By Using Improved Column Compression Tree And Optimized Final Adder In CMOS Technology", IEEE Transactions on VLSI Systems, Vol. 3, No. 2, June, 1995. V. G. Oklobdzija, P. Stelling, "Design Strategies for the Final Adder in a Parallel Multiplier", Twenty-Ninth Annual Asilomar Conference on signals, Systems and Computers, Pacific Grove, California, October 29 - November 1, 1995. P. Stelling , V. G. Oklobdzija, "Design Strategies for Optimal Hybrid Final Adders in a Parallel Multiplier", special issue on VLSI Arithmetic, Journal of VLSI Signal Processing, Kluwer Academic Publishers, Vol. 14, No. 3, December 1996. B. R. Zeydel, V.G. Oklobdzija, S. Mathew, R.K. Krishnamurthy, S. Borkar, "A 90nm 1GHz 22mW 16x16-bit 2's Complement Multiplier for Wireless Baseband", Proceedings of the 2003 Symposium on VLSI Circuits, Kyoto, JAPAN, June 12 - 14, 2003.

[2]

[3]

[4]

[5] [6]

[7]

[8] [9]

[10] [11]

[12]

[13]

[14]

Figure 5. Adder layout

[15] [16]

IV.

CONCLUSION

A fast 128-bit floating-point adder is implemented and fabricated as part of the POWER6 processor in a 65nm SOI technology [2]. We used non-uniform sparse Kogge-Stone tree and carefully balanced the prefix tree according to its critical path. This new design has met the stringent timing requirement after reducing the slack time by 0.5 FO4 compared to the Ladner Fischer scheme, which was used in the first test chip. Compared to Ladner Fischer design, our design only consumes 6% area overhead and 5% power increase. The measurements demonstrate operation of this adder beyond 5GHz with 1.1V supply.
ACKNOWLEDGEMENT

[17]

[18]

[19]

The authors would like to thank Kevin Nowka, Victor Zyuban and Mary Jo Saccamango for reviewing this paper.

[20]

You might also like