Techniques: Low-Power Design For High-Performance CMOS Adders KO

321
[4] R. J. Francis, J. Rose, and K. Chung, “Chortle: A technology mapping Low-Power Design Techniques for
program for lookup table-based field programmable gate array,’’ in Proc. High-Performance CMOS Adders
27th Des. Automat. Con$, 1990, pp. 613-619.
[5] R. Murgai, Y. Nishihito, N. Shenoy, R. K. Brayton, and A. Sangiovanni-
Uming KO,Poras T. Balsara, and Wai Lee
Vincentelli, “Logic synthesis for programmable gate arrays,” in Proc.
27th Des. Aufomat. Con$, 1990, pp. 620-625.
[6] R. J. Francis, J. Rose, and 2. Vranesic, “Chortle-crf: Fast technology
mapping for lookup table-based FPGA’s,” in Proc. 28th Des. Automat. Absftoct-A high-performanceadder is one of the mast critical compo-
Con$, 1991, pp. 227-233. nents of a processor which determines its throughput, as it is used in the
[7] R. Murgai, N. Shenoy, R. K. Brayton. and A. Sangiovanni-Vincentelli, ALU, the floating-point unit,and for address generation in case of cache
“Improved logic synthesis algorithms for table look up architectures,” or memory access. In this paper, low-power design techniques for various
in Proc. Inr. Con$ Compuf.- Aid. Des., 1991, pp. 564-567. digital circuit families am studied for implementing high-performance
[8] K. Karplus, “Xmap: A technology mapper for table-lookup field- adders, with the objeftive to optimize performance per watt or energy
programmable gate arrays,” in Proc. 28th Des. Automar. Con$, 1991. ef6ciency as well as silicon area efficiency. While the investigation is done
pp. 240-243. using 100 MHz, 32 b carry lookahead (CLA) adders in a 0.6 pm CMOS
[9] N. S. Woo,“A heuristic method FPGA technology mapping based on the technology, most techniques presented here can also be applied to other
edge visibility,” in Proc. 28th Des. Automa. Con$, 1991, pp. 248-251. parallel adder algorithms such as carry-select adders (CSA) and other
[lo] P. Sawkar and D. Thomas, “Area and delay mapping for table-look- energy ef6cient CMOS arcuits. Among the techniques presented here,
up based field programmable gate arrays,” in Proc. 29th Des. Automat. the double pass-transistor logic @PL) is found to be the most energy
Con$, 1992, pp. 368-373. emdent while the single-rail domino and complementary pass-transistor
[ll] -, “Performance directed technology mapping for look-up table logic (CPL) result in the best p e d o r m ~ c eand the most area efficient
based FPGAS,” in Proc. 30th Des. Aufoma. Con$, 1993, pp. 208-212. adders, respectively. The impact of transistor threshold voltage scaling
[ 121 R. J. Francis, J. Rose, and Z. Vranesic, “Technology mapping of lookup on energy efficiency is also examined when the supply voltage is scaled
table-based FPGA’s for performance,” in Proc. Inf. Con$ Compuf-Aid. from 3.5 V down to 1.0 V.
Des., 1991. pp. 568-571. Zndex Terms-Low power, digital CMOS, high performance, adder.
[13] R. Murgai, N. Shenoy, R. K. Brayton, and A. Sangiovanni-Vincentelli,
“Performance directed synthesis for table look up programmable gate
arrays,” in Proc. Int. Con$ Compur.-Aid. Des., 1991, pp. 572-575.
I. INTRODUCTION
[I41 K. C. Chen, J. Cong, Y.Ding, and A. B. Kahng, “DAG-Map: Graph-
based FFGA technology mapping for delay optimization,” IEEE Des. With the advent of battery operated applications like portable
Test Compur.. pp. 7-20, Sept. 1992. computing and personal communication systems ( X S ) [l], it has
[I51 N. Bhat and D. Hill, “Routable technology mapping for FPGA’s,” in become imperative to develop integrated circuits and systems that
Pmc. 1st Inr. ACWSIGDA Workshop Field Programmable Gate Arrays.
Feb. 1992, pp. 143-148. use less energy without greatly sacrificing computational throughput.
[16] M. Schlag, J. Kong, and P. K. Chen. “Routability driven technology Furthermore, such energy efficient circuits are also needed in high-
mapping for lookup table based PGA’s,” in Proc. IEEE Int. Con$ performance desktop, AC powered systems in which sinking large
Compur. Des., Oct. 1992. amount of heat through packages is becoming a difficult problem.
[I71 J. Cong and Y. Ding, “On areddepth trade-off in LUT-based FPGA
Thus, designing a low-power processor is becoming equally important
technology mapping,” in Proc. 30rh Des. Automa. Con$, 1993, pp.
213-218. to designing a high performance one. This trend will benefit desktop
[I81 R. K. Brayton, R. Rudell, A. Sangiovanni-Vincentelli, and A. R. as well as portable systems as it will allow greater integration at the
Wang, “MIS:A multiple-level logic optimization system,” IEEE Trans. silicon level with less expensive device packaging which in tum will
Comput.-Aid. Des. Infegr. Circ.. Sysf., Nov. 1987. lead to a further reduction in system power dissipation.
[19] T. H. Cormen, C. E. Leisemon, and R. L. Rivest, Introduction to A high-performance adder has been one of the most critical compo-
Algorithms. Canbridge, MA: MIT, pp. 477-479.
[20] Y. Chen, T. Ku, W. Chia, S. Chiu, and 0. Lam, “Structure exploration in nents in determining the throughput of a processor’s execution unit,
high-level language description for logic synthesis,” in Proc. 7fhIEEE floating-point unit, and memory address generation unit. Recently,
Int. ASIC Con$ Exhibif, 1994, pp. 63-66. Nagendra et al. [2] presented power-delay characteristics of various
[21] J. Cong and Y. Ding, “Beyond the combinational limit in depth adder architectures using the full static CMOS circuit style. They
minimization for LUT-based FPGA design,” in Proc. Inf. Con$ Comput-
Aid. Des., 1993, pp. 11CL114. presented a comparison of ripple carry, block carry lookahead and
signed digit adders. A similar study was also reported by Callaway et
al. [3] in which they compared dynamic power dissipation of different
adder architectures. Besides considering different adder architectures,
another approach is to employ different CMOS circuit styles to design
energy efficient, high-performance adder circuits for a given architec-
ture. Conventional static CMOS has been the technique of choice in
most processor design. Altematively, static pass-transistor circuits, in
particular, have also been suggested for low-power applications [ 11.
Dynamic circuits, when clocked judiciously, can also be used in low-
power microprocessors [4]. However, several other design techniques
need to be applied and evaluated along with these circuit styles for
low-power/voltage processor applications.
Manuscript received April 22, 1994; revised November 3, 1994.
U. KO and W. Lee are with Texas Instruments, Inc., Dallas, TX 75265
USA.
P. T. Balsara is with the Department of Electrical Engineering, University
of Texas at Dallas, Richardson, TX 75083 USA.
IEEE Log Number 9410845.
1063-8210/95$04.00Q 1995 IEEE
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on February 23,2021 at 07:54:38 UTC from IEEE Xplore. Restrictions apply.
328 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (-1) SYSTEMS, VOL. 3, NO. 2, JUNE 1995
T T a b
+-+--P Fig. 2. XOR using high-performance CPL.
a. T b~
- -
Fig. 1. XOR using high-performance static CMOS.
Fig. 3. High-performance CPL XOR with pMOSFET feedback.
In this paper, we present a study and evaluation of various digital

circuit families and techniques to optimize performance per watt or
energy efficiency as well as the silicon area of an adder. A 32 b carry
lookahead adder architecture is implemented and evaluated using five
different CMOS circuit styles: full static CMOS [5], complementary
pass-transistor logic (CPL) [ 6 ] double
, pass-transistor logic (DPL) [7],
dual-rail Domino dynamic logic [8], [9], and single-rail Domino [lo],
[ 1 11 dynamic logic with supply voltages varying from 3.5 V to 1.5 V.
These techniques are characterized in terms of power, performance,
and area for each circuit family. Lastly, we present the impact of
threshold voltage scaling on power and energy efficiencies of these
circuit families. '-D-p
Fig. 4. Low-power CPL XOR with pMOSFET feedback.
11. 32 BIT CLA ADDERIMPLEMENTATIONS
In order to compare different circuit techniques mentioned in
Section I, five different 32 b carry lookahead (CLA) adders are current in CPL, a weak pMOSFET feedback device, as depicted in
designed. The CLA adders are designed using a two-bit group Fig. 3, is added across the output inverter stage to pull the output node
approach with six cany generation stages [12]. The circuit families of the pass-transistor network to full V d d . However, this ph4OSFET
are briefly described below using the XOR gate, which is the most pull-up increases the propagation delay of the CPL XOR gate. A
basic gate in an adder circuit. low-power CPL can be created by combining the two CPL gates
from Fig. 3 into one gate, as shown in Fig. 4. However, this leads to
an increase in delay of one of the outputs.
A. Full Static CMOS
A schematic diagram of a full static CMOS XOR gate is depicted
C. Double Pass-Transistor Logic (DPL)
in Fig. 1. While it is a common CMOS design practice since it
involves minimum design risk, the serial nMOSFET or pMOSFET Suzuki et al. [7] proposed a double pass-transistor logic (DPL)
transistors (Fig. 1) tend to demand that their width be increased to which is a modified version of CPL. DPL alleviates the CPL problems
obtain a reasonable conducting current to drive capacitive loads. This of noise margin and speed degradation at reduced supply voltages.
results in significant area overhead, which also causes high gate input A high-performance DPL XOR gate is depicted in Fig. 5. A DPL
capacitance and therefore high power dissipation. Furthermore, the implementation avoids the series sizing (and high gate capacitance)
high input capacitance also loads the previous stage, increasing its issues of the full static circuits as well as the nMOSFET threshold
delay. voltage drop issue of the CPL design, by having both pMOSFET and
nMOSFET pass gates. Similar to Fig. 4, a low-power DPL can be
B. Complementary Pass-Transistor Logic (CPL) created by combining the two gates shown in Fig. 5 into a single gate
(except that DPL does not need the feedback pMOSFET).
Y h o et al. [6] described a logic technique in which a boolean
function is implemented using a network of nMOSFET pass transis-
tors. A CPL XOR gate is depicted in Fig. 2. When compared to the D.Dual-Rail Domino Logic
full static XOR gate, CPL implementation utilizes only half as many Besides the above three static techniques, precharged circuit tech-
transistors and it avoids the design problem of appropriately sizing niques like Domino [8, 91 are also known for its high throughput. A
series connected transistors in the pull-up or pull-down planes [5]. dual-rail domino XOR gate is illustrated in Fig. 6. The precharging
In CPL, internal dual rail signals (p and p in Fig. 2) are typically occurs when the CLK signal is at a logic low, and when CLK is at a
generated simultaneously to minimize the propagation delay through logic high the evaluation phase starts. A feedback pMOSFET device
each circuit block. However, when the outputs of the nMOSFET pass- is added to enhance the noise margin as well as to maintain the
transistor network are logically high, they are at v d d - vt, which logic state. One major advantage of this dynamic, precharged circuit
results in the incomplete turn-off of pMOSFET's in the inverters. over the previous three static families is that it eliminates all the
This results in a high static through current. To minimize the through spurious transitions and the corresponding power dissipation which
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,VOL. 3, NO. 2, JUNE 1995 329
-
a b
i,
C4 4L K - I t L ~ ~ zi
(Single-rail signal)
' ''I
b
& L
J
L
.b
I Fig. 7. NOR using high-performance single-rail domino.
a
Fig. 5. XOR using high-performance DPL
CLKdm
T i T l t t
si Pi ti (single-rail signals)
Pi gi zi single-rail signals)
t t f
Fig. 6. XOR using high-performance dual-rail domino. IPre-Conditioning1
is inherent in any static logic implementation. However, in dual-rail

domino circuits, additional power is dissipated by the gating devices
and its clock drivers and distribution network.
Fig. 8. Single-rail domino preconditioning and cany block diagram.
E. Single-Rail Domino Logic
A closer examination of the domino design revealed that the will merge the three signals p , , gz, and z, from the previous stages
dual-rail inputs are only needed in the XOR implementation. Other to create P, and 2,. The carry cell accepts dual-rail carry-in signals
functions, such as NAND and NOR, can be implemented with single- and single-rail gl, P, and 2, signals, as shown in Fig. 8. The sum
rail inputs only in order to reduce power dissipation. The single-rail circuit remains the same as in the previous dual-rail domino adder.
architecture is very sirmlar to the dual-rail domino, except that group This implementation reduces the internal signals from four (dual-
generation and propagation blocks in a CLA can be implemented rail p , and gl) to three (single-rail p , , gt, and t,),and hence cuts
using single-rail inputs and outputs. Since dual-rail carry signals are down the total gate capacitance. More importantly, it also reduces the
still required for the final sum (XOR) stage, an additional bit needs loading of the clock signal because of fewer number of clocked gates.
to be produced and fed to the carry stage to generate dual-rail carry Another domino circuit technique, called a multiple-output domino
signals from the single-rail propagation and generation inputs. This circuit, was proposed by Hwang et al. [lo]. This circuit performs
additional bit called z , is defined in [l 11 as follows well at a 5 V supply voltage, but may suffer in performance at lower
supply voltages due to multiple series nMOSFET's used to implement
z, = a , + b,. (1) the multiple-output functions. Furthermore, as demonstrated later in
Fig. 7 depicts a single-rail domino implementation for this additional Section 111, since all domino adders are far less energy efficient
bit z,. This signal 2% along with the propagate (pt) and generate (gz) compared to the DPL adder, the multiple-output domino was not
signals at bit position z is used to generate the dual-rail carry signals added in our study on energy efficient adders.
as follows. First, a merge with the subsequent stage is performed in
the following manner 111. POWER, DELAYAND AREACOMPARISONS OF CLA ADDERS
2, = z , + (zz--l . P t ) , where, P, = p , . Pz-l. (2) For the sake of having a regular physical layout, the adders
described above are designed using a two-bit group approach with six
Later, in the last stage, the dual-ral carry signal at bit position 2 is carry generation stages [12]. Inputs to all five adders are single rail
generated as follows signals, and receiving inverters were used to generate complementary
cz = gz + .E ) ,
( ~ ~ - 1 and, c; = 2, + (E.
P z ) (3) signals. Single-rail outputs to drive a capacitive load of 0.2 pF were
produced by all five adders. These adders were designed using a 0.6
Therefore, the single-rail domino adder needs a modified precondi- pm, two level metal CMOS technology. The bit pitch was 25 p m
tioning circuit (Fig. 8) in order to produce three signals p , ,gz, and wide for all five adders and the interconnect metal capacitances were
2, instead of the usual two (pt and gz). The modified merge block estimated using a value of 0.25 pF/mm. Interconnect lengths were
330 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 3, NO. 2, JUNE 1995
estimated based on size and placements of individual building blocks. TABLE I

These were then used to calculate the interconnect capacitances which CPL POWER, DELAY, AND AREAEFFICIENCIES
FOR DIFFERENT
TECHNIQUES
were added to appropriate circuit nodes for the SPICE simulation.
Power estimation was done by carrying out detailed circuit simulation Parametera
(Vddd.3 v)
using SPICE under nominal condition at 3.3 V, 25°C with an input
signal frequency of 100 MHz. Since power dissipation depends on Percentage
the input and its history, for the sake of our evaluation we used a 230 237 2.24
uniformly distributed randomly generated sequence of input patterns, 110 113 107
as used in the available literature [I, 31. Propagation delay was
measured from the carry-in signal to the most significant bit of sum Percentage 262 139 1 139
output, under the same SPICE conditions.
Table I gives a comparison of various CPL methods for a
32 b CLA. The second and third columns, CPL and CPL with
pMOSFET#l, indicate that adding the pMOSFET feedback transistor TABLE I1
reduces power dissipation to 48% of that in CPL. However, there is DPL POWER. DELAY,
AND AREAEFFICIENCIES FOR DIFFERENT
TECHNIQUES
10% performance degradation and a 2% area overhead due to the
feedback device. As shown in the fourth column, which is CPL with Optimlzed
pMOSFET#2, a stronger feedback pMOSFET transistor size can also (Vdd=33 V) DPL DPL Units
be utilized to further reduce the power with a nominal speed penalty, Power @ 100 MHz 40.1 27.5 mW
however the reduction in power is not as significant. In order to
Percentage 146 100 %
further reduce power dissipation without degrading performance, we
re-examined our adder circuit to isolate the speed critical paths. These Delay of critical path 2.11 1.98 w
paths include the nets from carry-in to carries C7, CIS,Cm. and C:~I. Percentage 107 100 %
For the gates on these paths we used the high-performance CPL gates Energy @ 100 MHz 84.6 54.5 PJ
(Fig. 3), whereas for the gates on the other nonspeed critical paths
Percentage 155 100 z
the low-power CPL gates (Fig. 4) were used. Further reduction in
power dissipation is still possible, if the power reduction technique Area: totaltr. width 30242 19116.5 pm
suggested in [I31 is applied. Using this technique we can reduce Percentage 158 100 %
overall power dissipation of a circuit by judiciously choosing drive
strengths (i.e., device sizes) of the gates on nonspeed critical paths
in that circuit. In this method we first characterize power dissipation power as DPL, and is 21% faster than DPL. Among all the design
of each gate for various drive sterngths using different values of styles evaluated in Table I11 we can see that the static families are
input signal slew and output fanout load. From this characterization more energy efficient compared to the domino circuits. Furthermore,
we can determine the most suitable drive strength for a gate (i.e., the DPL family is most suited for energy efficient, high-performance
one that results in lowest power dissipation) by considering its input adder designs.
slew and output fanout when it is used in a circuit. Since device
sizes are only changed on the nonspeed critical paths in a circuit, this I v . EFFECTOF SCALING SUPPLY VOLTAGE
method yields overall power reduction withour adversely affecting AT CONSTANT THRESHOLD VOLTAGE
performance of that circuit. The fifth column in Table 1 depicts the
SPICE simulations for different supply voltages using the same
results of this optimized CPL CLA adder. Compared to the fourth
CMOS device models with constant threshold voltage were conducted
column of Table 1, this results in a power reduction of 31%, an area
and the results are shown in Figs. 9-1 I . At lower supply voltages
reduction of 63%, and enhances performance by 6%.
( < 2 . 5 V), the insufficient pull-up strength of nMOSFET severely
A summary of results for various DPL implementations is given in
compromises the speed of CPL, and it fails to respond to the 100 MHz
Table 11. In the table, the optimized DPL was obtained by applying
input pattern rate at 1.5 V. Hence, except for CPL, the dependency
the method similar to the one described above for CPL. The third
of delay T d on supply voltage I& is well modeled by (4) given in
column of Table I1 indicates that optimized DPL yields 46% power
reduction with a 7% speed improvement and a 58% area reduction
U , 141
compared to the original DPL. DPL’s two separate current paths in (4)
charging and discharging output capacitive loads is the main reason
for a 13% improvement in speed compared to the optimized CPL with (Y = 1.6 which is consistent with the device models used in our
implementation from Table I. A comparison of the three static styles SPICE simulations.
with the dynamic domino families is depicted in Table 111. The After applying the low-power design techniques described in
columns in this table correspond to the optimized versions of the Section 111, the DPL adder consumes the lowest power and the
CLA adder circuits for each of the families. Among three static circuit single-rail domino adder is the fastest at all supply voltages as
styles, full static, CPL and DPL, the DPL adder is 42%-47% more shown in Figs. 9-10, respectively. The power-delay product, which
energy efficient than the other two. As for the silicon area, Table I11 is equivalent to the energy required to perform additions, is the
shows that the DPL adder occupies 47% less area compared to the lowest in the DPL implementation. Fig. 1 1 depicts that relative to
full static version, and 10% more area compared to the CPL circuit. the DPL adder, for a supply voltage range of 3.5 V-1.5 V, the
Compared to the best static style, i.e., the DPL, the dual-rail domino single-rail domino and full static adders consume 81%-71% and
consumes three times as much power but is only 12% faster. This high 47%-25% excess energy, respectively. On an average, a factor of 2.1
power dissipation in the domino circuit is due to the additional gating improvement in energy efficiency is achieved by scaling the supply
transistors and clock signal drivers required in that circuit. The single- voltage from 3.5 V to 1.5 V, without any change in the device
rail domino, on the other hand, consumes about 2.19 times as much technology.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 3, NO. 2, JUNE 1995 331
Single
Parameters DualRail Rail
(Vddd.3 V) static CPL DPL Domino Domino Units
Power@ 100MHz 34.3 34.5 21.5 82.5 60.2 mW
percentage 125 125 100 300 219 %
-.
Delnv of critical uath
~
I1
2.33 I
2.24 I
1.98 I
1.78 1.64 118
Percentage 142 137 121 109 100 %
Energy@ 100MHz 11 79.9 I 71.3 I 54.5 I 146.9 I 98.7 IPJ

Percentme - 11 147 I 142 I 100 I 270 I 181 1%
Area: pMOSFET width 17700 8295.6 11935.5 14716.4 12064.4 pm
nMOSFET width 9655 9141.5 7181 17728 14862 Cun
total tr. width 27355 17437.1 19116.5 32444.4 26926.4 pm
Percentage 157 100 110 186 154 %
100, I 7
I 1 ' I
1.5 2 2.5 3 3.5 1.5 2 Supply Voltage
2.5 (volts) 3 3.5
Supply Voltage (volts)
32 b adders average power versus supply voltage. 32 b adder delay versus supply voltage.
v. EFFECTOF SCALING THRESHOLD

VOLTAGE WITH SUPPLY VOLTAGE By simultaneous Vt scaling, the CPL adder benefits significantly
and is fully functional at 100 MHz down to 1 V supply voltage
The amplitudes of most internal noises, such as those generated
(Fig. 12). The speed degradation for all circuit families at low supply
from interconnect coupling and ground bounce, scale proportionally
voltages is also eased considerably, as evident in Fig. 13. This can
with supply voltages. Therefore, the threshold voltage of MOS
devices can be scaled proportionally with supply voltages to reduce
be predicted by (4).By keeping a constant vt/vdd
ratio, the delay,
the logic noise margins at low supply voltages, thereby main-

T d cc vJi-a).For cy = 1.6, T d 0: v&'',which is a weaker vdd
dependency than the previous case of scaling supply voltage without
taining the same signal to noise margin ratio. Assuming that the corresponding Vi scaling. Consequently, adder delay under 4.5 ns
threshold voltage can be kept at 1/5 of the supply voltage while is achievable at 1 V supply. Further scaling of FET channel length
using the same 0.6 pm technology, a series of SPICE simulations to keep the devices operating in the velocity saturation regime (cy
were carried out and the results are shown in Figs. 12-14. With approaches unity) will bring a less severe speed degradation at low
this particular scaling rule, the threshold voltage will be 0.2 V voltages. At 1 V supply voltage or 0.2 V threshold voltage, single-
when the supply voltage is 1 V. This low value of threshold rail and dual-rail domino adders are 27% and 8% faster than the DPL
voltage will lead to a high leakage current even when the gate adder, respectively. For 3.5 V-1.0 V supply voltage, DPL is the most
bias is at 0 V. The contribution from this high off-current to the energy efficient and dual-rail and single-rail domino consume about
total power dissipation of these adders running at 100 MHz is 2.78-2.63 times and 1.8-1.62 times as much energy as compared
around 1 1%-l8% for the five CLA adders. In applications where to DPL, respectively. Full static and CPL consume 1.48-1.45 and
the duty cycle of the circuit is much longer, the contribution of 1.24-1.23 times as much energy relative DPL. On the whole, with
the off-current will increase and an active substrate bias will be simultaneous scaling of supply voltage and threshold voltage, energy
needed to reduce the leakage current when the device is in the off efficiency is improved by a factor of 4.4 when the supply voltage is
state. reduced from 3.5 V to 1.0 V.
332 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,VOL. 3, NO. 2. JUNE 1995
4.5
140 - Static - 4
110 - 3.5
70 . 2.5
1.5
1.5 2 2.5 3.5
supply Voltage (volts) Threshold Voltage, 1/5 of Supply Voltage (volts)
Fig. 11. 32 b adders energy versus supply voltage. Dependency of delay on Vt scaling.
170 I
160
150
140
130
120
110
100
90
80
70
60
50
40
30
20
nl I 10
0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7
Threshold Voltage, 115 of Supply Voltage (volts) Threshold Voltage, 1/5 of Supply Voltage (volts)
Dependency of power on Vt scaling. Dependency of energy on Vt scaling.
VI. CONCLUSION is increased by a factor of 1.55 with a 58% area reduction when
Although CPL uses fewer transistors than the other four static and compared to the original DPL.
dynamic circuit styles to implement the same logic functions, the Precharged circuits like domino offers performance advantage over
partial swing at the intermediate nodes results in more than 50% DPL (fastest in the three static circuit families investigated) by
of the power being wasted. Reduction in the Vt of the nMOSFET 12%-21%, but at the expense of burning 300%-219% power relative
pass transistors has been proposed to ease this problem [l], but to DPL. Compared to DPL, dual-rail and single-rail consume 270%
and 181% energy and require 76% and 44% more silicon area,
it will reduce the noise margin to an unacceptable level at low
respectively.
supply voltages. An additional feedback pMOSFET device in the
In summary, the improved CPL uses the least amount of silicon
inverter stage combined with both the low-power CPL and the high-
area, and single-rail domino offers the fastest performance. The
performance version and the technique in [13], yields an improvement
improved DPL is the most energy efficient circuit and is suitable for
by a factor of 2.62 in energy efficiency and a 59% reduction in
applications in which power and performance are equally important.
area, compared to the original CPL. Because of the presence of
both nMOSFET and pMOSFET devices, all nodes in DPL have
a full voltage swing and there is no static short-circuit current ACKNOWLEDGMENT
problem. Dual current paths in DPL implementation also improves
performance. With the similar techniques applied in CPL (except The authors would like to thank the anonymous referees for their
for feedback pMOSFET), the optimized DPL's energy efficiency helpful comments in improving the presentation of this paper.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,VOL. 3, NO. 2, JUNE 1995 333
REFEENCES Sequential Circuit Testability

Enhancement Using a Nonscan Approach
[l] A. Chandrakasan, S . Sheng, and R. Broderson, “Low-power CMOS
digital design,” IEEE J. Solid-state Circuits, vol. 27, pp. 473-484, 1992.
Elizabeth M. Rudnick, Vivek Chickermane,
[2] C. Nagendra, R. M. Owens and M. J. Irwin, “Power-delay characteristics
of CMOS adders.” IEEE Trans. V U 1 Syst., vol. 2, pp. 377-381, 1994. Prithviraj Banerjee, and Janak H. Patel
[3] T. Callaway and E. Swartzlander, “Estimating the power consumption of
CMOS adders,” in Proc. IEEESymp. Compur.Arith., 1993,pp. 210-219.
[4] Preliminary Product Information, “MIPS technology R4200 micropro- Abstmct-Recent studies show that a stuck-at test applied at the
cessor,’’ 1993. operational speed of the circuit identifies more defective chips than
[SI N. Weste and K. Eshraghian, Principles of CMOS V U 1 Design. Read- a test having the same fault coverage but applied at a lower speed.
ing, MA: Addison-Wesley, 1993. Design-for-testability approaches based on full scan, partial scan, or
[6] K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi, and A. silicon-based solutions such as Crosscheck achieve very high stuck-at
Shimizu, “A 3.8-11s CMOS 16 x 16-bit multiplier using complemen- fault coverage. However, in all these cases, the tests have to be applied
tary pass-transistor Logic,” IEEE J. Solid-State Circuits, vol. 25, pp. at speeds lower than the operation speed. In this work, we investigate
388-395, 1990. various design-for-testability (DFT) techniques for sequential circuits that
[7] M. Suzuki, K. Shinbo, T. Yamanaka, A. Shimizu, K. Sasaki, and Y. permit &-speed application of tests while providing for very high fault
Nakagome, “A 1.5-ns 32-b CMOS ALU in double pass-transistorlogic,” coverage. The method involves parallel loading of flip-flops in test mode
IEEE J. Solid-state Circuits, vol. 28, pp. 1145-1151, 1993. for enhanced controllability combined with probe point insertion for
[8] R. H. Krambeck, C. M. Lee and H. S . Law, “High-speed compact enhanced observability. Fault coverage and ATG effectiveness improved
circuits with CMOS,” IEEE J. Solid-State Circuits, vol. SC-17, no. 3, to greater than 96% and 99.72, respectively, for the ISCAS89 sequential
pp. 614-619, 1982. benchmark circuits studied when these nonscan DFT techniques were
[9] J. Yetter, B. Miller, W. Jaffe, and E. DeLano, “A 100 MHz superscalar used. The average area overhead for the nonscan DFT enhancements
PA-RISC CPU/coprocessor chip,” in Proc. Symp. VUI Circ. Dig. Tech. was 9.9% for standard cell implementations of three circuits synthesized
Papers, 1992, pp. 12-13. from high-level descriptions, compared to 202% for full scan. ATG
[lo] I. Hwang and A. Fisher, “Ultrafast compact 32 b CMOS adder in effectiveness improved to greater than 99.3% for all three circuits with
multiple-output domino logic,” ZEEE J. Solid-State Circuirs, vol. 24, the nonscan DFT enhancements.
pp. 358-369, 1989. Index Term-At-speed testing, design for testability, sequential cir-
[ 111 E. Hokennek, R. Montoye and P. Cook, “Second-generation RISC cuits, testable chip implementation, test point insertion.
floating point with multiply-add fused,” IEEE J. Solid-state Circuits,
vol. 25, pp. 1207-1212, 1990.
[I21 R. Brent and H. Kung, “A regular layout for parallel adders,” IEEE I. INTRODUCTION
Trans. Compur., vol. C-31, pp. 260-264, 1982.
[13] U. KO and P. T. Balsara, “Short-circuit power driven gate sizing High quality is of critical importance to manufacturers of integrated
technique for reducing power dissipation,” IEEE Trans. VLSI Sysr., to circuits. Test sets having very high fault coverage are needed to ensure
be published. high quality. However, generating test sets with high fault coverage
[I41 T. Sakurai and A. R. Newton, “Delay analysis of series-connected
MOSFET circuits,” IEEE J. Solid-state Circuits, vol. 26, pp, 122-131, is a difficult problem, especially in highly sequential circuits. With
1991. deterministic test generators, generation of a test sequence to detect a
fault typically involves fault excitation, fault effect propagation, and
state justification. Good observability of the fault site is required
for propagating the fault effects, and good controllability of the
flip-flops is required for justifying the state. Typical design-for-
testability (DFT)techniques involve improving the controllability
and observability of internal circuit nodes, thereby reducing the
complexity of test generation [l]. With full scan design, complete
controllability and observability are provided at the flip-flops. As
a result, the state justification phase is avoided, and the fault effect
propagation phase is simplified. Partial scan design is a cost-effective
alternative in which only a subset of the flip-flops are placed in
the scan chain; state justification and fault effect propagation are
simplified, and delay and area overheads are reduced relative to full
scan. The disadvantages of full scan and partial scan are that vectors
cannot be applied at the clock speed, and test application time is
higher than in a nonscan design due to shifting of test vectors through
scan chains.
Manuscript received June 10, 1993; revised March 16, 1994. This work was
supported by the Semiconductor Research Corporation under Contract SRC
92-DP-109.
E. M. Rudnick was with the Center for Reliable and High-Performance
Computing, University of Illinois, Urbana, IL 68101 USA. She is now with
Motorola Incorporated, Austin, TX 78730 USA.
V. Chickermane was with the Center for Reliable and High-Performance
Computing, University of Illinois, Urbana, IL 68101 USA. He is now with
IBM Corporation, Endicott, NY 13760 USA.
P. Banerjee and J. H. Patel are with the Center for Reliable and High-
Performance Computing, University of Illinois, Urbana, IL 68101 USA.
IEEE Log Number 9410846.
1063-8210/95$04.00 0 1995 IEEE

Techniques: Low-Power Design For High-Performance CMOS Adders KO

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Techniques: Low-Power Design For High-Performance CMOS Adders KO

Uploaded by

Copyright:

Available Formats

321

1063-8210/95$04.00Q 1995 IEEE

+-+--P Fig. 2. XOR using high-performance CPL.

In this paper, we present a study and evaluation of various digital

is inherent in any static logic implementation. However, in dual-rail

estimated based on size and placements of individual building blocks. TABLE I

Percentage 142 137 121 109 100 %

Energy@ 100MHz 11 79.9 I 71.3 I 54.5 I 146.9 I 98.7 IPJ

v. EFFECTOF SCALING THRESHOLD

the logic noise margins at low supply voltages, thereby main-

REFEENCES Sequential Circuit Testability

1063-8210/95$04.00 0 1995 IEEE

You might also like