Low-Power Digital VLSI Design

1
LOW-POWER VLSI DESIGN: AN OVERVIEW
1.1 WHY LOW-POWER?

Historically, VLSI designers have used circnit speed 85 the "performance" metric. Large i n terms of perfoimanee and silicon area, have been made for digital processorz, microprocessors, DSPs ( D i t d Signal Processors), ASICs (Application Spec& ICa), ete. In general, "small area" and "high performance" are two cordieting constraints. The IC designers' activities have been involved in trading o f f these constreink. Power dissipation issue was not B d e sign criterion but an afterthought. In fact, power considerations have been the ultimate design criteria i n special portable applications such as wristwatches and pacemakers for a long time. The objective in these applications war minimum power for maximum battery life time.
+ . ,
Recently, power dissipation is becoming an important constraint in B design. Several reasons anderlie the emerging of this issue. A m o n g them we dte: Battery-powered systems such BS bptop/noteboak campatus, electronic organiserr, etc. The need for these systems a r k s from the need to extend battery W e . Many portable electronics nse the rechargeable Nickel Cadmium (NiCd) batteries. Although the battery industry has been making efforts to develop batteries with higher energy capaeity than that of NiCd, 8 strident increase does not seem imminent. The expected improvement of the energy density is 40% by the turn of the century. With iecent NiCd batteries, the energy density is around 20 Watt-hour/pound and the voltage is around 1.2 V. So, for example, for a notebook consuming a typical power of 10 Watts and using 1.5 pound of batteries, the time of operation bdween recharges is 3 hours. Even with the advanced battery
CHAPTER 1
technologies. such as Nickel-Metal Hydride (Ni-MH) which provide large energy density characteristics (- 30 Watt-hour/pound), the life time of the battery h still low. Since battery technology has offered a limited improvement. low-power design techniques are essential for portable devices.
Low-power design is not only needed for portable applications but also to reduce the power of high-performance systems. With large integration density and improved speed of operation, systeme with high do& frequencies are emerging. These systems are using high-speed products snch as microprocessors. The cost as9ociated with packaging, cooling and fans required by these systems to remove the heat is incteasing significantly. Table 1.1 shows the power consumption of various microprocessors that operate in the frequency range of 66-t-300 MHu. This table demonstrates that, at higher frequencies, the power dissipation is tw excesive.
rn
Another issue related to high power dissipstion is reliability. With the generation of on-chip high temperature, failure mechanisms are provoked [El. Among them, we cite silicon interconnect fatigue, package relstcd failure, electrical p a m e t e r shift. electrornigration, junction fatime, ete..
In addition,there is a trend tv keep the computers from using more than 5% shlue of the total US power bndgct [9]. Note that 50% of office power is nsed by PCs. Since the processors' frequency is increasing, which results in increased power, then low-power design techniques are prerequisites.
The power dissipation issues and the devices' reliability problems, when they
are sealed down to 0.5 fin and below. have driven the electronics industry to adopt a snpply voltage lower than the old standard, 5 V. The new industry
Low-Power VLSI Design: An Overview
standard for IC operating voltage is 3.3 V (i10%). The effect of lowering the voltage to much lower values can be impressive in terms of power saving. The power is not only reduced but also the weight and volume associated with batteries in battery-operated systems.
1.2 LOW-POWER APPLICATIONS

Low-power design is becoming a new era in VLSI technology, 8s it impacts many applications; such as: Battery-powered portable systems; for example notebooks, palmtops, CDs, language translators, etc. There systems represent an important growing maiket in the compoter industry. High-performance capabilities, eomparable to those of desktops, are demanded. Several low-power deroprocessors have been designed for these computers. Table 1.2 shows some examples of there low-power processors. However, these circuits still consume significant power an the order of 1-to-3 Watts. These &ems have their power
(!4
PowerPC 603 80
_.
0
3.3 3.3 3.3
0.5
0.8
IBM 486SLC2 MIPS R4200
66
80
0.64
(W) 2.2 1.8 1.8
[lo] [Ill
[IZ]
dissipation dominated by I j O devices such as hard disk ddves and LCD displays. The total expected power dissipation of notebooks is 2 Watts with 4 pounds weight and daily recharge. Electronic pocket commvnication products such 8s; cordless and cellular telephones, PDAs (Personal Digital Assistants), pagers, ete. Table 1.3 shows a battery analysis far B handheld cellular system. Low-power is crucial for extending the battery life of these systems. Also, battery improvement is needed. The PDAs requite a large *mount of dats processing with multimedia capabilities. The expected power of PDAs is around 0.5 Watt with 0.5 pound weight. Also the expected power for pagers is 10 mW with 0.125 ponnd weight.
CHAPTER 1
Example RF Power
Handheld Cellular Motorola Microtac GOO mW
Battery life Total power load
I
I
750 mAH secondary NiCd 75 minuter talk time 20 hours standby 650 mA x G V = 3900 m W
.
rn
SubGHz processors for high-perfomance workstations and computers. 100 MBz systems and over are emerging, and 500 MHz and higher will be common by the end of the decade. Since the power consumed is increasing with the trend of frequency increase then processors with new architectures and circuits optimized for low-power are crucial.
Other applications such as WLANs (Wireless Local Area Network) and electronic goads (calculators, hearing aids, watches, ete.).
1.3 LOW-POWER DESIGN METHODOLOGY

In order to optimize the power dissipation ofdigital systems low-power methodology should be applied throughout the design process from system-level to proeeer-level, while realizing that performance is atill essential. During optimization, it is very important to know the power didribution within a proeerSOL Thns. the parts o r blocks consuming an important fraction of the power ate properly optimized fa power 9a-g. Fig. 1.1 shows the different design levels of an integrated system. The process technology is under the control of the deviee/process designer. However, the other levels are eontrolled by the circuit designer.
1.3.1 Power Reduction Through Process Technology

One way to reduce the power dissipation is to reduce the power supply voltage. However the delay increases sigdcantly, particulsrly when VDD approaches
Low-Power VLSI Deszgn: An Overview
cI
rn
I
LOGIC/CIRCUlT
I
I
DEVICEPROCESS
Figure 1 . 1
Power reduction design ~pacr
the threshold voltage. To overcome this problem, the devices should be scaled properly. The advantages of scaling for low-power operation are the following: Improved devices charlrcteristics for low-voltage operation. This is due to the improvement of the current drive capabilities; Rednced capacitances throngh small geometries and junction capacitances; Improved interconnect technology; Availability of multiple and variable threshold devices. This iesults in good management o f active and standby power trade-off; and
1
Higher density of integration. It was shown that the integration of 8 whole system, into a single chip, provides orders of magnitude in power savings.
CHAPTER 1
Table 1.4 shows the effect of ecaling on microprocessor performance [14]. The power &sipation can be reduced by one order of magnitude at fired frequency of operation.
I 0.50 I 0.35 I 0.35 1 0.25 VDD (V) I 3.3 1 2.5 Area (mm') I 8 x 10 15.6 x I Clock (MH.) I 1 150 100 Power (W) 1 5.0 I 3.3 m Inn"M" R-~ " Area (%ma) 1 6.4 x 8.4 I 4.5 x 6 Power(W) 1 5.0 I 2.2
L (/4 L.ff ( P )
1 1 1
0.25
0.15 1.8
I I I
0.15
0.10 1.5
I 4x5 I 225 I 2.35 I 3.2 x 4.2 I 1
1 2.5 x 3 I 330 1 1.5 1 2 x 2.5 1 0.45
1.3.2 Power Reduction Through Circuitnogic design

To minimize the power at circnit/logic level, many techniqoes c a n be nsed such
as:
Use of more static style over dynamic style; Reduce the switching activity by logic optimim.tion; Optimim clock and bns loading; Clever circuit techniques that minimise device count and internal swing; Custom design may improve the power, however, the design cost increases; Redace VDOin "on-critical paths and proper transistor sizing;
Use of multi-!+ logic circuits; and
Re-encoding of sequential &enits.
Low-Power VLSI Design: An Overuiew
1.3.3 Power Reduction Through Architectural Design

At the architecture level, several approaches can be applied to the design:
rn m
m
Power management techniqoes where annsed blocks are shutdown;
Low-power architectnrcs based on parallelism, pipelining, etc.; Memory partition with selectively enabled blocks; Reduction of the number of global busses; and
rn
Minimieation of instruction set for simple decoding and execution.
1.3.4
Power Reduction Through Algorithm Selection
Among the techniqves to minimize the power at the algorithmic level, we cite:
rn
Minimking the number of operations and henee the number of hardware resonrces; and
Data coding far minimum switching estiuity
1.3.5 Power Reduction in System Integration

The system level is also important to the whole process of power optimization. Some techniques are:
.
rn
Utilive low system clocks. Higher frequencies are generated with on-chip phbse locked loop; and
High-level of integration. Integrate off-chip memories (ROM, RAM, and other ICs such 61 digital and analog peripherals.
etc.)
1.4
THISBOOK
Tb3 book is an early eontribntion to the field oflow-power digital VLSI circuit and system design. It targets two types of aodiences; the senior undergraduate and postgradoate university stodents and the VLSI circuit and system
CHAPTER 1
designer working in industry. In this book we have tried to cover the basics, from the process technologies and device modeling t o the architecture level, of VLSl system. T h e fundamentals of pow- dissipation in CMOS Circuits are presented to provide the readers with Juffieient badrgranod to be famdiaz with the low-power defign world. Several practical eheuit examples and low-power techniqucs, mainly in CMOS technology, me discussed. Also low-voltage issues for digital CMOS and BiCMOS eircnitr are emphasiied. This book also provides an extensive study of advanced CMOS subsystem design. brious power minimiaation techniques, 8t the circuit, logic, architecture and algorithm levels, are presented. Finally, the book includes a rich list of references, treating advanced topics, a t the end of each chapter. This allows the readers to study, in depth, any topier they find interesting. This book is orgganiad into eigth chapters. The first chapter i s an introduction to low-power design. The other chapters m e presented in the following sections.
1.4.1
Low-Voltage Process Technology
Chapter 2 deals with CMOS bulk, bipolar, BiCMOS and CMOS Silicon On Insolstor (SOI) process technologies. Several CMOS technologies (N-well and twin-tub) and low-voltage CMOS enhancement m e reviewed. Bipolar technology with emphasir on advanced stmetme. is considered. The topic of the isolstion techniques wed for both bipolar and CMOS is addressed. Three BiCMOS technologies, with different perfomance/cmt, are presented. Complementary BiCMOS structnre, where a vertical irolated PNP transistor merged with an NPN transistor in 8 CMOS process. The design rules of a 0.8 ~"m BiCMOS process is supplied. Finally, SO1 technology is reviewed for low-voltage and low-power spplieatianr.
1.4.2 Low-Voltage Device Modeling

Chapter 3 addresses the topic of device modeling. This t a p k is of iderest to those readers who need to analyze, design and/or simulate circuits. It introduces commonly used models of both MOS and bipolar devices. In this chapter we consider simple analytical models which EM be used for circuit malysir and design of deep-rubmicromete. MOSFETr a t low-voltage. Also, a simple model t o compute the leakage current, henee the static power dissipation, of MOS-
Low-Power VLSI Deszgn: An Overview
FETs i6 discussed. The SPICE device models of an 0.8 pm CMOS/BiCMOS process are also presented. This should help the reader to appreciate the meaning of the model parameters as well as to analyse the power and delay of the low-voltage cirenits presented throughout the book. Supply voltage scaling, due to reliability and power dissipation issnes, is presented.
1.4.3 Low-Voltage Low-Power VLSI CMOS Circuit Design

Chapter 4 focuses on CMOS logic circuit design. The sauces of power dissipation in these circuits are reviewed. Simple models for delay and power dissipation estimation m e presented. The concept of switching activity is introduced and examples are given. The power dissipation due to spurious transitions is described. Several CMOS design styles, such 8s pseudo-NMOS, dynamic and NO RAee (NORA) logics, are studied. Guidelines for low-power physical design 810 presented. Other circuit variations of the static complementary CMOS, which are suitable for low-power applications, are discussed. This indodes the passtransistor logic family such as Complementary Pass-transistor Logic (CPL), Dual Pass-trmsistor Logic (DPL), and Swing Restored Pass-transistor Logic (SRPL). Also an overview of clocldng strategy in VLSI systems is covered. Induded in this chapter is ane important area which is the I/O circuits. The power dissipation of the 1/0 circuits in also analped. Finally, techniques to reduce static and dynamic power components for CMOS design are also reviewed. This chapter is intended to provide the readers sufficient background in low-power circuit design.
1.4.4
Low-Voltage VLSI BiCMOS Circuit Design
A variety of BiCMOS logic circuits suitable for 3.3 and sub-3.3 V are presented i n Chapter 5. The chapter starts with the introdoction of the conventional BiG MOS (totem-pole) gate which was used in 5 V applications. The degradation of this gate, with supply voltage scsJing, is demonstrated. The BiNMOS family suitable for low-voltage applications (3.3- 2 V range) is introduced. It is shown that it provides better performance and delay-power product than CMOS, at these voltages, even a t low fan-out. Other logic families, for low power supply voltage operation, are also discussed. Finally, this chapter presents several low-voltage applications of BiCMOS.
SPIUE i s th. mod c o m o n l y u r e d circuit timulator.
10
CHAPTER 1
1.4.5 Low-Power CMOS Random Access Memory Circuits

The objective of Chapter 6 is two-fold. It is intended to present &=nit technique for active and standby power reduction in static and dynamic RAMs, and to apply the concepts bebind these techniqoes for other applications b e cause RAMs have seen a remarkable and rapid progrw in power reduction. These techniqoes are applicd to the architectural and dreuit levels. Several advanced circuit structures and memory organisstions are described. Circuits, operating at a power supply as low as 1 V, are dm discussed. The Voltage Down Converters (VDCs) used as DC-DC converters are also treated. Their low-power aspects ere investigated.
1.4.6 VLSI CMOS SubSystem Design

Chapter 7 presents B subsystem view of CMOS design. A variety of building blocks of VLSI systems such as adders, multipliers, ALUs, data path, ROMs, PLAs, ete. are &cussed. Several options of each subsystem are presented with power dbripation emphasis. The use of PLL in high-speed CMOS systems for deskewing the internal dock is &o examined. Low-power issuer of CMOS subsystems ilie &o included.
1.4.7 Low-Power VLSI Design Methodology

In Chapter 8 advanced techniques to reduce the dynamic power component at several levels of design are presented. Lowering the power supply voltage while maintaining the performance i s one technique for power reduction addressed extensively in this chapter. It is shown that low-power techniques at the high-level (algorithmic and architectural) of the design lead to a power saving of several orders of magnitude. Several exxamples are included to give the reader a desr picture of low-power design aspects. In addition, the powestimation techniqnes, at the G c n i t , logical, architectural and behavioral Levels, 61e overviewed. The goal of powa estimation is to opt-e power, meet requirements and know the power distribution through the chip.
REFERENCES
[l] Special Report, 'The New Contenders," IEEE Spectrum, pp. 20-25, De
cember 1993. [2] D. W. Dobberpuhl et al., 'A 200-MHz 64-b Dual-Issue CMOS Microprocessor", IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 1555-1567, November 1992. 131 W. J. Bowhill et d.,"A 300MBs 64b Qoad-Issue CMOS RISC Mieroprocessor," IEEE International Solid-State Circaits C o d , Tech. Dig., pp. 182.183, February 1995. 141 Technology 1995: Solid State, IEEE Speetmm, pp. 35-39, January 1995.
[5] D. Bearden, et d., "A 133 MHe 64b Four-Issue CMOS Mieroproeessor,' IEEE International Solid-State Circuits Conf., Tech. Dig., pp. 174.175, February 1995.
[6] MIPS Press release, 1994.
[TI A. Charms, ot al., "A 64b Microprocessor with Multimedia Support,"

IEEE International Solid-state Circuits Conf., Tech. Dig., pp, 178-179, February 1995.
[8] C. Small, "Shrinking Devices Pat the Squeese on System Packaging," EDN, "01. 39, no. 4, pp. 41-46, February 1994.
[9] P. Verhofstadt, "Keynote Address," IEEE Symposinm on Low Power Electronics, Tech. Dig., October 1994.
[ID] G. Gerosa, et d.,"A
2.2 W 80 MHz Superscalar RISC Microprocessor," IEEE Journal of Solid-state Circuits, "01. 29, no. 12, pp. 1440-1454, December 1994.
[ll] R. Beehade, et al., "A 32b 66MAu Micropzocersor," IEEE International
Solid-state Circuits Conference, Tech. Dig., pp. 208-209, February 1994.
12
LOW-POWER DIGITAL VLSI DESIGN
[I21 N. K. Yeung, Y-H. Sutu, T. Y-F. Su, E. T. Pak, C-C Chao, 5. Akki, D. D. Yau, and R. Ladenquai, "The Deign of a 55SPECint92 RISC ProeesIOI
under ZW," IEEE Internationd Solid-State Circuits Conference, Tech. Dig., pp. 206-201, Febrmry 1994.
[13] 5. Lipoff and A. D. Little, "Evsluation of New Battery Technology i n Se lected Applications," IEEE Workshop on Low-power Electronics, Phoenix, AZ, August 1993.
o r U1L.a-Low Power Inmmation (141 J. M. C. Stork, "Toehaalogy Leverage f Systems," IEEE Symposium on Low Power Electronics, Tech Dig., pp. 5255. October 1994.
2
LOW-VOLTAGE PROCESS TECHNOLOGY
This chapter ~ e w ffi a an introduction to IC fabrication of CMOS bnlk, bipolar BiCMOS and CMOS SO1 devices including sub-micron devices for low-voltage applications. Section 2.1 i s a review of CMOS process technologies. Examples for an N-well CMOS process and a twin-tub CMOS process are considered. Section 2.2 deals with bipolar technology with emphasis on advanced hipola structures. The topie of the isolation techniques used for both bipolar and CMOS is addressed in Section 2.3. In Section 2.4 we discuss the similarities between advanced CMOS and advanced bipolar transistor strnetnres to demonstrate how both technologies m e indeed convergiug. The BiCMOS technologies we introduced in Section 2.5. with emphasis on CMOS-based processes. Three BiCMOS technologies, with different performance/cost, w e presented. Section 2.6. introducer a complementary BiCMOS structure, where B vertical isolated PNP transistor is merged with an NPN transistor in B CMOS process. In Section 2.7, B table with the design rules of B generic 0.8 pm BiCMOS process is supplied. Finally, in Section 2.8, SO1 technology is reviewed for low-voltage applications.
2.1 CMOS PROCESS TECHNOLOGY

The idea of CMOS wao first proposed by Wanlaoa and Sah [l]. In the 198O's, it was widely acknowledged that CMOS is the technology for VLSI because of its unique advantyes, such as low power, high noise margin, wider temperature and voltage operntion range, overall circuit simplification and layout effie. The development of VLSI in tho 80's has driven the integration density to millions of transistors on B single chip.
14
CHAPTER 2
In this section we review two CMOS bull. technologies: N-well and twin-tub proeeeser. Other processes such ar retrogradwvell technology is not discussed.
2.1.1
N-well CMOS Process
In the N-well CMOS process, the P-channel transistor is formed in the N-well itself and the N-channel i n the -substrate. Fig. 2.1 illustrates cross-sectional views and process steps of B typical N-well process.
The process starts by growing an oxide on the wafer. The oxide is then patterned to open N-well windows. Phosphorus atoms are implanted into the &con followed by a high-temperature annealing to diffusethe well [Fig. Z.I(a)]. The LOCOS ( L o c a l Oxidation of S i l i c o n ) ' technique is used to isolate the Merent active areas. After removing the nitride used in the LOCOS process, a photoresist layer is deposited and is then patterned by B P-well mark (new mark). This is followed by low energy ion implantation of boron (B I/I) to adjust the threshold voltage of the N-channel transistor [Fig. Z.l(b)]. A seeond ion implantation can be applied to eliminate punchthrough in the short channel device. Simiirly, the threshold voltage of the P-channel tramistor is adjusted [Fig. Z.I(c)]. A thin gate oxide is then grown and B layer of polysilicon is deposited and doped with phoaphoros. The polyailiean is patterned to form the gates of a l l the transistors and intereonneetion layer [Fig. Z.l(d)]. The source and drain regions are then implanted by using =photoresist mark. Boron is used for the Pf regions of the P-channel transistors and arsenic for N-channel transistors [Fig. 2.l(e)]. The N f and P+ regions e . r e dso used Nand F- we& contacts, respectively. The photoresist is removed and a thick oxide is deposited by Chemical Vapor Deposition (CVD) ar an isolation layer between the polysilicon layer and the subsequent metal layer. Contact holes are opened in the oxide layer and metal (usually aluminum) is deposited on the whole wafer. At this stage, the metal is patterned and annealed at d s t i v d y low-temperature (450 C) [Fig. Z . l ( f ) ] . One or two other metal layers are u m ally added. At the end, the wafer is pauivated and windows are patterned over the metal bonding pads to provide electrical contacts with pins.
'For nore dctoils on the LOCOS iadationnrrc Sictian 2.8.l.
PI
16
CHAPTER 2
.
8
Strip 1eisUordde Grow gate oxide Deporitpolysilicon Apply photoresist and pattern stripresirt
0
Figure 2.1
(emtinwd)
. -. . . . . . . .
a Apply photoresist
Patteln s/D regions for P-ehanorl ~mi~rp+srn Stripphotar&t RepeatiorN+SlD Stripphotore%l
Grow oxide
Etch contact hoie Deposit mptd Pattar" metal Metal anneal
2.1.2 Twin-Tub CMOS Process

An alternative =pproa& for CMOS devices fabrication is to use two separate v & (tubs) for N- and P-channel transistors in a lightly doped N- or P-type snbrtrate. T h i s "twin-tub" CMOS technology uses a single mmk that d o w a it to form two independently doped and self-aligned tubs [Z];hence both CMOS devices types are optimiaed independently. This tlexibility in selecting the substrate type with no change in the process flow is the major advantage of twin-tub CMOS. This technology is alro more attractive when the devices are scaled down to submicron dimensions.
Low- Voltage Process Technology
17
Fig. 2.2 shows the major steps involved in B typical twin-tub process. The starting material is B lightly doped P-epitaxial material over a , Pi- substrate to reduce latch-up. In addition to the conventional N-tub process, another N-type (arsenic) shallow implant is used to increase the suifaee doping of the N-tub to prevent punchthrough (far short channel devices). It is also used to form the channel-stoppers' for the P-channel transistors [Fig. Z.Z(a)]. The photoresist is stripped and a selective oxidation of the N-tub is performed. The nitride/pad wide layers are removed to implant boron, which is driven in to form the P-tub. This is followed by a second boron ion implantation for the channel-stoppers for the N-channel device [Fig, 2.2(b)]. The N-tub oxide is then stripped. So far only one mask (N-tub mask, MASK#l) is required for self-aligned wells and channel-stopper processes. Both tubs are driven in. LOCOS isolation is developed to isolate between the devices using MASK#2, which defines the active areas. After the LOCOS process, baron is implanted through the pad oxide (wed in the LOCOS) to reduce the threshold voltage of the P-channel transistor using MASK#3. This process results in a buried-channel PMOS transistor. The pad oxide is then removed. The remaining steps are similar to those used in the N-well process where MASK#4 is needed to pattern the polysilieon [Fig. 2.2(~)].MASK#B and MASK#B me required to form the N t and Pi Joureer/drainr (S/D), respectively. MASK#? for contact openings, and MASK#8 for patterning the metal [Fig, 2.2(d)].
The fabrication ofsobmicron MOS transistors requires additional process steps to avoid hot carrier effects. Fig. 2.3 illustrates &CMOStwin-tub structure with Lightly Doped Drain (LDD). Both NMOS and PMOS devices have lightly doped extensions t o the ~ o u i c e and drain regions. The electric field near the drain is reduced due to its light doping. This prevents the generation of hot carriers. The major process steps to fabricate the LDD structure are shown in Fig, 2.4.
2.1.3 Low-Voltage CMOS Technology

Seded CMOS has been reoognived BE the technology suitable for low-power battery operated systems demanding high-speed operations. Conventional sealed CMOS technology undergoes a drastic reduction in speed when the power supPly is reduced to 1 V and sub-l V. Ifthe threshold voltage is sealed aggressively, the subthreshold leakage current increases drastically, which causes limitations for battery applications. Hence, high-performance low-power sealed CMOS technology is needed for ultra-low voltage operation. One key in achieving lowPower CMOS devices i s the reduction of the junction capacitances 8s well =
'For marc dctaila on Lhc Ehannel-atopprra rrfcfrr t o S d i m 2 . 3 .
18
CHAPTER 2
P-tub
N-rub
. -. . .
. . . .
stripe rcsir,
8 Grow sclcctivc hick
oxide
Remove niindeipad oxide B in ( P - ~ ~ I I ) B anneal (P-wolll 2 n d B Ill (channel-stoppis)
P-rub
P epi-1aycr
H'SID P'SID
contacts Metalhalion
A
P rpi4ayer
Figure I.l
Twin-tub pmscss sequence
19
Side will
Field irxidc
20
CEAPTER 2
other pararitic capacitances. Also, the subthreshold cmrrent should be reduced when low threshold voltage (VT5 0.3V) is wed.
Extensions and variations of standard CMOS process have been proposed to enhance the performance of devices at low-voltage [3, 41. There devices have
good short channel behavior, low junction eapadtbnce and ledwed parasitic resistance. The power supply choice depends on performhnce/reliabity/power trade-offs. Reduced power supply is needed far low-power applications, but 8 deeprubmicron CMOS device with ultrathin gate oxide and low threshold voltage should be used to improve performance. Table 2.1 shows the speed achieved at low-voltages using deepsubmicron processes.
Table 1 . 1
Perforrnsnee cornperison
tow-uoltsge.
N a m e [Ref.] I C M O S Process IBM [3] 0.10 pm ATLT [4] 0.10 pm NEC [5] 0.15 pm Fujitsu [6] 0.10 pm 0.15 pm Toshiba [8] 0.35 pm
1 Voltage (V)I Delay (ps) I
21.0 50.0 52.0
An example of improved performance CMOS technology suitable for low-voltage is the one proposed by Toahiba [a] called CMOS Shallow Jnoction Well F E T (SJET). Fig. 2.5 shows the cross-sectional view of the CMOS-SJET process. The N-well and P-well depths are very shallow and comparable to the maxmum depletion layer width i n the channel. With this CMOS-SJET structure the depletion layer of the NMOS device, for example, is extended compared to the original one and reaches the depletion layer of the P-well and the Ntype sobstrate. As B result, the total depletion layer width is inmeaced and low depletion capacitance, Go,is obtained. This leads to the reduction of the subthreshold slope ( s w Section 3.3.2). Thus, the threshold voltage can be reduced at low power supply voltage compared to the conventional CMOS p r e CWS. Furthermore the wells are designed to reduce junction capacitance of the S/D tegions by 40 to 55 % compared to the conventional one. The structure of Fig. 2.5 alro uses dual polysilicon gate Nt and Pt,to optimize the threshold voltages of the MOS devices. Mo W-polycide gates m e used to reduce the poly sheet resistance. The delay of the CMOS-SJET inverter is 2.5 times better than that of conventional CMOS using the same gate sine (0.5 pm technology) a t 1.5 V power supply. The power-delay product of a CMOS-SJET gate a t
21
P MOSFET
N MOSFET
N-Subsmh
1.5 V nsing 0.35 p m teehno1o.q is 1.3 fJ which is 113 times improvement of that for conventional CMOS d e ~ c e s . However,the main drswback with the CMOS-SJET is the large body effect due to its retrograde doping profile.
2.2 BIPOLAR PROCESS TECHNOLOGY

The technology ofepitaxial growth gave rise to the economical manufacturing of monolithic bipolar ICs as it allows a high-quality thin film of semieonductox to be grown on the top of a sobstrate. Jonction-isolation and e p i t u y techniques triggered the progress of bipolar technology. Althongh, most of the focos has been on the development of CMOS for the last ten years, yet, we find e l l . Impressive that bipolar technology has achieved significant progress as w high-speed resalts were demonstrated at the 1985 ISSCC (International SolidState Circuits Cafereme) and thereafter. ECL (Emitter Coupled Logic) gate delay of 15 ps have been reported 191. It was shown that advanced silicon bipolar technologies, although quite complex, eould be integrated at the LSI level and operate at frequencies above thore of CMOS circuits. Since then, the interest in sdvaneed bipolar processes has increased. The key features for such technologies are: i) self-aligned base, ii) advanced isolation techniques such 8s deep-trench, and iii) polySicon emitter contact.
22
CHAPTER2
LOU- Voltage Process Technology
23
A1
Figure 1.7
C r o a s a d i o n d vicw of the SICOS bipolm device structure [ll]
hsve been replaced by the side wall base electrodes. T h i s allows the base are& to be almost as large as the emitter. The SICOS rtructnre is suitable for VLSI applications became of its density and low perasitics
One of the features of advanced bipolar transistors is the replacanent of alnm n i U m by polysilicon for the contact of the emitter. This step has led to noticeable improvement in the current gain of bipolar transistam. For further reading on polysilicon emitter BJTs refer to [lo, 12, 131.
In this aection, we introduce &typical DoublePolysilicon Self-Aligned (DPSA) process technology as an example of the advanced bipolar technologies'.
Any bipolar process typically starts with creating the bnried layers and the epitaxial layer. Fig. 2.8 illustrates the major steps of the epitaxid growth with an iv+ buried layer (BL). This buried lsyer is introduced to reduce the collector resistance o f a hipolar transistor. While the epitaxial layer offers the high-quality silicon host far the bipolar transistor. The steps involved in Fig. 2.8 are the following. First, an oxide lsrer is grown on the substrate and is then patterned using the buried layer mask. The photoresist on the oxide s e r ~ e sas a mask against etching and ion implantation. After etching the oxide, the exposed regions of the silicon surface are implanted by arsenic or antimony to form the Nt buried layers. The photoresist is then removed and an annealing step is carried out. All oxide is then stripped. An N-epitariai layer is grown
'A r-irw of conrmntiond bipolar t.~chnology using the jundion isolation ttchniquu can be f o n d i n [la].
24
CHAPTER 2
Pholamm
. .
8
Grow oxide Apply p h a r o n a a Pducdetch N+BLmark Implant Sb
Si Epitaxial Laycr
. .
Strip resist
Annenl
Strip oride Epilaxy (intrinsic layer)
on the substrate as shown in Fig. 2.8(b). The thickness of this epitadal layer can he as low as 0.8 pm for advsnced digital bipolar technology. The problems limiting the &g down of the thickness of epitaxial layer are the autodoping and oot-diffusion of the boried Ieyer.
Fig. 2.9 amstrates the sequence of a DPSA process assuming B starting stimcture with N+ buried layer, N-epitaxial hyer and isolation oxide as shown in
Fig. 2.9(a). First, photoresist is deposited and patterned to define the collector contact region (deep Nt collector sink). This region is then implanted with phosphorus to increa~eits doping level. The photoresist is stripped and
Low-Voltaqe Process Technology
25
Oxide isolalion
(3
, : ,:
Initial Svucmre Apply photoresist PatBrn pholomist

(N+calleelor mask) P In for lhcN'sink
CVD Oxide
(4
. .
Svip photoresistloride DepositP+palySiio~ide Pattendetch oxidalpolyS1
26
CHAPTER 2
DepositCVD oxide RiE etch of oxide
Deposit !he second lcvcl oipulyrilicon
P Ill IN+poIy)
Anncal
a Pauemictch N+ p01ysi
a Dcposil oxide Open wnracl haler

Dcposil metel
Pallemicuh mcial
27
P-type bare is implanted through a pre-implantation oxide as shown in Fig 2.9(b). The resist and the oxide are then removed. A combination of ' P polysilicon and oxide layers are deposited o m the wafer. These layers are then etched 8 s shown in Fig. 2.9(c). A CVD oxide is deposited eyer the wafer. The oxide is then dry etched using reactive ion etching (RIE). The Pi- polysilieon is walled with the oxide (called sidewall space^) [Fig.P.S(d)]. The secondled of polysilicon is deposited and implanted with phosphoros that will ultimately form the diffosed emitter junction. At this stage, the wafer is annealed to drive the dopants from the P+ and Nf polysilicon layers. Fig. 2.9(e) illwtiates the structure after patterning the N+ polysilicon. The P+ diffusion under the polysilicon forms the extrinsic base. The eontaet openings to the P+ and Nf palyrilieon, and collector are etched. This is followed by the metallieation step. At the end, the metal is patterned 81 shown in Fig. 2.9(I).
B
The advantage of bipolar devices is their high-speed performance. However, there are not suitable for battery backup systems because they consume high DC current. Many logic circuit techniqoes have been proposed for low-power adlow-voltage operation, particularly for telecommunications applications 115, 161.
2.3 ISOLATION IN CMOS AND BIPOLAR TECHNOLOGIES
2.3.1 CMOS Device Isolation Techniques

Isolation in an integrated circuit means to electrically isolate similar or different n a CMOS chip, where more than one million transistors can be transistors. I integrated, 1pA/tran&tor of leakage cnrrent due to a bad isohtion can lead to a. few watts of DC power consumption, Moreover this leakage current pzovokes susceptibility to thelatch-up as will be discussed i n Section 3.1.6. Isolation in CMOS is reqnired to separate the devices electrically by elimioating the inversion layers, which might be induced by the interconnection layer between the trmsiston. The principle of isolation in CMOS is based on a field oxide formation between two active mess [Fig, 2.101. The width ofthe isohtion region should be minimiied to attain dense layout and particularly for VLSI circuits.
28
CHAPTER 2
Active Area
Active Area
<
Figure 2 . 1 0
SubrLrare
Fidd o y d c irol~tirm in MOS integrated circuits.
Several isolation techniques have been proposed and used. The most popular are LOCOS (Local Oxidation ofSilicon) [17],trench i d s t i o n [la, 19,20, 211, and selective cpitaxy [22]. Selective epitaxy is not studied in t h s chapter.
2.3.1.1 Local Oxidation ofSilicon (LOCOS)

LOCOS is a relatively simple process for the isolation of active devices in CMOS technology. It is realivcd by forming a thick field oxide (FOX) between the active meas. FOX is very thick (0.4 - 0.6 hm), hence the corresponding field threshold voltage is high. The condition for preventing an inversion layer under FOX and between two active regions is that this field threshold voltage should be higher than the highest power supply voltage used on chip. The field threshold voltage can be further increased by iaipig the doping level under the FOX, Thir can he achieved by selectively implanting the regions over which the FOX is subsequently grown. These redom are commonly knom as chonnel8toppera.
The steps of the LOCOS process m e illwtrated in Fig. 2.11. A p d oxide of 40 n m is grown and is followed hy chemical vapor deposition of B 100 nm thick nitride layer, which masks the active region. The pad oxide is called stress-relief-oxide (SRO) because it protects the silicon from stress caused by the nitride during nuhsepucnt high temperature processes. Sicon nitride is used as a mask to protect the active region from oxidation. A layet of photoresist h applied to the wafer and then patterned using the mask of the active areas. The nitride/oxide layers ace etched [Pi. 2.11(4]. A P-type dopant is
29
I
I
PChanncl-Stop
Substrate
Substrate
30
CHAPTER 2
Nitride
PolySiiicon
Nilridc
Figure 1.11 Poll buffered LOCOS promni
implanted to form the channel-stoppers [Fig. Z.ll(b)]. The photoresist, which is used for protection against ion implantation,is sttipped and a thick thermal oxide is grown;i.e. FOX. Only local oxkdstion is reahed hecanre the nitride masks the cegions heneath it. At the end, the nitride/oxide are removed [Fig. Z.Il(c)]. During this LOCOS process, 56% of tho FOX thickness b under the silicon surfwe because the oxidation consumer some of the silicon. This p m ceie is called remi-reeerred LOCOS isolation. One problem associated with this PCOCOIS is the lateral extension of the field oxide under the nitride during the oxidation, forming what is c d e d birds be& encroachment [Fig. 2.11(~)]. A typical value ofthb encroachment is 0.5 pmlside. This encroachment limits the sealing of the active areas and the c h e l width of the MOS device. Moreover, this birds beak introduees imprecise channel widths.
The Pofy Buff=? LOCOS process was developed to iedoce the hids heat encroachment [23]. Ln this modified LOCOS process, the nitride m a s k thickness has been inereared t o 240 n m snd B polysilicon streas relief buffer layer or50 nm has been added between the nitride and B 10 n m pad oxide [Fig. 2.12(a)]. This srrangement prevents deep lateral extenlion ofthe field oxidc under the nitride layer [Fig. 2.12(h)]. A 0.8 pm field oxide thickness results in 0.15 pmlride of
31
encroachment and 2.2 pm minimum isolation pitch. Other techniques to solve the problem of the bird's beak encroachment can be found in [24, 25, 261.
2.3.1.2
Trench Isolation
Treneh Isolation is mother alternative to LOCOS isolation process. This technology has been accepted relatively quickly b the industry [Z'f]. It addresses the isolation problem between opposite type devices (like N-channel and P-channel MOSFETs in CMOS technology). The advmtages of the trench isolation m e : i) no bird's beak encroachment, ii) latch-up fiee structure, and iii) planar sorfacc. Fig 2.13 illustrates the steps of the trench isolation process. First, the pad oxide, the nitride and the thick oxide layers are patterned using the mask of the active areas. The thick oxide series ar s mask in the trench processing This is fallowed [Fig. 2.13(.)]. A deep trench is formed by dry etching (RLE). by B boron implsnt to ueate the P+ channel-stoppers at the bottom of the trench. The top thick oxide is removed, and the trench sidewds are oxidived [Fig. 2.13(b)]. The polysilicon is deposited over the whole wafer, filling the trenches. The polysilicon is used as the trench dielectric because it uniformly fills the trenches better than other dielectrics. The surface polysilicon is then etched to yield the stroetore shown in Fig. 2.13(c). The wafer is oxidized using the nitride as a mask. The nitride is finally removed as illustrated in Fig. 2.13(d). At this stage, conventional processing can be used to integrate the CMOS devices. Although trench isolation permits reduction of the separation between the active regions; it has several drawbacks: i) it is a costly process because of the large number of processing steps, and fi) it can not be used BE an isoletian region for the inactive parts of the chip. In this ease, LOCOS is usnally used. T h e description of other trench isollrtion processes c m be found in [28].
2.3.2 Bipolar Device Isolation Techniques

The first tsehnique used for bipolar isolation was based on collector/substiate junction isolation [Fig. 2.141. The N-wells ( N collectors) ofthe adjacent transistors were separated by Pt isles, which are deeply diffused to reach the P-type substrate. By tying these ides and the robstrate to the most negative voltage, thejunctions between them and the N-type collectors are revuse biased. Thus,
32
CHAPTER 2
. .
Grow oxidelnitrideloxide Pattern a l i v e region
. .
.
RIE trench Implant boron
Remove hick oxide

OXidizB m e h walls
Complement wcll
Porl-orocersinP "
CII
Oxidize
Remove nitride
33
P-Subairare
Figure 1.16
isolation.
Cross-sectional view of an NPN bipolar tranaialor with LOCOS
the components i n different N-wells (N collectors) me isolated. The area conmmed by the isolation isles is large relative to the tramsirtor area.
all
The pa&s density of the bipolar technology tan be improved by r e p k g the junction isolation with LOCOS kolation. An additional advantage of LOCOS isolation is the reduction of the parasitic collector-substrate capacitance. Fig 2.15 illustrates the cross-sectional view of an NPN bipolar tranktor with LOCOS isolation. The ares oecnpied by the oxide isolation is proportional to the
34
CAAPTER 2
epitaxial layer thickness. As the epitaxial thickness is being reduced for higher device performance the oxide isolation area becomes smaller, which means that LOCOS may become a practical isolation technique for advanced bipol-1 and BiCMOS technologies. Fig. 2.16 illwtrates thc proecsr steps for oxide isolation in a bipolar pmcesl. After epitaxy growth, a thin layer of Si02 is grown and B layer of S i J N I is deposited. A photoresist layer is applied and patterned with M isolation mark [Fig. 2.16(a)]. Then the nitride/pad oxide layers and approximately half of the epitaxial layer are dry etched. Boron implant is performed to form the ehannel-stopper [Fig. 2.16(b)]. The photoresist is then removed and the wafer i s oxidized to grow the thick isolation oxide. This oxide is called recessed ozide. The SisN* and the pad oxide are stripped at this stage. The resulting strocture is almost planar. In this structure the birds beak is formed BE i n the MOS ewe [Fig. 2.16(c)].
In the early 198Os, new isolation techniques such as grooves and trenches [29, 30, 311 were demonstrated. These techniques reduced the collector-substrate capacitance and increased the packing density. Hence they improve circuit speeds The fabrication process is the same BS the one described in CMOS trench isolation.
2.4 CMOS AND BIPOLAR PROCESSES CONVERGENCE

An interesting exchange of process technology know-how between the CMOS and the bipolar domains has taken place over the years. We have seen that epitaxial and buried layers hsvc been used for CMOS to mute the latch-up. At the same time LOCOS, which WBS originally developed for CMOS, has been used for isolsting bipolar transistors. The use of polysilicon for creating selfaligned MOS transistors was later adapted for self-digned poly emitter bipolar transistors. Another uample of the convergence between bipolar and CMOS is the use of oxide spacers in CMOS for formation of LDD regions, while, it has been osed in bipolar to reduce the reparation between the base contact and the emitter. The convergence of both technologies made the attractive ides of merging bipolar and CMOS seem more rational and feasible than ever.
Many of the steps of the advanced CMOS and bipolat procesrer ate similar, hence, they can be shared for the fabrication of MOS and bipolar trsosistors
35
Oxide
Photoresist I \
Nilode
NtBL PruceES
Cmw epi-layer (Ntype1
Grow pad oxide Dep06if nihidelresisl Palteem resisl
Epi-layer
(CI
-+
. .
Slnp r w k l Croiu sclecdvcoxidc
Remove nilndeloride
36
CHAPTER 2
when they
are:
are
integrated in a BiCMOS process. Some examples of there steps
1. The N-well, which can be used
bl the body of the PMOS transistor and ar the N-collector of the NPN transistor;
2. The N + buried layer of the NPN can be used to form B retrograde w e l l for the PMOS to reduce the latch-up susceptibility;
3. The polysilicon can be used for the CMOS gatos and for the emitter contacts;
4. The r h d o w P-type implantation c a n he shared by the PMOS S/D and
the s e l f - w e d extrinsic base of the NPN transistor;

5 . The shallow N-type implantation can be shared by the NMOS S/D and
the emitter of the NPN transistor; and 6. The final annealing s t e p match.
However, as more steps me being shared by t h e different devices, the device charactedstics have to be compromised. There is L tradeoff between the process complexity and device quality.
2.5
BICMOS TECHNOLOGY
Although the idea ofmerging bipolar and CMOS on the same chip originsted 20 years ago [32], it was not feasible from a practical point of view becsuse of the lack of adequate process technology. With the technological progresr achieved i n r-t ycarr, this idea has been revived. There are many techniques t o merge bipolar and CMOS devices as reported in the literature [33, 34, 35, 36, 37, 381. There m e two ways of classifying BiCMOS processes. One way ih to classify them according to the baseline process. A CMOS-based BiCMOS process is a CMOS bareline process, to which a bipolar transistor is added. Similarly, a bipolar-bared BiCMOS process is a bipolar bascline process, to which CMOS transistors are added. In both eases, the added device would have to be compromired, which means that its characteristics can not be optimired. Alternatively, BiCMOS processes can be classified according to their co.t/performance. In this regard, three categories can be identified:
37
1. Low-cost;
2. Medium-performance; and
3. High-performance (high-speed).
In this section, we present three examples of BiCMOS processes. The first one represents B low-cost proeers. It needs only one mask to incorporate the bipolar device in B CMOS-based process. The second example shows a mediumperfamanee BiCMOS process, which requires 3 extra masks to a CMOS process. The third example illnstrbter a high-performsnce process in which polydicon emitter and self-aligned structures are used.
2.5.1 Example 1: Low-Cost BiCMOS Process

In a low-cost BiCMOS proeerr, a bipolar transistor is added to B CMOS process with minimum additional process steps. A typical N-we!J CMOS/bipolar process sequence is listed in Fig. 2.17(a). The N-well of the PMOS is nsed for the collector of the vertical NPN. The base is implanted in a separate step using an additional mask. The P+ S J D and the extrinsic base shme the same implantation step. The emitter and the Nt S/D ofthe NMOS are also implanted in the same step. Fig. 2.17(b) illustrates the cross-section of an N-well BiCMOS strmtuie. The process complexity is comparable to that of the CMOS. Howeuer, there me many trade offs in designing the emitter, base, and collector of the NPN. If the CMOS proccss is optimbed, some of the bipolar device parameters, suuh as the breakdown voltage and the gain, may be satisfactory, but many others are degraded. For example, due to the absence of the buied layer and the deep Nt collector in the NPN, the collector resistance is high. Hence, the cut-off frequency is low, the current drive is poor, and the collector-emitter saturation voltage is high.
25.2 Example 2: Medium-PerformanceBiCMOS Process

Fig 2.18 shows B cross-sectional view of B BiCMOS stmeture, which can be realized by adding an N P N to a baseline twin-tub CMOS process. This structure has an N + buried layer and a deep Nt collector sink which enhance the collector conductivity. The N + buried layer, under the PMOS, with tho nniform N-well form a desired retrograde N-well. Similarly, the Pt buried layer creates a retrograde P-well far the NMOS transistor. It also acts 81 an isolation
38
CHAPTER 2
CMOS (Bme)
Bipolar (Addition)
P-SubsUale
N-well LOCOS isolation NMOS channel implanration PMOS channel implantation
Gate oxide
__I
Collector
Polysilicon gate
SiDN+implantation
S l D P + implanmtion
~~
Pentrinsic base Base P implantation
Contact opening
MeMiZa~CIn
(a)
WN
NMOS
PMOS
39
40
CHAPTER 2
region between the N t buried layerr. A thin epitaxial layer (1 pm - 2 p m ) is used to increase the cutoff frequency of the NPN transistor and to reduce the required width of the isolation islea between the bipolar transistors. The N collector is formed at the same time with N-well of the PMOS transistor. After the formation of LOCOS a deep N+ sinh is implanted and driven in. The Pf extrinsic base is impknted at the ssme time with Pf S/D regions of the PMOS transistor. The Nt emitter and the N+ S/D share the same implantation step. In this process an aluminum emitter contact is used. Therefore. the 3i.e of the emitter is larger compared to the case where a self-aligned polysilieon emitter contact i v used. This process uses only 3 extra masks to form the bipolar transistor. The first mask is needed for N t buried layer. The second mask is used to implant the N+ deep collector, and the third one for the base implantation. The BiCMOS process described above can be optimized to be used far high performance circuits. The collector resistance is low in comparison to the lowcost proecsr (exsmple 1 ) . For a 0.8 pm process, the cut-off frequency (ft) of a bipolar can be as high m 5 081.
2 . 5 . 3
Example 3: High-Performance BiCMOS Process
A high-performance BiCMOS process can be achieved b7 replaeiog the N t S/D implant, used t o form the emitter in example (21, by a doped polysilicon emitter. One mtra mask is required to open the emitter window of the bipolar transistor. The ion implantation of &hepoly emitter and MOS gates is developed simultaneously. As shown in Fig. 2.19, four additional mask levels (N' buried layer, Nt deep collector, P-base, and emitter window) me required to ohtnin an advanced BiCMOS. After the farmstion of the N f / P + buried layers, the conventional twin-tub process is carried out. LOCOS is developed to isolate the devices. The deep collector N t is implanted and driven in, and the P-baseiS then patterned and implanted. The threshold voltages of the MOS transistors are adjusted hy additional ion implantations. After the gate oxide growth, a thin polysilicon is deposited as shown in Fig. 2.20(a). The emitter window is then pettermed and a second polysilicon layer is deposited [Fig. Z.ZO(b)]. The polysilicon is then doped by implantation and patterned to define the CMOS gates and polyrilieon emitter [Fig. Z.ZO(c)]. Next, implants are selectively carried out to form the LDD regions for CMOS. Before implanting the N t / P + S/D regions. a sidewall
41
42
CHAPTER 2
Polysiticon
NPY
P-base
N-well
Thick piysilicon (450 nm)
. .
Apply photarcsisf
rauem emi,,er
Etch polytoxidc
s,ripresin
Deposit LPCVD poly
(250 "rn) 2nd pan
of spiit poiy
Poly-Erniller
\
-. -.
.
lmplilni AsiQ
Apply pho~oicsist
Pattern poly
Dry etch poly

strip reSiEl
Ann4
43
oxide is formed nelu the emitter and gate edges. Fig. 2.19(b) shows the find crosrsection of this BiCMOS process. The BJTs realiaed in the presented high-performance BiCMOS process have low collector resistance (because of the buried layer and deep sink), high current gain (becsuse of the poly emitter contact) and low parasitic capacitances (because of the self-alignment). With this BiCMOS process ft's greater than 5 GHz can be achieved. BiCMOS technology k a relatively high cost and complexity, because it requires a total of 15 masks for snbmicron process. S e ~ e r d solutions have been proposed to redwe the number of process steps to lower process complexity and cost. Recently one idea [40] has resulted i n low-cost 0.35 fim BiCMOS technology which needs only 11 masks by &g W-plog trench collector sink. T h i s technology is suitable for 3.3 V power supply voltage and promising for low-power mixed-signal applications. Recently BiCMOS technologies with high N P N f*'s transistor, from 10-to-30 GHz., have been reported [38, 40, 411. The applications of these technologies are, for example, for low-voltage (3 V and s u b 3 V) and high-speed logic circuits. Another application of BiCMOS is mixed andog/digitd ICs . a n & from teleeommnnication circuits and high-speed networks to wireless systems. Among these npplicstions, BiCMOS can be used for low-power high-frequency portable systems. Bipolar devices can be used for high-frequency and highspeed parts with low-power innovative circuits, and CMOS can be used for low-speed ultra-low-power parts.
2.6 COMPLEMENTARY BICMOS TECHNOLOGY

In a Complementary BiCMOS (CBiCMOS) process both vertical NPN and PNP transistors m e merged with CMOS on the same chip. Recent investigations indicate that CBiCMOS allows for improving the performance ofBiCMOS gates at low supply voltages [42, 43, 441. Moreover far wireless applications, where high-speed m d Im-power charactelistics are iequired, CBiCMOS technology is one of the solution. The added PNP device to conventional BiCMOS can be oscd to efficiently design lowvoltage circuits. Further discnssion on CBiCMOS circuits can be found in Section 5.3.2. Although, to date, the NPN has shown superior performance to that of PNP, future trend indicates that PNP performance k approaching that of NPN. Same of the problems wsoci-
44
CHAPTER 2
ated with the PNP transistor are its high collector resistance, low current gain, and high b s e transit time. It has been recently reported that CBiCMOS processes can offer NPNs with of 8-20 GHz and PNPr with 2-7 GHa A [45, 46,41, 48, 49, 501. Fig. 2.21 shows a cross-sectional view and process flow of a CBiCMOS [46]. The N+ buried layet of the NPN transistor creates a retrograde well for the PMOS transistor. The Pi buried layer is only used for isolation isles between NPN transistors. After the epitaxial layer growth, twin-well and LOCOS processes are performed. The P-well of the NMOS device is used 86 the collector of PNP tr-tor. A second high energy (600 keV) boron ion implantation is carried out to form the retrograde well (2nd P-well) for the NMOS and the P+ buried 1ny.r for PNP device. The S/D implants of MOS transistors are used simultaneonsly for the extrinsic baser of the NPN and the PNP transistors. The emitters of the NPN and the PNP are formed by the self-aligned contact doping technique to simplify the process flow. Finally, the metal is deposited and patterned.
fe'g
Complementary BiCMOS offerr a technology with versatile devices. It adds flexibility for mixed bipolar/MOS circuit design. The CBiCMOS technology promises further improvements to BiCMOS circuits performance.
2.7 BICMOS DESIGN RULES

In this section, B set oflambda-based derign rules of a typical BiCMOS processs (for 0.8 pm, X = 0.4 pm) is presented. The corresponding device parameters are presented in Chapter 3.
the minimum length of the MOS gate is 2X and the minimum length and width of the bipolar emitter contact is 2X and 4A respectively. Table 2.2 describes the ba3ic marks used in the layont design of BiCMOS devices. The rest of the masks are generated automatically. Table 2 . 3 h t r the de3igp rules for the (design) masks only of a typical BiCMOS technology in terms of the parameter A. The corresponding graphical representation of design rules is illustrated in Plate 1. Plate I1 shows the layouts of minimum size PMOS, NMOS and bipolar transistors in * 0.8 pm BiCMOS technology.
6Thcgiucn designrules a r c t y p i o d o f ~ g m c r i c O . w 8m high-pdarmanccBiCMOSpco'osera.
45
P~rvbrUalc
N + I P + b w i d layer
N - t p spifBxill layer
Nn'iwinweIl(lnP-wcllfor PNP)
Field ihlulion
Callmior deep N '
DccpPt Ill for NMOS retrograde well uod
2nd P-well for PNP ( P+ bwicd layer)

Gate (CMOS)
NMOS S D ( N t s r s i n s i c brrc forPNP)
PMOS SID ( P Cwindc bsrc for NPN)

NPN Base PNP Bare
Caniacl haler
N t w d P'eniLL~r implant
Mctslizaalion
P+
s l of CBiC Figure 1.11 (e) Fabrication pmcom flow: (b) C r o ~ c ~ o c t i mview MOS [48].
46
CHAPTER 2
Teble 1.1 Basis BiCMOS Design Masks.
N-well (NW)
The NW mark is used to define the N substrate (bulk) of the PMOS and the Ncollector of the NPN transistor. The CN mark defines the area which is exposed for the N + sink implantation.
The CP maJk defines the ~e9;cm vhich is to receive an P-implant to create the basc
dmlrion.
Nt deep collector (CN)

P bare (CP)
Polyrilicon (PO)
The PO mark defines the gate and the emitter electrodes, and the polysilicon interconnect layer. The EW mask definer the opening for the emitter window. The DN (DP) mask d e h a the N+ (Pi) somzce and drain regime of the N-eh-d (?-channel) device within the P-well (Nwell), and the body contact regions in the N-wen (P-well) respectively. The CO mark defines the contact openings. The M1 mark interconnects.
defines
Emitter window (EW)
N i md Pt (DN and DP)
Contact (CO) Metal 1 (Ml)

Via (VIA)
the
metal
The VIA mask d&ms the openings of the via that connects metal 1 to metal 2. The M2 mask interconneets.
Metal 2 (M2)
definer the metal 2
Lou- Voltage Process Technology
47
1.
N-weU(NW) 1.1 minimum width 1.2 minimum spacing
12A 12A
2.
N + -diffusion (DN)
2.1 minimwidth 2.2 minimum spacing 2.3 minimum NW overlap ofDN 2.4 minimum NW to external DN spacing
3A 3A OX 6A
3.
P+ -diffusion (UP)
3.1 minimum width
3.2 minimum spacing

3.3 minimum NW overlap of DP 3.4 minimum NW to external UP spacing 3.5 minimum space to DN (same potentid) 3.6 minimum space to DN (different potentid) 4.
3A 3A 4A 4A CIA 3A
N-collector plug (CN) 4.1 minimum width 4.2 minimum spacing 4.3 minimum space to NW 4.4 minimum NW overlap of CN 4.5 minimum space to DN 4.6 minimum space to DP
4A 12A
1OA
3A
6A
5A
5.
P-base diffusion (CP) 5.1 minimum width 5.2 minimum spacing 5.3 minimum NW olerlbp of CP 5.4 minimum space to CN 5.5 minimum space to DN 5.6 minimum space to DP
4A 4A 3A 5A 3A 3A
48
CHAPTER 2
6.
Polyrilieon (PO) 6.1 minimum width 6.2 m-um spming 6.3 minimum space to DP or DN 6.4 gate overhang of DP 01 DN 6.5 minimW0 space to CN or CP
Emitter window (EW) 7.1 minimum width
2A 3A 2 A 2A 1A
7.
7.2 minimum length 7.3 minimum spacing 7.4 minimum CP overlap of EW 7.5 minimum poly overlap of EW
8.
2A 4A
3A 2A
2A
contact (CO) 8.1 minimum size (single) 8.2 minimum rise (double) 8.3 minimum spacing 8.4 minimum DN or DP overlap of CO 8.5 minim"rn space to gate 8.6 minimum PO overlap of CO 8.7 minimum CN or CP overlap of CO 8.8 minimum PO to CO spacing in P b s e 8.9 minimum poly emitter CO to CP spacing Metal 1 (MI) 9.1 minimum width 9.2 minimom spacing 9.3 minimum M I overlap of CO 9.4 maximum current density
1A 1A 2A 2A
9.
2A 3A 1A 1 mA/pm
49
Table 2.8 (continued)
10.
Metal 2 (Ma) 10.1 minimum width 10.2 minimum spacing 10.3 maimcurrent density Via(VIA)
11.
11.1 minimnm size 11.2 minimum spacing 11.3 minimum MI or M2 owrlap of VIA 11.4 minimum VIA to CO spacing 11.5 minimum PO to VL4 spacing 11.6 minimum PO overlap of VIA
50
CHAPTER 2
Plate I: Design Rules of Table 2.5.
51
NMOS
PMOS
BIT
Plate I I : Layouts of minimum size PMOS, NMOS and bipolar transistors.
52
CHAPTER 2
Si
2.8 SILICON ON INSULATOR

Silicon On lnsuletor (SOI) has recently received renewed interest for lowvoltage and low-power applications. T h i s is due to the reduction of the cost and improvement of its performance a t lower voltage. The emegenee of thiofilm SO1 CMOS processes have demonstrated excellent charactubtier for d e e p submicron ULSI applications.
Many techniqnes existent to grow silicon on insolator [ H I . The most mature technique ir the epitaxial growth of Silicon On Sapphire (SOS). Many LSI/VLSI circuits have been fabricated using SOS technology. SO1 can dso be produced by oring what is called SIMOX (Separation by IMplrtnted Oxygen) [52] techby nology. It is fabricated simply by the formation of buried oxide (SiOl) implantation of oxygen underneath the surfsce of the silicon as illustrated in Fig. 2.22. Dose and energy of oxygen ions are as high as 2 x 10'8m-2 and 200 KeV respectively. A subaqaent thermal annealing at high temperature is performed to improve the qoality of the silicon overlayer. The buried oxide can be several hundreds of n m thick and the thin silicon layer can have several tens of n m thickness. Compared to SOS, SO1 SIMOX materials have better defect density and thin silicon layer control. The dislocation density can be lower than lO'~rn-~. One important phenomenon which u i r t s in CMOS SO1 devices is the kink effect. It consists of B "kink" which appears i n the output characteristics of an SO1 MOSFET, as illustrated in Fig. 2.23. It is due mainly to the floating sobstrate of an NMOS device. An explanation of this phenomena c a n be found in [51].
53
Drain
Kink effect
Drain Voltage
Figure 2 . m
Kmk effect m tbc ouipvi chsrarterrslis of M SO1 MOS dcurce
The SO1 SIMOX is now m a t u n materid and represents a potential technology for low-power applications. Several LSIfVLSl circuits have been fabricated in SOI/SIMOX, particdarly for low-power application. Such circuits inelude PLL (Phare Locked Loop) for wireless terminals applications [64], and 1.2GHe frequency divider under 1-V power mpply [55]. The SO1 technology was applied &so to design a RUy pipelined 512-KbSRAM [53]. This SRAM worked successfdly do- to O.? V with an access time less than 5 nr.
Pig. 2.24shows B thin film SOI/SIMOX CMOS process cross-section. The process starts by the formation of buried oxide in silicon wafer ar explained above in [Fig. 2.24(a)]. Then, an oxide is grown on the surface silicon and 8 nitride hyer is deposited. Silicon nitride is used as n mark to protect the active region from oxidation. The nitrideloxide layers are patterned and a LOCOS isolation is applied [Fig. 2.24(b)]. At the end, the nitridejoxide layers are removed. This is followed by P I/I to Bdjut the threshold voltage ofthe N-channel transistor. Skilady, the threshold voltage of the P-channel transistor is edjdjnsted by I/I. A thin gate oxide is then gmvn and a layer of polyrilicon is deposited and doped with phosphorus. Then the Pt souice and drain regions of the PMOS are patterned and implanted with boron [Fig. 2.24(c)]. Similarly, the N+ S/D r@onr of the NMOS are patterned and implanted with phosphorus. A thick oxide is then deposited BS an isolation layer between the polysilicon and the subsequent metd layer. The oxide is etched at contact locations. N u t . the
54
CHAPTER 2
Srdp niMde and Midc
Figure 1 3 4 MP~FCSS it-
P-ChVTpimpianr
N-ChV m paitcm
N-Ch V m implant
Gmw gale oxide

Dcparir polyrilicon and pattern
o f CMOS lhin 61mSOI/SIMOX druicer.
metal l a y s (aluminum) is deposited over the whole surface. Finally, the metal is etched and annealed. This simple process description showsthat the SO1 process i s much simpler than bulk CMOS. Forbdance, the wells are no longer needed, and the punchthrough u e also unnecessa~yi f thin-film SO1 is used. Fig. 2.25 shows B implants a
. ..
.. ..
,.
...
56
CEAYTER 2
Due to the dielectric isolation, the MOS devices have several advantages over bulk CMOS such as : absence of latch-up, high packing density and lower pmasitic capacitances. SO1 reduces the circuit capacitance by 30% [57]. It has been discovered that if the silicon (containing the devices) is made sufficiently thin (< IOUnm), the MOSFETs devices are f d y depletcd [51! even when Vos = 0. W y depleted thin film SO1 MOS dwiccs offer attractive characteristics for CMOS applications such ar immunity from short channel effect, absence of kink effect, superior aobthreshold leakage and high d r d n 8atursAition current (due to low channel doping) [58, 59, 601.
Unfortunately, the technology hsr minor disadvantages such sr floating body effects which rault in i) floating body induced threshold voltage lowering and ii) low drain-tusauce breakdown voltage. For 1 V power supply this is not a problem. However for 3 V operation this could be an important limitation. Also, the threshold voltage is very sensitive to the thickness uniformity of the superficial silicon. In addition. the low thermal conductivity of the oxide underneath the thin film silicon layer is II severe problsrn when the SO1 circuit is operating at high-frequency. Therefore technological improvements are still needed to mlve there Limitations.
2.9 CHAPTER SUMMARY

In this chapter, we hme studied the proeerr technologies of CMOS and bipolar
devices. W e have shown that the advanced CMOS and bipolar processes me converging, and many process techniques can be shsred for the fabdestion of both devices. The different options for merging bipolar and CMOS devices are then discussed. Three examples for BiCMOS processes with different eomplcxitier a e presented The eomplemcntary BiCMOS process is ako considered. A table of design rules for a state-of-thcart BiCMOS technology is given for layout exercises. Several advanced technologies such as CMOS SOI/SIMOX and CMOS-SJET are reviewed for lm-voltage operation.
REFERENCES
[l] A F.M. Wanlans, and C.T. Sah, Nanowatt Logic using Filed-Effect MOS Triodes, International Solid-state Circuits Conference Tech. Dig., pp.3233, 1963.
[Z] L.C. Parrillo, R.S. Payne, R.E. Davis, G.W. Ratlinger, and R.L. Field. Twin-Tub CMOS: A Technology for VLSl Chcuits, International Eketron Devices Meeting Tech. Dig., pp. 752-755, December 1980.
[3] Y. Tam et al., High-Performance 0.1 pm CMOS Devices with 1.5 V Power Supply, International Electron Devices Meeting Tech. Dig., pp. 127-130, December 1993.
141 K. F. Lee et al., Room Temperatare 0 . 1 pm CMOS Technology with 11.8 ps Gate Delay, International Eleetmn Devices Meeting Tech. Dig., pp. 131-134, December 1993.
[5] K. TaLeuchi et al., 0.15 pm CMOS with High Rdiability and Performance, International Electron Devices Meeting T e c h .Dig., pp. 883-886,
December 1993.
[6] T. Yamaeaki, K. Goto, T. Fukano, Y. Nara, T. Sn@, and T. Ito, 21 pr Switching 0.1 pm-CMOS at Room Temperature using High Pedormance Co Salicide Pmcess, International Electron Devices Meeting Tech. Dig., pp. 906-908, December 1993.
[7] A. Oyamatsu, K. Kinugawa, and M. Kalrumu, Design Methodology of Deep Submicron CMOS Dwices for 1 V Operation, Symposium on VLSI
Technology Tech. D i g . , pp. 89-90, 1993. [8] B. Yoshimma, F. Mdatsooka, and M. K a l r m u , New CMOS Shallow Junction Well FET Structure (CMOS-SJET) for Low Power-Snpply Voltage, International Electron Devices Meeting T e c h .Dig., pp. 909-912, December 1992.
[9] T. Uehino, T. Shiba, T. Kikuehi, Y. Tamaki, A. Watansbe, Y. Kiyota, and M. Honda, 15-pr ECL/74-GAz ft Bipolar Technology, Intecnational Electron Devices Meeting Tech. Dig., pp. 67-70, December 1993.
58
LOW-POWER DIGITAL VLSI
DESIGN
[lo] T.B. Ning, and D.D. Tang, "Bipolar Trends," Proe. IEEE, vol. 74, no. 12, pp. 1669-1671, December 1986.
[Ill T. Nabamnra, T. Miyslaki, S. Takahashi, T. Kure, T. Ohabe, end M. Nagata, "Self-Aligned Bipolar Transistor with Polysilicon Sidewall Base Electrode far High Packing Density and High Speed," IEEE Journal of Solid-state Circnits, vol. 17, no. 2. pp. 226-230,April 1982.
1121 T.H. Ning, and R. D. Isaac, "Effect of Emitter Contsct on Current Gain of Silicon Bipolar Devices," IEEE Electron Device Letters, ED-27, pp. 2051-2055, November 1980.
[I31 A.K. Kspoor and D.J. Rodston, "Pdysiliilicon Emitter Bipolar 'IkansiStors," IEEE Press Book, 1989.
[14] M.I. Elmbsry, *Digital S i p o h Integrated Circnita," John Wiley & Sans, New York, 1983.
\IS] B. h a + , Y. Ota and R.G. Swart., =Design Techniques for Low-Voltage

High-speed Digital Bipolar Circuits," IEEE J. Solid-state Circuits, vol. 29. no. 3, pp. 332-339,March 1994. [16]
W.Wilhelm and P. Weger, "Low-Power Bipolar Logic," Inteznational Solid

State Circuits Conf. Tech. Dig., pp. 94-95, February 1994.
[I71 E. Kooi, J.G.Van Lierop, and J.A. App&, "Formation of Silicon Nitride at II Si-SiOz Interface during Local Oxidation of Silicon and During Heat Treatment of Olddbed Silicon in NE, Gas," J. Electrochem. Soc., vol. 123, p. 1117, 1976.
[I81 R.D.Rung, H.Momore, and Y. Nagakubo, 'Deep-Trench Isolated CMOS Devices," International Electron Devices Meeting Tech. Dig., pp. 6-9, D h eember 1982. 1191 T. Yamaguchi, S. Morimoto, G. K-wamoto, H.K. Park, and G.C. Eiden, "High-speed Latch-up Free 0.5 pm-Chamel CMOS using Self-Aligned TiSi and DeepTrench Isolation Technologies," International Electron Devices Meeting Tech. Dig., pp. 522-525, December 1983. [20] R.D. Rnng, "Trench Isolation Prospects for Application in CMOS VLSI," International Electron Devices Meeting Tech. Dig., pp. 574-577. December 1984.
[21] A. Mikashiba, T. Homma, and K. Hamano, "A New Trench Isolation Technology as a Replacement for LOCOS," International Electron Devices Meeting Tech. Dig., pp. 578-581. December 1984.
REFERENCES
59
[22] P. Singer, "Selective Epitaxial Growth Finds New Applications," Semicondnctor International, p. 15, January 1988. [23] R.A. Chapman, et al., "An 0.8 mzm CMOS Technology for EighPerformance Logic Applications," International Electron Devices Meeting Tech. Dig., pp. 362-365, December 1981.
[24] K.Y. Chiu, R. Fsng, J. Lin, and J.L. Moll, "The SWAMI- A Defect Free
and Near-Zero Bird's Beak Local Oxidation Technology for VLSI," Symp. on VLSI Technology Tech. Dig., pp. 28-29, 1982.
[ZS] K.Y. Chin, J.L. Moll, and J. Manoliu, "A Bird's Beah free Local Oxida-
tion Technology Fearible for VLSI Circuits Fabrication," IEEE Trans. on Electron Devices, vol. ED-29, pp. 536-540, 1982.
[26] 3. Aui, P. Vande Voorde and J. Moll, "Scaling Limitations of Suhmi-
won Local Oxidation Technology," International Electron Device Meeting Tech. Dig., pp. 392-395, December 1985.
[27] H.B. Pogge, "Trench Isolation Technology,' Bipolar Circaits and Technology Meeting Tech. Dig., pp. 18-25, September 1990.
[28] Y. Nits", ~~~~~~~-up Ree CMOS Structnre using Shallow lkench Isolation," International Electron Devices Meeting Tech. Dig., pp. 509-512, December 1985.
[29] H. Yamamoto, 0. Mieuno, T. Kubota, M. Nakamae, A. Shiraki, and Y. Ikurhima, "High-Speed Performance ofa Bwic ECL Gate with 1.25 Micron Design Rule," Symp. on VLSI Technology Tech. Dig., pp. 38-39, 1981.
[30]Y. Tamaki, T. Shiba, N. Honma, S. Miauo, and A. Hayas&, "New UGroove Isolation Technology for High-speed Bipolar Memory," Symp. VLSI Technology Tech. Dig., pp. 2425, 1983.
[31] D.D. Tang, P.M. Solomon, T.H. Ning, R.D. Isaac, and R.E. Burger, "1.25 mwn DcepGmove-Isolated Self-Aligned Bipolar Circuits," IEEE Journal of Solid-State Circuits, vol. SC-11, pp. 925-931, 1982.
[32] H.C. Lin, J.C. Ro, R.R. Iyer, and K. Kwong, "CMOS-B$pIar Transistor Structure," IEEE Trans. Electron Devices, "01. ED-26, no. 11,pp. 945-951, November 1969.
[33] T. Ikeda, A. Watanabe, Y. Nishio, I. Mwuda, N. Tamba, M. Okada, and K. Ogiue, "High-Speed BiCMOS Technology with a Buried Twin Well Structure," IEEE Trans. on Electron Devices, vol. ED-34, no. 6, pp. 1304 1309, June 1987.
60
1341 H. Momose, K.M. Cham, C.I. Drowley, H.R. Grinold., and R.S. Fu, "0.5 Micron BiCMOS Technology," International Electron Devices Meeting Tech. Dig., pp. 838-840, December 1987. (35) A.R. A l w e a , 3. Teplik, D.W. S c h d m , T. Hnlsemh, H.B. l i n g , M. Dydyk.snd I. &him, "Second Generation BiCMOS Gate Array Technology," Bipolsr Circnits and Technology Meeting Tech. Dig., pp. 113-117, 1987. 1361 B. Bastani, C. L a g , L. Wong, J . Small, R. Lahri, L. Bouknight, T. Bowman, J. Mao~liu, and T. Tunt-od, "Advanced l Mimm BiCMOS Tcch0010gy for High Speed 256k SRAM'r," Symp. on VLSI Technology Tech. Di., pp. 41-42, 198~. [37] T. Y-guchi and T.H. Yuanriha, 'Process Integration and Device Performance of B Submicron BiCMOS with 1GGHB f< Doable Poly-Bipolar Devices," IEEE Trans. on Electron Devices, "01. 36, no. 5, pp. 890-896, May 1989. [38] C. K. Lau, C-H Lin and D.L. Packwood, "Sub-micron BiCMOS Procer. Design for Manufaoturing," Bipolar/BiCMOS Circuits and Technology Meeting Tech. D i g . ,pp. 76-83, 1992. [39] C. H.Wang and J. Van Der Velden, '"A SinglcPoly BiCMOS Technology with a 30 GHa Bipolar A," Bipolar/BiCMOS Circuits and Technology Meeting Tech. Dig., pp. 234237, October 1994.
[40] 8. Yoshida, H. Suziki, Y. Kinoshita, K. Imai, T. Ahnoto, K. Toksshiki, and T .Yamaaaki, "Process Integration Technology for Low Process Complexity BiCMOS using Trench Collector Sink," Bipolar/BiCMOS Circuits and Technology Meeting Tech. D i g . ,pp. 230-233, October 1994.
[41] J. M. Sung et al., "BESTP- A High Performance Super-Aligned 3V/5V BiCMOS Technology, with Extremely Low Paraaitics for Low-Power Mixed-Signal Applications," IEEE Custom Integrated Circuits Conf. Tech. Dig., pp. 15-18, May 1994. [42] H.J. Shin, "Performance Comparison of Driver Configorations and MSwing Techniques for BiCMOS Logic Circuits," IEEE Jorunal of SolidState Circuits. "01. 25, no.3, pp. 863-865, Jone 1990. [43] S.H.K. Embabi, A. BeUaouar, M.I. Elmarry, andR.A.Hadaway, "New FullVdtag&wing BiCMOS Buffers," IEEE Journal of Solid-state Circuits, vol. SC-26, pp. 150-153, February 1991
REFERENCES
61
[44] M. Hiraki, K. Yam,M. Mioami, K. Sato, N. Matsumki, A. Watanabe, T. Nirhida, K. Sasa!&, and X. Seb, "A 1.5-VFull-Swing BiCMOS Logic Circuit," IEEE Journal of Solid-State Circaits, vol. 27, no. 11, pp. 15681574, November 1992. [45] Y. Kobayashi, C. Yamaguchi, Y. Amemiya, and T. Sakai, '"High Petformmce LSI Process Technology: SST CBiCMOS," International Electron Devices Meeting Tech. Dig., pp. 760-763,December 1988. [46] K. Higashitmi, H. Honda, K. Ueda, M. Hatanalra, and S. Nagao, "A Novel CBi-CMOS Technology by D I P Process," S p p . on VLSI Technology Tech. D i g . , pp. 17-78, 1990. [47] T. Maeda, K. Ishimaru, and H. Momose, "Lower Submicron FCBiMOS (Fully Complementary BiMOS) Proeerr with RTP and MeV Implanted 5GHs Vertical PNP Transistor," Syrnp. on VLSI Technology Tech. Dig., pp.19-80, 1990.
[48] W.R. Burger, C. Lage, B. Landau, M. DeLong, and J. Small, "An Advanced 0.8 Micron Complementary BiCMOS Technolorn for Ultra-High
Speed Circuit Performance," Bipolar Circuits and Technology Meeting Tech. Dig., pp. 78-81, December 1990. [4Q] S.W. Sun, et al., "A Fully Complementary BiCMOS Technology for SubHalf-Micrometer Microprocessor Applications," IEEE Trans. Electron Dev i e r , "01. 39, no. 12. pp. 2733-2139, December 1992.
[SO]
T. Ikeda, T. Naksrhima, S. Kubo, A. Jonba, and M. Yamawaki, "A High Performance CBiCMOS with Novel Self-Aligned Vertical PNP," B p r t
lar/BiCMOS Circuits and Technology Meeting Tech. Dig., pp. 238-240, October 1994.
[51] J . P. Colinge, "SO1 Technology: Materials to VLSI," Kluwer Academic
Publishers, 1991.
[52] K. Izumi, M. Doken, and H. Ariyoshi, "CMOS Device Fabricated on Buried SiOz layers Formed by Oxygen Implanted into Silicon," Electron. Lett., vol. 14, pp. 593-594, 1978.
[53] G.G. Shahidi, T.H. Ning. R.H. Dennard and B. Dawri, "SO1 for LowVoltage and High-speed CMOS," International Conf. SSDM, Japan. pp. 265-267, 1994. I541
Y.Kado, T. Ohm, M. Harada, K. Deguchi, and T. Tsuehiya, *Enhaneed

Performance of Multi-GHz PLL LSls uabg Su&l/4mkon Gate Ultrathin
62
Film CMOS/SlMOX Technology with Synchrotron X-ray Lithography, IEDM Tech. Digest, pp. 243-246, December 1993.
(551 M. Fujishima, K. A d a , Y. Omura and K. Irumi, Low-Pow,, 1/2 R e quency Dividers ~ & g 0.1-pmCMOS Circuits Built with Ultrathin SIMOX Substrate, IEEE Journal of Solid-state Circuits, ml. 28, no. 4, pp. 510512, April 1993.
1561 T. Ohno, Y. Kado. M. Hsrada, and T. Truchiya, A High-Performance
Ultra-Thin Quarter-Micron CMOS/SIMOX Technology, IEEE Symposium on VLSI Technology Tech. Dig., pp. 25-26, 1993.
1571 Y. Yamaguchi, A. Ishibarhi, M. Shimiau. T. NiPhimura, K. Tsu);amoto. K. Aoric, and Y. Akasaka, A High-speed 0.6-pm 16K CMOS Gate Array on 8 Thin SIMOX Film, IEEE Trans. Electron Devices, vol. 40, no. 1 , pp. 179-186, January 1993.
158) J. P. Colinge. Subthreshold Slope of Thin F i l m SO1 MOSFETs, Trans. Electron Device Letters, pp.274-276, September 1988.
IEEE
1591 J. C. Sturm, K. Tokunaga, and J. P. Colinge, Inereared Drain Saturation Current in Ultrnthin SO1 MOS Transistors, IEEE Electron Device Letters, vol. 9 . no. 9, pp. 460-?, September 1988.
1601 Y. Omura, S. Nakashima, K. Pumi, and T. Ishii, O.l-pmGate Ultrathin Film CMOS Devices using SIMOX Substrate with SO-nm Thick Buried Oxide Layer, IEDM Tech. Dig., pp. 675-678. December 1991.
3
LOW-VOLTAGE DEVICE MODELING
The objective of this chapter is two-fold. It is intended to review the basics of the MOS transistor, which is a prerequisite for Chapters 4. to 7., and to introduce commonly used models of both MOS and bipolsr devices [Sections 3.1, 3.2, and 3.61. In this chapter we consid- simple analytical models which can be used for circuit analysis and deign of deeprubmicrometer MOSFET's at low-voltage. Also, a simple model to compnte the leakage current of MOSFET's is presented [Section 3.31. The more sophisticated SPICE device models are also presented to d w the reader to appreciate the meaning of the model parameters as well as the capabilities and limitations of there models The SPICE parameters for the 0.8 pm CMOS/BiCMOS p r o w s presented in C h a p ter 2 are included in this chapter for readers who are interested in designing and simulating low-uoltage CMOS circuits as well as BiCMOS circoita. In Seetion 3.4, supply wltage scaling due to reliability and power dissipation issues is presented.
3.1 MOSFET STRUCTURE AND OPERATION

Fig. 3.1' shows crosssections and views of an N-channel MOS transitor. By e depletion layer is imdduced i n applying a positive voltage on the gate Vos, . the channel. Fnrther increase in VoS results in a surface inversion layer. The
channel width and length nrperliudy
64
CHAPTER 3
surface charge of the semiconductor (Qs cod/cm2) is equal in magnitude to the charge of the gate electrode (QGeoul/ema). Thus, we have
4 s = - Po = (Vos - VPB d.)C, (3.1) where Vos is the gate-source voltage and d, is the semicondnctor surface POtential. C , is the gate oxide capacitance per unit area and is given by
~ ~
<o c . , = -
t.,
(3.2)
where eo is the oxide permittivity and ,t flatbaod voltage VFBis given by
in the gate oxide thickness. The
Qo is the total of d l charges in the oxide and near the interface oxide/silicon. This charge is positive. The work function difference between the gate electrode and the semiconductor d,, depends on the type ofthe electrode and the doping concentration of the semiconductor, For an aluminum electrode, we have
dm, =
For N '
-0.61
+ dt +
$f
(3.4)
polysilicon electrode, we have
4". =
N.
i
0.55
(3.5)
The fcrmi potential $1in Equations (4.4) and (4.5) is given by
4fP = -&In(-) l
$f,,
for P - t y p e
s i
(3.6)
(3.7)
= +Kin(-) ni
Nd
f o r N-type S i
where K = K T / q . The charge Qs is the s u m of the charge in the depletion layer QB and the inversion layer QI.Therefore;
vos =
vrs
+ b,
QB +&I ___
(3.8)
The bulk depletion charge (per unit are*) consists ofioniied acceptors (P-type substrek) or donois (N-type substrate). The depletion charge ofB P-type bulk, with zero biss b&-s-aouree voltage (VBB = 0), is given by
QBD
= -9NaWn
(3.9)
Low-Voltage Device Modeling
NMOS enhancemen1 mode
NMOS dcplclion mode

(bl
PMOS enhancement mode
9.1 (a)The layout and ~ m s a - s c ~ t i o n n l r of i~m n NMOS tzanrislor; (b) Symbola of different types of MOS tronnirtorr.
Figure
66
CHAPTER 3
where the q is the electron charge and N . is the donor concentration. T h e width of the depletion layer in the bulk ( W D )is given by
(3.10)
The tnm-on (or threshold) voltage of an NMOS transistor is defined as the gate-source voltage at which the surface potential 4. is equal to 21dt[. This condition also defines what is known as the strong inversion'. At the onset of strong inversion we can assumc that Qs i i : Q B . Using Equation ( 3 4 , we can write the following expression of the threshold voltage
V T O = VPB
t 4, - Go,
880
(3.11)
QBO i s eqnal to -qN.W,,, where W D , = W D ( ~ = . 21dj1)3. Thus, the threshold voltage can be rewritten as
If the bulk-source is reverse biased (IVBBI> O), the threshold voltage becomes
VT =
VPB
t 21$fl
WJ"(lv5al + zl4fl)
(3,13)
c . ,
(3.14) & i )
This equation can be rewritten
VT =
K"0
7(t/iiGmcl
where the body effect coefficient 7 is given by
(3.15)
Low- Voltage Device Modeling
67
This valoe is negative and is not suitable for digital circuits where a positive VTIlis ieqmked fox switching. To get a reasonable VTo, the device rnrface is implanted with boron. The implanted dose DI came$ VTo to increase by the amount qDi/C,. The threshold voltage is hence given by VTo = VFB
+ W,I
7fi
+ ,?$
(3.16)
Consider now the previous example, with DI = 1.725 x 10'2cm-' and 7 = 0.238 V1i2we find that VT is equal to 0.7 V when lVss 1 = 0 V and is equal to 0.98 V when IVaai = 3.3 V . The symbols of the NMOS and PMOS transistors are shown in Fig. 3.l(c). Typical values of the VT are -2.5 V to -4 V far depletion-mode NMOS devices. For low-voltage CMOS they a m 0.3 V to 0.8 V for enhancement-mode NMOS devices, -0.3 V to -0.8 V for enhancement-mode PMOS devices. When VGs < VTO, the transistor is in the cuiqffwgion, since no inversion layer exists, 85 r b w n in Fig. 3.2(a). The drain current is, therefore, approximately zero. When VGs > Vm, the channel is formed and a drain current flowsfrom the dm.b to the source [Fig. 3.2(b)]. The transistor is in the linear region (&o called ohmic wgion) when VOD( i . VGE ~ - VDS) 2 VT. When Vcr > VT a d VDs > Vos - VT (ix. Vco < VT) the channel is pinched off as illustrated i n Fig. 3.2(c) and the device enters the solurntion region. The drain-source voltage which causes the channel to pinchoff at the drain edge is commonly known as the saturation d r a k s o u r c e voltage V D S . . and ~ is equal to Vcs VT.
~
The voltage drop between the pinchoff point and the wmce is VDS,.~.Any VoS higher t h m V D S , .will ~ appear between the pinchoff point and the drain. If we assume that the distance between the piacbaff point and the drain is extremely small compared with the overall length. then for VDS> V D S , . the ~ drain current is constant. The carriers which reach the pinchoff paint are swept across to the drain by the potential (VDS- Vns..,) between the drain and the end of the channel.
68
CHAPTER 3
LowVoltage Device Modeling
69
3.2 SPICE MODELS OF TBE MOS TRANSISTOR
3.2.1 The Simple MOS DC Model

Let us now ana1y.e the simple DC model describing the an MOS transistor.
From P i p .3.3 it
C L L be ~
I-V characteristics of
shown that the element dz har a resistance
(3.17)
We assume that the mobility ( p ) of the electrons in the channel of an NMOS device is constant. A cnrrent IDS crossing the incrementd resistance d R causes a voltage drop of dV = IosdR (3.10)
Sobstitutlng from Eqoation (3.11) in Eqnation (3.10) and integrating from the sonrce to the dinin, we obtain
70
CHAPTER 3
To solve thL integration, we need to express the electron inversion charge denin term of V . From Equation (3.8), we have sity QI(=)
Vos - V ~ B -
QBO
C .
C ,
(3.20)
The surface potential 4, at any point z dong the channel is equal to ZlQfI [Equation (3.11)] in V ( z ) . By substituting for VFB- Qso/C, 2l$fl by Equation (3.20) we get
Q r ( a ) = 4 V c e - VTO - V (x ) l G
(3.21)
The surface potential at the drain is larger than that at the Y ) ~ C C by VDs. Therefore, the magnitnde of Q I decreares with the distance across the channel. s triangular a illustrated in Fig. 3.3. Assuming This is why the inversion layer i that QBO is constant across the channel and substituting for Qi from Equation (3.21) into Eqnation (3.19), we obtain
where kp is B process-dependent parameter defined as kp = pCs=. Equation (3.24) is valid only for VDS 5 V D S , . ~ (ohmic region). W h e n VDS exceeds V D S . . the ~ drain-source current saturates. The saturation current can be found by substituting for VDSby V D S , ,in ~ Equation (3.24) and is hence given by
The characteristics ofan MOS transistor based on Equations (3.24) and (3.25) are s h o w in Fig. 3.4. The cnrrent eqnations (3.24) and (3.26) have to be by modified if the bulk-source voltage is greater than eero by replacing
[see Eqnation (3.14)]. Note that when VDSis small (say 60 mV), Equation (3.24) can be a p p r o h a t e d by
VT
71
72
CHAPTER 3
This equation expresses B linear relatiomhip between I D S and Vos. Using l i n ear extrapolation, VTO and k p p can he determined 8s shown in Fig. 3.4(h).
-9,
The measured I-V characteristics show that the drain cnnent, in the saturation region, iS a weak function ofVDs. This is due to the channel length modulation phenomenon which can be explained s follows. Let us define LLll = L.fl - AL
(3.27)
where AL is width of the depletion layer between the pinchoff point and the drain as shown in Fig. 3.5. The voltage wrom this depletion layer is VDSV D ~ , therefore ~ ~ , AL can be written as
The corrected saturation current becomes
If we assume that
AL << 1, then &Ill
we cam rewrite the current as
The ratio
can
be related to VDS by the following empirical relation
_ AL - XVDS
L m
V-?
(3.31)
Thc channel modulation factor X is very small. A typical value of X is 0.01
The drain current model described, so far,is known as the LEVEL I (MOSI) model in SPICE'. Thi. model is also d e d the Shiehman-Hodgea model. Howeveq this model b still very simple' to accomt for state-of-thtart CMOS devices and might lead to B 100% error in the current particularly for lowvoltage deepsubmicrometer CMOS devices. However, kp ( or p ) can be used as D fitting parameter to reduce this error. T h i s model in most suitable for preliminary analysis.
4SPICE1GBor 381 oz 3C1.
'Tbis model 1- used i n the 70's.
73
3.2.2 Semi-Empirical Short-ChannelModel (LEVEL 3)

The MOS3 model (or MOS LEVEL 3) has been developed for short- and narrow- channel MOS ( L <_ Zpm, W 5 ZFm) [I]. The MOS3 model har the following features (compaed to MOSI):
*
rn
A model for mobility degradation with the vertical abd the horizontal electric fields;
A model for the threshold voltage of short- and narrow- channel devices
(the (Drain Induced Barrier Lowering
(DIBL)effect is accounted foz);
An improved model for the channel length modulation phenomenon;

m
Weak im&m conduction (subthreshold conduction).
The threshold voltage expression is given by [I]
VT = VFB t 214~1- UVDS t .rfs"sJ2l4rl+IVBBI + FN(ZI+FI+ IVBBI)

(3.32)
7i n thir expression is 9;wn by Eqoation (3.15). This expression includes:
74
CHAPTER 3
.
m
The static Ceedback effect codficient (r (Due to DIBL effect) [2]

(3.33)
where 1is an empirical coefficient;
The correction factor for short-channel &eft is based on a modified trapeaoidal approach for calculating the charge Q B [Fig. 3.61. The correction factor can be obtained from [3]
where W,, the depletion layer width of a cylindricsl junction and is given by
We = 0.0831353+ 0.8013929m
W D
2,
- 0.0111077(-)W D
2,
(3.35)
The correction factor for narrow-&-el
MOS is given by
3.2.2.1 Mobility degradation:

The mobility degradation due to the vertical electric field is modeled by the following simple equation [4]
where B is an empirical constant which depends on the oxide thikness. A typical value of 0 is 0.05. To account for the effect of lateral average electnc field, the effectivemobility is related to the drhin-source voltage and the channel length by I41
(3.38)
In this expression, when the device operates in the saturation, Vos is replaced by VosSct.
Lou-Voltage Device Modeling
75
3.2.2.2 Chunnel length modulation

When VDS 2 VDS,.,, the channel length is modulated by an amount AL. This channel length redoctian is formulated in MOS3 by Baum'r model [5]. In this model the voltage ~ C I O Q S the depletion surface oflength A,? is modeled by I;(VOS - VDS.,,). x i s a fitting parameter.
3.2.2.3 Drnin current

In the LEVEL 1model of SPICE, the drain current in the weak inversion region was assumed eero. The modeling of the subthreshold current i n LEVEL 3 is based on the analysis by Swanson and Meindl [6]. The drain culrmt in weak inversion, which is b i d y L diffusion current, is given by
IDS = ~,el(var-v..)/nv,l
(3.39)
(3.40)
where
v ,
end
v , + nvl
0"s
n = 1 +
c,
+ Ca
(3.41)
76
CHAPTER 3
where
dQs dVsa
(3.42)
and Nps is a curve fitting parameter. V , marks the point between the weak and strong inversion modes. Typical d u e s of n range &om 1.0 to 2.5. I , is related to the c u r e d of Equation (3.39) by taking Vos = V , . Fig. 3.7 illustrates the transfer characteristics of the weak inversion and drift model. The voltage V , insures the continuity of the current, but it is dear from the figure that at Vo3 = V , a discontinuity exists in the derivative. Therefore, the MOS3 model is not precise in simulating the intermediate region where the diffusion and drift currents are comparable.
In the strong inversion, the drsj, cuprent can be expressed as
The threshold voltage along the channel is given by
VT(Z) = VT t 7Fs(\lI24~1 t IVBSI t V ( z ) Using Taylor series expansion,

W L have
d m ) + FNV(=)
(3.44)
(3.45)
VT(5) = VT
+ (1+FB)V(Z)
71
By sobstituting for VT GornEquatian (3.45)in Eqoation (3.43), andintegrating

we obtain the following expression for the drain current
IDS
= P c f / c o z w c j f L c / f [vC3
- VT
- 7 V D . I
+ Fg
VDS
(3.47)
The saturation voltage, which taker into aecomt the carrier velocity saturation effect, is gi~a. by
VDS,d
v,,,
+ v. -
fi
(3.48)
where
Knc = (Vcs - &)/(I + F s )
(3.49)
(3.50)
v. = v,,.L,ffIP.
a b l e 3.1 shows the CMOS device and ASPICE panmeters correspondence. Typical values for parameters of LEVEL 3 are shown in Table 3.2 for MOS devices of the 0.8 pm BiCMOS proces described in Chapter 2.
The LEVEL 3 model approximates the device physics and relies on the proper
choice of the empirical pammeters t o accurately reproduce the device characteristics.
3.2.3 BSIM Model (LEVEL 4)

BSIM (Berkeley Short-Channel IGFET Model) is a simple and accurate short
channel MOS transistor model It is implemented in SPICE as LEVEL 4. The model was tested for effective channel length down to 1 p m . This model inelodes: Vertical field dependence of carder mobility; Carrier velocity saturation;
I?].
.
rn
Drain-induced barrier lowering effect; Non-uniform doping in the channel surface and sub-surface regions effect;
CHAPTER 3
TBble P.1
CMOS dcvicc parsmetu and HSPICE ccrrsrpondmec
Pnramaer
SPICE Keyword LEVEL VTO TOX
Description
NSUB
NFS UO
VMAX ETA KAPPA THETA DELTA XJ CJ
JS
JSW MJ PB CJSW MJSW CGDO CGSO CGBO
RD RS
ID WD XL
xw
ACM LDlF
Model level Zero-bias thrcshold voltage Gate oxide thickness Substrate doping Surface fast state density Surface mobility Madmvm drift velocity of carderr Static feedback on threshold voltage Saturation field factor Mobility degradation factor Width effect on threshold voltage Junction depth Zero-bias balk junction cspacitanee Buk junction saturation current Sidewall balk junction saturation uurent Balk junction grading coefficient Junction potential Zero-bias side wall capacitance Sidewall cspacitsnee grading c o d Gate-drain overlap capacitance Gate-rource overlap capacitance Gate-bulk overlap capacitance Drain ohmic resistance Source ohmic resistance Lateral diffosion from drain or source Laterd dXusion dong the width Making and etching effects on W M d m g and etching effects on L Area calculation method Lateral diffusion beyond the gate
Low- V07tage Device Modeling
79
Table 3.2 ESPICE MOSFET =odd p m t t MOS p.accs8)
(LEVELs1) ( 0 8 p m BxC-
SPICE Keyword LEVEL VTO TOX NSUB NFS
N.Channel
3 0.8 17.5 Y 10-9 3.23 x 10" 820 Y 10s 503 150 x lo8 45 Y lo-*
PChannel 3
-0.9 17.5 x 10-9 3.37 Y 1 0 ' 6 764 Y 10' 165 190 x 108 121 x 10-8 1.45 135 x 10-3 0.336 230 x 450 x lo-' 5 x 10-4 5.5 Y 10-8 0.50 0.92 212 x 10-'1 0.30 215 x lo-" 215 Y lo-'> 571 x lo-'' 1189 1189 0.
0.
Units
uo
VMAX ETA KAPPA THETA DELTA XJ
6.7
63.4 x
10-3
lo-'
fl 728
C J
JS JSW MJ PB CJSW MJSW CGDO CGSO CGBO RD
275 x lorQ 250 x lo-' 5 10-4 5.5 x 10-0
n.m ...
0.92 205 x lo-'' 0.30 274 x 274 x 10-12 571 x 10-l' 596 596 59.5 x 10-9 0.
0.
RS
LD WD XL
xw
ACM LDIF
0.
0. 2 1 x 10-8
0.
2 940 x 10Wo
80
CHAPTER 3
rn rn
Depletion charge sharing by the drain and source; Channel-length moddtion; Dependence of some electrical parameters on drain and substrate biases; Better modeling of weak-, medium-, and strong- inverzion regions and elimination of the discontinuity problem in the drain-current; and Geometric dependencies;
3.2.3.1
Threshold voltage:
The threshold voltage is given bj
VT = VFB
4,
Kd9. t IVBBI) -
?VDS (3.51)
The two parameters, K , and K,, model the effect of non-uniform doping of the substrate on the threshold voltage. Typical values for KI and K 2 are 1 V'lz and 0.12 iespectively. The factor q mod& the DIBL effect and accounts for the cbsnnel-length modulation effect. It is a function of VDSand VBB.
3.2.3.2
Drain current.
When V h 5 V D ~ ,we . ~have

IDS
PO 1t UO(V0S - VT) (1
*
'=f)
+ $$V,,)
((Vos - V*)VD, - -V& 2
" )
(3.52) (3.53)
(3.54)
where
a = 1
+ 9 XI F(Q. t
I
1.744
IVBgl)-"'
and
g = 1 -
+ 0.836(h + ~ V B B ~ )
The parameters Uo = U&), U, = UI(VB) and po = p o ( v ~ s , Vare ~ ) bias sensitive. For VDS > VDS..~, the drain current is given by
Low- Voltage flbevice Modeling
81
where and
K' = I+..+J1+2..
2
(3.56)
(3.67)
The drain-source saturation voltage is given by
(3.58)
3.2.3.3 Suhhreshold curreni:

In BSIM, the total drain current is modeled as the Linear sum of a rtronginversion component and a weak-inverion component I,. I , is expressed BI
(3.59)
and (3.61)
The factor d.8 is empirkd to achieve the best fit. The Subthreshold parameter n is a function of Vpbs and VB.
3.2.3.4 Sensirivity Factors of Model Parumerers:

BSIM user the following formula to aeeoont for the sensitivity of each parameter
to the width and length of the channel
(3.62)
where P o is an arbitrary parameter, LPo and W P o ate the Land W sensitivity factor. of Po.
82
CHAPTER3
Another deep-submicrometer MOSFET's model called BSlM3 181 has been den . improved threshold voltage, drain velopcd for circuit simulrdion. It uses a current snd chaanel-lenpth modulation mod&. The model i s also simple and has a s d number of parameters (x 25).
3.2.4
MOS Capacitances
In transient simulation, MOS capacitances are very important for CMOS and BiCMOS circuits a n & & The MOS capacitances can be divided into two types of lumped capacitors:
the depletion capacitors of the bu&drain ( C m and C B S )[Fig. 3.81.
m
and bulk-source pn junctions
the capacitors associated with the gate ( C a , COD, COB.Ccsm, C G D ~ and COB,) [see Fig. 3.8, except for COB-].
3.2.4.I
Juncrion Depletion Cupucirurzces
The bull-source and the bullr-drain junctions have a bottom area As and AD respectively and B sidewall with a perimeter P , and PD respectively. Each of the bottom area and the sidewall contributes to the total depletion cap-tance. The bottom area capacitance is mesured per unit area, while the sidewall capacitance is measured per unit perimeter. Both of t h e e components are voltage dependent. As these junctioos a x normally zcyerse biased, we will consider the case when the bulk-soures and bulk-drain voltages ( V hand V B D ) m e less than 01 equal to 0.5#j (6 is the junction built-in potential). The total bull-source and hulk-drain capacitances can be expressed by the following reletions [l]
The exponential factor. Mj and Mi.- are in the order of 0.3-0.5. C, is the zero-bias capacitance of the bottom jmction p a unit area and C;,- is the eel-bias capacitance per unit perimeter.
83
3.2.4.2 Gate Capacirances

The gate capacitances can be divided into taro categories:
rn
The fid overlap capoeiioneea: gatedrain (CGD-), gatesource (Ccs-) , and gate-hmk (CDBm) ovellap capacitances. Both Ccs.. and Coom exist due to the lateral diffusion of the source and drain under the gate. They are usually given per unit width as Coso and Cooo. The total gate-source and gate-drain overlap capacitance is given by:
cosm = CcsoWe:r,
(3.65)
(3.66)
coo,
COD0
W.ff
where Cam and Cooo are eqod to C,L+ The capadtor COB, is due to the overlap of the gate a i d e and the bulk along the channel length at both ends of the active of the transistor. This capacitance is typically normalined to the effective channel length, the total COB^ is hence given by Coaw = C O B 0 L*ff (3.67)
a4
CHAPTER 3
where Ccao is equal to C,,Wd
The nonlinear capacitance due to the c A q e of the bulk OP tAe channel. This capacitance is actually distributed but CM be modeled by lumped eap&tances. In the CEX when the channel does note& the capscitance CM be expressed as C G B = cmwc,,Lc,f (3.68)
When the device in in the linear resion the channel is extending uniformly to the drain. The channel shields the b d k and the CBpaeitance exists only between the gate and the channel. The gate-buk capacitance goes to %em.The gate-channel capacitance can be oxpressed in terms of two equd lumped capacitances, B gate-source and a gatedrain capacitance, which am denoted Cos and CGDand are given by
Gom the m n x e
COS
1 = COD = FcozweffL'ff
(3.69)
Finally, when the device enters saturation, the channel at the drain pinches off and hence the gate-drain capacitance component becomes i e m while the pste-source capacitance esa be expressed by
Ccr = -C,W.,fL.ff 3
(3.10)
Fig. 3.9 depicts the change of the capacitance components as a fnnctbn of the gatc-source voltage (assuming that the sourcebulk voltage is zem). The total gate-ronrce capacitance is given by the snmmation of the Cosm and Ccs, and s i d m l y , the total gatedrain capacitance is given by the summation of C C D ~ and COD. The above described capacitance model can be used for circuit analysis and eLeuit design. SPICE me8 B chargecontrol model, which IS- developed by Ward and Dutton [$I. This modelis bared on the mtod distribution of charge in the MOS stiuctue and its conservation.
3.3 CMOS LOW-VOLTAGE ANALYTICAL MODEL

The MOS mod& discussed previously have been developed far circuit rimulators. These models (e.g. BSIM) involvc large numbers of parameters whose value. mud be derived from device measurements. With the% models it is difficult to develop an intlutive understanding of the device behavior. Therefore,
85
an analytical drain current model valid for submicrometer MOSFETs operating

at lowvoltage is needed for hand calculation and first order circuit analysis, with reasonable accuracy.
3 . 3 . 1
Threshold Voltage Definitions
The threshold voltage, VT,has some definitions which are important for the estimation of the static power dissipation. The first definition is the utrapolated threshold voltage from the characteristic IDS - V m [me Section 32.11. Another one is the constant-current (Lo., 010 nA per width unit) threshold voltage. These voltages do not have the same value [lo, 11). The extrapolated VT has approximately 0.2 V more than the constant-current one [ll]. The extrapolated threshold voltage should be sealed down proportiondy to the supply uoltage. This is becmse the drive (saturation) current depends on (VDD - VT(ertrapo1ated)).
86
CHAPTER 3
3.3.2 Subthreshold Current

When the threshold voltage is scaled for low power supply voltage operation, subthreshold current increases significantly. T h i s current a limiting fador for battery operated circnits. As shown in Fig. 3.10, the drain current in the subthreshold &on can be modeled by
IDS,"* = w;,,I,locv..-"l/s
W.
(3.71)
where VT here ir the constant-eorrent threahold voltage. I, and W . are the drain current and the gate width to define VT. S is the subthreshold swing parameter. which is the gate d k g e swing required to redvce the drain uuient by one decade. The current I , is related to VDs by
I, = I;(1 - P=/". 1
(3.72)
T h e subthreshold swing is given by LIZ)

Sc z 2.3K (1
2)
Vldeeode
(3.73)
where Cdisthe drplelion-layer capacitance of the sourcejdrain junctions. Thus, S has a theoretical minimum limit which is 60 mvldeeade.
The leakage current, due to the subthreshold eandnction, is computed from ID^..,^ when Ves = 0 . Then
I l d
=w.llIo,o-vds
W .
(3.74)
Using the examples of Fig. 3.10, typical values for constant-current and axtrapohted threshold voltager are 0.3 V and 0.5 V respectively. The parameter 5 is equal to 75 mVldeeade and the leakage cnrrent is e q d to 1p A l p m When estimating the static power dissipation, the worst-c leakage current has to be evaluated. In this E B S ~ ,the worst csre threshold d t a g e , VT,, hsr to be used where (3.75) VT,. = VT - AVT
AVT is the vapiation of the threshold voltage due to the process parmeters fluctuation such BS the oxide thickness, doping profile, junction depth, gate and width lengths, ete. AVT can be BS high as 50 mV on the same wafer and 150 mV for different wafers. This results i n almost two decades ofleakage
Low- Voltage Devzce Modeling
current increase. Also the temperature effect has to be considered when leakage current is computed. The temperature affects both VT and S. A typical value of the temperature coefficient of the threshold voltage is 1.6 mV decrease per degree Celsius. The subthreshold suing, S increases by 0.25 mV/(decade.C) [See Equation 3.731. For example, if the temperature increases &om 25 C to 75 C, the thrcshald voltage decreases by 80 mV md the leakage current equalr 30 pA/pm (initid extrapolated VT = 0.5 V). T h i s value ib 30 timu higher than that at 25 C. Both the temperature and process effects can result i n a drastic increase of the worst-case static power dissipation. Note that this variation of VT greatly affects the delay of CMOS circuits a t low supply voltage, since the drive cuirent is proportional to (VDD- VT).
3.3.3 Low-Voltage Drain Current

A part of this model is based on the one proposed by 11.31. For long-channel devices, the carrier drift velocity v is related to the horizontal electric field E by B simple linear relation (v = p E ) where the carrier mobility is constant. For short-channel devices, the mobility is no longer a constant and is a function of
88
CHAPTER 3
the vertical electric field in the inversion layer. At this point we prefer to use the symbol & for the mobility to denote its dependence on the vertical dectrie field. Also, the velocity (v) is no longer proportional to E but is gjwn by the following twwregion piecewise empirical model [14]
where
2%., E . = &
(3.77)
where the saturation velocity is equal to 8 x lo8 e m / s for electrons (NMOS device) and 6.5 x 10e e m / s for holes (PMOS device). The drain current in triode region (VDS5 VDS,,,) is given by [I31
The saturation current can be expressed by

ZDS8.t
= "sdC-Wtfl(VOS
VT
VDS.d)
(3.79)
By equating (3.78) and (3.79) we can derive the following expression for V D S . . ~
VD'oS,.t = (1 - X)(VCS - VT)
(3.80) (3.81)
where
The drain current in the saturation can be rewritten a8
Ios,.r = KvSatCmWe~i(Vcs - VT)
(3.82)
Note that VT,m the current eqnation, is the extrapolated threshold voltage The mobility & for electrons UUL be expressed [l5]
fin = 240\/0.06tO./(Vcs
+vT)
f m NC ply-gate
fm ' P
fop
(3 83)
and far holes
..=(
65[O.O6t,/(V~s - V T ) ] " ~ 65 [0.06t,/(T'as VT - I)]"'
POlY- gate
N i p l y - gate
(3.84)
n k and the mobility in cma/(Vs).Thn analytical model CM he where to, is i used for gate length down to deepsobmcmn range
8 ' 3
3.4
CMOS POWER SUPPLY VOLTAGE SCALING
Scaling device feature size has been used to increase paddng density and speed. MOSFET scaling can follow three theories: 1 . Constant Electric Field (CE) scaling [16]. 2. Constant Voltage (CV) scaliog [l?].
3. Quasi-Constant Voltage (QCV) scaling 1171
Expression
Dimensions
Gate oxide Doping Voltage Capaeitace current Gate Delay Dynamic Power Dynamic Energy
In the CE scheme a l l horizontal and vertical dimensions and voltages scale h e d y with the $ m e faetor. In the CV reheme, the dimensions are scaled, w h i l e the voltages w e kept constant. This scenario has been the most cornmonly used. While the constant electric field scaling is natural Lom the device physics point of view, the constant voltage scaling is more piactical from the systems standpoint. Changing the supply voltage every technology generation (when the feature sizes a e scaled) is too expensive because mdtiple pow-
90
CHAPTER 3
supply generatois will be required for each PC board. However, BS the channel length scales helow sboat 0.6 p m the 5 V supply voltage must be reduced for reliability rea~ons(e.6. hot carrier effects, breakdown, ete). The quasi-constant voltage scaliog is an intermediary scheme between the CE and CV views. The @c&g factors of the hoiieontal dimensions and the volts@ are denotd by kh and !ex, rerpectively. Table 3.3 summluiees the scaling ef the important device parameters according to the three theories as a fonction of the horizontal scaling factor (kh). Note that in the QCV scheme, the dimenions scale more aggressively than the voltage (k, = k h ' . )
For the drain current, the following average value is used

IDS
( I
W/LC,(VOS - V T ) ' . 5
(3.85)
Thk expression is not far fiom the one propored by [El. Table 3.3 shows the erect of device sealing on the delay, power and energy. It is assnmed that a gate
drives other gates, where the load is mainly the gate cspscithnce. The threshold voltage is sealed proportional to VDD rcsling. The gate delays imprave with scaling for all the scenarios, but with II better rate in the CV scheme. However. the dynamic power. at maximal frequency, of the gate increases by a factor k ; ' in the case of CV. For the CE scheme, the power is reduced by a high factor equal to kF6. Also in this Table, the dynamic energy dissipated by a gate is reported. This is independent of fkquency. For all schemes, it has improved significantly, particularly for the CE case.
Scaling the snpply voltage is an efficient way to reduce the power consomption. However, to get B better performance 8t low-Vdtagge the device sizes and the threshold voltage have to be properly scaled. For B fixed sub-micron technology. the supply voltage can not be reduced aggressively, otherwire the *peed is degraded. However, for each fixcd technology generation, there is a lower limit power supply voltage VDD,~, [la]. For VDD'S higher than this minimum limit the speed does not improve significantly. Typical d u e s for VDD,~,are, 3.3 V and 2.5 V for L.,j of 0.5 pm and 0.3 pm, respectively. On the other hand, the h i e r lrmit of V ~ is Ddriven by the reliability and the power dissipation limiitation. The d n e of this VDD is proportional to the s p a r e root of design rules (6) [IS]. For 0.6 pm and 0.3 pm design rules with LDD structure, these high limits are 4.5 V and 3.3 V, renpeetively.
91
3.5 MODELING OF THE BIPOLAR TRANSISTOR
3.5.1 BJT Structure and Operation

Fig. 3.11 shows a cross-sectional view of a NPN bipolar junction transistor with geometrical layout and the corresponding symbols for NPN and PNP. To understand the basic operation of the bipolar transistor, one dimensional representation ofthe active mgim can be used. Fig. 3.12(a) illustrates a typical profile of the one-dimensional section of the active region [Fig. 3.12(b)]. The N+PN- sand+& farms the heart of BJT. Consider an NPN transistor with VBE> 0.5V and VBC < OV (forward-active mode). The corresponding energy band diagram is shown in Fig. 3.12(e). When the NtP (emitter-base) junction is forward-biased, electrons are injected from the emitter into the base (current In=).A small fraction of these electrons recombine in the neutral base (I,B)8. The rest of the electrons, of which the cmrent I,, is constituted, diffosc through the base towards the reversebiased base-collector jnnction where they are swept by the electric field into the basecollector depletion kym. On the other hand, some of the holes in the base are injected into the N+ emitter region resulting in a current I p ~ . This component is small compared to I.B because the hales' concentration in the base ia much smaller than the electron concentration in the emitter. The emitter-bare depletion layer can be B rite for the recombination between the injected electrons and holes resulting in B current I,..,. Moreover, some holes ate swept into the base dne to the generation in the basecollector depletion &on, but this component is very small ( cz 10-'7A/pm2). The terminal currents can be -ten 11% follows Ic = I..c (3.86)
IB
= Za t L d
+
+
Ira
IPE
(3.87)
(3.88)
4 = I,&
+I
Note that it has been asmmed that the base and collector currents ere flowing in the device, while the emitter coxrent is a0-g out of it [Fig. 3.121. The emitter bjection efficiency, which is defined as the ratio of the electron's current iojected into the base to the total emitter eorrent, is by
(3.89)
92
CHAPTER 3
. /
N-well
has to be nem unity; thst is, the emitter current should mostly be due to electrons for an NPN transistot. The ratio
T h i s ratio
is defined
1C fl= IB
(3.90)
the DC curcent gain.
Lou-Vololtage Device Modeling
93
94
CHAPTER 3
When the emitter-base junction is reversebiased and the collector-base jamtion is forward-biased, the transistor is in the inverse xpion where the emitter and collector may be exchanged. When both junctions are reverse-biased the transistor is in the cutoflregion. But when they are forward-biased, the device is said to be in the astoration repion. In this situation, both junctions sre injecting into the bsse, the small electric fields in the two depletion regjons sweep the carders into the emitter and collector repiom. Both junctions collect as well as emit.
3.5.2 Ebers-Moll Model

In this section, we present the EbercMoU (EM) model, which is a simple DC model of the bipolar transistor. The Ebers-Moll model can be used for hand calculations and first order circnit analysis. The derivation of the model equations, in this section, is bared on the analysis by Rodston [ZO]. Lo Section 3.5.1, we have disms~ed the device operation in the forward active region only. For a general analysis, we assume that the base-emitter and the base-collector junctions &re forward biased. In the following discussion we will neglect the CnrrentS due to recombination in the apace ehsrge layeis and in the base. This implies that Inc = &',hence, Equation (3.88) reduces to
IE = Lc
+ &E
(3.91)
The current due the holes injected &om the base into the emitter is given by 1201 [ , V D . / V . - 1 1 (3.92) I , o = q AE D,E P ~ E O
WE
where h~~ is the equilibrium hole concentration in the emitter and W Eis the neutral emitter width. The current Incis dominated by the diffusion current in the base and is proportional to the gradient of the minority carders (electrons) in the neutral base. Because the neutral base width (WB)is very thin, this gradient is approximately a comtant. Therefore, we c a n write 1C as [20]
Inc =
AE D,B [ n B ( O )
; : g a g ( w B ) ]
(3.93)
where na(0) and na(Ws) are the electron concentrations at the edges of the emitter-base and collector-base depletion regions respectively [see Fig. 3.131. Note that the slope of the clectmns in the base is given by the term between the brackets as demonstrated by Fig. 3.13.
' B ? app~ying KCL (i
If thc recombination in the bsrc i s n&c$cd bstuten LB and I.o. j l s . / w e that I,., ri L o .
I,
+ I~
I, = 0).
scL t h t
is the differcncc
0). we can
(LB =
95
KllliffC
BaJC
CDiieclor
Using thejunction law, the electron concentrations nn(0) and na(Ws), can be expressed rn terms of VBE m d VBCrespectively. The current I., c a n hence be given by [ZO]
where Ng is the base impurity eoncentration. The collector current is given by
Ic = Inc - Ipc
(3.95)
The current IPc is due to the holes injected from the base to the collector8. The baSc-eoUcetor junction is basically a P + N N + structure as shown in Pig.
*Not= Lhat I.,
we harr -rumEd
w mat inclvdcd i n Eqv~tion (3.88) because i n drriring Equation (3.86) that the Eallsstor-b-e junction was revc-c biased.
96
CHAPTER 3
3.12(a). An expression for I,c can be derived from the analysis o f a P + N N +

diode. The reader is adviced to consult with reference [20] for the details of this analysis. The carrent I,, is gi~m by
where pnco is the equilibrium hole concentration in the collector, Wc is the epitaxial thickness under the base and T ~ ? i ,s the hole lifetime in the epitaxial layer. By substituting Lorn Equations (3.92) and (3.94) in Equation (3.91) and from Equations (3.94) and (3.96)in Equation (3.96)we get the following equations for I p and lc I, = I, - U,I, (3.97)
Ic = -I,
+ at',
(3.98)
Eqnations (3.97) and (3.98) m e called the EberrMoU eqmations. Fig. 3.14 shows the equivalent circuit of the BJT bared on the Ebers-Moll equations. The EbersMoU model described above is general and can be used for any region of operation by substituting for VB, and V.c by lhe appropdate values. In the forward ective region, assuming that VBS = 0.8 V and VBC < 0.3 V the emitter and collector current of Equations (3.97) and (3.98)reduce to
la = I, sz I,, eV-1".
(3.102)
where the reverse saturation current of the bare-emitter junction In, can be derived from Equation (3.99)snd is given by
Lour-Voltage Device Modeling
97
E
ligure 3 . 1 4 model
Equivalent DC & N i t of the EST blucd
on
the Eb.ra;MoU
It can edsily be shown that the base current can he expressed as 1 - a,

IB
-F
Ql
(3.105)
Eqnatims (3.102),(3.103) and (3.105)arethe well-known current equation. ofa fommd biased bqpolar transistor. Note that Equation (3.105) yields the famous relation between at and the DC forward current gain P P = Qf/(l - a f )1.
The simple Ebers-Moll model lacks accuracy for the following three reasons
1. It does not account far the parasitic resirtors of the emitter. base and
collector.
98
CRAPTER 3
PC
d E
2. It doer not aocount for the Early effect, which causes the collector current to increase 8s the collector-emitter voltage increases.
3. It does not sccount for the effect of the high collector currents on the current gain. Next, we will discnss the modeling of e& phenomena separately,
3.5.2. I The Purusiricul Resisrors of a Bipolar Transistor

Fig. 3.15 shows the modification of the EM model hy the addition of the base rwistanee RB, the collector resistance Rc and the emitter resistance R E . There extrinsic components represent the transistors parasitic resistances f r o m their active region to their base, collector and emitter terminals, respectively. The effect of the perasitie resistances ir important because the voltage drop BEIOSS them contribute to the external baseemitter and collector-emitter voltages VB1=. and V , , E ,respectively, = shown by the following two equations
V B ~ E= , VBE + RsIs t RBI, Vo,w = VCE
(3.106)
(3.107)
+ RcIc + REIE
99
The drop across the parasitic resistors has to be acconnted for to get more accurate iesalts from the EM model. Neglecting these drops may ~ V U Llead to erroneous iesults. For example, if the external collector-emitter voltage i n fonnd to be equal to 2 V one may dednce that the BJT operates in the active Ecgion. However, if Rc = 1.8K and RB = 0 . M and Ic I , = 1 mA, then the intrinde collector-emitter voltage (Von) is 0.1 V. This implies that the bipolar transistor is actually saturated. T h i s phenomenon is known as QuariSatuwlion.
3.5.2.2
The Early Effecf
The E d y effect refers to the base width modalation due to the change of the collector base reverse voltage (in the forward active region). As the collectorbase reverse voltage increases, the base-collector depletion layer widens. The resulting reduction in the neutral base width causer the current gain to increase which, in turn, leads to an increase in the collector current [see Fig. 3.161. T h i s effect can be modeled by introducing the Early voltage (Va,) i n the expression of the collector cnrrent a5 follows
(3.108)
The inverse of the forward Early voltage 1,'VAj is analogous to the coefficient A in an MOS transistor. A typical value of VA, is 50 V. The AC output resistance of the BJT in the forward active region is related to the Early voltage and is given by
70
-v.r I0
~
(3.109)
The Early effect in the inverse active region can be modeled by using the reverse Early voltage (VA,) which charaderises the slope ofthe collector cutrent in that region (inverse active region).
3.5.2.3
High Current Effects
The current gain and the cut-off freqnency are degraded due to high collector current. Fig. 3.11 shows the effect of the collector current o n the gain. T h i s degradation can be referred to the high level injection in the base (Webster effect) and/or the base pushout (Kirk effect). For B detailed discussion on these phenomenon, the reader is advised to consult reference [ZO]. In the w e , -here the injection level in the bare is high (Webster effect) the collector
100
CHAPTER 3
Figure 8.18
Thcl-V shmatcnsticrdrr BJT
Low- Voltage Deuzce Modelzng
101
cnrsent can be expresed as [ZJ]
Ic =
ev-l=v%
where the forward knee current Ixje is defined the collector current a t which its slope in the Gummcl plot changes from 1 to l/Z [see Fig. 3.181. This current marks the onset of high level injection. The degradation of the current gain, when Ic > k,, can be described by the following relation [203
(3.110)
P = - I0 =&IB
1x1
IC
(3.111)
where & is the value of the gain when Ic < I z f . The modeling of the Kbk effect is very complex. However, simple model for the current gain, which can be used in first oidei circuit analysis, i n given below [Zl]
(3.112)
The aemracy of the simple EM model can be enhanced by acconntbg for the parasitic resirtars, the Early effect and high emrent effect which mn be modeled by simple analytical expressions as shown above.
3 . 5 . 3
Bipolar Models in SPICE
Two BJT models are implemented in SPICE. The Ebers-Moll model and a more sophisticated one, which is based on the Gummel-Poon (GF) model [ZZ].The second model indudes the following second order effects:
rn
Very lour eument effect on the gain.
rn
Base width modulation effect. High-level injection effects (the Kirk effect is not included)
Base resistance -tion
.
m
with current.
The GP model is based on one-dimensional analysis. It is valid for all regions of operation: cutoff, forward-active, invecse-active. and saturation. The GPbared bipolar model is illustrated by the equivalent circuit shown in Fig. 3.19.
*A trpicai value of 1x1 B
C ~
u i Lacsi s 1 m.4/pmn
102
CHAPTER 3
in1ii
The two bad-teback diodes on the right represent the intrinsic base-emitter and basccollector junctions and their curients are given by 1231
I,,
= -(e I . ves/n,v. - 1)
qb
(3.113)
Iso = I* - ( e vec/n,v, - 1 )
4s
(3.114)
where I, i s given by [23]
(3.116)
The forward and reverse current e-on coefficient (nt ond %), which ate introduced in Equations (3.113) and (3.114), are used to model thelow currents. The parameter qb (base charge factor) accounts for the high current and base
Low- Voltage Device Modehng
103
Figure 2.1s
Thc GP-blrrrd model of D b i p d v t r ~ $ i s t m
width m a d h t i o n effects. It is given b7 [23]

9s
+ 1-
(3.116)
qr models the effects of base width modulation and can be expressed as
The general expression of qs [Equation (3.116)] can be simplifled for lo dev el and high-level injection conditions.
if
if
PI q,
> 91214
q:/4
(low - level - injection) (high- level -injection)
(3.119)
104
CHAPTER 3
The two back-to-back diodes on the left [Fig. 3.191 account far the currents caused bv the recombination of carders in the emitter-base and the collectorbase space-charge layers and other recombinations. These currents be modeled by [23] c,r,(ev-~-v~ I) (3.120)
~
c,r,(ev**m=vs - I)
(3.121)
where C,,C,.n . and n . have been introduced to fit the measured corrents. Further improvements to this model ate possible by the inclusion of three parasitic resistances ( R c , Rs, R B ) ;three jnnction capacitsnces (CE, C c , Cs); and two diffusion capacitances (C-, Cdc) = shown in Fig. 3.19. The model of the bare resistance take. into account the effect of the corrent (current crowding) through the following expression [24] tan(r) - I RB(I) = R B + ~~(RB - R B ~ z) tan(z)l where the variable z ia given by
(3.122)
Rg represents the low-current maximum resistance and RBm high-cmrent minimum residanee. The junction depletion capacitance is a function of the junction voltage (V). This function can be approximated by the following two expressions
Cj.irp= C ; ( 1 - -)
-Mi
4,
if V < FC4;
(3.124)
The empirieal factor FC has a value between 0 and 1. Its default valne in SPICE is 0.5. Note that Equations (3.124) and (3.125) apply for a reverse and forward biased junction respectively. The diffusion capacitances model the charge associated with injected carriers. For example, the electrons injected i n the bare have B corresponding rtorsge charge Q~~ = r,rcc (3.126)
105
The forward transit time q is current-dependent and is gjven by an empirical

olprcrJirm[24]
Where VTF is a fitting parameter to model the change of 7 , as a function of VBC ( 01 V c s ) ,ITF models the change due to Io and XTF controls the increase of q . ICO is the collector current in the absence of the high-current effects which corresponds to that dEbers-Moll model. The diffusion capacitance (associated v i t h the injected electrons from the emitter into the base, when the base-emitter junction is forward biased) is gjvm by
CDE
aQDB
(3.128)
Similarly, the base-collector junction has a diffusion capacitance, which is given by aQDc CDC = (3.129)
a v , ,
where
QDC
= SIEC
(3.130)
Although the SPICE models account for most of the first and second order effects, they m e not highly accurate. This originates from some weaknesses in the theory on which the models are based. As the device festnres are scaled down the currently a d a b l e models become less accurate. The physics and the theory of the sealed devices is more complex. Hence, aseluate modeling becomes very difficdt. One way around that problem is to chose the model parameters such that simulated device chsracteriaties agree with measurements. In practice, the models' parameters are extracted automatically using parameter analyser. with software tools to obtain the best fit. As a result, the values of the extracted parameters may not correspond to their actual values. For example, it is common to find B discrepancy of 20% between the measured cnrrent gain of a bipolar transistor and that listed in the SPICE fie. h o t h e r approach, which U eqmivalent to tweaking the parameterr, is to m e empifid models (eg. BSIM model), in which the empirical (fitting) parameters c m be optimized to get the best fit between simulation and measurements. Typical GP parameters , for the 0.8 prn BiCMOS prsented in Chapter 2., a ~ e shorn in Table 3.4 and 3.5.
106
CHAPTER 3
Table I.,
Bipolar dcviccpar-ekx
and HSPICE sorxspondcna
Para meter
SPICE Keyword
Description
IS BF
BR
NF NR VAF VAR IKF IKR ISE ISC NE NC RE RC RE IRB
RBM
CJE VJE MJE CJC VJC MJC CJS VJS
MJS
XCJC FC
Saturation current Ideal madmum forward gain Ideal madmum reverse gain Forward current-emirision coefficient Reverse current-emirision coefficient Forward early voltage Revers early voltage Forwadknee enrrent Reverse-knee current Baseemitter leakage ssturation current Basecollector leakage saturation current Baseemitter leakage emission coefficient Basecollector leakage emission coefficient Emitter resistance Collector resistance Base resistance a t zero current Base current where RB = RB(O)/Z Minimnm high-current base resistance Base-emitter ser-bias depletion cap. Base-emitter built-in potential Base-emitter junction grading factor Basecollector aero-bias depletion cap. Basecollector built-in potential Base-collector junction grading factor Collector-substrate iero-bias cap. Collector-substrate built-in potential Collector-substrate junction grading factor Internal base fraction of base-collector cap. Coefficient for forward-bias depletion cap.
107
Table 3.4 (contznnrd)
XTF
VTF ITF
T,
I,
TF
XTF VTF ITF TR XTB XTI ED KF AF
Forward transit time T F biar-dependant coefficient TF barecollector voltage dependence c o d . T F high current parameta Reverse transit time Forward and re~erse betel0 temperature exponent Saturation current temperature exponent Energy gap Flicket noise coefficient Flicker noise exponent
Table 3.5
ASPICE BJT model pa~metcrr (0.8 I " BiCMO8 p r 0 ~ ~ s ~ ]
SPICE Keyword
IS BF BR NF NR VA P VAR IKF IKR ISE
Vdue
Units
A
Zx
100 1 1
1
sn . .
5 5n 10P
0.
0.
A A
A
108
CHAPTER 3
Table 8.6 (emlmurd)
RE RC RB
IRB
30 87
RBM CJE
VJE MJE CJC VJC
650 0 650 1 . 5 1 ~lo-'' 0.87 0 265 1.15~10-14 o 713
n n n A
62 F V
F V
FC
TF XTF VTF
ITF
TR
XTB
XTI EG
0.5 12.5~ 916.2 1.6 a.7x 10-2 4 x 10W8 1 . 4 3.5 1.11

2.9x10-e
ev
-
XF
AF
2.0
109
3.5.4
Chapter Summary
111 thk Chapter, we h a w r r r i c w c d the fundamentds ofth e 110s xiid bipolnr derirrv 'l'hr ~ m w common t device rwud11 u s S 4 i n SI'ICE ILRYC been pn w ~ t d 'The key device P B I I U ~ ~ ~of Cw ~ S h model h a w been defined and rrplaincd, so that the rradcr is familiar with the drtailr of these niodclr and can apprecislr the importance a f t h e different model parameten T h e reader 19 given B Lst of model parameterr, for B typical 0 8 pm RiCXOS prnccis. that can be used for circuit simulations T h o c modrl ran be used even a1 low-voltage opcralion. hlorcoser, ia .in,plc analytical model unltd for suhmirronwrr 1lOSFET'r has berm 1 l i r c i . r 4
REFERENCES
[I] A. Vlrudimirescu, and S. Lio, "The simulation of MOS Integrated Circaits using SPICEZ," M m o . No. UCB/ERL M80/7, Univ. Cdifomia, Berkeley, October 1980. [Z] H. Masuda, M. Nakai and M, Kubo, "Characteristics and Limitations of Scaled Down MOSFET's Due to Two Dimensional Field Effect," IEEE Trans. on Electron Devices. Vol. ED-26, pp. 980-986, 1979.
[3] R.L.M. D u g , "A Simple Current Model for Short-Channel IGFET and Its Application to Circuit Simulation," IEEE Journal of Solid-State Circuits, vol. SC-14, pp. 358-367,1979.
(41 G. Merkd, J . Bore1 and N.Z. Cupces. "An Accurate Large Signal MOS Transistor Model for Use in Computer-Aided Design," IEEE Trans. an
Electron Devices, vol. ED-IS, 1972. [5] G. Baum and 8 . Beneking, 'Drift Velocity Saturation in MOS Tranristors," IEEE Trans. on Electron Devices, YOI. ED-17, pp. 481-482, 1970.
[6] R.M. Swanson and J.D. Meindl, "Ion-Implanted Complementary MOS Transistors in Lou-Voltage Circuits," IEEE Journal of Solid-state Circuits, vol. SC-7, pp. 146-153, 1972.
171 B.J. Sheu, D.L. Scharfetter, P.-K. KO, and M.C. Jeng, "BSIM Berkeley Short-Channel IGFET Model for MOS Transistors," IEEE Journal of Solid-state Circuits, vol. SC-22, pp. 558-566, 1987.
[8] J. 8. Huang,
Z. H. Liu, M. C. Jeng, P. K. KO, and C. Ha, "A Robust physical and Predictive Model for Deep-Snbmicmmeter MOS Circuit Simulation," IEEE Custom Integrated Circuits Conf., Tech. Dig., pp. 14.2.114.2.4, May 1993.
[9] D.E. Ward and R.W. Dutton, "A Chargeoriented Model for MOS Transistors Capacitances," IEEE Journal of Solid-State Circuits, vol. SC-13, pp. 703-707, 1978.
112
LOW-POWER DIGITALVLSI DESIGN
[lo] Y. P. Tsividir, "Operation and Modeling of the MOS Trwsistor,' Gmw-Ha, 1988.
Mc
[Ill T. Sakata et al., "Subthreshold-Current Reduction Circuits for MultiGigabit DRAM'S," B E E Jonmal of Solid-state Circnits, vol. 29, no. 7, pp. 761-769, July 1994.
1121 S.M. Sae, "Physics of Semiconductor Devices," John WiIey & Sons, 1981. 1131 C.G. Sodini, P.-K. KO,and J.L. Moll, "The effect of High Fields on MOS Device and Cireuit Performance," IEEE Trans. on Electron Devices, Vol. ED-31, No. 10, pp. 1386-1393, October 1984. [14] B. HoefRinger, H. Sihbert, and G. Z h e r , "Model and Performance of Hot-Electron MOS Transistor for VLSI," IEEE Trans. on Electron Devices, Vol. ED-26, pp. 513, 1979.
[I51 C. hu, "Low-Voltitge CMOS Device Scaling," IEEE International SolidState Circuits Canf.,Ted. Dig., pp. 86-87, 1994.
(161 R.H. Dennard, a t a l . ,"Designoflon Implanded MOSFETa with Very S m d Physical Dimensions," IEEE Journal of Solid-state Circuits, vol. SC-9, pp. 256-266, October 1974.
[I71 P.K. Chatterjjee, et al., ''The Impact of Scaling Laws on the Choice of N-Channel or P-Channel for MOS VLSI," IEEE Electron Device Letten, Vol. EDL-I, pp. 220-223, October 1980. [la] M. K e h m u , "Process and device Techoologiea of CMOS Devices for LowVoltage Operation," IEICE Trans. Electron., vol. E76-C, no. 5, pp. 672680,May 1993. [19] M. Kdkumu, M. Kinugawa, and K. H m b o t o , "Choice of Power-Supply Voltage for Half-Micrometer and Lower Submicrometer CMOS Devices," IEEE Trans. Electron devices, vol. 37, no. 6, pp. 13341342, May 1990.
[20] D.J. Rodstan, "Bipolar Semiconductor Devices," McGraw-HiU Publishing
Company, 1990.
1211 K. Naknuato, et al.,'Characteristics and Scaling Properties of n - p n Transistors with a Sidewall Base Contact Structure," IEEE Trans. on Electron Devices, vol. ED-32, no 2, pp. 328-332, February 1985.
[22] H.K. Gummel and H.C. Poon, "An Integral Charge Control Model of Hipalirr Transistors," Bell Syst. Tech. J., vol. 49, 1970.
REFERENCES
113
[23] 1. Getreu, Modeling the Bipolar Transistor, Tektranix, h e . , 1916.

[24] P. Antognetti and G. Massobrio, Semieandnctor Device Modeling with
SPICE, McGraw a ; U ,1988.
4
LOW-VOLTAGE LOW-POWER VLSI CMOS CIRCUIT DESIGN
In thir chapter we introduce the CMOS logic gate with the development of sim-
ple models for delay and power disripstion estimation. These analysis permit us to understand the mechanisms that control the performance, particularly the power dkipation, of a logic circuit. Several CMOS d m i p s t y k , such as pseudoNMOS, dynamic logic and NORA, are presented. Other k c n i t variations of the static complementary CMOS, which are suitable for low-PO- applications, are discussed. These include the passtransistor logic families such as Complemendary Pass-transistor Logic (CPL), Dud Pasctramistor Logic (DPL), and Swing Restored Pass-transistor Logic (SRPL). Also an overview of clocking strategy in VLSl systems is covered. Included in this chapter is one important %re*which is the I/O circuits. The power dissipation of the I j O circuits is also analyzed. Findy, low-power techniques for CMOS design are also reviewed at the tr-istor-level. We will cover the low-power issues a t subsystem/system/architeeture levels in Chapter 6,7 and 8 in more detail. Several books treat in detail other CMOS circuit design aspects [I, 2, 31. The reader CM refer to them. Many issues existing in todays advanced CMOS circuit structures are considered; such as: Power dissipation components of a CMOS gate and their importance; Concept of switching activity; Power dissipation in 1 1 0 circuits;
Single-phase clocking strategy; Clock skew issue:
116
CHAPTER 4
rn
m
Clock distribution in VLSl systems; Ground bouncing; and Low-power circuit techniques and design guideher.
4.1
CMOS INVERTER DC CHARACTERISTICS
Fig. 4.1 shows the basic complementary MOS inverter. Before deriving the DC-transfer characteristics of this inverter (the output voltage Y C ~ S U Ithe input voltage), lets understand the operation of this circuit.
When the input is BIGH, which means at VDD,we have
VSSn = Krn = V D D
(4.1)
= K" VDD = 0 (4.2) In this case, Vosn > VT, and lVcstl < lVrpl. The PMOS is OFF and the NMOS is ON. The NMOS transistor N provider a current path to ground. The find stable value of the outpot voltage V . is
~
v ,
v, = 0
(4.3)
At the steady rtete, the DC cnment from VDD to the groondis controlled by the subthreshold current of the PMOS P ,since this device ia OFF and the NMOS N has B VDS equals to zero. We assume that the junctions leakage is negligible. If VT,,' is low enough (lower for example than -0.5 V), the subthreshold current is negligible (< 1 pA/prn width). If (negative) is high, the subthreshold is not negligible and can be w high as 1 p A / p m for = -0.05 V [see Section 3.321. In this case the output is not exBctly at zero and can have a value of tens of mV. In this section we a m m e that the subthreshold cmient is not importmt. Low-VT CMOS circuits .%re treated in Section 4.10. Similarly, when Kn is low (OV) Vos. f VT, and IV,s8l > [VTJ. The PMOS transistor is ON and the NMOS transistor iS OFF. The output voltage is given by
v .
'Exbr*pold.ed thruhold voltage.
= VDD
(4.4)
Also we assume that the leakage current is negligible.
Lorn-Voltage Lou-Power VLSI CMOS Cixuit Design
117
%sf+
PMOS
*
Figure 1 . 1
A CMOS Inruter
The logic levels of the CMOS inverter are close to VDDand ground and the logic swing is equal to VDO.This is B main feature of CMOS gates.
4.1.1 ltansfer Characteristics

In this section we discuss the DC ehaiacterirtier of the CMOS inverter of Fig.
4.1. Fig. 4.2 shows the DC transfer characteristic with the different regions of operation. For simplicity we use, for the MOS devices, the simple cnrrent models presented in Section 3.2.1. The circuit operation can be divided into
fiue regions:
Region (A): 0 5 Ern < VT, The NMOS transistor is operating in the subthreshold region and the current is assumed zero. Hence the PMOS current is also em. The PMOS transistor is in the linear region. Thus, V. = VDD.
118
CHAPTER 4
Region (B): Vrn < K. < I L Ens is defined M the input voltage at whioh the gab of the inverter is maximum and is also defined s the gate threshold voltage. In this region, the NMOS transistor ia operating in the satmation region and the PMOS is in the linear region. Since the emrent in both devices is thc same (in sbsolute value), w e have
IDS? = - I D S .
The PMOS current i s given by
I D S p '-Pp
(4.5)
[(~~-vDD-vTn)(va--I/DD)-~/~(~-vDO)z]
Where
(4.6)
6 , = kp%
Leff
(4.7)
(4.8)
The saturation cument of the NMOS is given by
where
a.= , k
VGS,
W.ff L.ff
(4.11) (4.12)
and
= Km
Using equations (4.5), (4.6) and (4.10), the ontput voltage is given by
v,
= (K*-Vrp)+
(%, - VTp)' - a(%%
(4.13)
-- vTv)vDD
2
VDD
- P( ! &
PP
- vT,)a
T h i s equation of V, versus V , is plotted in Fig. 4.2 region (B)

Region (C) : K, = V & Both the NMOS and PMOS transistors we in the saturation region. In this case, the PMOS current can be given by
I D , = -P,
(G" - VTJ
(4.14)
Lou- Voltage Low-Power VLSI CMOS Circuit Design
119
'DI
YO
The NMOS saturation current is given in Eqoation (4.10). By iring the absolute value of the two dr- currents we have
equal-
(4.15)
where
p = -i %
PP
(4.16)
T h i s equation is very useful from
B design point of view. Note, from this equation, that the logic threshold voltage of this gate is set by the designer; since the parameters & and /a are dependent on W c f fand L . t f . Moreover, the region (C) is d e k e d for only one point of I$,, For symmetrical NMOS and PMOS devices we have
VT" = VTP
If the designer set
(4.17)
a 'PP
(4.18)
120
CHAPTER 4
This ratio is a typicd example. The designer should set the rise ratio
a5
(4.20)
We obtain
VD D K, = K*" = -
(4.21)
A n inverter with this V,."* is sometimes called B symmetrical gate. The cutput voltage in this ea5e h not neeereary equal to VDD/2 and is given
by the following inequality
K"
-vT,
<v . < V,,+
v ,
(4.22)
In reality, V. is set by the alight dependence of I D , versus VD'OS

Region (D) : K,," < V,, < VDD In this region the NMOS is in the linear region while the PMOS is in the saturation region. Simila analysis used in region (B]can be applied. The output voltage is given by
V . = (K* - V&) - ( L V T , , ) '

~
\i
& ( I $ .
Pn
VDD VT?)~ (4.23)

~
Region (E): VDD < '4" 5 VDD In this region the NMOS transistor is ON, and in the linear region, and the PMOS is operating in the subthreshold region. If we arirume that this current is too small then
v .
=0
(4.24)
The cnrient flowing from VDDto ground, Y C ~ I S Y S the inpnt voltage, is plotted in Fig. 4.2(b). It reaches its madmum when both the MOS transistors are in saturation. It h important to note that f o rV ,= K,," the DC power dissipation would be maximal.
Low- Voltage Low-Power VLSI CMOS G h o d Desrgn
121
Figvre 4.3 ERccl of thc ratio p on the (s)DC t r d w F h ~ E t e r i s t i c (b) i threshold voltage of ulr CMOS inverter
4.1.2
Effect of p
As we discussed before. the ratio 0 controls the threshold voltage of the CMOS inverter. This panmeter is set by the ekenit designer through the transistor sizes. Other psrameters such BS the mobility and the theshold voltage of devices are set during the fabrication and the circuit designer can not change them. Fig. 4.3 illustrates the dependence of DC transfer charaeterirtier and the threshold voltage of the CMOS inverter on the ratio p . Increasing 0 decreases the voltage &,". KU has I I prwticsl maximum less than VOD t VpP and practical minimum greater than I+". Practical values mean that 0 can not have zero or infinite. In general, the circuit designer tries to set 0 = 1 for symmetrical operation unless the gate is used to switch an input s 8 different than a CMOS swing (from ground to VDD).
4.1.3
Noise Margins
Noise margin LG an important parameter in logic design. It i6 defined si the allowable noise voltage on the input 10 that the output is not affected. In other
122
CHAPTER 4
(a)
words, we would define the valid logic levels such that they are restored when they propagate through a digital circuit. The logic levels c a n be extracted from the DC characteristic. As illustrated i n Fig. 4.4 we define the levels a t the input by
.
rn
Logic 0 : for 0 5
I i , 5 VrI, Logic 1 : for fix 5 5 VDD
and at the output by
Logic 0 : for 0 5
v. 5 V0'
5 V, 5 VDD
Logic 1 : far Vog
The
LOW noise margin is defined by

N M L = ]fir.- V
(4.25)
Low- Voltage Low-Power VLSI CMOS Cnrcuit Dessgn
123
and the HIGH noise margin is defrned by
N M H = IVOH- Vrxl
(4.26)
The V,r. and the V m lev& can be defined ils the points where the slope of the DC transfer characteristics is -1, i.e.,
These valuer can be deduced wing equations (4.13) and (4.23). To have good noise mar&, it i s desirable to have Vii. and f i x each near the other, mound the point V D D ~ ~ .
For CMOS circuits, the HIGH output Voltage level VOH, can be defined by letting VOH = VDD and Vor. = 0. The CMOS logic inverter has fairly ideal transfer nnnnctian and it tends to have very good noise margins. In some applications, either N M x or NM,, is compromised to have good speed of operation.
4.1.4 Minimum Power Supply

To obtain the maximum power raving i n CMOS logic circuits, the power supply voltage should be reduced. So, what is the lowest practical supply voltage at which CMOS d l operate? In 19'12, Swansan and Meindl 141 demonstrated
that the minimum supply voltage is given by
Vnom,n = BkTln
(4.28)
At room temperature this value is equal to 0.2 V. This demonstrates that CMOS ir a good candidate for ultra-low-power applications.
4.1.5
Example of Noise Margins
For an inverter with W, = 2W, = 4 p n (in 0.8 p n CMOS technology), and using a threshold voltage VT = VT,=(V~,(=0.5 V, we have the fobwinsvalues for N M L and H M H . At 3.3 V power supply voltage, Nnai. = 1.15 V and N M x = 1.45 V. However at 1.5 V, N M L = 0.60 V and N M H = 0.65 V. So the noise level should be kept low, particularly at low power supply voltage.
124
CHAPTER 4
vDD
Figure 4.5
CMOS invat.? %ndwitching chaiactuistic
4.2 CMOS INVERTER SWITCHING CHARACTERISTICS

In this section, we present the transient behavior of the CMOS inverter. A very simple analytic model for delay is developed. The objective of this analysis is to understand the parameters that affect the speed of the gate. We assume that the input has a step waveform. The delay t d , is the time difference between the mid point of the input rwhg and the mid point of the wing of the output signal. Referring to Fig. 4.5, td, is the 50% delay when the output is rising; and
rn
tq k the 50% delay when the output k faUing.
The power dissipation issue during the switching is considered in Section 4.3.
Low-Voltage Low-Power VLSI CMOS Czrcuit D e q n
125
4.2.1
Analytic Delay Models
The load capacitance shown in Fig. 4.5 at the output of the CMOS inverter represents the total of the input capacitance of driven gates, the pararitic capacitance at the output of the gate itself and the wiring cepacitance. In Section 4.4, we discuss the estimation of this load capacitance. For simplicity we ac sume for 50% delay. that the MOS current is averaged, and is e q d to the saturation current. The equation of the saturation used in this seetion is the one given by Equation (3.82) Section 3.3.3. T h i s saturation current is w e l l modeled for short-ch-el devices,
4.2.1.1 Fall Deluy

When the input goes from low (ground) to high (VDD),initially the output is at VDD, the pull-down NMOS of Fig. 4.5 is in the saturation region. We wusume that when the output falls to VDD~Z, the NMOS drain current is approximated by the raturstion current IDs,&. Referring to the equivalent circuit of Fig. 4.6(a), the delay i s computed from the following differential equation
where
I D S , , ~ , = Kn~.atCocWe~,m(Vcsn -E n ) (4.30) We ~ s s u m ethat the factor K, does not change. By integrating Equation (4.29) from t = tL, correrponding to V , = VDD, to 2 = t l , corresponding to V . = V D ~ / Zand , substitution of (4.30) into (4.29) we obtain
Note from this equation that the delay is inversely proportional to the width of the MOS transistor. So by aising the gate we can reduce the delay of the gate alone.
4.2.1.2 Rise Delay

When the input goes from high (VDD)to low (ground), initidly the output is a t zero. The pull-up PMOS transistor operates in the saturation region. Similarly using the equivalent circuit of Fig. 4.6(h), the rise delay is given by
(4.32)
126
CHAPTER 4
1 1
vDD At t = t , Vo=V,,
At t = t 3 V o = O
At t = t Vo=v~~ 4 2
From the *bow equation we can deduce that the dse delay is greater than the fall delay for equally sisad MOS transistors. So We,,, phould be rised such that the two saturation currents are almost equal in order to get symmetrical rise and fall dehyr.
4.2.1.3 Delay nme

By definition, the delay time (sometiw called propagation delay) is given by
fz = #d,
Hence, for
+td.)
(4.33)
V T . = - V T ~= VT the delay is given by
Low-Voltage Low-Power VLSI CMOS Circnzt Deszgn
127
Or the equation can be written as

(4.35) The constant is slightly diected by VDD through the parameter K. This equ* tion shows a simple analytic expression for the delay time. We can observe that the delay i s linesrly proportional to the total load capaeitsnce. Secondly, the delay increases when the power supply is scaled down. When VDD approsches the threshold voltage of the device, the delay incresses drssticdy. If the threshold voltage L sealed down with the supply voltage and the oxide t b i c h m is sealed down too, then the delay can improve with VDO sealing. &om the CMOS circuit designer point of view, the only parameters thst can be controlled to opt-e the speed of CMOS gates me:
. .
The width of the MOS transistor; The load capacitances (input of the n u t stage, wiring,ette.); and The supply voltage V D D .
Fig. 4.7(a) shows the simulated effect of the power supply voltage on the delay ofan inverter with fanout = 3, using the device parameters given in Chapter 3. W e buffer the input voltage with one inverter stage to obtain accurate results. The delay is almost stable at high VDO,however when VDD approaches the threshold voltage of the NMOS and PMOS devices, it increaser drastically as expected by Equation (4.35). Therefore, the threshold wltage should be reduced to overcome this problem. In Fig. 4.7(b), the delay of the inverter is D VOD= 2.5 V. For VT/VDD > 0.5. the delay plotted versus the ratio V T ~ V D at incresses rapidly. In order to maintain improvement in circuit performace at reduced power supply voltage, VTJVDD must be 5 0.2.
4.2.2 Delay Characterizationwith SPICE

A data sheet for the delay of a cell (i.e., CMOS inverter) c ~ be n e d y prepared using SPICE. For example the load capzsitace 01 the fanout of a CMOS inverter is swept during the airnulation, and the relation of the type l a = a + b.C,(or fanout) can be obtained. Fig. 4.8 shows the delay YS. the external load capacitance C,. Other parameters can be extracted also.
128
CHAPTER 4
4.5
Low- Voltage Low-Power VLSI CMOS Circuit Deszgn
129
0.65 I
0.15
'
I
2
10
4.3 POWER DISSIPATION

of a CMOS circait, the various power components and their effect mast be identified. There are two types of power dissipation. One is the m-nn power dissipation which is related to the peak of the instantaneous current and the other is the averagge power dissipation. The peak current has an effect on the supply voltage noise due to the power line resistance. It can cause heating of the device, thus resulting in performanee degradation. From the battery lifetime point of view, the average power dissipation is mole important.
To minimiae the power consnmption
There are three power dissipation components within the CMOS inverter. These are: 1. Static power csused by the leakage current rent 1.t due to the value of the input voltage;
and other Static cur-
2. Dynamic power caused by the total output capacitance
CL; and
130
CHAPTER 4
3. Dynamic power caused by the short-circait curent I,. during the
switching transient Sometimes component (2) and (3) are merged as total dynamic power
4.3.1 Static Power

This component is split sometimes into two other components. The sourcces of static power dissipation, i n a complementary CMOS inverter, are leakage currents (P,*) a d current drawn &om the supply due to the input voltage (P,%). Hence the total static power is given by
P, = P s i
+P . 2
(4.36)
Leakage eubent consists of MOS junction leakage currents. Fig. 4.9 shows the parasitic diodes in a CMOS inverter. The body ties in this stroeture, such as the p&itic. diodes, m e not conducting (i.e. reverse biased and/or at iero voltage). The current in B diode is given by
9vd Id = I,(exp 1)
nkT
(4.37)
where n is the emission coefficient of the diode (sometimes equal to 1) and V d is the applied voltage to the diode. Note that the current parameter 1 . inereares with temmnrturc. The total rrower dissipation due to these le&am currents is given by P,l = ~ I a , V L W (4.38)
A typical value of this leakage current Id is 1 fa/ device junction. This value is too small to have any effect on the static powex, because if we have o m million deuicer, the total contdbution to the power would be 0.01 pW. This first component of the static power is neglected, in the analysis, through all the chapters of this book except Chapter 6 in the c of memory design.
We con$der now the second component ofthe static power which is a function of the input voltage Kn. Assume that the input of the pull-down NMOS, of the inverter, is at B voltage 0 5 K" < V , . In this ease the torrent is given by the subthreshold expression (Fig. 4.10)
I D S = zo-I
w . , ,oLsgw W O
(4.39)
131
Vss
132
CHAPTER 4
wherc VT is the constant-current threshold voltage. For V , . > VT the current is given by expressions discussed in Chapter 3. The corresponding static power disripation is given by P . 2 = IDsm*o.VDD (4.40) Thc mean value ofthe current is for both the PMOS and NMOS transistors. For example if V . = 0, VT = 0.15 V, W c f j= 10 fim and S = 75 mVJdeeade, this current is 1 nA. Far 1 million devices integrated, the total static power would be impmtant (1 mA of current). Note that this current increases drasticdly with the increase of temperature [see Section 3.321. This value, in standby mode. is not permitted lor battery-operated applications. CMOS circuits have been known to consume energy only during switching. But this is not troe mow. since low-VT CMOS is used far low-voltage operation. Some CMOS circuits, which exhibit a high DC current, are discussed in Section 4.6.
4.3.2
Dynamic Power of the Output Load
In this section we estimate the power dissipation due to the total oiitput load capacitance CL.This power is due to the currents needed to charge and discharge CL as shown in Fig. 4.11 and 4.12. We assumc a etcp input 10 neither the PMOS and NMOS m e on rimultanmurly. The average dynamic power P a required to charge and dischsrgc II capacitance C, at Iswitching frequency f = IjT (Fig. 4.12) is given by
I =
(4.41)
The output current i s given during charging phsse by

I
~
do - .Ip = C , " df
(4.42)
and during the discharge phase by
i - In = -c&dv.
'
df
(4.43)
Then Eqoation (4.41) becomes
Finally the dynamic power dissipation is
(4.45)
Low-Voltage Low-Power VLSI CMOS Cmud Desegn
133
VDD
vDD
T h i s equation shows that the power dissipation is proportiond to the operating frequency. Moreover, the ieduction of the power supply d r a s t i d y reducer the power dissipation. Ideally, 3.3 V ~npply voltage rednces the power dissipation by 56% compared to that of 5 V. Moreover, at 1 V the power is reduced by 96% compared to 5 V. The expression of dynamic power in Equation (4.45) is valid only for an inverter. However, for E. complex gate the concept ofswitching activity is introduced [see Section 4.5.31.
During the h s t output transition (charging) from 0 VDD,the energy drawn from the power mopply is Ed = CLV;,. For tbis transition, the energy stored in the load capacitor i s
This means that during lhe output transition 0 Vo0, hdf of the energy drawn Gom the supply is stored in the capadtar and the other haUis eonramed
134
CHAPTER 4
...............
/
... ...
.......
L
......
....... 1
Time
y
......
...... .>
Time
Lou- Voltage Low-Power VLSI CMOS Circuit Design
135
by the pull-up PMOS transistor. For the outpnt transition VDD 0, the mergy [l/2 C z V i D ) stored in the capacitor is consumed by the pun-down NMOS transistor and no current is drawn from the supply.
4.3.2.1
Energy vs. Power
It is important to distinguish between enecgy and power. If for uample, for a CMOS gate x e reduce its dock rate its power coxsmption will be reduced by the same proportion. Howevu, its energy d still be the same. Assume that the gste is powered with a battery to perform computations. The time reqoired t o complete the computation, with low dock rate, d beincreased. Therefore, after t h e computation the battery Uiy be jnst as dead as if the computation had been performed at high clock rate. So law-enecgy design is moreimportant than low-power design. The factor of merit in this case can be defined as the pmdud of energy limes the delay. The canvcntional term, low-power.i s used through out this book to mean that we design for low-energy.
(I),
4 . 3 . 3 Short-circuit Power Dissipation

Even if there were no load capacitance on the outpnt of the inverter and the paradtics are negligible, the gate would still dissipate switching energy. If the input changes slowly, both the NMOS and PMOS transistom are ON, an excess power i s dissipated due to the. short-circnit current. Fig. 4.13 shows the rhortd time of the input. circuit cments BS the inverter switches as function of the i W e are assaming that the rise time of the input is equal to the fall time.
P,c = I,..,.LVDD (4.47) To estimate I , . , , we use the simple model of the short-circuit current of Fig. 4.14 151. Also we Bssume that the inverter has symmetrical devices, which = P, = 0 and V T , = -VT- = VT. W e also assume that the mesni that rise time is equal to the fall time of the input signal ( 7 , = rt = 7). The mean short-circuit current in the unloaded inverter is
r,,.
=z
[j:
i(t)dt
+ j:i(tpt]
(4.48)
Due to symmetry we have
136
CHAPTER 4
350 I
-50
'
1
I 2 1 4 5
(1
Time (ns)
Figure 4.18
Shari-circuit evmnt function of the input dope
The NMOS transistor is operating i n satmation, hence the above equation
The input voltage is given by
X * ( t ) = VOO -f
It can be derived &om Fig. 4.14 that
*I= VDD 7
(4.51)
VT
and t 2 = I 2
(4.62)
Then the integral leads t o
Low- Voltage Low-Pourer VLSI CMOS Circuit Design
137
Figure 4.14
hput voltage and short-cbeuit cumnt model
Thk equation shows that the short-circuit power dissipation is also proportional
to the tiequeney. The only parameters that can be controlled by the circuit designer at given frequency and power supply to reduce P . , are: 0 and 7. The power supply s d n g greatly affects the reduction of short-circuit power dissipation. Note that this analysis was done for an unloaded inverter. For a loaded gate, if the outpnt signal and inpnt signd have eqnd rise/fd times, the short-circuit power dissipation will be less than 20% of the total power [5]. So it is very important to keep the edges fast, to have negligible P,*01a t least, it is desirable to have equal input and output rise/fd times.
If the load capacitance is high, the output rirejfaU times become larger than the input ones. In this case, the inpot ehsnges completely before the output changer rignificantly. Therefore, the short-circuit current is near zero. Note that if VODis approaching (VT,, + VTz)01 is less, the short circuit current can he eliminated because both devices can not conduct simultaneourlv.
138
CHAPTER 4
4.3.4
Other Power Issues
The total power dissipztion of a CMOS gate is given by

Pi,t,,
=P .
+ Pd + PSC
(4.54)
It represents the total power of a gate when it is switching at the same rate aa the operating frequency. In Chaptez 8, we will discuss how to estimate the power dissipation of a complex circuit.
Other power dissipation k u e s exist, such as: worst ease power estimation and temperature effect. These conditions are : maximum VDOandjunction tcmperatarc, and faat-faat process. Static power dissipation (subthreshold carrent) is incieaad by the increased temperature and increased power supply. Dynamic pow= is not sensitive to the temperatare bat it is affected greatly by the worst caae VDD. Short-drcuit power dissipation depends on the temperature j u t as the short-circuit current doer. It i s also dependent on the power snpply. The mobility and threshold voltage deereaae with increasing temperature. Each of these two parameters has an opposite effect on the current. So it is important to eonrider the worst case power consumption evaluation in any design.
The simulated average total power dissipation can be easily measured by the SPICE simulator u&g POWER MEASUREMENT commands. However, several papers in the literature have introduced "power meter" in circvit simulation to meaauce the power dissipation [6, 7, 81,
4.4
CAPACITANCEESTIMATION
Previously we saw that the speed and power dissipation of CMOS gat- depend strongly on the total ontput load ce.paeitance. This capacitance is the sum of three components as shown in Fig. 4.15. Total input capacitances of N driven gates noted C,m;
1
I
Parasitic output capacitance of the drive gate noted C,;and Wiring capacitance noted C , .
For simplicity we estimate, in this section, the average value of Cr. over the range of the output awing. This approach is used only for b i t i d estimation
Low- Voltage Low-Power VLSI CMOS Czreutt Deszgn
139
of the design. More circait simulation and layout extraction and port-layout shdation arc needed fm mole accuracy. Moreover, it is sometimes interesting to derive a simple expression for the load capacitance to dee the impact of important parameters on the speed and the power dissipation. We h t eramine the different components of the outpnt load capacitance: then we illustrate by
e o .
example the estimation approach.
4.4.1
Estimation of C,,
the
The total eapacitanee of the driven gates can be evaluated by 5m-g input capacitance of all the receiving gates and we have
The gate capacitance of the receiving gate can be approximated by

n
Cq*te =
con C(WL)<
;=I
(4.56)
where n is the number of tr-torr of the gate. T h i s expression sum3 the gate capacitances of all the transistors composing the driven circuit. For a CMOS inverter it is given by (4.57)
140
CHAPTER 4
3.5
3 -
VOllll
y:
i i i
,?
'
? '
', ,'
! ? I
voD=3.3 v -
2.5
Vin
2 1.5
- i
1 -
i
i i
7
0.5 -
t. _..
-0.5
i i
i
;vout2
. . .
*< .... . ..
ei
Low-Voltage Low-Power VLSI CMOS Czrcuit Desrqn
141
6
Figwe 4.16 shows an example of the equivalent gate capacitance of the receiving gate. The driven inverter has the following drawn sizes : W, = W . = 20 p m and L = 0.8 pm. This gate can be replaced by an equivalent capaeitenee Cgacc z= 50 f F ,which is approximately the same as the one ealeulated from Equetion (4.57).
4.4.2 Parasitic Capacitances

Fig.
4.17 shows the main contributions to the output parasitic capacitances
of a
CMOS inverter. Thus, it L estimated by
c ,
= CdP Cd,,
+ Gjp+ c,,
(4.58)
142
CHAPTER 4
The drain overlap capacitance for NMOS and PMOS ir given by
cg. = c,w
(4.59)
C , is ddned in SPICE parameters of Chapter 3 as CCDO. The drain junction capacitance is a function of the ~everse applied voltage during the switching of the inverter. The average value of this capacitance over the range of output swing is defined by (4.60) = 6,aAo c j . , P ~
c,
where AD and Po are the area and the perimeter of the drain junction a shown in Fig. 4.18. The average bottom junction capacitance is (4.61) The average side-wall capedance
Low-Voltage Low-Power VLSI CMOS Czrcuit Design
143
\I
4.4.3
Wiring Capacitance
The Simple model of wiring capacitance is bared on the parallel-plate model [Fig. 4.191 given by
c,,
= -
cm
(4.63)
where H is the thickness of the insulator layer (oxide), and C , . is the capaeitanee per erea unit. The total capacitance of the wire is
c,
= IWC,.
(4.64)
where W is the width of the wire (metal or poly). and I is the length of the wire. Table 4.1 piyes some values of the widng capacitance per area for the 0.8 pm process presented in Chapter 2. T h i s capacitmce can not be known i n the early design stage but can be known after layout extraction. When the thickness of the insulator becomes comparable to that of the wire, T, then the fringing fields at the edge of the wire become important. The effect of the fringing fields is manifested by the increare of the effective area of the plates [Fig. 4.191. Many approximations have been proposed to compute the
144
CHAPTER 4
Metal2 to Substrate Metal2 to Metall Metall to Substrate Metal1 to poly Metall to diffusion Gate poly over field oxide
11
25 19
28 27 58
Table 4.1
Typical 0.8-sm CMOS r i m f&&g
csparitmr.
Layer Metal2 to Substrate Metal2 to Metall Metall to Substrate Metall to poly Metall to diffusion Gate p d y over field oxide
Perimeter C a p a d t a c e
F/pm)
38 47
44 48
47
44
effect of fringing capacitance. is given by [9]
One relatively accurate empirical approximation

(4.65)
C , , = ~[(~)+0.77+1.06(-)0~"+ W W 1.06(-)0.6] T B H
where C,, is the total capacitance ofthe wire per unit length. The contribution of the fringing effect in many -es k important. "able 4.2 shows the fringing capacitance per =nit of length.
4.4.4
Example
Consider en inverter with W, = 2W. = 20 pm with 3 pm length of each drain and source. This inverter is driving B Line of metall of 100 pm length by 2 pm width a d an inverter with W, = 2W, = 20 pm operating st VDD= 3.3 V.
Low- Voltage Low-Power VLSI CMOS Ctrcuit Design
145
The total load cspacitsnce i s computed using the 0.8 p m device parameters presented in Chapter 3 BI follows:
m
The gate capacitance of the dzivcn inverter is
.
rn
c,
= [%L,+W"I;,IC, = [20 x 0.8 + 10 x 0.81 x 2 f F w 48fF
The total ovedap capacitance at the ontput is
,c ,
Then
= CGD,W,
+ CODhiW"
C , ,
= 20 x 215 x lo-'+ 10 x 214 x lo-' = 4.30 t 2.14 w 7 fF
The total drain junction capacitances can be approximated at midvoltage of 1.65 V (1/2 of V D ~ instead ) of eompnting integrh. We have far one drain junction
The drain areas are 60 pmaand 30 p d far PMOS and NMOS respectively. The drain perimeters are 46 p m and 26 pm for the PMOS and NMOS transistors respectively. The total junction capacitance can be easily calculated and is Cj s 3 2 f F Note that this capacitance increaser with the power supply voltage reduction.
m
The wire capacitance is estimated by adding the two components psxallel plate and fringing capacitances. The ares of the wire is 200 pm' while its perimeter is 204 pm. We have
c ,
= w x I x CW(peV m a ) Z(W i ) x C&r length) = 200pm' x 19 Y lO-'fF/pm' 204pm x 44 x 10-3fF/pm = 3.8 + 9.0 c 13 f F
Note that the fringing capacitance is an important portion of the total wire capacitance.
146
CHAPTER 4
Hence the total capaeitance at the output is 100 fF. Note that the contribution of the junction capacitance is important. The contribution of each component wries *om one circuit to another and it depends on the layout style osed. Before starting any circuit layout, it L important to keep in mind an estimation of capacitances snch BQ the gate a d ontput capacitance of 1 unit sbe inverter and the wire capacitance of, for example, 100 fin poly line and 100 p n metall line. With these data, when starting the design, it is possible to siee different transistors correctly.
4.5 CMOS STATIC LOGIC DESIGN

From the CMOS inverter we can re&e any static logic function by using the complementary NMOS and PMOS transistors. In this section we present the design of NAND/NOR, eomplex and tr-mission gates. The fanin of any complex gate is defined as the number of inputs of this gate. The fanavt of a complex logic gate i s the number of driven inpnts attached to the output of this gate.
4.5.1 NANDINOR Gates

Fig. 4.20 shows B 2-input NAND gate (NAND2) and a Z-inpmt NOR gate (NOR2). Each input reqoires a complementary pair. In the case of the NAND gate, the PMOS transistors a r e connected in parallel, whilc the NMOS transistors are connected in series. But in the case of the NOR gate, the NMOS devices are connected in parallel, while the PMOS devices are connected in series. Thege gatea consnme only dynamic power while the DC power dissipation is vero (if VT'S are high) because there is no DC path between VDDand ground for any logic combination of the input. For the NAND and NOR gates of Fig. 4.20, any input combination (AB = 00,01,11,mlO) there i s no path between the two I&. The design of these gates, or any CMOS static gate, follows that of an inverter. As discussed i n Sections 4.1 and 4.2, an inverter ir designed to meet a given DC and tianrient petformanee, then (W/L), and (W/L), are determined. The (W/L)and (WjL), of the devices of II logic gate are determined BJ follows: For example we want to design a 3-input NAND (Fig. 4,21(a)) to have the same DC and transient as that of an inverter driving the same C , , (Fig. 4.21(h)).
Low-Voltage Low-Power VLSI CMOS Circuit Desagn
147
gF
6
=c
148
CHAPTER 4
We assume that
W" = W",= w . * = Wns
(4.66)
and
w,= w,= w , , =w , ,
G=G+-t-=w 2
(4.67)
The first thing to do is to approximate the gbtc by M equivalent inverter where the effective p is given by 1
s . 0
0,
(4.68)
and
?Pelf
=a,
(4.69)
To have LS of the gate in the midway of the power supply in DC characteristics, the following condition should be satisfied for the Sinpot NAND gate (see Eqnation 4-18) PPLlf = (4.70)
a<n
which means that
P, = 0. 3
(4.71)
To have the same delay BE an inverter with determined eiues, we should have (assuming that L is the same)
w,,= w*e,l = w ,
and
(4.72)
w,,. =w , . , , =T W,
(4.73)
But in practice the size of these transistors, composing the 3-input NAND gate, should be increased because the output parasitic capacitance afthe NAND gate (or any complex gate) is larger than that of the inverter. Hence
w,> w ,
and W" > 3w"i
(4.74)
(4.75)
Note that by circuit simulation, we can properly size the transistors. Moreover,
it should be noted that the back-gate bias effect has to be taken into consideration in the design of the series NMOS devices in NAND gate (or repier PMOS in NOR). The relies-connected MOSFETr, during switching, exhibit a threshold voltage increase doe to a non-null source-substrate voltage as shown in the simulation example of Fig. 4.22. In Fig. 4.22(a), the transistor NL of the
149
first NAND3 gate near the ootpot outl, is driven by the latest signal becanse N, 8nd N, are already ON. Therefore, the node oi is at the ground level and the source of the transistor N, is not subject to the body effect. In t h e other NAND3 gate, the transistor N , and N6 are ON, while Ne receives the input signal. In this case, the node a. and bz are eit II certain voltege Icvd. Henee, during the discharging period the transistors N, and N5m e subject to the body effect. This effect slows the discharge of the output aa shown in Fig. 4.22(b). The output outl is discharged more ispidly than the output oui2. One way t o reduce the body effect at the logic level is to put the transistor, driven by the latest ardving signal, near the output. The e d y arri'ving sign& should be used to discharge the nodes snsceptible to the body effect. For example in ~n adder &=nit, the transistor driven by the carry is placed near the ontpot. Let us derive the output parasitic capacitance ofthe m-input NAND gate and compare it to thst of the CMOS inverter of Fig. 4.21(b). We have
c, = *wpc,, + w,c,
+ mC*? + .c ,
(4.76)
The Ce. of the m-input gate is larger than that o f the CMOS inverter by the ratio W,/W,.i. Fmm the above equation it is obvions that C, of the m-inpnt NAND gate is lrtrger than that of the CMOS invater. Note that for the same pedormance and far the same number of inputs the NAND gate consumes less silicon area than that ofa NOR gate because of the s m d e r *pea taken by the NMOS devices. Hence, CMOS NAND gates arc more widely used than NOR gates. Moreover, the NOR gate eonsume~more power than the NAND gate.
4.5.2
Complex CMOS Logic Gates
The strategy used to build NANDINORgater can be extended to build more complex logic gates. Complex logic functions can be realiied by connecting several NAND, NOR and INVERTER gates. However, they can also be 6 % eiently realized oring a single CMOS logic gate. Any complex CMOS gate is formed by two N and P logic blacks as shown in Fig. 423(a). The two blocks have the same number of transistors. Fig. 4.23(b) shows a threcinput complex CMOS gate and its logic equivalent symbol. The topology of the block N is the dual of the block P, i.e., p a d e l connections become sexier and vice v e w . In either the P or the N logic blocks, the pardel combination is placed Iar from the output to minimize the output capacitance and hence improves the speed and maybe the dynamic power dissipation. For example, the contribution of
150
CHAPTER 4
the N block to the output capacitance in Fig. 4.23(b) is less than that of Fig. 4.23(c). There is no direct DC path between VDD and ground for any of the logic input combination. In practice, the complex CMOS gates are used for a marimurn f& of 6-6.
Low- Voltage Low-Power VLSI CMOS Circuit Design
151
Logic
Block
c Logic
ci5
(C)
Figvre 4.13
CMOS
152
CHAPTER 4
4.5.3
to
Switching Activity Concept
So far, we have discussed the dynamic power dissipation of an inverter due the load capacitance. Whet about a CMOS complex gate driving a load
capacitance ? The dynamic power dissipstion has two components in B complex gate. The internal cell power, P*mcd,,n, and the capacitive load power. The internal cell power consists of the power dissipated by of the internal capacitive nodes. Sometimes the internal short-circuit power i s added to the internal cell dynamic power. The dynamic power for B complex gate cannot be estimated by the simple expression Cr,ViDf, because it might not always switch when the dock is switching. The switching activity determines how often this switching occurs on a capacitive node. For N periods of 0 VODand VDD 0 transitions, the switching activity a determiner how many 0 + V O D transitions ~ occur at the output. In other words, the activity Q represents the probability3 that a transition 0 VDDwin OEEU during the period T = l / f . f is the periodicity of the inputs of the gate. The average dynamic power of B complex gate due to the output load capacitance is
P* = aCLV;,f
(4.77)
The internal power dissipation, due to the internal capacitive nodes, can be characterized by simulation. Fig. 4.24 illustrates an example of a complex gate with internal nod-. The internal dynamic power of a cell is gken by
"
P k A p = xQiC$xvDDf
i=,
(4.78)
where R is the number of the internal nodes, Q, is the switching activity of each node i, C;is the parasitic capacitance of the internal node, and V, is the internal voltage swing of each node i . The parasitic capacitance at the output is included with the load CL.Note that internal voltage swing can be different than VDO.
4.5.4 Switching Activity of Static CMOS Gates

In this section we consider the computation of the switching activity of static CMOS gates. We will discuss the case of dynamic gates and other circuit styles
lDvring tbis tranritionLhc enorgy CzVi4 is d r a m &om the avpply 'Wc u s y m c that thc @c doar not expert-= sLkhbg
Low-Voltage Low-Power VLSI CMOS Circait Desaggn
153
in the next sections. First we consider the c s e of a NOR gate. Then we treat several rtatk gates. Table 4.3illustrates the truth table of the NORgate. From the table the probability that the output is at zem is 3/4 and that it is a t one is 114. The probability for (I VDDtransition is eompnted by multiplying the probability that the output d be at sera, Po, by the probability it d be a t one, P,. 3 1 3 PNOn, = Po.P, = - Y - = (4.79) 4 4 16 We aFsume that the inputs ate uniformly distributed (i.e, the probabilities
P(A=I)=P(B=l)=I/1).
We
by
OI
show that f o r m y bodean function, the activity d a static gate is given

= P(0 4 1) = P,.P,
(4.80)
where Po is computed by dividing the nvmber of zeros by the total n-ber of input eornbin&ons (N = 2" for n-input gate) and P, is computed by dividing the number of ones by N. P o is also equal to (1 -PI), Fig. 4.25 shows the probability that the output maker an 0 3 1 transition for several static gates. The probability of transition. at the inputs are assumed uniformly distributed.
Low- Voltage Lour-Power VLSI CMOS Circuit Design
155
P(O-21)
P(0 +I j
3/16
1 1 4
3 D
4.5.4.1 Example
1/64
I4
gates
Figure 4.11 output octivitics Rr static lagie tribnted inpute
with d o d g dis
As an example of a logic decision far low-power, consider the different Lnplementation of an 6-input AND gate driving a 0.1 pF load. As shown in Fig. 4.26, we may compare the following implementations:
.
rn
Implementatirm 1 : an 6-inpnt NAND and an invater. Implementation 2

: two
3-input NANDs and one 2-input NOR.

ODE
Implementation 3 : three 2-input NANDr and
3-input NOR
The library osed of such 8 comparison is a high-performance standard cell library optimbed for speed. Table 4.4 shows some eharacteristics of the library, where the average delay is reported which is the average v d u e of the rise and delay timer. W, = Z W , = 10 pm is set for all the t r d t o r s composing the different gates. The delay i s a function of the outpui load capacitance4 C, in pF. The area is a function of a unit area called cell grid. Each unit area for a cell h= a certain height and width. Also included i n this Table, is the input capacitance of a gate and the output parmitic capacitance in fFr. We make, for this example, the following annumptions:
Tlua saparitmcc doer not inrlvda the output pararilic one.
156
CHAPTER 4
P = 6314096
0 1 lrnplernenialion I
P = 6314096
Low-Voltage Low-Power VLSI CMOS Circuzt Deszgn
157
=
m
We neglect the \siring capacitance between the Merent cells; and

We neglect &o the internal power of each gate.
Gate
type
Area (eeU unit)
output
cap.
Input
( f F ) cap. (fF)
85 105 132 200 101 117 48 48 48 48 48 48
Average delay (ns)
INV NAND2 NAND3 NAND6 NOR2 NOR3
2 3
4 T
0.22
0.37 0.65 0.27 0.31
+ 1.00 C .
+ 1.50 C . + 2.30 C . + 1.50 C, + 2.00 C .
0.30 t 1.24 C .
3
4
First we compare the delay and the iliea of the different implementations. Using the data of Table 4.4, the results are reported in Table 4.5. The delay may be computed or simulated by SPICE as illustrated in Table 4.5. The implementations 2 and 3 offer the best speed compared to the first one. However, they requiz. more area.
Implern. 1
Implem. 2
11 0.85 0.86
Implem. 3
Area (cell unit) Computed delay (ns) SPICE delay (m)
9
1.1
13
0.87 0.83
1.1
Let us now compare the power dissipation wing the power cost function. It ir defined by Power coat = CP.-.,,C, (4.86)
158
CHAPTER 4
where Po+,,; is the probability of transition 0 1 at each node i and C: is the t o t d capacitance at each node i. We assume that the inputs A, B, C,D , E , and F a r e uncolrdated andrandom (i.~., E = 0.5). For the implementstions of Fig. 4.26, w e compote the transition probabilities. Table 4.6 summarizes the procednre of probabilties compntation of Merent nodes in the drcnit.
lmplomentatian 1
0 1
P,
Po = 1- P,
PO-,
63/64 1/64 65/4086
1/64 63/64
^^II^^^
oa/nuao
Implementation 2
P I
P o = 1 - P, PO-,
0 1 718
118
7/84
0 2
2
1/64 63/64
65/4090
7!8 1/8
7/64
Note that the node 01, in implemention 1, has a lower switching activity =ompared to the other two. To compute the power cost function w e laiu not indude the p~imary inputs. Table 4.7 illnstrates the results of this calculation. The results indicate that implementation 1 has the lowest power. So technology mapping is important for low-power applications. We consider now another example using low-area 0.8 p m CMOS standard eel! library for the &input AND implementation. Some characteristics of this library are s h o w in Table 4.8. Cornpazed to the library presented i n Table 4.4, this library uses sma!! transistors with W, = W, = 4 em. Compared to the
Low-Voltage LowPower VLSI CMOS Circutt Deszgn
159
case of the highperformance hbrary, the cell area unit, in the low-area ease,
LS
smaller by a factor of 1.5. Note that the delays of diRerent gates are higher. Bowever, the input gate and output parasitic capacitance$ me lower Thus, this hbrarg c a n be used for low-power fonction implementation.
Table 4.8
Characteristic. of s lov.mcs 0 8 ,zm CMOS bbprrry
Gate
Area
(cell unit)
type
Output Input cap. (fF) cap. (fF)
Average
delay (ns)
INV NAND2 NAND3 NAND6 NOR2 NOR3
3 4 7 3 4
35 60 65
13 13
0.23 t 3.73 C,
81 62 69
13 13 13
13
0.28 + 4.40C, 0.34 t 6.00 C . 0.53 t 7.13 C,

0.35 t 6.27 C,
0.47
t 8.84C,
Implem. 1 Implem. 2 Power cost (D) 3.5 19.5
Implem. 3
43.7
The delays reported in Table 4.8 do not indnde the effect of the input voltage dope. The delay, of the m e r e n t implementations, w.s simulated with SPICE and it is almost the pame for all the configuration. The delay is 1.5 "8. Using the same reasoning discussed earlier we can compute the power cost function wing this library. The transition probabilities are the same, except the total
160
CHAPTER 4
node capacitances which are different. The results of the power cost evaluation are illustrated in Table 4.9. The power cost, in the case of low-power library, is almost half of that of highperformenee. Still, implementation 1 hea . e low-power chs*Factedstie while the speed is h o s t the S-e compared to the others. The me- is also lower than the other implementations. T h i s example shows that the power dissipation e m be Fedneed a t the gate level. Even if we take into account the wire capacitances between the cells atill, the conclusion is valid. The topic of low-power at the gate-level is discussed more in Chapter 8. Keep in mind, that in this comparison, the internal power of the gates has not been considered.
4.55
GlitchingPower
Note that in the probabmty discussed so far, we assumed that the gates had e e m delay. In that case, we m e not taking into account the glitches and we consider only the transitions between stable states. Glitches must be considered if we assume non-aero delay at gates. Thus the total dynamic powei of a circuit is the total dynamic power with iero delays power and the glitching power. So what is the glitehing phenomenon?
In a static logic gate, the output or internal nodes can switch before the correct logical value is being stable. To illustrate this spurioos transition, Fig. 4 . 2 T shows an example of a circnit with a cascaded configuration. When the inputs ABC make the following transition 100 111, the output, with %emdelay gates, should stay high. However, considering a unit delay for each gate, the output 01is delayed compared to the input C and hence csusing the output Z to evaluate with the new value of C and the old value of O1.In that care, the output expedenee. a dynamic hazard (glitch). This transition increases the dynamic power of the circuit and adds a dynamic component to the switching activity,
Another example is shown in Fig. 4.28(a). The cawaded circuit exhibits a glitching pioblem. However, the same function can be implemented oring balanced delay implementation as shown in Fig. 4.28(b). These are some mles to amid this problem:
Balance delay paths; psrticdaxly on highly loaded nodes. Insert, if possible, buffers to equirliee the fart path; and
Lou-Voltage Low-Power VLSI CMOS Circuit Design
161
Avoid if possible the carcaded implementation; and
Redesign the logic when the power due component.
to
the glitches is an important
4.5.6 Basic Physical Design

To implement simple gates, the physical layout should be performed. It is usually eary to draw a layout of a gate with well arranged transistors. For example, for the inverter, Fig. 4.29(~.) shows a possible layout implrmentation. The metall is need for the power liner. Many uariations can be drawn, depending on the use of the gate. Fig. 4.29(b) shows another layoot variation of the inverter prhere metal2 is used BS the power lines. For clarity the wells and body ties are not shown in there layouts.
Similarly, the rchemstic of NAND2 and NOR2 gates E B be ~ converted to layFig. 4.30(a) shows one pwsible layout of a tw-input NAND gate. The layoot can &a be arranged to draw the inpot poly lines vertically. The layout artist should draw the gate taking into consideration the environment of this cell (the connectivity to others). Fig. 4.30(b) shows the lilyout of a two-input NOR gate. Note that the junction mess should be aptimieed during the layout to reduce the power dissipation and improve the speed of the cell. A n imple mentation of a %input NOR gate with B high output drain junction capadtsnce is shown in Fig. 4.31.
outs.
To do a layoat of a complex gate (i.e, several tens of transistors), the folloving general layout guidelines can be used :
.
m
rn
Set the siaing of the transistors composing the gate;
Run V D ~and , Vss in metal (1 or 2) hodmntdy. For example, VDD at the top and Vss a t the bottom of the cell in semi-rectangular form; Define the polysilicon gate lines odentatioionr and order them for maximum active area cros~over to form the gate regions;
Place the N-block (NMOS transistors) near Vss and theP-block (PMOS transistors) near VDD. The PMOS devices should be located in the common N-well ifthey use the same bulk potential; Adhere to the design rules snd m e if possible an interactive DRC (Design Rule Checker);
162
CHAPTER 4
AEC
loo
Iii
-*
(a1
Lorn- Voltage Lou-Power VLSI CMOS Circud D e q n
163
164
CHAPTER 4
"OD
v~~
B
A
i ; l l
lhl
. .
-. .
B
OUI
Low-Voltage Low-Power VLSI CMOS Circuit Design
165
rn
m
Keep the internal junction and wire capacitances to the minimum to minimiae the paes and the delay; and Complete the uonnection of different nodes inside the cell using the different layers available (metall, p l y , etc.).
Note that the power Line widths are drawn taking into consideration the current consamed by the cell because the electromigation phenomena sets the minimum width of eoodacturs.
Far low-power design, these are some layont guidelines:
m m
Identify, in your circuit. the high switching activity nodes;
Use for these high activity nodes low-capacitance iayers such BS metall, metal$ ete.;
Keep the wires of high activity nodes short;
rn
Use low-capacitance layers for high capacitive nodes and busses.

For large width devices, use special layout; such BF interdigitated fingers [3] and donut (round transistor); to achieve & l o w drain junction capacitance; and
Design complex cells or blocks using, as much as, possible custom a p

proaeh.
4.5.7 Physical Design Methodologies

There are many layout methodologies to do the physical implementation of a complex circuit. The furt methodology is called fill-eartom design, where the layont of each transistor i s optimized. The layout of B complex block is performed by costom design for r e a ~ o n of ~ speed. However, this style leads to low design productivity snd is ~ a x l y used in ASIC5 and digital processms. Bnt, when the low-power is an issue the full-cnstom deign can be used to M e the power of the circuit. Another design methodology is the standard-cell approach (or semi-curtom design) . That is, several gates and functions are created in the library such as:
166
CHAPTER 4
NAND, NOR, XOR, AOI, OOAI, latches, buffers, multiplexers, fulladder, fipfiops, etc.;
=
m
rn
Linear cells : low-battery detector, power-np reset, etc.; MSI/LSI functions : ALU (Arithmetic and Logic Unit), countezs, magnitude comparators, ete.; Compiled maemeellr : register file,FIFO (First In Fhrt Out), ROM
(Red Only Memory), parallel multiplier, etc.; and

Macrocells : Sjle-bit microcontroller, 16-b fixed point DSP, UART (Universal Asynchronous Reedver/Transmitter), etc.
A &wit is designed by capturing the rehematie or thefanctional model (VBDL, Verilog, etc.) of the cells. The layont is generated by an antomatic placement and routing. An example of a CMOS standard cell library can be found in [lo]. In standard cell approach, the logic c& have the same height and the width is variable. In many libraries, the cells are available in two layout styles. In the area-optimized cell, the cells me made as s m a l l an possible. In the performanceoptimized style, cells are optimieed for high-speed performance and, as a result, occupy more aces than the small cells. Even the height of the c& in the two styles is different. A typical standard cell layout for a NAND gate is shown i n Fig. 4.32. This methodology providu lower cost and higher productivity than the fall-enstom one. For low-power applications, the s m a l l and large cells for the same function can be c a r e U y chosen to optimise the power in a complex design without degrading the timing requirement. The third layout methodology is the gete array6. The gate arrays consist d i m plemented cells and need only the personalination steps. Fig. 4.33illuetrates an example of gatearray core using Sea-Of-Gates structure. It consists of I/O and internal cell areas. The 110 cell area contains pads with input/output buffets. Theinternal cell array eontainsscontin~ousarray ofNMOS and PMOS transistors. Hence, the transistors and interconnects a r e & e d y predefined. The design of a logic gate consists of wiring the different tramistors using metallization and contacts. The isolation of a logic gate is performed by tying the polysilieon gates of the limiting transistors to Vss or VDDdepending on the type of gate diffusion. Routing channels are routed over unused transistors. This methodology permits the reduction of the design cost at the expense of area, power and performance. Ont recent gate array nrchiteeture WVIU based on multiplexers with small sine transistors to maintain low-power characteristics
1 1 1 1 .
Low-Voltage Low-Power VLSI CMOS C i r c u i t Design
167
Figure 4.53
An cxunpk ofstandwd c e l l I s ~ o u(NANDZ) l
168
CHAPTER 4
7 I/O Cell area
VDD(metal)
Pdiffusion
Polysilican gates
N-diffusion
ss (metal)
Comparing these layout approaches, the full-custom methodology offers the beat approach to minimive the power digsipation. However, for a complex d t sign, it is costly to use such a design strategy. The standard cells approach provides good performance and an improved design time. However, in many libraries the devices ate oversized for performance purposes and conrequently, the power dissipation would be high. To efficiently use the standard cells tech-
Low- Voltage Low-Power VLSI CMOS Circurt Deszgn
169
Figure 4.14
(a) CMOS kran.mis&one t c i
(b) and ( c ) rchrmatic symbols.
nique for low-power applications, the library should be expanded to include several versions of the same function with different driving oapabilities. In that case, powerful synthesis tools are needed to optirnim the power while maintaining the timing specificstions. Moreover, both the standaid c& and gate arrays stylu require new place and route took for low-power design.
4.5.8
Conventional CMOS Pass-Transistor Logic
Another alterndive to CMOS static complementary logic ir the conventional passtransirtor logic based on MOS switches. Fig. 4.34 shows a CMOS trans mission gate (TG) as primitive element. It u o n ~ t r o f a complementary pair connected in parallel. It acts as B switch, with the logic variable A as the control inpnt. If A is low, the gate is OFF and presents e high resistance between the terminals. If A L high, the gate is ON and acts as a switch with an on resistance of R,, and % in pamllel. The equivalent resistance of the TG i s RTD = R,,llG. This resistance is ulways less than the smallest among R, and 4. This permits a fast switching characteristic. When the input I is at Voo, then the outpot F is quidtly charged initially by the NMOS, then at the
170
CHAPTER 4
vD;k;
PMOS ON
>"
NMOS ON
TlIlE
end by the PMOS transistor as illustrated by the equivalent resistances of Fig. 4.35. In this figure, we assme that at V,, = 0, A and A are set to their final values. During this transient switrhing phase the NMOS i s subject to the body while the PMOS is not. When a eero, at the input I , is to be transmitted then the PMOS is subject to the body &ct. The PMOS and NMOS transistors should be sbed such that they charge and discharge the output symmetrically. If V T . = IVT,~and the body effect is symmetrical then we can size the devices such as P. = Pp. Sometimes, equal shed NMOS and PMOS devices can be used. It i s easy to see that the delay of the TG gate in approdmately independent of the input level. T h i s is not the case if the pass-logic Y S ~ S a singlcchannel
Low-Voltage Low-Power VLSI CMOS Czrcurt Deszgn
171
transistor. A drawback of the CMOS TG is that it co~~sumes more area than a single-channel transmission gate (NMOS TG 01 PMOS TG). Thnr, if the area is ofprime concern, NMOS TGs are used. Any CMOS TG logic (we call it here conventional pars-transistor logic) function can be implemcntcd using the TG primitive element described above. In such implementation the transistor count, hence the silicon area, is low compared to standard static CMOS implementation. This ishighlighted in the implementation of such functions BJ mdtiple-g, demdtipleldng, decoding and addition. Pi. 4.36 shows & 4 1 multiplmer, where the data lines A, B, C and D are contlolled by S 1 and S2 such that
F = A S I S ? + B.S, .Sz + C.S&
+ D.S,.S2
(4.87)
Thm form of logic is used when the inputs and their logic complements are available. The implemenlation does not need VDDor ground liner. However, the implementation suffers f r o m a number ofdrawbacks; the driving capability of the ckcnit is limited and the delay increa~eswith long TG chains. Moreover, the eireait does not provide a restoration ofthe logic lev& i.e., the logic gates are passive with no gain elements. P i . 4.37 shows an example on how to lestore the voltage levels in chained TGs. When 8 TGs are pnt in s u i e s . the output signal changes very slowly. However, when an inverter stage is added every 4 TG stages, the level is restored as shown in the SPICE voltage waveforms of Fig. 4.37. The CMOS TG logic can be used in CMOS d r c u i t design offering an extra The adder degree of eirenit design Beedom. A0 example is the full-adder. Circuits d l be diseused in detail in Chapta 7. Fig. 4.38 shows the schematic of the XOR gate w h i c h is used by the adder. When the input A is low, A is high. The transmission gate TG is closed, then the output is equal to B. When A is high, A is law. The inverter formed by the transistors N m d Pis enabled, then the output is equal to A. The TG gate is open in this care. To implement an adder lets first review its functions. The boolean function o f a full-adder are: (4.88) S , , = A B B B Ci, ,C ,
= A.B t &(A
+ B)
(4.89)
A and B are the inpots, C i , the carry input, , , S is the sum ontput, and C , , is the carry output. The truth table ofan adder is shown in Table 4.10.
The CMOS implementation ofa one-bit full-adder is 3hown in Fig. 4.39(a). It requires 28 transistors and has two gate delays. In this circuit the transistors
172
CHAPTER 4
F
C
Low-Voltage Low-Power V L S I CMOS Crrcuzt Deszgn
173
n<I
controlled by the carry signal C,, should be placed dose to the output. This will _offret the body effect problem, since the carry is the latest arri-8 signal. An optimiaed implementation of the full-adder is shown i n Fig 4.39(b) It uses only 18 transistors and is bared on the XOR function shown in Fig. 4.38 and the TG gates. Hence, this adder is more compact and farter and eonrnmer less power than the complementary static one.
174
CHAPTER 4
Figure 4.38
TG XOR gate.
A 0 0 1 1
B C ; . , 0 0 1 0 0 0 1 0
S , ,
0
1
1
C ,
0 0
0
1
Table 4.10
Adder l h t h Table
4.5.9
CMOS Static Latch
Fig. 4.40 shows a mxs-cmpled CMOS static latch. In the storage mode (input LD = O), when the node A is high, B is low,PLand N, are ON while P2 and N t are OFF. Similarly, when A is low, B is high, PI and N2are OFF while P, and N1 are ON. The standby power &sipation of the ceU is very small. The
state of the htch changed by turning the two transmission gates ON (LD high) and applying the input and its complement.
Lorn- Voltage Low-Power VLSI CMOS Circuit D w i p
175
176
CEAPTER 4
Figure 4.40
CMOS cros%couplcdstatic latch
4.6 CMOS LOGIC STYLES

CMOS logic har been known to have a negligible static power dissipation. However, this is valid as long as VT is not too low. However, it has low-speed and consumes large area because for n-input, twice the number of transistors is required. As B result, it is sometimes desirable to have faster and smaller logic gates at the cost maybe of parameters such lls : noise margins, power dissipation, etc. This section discusses many CMOS logic alternatives to wmplementary CMOS and also the clocking issuer in a VLSI system.
4.6.1 Pseudo-NMOS CMOS Logic

The gate area of complementary CMOS can be reduced if CMOS circuits u e designed in B way similar to NMOS circuit f a d e [IZ]. A PMOS device is used to replace the depletion-type device in NMOS family. This type of circuit is referred to as pseudo-NMOS, as shown in the inverter of Fig. 4.41. When the input A is low, the output is high and at VDD.When A goes to LL high level, N turns ON while P is still ON. I0 this cllse, the output never reacher zero and taker a value VOLdetermined by the ratio & / A and the logic is called ralioed. To examine V0h, we nre simple analysis. When A is at VDD, N is in the linear ~epionwhile P is saturated. By equating the currents using simple models, we have
Low- Voltage Low-Power VLSI CMOS Czrcwt Desrgn
177
Thus V0,, depends strongly on the ratio &/A,. For example, if we need B VOL = 0 . 0 4 V ~ and ~ VT = 0 . 2 V . ~ , then the ratio &I@, should be equal at l e s t to 0.1. If the NMOS transistor is minimom she, the PMOS should be weak to provide adequate noise margins (low Voc). In this case, the rise time of the gate is too slow. If we improve the rise t i m e , the ratio condition tends to inerurre the gate area a d hence the input capacitance. Although this circuit offers a reduetion in total transistor count and ease of layout, it has the disadvantage of non-~ero static power dissipation. Since the pull-up PMOS is always ON, a current flows from VDD to ground whenever the pull-down section of the pseudo-NMOS is turned ON. This current is the source of the static power dissipation. When II pseudo-NMOS gate, with antput a t VoL, is driving another one, the d i v a gate, with OFF pd-down section, leaks a high eubthreshold cnrrent but still this cnrrent is lower than the one when the pull-down in ON. For a-input preudrrNMOS gate there ate (ntl) transistois. Fig. 4.42 illustrates an example of complex gate implemented in pseudo-NMOS style. This logic hns been used in many applications such 8 8 . decoding logic for memories and PLA. Because of its high static power, it is not suitable for low-power applications.
4.6.2
Dynamic CMOS Logic
To reduce the area and improve the speed of CMOS circuits, another popular style e d e d dynamic iogie is used. Fig. 4.43 shows a dynamic CMOS gate. This logic is referred to as domino CMOS logic [13]. The domino gate shown in Fig. 4.43(a) consists of e dynamic CMOS drcuit followed by a static CMOS
178
CHAPTER 4
A R i
Figure 4.41 PseudaNMOS complex laslc g a b
buffer. The dynamic circuit consists of a PMOS prechargc transistor P i , an evalnation NMOS transistor N,,a storage capacitor C , and an N-logic block which is a serie-parallel combination of NMOS transistors estivated by the inputs and implementing the required logic. The storage capacitance represents the parasitic et node A. This circuit u4es asingle clock phase clk. DuMg theprecharge p k e ( c f k= O), the storage capacitance is charged through the PMOS pull-up PI to VDDand the inpats have no effect since there is no path to ground. The output of the buffer is precharged to ground. During the evaluation phase (cfL = l), A', is ON, and depending on the logic performed by the N-logic block, the node A is either discharged or it will stay precharged. Fig. 4.43(b) shows an example of complex gate. In a cascaded set of domino logic stages, a5 shown in Fig. 4.44, the first stage evaluates and causes the next one to evaluate (like domino f a ) . The number of erscaded skages is limited by the evaluation clock phase. Compared to psendo-NMOS, domino logic has the same k p n t capacitance snd improved iise time. However the fall time is affected since there is one more transistor in the pull-down section. Also the gate is suitable for high-fanout operation because of the CMOS buffer. Moreover, it is efficient in area for high fanin because n 4 transistors are required compared to 2n for CMOS static gate.
Some limitations of the gete
ue:
Low-Voltage Low-Power VLSI CMOS Cwcud Deszgn
179
180
clk
e r
Stagel
Figure 4.44
CHAPTER 4
sage2
Dormno logic c h w
stage3
The domino gate has a problem called charge sharing OP redistribution. Fig. 4.45 gives an example to explain this problem. During the precharge, the node A is a t VDD and charge CVDDis stored on the capacitance C. We armme (worst-case) that the pararitic capacitance of nodes B and C,C, and C2respectively, have iero charges. During the evaluation, the node A should stay at VDD, however, due to C, and C z ,charge sharing take place. Using the charge conservation principle before and after redistribution, we have
CVDD
= (C
+ c, + C,)V.
C
(4.92)
Hence the final voltage of node A is
VA =
c + c, + c, "DO
(4.93)
Iffar example CI = Cz = 0.6C then this voltage wonld be VDD/Z. This voltage can alter the logic and provoke the CMOS buffer to dissipate high static power dissipation.
rn
If the clock frequency is too lour, the node A leaks the charge stored on C due to the leakage cnizents. The dynamic node can leak its charge in n t h e of few hundreds of #r to few ma, depending on the temperature, the Starage capacitance and the leakage cnrrent. When
Low- Voltage Lour-Power VLSI
CMOS Czrcvit Design
181
Figure 4.45
Charge aharingin
h - c
CMOS l o p k
using power-down techniques, the dynamic nodes should not be left floating for a long time. If the leakage is high with low VT devices, the charge can be deleted in B t h e IU low s 100 RS. This problem is similar to charge sharing. Fig. 4.46 shows two alternates to solve the problems of charge sharing and leakage. In Fig. 4.46(a), a weak
PMOS (low W/L) is added BL pull-up transistor. This circuit operates like pseudo-NMOS during evaluation phae. Hence it consumes some static power dissipation. If the circuit operates at high-fceqnency, the added Teak PMOS har no role because it does not have enough time to operate. Note that this weak PMOS inereares the ontpnt cappacitmee and then it slows this dynamic gate. To eliminate the DC path during evaluation, the gate of the weak PMOS c a n be driven from the output of CMOS buffer as shown in Fig. 4.46(b). This circuit adds another capacitance at the output ofthe inverter. A third alternate circuit which solves only the problem ofcharge sharing is shown in Fig. 4.41. In this chcoit configuration, intermediate nodes of complex gate are prccharged with additional precharge PMOS devices.
rn
Another limitation of the domino logic gate is that it implements noninverting logic functions. Hovever, this is not a serious limitation and can be overcome, if the need arises, by "Jig CMOS static gates. The dedgnep can mix both stalic and dynamic CMOS logic circuits in a given design to optimize the overall performance.
182
CHAPTER 4
Logic Block
Block
183
Historically, dynamic design style have been devised f a low-power charaeteristics because of t h e reduced device count. Moreover, dynamic gates do not experience short-kcnit pover &sipation and glitching problems as in rtatie &wits. However, to drive the docked transistors, a lluge dock dirtribation network is needed. This highly loaded network consumes a significant a m o u t of dynamic power particularly at high frequency of o p e r a t i d . The switching activities of dynamic gates are higher than those of static gates. In B dynamic gate the output maker a 0 1 transition during the precharge cycle only if the N-bloc discharges the autpnt during the evaluation phase. Hence, the probability of 0 + 1 transition is given by
Po-, = P o
(4.94)
where Po is the probability that the output has a "0" output. For a two-input NAND dynamic gate, the output has only one zero for 4 input stater. So,
Po-, = Po = For a NOR2 gate, we have
1
~ ~
2' - 4
(4.96)
Another refinement oftbe domino CMOS logic is shown i n Fig. 4.48 [14], where the CMOS buffer is removed. N and P logic blocks are alternated and each drke the other. When clk is low (0), the h s t and third stage are prechsrged high and the second stage is precharged low.
Fig. 4.49 s h w s another NP domino logic called NORA (No Fbcce) [El. Two sections elk and elk are shown in Fig. 4.49. It i s constructed by cascading N and P blocks followed by C 2 M O S (clocked CMOS) latch. CMOS buffers (inverters) ace nsed to provide logic inversion. When clk = 1 (evaluation phase in section dk),the CaMOS latch3 operates like aninverter. When clk = 0, the latch move* into hold state because the output NMOS and PMOS transistors ale OFF. In this case, the old data is latched at the output. This latch is used to avoid signal races. A NORA pipeline is shown in Fig. 4.50 and it consists of alternating elk and cik sections. Signal racer do not occur in this structure because of the use of C'MOS. Another logic hlrr; been proposed to oveicome charge sharing by using additional clocking signals. It is e d e d Zipper CMOS logic. For more details refer to [MI.
' S c r the ex-ple
of the DEC Alpha Ehip
in Scc~ion4.8.4.
184
CHAPTER 4
Block
Block
Block
Pigme 4.48
NP
do-o
I Q ~ E
An example o f a pipelined full-addu (FA) NORA circoit is shown in Pig. 4.61. This cell can be used in many deigns such as B pipelined multiplier. The output C'MOS latches c a n only use three transistors rather then four. The NMOS and PMOS tramistor Pa and N, respectively, can be removed from the output C'MOS latches. The reason is that during precharge phase (clk = O), the outpnt nodes A and B are set t o ground and VDDre~pectively. Thus, the transistors PI and are tmned OFF. Benee, the clocked transistors P . and N, cam be removed and the FA cell is isolated from other sections during precharge.
4.6.3 Design Style Comparison

If we eompae the above discussed deign styles, static CMOS lo@ is the slowest circuit, but the power efficiency is the best, particularly if minimum siae devices are used. Hence, it is snitable far low-power, m e d i m speed applications. Note that the static CMOS logic occupies the largest chip area because complementary functions are needed. The circuit designer can includc, in static logic, pas-transistor logic to improve the speed and B P ~ B Pseudo-NMOS . logic style can be f a t e l than static CMOS logic, howeyer its rise time is long. This is limited by the low output logic level. Moreover, the most serious drawback of pseudo-NMOS logic is the high power dissipation in the standby mode. N-P domino logic is f a t , because it has small input capacitance Wre paendrrNMOS
Low-Voltage Low-Power VLSI CMOS Circvrt Deszgn
185
\?7+
T
To N-Block
\?7
T
(a) NORA clk-SeLdon
i : :
To I
lock
To N-Block
186
CHAPTER 4
clk-Section
clK-sect,on
Figure 4.110
clk-Section
NORA p l p e h r l o g x o .
Figure 4.61
Pipehod fd-addrr NORA c w c u t
logic and improved rise time. The power dissipation consumed by this logic Is high due to the hi& switching adi-ity of the clock even if the circuit is not used. However,power-down techniques can be used t o control the dock of the logic. Using thi. style, requires from the desi@er to spend more d s i p effort than the static style to solve all the problems of dynamic logic such 81: charge sharing, clock skew, preeharging, ate. Finally, we note that pass-transistor logic i s very pxomising for high-performance low-voltage low-powez applications.
187
Figvre 4.51
Clock skew.
4.6.4
Clock Skew in Dynamic Logic
Clock skew is 8 critical design parameter in high-speed circuits. Fig. 4.52 shows the clock skew in single complementary-phase dock sipds. If & is generated &om elk, clock skew is possible. The time skew is measured between the h&-VDD points of clk and & sign&. In the presence of dock skew, a glitch e m be transmittad from one section to another as illustrated in the example of Fig, 4.53(b). T h i s structure cant- one stage between the two C'MOS latches, and a glitch can be transmitted to the last C'MOS latch. The example ofFig. 4.53(c) does not have this problem. It has been shown that to eliminate the signd race in N-P domino logic. an even number of inversions &odd be used between stages 1171. Moreover, the clock skew problem shonld be minimieed to improve the speed of dynamic circuits. One possible solution of single complementary-phase dock generation, with miaimd skew and p ~ o c e s insensitive, is the one shown in Fig. 4.54 [18]. The delays clk. + clk and elk; d k are equahed with special buffer sizing.
188
CHAPTER 4
4 c :
4.7
CLOCKING
One way to synchronize thousands of sign& in 8. VLSI system is to employ a docking strategy. The clock controls the flowof data in the digital system and
reduces the compl&ty of design.
Low-Voltage Low-Power
VLSI CMOS Czrcuzt Deszgn
189
clock signal
repistcr
input register
register
Figure 4.65
do&dpip.lm. ayrtrm
Moat VLSI processors a r e constructed Using a set of functional blocks (ALU, shifter, register file, ete.) connected vis pipeline registers as shown i n the example of Fig. 4.55. The clock signd can be split to one, two, three o r four phases. Typically the phases are non-overlapping.
First we pesent the different storage elements (latches, registers), then we treat two doeking strategies : Jinglcphase and two-phsse with emphasi. on the former which is usually the main option available i n standard cell and gat-array approaches. The doc$ distdbntion issues are discussed i n Section 4.9.4.
190
CHAPTER 4
Q
lateh
clock
4.7.1
Storage Elements
There are many types of storage elements. Some of the ones used in VLSI design are the fallowing:
4.7.1.1 D-Latch
Sometimes d e d level-sensitive latch. Its operation is shown in Fig. 4.56. The output changes with the input when the dock is high (case of positive levelsensitive latch). The D inpot must he rtehle within LL time window around s pasred to the positive transition of the clock (Fig. 4.57). The input data i the output within B delay ti. The time window i s defined by two times; called setup'time t , , lrnd hold time h. Setup time, t., is the time needed for the D input to he stable, prior to the do& edge. More specifically, it is the delay between the input of the latch and the storage node. Hold time, t h is the time needed for the D input to he stable after the clock edge. This time relates to the delay between the clock input and the storage point. There are a variety of implementations for this D-latch. Fig. 4.58 reviews some of the static versions. The circuit of Fig. 4.58(a) hhS a weak inverter used 85 feedback path for latch mode. The mltsge at node A is not changed by noise or leakage because the feedback inverter would keep the level. The feedback inserter should have low (Wjl) for NMOS and PMOS (weak inverter) compared to the transmission gate and forward inverter. This assures that the transmission gate is capable of overdriving the feedback inverter when data is being written to the latch. The feedback inverter should he carefully siaed to guarantee switching for all process corners and maximom fanout condition.
Low- Voltage Low-Power VLSI CMOS Circurt Design
191
The problem of rstioed design in Fig, 4.58(a) can bc avoided by using the modified version in Fig. 4.58(b), where B transmission gate in added in the feedback path. When clk = 1, the data is passed to the storage node and the feedback node is disconnected. When clk = 0, the feedback loop is dosed, and i g . the latch is in store (latch) mode. Fig. 4.58(c) shows another version of F 4.58(b), where the outputs are buffered. Thia latter latch is fonnd in the cells library of standard-cell and gate-array. All there described static latches store their state even ifthe clock is stopped. Note that these latches do not dissipate any DC power.
To reduce the size of the static latches, dynamic versions can be used as illustrated in Fig. 4.59, Fig. 4.60 and Fig. 4.61. Fig. 4.59 shows a simple dynamic latch, where the storage node A, temporarily stores the data. Note that latches have B property called "trampareney": output follows the input when the dock is asserted. Otherwise they are yopsqne". Fig. 4.60 shows two other latches [19]. The circnits of Fig. 4.60(a) is transparent when the dock elk, is high and latches the data (opaque) when the dock is low. This latch is positive level-sensitive. The negative level-sensitive is shown in Fig. 4.60(b). Note that these latches use one clock line ( c l k ) . The circuits of Fig. 4.60 have redaced noise immunity. For example, for the circuit of F i g . 4.60(a), when the latch is opaque (elk = O), the node A may be tristated high with Q tristated law. The node A is isolated and may be surceptible to noise which reduces its voltage. The reduced voltage of node A can cause the PMOS PBleaking current, thereby deitwyhg the output Q. This problem was addressed with latches designed in DEC Alpha microprocerror PI]. For example the eircoit of Fig, 4.61 is an improved version of Yuan and Svenrron [19]. A weak PMOS device P3 is added to solve the problem of noise i n positive level-sensitive latch. The operation of this latch follows. When clk
192
CHAPTER 4
weak invenci with small
iwu ror NMOS and PMOS
clk
clk = 0
193
clk
Figvre 4.68
Simple dynamic CMOS single-dock latch
b high, PI, NI and N3 function like an inverter. Pz,Nz and N4 function &a &e an 'bwerter. Therefore the latch p~3ses the input D t o the output Q. If D falls to low,then A is high and Q is low. When clk is low, Ns and N n are OFF. If D goes to high, Pi is OFF,while the nodes A and Q are tristated high and low respectively. The added P3,in this case, is ON and holds P2 OFF. This device supplies current to node A and counters any noise.
194
CHAPTER 4
TT
Figure 1.81
Nan-inverting dynamic ktch with improved n&e immunity.
For R&bility reason many latches have been designed for DEC Alpha chip [Zl].Some are illustrated in Fig. 4.62. These latches have been designed for all
process corners and circuit conditions (supply Voltage, temperature, rise/faU times, etc.). The results showed no appmciable evidence of raccthrough for elk risvjfd times at or below 0.8 ns. With 1-ns rise/fall times, the latches showed some signs of feilure. A 0.5 ns for rise/faU timer was set for the dock in this chip.
4.7.1.2 Edge-Triggered D-flip-jop, (E7DFFJ

Sometimes this fipflap is called edgetriggered register. Fig. 4.63 shows a static veisian (bnffered) of the D flipflop with positive edge-triggered, and the voltage waveforms. It is constructed by using two latches. The first one called master, is positive level-sensitive. The second one called slave, is negative level-sensitive. When the clock is low, the storage node A follows the input, while the node B stores the old data and is disconnected. Then, when the clock makes a transition from 0 to 1, the node A stores the input value during the transition. then ceases to sample any input data. When elk = 1, the master is in the the hold mode and the node A psraes the data to storage node B of the slave latch which is then passed to the output Q and Q. In this case, the outpvt is disconnected from the input D. Hence, the Ripflop doer not have the transparency property of the latch. When the clock returns b a d to 0. the slave k in hold mode. By reversing the two latches, B negative edge-triggered flip-flop can be constructed. This circuit can be found in standard-eeU and gete-array libraries and represents an important cell in synchronized design. With high operating frequency. it is desirable to balance the delay of clk and
Low- Voltage Law-Power VLSI CMOS Circuit Design
195
TT
196
CHAPTER 4
cik locally, to reduce the clock skew problem. The dock skew, in single-phsc strategy can lead to invalid data storage.
A dynamic version of the positive ETDFF is shown in Fig. 4.64 [19]. The operation of this drcuit is Unstrated by the voltage waveforms. The d o e
Low-Voltage Low-Power V L S I CMOS Czrcuit Design
197
of the hold time of this Ripflop is close to zero [ZO]. This dynamic flipflop, compared to the static one, needs only 9 transistors and one clock Line. The negative ETDFF is shown in Fig. 4.65.
4.7.1.3 MiscrlIoneous
Many other latches and Ripflops are available; Car example in gatearray Libraries such as the JK Ripflop and the toggle (T) flip-flop. Fig. 4.66 shows the T Rip-flop with reset control. When elk = 1, the output Q is complemented, whereas when d k = 0, Q keeps its old state.Thir T flip-flop provides divide-by-2 operation. A J K flipflop is shown in Fig. 4.67. When J and K inputs are low, the outputs are meintainod on the positive edge of the dock. If
198
CHAPTER 4
6
J = 0 and K = 1, the ontput Q is set to 0, whereas when J = 1 and K = 0, the output Q is set to 1. When both J and K are high then the ontput are complemented.
4.7.2
Single-Phase Clocking
Generic singlephase finite-state-machine (FSM) is shown in Fig. 4.68. The storage element c a n be either a latch 01a register (Bpflop). For the latch case, it demands more constrained design because of the transparency property of the latch. When the latch is transparent, thc statesignals can pass the logic block more than once during one dock eyele. To avoid race condition in this FSM, the clock width (of transpateney) has to satisfy B two aided-constraint [22]. Hence, singlephme with latches, in the case of FSM, i s insidiously complex. To reduce the complexity of timing constraint, single-phase ETDFFs c a n be used. T h e ilipipaop k never transparent. At the clock edge, the state is stored and it cannot pass the logic more than once during one d o c k cyde. D&& and synchronizing VLSI circuits with ETDPFr is rather simple and straightforward pazticukrly when nsing static Bpilops. For high-speed CMOS applications it is necessary that the storage elements should be carefully designed with minimum delay, setup time and dock skew. In thia case, trktate dynamic latches can be used efficiently. Fig. 4.69 shows ~n example of using dynamic latches [21]. Notice that L1 and L2 arc tr-parent latches separated by random logic and are not simultaneously active. When
Low-Voltage Low-Power VLSI CMOS Czrcuit Design
199
200
CHAPTER 4
Elk
K Q
Q .. ...... ~i
Figure
4.81
JK &p-tlop.
Low-Voltage Low-Power VLSI CMOS Circuzt Design
201
Combinational
clk i s high, L1 is transparent, whereas when elk i s low, L2 is transparent. The minimum number of logic gates hetween latches can be B ~ F and O the madmum
k constrained by the cycle time.
202
CHAPTER 4
Fig. 4.70 shows another example of singlephase system using ETDFFs. This system is edge based and the minimum cycle time is given by [22] t.q.l.,min
= ttf,m.r
+ b s k , m ~+ *t..tup,m.* + t.inu.mnr
(4.97)
where t i t , t ~ ~ t,.tup,m.r ~ , ~ and , ~ i,~.lo,m.r ~ ~ are , worst case ddsys of the flipflop, combinational logic block, setup time and clock skew. When designing with gatc-array and/or standard cell approaches, the single-phase clocking scheme using static ETDFFs is the oaly option available for the designer.
4 . 7 . 3 Wo-Phase Clocking
Two-phase "on-ovedapping clocking strategy iernove~many constraints existing in single-phase discipline. However, the use of two-phase (or multiple phase) non-overlapping clock atructmes becomes more difficult as clock fre quendes and chip size increase. This is because of the increase in dock skew and clock interconnect wking. For high-speed applications, singlephare strategy is preferred and tends to be widely used in many VLSI systems' designs. Fig. 4.71 shows an example of tw-phase non-overlapping docking scheme. The first latch LI i s transparent when the clock elk, is high, ahereas 1 2 is transparent when d k a is high. The example of Fig. 4.71 is not the d y way to build 8 two-phase system. Latches C ~ be R replaced by two-phase master-slave flip-flops where the master latch is clocked by elkl and the slave latch by elk2. This latter structure does not have transparency property.
Low- Voltage Low-Power
VLSI CMOS C i r c u i tD e s i g n
203
4.8 PASS-TRANSISTOR LOGIC FAMILIES

Sweral pms-transistor logic families, for logic circuit design, have been praposed for improving the speed of CMOS circuits. Such families me: the conventional CMOS pers-transistor logic, the Complementary Pass-transistor Logic (CPL) 1231, the Dual Pass-transistoi Logic (DPL) [24], and the Swing Restored Pas-transistor Logic (SRPL) [%]. In this section, CPL, DPL, and SRPL logics are presented and compared.
4.8.1 CPL
The main concept behind CPL ia shown in the block diagram of Fig. 4.72. It consists of NMOS pass tranrktor logic network driven by two sets of eomple mentary inputs and two CMOS inverterr used as buffers.
Fig. 4.13 illustrates an example of ANDINAND gate built in CPL logic. At the node Q for exhmple we have
Q = A.B t B . B = A.B
(4.98)
At the output of the corresponding inverter we have NAND function. The NMOS pass-transistor loaie network forms pull-up and pull-down functions. When the inputs ( A B ) have the followingcombination (ll),the voltage of the node Q i s a t a voltage given by
VQ = VDD - VTdVQ)
(4.99)
204
CHAPTER 4
Figure 4.71
Basic CPL l&
circuit.
where V T , . is the threshold voltage subject to the body effect. So the invertiog buffers translate the swing of the output fram ground to VDD - VT,,to a fullrail logic swing (ground to V D D ) .The logic threshold voltage of the inverting buffers should be shifted to lower voltage than VDD/Z. Hence the 0 ratio of the inverter in this case should be higher than unity. This inverting buffer permits also to drive large load capacitance efficiently. When the output of logic networks are st Von - VT, then all the output inverters are driven by reduced $Wing, BS shown in Fig. 4.74. Hence, the DC power of the inverter increases because the pull-up PMOS device is not completely OFF. The VG, of the puU-mp PMOS is eqnal to -VTm.Moreover, the drive capability of the pull-down NMOS transistor is reduced particularly if the power supply voltage is iedueed. The noise margins are also affected. To solve the problem of DC power &$pation we can design NMOS transistors with lower VT than that of the PMOS transistor. Also, the body effect should be controlled. Another way to solve all the problems associated with the reduced high-level is to add to the CPL II PMOS latch 8s shown in the case of the ANDINAND circuit of Fig. 4.75. In this case, the two added PMOS transistors can be sised to be
205
minimum. as long 8s the high-level reacher VDDin the given cycle time. We call this style PMOS latch CPL. Careful design should be considered when the NMOS network has minimum size devices. Otherwise the high-level stored in t h e latch cannot be discharged. Fig. 4.16 shows examples of CPL arrays for ORINOR and XORjXNOR fune. lions. With only 4 transistom we cm pmdnce many awo-kput functions with their complement. More examples are shown in Fig. 4.17 for 3-input ANDINAND and ORJNOR gates. In these examples 8 NMOS transistors are needed to generate the 3-input functions. Any complex logic function can be constructed easily using this principle of NMOS n e w o r k t~an&%tors. For e x m Ple the full-adder circuit call be constructed wing wired CPL as shown in Fig. 4.18. The circuit is constructed using basic CPL primitives discussed before.
206
CHAPTER 4
(a) Figure 4.78
(h)
CPL ORINOR and XOR/XNOR
Low- Voltage Lou-Power VLSI CMOS Circuit Design
207
A i t ;
~ ~~~
i i
B
~
ABC
(a)
ABC
A+BIC
A+B+C
(b)
Figure 4.71
CPL %input: (4 ANDINAND; (b) ORINOR loaic m a y s
Ako the sizes of the transistors are shown in this fignre for fast operation. The tr-istors of the NMOS net>mrk, far from the output, have larger size than those closer to the mtput. This is because the NMOS devices, closer to the output, pass a reduced swing. The siving of the transistors depends on the chcuit type, layout and device's parameters, Compared to full-dder implemented in standard static CMOS style, the adder of Fig. 4.78 is much fsstei and dissipater less power due to the low internal swing. Also the schematic of this CPL adder is structured resulting in simplified layout.
One drawbad assodated with the CPL logic is the driving capability which is limited and the delay increases with long pass-transistor chains. So buffering is needed to restore the transmitted level and improve the driving eapability.
4.8.2
DPL
The DPL is a modified version of CPL suitable foor law-voltage applications. It deviates the problems of CPL associated with the reduced high level. Example far ANDINAND gate is illustrated in the schematic of Fig. 4.79. It consists of NMOS and PMOS pass transistors in contrast to CPL gate, where only NMOS devices are used. In the example of ANDiNAND gate, the NMOS tranrktor m e used to pass the ground while the PMOS transistors are used to pass the high level (VoD). The output of the DPL is full rail-to-rail swing owing to the addition of PMOS. However. this addition results in increased
208
CHAPTER 4
Low- Voltage Low-Power VLSI CMOS Czrcuit Design
209
A.5
Figure 4.18
A.B
DPL AND/NAND patc.
input capacitance compared to CPL. T h i s wiU not limit the performance of DPL as will be explained.
Fig. 4.80 shows a comparison between the switching characteristics of CPL, conventional pus-transktor CMOS and DPL XOR gates. In the truth tables, the colnmn labeled *Pass" shows which signals are passed and perform the XOR function. There are some features of DPL
.
rn
The DPL gate h a s a balanced input capacitance. This reduces the dependence of the delay on the input data, contrary to the CPL and conventional CMOS pass-transistor logic where the input capacitances for the signals A and B are not the same. In DPL, far any input combination, there are always two eurient paths driving the output. T h i s compensates for any reduction in speed due to the additional PMOS. Fox example, when the inputs A and B are low, A is passed by a PMOS while B is passed by sn NMOS.
A DPL fall-adder implementation is shown i n Fig. 4.81. When d the input A, Band C arelow, for exampie, there are two current paths to the output buffer. This implementation uses DPL primitives such as ANDJNAND, ORINOR,
XOR/XNOR and MUX to generate the carry and rum signals.
210
CHAPTER 4
CPL
Ciicuii
B XOR Pars
Table
-"DO
-" T ,
PLII~
k-ister
Figure 4.80 Cornpariaon oi CPL,conventional CMOS TC and DPL iogin for XOR gata.
4.8.3 Modifred CPL

Another technique which uses CPLlike st~lle suitable for low-power/low-voltege~~ h the Swing Restored Pass-transistor Logic (SRPL) [25]. Figure 4.82 show6 the b & of SRPL logic gate. One part is the NMOS network with the CPL style discussed previonsly and the second part, is B CMOS latch. The crorscoupled CMOS inverters (latch) permit to restore the logic levels. So, any logic function i n SRPL can be implemented using CPL network and a CMOS latch st the output. The aieing of such a logic is critical fot speed and power dissipation issuer. Fig. 4.83 show an example of ANDINAND gate using SRPL. Incre-8 the sise ofthe NMOS traniistorr in the network,Wnctmm~
Low- Voltage Low-Power VLSI CMOS CtrctLit Design
211
OWNOR
Figure 4.81
DPL Iull-addcLr.
212
CHAPTER 4
NMOS CPL
improves the speed as shown in the simulation C U Y ~ of Fig. 4.84. It har been found that the rim of the latch should be minimum, for a fast operation, using the 0.8 p n device parameters of Chapter 3. If the siae of the NMOS transistors in the network k small, the autpnt of the SRPL gate fails to switch to ground b e c a m the equivalent impedance of the network is lower t h a n the one seen by the output to VDO. Thk problem becomes wome when many gates are cascaded. Fig. 4.85 illostrstes this problem in 2 ANDJNAND cwcaded gates. When the input goes from VDOto ground, the nodes A and B,initidly at VDD, cannot be completely discharged.
Low-Voltage Low-Power VLSI CMOS C i r c u i t Design
213
750
I
4 6
8 10
12 14
16
18
20
4.8.4
Pass-TransistorLogics Comparison
The speed and power dissipation of the different pars-logic styles. so far presented, depend on the circuit type and the application of the circuit (cascaded gates, driving a fixed load, etc.). For the care of 8 full-adder, used in a multiplier array, B comparison is given in Chapter 7. In general, SRPL has the lowest power dissipation but careful design is needed when smaU device iim are used. The DPL consumes more power than SRFL and PMOS latch CPL. because of the higher transistor count.. Both CPL and SRPL Circuits have the smallest area and the fastest speed. In summary, CPL-like styles are promising, for law-power and high-speed applications.
214
CHAPTER 4
-0
% + Part of thc lalch
I T
4.9
YO CIRCUITS
1/0 circuits connect the on-cbip l o & circuitry to the external world. They play an impmtant role in the limitation of speed and power dissipstion of the whole chip. In thu section many 1 1 0 circuits are discussed such BS input and output buffers, dock distribution, clock buffeimg and low-swing 110.The power dissipation issuer related to there circuits are &o studied. Layout techniques for 1/0 circuits are not cclverd in this chhapter.
4.9.1
Input Circuits
To distribute en inpot signal to the i n t e n d circuitry of a chip, BO input buffer i s needed. It has its gate connected to the input pad. Excessive electrostatic charge, on the input pad, can break down the oxide and destroy the trandrtorr of the input buffer. For an oxide thiekmss of 100 A, the bieakdoxn voltage is i i 7 V. The voltage build on the gate, from the electrostatic charge, can be ss high 300 V Fig. 4.86 shows an example of electroatstk dkcharge protection. If the voltage, a t the node N , goes above V m or below ground, than the coupling diodes D, and D2 limit the voltage excureion of the node N w i t h -VBz and VDD+ VBz. The role of the resistance R, is to limit the
[%I.
215
YDD
peak current that flows in the diodes. %ical d n e s of R are few a hundred of and m e realieed using the diffusion layers. The input protection Circuit has a pararitic RC time constant which can limit high-speed operation. It ranger from a few tens of ps to a few hundreds of pa. The input buffer, connected to this input pad, consists in general of a number of inverter stages to drive the internal circuitry. The input buffer. for clock distribution, needs rpecid care and design and is discussed in Section 4.9.4.
4.9.1.1 SfaficPower Dissipaliorr

When the input signal has TTL (Transistor-Transistor Logic) levels. the conventional CMOS buffer is used to translate these levels to CMOS levels. The TTL interface has historically specified input voltage levels of 0.8 V for the low-level input maximum, and 2.0 V for the high-level input minimum. The recently passed 3.3 V Low-Voltage TTL (LVTTL) standard is shown in Table 4.11. The individual input inverters are designed by setting their W / L ratio such that the rwitebiog point of the buffer is near 1.4 V (middle of VILand Vrx). To have thk switching point of 1.4 V at 5 V power supply voltage, the ratio W,lW, of the input inverter of the buffer should be at 2.9 using 0.8 pm CMOS technology. At 3.3 V,this ratio should only be equal to 0.7. However, since the TTL voltage swing is limited to 1.2 V, the input buffer is always dissipating
216
CHAPTER 4
Minimum high output
Madmnm
high inpnt
low output
Maximum low input
+
Figure 4.81
TTL inpuL buffrr.
DC power, BL shown in Fig. 4.87, particularly if the VT of the devices is low. If the first inverter does not fully translate the input TTL levels then the second Stage dissipates some DC power. The static power dissipated by a TTL i n p d buffer is PTTL = VDDIDTTL (4.100)
where
IDTTL = IDTTLL tIDTTL~
(4.101)
IDDTTL is the average dissipated current for the CBLSEJwhen the input is at low and high levels. At VDO= 3.3 V, the input buffer dissipates more static power when the input is high than when it is low. Fig. 4.88 shows the characteristics of the static power dissipation of the input buffer. Note that w h a VDD is sealed down the DC current is reduced beeanre the Vos o f the pull-up PMOS of the input buffer is zedwed. If the number of TTL input pads is large, then the DC power of the input buffers could bc an important and limiting factor. A static power-saving input buffer fox reducing IDTTL for 5 V power supply voltage har been proposed in [21].
Low- Voltage Low-Power VLSI CMOS Czrcuit Design
217
Figure 4.88
Simdslcd static ~ o w dissipation r of input bvffcr
4.9.1.2 Dynainic P u w r Dissipation

The dynamic power dissipation of the input pad is mainly internal power. The total dynamic power of all the input pads (of the $ m e type of example) is
PI= ANsE*< f
(4.102)
where A is the switching activity, N , the number of the input 'pads and Eii is internal energy of the input pad in Watt/Hz.
When the input signal has ECL levels, then an ECL input buffer, with ECLnsed. In " eeneral the" are imolemented in BiCMOS **CMOS converter a ~ e technology and con~umea DC power. An ECL-CMOS converter can be designed in full CMOS ps].
218
CHAPTER 4
4.9.2
Schmitt Rigger
When the input signal to a chip is slowly e g , a hysteresis circuit is needed at the input pad to generate B dean edge. A circuit called Sehmitt trigger can be used for this fnnetion. They are often found at the on-chip inputs. Fig. 4.89 illustrates the transfer characteristic of ideal Schmitt inverter with hysteresis voltage Vx = VT+ - VT-. For 3.3 V power supply with 3.6 V for fast process and 3.0 far slow process, typical d u e s are : V T + , . , , . . = 1.7 V and VT-+* = 1.0 V. The Schmitt circuit switches at different thrrrholds. When the input is rising, it switches when En= VT+ and when the inpnt is falling,it switches when K,, = V T . . Fig. 4.90 shows an example of how the Schmitt t*gw turns a signal with a very slow transition into a Sign& with a sharp transition.
'
A CMOS version ofthe Schmitt trigger is shown in Fig. 4.91. When the input is rising, initially the NMOS transistois are OFF. The Vcs afthe transistor N z is given by (4.103) v , , , = v;" v m
~
219
Y
vT+
.. ...... ... .... .... ........................
vr.
vDD\
~~
Time
6
Figure 4.81
The CMOS Schmilt triggrrrchrrnstic.
When V,. = VT+, N, enters in conduction mode which means VGS, = V,, then' (4.104) V F N = vr+ - VT"
' W Ineglrct the body
offast of N,
220
CHAPTER 4
The voltage VFN i s rontiolled by Nt and N , . These transistors opelate in saturation because
VCSl
= VT+
(4.105) (4.106) (4.107) (4 108)
VDS,= VFN = VT+ - VT*

and
vG'cs8= V D D
~
VPN
VDSS= VDO - VPW

The drain currents flowing in N, and model we have
z
Na are equal. Then using a simple MOS

L b
~
&(VT+ 2
We have
VTm) = ,(vDD
V T + ) '
(4.109)
(4.110)
where
(4.111)
This equation shows that the trigger point is independent of the process prsremeters except for V T , . By symmetry, the trigger point for falling transition, ULO be deduced from the pull-up section. We have
(4.112)
where
(4.113)
If & =
and V T . = -V,
= VT,then
VT+ = "OD
~
2
2
VT +2
(4.114) (4.115)
v7.=--VOO
VH
VT
2
= VT+ - VT- = vr
(4.116)
In this
case the hysteresis voltage can be made equal to VT. The short-circuit power dusipation of the Sehmitt trigger can be very important since the rke/fd timer of the input signal is very long.
Lorn-Voltage Low-Power VLSI CMOS C i r c u i t Design
221
Fig. 4.92 shows SPICE simulation o f the circuit of Fig. 4.91 in 0.8 p m technology. In thla example, the load capacitance is 0.1 pF and the total power dissipation is 0.85 mW. The dynamic power &sipation, dne to the load and parasitic capacitances, is 0.40 m W .Therefore, the power dre to theshort-circuit iS 0.45 m W , which represents 53 %of the total power dissipation.
4.9.3
CMOS Buffer Sizing
When the gate is intended to drive B large load capacitance (larger than the h p u t capacitance of the gate), the driving CapabilitY is limited and the delay is large. If we increase the i i e of the gate (driver configuration), we improve the nse/fall times but still the delay can be improved by putting several stager of buffering between the first gate and the load. The objective in B buffer configuration io to gel the input signal to the load as quickly as possible. Each stage in the buffer chain should have its transistor widths larger than the previous
(ZZ1.P)
Low-Voltage Low-Power VLSI CMOS Circuzt Deszgn
223
Question : What are the d u e s of the size ratio a and the number of stages
n t o op&e
the deky ?
By differentiating t a equation with respect to a and then setting it equal to aem, we have = o 2.1 (4.124) The optimum number of stages ir
, , n
= I.(Cf,/C,")
(4.126)
In this analysis, we have neglected the pararitic output capacitance of each stage. Other stndies [30,31, 32, 331 illustrate that the siee ratio a depends on the ratio of the parasitic ontput capacitance and load cspacitanee. In [34] B new approach for CMOS tapered buffers, with large Ch/Cs, ratio, was proposed. It uses B variable sise ratio between the stages.
The power dissipation ofa CMOS bufferis mainly dominated by dynamic power dissipation for large VT. The short-circuit power dissipation can be neglected 85 first-order analysis [34]. If we indude the parasitic outpnt capacitance. So stage i, has a t o t d ontput capacitance
c, = O'C., + a.-'Cp
Pi = c,v;,r = V&f(a'C,
or
(4.126)
we assume that the parasitic capacitance of stage i is proportional to the size ratio a. The dynamic power dissipation at the output of glrte i is
+a'-'cp)
(4.121) (4.128)
P, = v;,fa'-'(ac."
The total power is
+ C,)
Rence
P , = V&f(aC,, t C , ) -a-1
a " -1
(4.130)
The power efficiency of the buffer can then be defined as
224
CEAPTER 4
where P~isthepowe~dissipated, duetotheloadCL, whichissimply C=V&f. PT is the total power dissipated given by Equation (4.130). This power effidency, for a given Cc,C,, and C,,is afunction of only the factor a . The term 1 - characteriaes the additional power dissipation overhead, needed by the buffer chain to drive the load CL. For high values of a,the power efficiency of the buffer increases. In practice a can be in the range of 2-ta-10. T h i s d u e of a can beret depending on speed, dday and power dissiphtion constraints.
4.9.4
Clock Drivers and Clock Distribution
U m d y when the dock is to be distributed on-cbip, input buffers me needed. The clock erenit hss to drbe wry high internal load with extremely h t fd/Jl/rise times. For example, in the CSLS of DEC Alpha chip [21] the dock load is 3.2 nF. If this load has to be driven by a large driver, in ~ i s e / Wtimes of 0.5 ns when the clock frequency is 200 M B z [ T . i O r r = 5 4,then the average transient current would be
r,.
cE= 3.2 x 10-0

At
3.3 = 21 A
0 . 5 lo-* ~
(4.132)
OVDD = 3.3V power mupply. The corresponding dynamic power dissipation

due to this clock lobding is
P = CV&f = 3.2 x lo-'
x 3 . P x 200
1 0 ' s
7W
(4.133)
T h i s example shows how the docking is an important design issue. A clocking

strategy should be used to distribute the clock to the different functional blocks of chip with minimum clock skew and low-power dissipation.
The clock skew problem is due mainly to two iuuea
rn
The difference in RC intercomat time constants: For example i n Fig. 4.94 node A and node B have two different branch lengths to node C. In this case, the delays of the signals at node A and node B Vir a v k node C ace different. Therefore, the dock skew is eqoal to the time difference between these two signals.
n the example of Fig. Unbalanced loads a t different nodes: As shown i 4.95, if the loads at the nodes A and B, Ca and CB respectively, are different. Then the skew between the signals at these nodes exists.
Low- Voltage Low-Power VLSI CMOS Cmuzt Deszggn
225
F F Z
Block
Figure 4.95
-T
Clock Driver
Clock skew due to the vnbaknced bad. at block A and block
B.
226
CHAPTER 4
Several stmtegiea have been proposed to minimiee dock skew. The first a p proach is to use cascaded inverters (buffer) to ddve B lmge load and feed d l blocks as shown in Fig. 4.96. The buffer chain is designed by the approach presented in Section 4.9.3. In another approach, the clack distribution is aceomplirhed by using a tree of clock buffers well sized as illustrated by Fig. 4.97. Identical buffers are used in each level and each buffer sees the s a m e load capacitance. Equalking clock buffer loads is possible by : 1) equalizing the interconnect lengths between the buffers of different levels, and 2) the addition of dammy bufferr st the slightly loaded bvffer ontput. The last distribution level has buffers which drive the functional elements such as registers. This structure results in very reduced skew and the only skew that exists is the one produced by variations in process parameters. To further minimile the skew, identical layout for all the buffers, should be wed. As an uample of tree approach is the following case. To distribute the clock signal to 64 elements (for example r e e k s ) . 3 stages (levels) of buffering with 1-to-4 tree structure m e required. A wuiety of software paekager have been developed for clock tree synthesis [35. 361.
T o ieduce the high dynamic power dissipation (few Watts) in dock distribution
at a
fixed power supply. many techniques c a n be used such as:

1. Using a low capacitance clock routing Line such as metal3. This layer of metal can be, for example, dedicated to clock distribution only. 2. Using low-swing drivers at the top level of the tree levels.
01
in intermediate
227
Figure 1.87
Clock tree distribution,
For the second approach, a half-swing clocking scheme has been proposed 1371. F i g . 4.98 shows the half-swing dock driver which generate half VDD clock signals (four phases) to the elements (eg , latches). Using the charge shaiing principle, the node of haEVDD can he expressed by H-VDD = H-VDD
c, + c ,
+ c, + c s V D ~
-VDD
when clk is low
(4.134) (4.135)
ca+ c3+ c, + G B
whenclk ia hwh
, and CB me added Capacltms to the power liner. C, through C4 are where C the load capacitances of the driver. When CA is equal t o CB and both ase large enough, compared to C,-C,, then H-VDD node is stabilized at V D D ~ ~ .
F i g . 4.99 shows the clocking schemes of the latches driven by the clock driver. Compared to the conventional scheme which uses two clock phases, the halfSwing scheme requires four clock phases. Two phases are for PMOSs and two are for NMOSr BI shown in Fig. 4.99(b). This scheme reduces the power by 75%. However, the delay of the latch is increased by the new docking scheme,
which can be acceptable [37].
4.9.5
Output Circuits
T O drive the output pad.
a high drive capability driver is needed to achieve adeqnate rise and fall times. In this cme, inverter chain is used to handle the
228
CHAPTER 4
229
large load of the pad, package wiring, and off-chip load. This capacitance can be few tens of pF. A typical value of this capacitance is 50 pF. There arc many types of output pads swh BS tristate, bidirectional, I O W - V D (3.3 ~ V) to higb-VDo ( 5 V) output buffer and low-swing output.
4.9.5.1 Trisiafe and Bidirectional Circuits

Fig. 4.100 shows a tdstate circuit to drive large pad capacitance. When the output enable signnl is high, the output data is the same BS the input data. When the output enable signal is low, then the output of the pad is in high impedance state (Z). Bolh the otttput NMOS and PMOS transistors are cutoff. Fig. 4.101 shows the bidirectional I / O circuit which is quite useful when we need to save the nomber of 1/0 pads. Sometimes an input buffer is included in the bidirectional pad. The operation ofthis circuit is obvious.
4.9.5.2 Power Di,wiparion o/ Output Circuir

The total power dissipation a t the output pads can be divided into the static power dissipation asd the dynamic power dissipation. The statk power dissipation is due mainly to the leakage curents (junction and subthreshold) if the ontput pads are driving CMOS logic. If the VT of the devices is large enough, then the static power dissipation of the output pads is neglected. However if VT is small, then the DC power, due to the subthreshold current, for the output pads is P. = N . I D s , ~ . . VDD ~ (4.136) where No is the number of output pads and ID5,mron is the average subthreshold current for both cases when the input is 1-w and high. For low VT the
230
CHAPTER 4
1
Data-in
Figure 4.101
Biduraciiod pad.
IDS,-..* value would be important, beesnse the devices i n the autpnt bnffer have large ske partiedrub the output transiston. I D , , . . , should be cornputed in worse case where the VT has its minimum value. Thus for future technologies where the threshold voltage is low and the nomber of output pads is large, thm static power dissipation would be very important and can be a limiting factor for low-power applications. Hence low-power eircuit techniques are needed for output buffers.
If the CMOS output buffer is intended to drive bipolar TTL inputs (not CMOS TTLinputs), thenMportanteurrentissn~.Fig. 4.102shows thefinalstageof the buffer driviog a TTL logic. Since, bipolar TTL inputs can sonrce significant amounts ofcnrrent, B CMOS ootpnt buffer must sink this current. For 3.3 V power supply, this current can be in the range of 1 mA to 12 m.4 depending on the strength of the ootput driver. The static power dissipated by the one output pad driving bipolar TTL inputs is
= VOLIOL
(4.131)
231
output driver
:
TTL output buRIr.
Figure 4.10'2
where lo& is the cmrent sunk by the output buffer and is equal to the I of the cnxrent from d the bipolar TTL inputs. VOL = 0.4 V for 1 0 - TTL output. This disspated power is due to the ontpnt NMOS pull-down transistor and can be an important issue s far BJ the chip heat is concerned. Note that the corresponding energy is not drawn from the internal power supply. Another romponent of the total power dissipated at the output pads is the dynamic power. It is given by
Pen = A(N,E<. + N.C.V&)f
(4.138)
where E;, is the internal switching energy of the output pad, and G, is the werage output load capacitance (including the pad load). As an example. 64 output pads switching vith an activity of 10% at 200 MHe dissipate 0.8 W (WDD = 3.3 V, E;. = 70 ) r W / M H Z and C, = 50 pP). This d u e is very important to take into account. The total power dissipation of the bidirectional pads can be evaluated using the approaches developed far the input and outpot circuits.
4.9.5.3 3.3-10-5
v olllpul hzterface
When a 3.3 V chip is connected to a 5 V chip, zero DC power dissipation interfaces are needed. If the conventional CMOS is used to interface the 3.3 v 109;. to 5 V logic, the DC power would be large. Fig. 4.103 illurtrates this
232
CAAPTER4
problem. For example, if the 3.3 V inverter driver high into the 5 V inverter, the Vos of the PMOS transistor P, is equal to 1.7 V. This value is larger than VT of the device and thus results in large DC power dissipation in the range of milliwattr. Since this power is for every 110, then for a whole ASIC chip it could be hundreds of mW. This situation is unacceptable for low-power application.. The circuit of Fig. 4.104 defines a solotion t o the problem of DC pow% d i c sipation (381. The circnit has two power supplies, denoted VDDL and VDDB corresponding to Iow-VDo (erhmple 3.3 V) and high-VoD (example 5 V), r+ spectidy. For low input data, node A is at VDDL and node B is at aero. The NMOS transistor N is conducting and the output is at Vss. Since the output is %em, the feedback PMOS transistor. PI, is also conducting. The p a r NMOS transistor N,, is cutoff, thus the node C is palled up to V D D XThen . the PMOS transistor P is completely OFF. Hence no leakage is in this state except the junction leakage currents and the Subthreshold currents. For high input d a b , node A is a t s e m and node B is at VDDL. In this cffie the NMOS transistor N is OFF and the pffis transistor Ne is condncting. Initially the feedback PMOS transistor Pj is ON and since Np i s conducting, then proper sising of PI and Nn (higher conductance of Np)d l permit node C to be discharged though Np. T h i s canses P to eondnct, which in t u n charges the ontput to V D D H . Then the feedback device P j is completely OFF. Thus this interface results in very limited leakage current and solver the problem of interface.
As mentioned, the transistors PI and Np should be sined properly so that the circuit does not hteh the prcvious data. Pj should be mvch smaller than
Low-Voltage Low-Power VLSI CMOS Circuit Deszgn
233
Xp. We we simple analyri. to find the relationship between the sizes of the two transistors. For high input data, initidly the node Cis at V D D X . Thns the NMOS Ng is in satmatian and the PMOS Pf is in the linear region. By 'ustoning that the drain current of N? is much higher than that of P f , w e have
(4.140)
where & and are the 8 s of the NMOS transistor Np and the PMOS transistor P f , respectively. The low-to-high voltage converter has jl negligible DC current when the input is stable since all the devices are completely OFF. T h i n technique can be used to interface any lowvoltage to higher voltage.
opt
4.9.6
Ground Bounce
W h e n a high drive carrent CMOS driver switches, it generates high carrent spikw. This current can generate noise, as shown in Fig. 4.105. The current tlows through the impedance between the pad and supply node and produces a voltage noise. This noise is often called L$ or ground bounce. The I is due to the padrage inductance. The ground hounce is given by
di V ' = Ldt
(4.141)
234
CHAPTER 4
C""*",
:
Vi"
p F y j > 'TI i n x
. . ..
Time
V
L = L- dt
dl
This noise problem can occur on power lead and is termed power bounce. We will use only one name to refer to this problem. Consider a CMOS output driver driving the output pad of 50 p F at 3.3 V in 2 ns rke/fall timer. It can be shown [39] that 2 is related to the fall/rise times by
(4.142)
The dijdl can be as high as 165 mA/m. If for example 8 drivers are dowed to switch rimnltaneoudy per eaeh VoojVss pads pair, the resulting ground bounce for 1 = 1 n B is 1320 mV. This value can be B problem, partieduly for low-voltage applications, since this ground bounce consumes a large fraction of the digital noise margins. Some of the problems encountered arc 1) fake triggering. 2) double cloddng, andjoz 3) missing clocked pulses.
Low- Voltage Low-Power VLSI CMOS Czrcurt Deszgn
235
110 buffers are not the only sonree of ground bounce in CMOS circuits. Clock
bnffers llod slightly the c o x logic can also cause serious ground bounce in the supply leads when driving large loads. Careful power supply routing should be taken when we power large buffes. The resistance of the metal should be minimieed so the voltage drop, due to the corrent spike, is reduced. There are many techniques to reduce the ground bounce. One simple approach is to use separate supply pins for the ootput buffers. Some approaches, based on reducing L and d i l d l , are the following: Multiple supply pads and pins iz O ~ way E to ieduce the indnctanee of the supply. A recent chip nses 121 power/gronnd pins oat of a total of 293 pins [40]. Placement of power and ground pins, adjacent one to the other reduces the effective inductance of power sod groond pins by mutual inductance. This approach cmses an inerutse in chip s i x and cost.
Circuit techniques to reduce the d i j d t of the output and dock bufferr,
while maintaining sdeqwte performance. The simplest way is to control the rise/fsD times while maintaining the timing requirement. However, this approach has a serious problem, since worst-ease-slow process dictates the buffer rising (worse~ase dclsy), while best-casefast process dictates the ground bounce l e d Benee the buffer design is constrained by the two extremes of process variations. Once the buffer i s siaed to satisfy the worse~asedelay, the worsecase gronnd bounce may exceed the fired level. This problem can be solved by controlling the signal slope at the inpnt of the output transistors of the buffer [41].
rn
For clock buffers, and in high-performance design, on-chip by-pass apacitmce are added between t,he power bur and the substrate as shown in Fig. 4.106. This capacitance lowers the impedance of the power s u p ply. On-chip bypass capacitance doer not reduce the noire produced by output buffers.
Another approach is to reduce the output d t q e swing of the large boffer.
In eondudon, to reduce the ground bounce, all the techniques can be combined
to reduce Land d i l d t The reader can refer to many other techniques to reduce the ground bounce [42, 43, 44, 451.
236
CHAPTER 4
T'DDC
VDDBus
4.9.7
Low-Swing Output Circuit
With the advent of high-performance VLSI chips, which operate beyond 100 MHe and have over 100 I/Os on the same chip, high data rate CMOS 1 1 0 interfaces with low-swing signals are needed such BP ECL (Emitter Coupled Logic) 146, 47,481, BTL [4Q], GTL (501, and CMTL (Current Mode Transceiver Logic) (511. Conventional unterminated htecconneets (between VLSl chips) for CMOS-level sign& w u d y have poor signal quality with severe overshoot and r k g h g . accompanied by EMJ (deetromag~tetie interference) and the possibility to trigger the lath-up.
Fig. 4.101 shows two chips connected to the bidirectional transmission line (50 R termination resistors) though GTL I/O (Gunning 110 ) transceivers. Bath ends of the transmission line are tezminated to prevent reflections. The load seen by each driver is 25 R. The termination voltage VTMis about 1.2 V. The output driver is an open-drain NMOS pull-down transistor and when it is inactive the output is at high-level signal Vox equal to 1 $ ~ The . input receiver = 0.8 V. uses a M e r e n t i d comparator with external reference voltage
237
Figure 4.101
CTL 1 1 0 with two chipa connected to
transmirsionhe
Fig. 4.108 shows an output duver i n open-drain confignration which indudes circuitry to reduce overshoot and the turn-off dildt. When K, is low, P, turns ON which itself turns Na and N, ON. In this C B J ~ , ,the maximum output voltage is VOL,,,, = 0.4 V. The powei dissipated by the pull-down NMOS ir madmum and mainly static. The static current is equal to (VTM - V o r ) / R= 0 3 / 2 5 = 32 mA8. Hence, the marimurn static power dissipated on-chip is P = 32 n A x 0.4 = 12.8 mW for each I/O. % i d value of Vor. is 0.24 V, thns the nomind power dissipated by each active driver is 9.2 mW. When the input goes Lorn low to high, N, turns ON and Na is still ON because the signal , is delayed by about 1 na. The transistor through the two inverters I , and 1 NI is weak, hence the output discharge ir controlled by N, and Ns. There transistors let the drain of N, connected to its gate as long BS V ~ s ir s higher than VT. When Ns turns OFF, then NI discharge. the gate of Nq to the ground. Thus, the turn-off of N4 is controlled. In this mse, there is no DC Power dissipated. Fig. 4.109 shows the input buffer which employs B differential comparator. This L V , . , > 50 mV (< -50 mv), circuit switches to high (low) V,, when I respectively over process, power supply and junction temperature variations.
~
'"ole Lhat this ourrent ;s supplird by
V p ,
and DOL V,,
238
CRAPTER 4
Vi"
(GTJ. levels)
YOU,
The average power dissipated by this input receiver is 5.5 m W at 5 V power

supply.
239
4.10 LOW-POWER CIRCUIT TECHNIQUES

Remember that the total power dissipated by a circuit has three components. Two of them which are very important are : 1) the static power (P.), and 2) the dynamic powei ( P d ) . This section treats some of the circuit techniques for achieving law-power while maintaining performance. Techniques to reduce the power at rubrystem/rystem and architecture lev& will be discussed in Chapters 6, 7 and 8.
4.10.1 Law Static Power Techniques

One important source of static power dissipation is the use of low threshold voltage. With device sealing, the power supply voltage is sealed. If the threshold voltage is not sealed, and is equal or greater than one half V D D the , gate delay increases drastically [52]. The threshold VOhge should be less than 20% of VDD, in order to maintain puformance at law supply voltage. At 1 V power ropply, the thrwhold voltages can be as low as 0.1 V. However, rcdncing V T C ~ S serioos ~ S standby snbthreshold enrrent increase, dne to the exponential relation between the current and VT. With low VT the process fluctuation can increase this current more. For VLSI integration and future ULSI, the total standby current can be high and not acceptable for low-power spplications.
To reduce this subthreshold current, associated with low VT devices. there are many techniques. These techuiqms are based on the principle to reverse bias the VGS voltage of the MOS device (in the case of NMOS) in the standby made ofoperation, as ahown i n Fig. 4.110. With Vcs = V e x , where Von is mgativc, the standby state of the device moves from state to state p . We d t e two tcchniqoes using this principle:
4.10.1.1 Self-Reverse Biasing
T h i s technique has been used mainly to reduce the static power dissipation
i n standby mode of the memory decoded-driver [53]. The drivers, in memory, have a lbrge number of circuits, arranged repeatedly, but only a few of them operate aimultaneoudy. The drcuit of Fig. 4.111 can drastically reduce the subthreshold current of the drkers. The technique simply consists of inserting a PMOS tmnsbtor P- with a size W. between the power supply VDO and the common source node A. AU the PMOS transistors (Pd,,Pd2, ...,Pdn)of the
' I C o l y L ~ -nl t
tbcahold voltage.
240
CHAPTER 4
drivers have, in thk example, the s m c sivc Wd and common SOUICC (node A). The number of drkers R can be between a few hundreds to a few thousands. The MOS transistors in the ddvers have low iVTdl (e.g., 0.1 V). The PMOS transistor PG have a threshold d t t a g e I V T , I slightly higher than I V d (%.,
0.2
~
0.4
V).
In active mode, the input S is low and the transistor Pois ON. For the drivers only one circuit is ON. In order that the PMOS transistor Pedoes not affect the drive current of the driverg, its size W, should be larger than Wd, depending
on the capacitance of the common murce, which is huge for high R. In standby mode, the input S is high and the PMOS transistor P, is OFF. The inputs of all drivers are set to high (VDD). Without the PMOS tiansirtor P . , the total subthreshold emrent would be n timer the c u r d of each driver. T h i s malres thk current very high. Hence Pc %educesand limits the sobtbrahold cnrrent. The voltage of the common source node A, is reduced by an amount AVsna (afew hundreds ofrmV). This CBUSOS the PMOS transistors ofell drivers to hsve self-reversebiasing gate-source voltage, which drastically reduces the subthreshold current. The time needed for the node to stabiliue to VDDAVsns (or the time needed to switch from the active to stsndby mode) i s called evolution time and can be very high (order of 1 mr) compared 10 the delay of the driver. The reason is that only the leakage and subthreshold cyzlents which
Low-Voltage Low-Power
VLSI CMOS Circuit Design
241
Active mode
Slvndby mode
Figure 4.111
Subthicrholdcurrmt reduction by self-revcrre hissing.
&charge the node A in this mode. This time can be undgnificant to low-power operation if the standby mode time is large enough s i n the case of many lowpower applications. When the input S is turned low (active mode), the time needed for the coinmm source A to recover (reaches almost V D Dis ) too low and can be lower than the delay time. Hence. it doer not interrupt the start of normal operation.
Lets derive now the subthreshold current expressions before and after reduction by SXB technique. The total subthreshold current withont the self-reversebiasing techaique is given by
w a I..*, = n.I-exp
wo
-1vm
~
Sjln10
(4.143)
With the lranristor P,, the subthreshold current is given by
w. exp I d 2 = law,
-lvTcI
~
S/I.lO
(4.144)
242
CHAPTER 4
We assume that the devices have the s m e lo, W o and S. By dividiog the current equations (4.143)and (4.144). ws have, for the subthreshold current, a
reduction factor
-,
Forexampleforn = 512, W. = (with this ratio thespeed irnot affected), VT, = 0.3 V, V T ~ = 0.1 V and S = 90 mVjdecode, the factory = 8.5 x 1 0 ' . So, the saving, i n subthreshold current, is sufficient. The parameter AVsni, can be easily deduced. Note that this technique needs multi-VT technology.
lowd,
4.10.1.2
Mulri-VTTechnique
This techniqne is similar to the one discussed above, but it u ~ be n applied to any CMOS logit (54,561. The basic idea i s shown in the crsmple of the NAND gate of Fig. 4.112. Here the MOS transistors P and N have high V T (e.g., 0.6 V extrapolated) for 1 V power supply applications. Also the logic gate has MOSFETs with low V T ( 5 0.3 V). The signal SL is used to switch the gate in active or sleep (standby) mode. The virtual upp ply lines VDDV and Vssv are common for many gates. W e call thb logic multi-threahold CMOS logic (MT-CMOS).
In the active mode, the signal SL is low,P and N are ON, so the vktoal supply Vssv can be set to almost VDOand ground, respectively. Hence, the 10w-V~ logic o m switch effidently, bot cart shonld be taken i n the siziing ofthe P I N devices compared to the logic. Fig. 4.113 shows the effect of aieing the high-& devices on the delay of the gate. The width of P I N rhodd be at least 10 timer larger than that of logic cells. This condition depends greatly on the pararitic capacitances of the Virtusl sopply lints CI6nd C, [see Fig. 4.1121. If C , and C , are large then the width of P and N transistors can be reduced, because these capacitances tend to suppress the bouncing of VDDV and Vssv and henee improve the rpeed. The high-& MOSFET. can be cornmon for several logic g a t e s (q, 1 0 ) .
lines VDDVand
In the standby (sleep) mode, the signal SL is high, then P and N are OFF. Hence, the subthreshold current is limited by that o f these high-VT devices. In this ease, the static power dissipation is dramatically reduced in the sleep mode. The subthreshold reduction factor can be deduced using the analysis presented in the previous section. One problem associated with this MT logic is that the evolution and recovery times can be large.
Low- Voltage Low-Power VLSI CMOS C i r c u i tD e s i g n
243
'
H - V T Tr Gak Wid* lnormalizedcd)

Figure 4.113 CMOS,
Effect
high.V,
MOS width on thc p=dommce ol MT-
244
CHAPTER 4
The measured delay, as a function of the supply voltagc tor Zinput NAND gate with FO= 3 and wiring load of 1 mm (0.25 p F ) , is shown in Fig. 4.114. The technology is 0.5.pm CMOS with low VT- = 0.25 V, low V T ~ = -0.35 V, high VTn = 0.55 V and high VTp = -0.65 V. The MT-CMOS logic has almost the s-e speed ag the full 10w-V~ logic. The logic delay time is reduced by 70% at 1 V as campared with that af the high-v, one.
For holding the level of the output during the deep mode, a level holder is 85 shown in Fig. 4.115. It consists o d y of cross-coupled inverters with high-VT devices powered from the power snpply VDD.
necessary
T h e source of the static power dissipation is not mly low VT devieer. Several
other issuer eontribnte to static power increase. These are some Circuit design guidelines to ieduce the static power Mipation :
rn
Avoid the use of pseudo-NMOS circuits in yaw design.
245
Figure 4.116
CMOS gatr with Icvrl holder.
Avoid the w e of TTL-compatible I/O or devise low-DC current level converters.

D o not use low VT devices in the 1/0 buffers, otherwke the DC power increaser remarkably because the MOS transistors of the I/O buffers have large sines. If you do not have any option, then use the rubthreshold reduction techniques.
4.10.2 Low Dynamic Power Techniques

ASIC. and VLSI processor elode are improving rapidly, reaehing the snb-GKa range [ZI,561. The power dissipation of CMOS di@d circuits, operating at thew high-fxequeneies, increaser drastically and it can be the m a i n performance limiting factor. Therefore, low-power circuit techniques are needed to reduce the dynamic power of digitd citcuitr. Moreover, low-power chip consumption i s extremely important in order to extend the battery life of portable systems 1571.
In general the dynamic power dissipation of B gate (i) is given by:

Pas = rriC,v.VDDf
(4.146)
where (I,is the gate activity, V, is the voltage swing, C, is the load and parasitic capacitances and f is the operating frequency of the system. Equation (4.146) demonstrates that there m e several ways to reduce P,:
246
CHAPTER 4
1. Reduce the power supply voltage. Seating VDDfrom 3.3 V to 1 V results in B power reduction factor of 1 1 . However, tbia approach leads t o speed degradation for a givcn technology. But if device sealing is applied, in a next generation technology, the delay will improve and henee the operating frequency. In a complex digital system local supply reductions een be used for non-&tical dreuits. 2. Redwe, temporarily, the clock frequency of unused blocks on a VLSl chip using an on-chip power management unit or reduce the gate BCtivity. These can be done a t the architectural level.
3. Reduce the output capacitance Ci. As a first order approximation thi. capacitance is composed of the intercomect capadtanee G.,, and the total input capacitances of the driven gates C;sv The latter caa be redwed Using low inpat tapa6tanee logic family [SO] such a CPL-like. Also u5ing minimum size logic gates in non critical parts of the dclign can reduce the dynamic power significantly.
When Ci,, dominates, &s in busses and high-capacitance intereonncctionr (interbloek wirer), then dreuit techniques, bwed on low-swing signal, while maintaining the power sopply voltage. can lead to power dissipation reduction 158, 591. With increasing chip dimensions and integration density, the capacitances of wirer will dominate. It is expected that the power &ripation associated with the busses and the interwnneetions in future ULSl chips waill reach half of the total power dissipation [58].
These arc some guidelines for the design of low-dynamic power eircnits :
rn
Cho0.e the technology that has low junction and oxide capacitances for the same performance. Avoid, if possible, the use of dynamic logic design style.
rn
. .
rn
For any logic design, reduce the switching activity, by logic reordering and balanced delays through gate tree to avoid glitching problem.
Use low-input capacitance logic family

In non-critical paths, use minimum size devices whenever it is possible without degrading the overall performance requirements. If pars-transistor logic style is used, uuefd design shodd be considered.
247
4.11 ADIABATIC COMPUTING

As discussed in Section 4.3.2, the energy provided by the snpply to charge a load CLof a driver during charging and discharging is
E = C,,Va
(4.147)
where V is the power supply voltage ar shown in Fig. 4.116(a). Half of the energy is dissipated by the resistor of the pull-up PMOS device during the charging phare. A similsr argument applies Lo the discharge resistor of the h i s analysis is valid men if a step power supply pd-down NMOS transistor. T voltage, V, is applied to the network. From Fig. 4.116(b), the Voltage drop across the resistor, Rp varies from V (supply voltage) to eero. Hence. the energy disripsted by Rp is given by
En = / e V . d Q = / e V n C d ( V - V x )
then
En =
1 41.v 2
(4.148)
(4.149) (4,150)
En = C L V V .
where
6 is the average voltage drop nerosr the resistor of the pull-up PMOS.
If the power supply voltage bar two half steps, ar shown in Fig. 4.116(c), the energy dksipated by the resistor is
ER = -C,Va
4
(4.151)
So less energy is dissipated by the resistor, when the average voltage is reduced, while keeping the swing and load eapaeilnnce constant. This is the principle of
Adiabatic Switching [61, 62, 631.
For multi-steps power supply voltage,

energy dissipated is given by 1611
BC
shown in Fig. 4.116(d), the total
Va = E = CL-
Ecmuant,msj
(4.152)
and the one dissipated by the resistor is
En = 4 N
1 2
vz
(4.153)
248
CHAPTER 4
Low-Voltage Lour-Power VLSI CMOS Czrcud Design
249
where N is the number of voltage steps uniformy distributed. Fig. 4.117 shows an example of a driver with uaiformy distributed supplies which are switched in surcesi~ely.The voltage V, is given by
To charge the load, V t through VN are connected to the load in succession (by dosing switch 1, opening switch 1, dosing witch a, etc.). To discharge
the load, Kx-1 through K are switched in the same way, and the switch 0 is dosed, connecting the output to gannd. Note that the supply voltage, with mnlti-steps, needs B longer time period than the conventional case to charge m p the load capacitance. This techniqne has been used for large loads. Another variation i s to use a supply voltage with a ramp form" [62]. In this case, the energy is drastically reduced if a long time period is used. For the (PPS) are applied to the circuit. inverter for example, pulsed power supplie~ The adiabatic comput;oP becomes attractive only when the delay is not critical, b e c a m in that technique the energy is traded for delay. The energy-delay product of the sdie.bbstic circuit is much worse than the conventional CMOS gates [64].
4.12
CHAPTER SUMMARY
This chapter has provided an introdnction t o low-power CMOS desisn. The power dissipation components of a CMOS gate hsve been discussed. Techniques to reduce the different components, a t physical and circuit levels, were presented. Novel CMOS design styles such iu CPL, DPL, and SRPL were examined. Several issues in CMOS circuit design, such as clock distribution, ground booncing, etc., were reviewed. This chapter represents a base, for Chapters 6 , 7, and 8 , where subsystems and low-power architectures are discussed.
REFERENCES
[I] N. H. E. Weste and K. Eshraghian, "Principles of CMOS VLSI Design : A Systems Perspective,'. second edition, Addison-Wesley, Reading, MA, 1993.
[2] J. P. Uyemura, "Circuit Design for CMOS VLSI," Kluwer Academic Publishers, Norwell, MA, 1992. 131 M. I. Elmasry, "Digital MOS Integrated Circuits 11", IEEE Press Book, 1993. [4] R. M. Swansan and J. D. Meindl, "Ion-Implanted Complementary MOS 'hamistors in Law-Voltage Circuits", IEEE 3. Solid-State Circuits, "01. 7, no. 2. pp. 146-153. April 1972. [S] H. J. M. Veendrick, "Short-Circuit Dissipation of Static CMOS Circuitry and Its Impact on the Design a l Buffer Circuits," IEEE 3 . Solid-State Circuits, "01. 19, no. 4, pp. 468.413, August 1984. [6] S. M. Kang, "Accurate Simulation of Power Disripation in VLSI Circuits," IEEE J. Solid-State Circuits, vol. 21, no. 5, pp. 889-891, October 1986. [TI G. J. Fisher, "An Enhanced Power Meter for SPICE2 Circuit Simulation," IEEE Trans. Computer-Aided Design, vol. 7, pp. 641-643, May 1988. [8] G. Y. Yaeoub and W. H. Ku, "An Enhanced Technique lor Simulating Short-circuit Power Dissipation," IEEE J. Solid-Slate Circuits. YOI. 24, no. 3, pp. 844-847, June 1989. [9] N. Meijs, and J. T. Fokkema, "VLSI Circuit Reconstruction From Mhsk Topology,'. Integration,"01. 2, no. 2, pp. 85-119, 1984.
[I01 D. V. Heinbruch, "CMOS3 Cell Library," Addison-Wesley, Reading, MA, 1988.

[I11 R. J. Landers, and S. Mahant-Shetti, "Multiplexer-Based Architecture for High-Density. Low-Power Gate Arrays," in Symposium on VLSI Circuits, Tech. Dig., Honolulu, pp. 33-34, June 1994.
252
LOW-POWER
DIGITAL VLSI DESIGN
[lZ] M. 1. Elmasty, "Digital MOS Integrated Circuits I", IEEE Press Book,
1981.
[I31 R. H. Krambeck, C. M. Lee and H-F S. Law, *High S p e d Compact Ckcuitr with CMOS", IEEE J. Solid-State Circuits, vol. 17, no. 3, pp. 614-619, June 1982.
[I41 V. Friedman and S. Lio, "Dynamic Logic CMOS Circuits". IEEE J. SolidStale Circuits. vol. 19, no. 2. pp. 263-266, April 1984. 1151 N. F. Conclaves and H. J. DeMan, "NORA:LI Race Free Dynamic CMOS Technique for Pipelined Logic Structures" IEEE J. Solid-state Circuits, vol. 18, no. 3. pp. 261-266, June 1983. 1161 C. M. Lee and E. W. Seeto, "Zipper CMOS," IEEE Circuits and Dcviccr Mag.. vol. 2, no. 3, pp. 10-17, May 1986. [lT] N. Weste and K. Erhraghian, "Piinciplcr of CMOS VLSI Design : A Syrtemr Perspective." Addison-Wesley. Reading, MA, 1985.
[IS] F. Lu and H. Samueli "A 200-MH1 CMOS Pipelined MultiplierAeeumiilator Using a Quasi-Domino Dynamic Full-Addcr Call Design,"
IEEE J. Solid-Stale Circuits. VOI. 28,
no.
2. pp. 123-132. February 1993.
[19] J. Yuan and C. Svenron, "High-speed CMOS Circnit Technique," IEEE J. Solid-state Circuits, vol. 24. no. 1. pp. 62-71, February 1989.
1201
M.Afghahi and C. Svensson, "A Unified SinglcPhare Clocking Scheme far VLSI Systems," IEEE J. Solid-state Circuits, uol. 25. DO. 1. pp. 225-233. February 1990.
I211 D. W. Dobberpuhl e l al., '"A 200-MHz 64-b Dual-Issue CMOS Microproccs~or",IEEE J. Solid-State Circuits. vol. 27, no. 11. pp. 1555-1567, November 1992.
1221 H. 8. Bskoglu, "Circuits. Interconnects. and PacLaging lor VLSI," Addison
Wesley, Reading. MA, 1990. [23] K. Yam, e l al., "A 3.8-ns CMOS 16x16 Multiplier u%htg Complementary PaJr-'Ihn8islar Logic", IEEE J. Solid-Stntc Circuits, "01. SC-25. no. 2. pp. 388-394, April 1990. [24] M. Suaiki. e l . I . , "A 1.5-ns 32-b CMOS ALU in Double Pars-Thnsistor Logic", IEEE J . Solid-Slite Circuits, vol. SC-28. no. 11, pp. 1145-1151, November 1993.
REFERENCES
253
[25] A. Psrameswai, 8 . Eara, and T. Sakurai, "A High-speed, Low-Power, Swing Restored P a s s - T r k t o r Logic Based Multiply and Accnmulate
Circuit for Multimedia Applications," IEEE Custom Integrated Circnits Conference, Tech. Dig., S a n Diego, CA, pp. 278-281, May 1994.
[26] L. A. Glasser and D. W. Dobberpuhl,
"The Design and Analysis ofVLS1 Circuits", Addison-Wesley, Reading, MA, 1985.
[27] T. Kobayashi et al., "A Current-Controlled Latch Sense Amplifier and B Static Power-Saving Inpnt Buffer for Low-Power Architecture", IEEE J. Solid-state Circuits, vol. SC-28, no. 4, pp. 523-527, April 1993.
[28] M. S. J . Steyaert, et al, 'ECL-CMOS and CMOS-ECL Interface in 1.2pm CMOS for 150-MAz Digital ECL Data Transmission Systems", IEEE J. Solid-State CLcuits, uol. SC-26, no. 1,pp. 18-24, January 1991. [29] C. Mead and L. Conway, "Introduction to VLSI Systems", AddisonWesley, Reading, MA, 1960. [30] N. C. Li, G. L. Haviland and A. A. Tureynrki, "CMOS Tapered Boffer", IEEE J. Solid-state Circuits, vol. SC-25, no. 4, pp. 1005-1008, August 1990. [31] M. Nemes, "Driving Large Capacitances in MOS LSI Systems", IEEE J . Solid-state Circuits, vol. SC-19, no. 1, pp. 159-161, February 1984.
[32] N. Bedenstiema and K. 0. Jcppson, "CMOS Chcuit Speed and Buffer Opthiastian", IEEE Tram Computer-Aided Design, "01. CAD-6, no. 2, pp. 276-281, M a d 1987.
[33] A.J. Al-JShalili, Y. Zhn and D. Al-KhaIili, "A Module Generator far Optid e e d CMOS Bnffer", IEEE Trans. Computer-Aided Design, "01. CAD-9, no. 10, pp. 1028-1046, October 1990.
[34] S. R. Vemuru and A. R. Thorbjornren, "Variable-Taper CMOS Buffer", IEEE J. Solid-state Circuits, "01. SC-26, no. 9, pp.1265-1269, September 1991.
[35] J. Burlds, "Clock Tree Synthesis for High Performance ASIC?', in IEEE ASIC Intun. Conf. and Exhibit, Rochester, NY, pp. PS-8.1-PS-8.3, September 1991.
[36] P. D.Taand K. Do, "A Low-Power Clock Distribution Scheme for Complex IC System", in IEEE ASIC Intern. Conf. and Exhibit, Rochester, NY, pp. PI-5.1-P1-5.4, September 1991.
254
[37] Li. Kojims, S. Tsnaka, and K. Sasski,
Half-Swing ClocLing Scheme for 75% Power Saving in C l o c h g Circuitry, Symposium on VLSI Circuits, Tech. Dig., Honolulu, pp. 2524, June 1994.
[381 J. S. Caravella and J. H.Quigley, *Thee Volt to Five Volt Intedace Circuit with Device Leakage Limited DC Power Dissipation, in IEEE ASIC Intern. Conf. and Exhibit, Rochester, NY. pp. 448-451, September 1993.
1391 M. Shoji, CMOS Digital Circuit Technology, Prentiee Hall h c . , Englc wood Cliffs, NJ., 1988.
(401 F. Abu-Nofd et d., A ThresMillion Ttanaistor Microprocessor, in IEEE Iotenw&xal Solid-State Circuits Conf., pp. 108-109, February 1992.
(411 T. Gabars and D. Thompson, Ground Honnee Control in CMOS Intessted Circuits, in B E E International Solid-state Circuits Cod., pp. 88-89, February 1988.
(421 T.Gahara, Gronnd Bounce Control and Impromd Latch-op Suppression Through Substrate Conduction, IEEE J. Solid-State Circuits, 01. 23,no. 5 , pp. 12241232, October 1988. [43] M. HashLnoto and 0 - K Kwon, Low dI/dt Noise and Refletion Free CMOS Signal Driver, in IEEE Cuatom Integrated Circuits Conf., Tech. D i g . ,pp. 14.4.1-14.4.4. 1989. [44] T. Wada, M. EiOo and K. Anami, Simple Noise Model and Law-Noise Data-Ontput Buffer for Ultra-High-speed Memories, IEEE J. Solid-state Circuits, 01. 25, no. 6, pp. 15861588, December 1990. [45l R. S e n t b a t h a n and J. L. Prince, Application Sp&e CMOS Output Driver Circuit Design Techniques to Reduce Simultaneous Switching Noise,IEEE J. Solid-state Circuit, YOI. 28, no. 12, pp. 1383-1388,Decemher 1993.
[46] T. Knight and A. Krymm, A Sew-Terminating Low-Voltq,e-Swing CMOS Outpvt Driver, IEEE J. Solid-State Circuits, 701. 23, no. 2, pp. 457-464, April 1988.
[47] H-J Schumseher, J. Dikken and E. Seevindr, CMOS Subnanosecond True ECL Output Buffer, IEEE J. Solid-State Circuits, 01. 25, no. 1,pp. 150154, February 1990. (481 M. PedcrMn and P. Meta, A CMOS to lO0K ECL Interface Circuit, in IEEE International Solid-State Circuits C o d , Tech. Dig., pp. 226-227, February 1989.
REFERENCES
255
[49] J. Martinen, "BTL Transceivers Enable High-speed Bus Design", EDN, August 1992.
[50] B. Gunning, L. Yuan, T. Nguyen and T. Wong, "A CMOS Low-VoltageS g Itansrnisrion-Line Transceiver", in IEEE International Solid-state
Circuits Conf., Tech. Dig., pp. 58-59, Februay 1992. [51] J. A. Quigley, J. S. Caravella and W. J. Neil, '"Current Mode Transceiver Logic (CMTL) for Reduced Swing CMOS, Chip to Chip Communication", i n IEEE International ASIC Conference and Exhibit, Rochester, NY,Tech. Dig., pp. 452-457, September 1993.
[52] M. Kakumu, 'Process and Device Technologies of CMOS Devices foz LowVoltage Operation," IEICE Trans. Electron., Vol. E76C, No. 5 , pp. 672-
680, May 1993. [53] T. Kawahara et al., "Subthreshold Current Reduction for Decoded-Driver by Self-Reverse-Biasing." IEEE J. Solid-state Circuits, vol. 28, no. 11, pp.
1136-1144, November 1993. [54] S. Mutoh et al., "1 V Bigh-Speed Digital Ckcuit Technology with 0.5pm Multi-Threshold CMOS," in IEEE International ASIC Conference and Exhibit, Rocherter, NY,Tech. Dig., pp. 186-189, September 1993. [55] M. Eoriguchi et el., "SSI CMOS Circuit for Low-Standby Subthreshold Current Giga-Scale LSI'r", IEEE J. of Solid-state Circuits, Vol. 28. No. 11, pp. 1131-1135 November 1993.
[56] R. W. Badeau et al., "A 100-MAz Macropipelined VAX Microprocessor,"
IEEE J. Solid-state Cmcnits, vol. 27, no. 11, pp. 1585-1597, November 1992.
[57] R. Brodersen, A. Chandrakasan and S. Sheng, "Design Techniques for Portable Systems", in IEEE International Solid-state Circuits Conf., Tech. Dig., pp. 168-169, February 1993.
[58] Y.Nakagomeet al., "Sub.1-V Swing Internal Architecture for Futwe LowPower ULSI's," IEEE J . Solid-State Circuits. vol. 28, no. 4, pp. 414419,
A p d 1993. [59] A. Bellaouar, I. S. Abu-Khater, and M. I. Elmssry, "Low-Power CMOS/BiCMOS Drivers and Receivers for On-Chip Interconnects," IEEE 1. Solid-state Circuits. vol. 30, "0.1, May 1995. [601 A. Chandrakaran et al., ~~~~~-Power CMOS Digital Design", IEEE J. Solid-state Circuits, VOL 2, no. 4, pp. 473-484, April 1992.
256
[61] L. J. Svensson, and . I . G. Kollcr, "Driving a Capacitive Load without Dissipating fCV'," IEEE Symporiam on Low Power Electronics, Tech. Dig., San-Diego, pp. 100-101, October 1994.
1621 T.Gabara, "Pulsed Power Supply CMOS - PPS CMOS," IEEE Sgmposium on Low Power Elcotronics, Tech. Dig., San-Dicgo, pp. 98-99, October
1994.
[63]J. S. Denker, "A Review of Adiabatic Computing," IEEE Symposium on Low Power Electronics, Tech. Dig.. San-Diego, pp. 94-97, October 1994.
[64] M. Horowita, T. Indermaur. and R. Gonadeu, "Low-PowerDigitd Design." IEEE Symposium on Low Power Electroniw, Tech. Dig., Slm-Diego, pp. 8-11, October 1994.
5
LOW-VOLTAGE VLSI BICMOS CIRCUIT DESIGN
BiCMOS technology offers enhanced performance compared to CMOS at 5 V power supply voltage. Many high-speed BiCMOS SRAMs, gate arrays, ASICr, etc. have been fabricated [I]. In this chapter, we present 8 variety of BiCMOS logic circnits suitable for 3.3 and rub-3.3 V. The potential gatel for digital applications m e identilied. The chapter starts with the introduction of the conventional BiCMOS (totem-pole) gate which is used in 5 V applications. The degradation of this gate, with supply voltage scaling, is demonstrated. In Section 5.2, we introduce the BiNMOS family suitable for low-voltage applications. Othec logic families, for low power supply voltage operation, are discussed in Section 5.3. Low-voltage digital applications of BiCMOS m e identified. The reader is referred to BiCMOS books [Z,31 to get more familiar with BiCMOS circuits.
5.1 CONVENTIONAL BICMOS LOGIC

In this section, the eanvenlional BiCMOS logic family is introduced. This brnily has been used successfully in many applications at 5 V power supply voltage. The reason for the speed advantage of BiCMOS compared to CMOS is explained. At lawvoltage, the performance degradation of conventional BiCMOS is shorn.
The CMOS inverter of Fig. 5.1 suffersfrom the limited current drive when the load capaeit,ance u large. To increase the drive capability of CMOS, I bipolar driver can he added at thc output of the CMOS inverter. Fig. 5.2 shows one possible configuration to construct what is called B conventional BiCMOS
258
CHAPTER 5
inverter. The addition of the bipolar driver stage to the basic CMOS inverter is responsible for the high current driving capability of BiCMOS over CMOS. As a result BiCMOS offers lower d e l q compared to that of CMOS especially at high loading capacitance. The operation ofthis gate is straightforward. When the input is low, the PMOS P is ON and its d r a b current tmns the transistor QlON. The collector current of QIcharger the output load capacitance. As the output reacher VDD -VBB,, where VBE, is the turn-on voltage of the bipolar transistor and ir about 0.7 V, Q, gradually turns OFF. During this period, the NMOS transistor N a is ON. Since Ndl is conducting, Q2 is in the cutoff region. Bansistor Nd2 can also be controlled by the output node. However, using the base node results in faster operation because the b a of Q t is p d e d up faster than the output node and because the voltage level of the b a a node is largei. If the input is high, the NMOS transistors N and Nd, are ON. Qlis OF while Q . turns ON to discharge the output node. As a result, the load capacitance is pulled down. As the output V. leaches VEB, transistor Q . turns OFF and the outpot stays at this level. The conventional BiCMOS gate provides high drive capbilitr, eem static power dissipation and h g h input impedance. More dincnssionr on this gate are given in the following sections.
Low- Voltage VLSI BiCMOS Circuit Design
259
"0
w CMOS
BiCMOS
L
TCL
Conventional BiCMOS h v c r k r
Figure
6 2
5.1.1 DC Characteristics
Fig. 5 . 3 shows the DC transfer characteristic of the conventional BiCMOS inverter of Fig. 5.2. When the input voltage to the BiCMOS inverter is s e r a both the bipolar tran&lurr azr OFF. The PMOS device P operates in the h e a r region with rero drain-source voltage. Due to the subthreshold current of the transistor N (- 10 p a ) , the base-emitter voltage of QI is around 0.45 V. As a result, the output voltage V, = 4.55 V (0VDD= 5 V). The bilse of the bipolar transistor Q2is at zero voltage because Nd2 is ON.
As the input voltage increases, the subthreshold current of N h u e a r e s causing VB,,~,to rise and the ontput voltage to fa.When the input voltage is around the mid-VDo. both the P and N MOSFETs are ON and operate in t h e saturation region. Also the bipolar devices are ON. At this point, the BiCMOS inverter is in the high gain region and the output voltage drops sharply towards its low level.
260
CHAPTER 5
5 3 ,-.
0
>
z 21
Figure 1 . 3
Thc DO tranafGr charactcrialic o f the convcntiondBiOMOS at
V.
As the input voltage increases again, the base of Q2Sollows the voltage of the output since N is ON. When the input voltage reaches V D D ,the PMOS P is OFF.The discharge device, A ' , is ON and the base ofQl is at uero. Also, the o n t p t is completely discharged and N is ON. Then, the base of Q, is at sera In this cme, the output voltage is %em end both the base-emitter voltages are
aero.
5.1.2 Randent Switching Characteristics

In this section we study the transient behavior of the convent,iond inverter of Fig. 5.2. The purpose o f this analysis b threefold i) it serves to nndeEs1w.d the transient switching behavior of the gate, i) to develop a simple analytic model, and iii) also to show the superiority of BiCMOS compared to CMOS. The objective of delay analysis is to point out the important device and circuit parameters that affect the response OS the gate. The developed model is very simple and can be used BS a first order spproimation. We start with the
Low- Voltage VLSI &CMOS Circuit Design
261
Time (nr)
e
-6 -8
(b)
Time (ns)
snalysis of the puU-op section. Then we show the difference in the case of the pull-down section. We asinme a step input.
262
CHAPTER 5
5.1.2.1 Tmnsient Lkhnvior

Fig. 5.4 shows the transient behavior of the BiCMOS inverter of Fig. 5.2. When the inpmt f& t o gronnd, transistor P turns ON and operates initially in the saturation region. Its drain charges the parasitic capadtames et the base and when VBE,PI = VBErm, Qlturns ON. The emitter current increaser in a relatively short time to its peak to charge the output load Cr.as shown in Fig. 5.4(b). The ontput voltage is pulled-up following the base voltage of Q1 BI shown i n Fig. 5.4(a). As the bof Q, exceeds VT,, Ndl turns ON to discharge the base of QIto ground. But due to capacitive COUP^^. VB,,, tends to be pulled-up. When the base vokage is higher t h m VDD- V D S , . ~where , VDS..+is the saturation voltage of P,the PMOS tramistor P enters the Linear zepion and the drain (base) current drops gradually. Consequently, the emitter current of Ql struts falling. As the output voltage V , approaches the theoretical limit of VDD VBE-, Ql is expected to turn gradually OFF. However, due to , exceeds this the capacitive coupling between the bare and the output node, V limit as shown in Fig. 5.4(a). The same ieasoning can be applied when the input riser to VDD
~
5.1.2.2 Analytic Delay Mudel

A simple delay aoalysk is w r i e d out in this section. The reader can refer to [4. 5, 6 1 for other detailed models. We talre iota acconnt the pararitic capacitances and the bipolar high current effects. We do not take into account the parasitic resistances since they have no appreciable effect with advanced bipolar technology. This model is based on i b j e model [TI.
Fig. 5.5 illustrates the transient equivalent circuit of the pull-up section (Fig. 5.2) of the conventional BiCMOS gate driving a load capacitance CI,. As we are interested in 50% rise time, the PMOS current can be modeled by the saturation current of the device. Thia current is given by Eqnstion (3.82) in Chapter 3 IDS,,* = ~ p c ~ ~ , ~ t , p ~ p ~l IVT?l) vosl (5.') where Vcs is equal to (K*+j V D D ) where , K,+ is the low level ofthe input. The capacitance C , , accounts for the parasitic capacitances of the MOS devices P, N d , and Ndz a t the base of the pull-up bipolar transistor. Therefore, it is given by = C d , P Cd,N*> (5.2)
~
c,,
where C d , pand Cd,Na, are the drain junction capacitances of P and N d l and Ca,N., is the gate oxide capacitance of N d l . The overlap capacitances of P
Low- Voltage
VLSI BiCMOS Circuit Design
263
Bipolar large signal model

-. 7 ~. . . . . . . . T . . .
. \
and N,, hie assumed negligible. The bipolar parasitic capacitance C a , of Fig. 5.5(a) is given by (5.3) Cpa = CC.Q> t CE.Q, The total load capacitance, C., shown in Pig. 5.5(b), i s given by
c, = c,
CS,Q1 +CC.Q,
(5.4)
where Cr.is the external load capacitance, C,,O, i s the average collectorsubstrate capacitance of Qz and CC,~, is the average base-collector capacitance of Q2.R e c d from Section 3.5.3 lhat the base-emitter Murion capacitance is given by
co
drc,Q,
=if=
(5.5)
whew the q is the forward transit time subject to high-level effects.

The delay c m be divided into three components :
1. The first component, l,, in defined as the time required to turn QION. The model of Fig. 5.5(a) can be used in this case. Writing lhe current equation at the base node of QI, we have
264
CHAPTER 5
Solving that equation and assuming that initidly the bare-emitter of Qzis zero, we have
t,
(CF + C , ) -
VBB,a
I.?,,.,
(5.7)
If the initial VBEis not eeio then the above expression should be corrected. Typical value of il is 17.5 ps for a total parasitic capacitance at the base node of 50 f F ,V.j+,, = 0.7 V ,and I D S , . ~= 2 mA.
2 The second component, t2, is defined as the time required to charge the diffusioncapmitame, CD,p,.Startingfrom t,, the collector current begins to quickly rise and then rexbes its peak value, I c p . The output voltage changes slowly (see waveformsofFig. 5.4). Sot. is then defined as the time required for the collector corrent to reach its peak. This delay component is given by
t2IDSd
T,IOCp
(5.8)
which means that the charge furnished by the PMOS is needed to charge diffusion capacitance. Therefore,
The peak collector current of Q1 can be approximated 'sing Equation (3.111) [Section 3.5.21. So we have
ICP = JBOIX,IDS..t
(5.10)
where Po is the value of the p i n for low-level injection and I x , is the forward knee current. Note that r , is incremed by the collector current [see equation (3.127) Section 3.531. Hence, an average value of the forward transit time should be used in the above delay expression. The initial value o f q is 12 ps and it can leach 50 pr when the collector current reaches, for example, 5 mA. For = 2 mA, typical value for t a is 78 pr (average forward transit time is 31 ps).
3. The third component, ts, is defined as the time required to charge the total load capacitance to the middle point of the output swing. If we assume that the voltage across the base-emitter of QIis almost constant, then we have the following approximation
(5.11)
Low-Vollage VLSI BiCMOS Circuit Design
265
I f w e assume
that Ic,pz is constant during this time [see Fig. 5.41, and the mid-point of the output is VDD/Z, then we have (5.12)
The value of this delay vsries by more than an order of magnitude depending on the devices sise and the load capaeitnnee. For example, for a load C , of 1 pF, this delay. t 3 , has a typical value a t 5 V power voltage 400 p, while for load 100 f~ a typical value is 70 ps.
Hence, the total delay t d can he written as
1
IIitatt.
(5.13)
The first delay is associated with the parasitics at the bare, the second one with thc forward transit time and the last one is a function of the load capacitance. For smdl loads, t2 and ti dominate. Bowever, for large output loads, the third delay term, t s dominates. The exprersion of the pull-down time is similar to that of the pull-up time ucept for the value of the drain e m e n t of the transistor N [see Fig. 5.21. The saturation current ofthis device is given by
I D S . .= ~ K,C=U,G~W~(VG~ Vh)
(5.14)
The VGs far the NMOS during the switching is affFeted by V L Zdrop ~ while the one o f the PMOS is not. This voltage is given by
vos =
y;.,h.
VBE
(5.15)
So the effective gate-source voltage of the NMOS k lower than that of PMOS. The sizing of the NMOS and PMOS dwicer doer not follow the rule used for CMOS. It can only be determined from circuit simulation to get symmetrical risc/fa delay limes.
The slope of the characteriPtic delay-load of the BiCMOS gate is larger than that of CMOS, since it is equal to V D D / Z ( ~ D S +, lc ~p ~) . For 8 CMOS gate, the slope is rimply VDD/~(~DS.~,). The saturation culient in the CMOS is slightly higher than that of BiCMOS because the CMOS inverter has D PMOS with slightly wider device (see next Section]. Houcver, the slope of the BiCMOS inverter is larger due to large Icp. Therefore. the BiCMOS gate h a s a higher ddvability than CMOS.
266
CHAPTER 5
5 . 1 . 3
CMOS and BiCMOS Comparison
Lets compare the delay of BiCMOS gate to CMOS gate, having both of them the same inpnt capacitances. We consider the case of inverters with the following riser. For the BiCMOS inverter, we have : W, = W, = 10 em, WN*, = WN,, = 2 fim, and the emitter ate8 is n2 the minimom area. For the CMOS inuerter, we have W, = 15 em and W, = 7 em. For unloaded inverters and from the delay cxprersion of the BiCMOS inverter discussed above, ~ ~ , C M O <Si d , B , o M o S because the BiCMOS circuit has more parasitics and requires an initial delay to turn ON the bipolar devise. For large loads, I ~ , C M O S> G,B;CMOS, as explained previously. Fig. 5.6 shows the simulated delays of the CMOS and BiCMOS inverters function of the fanout. Fanout is defined here a s the ratio of the load seen by the gate to the hpni capacitance. In other wozdr, fanout is equal to the number of the gates connected to the ontput of the driving gate, all having the same input capacitance. The inputs axe driven by a small siae inverter of the s a m e type to have t y p i d inpnt waveform falljrise times. For low fanout, 1-to.2, CMOS outperforms BiCMOS at 5 V powez supply voltage. However, when the fenout is greater than 3, BiCMOS outperforms CMOS;particularly for high loads. In Fig. 5.6, the u o s s ( ~ ~ e ear pacitance (or fanout), denoted C,,is typically h the order of 100 f F . This cm~over value is critical for the performanee of BiCMOS; particularly when the supply voltage is sealed down.
5.1.4
Power Dissipation
As discussed, the BiCMOS gste of Fig. 5.2 has no DC emrent path from VDD to Vss if the input has rail-to-rail swing. Hence the static power dissipation is negligible if VT of the MOS devices is high. The dynamic power dissipation of the gate can be estimated from the circuit diagram of Fig. 5.7.
It is estimated by
Pa = C,iV%f
+ Cp2Vizms=f+ GVDD(VX- V L ) f
(5.16)
The first term is due to the total peraritie capacitance at the base node of Qi where the swing is V D D . The second term is also due to the parasitic capacitance st the base node of 4 . The swing at this node is limited to VBB.,... when the collector current reaches its peak. Finally the third term is related to the output load capacitance, CL, and the parasitic capacitance at the output. The swing is only V x - V ~ where , VH and VL are the high-level and the low-level of ontput, respectively. These levels ace affected by the output load.
Low- Voltage
VLSI BzCMOS Circuit Design
267
Equivalent load capacitance (kF)
For small loads the power of BiCMOS is greater than that of CMOS, w h i l e for large loads, they have almost the same dynamic power. Table 5.1 shows the simulation results of the power dissipation for both gates at 5 V power supply. At a fanout of 1, CMOS consumes much lower power than BiCMOS and it is h t e r . However at a Ianout of 10, the BiCMOS is faster (37.5% delay reduction) and it dissipater only 24% power more than CMOS. When a BiCMOS gate is driving another BICMOS, or a CMOS gate, the driven gate exhibits a DC power dissipation. T h i s DC current is nat acceptable, particularly when the circuit is in standby mode. Thk is due to the reduced $-Ping at the output of the first gate. Fig. 5.8 d o w r an example of BiCMOS gatedrivhgaCMOS gate. Iffor example theoutput ofthefirst gate (BiCMOS) VBE,the Vos of the driven NMOS would be higher than ieio and around the V T , resulting in appreciable DC power. Furthermore, the drive current of the driven gate would be reduced; particularly a t low power supply voltagc. Another disadvantage of the reduced swing is the noire margin reduction.
268
CHAPTER 5
Table 5.1
CMOS/BiCMOS powm disripotion v e r m ~ Land OVDD = 6 V and
f=100hmS
Driver
Fenout=l
Fsnout=5
Fanout=lO
CMOS (mW) BiCMOS (mW)
0.67 0.23
0.83
0.58
1.26 1.02
5.1.5 Full-Swing with Shunting Devices

Previously we have seen that BiCMOS &caits uhibit iedoced output s-g. To overcome these shortcomings, various types of BiCMOS gates have been devised. There are based on the conventional BiCMOS citcuits with baseemitter or collector-emitter shunting techniques or on other logic circuits which will be d~eusredin the following sections. Figore 5.9 shows some of the circuits bared on shunting devices. Fig. 5.0(a) illustrated one full-swing (FS) configuration called "FS type" gate [8] which uses MOS devices to achieve full-swing. For the charging phase, 8s the output exceeds V x , Qi cemes to source current to the load, and the load capacitance is charged through the shunting PMOS transistor P,. When the input goes to HIGH,the load is discharged through
Low- Voltage VLSI BiCMOS Circuil Design
269
Fare 1 (BiCMOS)
Figure
5.8
Gate 2 (CMOS)
p t c
DC
eowcr dissipstim of the &ring
and N,. When V. falls below V,, Q a ceases to sink current from the load capacitance. Then the output is discharged to the ground through only the MOS transistors N and N,. The final charging and discharging phaser occurs through the shunting devices. Hence, these phases c a n be slow became the MOS shunting devices have low drive capabilities. When this FS BiCMOS gate L operating under high frequency, the output s-g can he reduced. Another drawback of this circuit is that part of the current supplied by P ( N ) is wasted through the shunting transistors which weakens the bipolar drive. The shunting transistors P, ond N, can be minimum size. The problem of the base drive inherent in the "FS type" BiCMOS gate can be overcome by using feedback (FB) from the output through an inverter as shavn in Fig 5.9(h). This eireuit is called "FB type" [9]. During the pull-up transition, the shunting device P, is initially OFF and the PMOS transistor p wpplied all its current to the b s e af Q,. When V, is approaching its high level, the inverter I turns ON P, which itself charger the output node to V D D . The pull-down transition can be explained similarly. The shunting devices P . and N , and the inverter I can be sived properly to achieve greater speed then the othei configurations, even the conventional BiCMOS gate.
270
CHAPTER 5
VDD
V n n
&:
CMOS inverter
Figure 5.0 Fdl.swing BiCMOS gstr typal: (a) "FS type"; (b) "FB k y p i ' ' ; ( c ) '"CErhlvltingtype.
Another full-swing configuration is the one shown i n Fig. 5.9(c). It uses a parallel inverter from the input to shunt the collector-emitter (CE) of QLand Qa ontputs. The disadvantage of this gate is the increased input capacitance.
5.1.6
Power Supply Voltage Scaling
The output bipolar stage introducer VBEvoltage losaes at the output node as discussed earlier. When LL BiCMOS gate is driving another BiCMOS gate, the conventional BiCMOS gate loser its superior performance o v a CMOS at lower power supply voltage. The major c a w of this problem is the pull-down section of the BiCMOS gate. The VoSvoltage of the driving NMOS transistor of the pull-down section is eqnal to VDD 2VeB. As VDDis redoeed, VOS is signifinrntly reduced, resulting in degradation of drain current, hence the driving capability ofthe conventional BiCMOS gate. Fig. 5.10 shows the delay of a BiCMOS inverter in comparison to that ofs CMOS m the supply voltage is scaled down. The reported delay times were extracted from SPICE simulation by memuring the delay of the second gate i ne . chain of identical inverters. AU gates were equally loaded by B load CL = 0.25 p F and one fanout. All the circuits have the same input capacitance. The BiCMOS invcrter fails to
~
Lour-Voltage VLSI BICMOS Czrcuit Design
271
1.4,
operate at 2 V power supply. The BiCMOS outperforms CMOS but for 3 and sub4 V it looser its superior performance. The limit of operation of the conventional BiCMOS gate with the power supply voltage is determined by the NMOS device of the pull-down section. The drive current of this NMOS d e v k k (VDD -2Vs.s V T . . ) . Hence, VDD,,,~ 2.2 V. Therefore, high-performance BiCMOS circuits, at low-voltage, are needed that
minimize
m
rn
Teehnology/procesn complexity; Circuit complexity by osing less device count;
m
rn
Area occupied by the gate; and

Power dissipation.
272
CHAPTER 5
5.2 BINMOS LOGIC FAMILY

BiCMOS technology can gain much of its performance edge o ~ e r CMOS with c k u i t techniques that mk-e o r eliminate the effects of VBBloses. To overcome the problem of dday degradation i n conventional BiCMOS with supply voltage, many navel circuits were proposed. In this section, a practical family suitable for 3.3 V and sub-3.3 V operation regime is outlined. Fig. 5.11 shows the BiNMOS family of BiCMOS &<nits. The b&c circuit technique used in BiNMOS [lo] is the use of the NPN bipolar transistor only in the pull-up section of the output stage [Fig. 5.11(&)]. The pull-down seetion is kept as CMOS. In CMOS circuits, the PMOS transistor is twc-tc-three t i e s slower than an NMOS transistor, when same sbes are compared. In the BiNMOS circuit, the use of the PMOS, with the bipolar driver in the pull-np section, will halanee the unsymmetrical response of CMOS.
In the basic circuit of Fig. 5,11(a), the output reachs only VDD VBE level. This increaser the delay and power &sipation of the subsequent gates. If a resistor (in this case the gate is called BiRNMOS) or n grounded gate PMOS transistor is inserted between the emitter and the base of the pull-up bipolar transistor. the output achiever fd-swing. However, this will degrade the speed of the gstc because the base current is bypasaed by the inserted element and hence is reduced.
~
Many alternatives have been proposed such ar BiPNMOS [Ill, and PBiNMOS [I21 to realist full-swing output. The BiPNMOS is shown i n Fig. 5.11(c). A small rise PMOS transistor and an inverter ale added to the bark BiNMOS gate. The PMOS device realiees full-swing output when the output changes from low to high. The Sdded PMOS, P, turns ON only when the output rewhches the threshold voltage of the feedback inverter. Hence, the bare curreat supplied by the pull-up PMOS transistor is not affected by this added PMOS transistor. Consequently, the BiPNMOS gate has higher performance than conventional BiNMOS and BiRNMOS. One drawback of the BiPNMOS is the increased output load capacitance due to the inverter I. The PBiNMOS gate eonfiguration shown, in Fig. 6,ll(d), uses a small sine PMOS device in parallel with the bipolar p d - u p transistor t o r&e full-swing output. This configuration results in better performance compared to the other circuit structures but slightly increases the input capacitance of the gate. In this section, we show that a properly optimiied PBiNMOS gate is faster than CMOS, even a t low power supply and load.
Low- Voltage VLSI BiCMOS CiTCUit Design
273
274
CHAPTER 5
5.2.1
BihMOS Gate Design
In this section we discuss the effect of the circuit parameters available to the designer to optimine the PBiNMOS gate for low fanout fast operation ming the 0.8 pm BiCMOS device parameters discnssed in Chapter 3. W e optimie the design of the inverter. Then, the teeh*que can be extended to more complex gates.
Finding the proper sieing of the inpct MOSFETs P and N (W, and W, respectively) is not tdvial. The sizing of Na and P, [see Fig. S.ll(d)] k not critical. For typicd applications, it is enough to use near minimum size devices. When the delay of the PBiNMOS is plotted versus the width of one of the devices P o r N,for different fanouts, a common optimum width exits as shown in Fig. 5.12(a) with a fiattaed region. This optimum is due to the fact that when inerebdng the size, the d r i n t i i t y of the gate increases. However, the equivalent ontpnt load also increase.. Then at a certain siee, an optimum delay exits. &om this figure,the optimum W, is 9 p m and W, = 11p m (particularly for low-fanout). Note that in Fig. 5.12(8), we have chosen W, i i 0.8Wm. This is explained in more detail below. When the BiNMOS inverter is used as a driver of a fixed losd (e.g., bus), instead of d d ~ gates, g then we should consider the delay of the driver, including the delay of the stage that drives it. In Fig. 5.12(b), the total delay of the PBiNMOS driver and the CMOS inverter that driver it is plotted for two fixed loads: 0.2 p F and 0.5 p F . The CMOS stage has a minimnm dae. The minimum delay is around the point determined previously for the knout cese The choice of the emitter area in this gate depends on the technology and the load. For the 0.8 pm BiCMOS at 3.3 V power supply voltage, it was found that using the minimum emitter ares (AB x 1 = 0.8 x 4 pm) gives the minimum delay for the range of loads 5 1pF. Fig. 5.13 shows that the optimal W,/W, ratio is the same for different fanonts and is equal to 0.8. This point &o gives almost symmetrical f d j d s e delays. So wen if the fanont is unknown,the optimnm gate is fixed and the size. depend only on the device parameters. This result is very important for standard cells and gate arrays where the cells are ddgned with unknown loads.
Low-Voltage VLSI BICMOS Czrcutt Design
275
1411,
2201
-8 LO
I
12
14
16
276
CHAPTER 5
.....
340
...... ......., ........
....
VD0 = 3.3 v wp +W,,=201im
2x0
240 228.2 0.4 0.6 0.8 I

1.2 1.4 1.6 1.8 2
2.2 2.4
wpmn
ratio
Figure &.I$ The &lay of PBiNMOS inverter Y I ~ U B the ratio of W p / W . for n fired input capacitance.
CMOS.--.-
'
500
....
......
$ 4 0
300
200
IwI
2
Fanout
Figure 6.11) Comparison of the CMOS m d PBiNMOS delays for the same input ce,p~ciLancc funslim of the fan..uk.
Low-Voltage VLSI BzCMOS Czrcuzt Deszgn
277
5.2.2
CMOS and BiNMOS Comparison
F i g . 5.14 shows the delay of CMOS and PBiNMOS inverters fnnction of the knout. Both gates have the same input capacitance. The impmtant result of
this plot, is that the PBiNMOS gate is always h t a than CMOS, except for B fanout of I , where PBiNMOS is slightly Carter. For a fanout of 3, which is II typical value in many designs, the delay is reduced by 20%. For a higher fanout, tho delay is reduced by 25.40%. This result ir quite different from the e a ~ e of conventional BiCMOS where B high fanout (or load) is required for BiCMOS.
Let us compare the power dissipation of the gates for different fanoot. Table 5.2 shows this comparison for s m d fanouta. The power dissipations of both gatare comparable and are the same for e . fanout (> 3). The small rize additional bipolar in the BiNMOS gate does not result in sigaificant power dissipation overhead. This result shows that the BiNMOS family is an excellent choice fo? law-powcr and high-speed operation. However for D fanout 1-2, still the CMOS can be used.
TableS.2 CMOS/PBWMOSpow~i.di..ipationsarvfanovtBV~~ =3 . 3 Y f = 100 MBx.
Driver CMOS (pW) PBiNMOS ( p W )
Fanouk2
Fanout=3
192 203
Fanout=5
277 287
149
171
5.2.3 BiNMOS Logic Gates

Since the PBiNMOS is used extensively in 3.3 V digital integrated circuits,
some logic gates a e presented. Combinational PBiNMOS logic circuits *re
ewily constructed using the basic PBiNMOS inverter of Fig. 5.11(d). Twoinput NOR and NAND gates are shown in Fig. 5.15(a) and Fig. 5.15(b). The logic function is implemented using the PMOS and NMOS blocks a5 i n CMOS technology. The bipolar device Ql is osed as a current drive. More complex functions c m be implemented wing standard CMOS gate formation theory. The layout of the PBiNMOS inverter is shown in Fig. 5.16. The BJT consumes area in the PBiNMOS gate. However,when complex gates are implemented with more MOS devices, &heextra area of the BJT is reduced.
278
CHAPTER 5
Figure 5.16 Cir-uit NANDZ.
rchhcmslier of:
(a) PBJNMOS NOR2
(b) PBiNMDS
One technique to reduce the area penalty of the BJT is to use merged N-well bipolar and PMOS device..
5.2.4
Power Supply Voltage Scaling
For fntare technologies, the power snpply voltage will be sealed below 3.3 V. Fig. 5.17 shows the delay of PBiNMOS and CMOS inverters for a fanaot=3 versus the power supply wltage scaling. The reported delay times were extracted from SPICE simulation by measuring the delay of the second gate in a chain of identical inverters. In this case, the full-swing operation, at the input of a PBSMOS inverter, is provided by an identical gate, where a shunting PMOS is used. Fig. 5.17, shows that PBiNMOS is faster than CMOS down to 2.5 V. At 2.5 V the delay reductinis 15%. The crowwer power supply vdtage between PBiNMOS and CMOS is around 2.15 V. Note that in this comparison we used 8 0.8 pm BiCMOS technology aptimked for 5 V operation. In this case, to compare the BSMOS to CMOS at low-voltage, deepsubmicron technology should be osed. From the device Iwd point of view, scaled technology is expected to improve the performance of BiNMOS a t low-voltage. However, 2 V is the limit of the use of BiNMOS, since almost half of the swing a t s u b 2 V is provided by the poor shunting PMOS device. In summary, BiNMOS family provides the follorving advantage:
279
-_ I - (N-Well B N - P l u g m N + Diff n P + Diff

$$$Gate
m P - B a s e a M e t a l 1 UMetal 2
UEmitter
~ContactlX]VlA I
280
CHAPTER 5
.
rn
Simple gste compared to other BiCMOS logic circuits; Good performance at 3.3 and 2.5 V power supply voltage generations even at low-fanout; and Needs simple BiCMOS process
The only disadvantage of BiNMOS is its poor performance for sub-2 V operation. The s m a l l area penalty of BiNMOS is not a problem since for complex gates the overhead of the bipolar device is miaimiued.
5 . 3 LOW-VOLTAGE BICMOS FAMnIES

In this section, several BiCMOS logic circuits proposed for low-voltsge highspeed digital applications are reviewed [13]. Many of these circuits have not been widely used in BiCMOS products. However, some of the logic circuits presented in this section exhibit high-performance at low-voltage down to 1 V.
281
For fast operation at low-voltage the fd-swing operation should be realized with bipolar devices. Otherwise, the techniqnes based on shunting devices do not provide high drivability
5.3.1
Merged and Quasi-Complementary BiCMOS Logic
In this section two circuit techniques to overcome the shortcomings of the conventional BiCMOS gate are discussed and compared. These gates are intended to be nsed for sub-3.3 V operation. luso they m e devised to solve the pmblem of ming PNP transistor (see next section on Complementarg &CMOS). In all there circuits, the improvement is done mainly on the poU-dourn section of the conventional BiCMOS, since it is the major can~e of speed degradation at low-vdtage.
5.3.1.1 Merged BiCMOS (MBICMOSJ

To improve the performance of the pd-down seetion of the conventional BiCMOS circuit, with power snpply sealing, PMOS/NPN pd-down BiCMOS gate [I41 as shown in Fig. 5.18. In this pull-down canfighas been proposed ration, a PMOS transistor Pa,is red to drive the NPN bipolar trsnsistor, 8,. The gate of the PMOS P, is tied to the base of Q,.The CMOS inverter formed by the transistors P, and Ndl supplier rail-to-rail voltage swing to the pull-down PMOS. Henee, the VGSvoltage of the driving PMOS transistor is not affected by VaE loss s i n the ease of conventional BiCMOS. T h i s gate is d e d Merged BiCMOS (MBiCMOS) because of the advantage of the gate for possible PMOSJNPN devices merging. The pull-up section is similar to the one in conventional BiCMOS. The operation of the pull-down sections is BS follows. When the input is high, N a p u b the bare of Q1down to ground and P, turns ON. The transistor P z supplies the base elurent to Ql. The bipolar tramistor Q2discharges the load capacitance to lover voltage equal o rI w than Vgaon. Stin this structure suffers from the 2 VaE hrser. The only improvement in MBiCMOS, compared to conventional BiCMOS, is the higher drive current of the pull-down section. If the N-well of the pull-down PMOS transistor i s tied to the VDD rail, its threshold voltage will experience a degradation due to the body effect during the pull-down transient. As a result, the drivability of the pull-down PMOS transistor is degraded. A simple solution to eliminate this problem is to shunt the IOUC~ and the substrate of the PMOS transistor, P2.
282
CHAPTER 5
Figure 5.18
Tho MBiCMOS
r t r
It was shown that this configuration (with shunted source/substrate) is fsJter than its CMOS counterpart down to 2.2 V supply voltage "sins sub-0.5 pm BiCMOS technology [15,161.
5.3.1.2 Qunsi-Complrme?zfory BiCMOS

Another variation of the MBiCMOS is called "Quasi-complementary BIG MOS" [17]. A "quasi-PNP" connection is generated in the pull-down section of the conventional BiCMOS as shown in Fig. 5.19. It consists of PMOS and NPN tranaktors (Fig. 5.1S(b)). This configuration resembles the MBiCMOS gate of Fig. 5.18. The QCBiCMOS has two attractive features. The first one is that the drain curtent of the pull-down section does not suffer the ~ V B losses E as in the case of conventional BiCMOS. The second one is lhat the pull-down waveform is steep, dae to the good Ehsrge retention capability of the bipolar tramistor. The feedback circuit formed by the two cross-coupled inverters, 1, and Iz, permits the discharge of the bere of the pun-down transistor immediately after the p&down transition. The QCWiCMOS gate keeps its superiority over CMOS down to 2 V. At 2 V it has better performance than BiNMOS logic circuit. However for sub-2 V, it looses its performance. Furthermore, it consumer large area and needs a relatively large fanout to outperform CMOS.
Low-Voltage VLSI BiCMOS Czrcuit Design
283
5.3.2
Emitter Follower Complementary BiCMOS Circuits
Full-swing operation can a L o be achieved by using what is called the Complementary BiCMOS (CBiCMOS). The n ~ of e complementary BiCMOS has been encouraged by the recent advances in bipolar technology, which led to high-performance PNP transistors. It is expected that the N P N and PNP transistors will exhibit dose performance when the de~cicesare scaled doam and the base doping inerearer. In this section, we study the emitter-follower (EF) CBiCMOS. Fig. 5.20 shows the use of complementary bipolar output stage to form the bnsic complementary BiCMOS circuits [18, 191. The pun-op section is similar to the conventional BiCMOS. The pull-down section is symmetdcal to the pullnp. The cnrrent of the NMOS transistor N does not sdfer of VBSreduction doc to Q . as in conventional BiCMOS. T h e static swing varier between VBEand VDD VBB-. However, m explained in Section 5.1.2, the actual swing might bs larger than the static design. The balanced transconductance of the PMOSINPN and NMOSIPNF makes it ensier to obtain symmetrical fall and rise time. Hence this circuit eliminates the degradation of the pull-down delay with power supply voltage of the conventiond BiCMOS.
~
284
CHAPTER 5
Figure 6.20
SEhrmsti. of Lhc basic CBiCMOS
The gate of Fig. 5.20 can be modified to achieve full-swing operation by using emitter-base shunting devices. Fig. 5.21(a) shows EF CBiCMOS with shunting technique. The shunting MOS transistors of the base-emitters permit rcstor8r tion of the full logic level of the output. But still the full-swing is achieved with the two dow MOS devices. Some of the base current can be consnmed by the shunting devices which weakens the drive of Ql and Qz. To O T C I C O ~ ~ this problem, the feedback technique can be used as shown in the circuit of Fig. 5.21(b). The turn ON of the shunting devices i s delayed by the feedback inverter, I. There CBiCMOS drcuits have two drawbacks: poor performance at 2 V power supply voltage and less, and high proce-g cost because of the high performanee PNP device needed. This low performance, at low voltage, is due mainly t o the fact that 2Vse outpot swing is generated by the two shunting transistors.
5.3.3 Full-Swing Common-EmitterComplementary BiCMOS Circuits

So far all the presented full-swing circuits, such as PBiNMOS, CBiCMOS, MBiCMOS and QCBGMOS, achieve the rail-to-rail swing by using resistom or MOSFETr that apcrate in the linear region. These techniques are effective
285
Figure 6 . 1 1
SrhematicofEF CBiCMOS g s k r xithshvnilngdcrirsr
only when the operating frequency is low, where the gate can complete its fullswing operstion and/or when the load capacitance is small 1201. FuU-swing circuits with full bipolat drive are needed. In this section, CBiCMOS variation suitable for sub-2 V operation, called Ttmsient Saturation (TS) is presented.
5.22 shows the basic common-emitter complementary BiCMOS ( C E CBiCMOS) circuit. The circuit is symmetrical and has symmetrical fall and rise times. When the input goes to high, N turns ON to rink the current from the base of the PNP transistor Q2.When the base voltage o f Q 2 falls to V D ~ - V Q .~ turns ~ ~ ON , to s o u m the current to the output load capacitance. Q 2 eventually saturates and the output node ir pulled-up to VDD - Vcs..,. A1 the end of charging the MOS device is still consuming current. The operation of the pull-down section can be explained similarly. Hence, the operation of CECBiCMOS is "on-inverting and the gate needs an extra CMOS inverter at t.he input to achieve complement fnnction. In this circuit, the MOS trsnrktors operate in saturation, hence they supply high cnrrent for the bipolar transistors. Furthermore, the output swing has near rail-to-mil w i n g (VCB,.~ to VDD - V,o,.r). This circuit offers high-speed at low-voltage, but har two drawbacks; (i) the high-static power dissipation, due to the DC cwrent flowing through the bave of either QI or Q a , and (ii) the excess delay due to the slow procesr of turning the saturated BJTs OFF.
Fig.
286
CRAPTER 5
"DO
4
Figure 1.22
Common-*mitt* CBiCMOS $eL.
These two problems have been salved with several implementations [21, 221. One possible implementation is shown in Fig. 5.23. It is cslled Transient Satmation M-Swing (TS-FS) BiCMOS. This logic nses the principle of CE CBiCMOS described in Fig. 5.22. When the input f a , we - m e that the output is charged high, then Pa is ON. Pz tmns ON and the base of QL is charged throngh Pa and Pa [Fig, 5.23(b)]. Consequently, Ql discharges the output (load) down. When the octput voltage approaehs eero, the inverter Z , turns P s OFF and N4 ON [Fig. 522(c)]. The base voltage of Q 1 falls below V B E , causing it to torn OFF. Although 9 1 Jatutates, this does not slow the n u t pull-up transition because the excess minority carriers of Q, are discharged immediately after the pull-down operation. Thus, the bipolar transistor ra1mst.a transiently. The circuit is symmetrical, hence the operation of the pull-up section can be explained W a r l y . T h e PMOS transistor,Pa, cuts off the the DC enrient path during the pull-down transition to avoid any static power dissipation. The small sine ontput latch, composed of the inverters I, and I,, holds the output level because in steady state there is no path between thc ontpnt and the supply h e s . Compared to the BiCMOS logic circuits so far presented, TS-FSis faster below 2 V supply, when the load is relatively large (- 1 pF). At 1.5 V it is twice as fast s CMOS for large loads. Although this circuit solves the problem of speed degradstion of BiCMOS a1.5 V power supply, it still has several drawbacks:
Low-Voltage VLSI BiCMOS Cixuit Design
287
(a)
(C)
(c)
Figure 6 . 1 3 (a) Circuit configuration af TS-FS BiCMOS: (b) and sicnt saturation opcrstion for the pd-down srclion.
tram.
process complexity due to the PNP bipolar transistor; large area; relatively high crossove~point with CMOS (- 0.4 pF); and it is a noninverting circuit.
5.3.4
Bootstrapped BiCMOS
An alternate way to avoid the negative effect of VgBloss i n BiCMOS is simply to use a second supply voltage equal t o (VDD t V B B ) Bowever, . this approach is costly because of the additional wirer needed to distribute across the chip and the need for the second supply voltage. Another approach is to use boatstrapping technique to pull-up the base of the pull-up bipolar transistor to (VDD V B B )and hence the output to V D D .The generation of voltages higher than the power supply at the gate level adds an extra degree of freedom to BiCMOS. Schottky BiNMOS/BiCMOS circuit configorations using the boat-
288
CHAPTER 5
strapping have been proposed to overcome lhe negative effect of VBEloss [ZO]. The full-swing operation is performed by saturating the bipolar transistor of the pull-up section with jl base current polse. After which, the base is isolated and bootstrapped to a voltage higher than VDD. These Schottky circnits ontperform all exjsting BiCMOS families in snbW regime down to 2 V, but they need a BiCMOS tcehnology with good integrated Schottky diode. Other examples of a such technique are the bootstrapped BiCMOS circuits published by [23,24. 251. The main advantage of the bootsttrapped circuits is that they c a n be realized in conventional BiCMOS process with CMOS and NPN transistor only. In this section, we present one bootstrapped circuit which overcomes many drawbacks of the BiCMOS logic families discussed previously.
S.3.4.1 Basic Concepr of Operarim

The Bootrtrapped Full-swing BiCMOS (BFBiCMOS) inverter is shown in Fig. 5.24. It consists only of CMOS and NPN transistors. Benee, it can be built in a non-complementary BiCMOS technology. The pull-down circnitry is identical t o that of TS-FS and was explained previously. The operating principle of the pull-up section can be explained as follows. When the input is high and the output is low, the PMOS transistor P d is ON. In this w e , the bare voltage of QIis precharged t o VTP which is less than VBS- but close to it. The prechsrge PMOS transistor MP, is ON to charge the bootstrapped capacitor Cawt to the level VDD(piecharge cycle). When the input goes to low and Pitnrm ON, the bipolar transistor Qlturns ON almost instantaneously becanse its bMie-emitter junction is piecharged near Consequently the initial turn-on delay of the pull-up section is reduced. This has an impaet on the minimum fanout required by BFBiCMOS to outpetform CMOS. Once QI turns on, the output node starts to charge the load capacitor CL toward VDD. Since Pp is OFF, the node nl is disconnected from VDD and is floating. Thus as the output voltage V . rises to VDD, the voltage at node n l also rises towards VDD+ V B S ~ (bootstrapping eyde). When the inpnt is low, the gate of the PMOS transistor Pp turns OFF (almost instantaneously) during the bootstrapping cycle to prevent dkehsrging the bootstrapped node through reverse current Corn 01 to VDD.This is achieved through the use of the pseudo-inverter formed by P( and Nj. During the bootstrapping cyde (the input is low), Pt t u n s ON and the gate of the preeharge transitor Pp is pulled up towards the voltage of nl. Thus, P,, is completely OFF when the voltage at nl exceeds VDD. Furthermore, the PMOS transistor P d is OFF completely because its gate is driven by the boosted voltage through
P . .
289
"OD
I"
G t
Figure 5.24
The boolrtrtippd full-swing BiCMOS i n ~ e r t e (BFBiCMOS) r
Compared to the Bootrtrapprd BiCMOS (BS-BiCMOS) [23] af Fig. 5.25, the BFBiCMOS has several advantages. First, the bootstrapped capacitor ir driven by the outpnt rather the input as in the BS-BiCMOS. In BS-BiCMOS, the gate of precharge transistor, Pp is driven to VDDand the node n t to VDD VBE. Hence, when VT is lower than Vss, the boolrtrapped node leaks its charge and resalts in less efficient bootstrapping. Third, a PMOS transistor P s is used to discharge the base to a pxcharged level VT, resultins in improved performance. Furthermore, it has a high cioisover capacitance and less performance than the BFBiCMOS.
290
CHAPTER5
Figvre 5.15
The BS-BiCMOS inucrtcr
The simulated waveforms at 1.5 V power supply of the BFBiCMOS inverter aze shown in Fig. 5.26. The base of QLgoes to (VDD t VBB) when the input is low. Note that when the input is high the base voltage falls to VT.
5.3.4.2 Design Issues

As a first orda analysis, the minimum d u e of C b , , , necessary for the bootstrapping condition, can be obtained as follows During the piecharge cyde, and the charge on C,, the charge of the bootstrapped capacitor is VDDCS~.~ the parasitic capacitance on the node nt, is VDDC,. The total charge on nl during the precharge cycle is
Qni
= VDDC~..~ VDDC,
(5.17)
In order for V t , to reach VDD, V,, must reach VDD t VBE- (during the , is (VDD VBE,)~, and the bootstrapping cycle). Thus the charge on C
Law- Voltage VLSI BiCMOS Circuit Design
291
charge on Cbo,t is V~a,Ca~.c. The new charge is given by
QI, =V
s ~ ~ C+ a (VDD ~ ~ i +V B S ~ ) C ~
(5.18)
The charge necessary for the base is

Qb
= Q-1-
4 6 1
(5.19)
As an approximation Q s can be given by

Qh
=I&
(5.20)
292
CHAPTER 5
where I , is the average base current of Q 1 and t, is the rise time of the output. From Equations (5.17-5.20) we find that
This equation indicates that Csomi has to be increased as the power supply is scaled down. When power supply scaling is accompanied with device scaling, 1, improves and as a result ChOot can be kept smsll. At 3.3 V, a typical value of C , , , , is I00 IF, while at 1.5 V,without technology sealing, it is equal to 250fF. The bootstrapped capacitance can be implemented using a NMOS transistor with its IOUC~ and drain connected together. In this cme, the capacitance is related to the area and gate oxide thickness of the MOS transistor. Simnlations have shown that for 1.5 V power snpply voltage, the width and length of this bootstrapped NMOS are equal to 13 fim and 6 pm, respectively. A typical area increase for B two-input NAND gate due to C b , , is 10%. As shown i n Fig. 5.24 of the BFBiCMOS inverter, the N-well of the PMOS devices Pp, PI and P*is connected to the bootstrapped node nl.This prevents their source/drain-well junctions to turn ON during the bootstrapping cycle. Also, it pzevents any latch-op which might be eaosed by the parasitic SRC when the drain/sowce-well voltages a r e forward-biased. The PMOS tiansistor Pa &o has its well connected to its source. This eliminates the body effect of the transistor and prevents any leakage during the bootstrapping.
5.3.4.3 BiNMOS Configuration

Fig. 5.2T shows the BiNMOS version of the bootrtinpped circuit. The pulldown section uses an NMOS transistor (N,) as CMOS.
The p d - u p section of this BFBiNMOS configuration is slightly different than BFBiCMOS, where a small-size PMOS transistar ( 4 ) is Sdded. Withont this PMOS device, the base-emitter voltage of Ql would be equal to VBB- when the m t p t reachs VDo.For low output load, if the k p n t goes to high, the p d down NMOS device, X I , discharges the output faster than the PMOS transistor P d does for the base. Thus, the bipolar transistor Qlcan turn ON to supply the output. This results in 8 high fall time delay. The added smd-sire PMOS transistor, Pf, in the pull-up section solves this problem. It permits, through the US^ oiinveiter I,, to set the voltage of nodes nl and B1 to Voo at the end of the bootstrapping. Hence, the bareemitter voltage of QI is almost equal to eem at the end of the bootstrapping. m e n the base is discharged from
Low-Voltage VLSI BiCMOS Circuit Desiqn
293
Figure 5.2T
Bootstrsppcdfull-swing BiNMOS inverter (BFBiNMOS).
(VDOt VBB,) to VOO by the PMOS P j , inverter I2 holds the output level a t VDO. Withoot this inverter, the output falls down to a level equal to (VDD VBE) due to the baseemitter coupling capacitance. The simulated waveforms of the different voltages are shown in Fig. 5.28.
For an n-input gate implementation, the BFBiNMOS requires 4n input transistors. Whereas, the BFBiCMOS and the BS-BiCMOS require 5n and 6n input transistors, respectively. The E ~ O S S O W ~ load capacitance represents one of the important parameters in circuit comparison. It is B measure of the load where BiCMOS circuits start to have speed advantage over that of CMOS. In the range 1.2-3.3 V. BFBiCMOS/BFBiNMOS circuits require almost an e q o i v d d minimum fanont of 5 . The BS-BiCMOS have a higher cmssavm capacitance.
294
CHAPTER 5
Figure 52.4 bareofQ1.
Voltage w w o f o m of the inpvt ( i n ) , the output (out). and t h e
5 . 3 . 5 Comparison of BiCMOS Logic Circuits

In this section, a brief comparison ofseveral BiCMOS logic circuits is presented
n i n g II gene& 0.35 pm BiCMOS technology given i n Table 5.3. For moxe detailed comparison,the i d e r can refer to [25].
Two-inouts NAND " gate confirruration wlls chosen to evaluate and com~are the performance of the circuits shown in Fig 5.29. The logic families compared are: CMOS [Fig. 5.29(a)], PBiNMOS [Fig. 5.29(b)], TS-FS [Fig. 5.29(c)], BS-BiCMOS [Fig. 5.29(d)], BFBiNMOS [Fig. 5.29(e)], and BFBiCMOS [Fig.
Low Voltage V L S I BCWOS Carcurt Deszgn
295
296
CHAPTER 5
297
Teble 6.1
Kay demicc parametrrafm 0 55
BiCMOS PROCESS
0.35pm
o a3pm
4.9 mA B V. = V n F
0 35pm 0 34pm 24mA = 3 3v w = 10 /,m
52 fF
73
fF
30 5 l
28
37 R
31 R 280 R
265 R
5.29(f)]. The simulations were carried out using a chain ofgatcr. The reported 50% delay timed m e those of an intermediate gate. Table 5.4 shows the delay, the a w a g e power dissipation and the power-d&T product of the different NAND gates at two sopplies; 3.3 and 1.5V. The rimulation was carried out at a typical load capacitance of 1 pF. The bootstrapped family consumes more power than CMOS because of the higher internal node capacitance. However, they provide a high speed of operation, particularly the BFBiCMOS, where il has a factor of 3 speed advantage compared to CMOS at 1.5 V. Moreover, the delay-power product of the bootstrappcd family is lower than that of CMOS. Notice that at 3.3 V, PBNMOS has the lowest delay-power product and less delay than CMOS. BiNMOS at 1.5 V is slower than CMOS and is not reported in the table. These rwulta also indicate that the m e of the bootstrapped BiCMOS/BiNMOS gate would improve the delay-power product when VDOis scaled dawn to 1.5 V.
298
CHAPTER 5
Logic Type
Delay
Power
(PWWBZ)
(PSI
TS-FS
DelayxPowei (fJ/MH.)
20.0 18.5 26.4
7.6
Delay
Power
DelayxPowu
TS-FS BS-BiCMOS BFBiNMOS
962 1175 686
3.84
3.1
4.60 3.50
3.2 4 . 1
5.3.6 Conclusion
We have demonstrated, during all the previous sections, that the b e t family to use for B fanout higher than 5 , is the bootstrapped BiCMOS for the r q e of power supply 1-to3.3 V. Bowe~er, due to its higher area occupied, it can be used m d y in high-speed digital applications. Note, when the load is large, in the range of 1 p F , the bootstrapped f d y provides a Q h speed and a good dday-power product. One drawback of this f d y , beside the large =ma, is that the bootsttapping is sensitive to the shape of the inpot voltage. One practical gate which can be used in several applications, even when the fanout is low, is the BiNMOS family. It has good performance for 3.3 and 2.5 V power supplies. Also it provides a better delay-product than CMOS. In the next section, many digital applications b a e d on BiNMOS family are outlined.
299
5.4 LOW-VOLTAGEBICMOS APPLICATIONS

In this section, we present the applications of BiCMOS digital circvitts in the implementation of digitd building blocks, microprocessors, memories, digital signal pmuessors, and gate arrays. BiNMOS f d y and its ntiliaation i n pmctied design at 3.3 V is emphasized. Many of the circuits cited are discuued in detail in Chapters 4, 6 and 7.
5.4.1
Microprocessors and Logic Circuits
BiNMOS logic have been nred i n several microprocessors [26, 271. In this application, BiNMOS can be used in critical path delay reduction without increasing .hip area since BiNMOS needs a low-fanout to outperform CMOS. Among the critical paths, we cite
Decoders in the register file and the cache memory;

m m m
m
Sense amplifiers and output buffers in the register file and the cache;
Booth's encoder. Wallace tree, and the final adder i n a multiplier;
n a rnio~optoce-x Arithmetic and lopi. unit i

Critical path of the control unit.
data psth; and
In the microprocessor of [26], the PBiNMOS logic family is used a t 3.3 V power supply. The critical p s t h ofthe control onit is reduced by 36% ovei CMOS. The BiNMOS gates keep their speed advantage even in the worst ehre (VDD = 2.7 V and T = 125 C).
BiCMOS logic is not only limited to conventional gates, but many other logics can be devised. One such example is the pass-transistor BiNMOS used i n the design of a 64bit adder [28] similar to the CMOS CPL logic family discussed in Chapter 4. Fig. 5.30 shows an urdnsive ORINOR gate uriing the passtransistor BiNMOS gate (abbreviated PT-BiCMOS) wing donble raiL The outputs of the pass-traoristoi network a m connected to the bases of the bipolar transistors Q, and Q2 to reduce the intrinsic delay. The PMOS transistors Pl and P s are crorr-coupled to restore thc high level of the pass logic to full Voo. The PMOS transistors, P2 and P4,charge the oatput to full-swing. These transistors are subject to body effect, hence they turn ON later during transitions.
300
CHAPTER 5
-Pars-transistor network
exclusive OR and NOR gates using PTBiCMOS, TG-type CMOS, and CPL-type CMOS using 0.5 pm BiCMOS process at 3.3 V power supply voltage. The fanout=l is equivalent to jl capacitance of 35 I T The PT-BiCMOS gate is faster than the CMOS gates for any fanout. The power-delay product is &so shorn i n Fig. 5.31(b). The T G gate has the best delay-power product for a fanant lower than 3. However, for B fanout greater than 3, the PT-BiCMOS sate is better.
Fig. 5.31(a) compares the delay of
This PT-BiCMOS has been used in the dcsign of . e &bit adder [28]. It is used mainly in the P, sum and carry blacks. A delay time of 3.5 ns was obtained for the 64-bit adder at 3.3 V, which is 25% better than the CMOS version. The area and power dinsipation penalties of the PT-BICMOS adder, compared to the CMOS, were 13% and 14% respectively. The speed advantage is kept down to almost 2 v.
5.4.2
Random Access Memories (RAMS)
One of the largest applications of BiCMOS is i n RAM design, particularly Static RAMS (SRAMs). The first BiCMOS SRAM was proposed in 1985 [29], then many BiCMOS SRAMs were reported [30,31, 32, 33, 34, 35, 36,371. The major applications of fast BiCMOS SRAMs a x cache for workstations and msin memory for super computers. Many BiCMOS SRAMJ are in production
301
B N l VD".,.,
Y
7 w
006
0 12
0 I*
0 21
Load Capacitance (pF)
Low-Vo7tage VLSI BiCMOS Circuit Design
303
complexity. BiCMOS war limited to some periphery circuits due to layoutpitch matching. It WIU used in the 110 buffers, decoder and drivers, main sense amplifier and voltage down converter. In general BiCMOS SRAMs and DRAMS are not suitable for low-power applications.
5.4.3
Digital Signal Processors
High-performance DSPs are needed i n many applications such as video signal processo~~, convolvers, filters. etc. BiCMOS technology has been used E U C C ~ S S fully in DSPs operating at B frequency of 300 MHs [41,421. These DSPs operate at 3.3 V power supply voltage using BiNMOS logic family. Among the characteristier of there BiCMOS DSPr, we cite:
Parallel, pipelined architecture;

m
High-performance and high density of integrstion; In this ewe, critical data-path functional blocks are customized; and
304
CHAPTER 5
rn
BiNMOS is used in the blocks such as: SRAM, ROM (Read Only Memory), ALU (Arithmetic Logic Unit), multiplier, and clock driver,
etc.
Fig. 5.33 shows a block diagram of a DSP [41]. This architecture can ~ E O C ~ S B any signal processing operation. The BiNMOS inverters me used as dock buffers to reduce the clock skew at 300 MHu clock frequency. The dock is distributed to about 1000 registers. High clock frequency increares drastically power and reduces the power supply voltage due to the powor noise (effect of high disripsted current). The BiNMOS inverter, used in the clock distribution, is the conventional one which h= a high level of VDD- VBE. Bence, the dynamic power of the clock network is rednced by 17% compared to CMOS when rising BiNMOS. Also the BiNMOS logic is used as:
rn
Ootput buffer of the Booth encoder of the convoluer/multiplier blodr; Decoder driver of the register file; and 0th- drivers.
5.4.4
Gate Arrays
Gate arrays became very popular for a wide spectrom of applications becsnse of their low cost and short turn-around time. Gate array chips consist of s large number of identical sites 01 basic cells which are usually placed in rows. The rows are separated by routing channels. The core of rows and channels is surrounded by 1 1 0 cells at the chip periphery as illustrated i n Fig. 5.34. Each of the basic cells is typically made up ofa nnmhez of transistors which can he connected to form a two input NAND 01 NOR gate or B simple latch. The only p ~ ~ e step ~ ~ that h can g be cnstomiaed is the metalhation. The nser of a gate array can implement the system by specifying the required connections between the devices in each cell and then the connection between the various cells. This is done a u t o m s t i d y using CAD tools. The number of metal levels used for wiling varies from 2 to 4. The first one or two levels are used for internal Wiring of the cell and the upper levels (0.g. third and fourth) for wiring between the cells in the harbontal and vertical directions [43].
Low- Voltage VLSI BiCMOS Circuit Design
305
24-bit
fl-
BiCMOS technology has been used extensively for building gate arrays and channelless gate arrays (sea-of-gates) [43, 44, 45, 461. At 3.3 V power supply voltage, BiNMOS logic f d y has been wed [lo, 111. In [ll],BiPNMOS logic gste has been proposed for the Chamelless gate array. Fig. 5.35 shows a layont ofa BiPNMOS basic c d on 0 . 5 pm BiCMOS technology. A bipolar transistor and a md size MOS transistor are added to the pnre CMOS basic c e l l Thew transistors are not only used to implement BiPNMOS gates but also Eip-flopn, memory macros (RAM, ROM, and CAM), etc. A BiPNMOS two-input NAND gate has 36% delay reduction compared to a similar CMOS gate for B fanout of 7. The speed advantage is maintained down to 2 . 5 V.
306
CHAPTER 5
1 1 0 PADS
I":
Figure
5.54
~ . t . A-.~
d+.floeqian.
5.4.5
Application Specific ICs (ASICs)
In order to realiae high-performance ASICr, fast standard cell library macros for rapid design are important. T h i s library contains custom functional maems such as: adder, Programmable Logic Axray (PLA), register file, RAM, cache, Table Look-aside Buffer (TLB), and controller, ete. PBiNMOS logic has been used for such a standard een library [12]. The cells of logic gates are d-ad in CMOS and PBiNMOS for the same logic functions. T h e PBiNMOS gates are used for a relatively high fanout and load, whereas CMOS gates are used for a m a l l fanout. A CAD tool can be utiised to choose the most appropriate cells in the design.
Lou-Voltage VLSI BiCMOS Circuit Design
307
Bipolar
0 I
Resinlor
IM a
PMOS
F3S
NMOS
5.5
CHAPTER SUMMARY
In this chapter, we have demonstrated the advantage of using BiCMOS over CMOS in terms of speed. W e have shown the historical evolution of the different BiCMOS logic families. A vmiety of alternative circuit techniques for low-voltage operation have been outlined and compared to the conventional BiCMOS. Also we have shown how optimized BiNMOS are faster than CMOS even if the fanout i s low (greater than 1). The design techniques c8n he utended to more complex gates and building blocks such as flipilops, and adders, ctc. Vsdety of applications where BiCMOS, particularly BiNMOS can be used at low-voltage are reviewed. The addition of the bipolar to CMOS to devise new structures enhancer the performance of ICs. T h i s feature improver the access time of memories, register files, ALUs, DSPs, ete. Notice that a large portion of a BiCMOS IC is implemented in CMOS, while bipolar transistors represent a s m d portion ( 0 5 4 % ) for driving or sensing p u p o s s . The power dissipation of BiCMOS circuits, compared to their CMOS cannterpartr, inaea5es drruticdy if ECL is nsed because of the DC current. However, if m l j BiCMOS logic gates m e used, the powez inccease is not significant compared to speed enhancemcnt. In some cases, like clock didribution network, the power dissipation is reduced when using BiNMOS.
REFERENCES
[I] A. R. Alvsree, %CMOS Technology and Applications," Kiuwer Academic Pnb., MA, Second Edition, 1993.
[Z] S. H.K. Embabi, A. Bellaouar and M. I. Elmarry, "BiCMOS Digital Integrated Circuit Design", Kluwer Academic Pub., MA, 1993.
[3] M. 1. Elmasry, "Design and Analysis of BiCMOS ICr", IEEE Press, 1994.
[4] G. P. Rosseel, and R. W. Dutton, "Muence of Device Parameters on the Switching Speed of BiCMOS B u f f e r s , ' IEEE Journal of Solid-State circnits, vol. 24, no. 1, pp. WB9, Febmary 1989.
[5] P. Raje, K. Chan, and K. Saraswat, "BiCMOS Gete Performmcc Optimieation wing Unified Delay Model," Symposium on VLSI Technology,
Tech. Dig., pp. 91-92, 1990.
[6] S. H. K. Embabi, A. BeUaouar, and M. I. Elrnsrry, "Analysis and Opt-ration of BiCMOS Digital Circuit Structures," IEEE Journal of Solid-state circuits, vol. 26,no. 4. pp. 676-679, April 1991.
[TI P .A. Raje, K. C. Sarsraat and K. M. Cham, "Performance-driven Sealing of BiCMOS Technology", IEEE Trans. an Electron Devices, ED-39, no. 3, pp. 685-693, March 1992.
[8] 3. Gallie, et al., "High-Performance BiCMOS 100K-Gate Array," IEEE Journal of Solid-state Circuits, vol. 25, no. 1 , pp. 142-149, February 1990.
[9] Y.Nishio, et d., "A BiCMOS Logic Gate with Positive Feedback," International Solid-State Circuits Conference, Tech. Dig., pp. 116117,Febrosry 1989. I101 A. E. Gamal et al., "BiNMOS a Basic Cell for BiCMOS Logic Circuits", in Custom Integreted Circuits C o d , Tech. Dig., pp. 8.3.1-8.3.4.. 1989.
[ll] B. Ham et al., "0.5-um 2M-Transistor BiPNMOS Channelless Gate Array", IEEE Journal Solid-State Circuits. "01. 26, no. 11, pp, 1615-1620, November 1991.
310
LOW-POWER DIGITAL VLSI
DESIGN
[12] H. Hara ct al., "0.5-um 3.3-V BiCMOS Standlrrd Cells with 32-kb Cache and Ten-Port Register File", IEEE Journal Solid-State Circuits, vol. 27, no. 11, pp. 1579-1584, November 1992. [13] M. I. EImary, and A. Benaoosr, "BiCMOS a$ Low-Supplg Voltage," in IEEE Bipolar/BiCMOS Circuits snd Techoology Meeting, pp. 89-96, October 1993. [14] P. Rsje, et al., "MBiCMOS: A Device and Circuit Technique for Submicron, s u b 2 V Repjme." Internetiond Solid-State Circuits Conference, Tech. Dig.,pp. 150-151, 1991. [15] P. G. Y. Tsui et al., "Stndy of BiCMOS Logic Gate Configurations for Improved Low-Voltage Performance", IEEE Journal Solid-State Circuits, vol. 28, no. 3, pp. 371-374, March 1993.
[I61 S. W. Sun et al., "A filly Complementary BiCMOS Technology for SubHalf-Micrometer Microprocwror Applications", IEEE Trans. Electron Devices, vol. 39, no. 12, pp. 2733-2739, December 1992.
[171 K. Yano et el., "Quasi-Complementary BiCMOS for Sub-SV Digital Circuits", IEEE Journal Solid-State Cizcuits, vol. 26, no. 11, pp. 1708-1119, November 1991.
[IS] A. Wataosbe et d., "Future BiCMOS Technologies for Scaled Sopply Voltage", International Electron Devices Meeting, Tech. Dig., pp. 429433, D e cember 1989. [I91 A. J. Shin et al., "Full-swing CBiCMOS Logic Circuits", in IEEE Bipolar/BiCMOS Circuits and Technology Meeting, Tech. Dig. pp. 229-233, September 1989.
[20] A. BeUaouar, I. S. Abu-Khater, M. I. Elmasry, and A. Chekims, "WSwing Schottky BiCMOS/BiNMOS and the Effects of Operating Frc queney and Supply Voltage Scaling." IEEE Journal of Solid-State Circuits, vol. 29, no. 6. pp. 693-700, June 1994. [21] S. H. K. Embabi, A. Bellaonm, M. 1. Elmsiry, and R. A. Hmdaway, "New FoU-Voltageswing BiCMOS Buffers", IEEE Journal Solid-State Circuits, vol. 26. no. 2, pp. 150-153, Febrnary 1991.
[22] M . Hiraki et d., "A 1.5-V FuU-Swing BiCMOS Logic Circuit", IEEE Journal Solid-State Circuits, "01. 27, no. 11, pp. 1568-1574, November 1992.
[23] R. Y. V. Ch& and C. A. T. Salama. "1.5 V Bootsttapped BiCMOS Logic Gate", IEE Electronic Letters. Vol. 29. No. 3, pp. 301-309, February 1993.
REFERENCES
311
(241 S. 8. K. Embabi. A. Bellaouat, and K. Islam, "A Boatstrapped Bipolar CMOS ( B 2 C M O S ) Gate for Low Voltage Applications," IEEE Journal of Solid-State Ckcuits, "01. 30, no. 1,pp. 47-53. January 1995.
(251 A. Bellaouar, M. 1 . Elrnsry, and S.
H. K. Embabi. ' Bootstrapped FullSwing BiCMOS/BiNMOS Logic Circuits b r 1.2-3.3 V Supply Volta8e Regime," IEEE Jaurnsl of Solid-State Circuits, 701. 30, no. 6, June 1995.
('261 J , Shuta, "A 3.3 V 0 . 6 p m RiCMOS Suprrscalar Mic.roproccssor,' IEEE International Solid-State Circuits Conference, Tech. Dig., pp. 202-203.1994.
[27j
F. Murabayarhi, ct s l . , -3.3 V, Novel Circuit Techniqnea for a 2.8-MiionTransistor BiCMOS RISC Microprocessor," IEEE Curtom Integrated Circuit Conference, Tech. Dig., pp. 12.1.1-12.1.4, May 1993.
[28] K. Ueda, H. Suziki, K. Suda, Y. Tnsujihnshi, H . Shinohsra. "A 64-hit Adder By Pass Ttandrtor BiCMOS Circuit,' IEEE Curtom Integrated Circuit Conference, Tech. Dig., pp. 12.2.1-12.2.4, May 1993.
(291 K. Ogiue, et d . . ?4 15 ns/ZSO mW 64K Static RAM," in ICCD. Tech. uig.. pp. i~-z0.1985.
[So] H. Tran o t al., "An 8.m 1-Mb ECL BiCMOS SRAM with a Configurable
Memory Array Sine,' Internationol Solid-State Cireuila Con<. Tech Dig., pp. 36-31, February 1989.
pi] M.
Matrui et al., "An 8-ns I-Mb ECL BiCMOS SRAM," International Solid-state Circuits Cod., Tech. Dig., pp. 38-39, February 1989.
(321 Y. Maki et al.. "A 6.5-0s 1 Y b BiCMOS ECL SRAM,"International SolidState Circuits Conf. Tech. Dig., pp. 136-137. February 1990. (331 M. Takada e t al., "A 5-ns I-Mb ECL BiCMOS SRAM," IEEE Journal of Solid State Circuits, VOI. 25, no. 5 , pp. 1051-3062, October 1990
134) A. Ohbn et al.. "A 7-ns I-MI) BiCMOS ECL SRAM with Program-Free Rcdundancy," in Symp. VLSI Circuits Conf. Tech. Dig.. pp. 41-42, May 1990. (351 Y. Okajiia et &I.. "A 7-nr 4-Mh BiCMOS SRAM with a Parallel Testing Circuit," International Solid-state Circuits Conf. Tech. Dig., pp. 5455, February 1991. 136) N. Tamba el s l . ,'"A 1.5 nr 256Kb BiCMOS SRAM with 11K 60 PI Logic Gates." International Solid-State Citcuits C o d , Tech. Dig., pp. 246-247, Februaiy 1993.
312
[37] K. Nakamvra et al., "A 200-MHz Pipelined 16-Mb BiCMOS SRAM with PLL Propmtional Self-Tim'mg Generator," IEEE Journal of Solid-State Circuits, vol. 29, no. 11, pp. 1317-1322. November 1994.
I-Mb BiCMOS DRAM," IEEE [38] G. Kitsukawa, et al., 'An Exp-ental Jonrnal of Solid-State Circuits, vol. S C Z Z , no. 5, pp. 657-662, October 1987. [39] S. Watanabc, et al., "BiCMOS Circuit Technology for High Speed e c h .Dig.,pp. 79-80, 1987. DBAMs," Symposium on VLSI Circuits, T 1401 G. Kitsukaws, et al., "Design of ECL I-Mb BiCMOS DRAM," Electronics and Communications in Japan, Part 2, vol. 76, no. 5, pp. 89.102, 1992. [41] M. Namura et al., ''A 300-MH8, ]&bit, 0.5-em BiCMOS Dsital Signal Proeesror Core LSI," IEEE Cnstom Integrated Circuits Conference, Tech. D i . , p p . 12.6.1-12.6.4,Me.y 1993.
1421 T. Inoue, et al., "A 300-MHe 16-bit BiCMOS Video Signal Proeersor,", IEEE Journal of Solid-State Circuits, vol. 28, no. 12, pp. 1321-1329, De-
cember 1993. [43] F. Mdurabayshi, et al., "A 0.5 micron BiCMOS Channellcss Gate Amy," IEEE Curtom Integrated Circuits Conference, T e c h .Dig., pp. 8.7.1-8.7.4, May 1989. [44] E.Hara,etal., YA350p~50X0.8micr~nBiCMOS GateAnaywithShared B i p o h Cell Structure," IEEE Custom Integrated Circuits Cenferenee, Tech. Dig., pp. 8.5.1-8.5.4,Msy 1989. I451 J. D. Gallia, et al., "High-Performance BiCMOS 100K-Gate Array," IEEE Journal of Solid-State Circuits, "01.25, no. 1, pp. 142149, February 1990. [46] T. Hanibuchi, et al., "A Bipolar-PMOS Merged Basic Cell for 0.8 micron BiCMOS Sea of Gates," IEEE Joarnal of Solid-State Circuits, vol. 26, no. 3, pp. 427-431, March 1991.
6
LOW-POWER CMOS RANDOM ACCESS MEMORY CIRCUITS
Low-power Random Access Memory (RAM) h a s seen a remarkable and rapid progress in power reduction. Many circuits techniques lor active and standby power reduction in static and dynamic RAMShave been devised. In this chapter we study low-power memory circuit techniques which are very interesting for several other applications. Among these circuits, we eramine memory cells, sense amplifiers, precharging circuits, ete. Circuit techniques for 1 . r V power supply are also discussed. The voltage targets using NiCd and Mn batteries are 1.2 and 1.5 V respectively. The minimum voltage of a NiCd cell is 0.9 V. Also we consider the Voltage Down Converters (VDCr) which are used in memories and processors. No consideration is given to the detail of designing B complete memory chip because a single configuration requires an entire book.
6 . 1 STATIC RAM (SRAM)

Today, workstations, computers and super computers are demanding highspeed and high-density SRAMr, e.g., cache memories. These systems started to use 4-Mb fast SRAMs and w i l l require, in the future, larger density m e m e nes with faster access time. Many I-to-4-Mb BiCMOS SRAMs [l, 2, 3, 4, 5. 61 have achieved access times of 5 to 10 ns. In these SRAMs, the power dissipation i s 275 to 1000 mW. which is not acceptable in many applications. On the other hand, high-density, low-pawe~SRAMs are needed for applications Such as hand-held terminals, laptaps, notebooks and IC memory cards. Table 8.1 shows examples for high-density SRAMr with low-power characteristics. The standby current is in the order of 1 @A snd rub+A which is suitable for battery-backup operation.
314
CHAPTER 6
Memory size (Ref.)
Power supply 3.0 V 5.0 V 3.0 V 2.5 V 3.0 V
CMOS
technology
0.35-pm 0 50-pm 0.60-pm
Access
time
7 ns 23 ns 68 ns 15 ns 15 ns
Power dissipation
140 m W C 3 100 MHa 100 mW d 10 MHz
1-Mb [f] 4 M b [8] 4 M b [9] 16-Mb [lo] 16-Mb [Ill 16-Mb [I21
0.25-pm
0.40-pm
0.35-pm
3.3 V
9 nr
21 mW 120 mW 165 mW 238 mW
d 10 MHa @ 20 MKs 0 30 MHz d 30 MHz
The power dissipation iednction in SRAMr is not o d y due to power supply voltage reduction, but &o to low-power circuit techniques. In this section we review some of these circuit techniques for low-power applications.
SRAMs have several advantages

rn
OY~T
Dynamic RAMS (DRAMS) such
BS:
No refresh operation of the memory cells are needed.

The speed of an SRAM is higher because of the differential pair of bit-lines. The operational modes are simpler because the row and eolamn address signals are simultaneously loaded. A low data retention current which is required by battery applications.
However, S U M S have the great disadvantage ofa large memory eeU eompered to DRAMS. For this reason, their capadties rue smaller than that of DRAMs.
6.1.1 Basics of SRAMs

In order to treat the different circuit parts of an SRAM, it is important to understand some characteristics of there memories. In general the pins of B SRAM are :
1. Addresses (Ao ... An); which d e h e the memory location;
Low-Power CMOS Random Access Memory Czrcuits
315
2. Write Enable
3. Chip Select system;
(m); which selects between the read and write modes;

(El?); which is used to enable the output buffer; and
(m); whkh selects one memory out of several within a
4. Output Enable
5 . Input/Output data (I/O).
6. Power supply pins. A timing disgram during read eyde is shorn in Fig. 6.l(a). Daring this time the data stared in a specific SRAM location (defined by the address) is read out. For a read cycle, two times are shown i n the figare; the read cycle time, ixc, and the address access time, IAA. Fig. 6.l(b) shows the write cycle which permits change to the data in an SRAM. Two timer are indicated. the write cyde time, f w c , and the write recovery time, ~ W R . Same of this information is used in this chapter. For more detail on the timing, the reader can refer to any memory data book. A typical SRAM mchitecture is shown in Fig. 6.2. The memory array contains the memmy cells which a x readable and writable. The row decoder (Xdecoder) selects 1 out of n = 2 rows, while the column decoder (Y-decoder) Selects I = 2 out of m = 21 columns. The address (row and column) are not multiplexed as in the ease ofa DRAM. Sense amplifiers detect small voltage variations on the memory complementary bit-line which reduces the reading time. The conditioning circuit permits the preehaige of the bit-lines. The aces~ b e is determined by the critical path from the address input to the data output as shown in Fig. 6.3. This path contbins address input buffer, row decoder, memory cell array, sense amplifier and output buffer circuits. The word-line decoding and bit-lines sensing delay timer am critical delay componentr. T o reduce the sensing time during a read operation, the swing on the bit-liner should be as s m a l l as pamible.
For an aspchronons S U M , a s p e d Circuit called an Address Detection Transition (ATD) permits the generation of internal pulses. These pulses are of two types; activation and equalieation. Activation pulses selectively activate particula circuits, w h i l e equalization pukes permit the reduction ofthe delay by restoring and equalking differential nodes prior to being selected. In t h m section we treat only asynchronous SFLAMr.
Not docked crternoily.
316
CHAPTER 6
CS (Chip Select)
OE (Output Enable) I
Data Out
ktnn-
CS (Chip Select) WE (Write Enable )

Data in
r-
I tWK
Dafa valid
\\\
(b)
Figure 0.1
Typicd timing of a SRAM: (s)mad q d e ; (b) w i l e cydc.
LlC
318
CHAPTER 6
Input
address
Addmr mpnt buffer
Row decoder
idnver
Memory
cell
6.1.2 Static RAM Cells

The memory cell is an important circuit in the design of low-power and highdensity SRAMs because the memory size is dominated by the cell area. There are various static memory cells. The cell of Fig. 6.4 has six transistors, in the form of two inverters, cross-coupled with two pars-transistors, connected to two complementary bit-lines BL and B. The pas-transiston are controlled by the signal W L (word-line). During the read cycle, the bit-lines are held high (prechsrged). Assume that "0" is stored at node A a n & " 1 ' is stored at node B. W h e n the cell is selected; i.e., WL set to "I", BL i s discharged through N1 and N3.
a
To write in the cell, one of the bit-liner is pulled low and the other high and a then the cell i s selected by W L , Assume that B is set to "0" whil e mltlally ' ' ' "1" is stored at node A ("0" at B).N1 and P1 should be riaed such that node A is pulled down enough to turn P2 ON. This in turn causes node B to be pulled np. The crosssoupled inverter pair have a high gain to cause the nodes A and B to switch to opposite voltages. The data retention (standby) current of thk cell can be 85 low BS 10-"A. Although this full-CMOS cell has low retention current, the cell area is so large that it does not allow high-density SRAMs. A typical cell area using a 0.8 ~m design rules is 75 p d , The stability of the memory cell is its sbility to hold a stable state. Fig. 6.5(a) ahows the transfer cumes of full CMOS S U M S . The box between the two
Low-Pomuer CMOS Random Access Memory Circuits
319
Figure 6.4
CMOS memory c d M i r h PMOS laad
characteristics (I and 11) defines the Static Noise Margin (SNM). Static noise is DC disturbance, such ffi offsets and mismatches, due to the pioeesskg and variations in process conditions. The SNM is defined as the maximum value of V, (static noise IOOIC~ ffi shown in Fig. 6.5jb)) that can be tolerated by the cross-coupled inverters before altering state. A n important parameter in SNM is the memory cell ratio, I , defined by
where transistors N , and N , sre the a c e m and driver NMOS transistors shown
i n Fig. 6.4. An a n d y s k of SNM for memory cells is given in [13]. T h i s static

noise margin parameter incremes with the ratio 7 . However, it k limited by the cell area constraint. The stability of the cell iS maintained even if VDDis scaled down. s similar Another mcmory cell configuration is shown in Fig. 6.6. This cell i to the full CMOS memory cell, except that the PMOS pull-up devices are replaced by high-iesistance polysilicon loads. The memory cell area can be
320
CHAPTER 6
"DO
about 30% to 40% smaller than the CMOS &-transistor memory cell, because the two polyrilieon resistances c a n be formed on top of the two NMOS driver transistors. The High Resistive Load (HRL) memory cell har been used in several S R A M generations from 4 K b . The high state storage node of Fig. 6.6 ulll be p d e d down with time due to two kinds of leakage current; the I d a g e current ofthe drsin junction and the subthreshold current. The voltage drop BCZOBI the resistance R prevents iegvlac cell operation, if the leakage current reacher the l e d of the poly-Si remtor current. In several SRAMs generations using BRL memory cell, the total standby current w w act to 1 p A per chip a t room temperature for battery-backup applications. Thus, for each memory generation with quadrupled density, the polyJi resistance value is also quadrupled. For 4 M b chip which h a II total standby current less than 1 PA,
Low-Power CMOS Random Access Memow Cwcuzts
321
I
typical d u e s of &'stance me in the 5 x 1 P 0 range and the resistance current is limited to 10-laA. This current should be mvch larger than the total leakage current of the storage node of the cell to improve tho data retention margin. The leakage current cannot be scaled because, fist, the subthreshold current per channel width, tends to increase; particalerly with the trend to decrease the threshold voltage for low-voltage. Second, the leaksge current of the drain jonction per area unit tends t o increase with technology scaling. Moreover the junction area is shrank with a rate lower than the SRAM density increase rate. In [14], it w m determined that the maxim- SRAM capacity for low-power applications, using an ERL memory cell is 4 Mb where the retention current is 1 @A. Note that the high-level node voltages of a l l poly-Si load memory cells are (VDD- VT)after mite cycle, where VT is the threshold voltage of the access transistor, subject to body effect. These nodes need a time of several ms to charge np to VDD. The SNM of the ply-Si load memory cell L more sensitive to cell ratio 7 , than the full CMOS cell 1131. A typical valne of I is 3. Also the cell stability is drastically degraded when VDDis 3 V or less. The transfer curves in the read mode can be easily plotted for different VDDto flnd out that the cell cannot store the data a t a certain low-voltage.
322
CHAPTER 6
p-Suhsmle
Low-Power CMOS Random Access Memory Circuzts
323
For 4 Mb and higher density SRAMs, the polysilieon load cell starts to be replaced by a polysjliean PMOS load called PMOS Thin Film Damistor (TFT) for low-power applications [S,9, 151. Fig. 6.7 shows a cmss section and k c n i t diagram of the poly-Si PMOS load memory cell 181. The TFT device is fabricated from amorphous silicon (a- Si). This material has a grain size of 2 ~ r while n that of the conventional poly-Si material is 0.03 pm. The thickness of this a - S i is 100 n m and the gate oxide thickness of lhe TFT is 40 nm. T h i s technology rerulls in improved ON/OFF currents compared to the one using poly-Si. The N i drain area of the NMOS transistor ia used ar the gate electrode for the PMOS TFT. To obtain a small area, the polydimn PMOS farms the must be stacked on the NMOS driver. The second palysilicon Iaye~ channel regions. The T F T memory cell area is more than 40% s d e r than the fall CMOS one. PMOS TFT used in a 4-Mb SRAM as W 7 A is obtained is attained. The ON current is larger by more than six order of magnitude than memory cell leakage currents which b much better than the current of the HRL cell Thos, it results in an excellent data letentian characterbtic. Moreover, the very low OFF current results in a standby current less than 1 p A for 4-Mb SRAM. This current is low enough for battery back-up operation. At 1.2 V power supply, the current flowing in the PMOS TFT is more than one-and-a-half order of magnitude larger than the OFF current. Thk demonstrates the ability of this teehnoiogy for iow-voitsge operation.
B
Fig. 6.8 shows the drain curzcot of
a function of the gate voltage. An ON current more than at a supply voltage of 3 V, while an OFF current of lO-"A
Afier write cyde, the hgh-storage node voltage i n the cell becomes VDD - VT. The time needed for charging up this node to VDD is t,h = -
C,VT
(6.2)
where 4 ir the current flowing in tho load device and C , is the total parasitic capacitance of the node. Using 4-Mb data for TFT memory cell, VT = 1 V , C , = 10 fF and 4 = 10 p A the to&is around 1 me. For poly-Si load this chage-np time is larger than 100 m i because h k low i y ~0.1 PA. The average interval time between two word-line selections (for the same word-line) is given by
1.
= Nlcy,rr M
~
(6.3)
where N is the number of memory ceUr per SRAM chip, M is the number of memory cells pel word-line, and (or noted t n c ) b the operating cycle time. For CMb, a typical value oft, is 4.5 ma when the cycle time is 70 na and
324
CHAPTER 6
M equ& 64cell/word-line. Comparing t. to t.k for poly-Si load and PMOS TFT we have t,* < t, For P M O S TFT (6.4)
to*
> 1.
For p l y - S t
Lond
(6.5)
Thus, the high-storage node, in the ease of PMOS T F T sell, is charged-np qvkkly to VDD. For this rearon, the Soft Error Rate (SER) of the PMOS T F T cell i s much lower than that of the poly-Si cell [El.
6 . 1 . 3
R e a m r i t e Operation
Fig. 6.9 shows a simplified readout circuitry for an SRAM. The circuit has static bit-line loads composed of pull-up NMOS devices N , and N2.The bitlines are pulled-up to a voltage (VDD - h), where V!, is the threshold voltage
Low-Power CMOS Random Access Memory Circuits
325
326
CHAPTER 6
"OD
WL
Figure 8.10
Power reduction by pulsing the word tine.
mbjett tu body effect. When the word-line W L is asserted, one word is selected. At this time, the bit-line B L is p d e d down to s level determined by the pull-up NMOS HI, the word-line transistor N . , and the driver NMOS transistor Nd ss shown in Fig. 6.9(b). The voltage at the node A should be low (mar ground) to not alter the RAM content during this read operation. A s m a l l swing change on BL is dwirable to achieve the high-speed readout, particularly if CnL is high. The Sense Amplifier (SA) amplifies the small swing, AV on the bit-line. Typical values 0fAV-J are 100 mV wd.L?& respectively. It should be noted that t&FA phould provide a wide opemting margin over all pmcess, temperature, and voltage cornerr.
If the W L signal stays asserted, all selected eolamns consume a DC current flowing through the NMOS devices N,. N. and Nd. Thus, the shortening of read mode duration is necessary to reduce the power dissipation during this active mode. This is possible by pulsing W L with enough time to read the cell as shown in Fig. 6.10. The generation of pulsed W L signal is possible owing to the Address Transition Detection (ATD) technique as will be discussed in Section 6.1.5.
Fig. 6.11(a) shows asimplified circuit configuration for SRAM write operation. For II write operation the memory cell state should be Ripped. When the write signal W E is asserted, the input data and its complement are placed on the bit-lines. If for example, a vero has to be stored in the node A initially at VDD,the voltage at this node should be below the threshold voltage of the coll, as shown in equivalent circuit of Fig. 6.ll(b). The bit-line in thia crse is pulled-down to almost 0 V. The design of write circuitry should provide a wide operating margin o v a all process, temperature, and voltage corners. Note that B DC current is consumed during a write mode, hence the W E signal should
327
WL
BL
of the write operation. In high-speed SEAMS, write recovery time is an important component of the write eyde time. It is defined BE the time necessary to recover from the write cycle to the read
&o be short to cut this current at the end
state after the W E s i g d i s disabled. Note that the swing on bit-lines after mite operation is large. Thus, an equalizer circuit is needed to reduce this s-g, so that the read operation is performed qoidrly. Fig. 6.12 illustrates b simplified achematic of an SEAM with xead/write circuitry. At the end of the memory cycle a differential voltage existed on the bit-lines. A PMOS equalizing device is used to equalise the bitliner after each read and write operation. The differential voltages on the bit-lines are restored
328
CHAPTER 6
Dafa-i"
%D WE
0
WL
@.@
Lou-Power CMOS Random Access Memory Gircuzts
329
column 1
Bil-line conBLioning md COlvm" m
AQ
1M
a%
9 X3LdVH3
OEE
331
rn
The decoders (row and column); The memory array. Ifm memory cells are connected to the ward-he, the active power of memory array (in read mode) is given by
Pmm-ma,
=mPd
+ (n- l)m&ab + mrDcAtfVDD
(6.6)
Where P . , is the power dissipated in active mode when selecting the m cells and ~ I . . I , is the data retention (standby) power of the unselected mekory cells in the m Y n array. The second term is neplipible. The third term is due to the DC current, ID,, dadng the read operation. At is the activation t i m e of the DC eonr-g parts and f is the operating frequency (f = 1Jinc).An example of such a current is the DC current flowing Gom the bit-line load to the ground through the memory cell;
rn
Sense amplifiers. They m e dominated mainly by a DC current; and
Remaining periphery such as input/output buffer, write circuitry ete.

Note that the power dissipated by the pads is not included. The power dissipation of the components, other than the memory array, depends on the total capacitances, the opersting frequency and the internal voltage swing. It can include a DC component with a major contribution from the sense amplifier.
To reduce the active power consumption many techniques can be used and are summatized 85 follows :
m
rn
Reducing the capacitances of the word-line and the number of m cells connected to it. This is possible by osing Hierarchical Word-Line (HWL) techniques. Reducing the DC current by using the pulse operation technique for the word-tine and the periphery circuits (including sense amplifier).
Use of multi-stage static CMOS decoding to reduce the AC current.
Lowering the operating power supply d t a g e .
The standby power (or Sometimes called retention current) of an SRAM has a major contribution from the memozy cells in the array if the sense amplifiers are disabled in this mode. It is given by
Pstcdbv
mnprcar
(6.71)
332
CHAPTER6
One way to reduce the standby current is to reduce the operating voltage. However, note that the data-retention cnirent will increase with memory capacity. Moreover, the leakage current, per cell, tends to increase because the threshold voltage is expected to be reduced for low-voltage operation.
In the following sections, many key circuits in an SRAM are reviewed. The circnit techniqocs and memory organisation to reduce the lrctive and dataretention currents are presented.
6 . 1 . 5
Address lkansition Detector (ATD) Circuit

signals for word-lines, equalisation and sensing,
To generate the different t-ng
an on-chip pulse generator, which detects the address change, is needed. It is baaed on address transition detection technique. The ATD is a key technique to reduce the active power of memories. Fig. 6.14(a) shows the schematic
diagram of an ATD pulse generator. Short pulses are generated with XOR circuits when the address changes from "L" to ' H " or "H"t o "L"; then summed through an OR gate. The overall pulse width is controlled by the RC delay line shown in Fig. 6.14(b). The corresponding waveforms are shown in Fig. 6.14(c). The d m o pulse is usually stretched out with a d&y circuit to generate the different pulses needed in the SRAM. Note that the CS signal is also included as m input to the ATD generator.
6.1.6 Decoders
Usually the decoding in an SRAM is performed by using complementary CMOS. Two kinds of decoders arc used ; the row and the column decoders. Fast static decoders are based on OR//NOR and ANDINAND gates. Fig. 6.15 shows an example of a two-bit input address EOW decoder. The input bnffers have to drive the interconnect capacitance of the address lines and the input capacitance of the NAND gates. To match the pitch of the memory cell and to perform decoding for severals blocks, twostages decoders ale used. The first stage performs predecoding and the second one performs the final decoding function [Fig. 6.161. The twostages decoder circuit has other advantages over the onc Stage decoder such as to reduce the number of transistors and fanin. Also it reduces the loading on the address input buffers. This predecoding teehnique optimiiaer both speed and power. In the last stage an additional signd 4, is included in the AND gate. This signal is generated from an ATD pulse generator to enable the decoder and ensue the pulse activated word-line. There
333
(h)
Address
i i
334
CHAPTER 6
Address h e r
Word line dtivcr
Low-Pourer CMOS Random Access Memory CirczLita
335
Predecodcr
Final decoder
are several ways to build mw-decoderr and it depends on the R.AM architecture division.
The column decoder permits the selection d l out of m bits of the accessed TOW. Fig. 6.17(a) shows the circuits involved for column selection uskg an example of 4 columns. The selected gate permits the transferring of the data from the bit-lines to the common data-lines I j O . The signals Y i a r e controlled by the ANDINAND c o l u m decoder BS shown in Fig. 6.17(b).
336
CHAPTER 6
Low-Power CMOS Random Access MemonJ Czrcuits
337
6.1.7 Bit-line Conditioning Circuitry

The NMOS bit-lines' loads [Fig, 6.181 have been used in many SRAMs at 5 V pow= supply. They provide a precharge level on the bit-lines of VDD VT. The threshold voltage of the load, VT is subject to the body effect. A typical valne of this precharge level for 5 V power supply is 3.5 V. This level is suitable for voltage-type sense amplifiers to provide large gain and f s t rensiog delay.
~
To reduce the DC current, during the write circuit, a variable bit-line load It realizes fast sensing in the read cycle and B short wdte pulse width in the mite cycle. For fast sensing, the voltage swing of the bit-line shodd be s m a l l . To achieve this, the load impedance should be low. On the other hand, to obtain a low current dndng write cycle, the load impedance of the bit-lines shonld be high. As shown in Fig. 6.19, during the read operation, all four NMOS transistors N,, Na, N,, and N4 are turned ON. The bit-lines are switched into a low-impedance state so that the Voltage swing of the bit-lines is limited to R s m a l l value (e.g., 100 mV). During the write operation, the NMOS devices N, and NI arc witched OFF and only the small she transistors N, and N , are turned ON.
tdmique can be employed [Fig. 6.191,
338
CHAPTER 6
NI
i
Figure 6.19
Variable load bit-hrs.
Low-Power CMOS Random Access Memory C i r c u i t s
339
As the power supply voltage is sealed down to 3 V, the preeharge level can be lower t h q 2 V, Thus, d g r e d operation the high-level node of the memory cell can t;,f&e equal to the bit-line d t s g e . Hence, the noise margin of the memory cell is drastically degraded and consequently the cell stebbility and soft error are degraded. Therefore, at 3 V power supply voltage, a PMOS trsnsktor can be used w bit-liner' load [Fig. 6 . 201. The bit-lines precharge voltage i s V b ~ Far . law-voltage bit-liner precharge voltage, special ~ e n s eamplifiers should be used because conventional sensing circuits have poor voltage gain (less than 10). A variable impedance bit-line, using PMOS transistois, can
&o
be implemented.
6.1.8
Sense Amplifier
When reading II memory cell, the bit-lines are initially precharged. then one i f the two bit-lines goes down, while the other stays high. The operation of polling down the bit-line i s very slow because the discharging MOS device, in the memory cell, is small and the bit-line capacitance is high. This results i n very slow memory read time. Sense ampliiiers are used t o detect the small "adation on the bit-lines and amplify it to get at the end fuU-swing signal. A dmple anbalanced inverter with a high logic threshold voltage can be used. j i c e its input is single and has very s m a l l noise margin,it ir very sensitive to noise on the bit-line. Thus, sense amplification, for the data-liner, is a key to aehieve fast access time and low-power dissipation. In general, the delay of B sense amplifier (from the time of word-line activation) represents 30 to 40 %of the whole read aserr tie.
Various kinds of sense amplifiers have been devised for fast sensing operation and low-power dissipation. Fig. 6.21(a) shows a ringlcend sense ampliser with an active current-mlror. Thin structure forms the basin for ~ n SRAMa' y sense amplifier circuits. It has two differentid inputs, D L and DL. The noise equally affects both the two inputs and only the difference is detected. The transistor N, acts as a curent source. Before the signal $ 4 . ~ is asserted, the data-lines D L and DL are high. AU the nodes, A, B and C, a x high. The signal & A is a s e r t e d when DL starts, for example, to drop slowly. In this m e , the NMOS transistor N, is ON. The output voltage (node C) drops suddenly to a c a t & voltage. Thus, the input signal is amplified by the gain of this differential amplifier.
Fig. 6.2l(b) shows the voltage waveforms of the single-end sense amplifier uskg SPICE simulation. The signal is generated with an ATD pulse. It i s
340
CHAPTER 6
Low-Pourer CMOS Random Access Memory C~rcuets
341
asserted for a time, enough to amplify the small variation (few hundreds of rnV) on data-lines', then it i s disadivated. In this scheme the DC cnrrent consumed by the sense amplifier is cnt off. Usually the sense amplifier is common to msny columns through the common data-liner. The small Signel gain of this amplifier is given by * = 9-(6.8)
90
is the transconductance of the driver NMOS Nd and go is the cornbioed output conductance of the PMOS load and the NMOS driver. where
y'mn
In many SRAMs multi-stage sense amplifiers are needed to attain large volte.gge
in Fig. gain. In this case, the daublbend sense arnpLifier is used a6 sh6.22. This circuit h s often been wed in many SRAMs. To attain high-speed data sense, a two and three-stage sense amplifier technique a n be adopted. Fig. 6.23 shows a two-stage amplifier structure. An equalisation technique is used for the data-lines, using the equalization pulse 4sq,which i s generated with an ATD pnlse. It is indispensable, not only to attain faster data transfer
'Thc auipui of the srme ampmcr k then iatchcd.
342
CHAPTER 6
343
I
S
Figure 8.14
PMOS cross-couplid sense nmplrficr
during read operation, but also to suppress incorrect data before the comect data appears in the sense amplifier [17]. For low-powei applications and &o due to the plastic packaging limitations of static memories, this type of sense amplifier can result in high power dissipation for high-density memories even if the current source is pulsed. Many circuits have been proposed to reduce the power of the sense amplifier w h i l e improving their sensing delay time. One of them is the PMOS CIOSScoupled amplifier [I81 shown in Fig. 6.24. The PMOS loads, P, and Pz, are cross-coupled and the M e r e n t i d outputs S a m S are connected to their girtes. The positive feedback in this latch amplifier permits much faster sense speed than the conventional one. In this circuit the equalization technique is used for the reasons discussed above. Fig. 6.25 rhawr the senre delnys of both the PMOS cross-coupled amplifier and the double-end current-mirror amplifier as 1 function of the average current of the amplifier. The input voltages simulate
344
CHAPTER 6
0 6 prn CMOS
Convenuo~ai current -mrrror SA

1 2 3 4 5 6
'd
345
the common data-lines' voltages and the sense delay id is defined as the delay time from the crosso~er point ofthe input voltages to the point when the ontput reacher 1 V difference. The PMOS cross-coupled amplifier has less than half the delay of the conventional current-mirror sense smplifrer. Moreover, this latch amplifier consumes less than one-Mth ofthe power of 6 current-mirror amplifier. The PMOS cross-coupled latch amplifier requires much more accurate timing for to optimize the sensing delay [la], Thin circuit also has low-power property compared to the current-mirror amplifier since it has nearly full-swing outputs with positive feedback.
+.,
346
CHAPTER 6
When the voltage is sealed to 3 V power supply, the data-line voltage is near VDD, then a level shifting can be pedormed. Fig. 6.26 shows a two stage sense amplifier wed for 3.3 V mpply. The first stage is a cross-coupled NMOS amplifier which also performs level shifting of the common data-line voltage. In the second dage, a conventional sense amplifier is used which operates at the maximnm 9 . ; . point since the l e d on SA a d YZ =re medium leutlr.
Fig. 6.21 shows another sense amplifier developed for low-voltage power supply [IS]. This circuit is mcd when the bit-tines are close to VDD, where the gain of a conventional current-mirroi amplifier is poor. The circuit is composed of a level-shift circuit and a conventional current-mirror amplifier. The level-shifter shifts the bibline voltage to a medium voltage; 0.6 to 0.7 V, (@ 1 V power
Low-Power CMOS Random Access Memory Czrczlits
347
supply voltage) where the gain IS maximum. Low-VT NMOS devices NL and N2 are used to provide these medium levels. There devices are subject to the body effect. Recently current sense-amplifiers have been proposed to overcome the gain reduction of voltage amplifiers a t low power supply [T, 121. Alao they reduce the power diiaipntion of the sensing operation compared to voltage sense amplifiers at the same delay. There circuits require wry careful dengn.
6.1.9
Output Latch
In low-power SRAM, the pulse technique for word-line and seme amplifter ir indispensable in order to reduce the DC Current. In such B pulse mode. a datalatch circuit is required to Store the amplified data by the sense amphfier from the memory cell for the data output circuitry. Fig. 6.28 shows an example of an output latch placed after the sense amplifier. The requirements of such an ontput latch are the following '
m
The latch circuit must not delay the mad access time. Such a requirement is attained by connecting the latch with data-bus lines in parallel. One input transmission gate, controlled by 41, is used to enter the data to the latch. Another transmission gate, controlled by 40, is used to put the dat. back into the det-bnr. The latched data must not be destroyed by the noise entering the SRAM. A noise in an SFAM is generated and propagated by the following mechanism. On the system board, 8 ground noire can enter the SRAM. When the peak level of the ground noise becomes large enough for the first gate of the address buffer to change the logic value of the address input, an ATD pulse noise is generated. This noise pulse could turn on the word-lineand the *erne amplifier for a short time resulting in an expected signal on the data-bus. Therefore, the Latched data conld be destroyed if the inpnt Gp.1 is ON. To avoid such a problem, two circuit techniques m e included in the eireuit of Fig. 6.28. The first one is the generation of Qr only when the pulse width of the ATD is large enongh, compared to that of the noise. The other circuit technique is to place latch-protecting invertem [Fig. 6.281 in the front of the output gates. The inverterr prevent noise from entering the output gates.
rn
348
CHAPTER 6
The new data must be quickly latched into the data-latch. The circuit of F i g . 6.28 can be optimbed for fast operation.
6.1.10 Hierarchical Word-Line for Low-Power Memory

With the increased memory size, the word-line delay and the column power increase. To solve this problem, B Divided Word-Line (DWL) structure was proposed [ZOr. The concept of DWL is shown in Fig. 6.28. The cell array and the word-line are divided into ng blocks (rub-arrays). If the SRAM has no columns, each block has n o / n ~ columns. The divided word-line of each block is activated by the main word-line and the corresponding block select signal. Consequently, only the memory cells connected to one divided wordLine w i t h a selected block are accessed in a cycle. Hence, the column current
349
Global row decoder
Block
n -
2nd Block
nBch Block
Elnck
sdcct
lillC
n i n CI,IIIIlI"S C B (rneniory cells)
Figure
B.m
Divided Word-Linc (DWL) concept [ZD]
is reduced, since only the selected columns switch. Moreover, the ward-line selection delay, which i s the delay time from the address input to the divided word-line, is reduced. T h i s delay is composed ofthe main word-line select delay and the divided word-linc select delay. The main word-line selection delay is reduced compared to the conventional one, because the total capacitance of connected transistors is reduced. In a conventional S U M , the word-he has all the row memory c e k ' gates of B row connected to it. The insin word-line delay increases as the number of blocks increase because the number of block select gates increases. On the other hand, the divided word-line delay decreases as the number of connected cells i s reduced with the increasing number of blocks. Consequently, the word-line selection delay has a minimum for a certain number of blocks.
6.30 shows the effect of the number of blocks i n DWL structure on the word-line select delay and the colvmn power for 64-Kb SRAM [l o]. In this example. a number of blocks of eight can be chosen. The ares penalty for this case is only 5%, compared to the conventional memory. AE an example, for I-Mb SRAM, the cell array is divided into 16 blocks and each black consists of 612 OWE by 128 columns. 9-bit address (,4...Ae) is used to select B I O W within
Fig.
350
CHAPTER 6
16
32
Number of Blocks
a block using two-stage row decoder.
Global block selection is done using &bit
address. The DWL structure has been widely used in high-density SRAMa for its lowpower. high-speed characteristics. However, in high-density SRAMs, with a capacity more than 4 M b , the nomber of blocks in the DWL structure will have t o increase. Therefore, the capacitance of the global w o r d - h e increases cansing the delay and power increase. To solve this problem, the concept of Hierarchical Word Decoding (HWD) was proposed in [21] as shown in Fig. 6.31. The word select line is divided into more than two lev&. The number of lev& (hierarchy) is determined by the total load capacitance of the word select line to efficiently distribute it. Hence. the delay hnd the power ayt reduced. For 4-Mb, three levels of hierarchy haw been used with 32 blocks; each block having 128 columns by 1024 rows. Fig. 6.32 shows the delsy time and the total
352
CHAPTER 6
capacitance of the word decoding path comparison for the optimized DWL and HWD strmtures of 256-Kb, 1-Mb, snd 4-Mh S U M S . For 256-Kb SRAM there is no significant advsnthge of HWD over DWL. However, for high-density SRAMs the perfounance, of HWD i n terms ofpower and delay, becomes dear. The three-levels scheme can be used efficiently for 16-Mb SRAMs.
6.1.11 Low-Voltage SRAM Operation and Circuitry

There are several applications which need a 1.2 V battery power supply. For such B application 1 V SRAMs are needed. At 1 V power supply, B stable operation is targeted and it is very important that the noise is reduced. Moreover, the active and standby powers should be reduced t o meet the requirement of battery operation. For 1 V power supply, a full CMOS memory cell has a lower power dirripation in standby mode and greater immunity to transient noise and voltage variation than other cells. It can also operate at the lowest supply voltages. Although a full CMOS cell operates w e l l at ultralow-voltage, its area is almost double of that of PMOS TFT. Henee it is not suitable for high-density memories (sine > 4Mb). When the full CMOS memory cell is operated at 1 V power ropply, a typical cell ratio is 3 for stable operation. The SNM of this cell, at 1V, can be h o s t the same as for a poly-Si load memory cell at 5 V. When nsing the fnU CMOS 4 no boosting of the wad-line is needed to write a high voltage level in the cell. However, the PMOS T F T cell requires a boosted voltage ( V . h > VDD) on the word-line during the write cycle 1191. If the voltage of the word-line is raised only to VDDin the write cycle, the high node B of Fig 6.33 is initially at VDD - VT, where VT is the threshold voltage of the access device subject to the body effect. This low-level (VDO - I+) of the node B em not charge up to V0o because of the poor drimbility of the PMOS T F T device. When the boosted word-he tedrniqne is applied to the PMOS T F T c e l l during a write cycle, a problem can a G e . The unselected cells connected to the boosted c o m m o n word-he suffer from an instability problem because a large current flows through the low node of the cell. This large current is due to the high voltsge on the access transistor. Consequently, this technique is not suitable for 1 V operation.
353
Figure 8.54
Twertep t.Ehniq\is
for 1 V operation [is].
354
CHAPTER 6
Word driver
Low- VT MOSFET
Din
(a)
WE
Din
Figure B.55
(a) TSW m d l w i t e ~imuitm [is]
A TwrrStep Word (TSW) voltage technique has been proposed by Ishibarhi et al. 1191 to solve the cited problem. Fig. 6.34 shows the block diagram of the proposed memory. The boosted-level generator' generates a voltage V , , = 1.5V for VDO = 1V. The word-line voltage har two-steps, one is VDD and the other is K h . The circuitry for the TSW method is shown in Fig. 6.35(s). When Q , goes to zero, the signal W L is raired to V , , = VDD. Then when .$ch is mserted with a high l e d , equal to Vch, the transistor P i tnms ON and then the W L level is increared to V , , = Vch. In this e a e , the low threshold voltage device N, tun. OFF and the inverter formed by the transistors P a m d N, is isolated to reducc m y leakage current. Fig. 6.35(b) shows the voltage waveforms for the TSW circuitry in read/write modes. During the write cycle, the high node A is first charged to a low voltage,
'The boostcdLvel8~lcratorir prcsentcdin ScetionB.2.11.
355
then raised to Vms. The bit-hes are initially floating, then prechaged at the end of mite cycle. In the next read cycle, the b i t - k s are floating. Before the , , , the cell discharges BL through the low node B . word-line voltages rise to V Thus, when the word-line has risen to Vwt, current does not flow in the cell and the node B stays at low level voltage. Note that this technique requires mdti-V, CMOS devices and causes delay in writing because the bit-lines are discharged before writing.
However. the low-voltagge S U M S discussed above require a relatkely high threshold voltage VT 2 0.5V. Thus, their speed is qnite slow. As an example. a 258-Kb SRAM with full CMOS memory cells attained 3 ps access time at 1 V power supply using 0.8 pm CMOS technology [22]. The active power at 0.1 MHa is 0.2 mW and the standby power is 5 nW.Another example i s a 1-Mb SRAM with fuU CMOS memory c c b which achieves 200 n s access t h e at 1 V power supply using 0.5 p n CMOS technology 1231. The active
356
CHAPTER 6
cuprent at 1 MHs is 0.1 mW snd the standby current is 10 nW.Note that if the tbrerhald voltage i s too low for ultra-low voltage applications, all the eirwits composing the SRAM will suffer from the subthreshold current leakage. Thus, the retention current increases drastically cansing B sedous problem for low-power applications. Moreover, the temperature effect and the threshold voltage variation enhance this current. So far, no practical solution has been proposed.
6.2 DYNAMIC RAM

The first dynamic RAM (DRAM) was introduced in 1970 with a capacity of 1-Kb. Since then, the density has quadrupled every three years (one generation). Recently, some wperimentd 256-Mb DRAMs were reported [24, 25, 261. At p'esent, low-voltage 16-Mb DRAMr run in high-volume production. The development of there higher densities have made DRAMs the cheapest per bit compared with other types of memories. They are widely used as the main memory of mainframes,PCs, and workstations. The access time har been decreased from few hundreds of ns for 4-Kb DRAMr to less than 50 ns for 256-Mb. Also the power dissipation has been reduced by an order of magnitode from 4 K b capacity to 256-Mb capacity reaching 50 mW at 1.5 V power supply. The area of the memory cell has been reduced from more than 100 @ma for 64-Kb DRAM to 1.28 @ma for 64-Mb DRAM. In addition to the trend for higher-density standard DRAMs, there are two other trends: Low-Power (LP) DRAMs, and high-speed DRAMr. The highspeed DRAMs sacrifice the retention current ar well as density for faster access time. Low-voltage low-power DRAMs are becoming important particularly for battery operation. LP DRAMs extend the time of the battery operation as well as battery back-up operation. The active current of LP DRAMS has been lowered. The data-retention cuiient has also been reduced but rtii it is about one order of magnitude higher than those of SRAMs'. The 5 V power supply standard has been used for many DRAM &enmations from 64Kb to 16-Mb externally. This was followed hy 64-Mb DRAM powered with external 3.3 V not only to reduce the power dissipation, but &o t o emme reliability. The gate oxide reliability limits the msldmum voltage which is related to the boosted voltage inaide the chip. Regarding the internal voltage, the 5 V can be used to a maximum DRAM capacity of 4-Mb. At 16-Mb generation, the internal voltage is 3.3 V while maintaining external 5 V with on chip voltage
'This comparison is msdc for I - M b mernezicr.
357
WL SWING
LIMITER
?
w
0 3 4
t;
? I
1 -
-,
-
Li
4 4
Mn
NiCd
0
DENSITY
I
1M
I
4M
0.8 20
I
16M 0.5
I5
I
MM 0.3
10
I
256M 0.2
I Ic
0.1
5
(hi0
ipim)
FEAT.SlzE1.3 Toi
25
(nm)
Figure 8.38 Trends of DRAM upp ply [ Z B )
down converter [see Section 6.31. Howevez the 3 3 V externill power supply wlll dominate.
Recently, activities to r e d r e 1.5 V battery-operated DRAMs are accelerating
the trend i n lowvoltage operation [ZT. 28. 291. Fig. 6.36 shows the trend of DRAM supply [ZS]. In battery operation, the chip must be operated on B variety of batteries with various supply voltages for a long-term and under supply fluctuationr.
358
CHAPTER 6
CAS
6.2.1
Basics of a DRAM
:
In general the pins of a DRAM are

m
Address; which is seprrrated i n time with two separate fields. There fields are the row and column address.
Row Address Strobe
(m). The row address is docked by this signal.

(m).
rn
Column Address Strobe The column address on the multiplexed pins is clocked by this signal. Write Enable
rn
(m).
359
.
m
Inpnt/outpot data pi... External power supply pins.
It is d e a r that the multiplexed address penalims the access delay so for fast DRAMr separate address input pins can be used. The multiplexing permits the reduction of the pin count and the cost of packaging. An example of DRAM timing, ndng the addresa multiplexing during read mode, is shown in Fig. 6.31. Some important times are shown, such as the access time from low, tmS, the row addxss strobe cyde time (or cycle time), tRC,and the row address strobe low-state time, 1x1s. Fig. 6.38 shows B gene& 4 M b DRAM architecture. It uses almost the same circuit techniques as SRAM except for memory army. Some additional circuits are needed such e s a Back Bias Generator (BEG), B Half-Voltage Generator (BVG), an optiond Voltage-Down Converter (VDC), a R,eference Voltage Geaerator (RVG), and a boosted voltage generator circnit. The substrate back-bias voltage is indispensable for stable operation of the DRAM array. The halfvoltage generatar permits generation of the precharge level for the bit-lines to half-VDD as it is explained in the following sections. The reference voltage generator ir needed for the VDC. The boosted voltage generator uses b chargepump circuit and permits overdriving of the word-line WL to a voltage higher than VDD.More details on these circuits, composing the DRAM, are given in the following sections.
6.2.2
DRAM Memory Cell
CMOS DRAMr, with threetransistor and four-transistor cells, were used i n 1and 4-kb generations. One-tranristor (IT) cell offers smdei chip size and low cost. These justify the process complexity to fabricate the IT ccU, particularly its capacitor. A &hematic of B 1T DRAM cell is illustrated in Fig. 6.39(a). The charge is stared in capacitor C,.To prevent loss of the stored information, the capacitor must be refreshed within a specific time with spedal circuitry. The bit line has a capacity CBLinduding the parasitic load of the canneeted circuits. Typical values for the storage and the bit-line eapaeiton are 30 f F And 250 f F , respectively. The ratio R = CBL,C, is very important for the sensing operation.
360
CHAPTER 6
---
RAS CAS WE
9.
102
I'
Low-Power CMOS Random Accrss MemonJ Circuits
361
Doring the read operation ( W L is selected) the bit-line wltage changes by
where (VMC- Vm,) is the difference between the memory cell voltage and the bit-line voltage before the selection ofthe cell. A typicd value of the difference i s V D D ,Hence, ~ we have fog the hit-line renre signal
(63)
For 3.3 V supply voltage, and using a rstio E = 8 far 16-Mb DRAM,the sense
signal V , = 180 mV. This r m d voltage change, of the bit-line, requires sensing circuits. For low-voltage operation, V . decreases, thus a low ratio R is required. This is possible by reducing CBLand increasing C,.
C, was implemented ming a simple planar-type capacitor a~ rhom in the structure of Fig. 6.39(b). Thi structure WBS used in DRAMS with capacity up to I-Mb. With the increased density, many threedimensional approaches were used for DRAMs with capacity higher than I-Mb. One approach is to stack the capacitor over the access transistor (STCcell). Another approach is to m e a trench capacitor. For more details on advanced cell structure the reader can consult 130, 311.
The signal charge (Q.ig = C.AV,) transferred to the bit-line during a r e d operation should have enongh margin agsinst noise. The sources of noise are the following :
rn
bit-line noise; which is caused by capacitive couplings and other sonr~eei leakage charge; which is mainly due to the leakage in the junction of the NMOS trmsistor of a IT memory cdl; and
a-particleinduced soft errom
In the early DRAM,the plate of the capacitor WBS grounded to reduce the noise injection from the VDD power supply. However, for multi-Mb DRAMs, a VDD/Z bias or the eeU plate was nsod. This scheme has several advantages such as, the reduction of the stcess on the thinner oxide of the atorage capacitor, and the reduction of supply voltage noise. Many I-Mb DRAMs have used this cell biasing scheme.
362
CHAPTER 6
DRAM cell design with redneed VOD, the ratio R should be rednced. This L possible by reducing the bit-line capacitance, Csr. and increasing the . . On the other hand, the area occupied by C . should storage capacitance C . reduction is the be rednced to increase the chip capacity. One solution for C use or* capacitor insulator with extremely high permittivity 6 such BI Ferraelectric materials nuch as BoSrTiOJ film. Consequently B simple planar-typo capacitor can be nsed in that c a ~ e
For Gb
Low-Power CMOS Random Access Memory Czrcurfs
363
6 . 2 . 3 R e a m r i t e Circuitry
Fig. 6.40 illurtrstes the Merent circuits for read, write precharge, and equalisation funotions. The read operation is performed as follows. Initially both the bit-lines ( B L and BZ)are precharged to V, which is equal to VDD/Z and eqndized before the data reading operatirm. This hali-yoo preeharge technique permits the reduction of the active power disdpation 89 discussed i n Section 6.2.9. The signal W L is seleded by the TOW decoder. The high level of the word-line voltage har to be greater than VDD to increase the stored chaise in the memory cell. The selected memory cell is connected to one bit-line. Then AVBL (100 to 200 mV) appears between the bit-lines, immediately &her the word-line rises. Then it is amplified by the latch-type CMOS sense amplifier
364
CHAPTER6
which is connected to both bit-liner. After the sensing and the restoring o p erations, the voltage levels of the bit-lines bsve a full-swing condition. The bit-line differential voltage signal is transferred to the differential output-lines (0 and d), through a read drcnit. The signal YR i selected h o s t at the 8-e time with W L . The parasitic capadtance of the output-line is large (a typical value 2 pF for 4-Mb DRAM), and the readout circuit would need a long time to amplify the ootput-line signal. A main sense amfler is used to read the output-liner, then the data is selected among several main SAs connected
to different sub-arrays. Finally it ia transferred to the output buffer.
The DRAM cell readout mechanism is destructive, and hence the same data must be wsdtten to the cell on every read access. Consequently, on each bitline pair, a CMOS mpifier is needed to amplify and restore the level. This mechanism is not needed in SRAMs since the lead operation is non-destructive.
i g n d is selected by a column decoder as shown In the write made, the YW J in Fig. 6.40. In this ease, the write control signal is actiTated. The selected bit-lines are connected to a pak of wdte-liner W and W and the data are transferred to the memory cell when W L goer HIGH.
6.2.4 Low-Power Techniques

Fig. 6.38 can be osed to identify the different sources of power dissipation i nB DRAM. For simplicity we asmme that the internal supply voltage is the s a m e compared t o the external one. The total power dissipated is the addition of two components; the active power and the data-retention power. The active power is the rum of the power dissipated by the following components;
The decoders
(row and column);
The memory army. This is the dominant one. If m memory e d s ate connected to the word-line, the active power of memoly array is &ken
by
P.,,sm.a,,ov = m x Poem
(6.11)
Where Pmctm is the power dissipated in active mode when selecting the m cells. It is given by
Pacam = CmAVmVDDf
m
(6.12)
The sense amplifier;
Low-Power CMOS Rondorn Access Memory Circuzts
365
=
m
Other circuits such as refresh circuit, substrate back-bias generator, boosted l e d generator, B voltage reference circuit, and a half-VDD generator. These circuits &a dissipate a DC current; The rest ofperiphery such BS main sense amplifier, input/antput buffers, write circuitry etc.
Note that the power dissipated by the pads is not included.
To ieduce this active power, many techniques can be used and a m smnmarieed
as follows :
rn
Reducing a l l capacitances; particularly the bit-line and word-lines <apaeitanees. As seen from Equations (6.11) and (6.12)m Y Csr.should be reduced. Techniques which permit this are partial activation multidivided bit-line and shared IjO [see Section 6.2.7]. Also to *educe the word-line capacitance, a techniqne such as partial activation of mdtidivided ward-line can be used [see Section 6.2.81; Lowering the internal VDD.This i n d u d e the generation of half-Voo for precharging the bit-lines and reducing the external supply voltage; and Reducing the DC power required by periphery circuits. This is possible by using static CMOS decodes and puke operation technique using an ATD circuit (as in SRAMs).
The data retention power in a DRAM is mainly due to refresh operation and the DC power ( I D c ) due to peripheral circuits such 8s BBG, BVG. VRG, HVG. The refresh process is performed by reading the m cells connected on each word-line and restoring them. Thus, n refresh cycles are needed for n x m DRAM. It can be estimated by
where
9 is the total dynamic energy (f is the operating frequency) and

is the refreah time of m c e b . To reduce the power dissipation due
n/fvejrS,b
to the ieLwb mode, one obvious technique is to increase f,<j,<,h and decrease
n.
P , L the AC and DC power dissipated by the other circuits such BS VDC, BBG, RVG, BVG,and boosted level generator. To redoee this power m y
366
CHAPTER 6
Figure 8.41
Static CMOS .mrd-linc dr>vrr
techniques can be used. One of them is to reduce the frequency of operation of circuits which have high-power during active mode when operating in data retention mode. Another one is to reduce the DC current of there ckcuits using, for example, dynamic concept.
In the following sections, the circuit techniques to reduce the active and dataretention power dissipation are presented. Also, different circuits conrtitnting a DRAM are described and low-power issues of these eirenits are discussed.
6.2.5 Decoder
In a DRAM, the static CMOS NAND decoders are used. The power is reduced by sing the predecoding technique. This topic is discussed more in Section 6.1.6 for SRAMs. Fig. 0.41 shows astatie CMOS word-line driver. The boosted level, K h , generated by an intunsl charge pump circnit, is used in the output stage. When node A is high at (VDD- VT), the antpnt inverter le& a high DC ourent because this is l m w then Vrhby 8% least two threshold voltages, sobjeet to body effect. Therefore, a s m a l l size PMOS transistor PI is used to restme the level of the node A to K, l e d . Also this transistor permits the latching of the low output level (ground). Thc Xi signal, when selected, is normally at Voo. The unselected X, is discharged to ground in the selected block before the row decoder selection.
367
6.2.6 Sense Amplifier

The main sense amplifier is the main source of DC current during the xt h e mode. It employs the same sense amplifier discussed in Section 6.1.8 for SRAMs. T h e DC enrrent can be shut down using the ATD technique.
6.2.7 Bit-Line Capacitance Reduction

slso improves the signal-t-noise two approaches :
Redocing the bit-line capacitance not only reduces the power dissipation but ratio of the memoiy cell. This is possible by
1. Reducing the number of memory cells n per bit-line.
In this ease,
multi-divided bit-line technique is used. 2. Redncing the jnnctian capacitances of connected transistors such 8 s access devices. One possible solotion is the back-bias of the substrate cant- these devices. A negative voltage on the substrate permits to reduce the junction capacitance. In addition, the we of the trench isolation technique for CMOS devices rather than the LOCOS isolation results in almost 50% ieduction in capacitance, Fig. 6.42 shows the principle of multi-divided bit-line architecture for the memory array. The m x n m a y is now divided into m columns by k snbarrays. Each subarray contains n/k word-lines. In this scheme the bit-line capacitance CsLis reduced by dividing it into k sections. Also the signal-twmise of the cell is improved. Fig. 6.43 illustrates an example of I-Mb DRAM [32]. The memmy is divided into two parts; upper and lower. One part is divided into N = 16 sub-arrays and the total number of rubarrays i s k = 32. Two subbit-lines share one amplifier which are selected by isolation sign&, I S 0 and ISO. Thus, a partial activation is performed by selecthg only one SA along the bit-line. The switeh SW is controlled by the Y signal from the shared e o l m decoder. This signal runs in parallel to the bit-linw and uses metal-2. Thos, the 1 / 0is shared by two sub-bit-hes. Thk principle results in reduced power dissipation and chiprize. It has been used foz many DRAM generations up to 16Mb.
6.2.8 Multi-Divided Word-Line
368
CHAPTER 6
369
Row decodri
._ - - _
--_
---__
Bit-lineinmetal-l
(meid-2)
Figure (1.45 Multi-divided bit.8ne orchilceturr with shard SA, I/O snd eolum.dccodrr[Zl].
370
CHAPTER 6
,,,R
._ ..-._ ._
Fig. 6.44 shows the hierarchical word-line structure proposed for a 256-Mb DRAM [26]. This scheme resembles the one used in the SRAM. The DRAM cell array is divided into several blocks and each o m itself is divided into sub arrays. The SnbWord-Line (SWL) circnitry is embedded in the subarray. Only one S W L is activated by the Main-Word-Line (MWL) and the 109" select Jignd. It i s common to two sub-mays as shown in Fig. 6.44. Thus, only two cell rubarrays are activated which represents B very small portion of the total cell arrays. In the case of the 256-Mb, the active cell array rise is 1/1024 of the total number. This ntrosture results in reduced active current and ground bounce.
Lorn-Power CMOS Random Access Memory Czrcoits
371
6.2.9 Half-voltage Generator

One efficient technique to reduce the memory anay operating current is halfVDD bit-line precharge [33, 341. During the sensing operation, one bit-line and the other switches to m o . This resnlts switch- from V D D / ~ to VDD in L powex swing of almost h a , compared to the fd-Voo precharge ease, BS well as peak current. Note that the reduetian in peak current leads to suppression of noise. In addition, the precharge time is reduced and the cycle time is shortened. This preeharging technique has been used darting from I-Mb DRAM generation.
n A simple circuit which permits the generation of this half-VDn is shown i Fig. 6.45. The HVG CLcait is composed of two stager. One stage L B bias generator which generates two voltagelevelr; (VDD/Z+VT) and (VDD/Z-VT). The second one is the push-pull output stage which generates the level V D D / ~ distributed to the memory array. The load capacitance, seen by the push-pull output stage, is huge. A typical value is a few tens of nF. A typical response time when the circuit is powered-op is few tens of ps at 3.3 V power supply voltage for 16-Mb DRAM. This HVG circuit has many disadvantages such as
ZL6
373
duty ratio of the H V G E signal in the data-retention mode. To solve the other problems dted an HVG G c d t was proposed k [28] but this circuit dissipates B DC =-rent.
6.2.10 Back-Bias Generator

The back-bias valtage VBB is utilised in a DRAM to reduce the subthreshold current and the junction capacitances, to improve deem isolation, t o enhance latch-up immunity, and protect the circuit against voltage undershoots of the inpnt signals. Also this voltage can he wed to compensate for some device parameter variations.
For NMOS devices with P-well (substrate) a negative Vsa is generated by pumping electrons out of the ground node and into the substrate. A typical VBB generator configuration is shown in Fig. 6.47. This circuit is known as charge p a p . The node A oscillates between V T and (Vr - VDD). D n k g the high side of the cycle, the node A must be at least at V T to pump the chsrge from the gronnd. On the low side o f the cyde, the node A mart be a V T drop below V s S .The antput node VBs stablize. at a voltage l e d equal t o (ZVT - VDD),since the losd capacitance is huge. The clock (clk) is generated by B ling oscillator with N (N is an odd number) stage. The frequency f of oscillation, is approximately 1/(2Ntd), where t d is the delay of one inverter. The buffer is needed to drive the huge C , , , , capacitance. The average current pumped out of the substrate is approximated by
Ipmp= ( V m - vBBm;.)c,,f
(6.14)
where VBBminis the back-bias voltage when no current is pumped and is equal t o ( W - V D n ) (optimumvalue). During thertart-upalargecorrent Lpumpcd; equal to (-Vasin..C,,,f). Another PMOS version, of the charge-pump circuit, ir shown i n Fig. 6.48. Since the gate voltage of PI only reaches -VOD, Vsa ir pumped to a limit of (VT - VDD). For VDD = 5V, the NMOS and PMOS charge pump circuits generates typical voltage. of-3 and-4 V,respectively. However, for 3.3 V power supply, the PMOS version can generate a low negative voltage of -2.5 V which is Lower than the one generated by the NMOS version at this power supply. Fig. 6.49 shows e. pumping circuit which avoids the VT losses and hence is suitable for low-voltage operation [35]. When the clock ( c l k ) is low, the voltage of the node A reaches (IVT~I - VDD), and the PMOS transistor PI clamps
374
CHAPTER 6
Low-Power CMOS Random Access Memory Clrczlzts
375
376
CHAPTER 6
the voltage of node B to the ground level. The Vgg level is in that case, (IVT,~ - VOD- V T , , ) . When clk goes to a hieh level, the voltage of A rises to V T and ~ the voltage of B , by capacitive coupling, becomes -VOD, causing VBB to be equal to -VDD. Therefore the Vse will be
Vsa = mas{-Vm,
V l ,I~
VDD
- VF")
(6.15)
T h i s eircvit needs a special triplewell strncture to avoid minority carrier injwtion of the NMOS transistor N, as discussed in [SS]. To reduce the power dissipation of the BBG dreuit, while the DRAM is not i n an active mode, the BBG can be operated a t low fpequency. Fig. 6.50 shows D simplified circuit diagrsm of the BBG circuits for low-power operation [Xi]. In the normal mode, the ring oscillator works a l l the time to retain the Vsa level. In the data retention mode, the BBG Enable (BBGE) signal is clocked
Lou-Powuer CMOS Random Access Memory Czrcuits
377
with a low duty ratio. Then the ring oscillator is operating with low-frequency to iefresh the pumping eircuit.
6.2.11 Boosted Voltage Generator

A Boosted level circuit
is needed to generate a voltage level above VDDby at least VT. Tho word-line driver is powered with this voltage Vrh. A simple boosted voltage generator is shown in Fig. 6.51. It use6 the charge pump circuit technique discussed i n Section 6.2.10. The outpnt of this Circnit is switching between (VDD- VT)and ( 2 % ~ -V ) . The clock 4 is generated by a simple ring oxillator. Another circuit which switches between VDDand ~ V D D is shown in Fig. 6.51(b). It uses two non-overlapping clock phases. This second circuit configuration uses feedback NMOS devices, N I and Na, to eliminate the threshold voltage loss and boost the voltage a t higher voltage. This circuit is
not sensitive to power supply voltage reduetion. The boosted level can not be dkctly used to drive the load. Thus a pass transistor is needed to isolate the switching boosted level from the load as shown in the example of the drcuit of Fig. 6.52(a) [28]. The charge pump circuit CP1 generates at the node A, B boosted signal switching between VDD and ZVOD. To control the pass tiandstor N , two pump circuits CP2 and CP3, and an inverter INV are needed. The pump circuit C P generates, a t node B, a signal switching between WDD and ~ V D and D uses the boosted voltage Vrh. The other pump circuit CP3, controls the inverter INV. The output of this inverter (node D) switches between VDDand SVDD. The output of this KVG circuit is Vc,, = 2VDD and it is stable since is large. The voltage waveforms are shown in Figure 6.52(b). This ekcnit is insensitive to VDDreduction and can work down to s u b 1 V power supply.
6.2.12 Self-Refresh Technique

Standard DRAMS require an erternd DRAM controller5 to control the refresh pmcerir of memory cells. The stored charge in the memory cell deueases due to the leakage current with high rate at high temperature. The refresh time (period) L , . t h is determined from the timc needed for the stored charge in the memory cell to keep enough margin against leakage at high temperature. This indicates that trljr.,h can be lower than what is expected at room tem-
378
CHAPTER
379
380
CHAPTER 6
perature. One way to increase this time, and hence reduce the dato retention powex dissipation, is to eontrol the refresh period funftion of the chip temperature. Fig. 6.53 shows LUL on-chip self-refresh control circuit with a memory-cell l e h g e monitoring scheme. A iefreJh dock hraffrlh ir generated automatically with a period of t,s,va,h. The moOitox cell, which has s hk?.&ecunent I&, controls the refresh period. Initially node A is high, the NMOS transistor N is OFF, and node B is low. When the c h a w on node A is deereased to the p&t that the PMOS transistor P toms ON, node B riser up. Then, during t h e 7 B high puke is generated at the node C, whieh in turn charges OP node A to high level.
Low-Power CMOS Random Access Memory Cixuits
381
6.2.13 Low-Voltage DRAM Operation and Circuitry

Low-uoltage operation is reqnired to reduce the power dissipation and to assue the reliability of deepsubmicrometer MOS devices in futue DRAMS. The power rupply voltage ULO be as low as 1 Y to meet the requirement of battery operation for portsble applications. To get high performsnce in a high-density DRAM, at low supply voltage, the threshold voltage of MOS devices should be reduced. T h i s results in an increased subthreshold curtent and hence circuit techniques are neeeded to reduce the standby current. In this section, circuit tehniques to reduce the subthreshold current for the DRAM array ( e q u d k r , precharge and ~ e m ampli&r) e circuits, memory-cell access, and word-line driver are described.
6.2.13.1 DRAMArray Circuits

Fig. 6.54 shows the conventional DRAM array circuit with the half-VDD bitlines precharging tehniqm. This circnit has already been discussed in Section 6.2.3. When VDO is sealed down, this M - V D D seheme causes several problems with respect to the CMOS latch-type SA and the e q d n e r . For example, for the NMOS transistor, Nsr,of the N-type SA (N-SA) the following problem can exist. When the signal 4.. is pulled-down during the readout operation, the sensing operation starts when the voltage Vosl [See Fig. G.541 becomes larger than the VT of the NMOS transistor of the SA. However, if VDOJ Z is law enough, approaching the d u e of V., then the sensing operation is very slow doe to the low value of VGV,. Note that VT is subject to the body effect when the common source of the N-SA is falling to ground. Another problem arises duing the equalization period. The equalization is carried out by the NMOS device, N g p , when the signal dp is activated. In the final stage of equalisation, the drive current of the NMOS qualiner decreases drastically, particularly when VDD/Z is not higher than VT. Note that the threshold voltage of the equalizer is also subject to the body effect.
One solotion to these problems is the use oflow-VTdevices in the DRAM army for the CMOS SA, prechlrrge and equ&g circuits. However, this leads to a drastic inuerse in the leakage current during the active period. The leakage current paths are shown in Fig. 6.55. To significantly reduce this leahge current the concept of Welldynchronized Sensing and Equalizing (WSSE) concept was proposed [37]. It is based on the following two concepts:
382
CHAPTER 6
rn
The voltage levels of the transistor souxes and the well are equaled during the sensing, the restoring, and the equalizing period. T h i s dimh a t e s the body effect. A negative (positive) him, V s s (&) is applied to P-well (N-well), respectively, during the active period. Thus, the leakage current is reduced because VT incremes due to the body effect.
rn
Lou-Pourer C M O S Random Access Memory Circuits
383
Fig. 6.56(a) shows the WSSE eireuits using a triple-well structure. The N-well and the P-well control voltages, Vw, and Vwp, respectively, are controlled by B s p e d logic. Fig. 6.56(b) finstrates the voltage waueforms. Before the wordline is activated, the bit-lines and # ,, and $ , are equaliaed to haKVoo. The P-well and N-well levels BIC prechapged to ( ~ / ~ V DV Dn ) and (1/2Yon VT~), respectively. There voltage levels permit to avoid any drain-well voltsge forward-biasing during the initial time, after W L activation. During this initial time, one bit-line is different than VDD/Z. In the sensing and restoring period, the signals 4.. and Vwp are palled-down w h i l e the signals $ , and Vw. are pallhp; each pair is synchronimd. After this period, the bit-lines BL and are in full-Jwing condition. Then, the level Vw, is pulled below GND to VHH and isolated from &, while the level Vw. is pulled above VDDto V& and isolated from qLp.
~
6.2.13.2 Memory Cell

First, let's dixcms the requirements far the memory cell, particularly at lowvoltage. Fig. 6.51 shows the memory cell in the restoring operation. To restore the high-level, V b , from the bit-line to the storage capacitor, the word-line must be boosted to s level Vch.T h i s l e d has the following requirement
Vrh
> VDD+ ~ ( V D + Da )
(6.16)
where a is the voltagemarginand VT(VDD) is the threshold voltngeofthe access NMOS transistor when its source is at VDD.Note that the NMOS device has (VDD+IVHHI) a5 an effective back-bias voltage. Far transistor reliability, V s , should be as s m d a s . possible. This meam that Vr(Voo)is required to be smd. T h i s threshold voltage is given by
VT(V?D)
V T O
+ 7v,-
(6.17)
where VTois threshold at zero source and substrate bias, 7 is the body effect coefficient and 4, is the Fermi potential.
Fig. 6.58 shows the anselected memory oell in long cyde operation. The bitline hsr completed t h e s g operation and is at gronnd level (GND). In this situation, t h e memory cell is exposed to worst case leakage condition. The c h q e stored in the cell leaks rapidly due to the subthreshold current. This situation sets the lower limit of the threshold voltage. Note that the access transistor of the memory cell has lVss1 as back-bias voltage. The threshold voltage in this mode is given by
384
CHAPTER 6
Low-Power CMOS Random Access MemonJ Czrcuats
385
To meet these two requirements of the threshold voltage, the substrate voltage should have a suEcient bad-bias voltage to suppress the body effect.
For example when the internal supply voltage is VOD = 1.5 V, the IVsel is set to -1. The V~(1.5 V) is 1 V and the Vp(0) is 0.75 V and S = 90 mV/decade.
Extrapolakd thrcrhold v o h g r .
386
CAAPTER 6
Therefore, the lcskage current of e transistor with W = 1 pm, is 10 fF. In this case, Vch must be larger than (VDD VT(VDD)) which is 3 V.
When the VT of the memory cell is reduced, the leakage current increases drastically. The concept of Boosted Senre Gronnd (BSG) [38] w a s proposed to shnt down the subthreshold current in the memory cell B C C ~ S S transistor. This is achieved by slightly boosting the low-level voltage of the bit-line. This level i s called BSG level, and is set at 0.5 V. During a long cycle operation, the gatesource ofan unseleeted cell is negative (-0.5 V), then the subthreshold current is redveed by 6 orders ofmagnitude (for S = 80 mV/decade). Fig. 6.59 shows the BSG circuit applied to a memory cell. The BSG line is common to all N-channel sense amplifiers. The BSG l e d is generated by . e circuit similar to the VDC circuit [see Section 6.3. I0 active mode, the differential amplifier and N I are activated and the voltage of the sense ground becomes Kc,. The W2 transistor has alarge width and is activated by the signal SE at the beginning of the sensiig period to suppress an unnecessary rise in the BSG level by the s made inactive sensing current. In the standby mode, the differential amplifier i to reduce the standby current and also N , and N 2 . The BSG level is clamped to the threshold voltage of N,. Note that the boosted level, Vrh, is reduced compared to the conventional scheme because VT is reduced.
6.2.13.3 Word-Line Driver

Scaling the threshold voltage down increases the subthreshold current of a DRAM, particularly for iterative circuits such m word-line drivers or decoders. If the DRAM is divided into k blocks, each block has a drivers, then the total of word drivers is k.n. Fig. 6.60 shows an example of DRAM drivers. During lhe active mode, one driver out of k.n drivers is selected by the row decoder and the word-line is at the boosted Level K h , generated by the internal ehsrge pump circuit.
When the threshold voltage is low, the subthreshold elurent of each driver is important. Then for &DRAM the total subthreshold current of the drivers is
L,adr
=L . n . l . , a
(6.19)
where I,,s is the subthreshold current of NMOS and PMOS transistors (assumed the same). For B high-capacity DRAM, the current L b d , would be huge. For example, a multi-Giia-bit DRAM har B 1 million drivers, and each driver har a subthreshold current of 10 nA at room temperature, then the total subthreshold current would be 10 mA. At 75 C,this current can be hundreds of mA. This high DC current destroys the Vc6 level because the charge
387
Figure 8.59
Boosted Senre Ground (BSG) tirclut
pump eLcuit cannot handle such a DC current. Note that this current should always be evaluated in the worst case; maximum temperature, and the lowest value of VT. In the standby mode, all the drivers are turned OFF. The current L a d - is still the same. To solve this problem, the concept of Self-Reverse-Biasing (SRB) scheme c 8 n be used !24]. This concept has already been discussed in Seetion 4.10 [Chapter 41. Fig. 6.61 shows the application of the SRB scheme to word-he drivers. During the active mode, the control signal 3 is low and the node SL is equal to Kh. Only one word-line is selected. When goes to high (standby mode), the PMOS device Ps limits the subthreshold current. In this mode, all drivers are OFF,even lhe selected one. Fig. 6.62rhowr the technique to turn off the
388
CHAPTER 6
V,h (boosled levcil
389
selected drive^ in standby mode. When selected wmd driver is low.
d is low,
node Ai is high, then the
One problem associated with the SRB acheme is that daring the actke mode, after one selected word-line driver is activated, d the other drivers m e leaking thereby substantidly contributing to the active current. This problem is solved by the partial Betivation of hierarchical power-line scheme 139). Fig. 6.63 shows the principle of the 2-D selection scheme. In this scheme, the array of k blodrs b7 n drivers is divided into E sob-blocks in columns and I sub-blocks in mw6. The total of sub-blocks, each containing a set of drirers, is k x I . Dudng the active mode, only one subblock is activated. Thus the subthreshold carrent in the active mode is drastically reduced.
6.3 ON-CHIP VOLTAGE DOWN CONVERTER

Chip makers prefer to scale down VDDto enhance the device reliability, while the users prefer it the s a m e power supply voltage and dislike the frqumt changes. The reduction of VODis &o important to achieve low-power characteristic. The strategy to meet these cantrildictory requirements is to use an on-chip Voltage Down Convwter (VDC). A VDC can be used to convert the old power supply voltage standard of 5 V to 3.3 V to power CMOS circuits using 0.5 p n and sub-0.5 pm technology. For the state-of-the-art 0.25 fim (SMOS technology, the power snpply voltage must be 2.5 V. However, the new standard is becoming 3.3 V and is likely to stay that way for many years. Thus a 3.312.5-V VDC is required.
On-chip VDCs are used for DRAMS as w d BJ SRAMs, ASICs and digital proeersors. They m e employed in commercial 16-Mb DRAMr to reduce the external 5 V to an internal voltage o f 3.3 V. For SRAMs,they have not been commonly used as in DRAMr, partieulmly in commercial ones. The SRAMs can operate over B wide range of power supply. Moreover, they already have low data retention current, enough for battery-operated applications. In thk section, w e discuss the VDC &<it tcchniquer for DRAMS which are basically the same as for SRAMs and other circuits.
Numerous pspers have reported designs of the VDC circuit for B DRAM [32,40, 41, 42, 43, 44, 451 and for an SRAM [46]. Fig. 6.64 shows one approach using a VDC to reduce the internal voltage for 8 DRAM. Memory cell array and the periphery circuits are powered from the internal supply voltage, w h i l e the 1 1 0
390
CHAPTER 6
Low-Power CMOS Random Access MemonJ Circuits
391
Figure 8.82
Detail of rord-driver w i t h voltage ahifter.
vch
,O
Vb
u
h
392
CHAPTER 6
bfiers are powered with the external voltage to maintain the compatibility. However. the VDC, in thk situation, should be stable when supplying a large current to periphery and memory array. When the VDC is used for battery operated applications, the standby current should be less than 1 p A over a wide range of temperature (0-70C).
VDC structure for a DRAM, used to convert Generator (RVG), a driver circuit and B time-dependent load. The buffer dreuit consists of a differential amplifier [Fig. 6.661 and common-smrw drive PMOS transistor Pb. The current load has B peak, for the memory spray, ofmore than 100 mA in 1030 nd time and more than 100 mA in few ns for the periphery <Leuit. To deliver such a large carrent, the width of the PMOS 8 of the outpot stage shanld be large. Moreover, when the output current changes rapidly, the output voltage VDD decreases by AVDD. To m i n i = AVDD, the gate control voltsge, VG, hes to change quickly. This is possible by increasing the differential amplifier tail current, I,. The current snomce, I., is needed to clamp the mtpnt voltage VDDwhen the load ourrent becomes almost zero.
Fig. 6.65 shows a schematic of the
V to 3.3 V. It is composed ofaReference Voltage (&)
393
10
circuit
t.
Figure
6.08
Schematic of Lhr differential amplifier,
394
CHAPTER 6
A VDC circuit is one of the keys for achieving 8. DRAM with data-retention current that can be used in battery based applications. The requirements for low-power are the following : The standby current mast be less than 1 P A o v a a wide range of temperature, process and power supply voltage variations; and
rn
The output impedance of the VDC should be low.
6.3.1 Driver Design Issues

The internal voltage generated by the VDC c a n have many BOIIIC~S of flnctuations which are as follows. DC changes i n the reference voltage dne to process and temperature variations. Transient variations caused by the noise in the external power supply and by the load current. The variation of the internd voltage with respect to the reference voltage should be less than 3%. The variation with respect to the load have to be less than 10% and with respect to the power supply less than 1%. The stability of thir circuit is essential for the operation of the VDC. To study the stability, ac smd-signal analysis is carried out. Fig. 6.67 shows the aimplified equident circuit using the MOS smd-signal techniques [47]. The gate capacitance of the output PMOS Cor is hnge and is taken into account. gml and gmr are the transcondnctances of the differential amplifier and the output stage, iespectively. T , and p1 are their iwpective equivalent output resistance. Ci. is the ovtput load capacitance composed of the wire capacitance C-', and the switched capacitance of the memory core em8. The frequency response of this circuit L upreared by
(6.20)
for the differential amplifier and The circuit has two poles: m = l/CGq, PI = l/C,,n for the output stage. The two poles must be sufficiently separated from each other to M J U I ~ a good phase margin 1481. For a DRAM application, the pole pa varies drastically, because of the load variation. Thus. the circuit CM fail to ensure a sufficient phase margin and hence it c a n generate ringing or oscillation. Therefore, phase compensation has to be applied. One
'A typical ralw of C , is 1OOpF. 'A typical ralm 01C , is 1200 DF.
395
possible compensation technique is shorn in Fig. 6.68(a) and it is called Miller compensation technique. The compensation capacitor C,is connected between the input and the output ofthe second stage. It shifts the pole p1 towards lower fieqoeney pk, BS shown in Fig. 6.68(b). Thos, the phase m a r g i n is improved. The condition of the stablization is defined at the paint of 0 dB loop gain where the phase margin is larger than 45 degrees. Using the smd-sigignal analysis with the compensation eapacitm C . the condition c a n be utracted. This capacitor is a function of gma, gml, CL and Co. To determine it, g m m l has to be known, using Iarge-Signd analysis. The PMOS driver Pb has to be rised to satisCy the condition on A V D D ~ V D(less D than lo%), due to the transient load current variation. Hence 9-2 can be determined from the she of &. For a 1 6 M b DRAM, the width of the antpot PMOS Pb can be as high as 30,000 p m and C, eqn& t o 200 p F . This is for 3.3 V internal power supply generation from 5 V. The current tail of the differential amplifier can be high (few ma) in active mode. The driver can be &activated in standby mode to conmme only a very s m a l l current by Chip Select (CS) signal. In this case, the internal vdte.ge can be supplied by a low-power voltage follower (461. The voltage fallowex has the same eonfigmation as the driver but the tail current is in the nub-fiA range.
6.3.2 Reference Voltage Generator

The Reference Voltage Generator (RVG) must provide B high accuracy over a wide variation of VDD,process, and temperature. So far, the RVGr have been based on the band-gap reference and on the threshold d t a g e generator.
396
CHAPTER 6
LOOP
Gain
397
The former consumes a DC current which is not low enough for low-power applications. The latter is more suitable far B CMOS technology. Fig. 6.69(a) shows a PMOS-VTdifference generator with an output voltage AVT = l V ~ ~i lIvTpsl (VT,, < V T ~ Z < 0). The equivalent circuit is shown in Fig. 6.69(b). This circuit needs a PMOS device with high threshold voltage. A typical value for the threshold voltage difference is I.]*. The PMOS transistam are chosen as threshold voltage difference generator because they are in N-weUs and therefore the difference is independent of back-biar (VBB). The circuit of Fig. 6.69(a) does not s&er mnch f m m V~D..~ bounce. The temperatwe dependency of the VT difference is expressed by [49]
(6.21)
where N.il and N.42 are the surface impurity concentrations of PI and P2$ respectively. Far B stable-temperature design, the concentration ratio N.il/N,i2 and. therefore the threshold voltage difference, should not be excessively large. A typical valne of temperature dependency is 0.4 mV/C, whieh is small for the VDC circuit. Since the AVT is around 1 V, the circuit of Fig. 6.10 is used to convert this difference to the required internal supply voltage. The voltageup converter amplifies AVT to:
V,.t = AVT
R (1+ 2)
(6.22)
The mismatch between the two PMOS devices PI and P , of Fig. 6.69 can be minimised by using large channel widths and lengths. But stiU the deviation on VT, dne to the fabrication process, has to be eliminated. This can be done by using fuse trimming technique to control the ratio of the resistors R1 and R2. The total current consumed by this RVG circuit is
where 31 is the current consumed by the voltage regulator [eee Fig. 6.69(a)] and I, is the current of the differential amplifier. I& = K c f / ( R r + R2) is the current of the ontput stage. I can be made < Ip A, however I. and II, can not be made rmdcr, particdarly I,. The resistor is implemented, foz example, by using doped polysilicon. Typical valuei of the resistances m e of the order of 100 K l l . They can not be increased excessively, otherwise the m a of the RVC can be significantly high. Moreover, the substrate noise can affect the reference
398
CHAPTER 6
399
voltage through the coupling capacitances of the resistors. The total current of this type of RVG is i n the order of few . e tens of p A .
To redme the current of the RVG to rub-pArmgefbr battery-operated DRAMs, the concept of dynamic RVG can be used [50] - s h o w i n Fig. 6.71. A PMOS transistor P, with low [VT~ is used. Doring the sampling peiiod (#, is high), all switches S, -54 are closed. The threghold voltage difference, AVT, between the two PMOS devices, P i and P2* appears a c m s the resistor RR.If the transistor dimensions of the pairs P, and P2, and HIand are identical, the reference voltage is given by I , = A VT (6.24) RR This current is mirrored to the output node. If the dimension of P is identical to that of P>, the output voltage V,, is given by
~
V7#, = AVT-Rr.
RR
(6.25)
This shows that the reference voltage e m be adjusted to any voltage. Moreover, with trimming technique V , , , can be adjusted against pmcess vadation effect (AVT variation). The ontput voltage is sampled on the hold capacitor C , . When 4, is low, the circuit is in hold mode. Clock +2 is delayed to clock to minimbe fluctuation of the output voltage. These clocks ape generated from the self-refresh clack circuit in il DRAM. The ciircuit consumes a DC current only when 4, is applied. The average cuiient consumed by this circuit i s I,,
= 31x74 = ~ ( A V T I R E ) ~ ~
(6.26)
The corrent of thb circuit c m be reduced where 7+ is the duty ratio of to a low-level in sub-PA iange by controlling the duty ratio. For example t o generate a reference voltage of 2.4 V from an externd power supply voltage of 3.3 V, RR and Rr. me 9 kR and 12 kfl, respectively. AVT has a typical value of 0.3 V. The total DC is 100 PA. So with a duty ratio lower than 1/100, the average current can be reduced below 1 p A . It can be easily shown that this circuit has a low sensitivity to power supply voltage and temperature variations.
6.4
CHAPTER SUMMARY
Low-power architectures/circuitr techniques for SRAMs, DRAMs and VDCs were reviewed. The obviow technique to reduce the power dissipation is the
400
CHAPTER 6
401
voltage ~ealing. The reduction of power supply voltage to 1- and sub-1 V range requires new circuit innovations and breakthroughs, particularly when low threshold voltage devices are used. It ww shown that not only the power supply voltage scaling contribntes to the power consvmption reduction but &o the reduction of capacitances and DC currents using sophisticated techniques. Many of the techniques presented for memories can be useful to other applications such as : ASICs, DSPs, etc. Design issuer for stable operation of a VDC and Iow-rtandby current techniques were invertigated.
REFERENCES
[I] 8. Tram ct al., "An 8 - m 1-Mb ECL BiCMOS SRAM ~ t a h ConfigurabIe Memory Array Size," International Solid-state Circuits Cod. Tech. Dig., pp. 36-37, Febzuluy 1989.
[2] M. Matsni et al., "An 8-ns I-Mb ECL BiCMOS SRAM," International Solid-State Circuits Conf. T e c h .Dig.,pp. 38-39, February 1989. [3] Y.Maki et al., 'A 6.5-nr 1 Mb BiCMOS ECL SRAM," International SolidState Circuits Conf. Tech. Dig., pp. 136-137, February 1990. [4] M. Takada et al., "A 5-11s 1-Mb ECL BiCMOS SRAM," BEE Journal of Solid State Circuits, uol. 25, no. 5, pp. 1057-1062, October 1990. 151 A. Ohba et al.. "A 7--ns I-Mb BiCMOS ECL SRAM with Program-Free Redundancy," in Symp. VLSI Circuits C o d Tech. Dig., pp. 41-42, May 1990.
[6] Y. Okajimact al., "A 7-nr 4-Mb BiCMOS SRAM with a Parallel Testing Circuit," International Solid-State Circuits Conf. Tech. Dig., pp. 54-55, Febrosry 1991.
[7] K. Sas&
ct d., "A 7-ns 140-mW 1-Mb CMOS SRAM with Current Sense Amplifier," IEEE Journal of Solid.State Circuits, vol. 27, no. 11, pp. 15111518, November 1992.
[8] T. Ootani et al., "A 4-Mb CMOS SRAM with a PMOS Thin-Film Transistor Load Cell," IEEE Journal of Solid-State Circuits, "01. 25, no. 5, pp. 1082-1092, October 1990. [9] S. Mur&kami et al.. "A ZI-mW 4 M b CMOS SRAM for Battery Operetion,' lEEE Journal ofSolid-State Circuits, vol. 26, no. 11, pp. 1563-1570, November 1991.
[lo] K. Saraki et al., "16-Mb CMOY SRAM with a 2 . 3 - p ~Single-Bit-Line ~~ Memory C e l l , " IEEE Journal of Solid-state Circuits, val. 28, no. 11, pp. 1125-1130, November 1993.
404
[Ill M. Metrumiya et al., 'A 15-ns 16-Mb CMOS SRAM with Interdigitated Bit-Lme Architecture," IEEE Journal of Solid-State Circuits, ual. 27, no. 11, pp. 1497.1503, November 1992.
[I21 K. Sen0 et al.. " A 9-ns 16-Mb CMOS SRAM with OfEset-Compensated Cnrrent Sense Amplifier," IEEE Journal of Solid-State Cirenitr, vol. 28,
no. 11, pp. 1119-1124,November 1993.
[I31 E. Seevinck, F. J. List, and J. Lohrtroh, Static-Noise Marsin Analysis of MOS SRAM C e b , " IEEE Journal of Solid-State Circuits, vol. SC-22, no. 5 , pp. 748-754, Oetobei 1987.
[I41 H. Kato et al., "Consideration of Poly-Si Loaded Cell Capacity Limits for Low-Power and High-speed," IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 683-685. April 1992. [I51 K. Saraki et al.,"A 23-ns 4-Mb CMOS SHAM with 0.2-pA Standby Current," IEEE Journal of Solid-state Circuits, vol. 25, no. 5, pp. 1075-1081, October 1990.
[I61 K. Ishibarhi, T. Yamanaka, and K. Shimohigashi, "An a-Immune.2-V Supply Voltage SRAM using a Polysilicon PMOS Load Cell," IEEE Journal of Solid-state Circuits, vol. 25, no. 1, pp. 55-60, February 1990.
[I?] K. Saraki et al., "A 15-ns I-Mbit CMOS SRAM," IEEE Journal of SolidState Circuits, vol. 23, no. 5 , pp. 1067-1072, October 1988.
[I81 K. S s a k i e l al., "A 9-ns I-Mbit CMOS SRAM," IEEE Jonrnal of SolidState Circuits, "01. 24, to. 5, pp. 1219-1225, October 1989.
[I91 K. Ishibarhi, K. Takasugi, T. Yamanaka, T. Hashimoto, K. Sasaki. " A I-V TFT-Losd SRAM using a Two-step Word-Voltage Method," IEEE Journal of Solid-state Circuits, vol. 27, no. 11, pp. 1519-1524, Msy 1992.
[20] M. Yoshimito, K. An-, H. Shioohara,T. Yoshihara, H. Takagi, S. Nagao, S. Kayano. and T. Nakano, "A Divided Word-Line Structure in the Static RAM and its Applieation to a 64K Fall CMOS RAM," IEEE Journal of Solid-State c i r c u i t s , vol. SC-18, no. 5, pp. 479-485, October 1983.
[21] T. Hirose, H. Kuriyama, S. Mnmkami, K. Yuzuriha, T. Mukai, K. Tsutsumi, Y. Nishimura, Y . Kohno, and K. Anami, "A 20-ns 4 M b CMOS
SRAM with Eieraichical Word Decoding Architecture," IEEE Journal of Solid-State Circuits, vol. 25, no. 5, pp. 1068-1074, October 1990.
REFERENCES
405
[22] A. Sekiyama, T. Seki, S. Nagai, A. Iwase, N. Surilti, and M. Hayaraka, "A I-V Operating 256-Kb FaLI-CMOS SRAM," IEEE Journal of Solid-state Circuits, vol. 21, no. 5, pp. 776-782, May 1992. [23] T. Yabe, et al.. "High-Speed and Low-Standby-Power Cieuit Design of 1 to 5 V Operating 1 Mb Full CMOS SRAM." Symposium on VLSI Circuits Tech. Dig., pp, 107-108, May 1993. [24] G. Kitrukawa, et 81.. "256-Mb DRAM Circuit Technologies for File Applications," IEEE Journal of Solid-State Circuits, "01. 28, no. 11, pp. 11051113, November 1993. [25] T. Hasegawa, et a l . , "An Experimental DRAM with a NAND-Structnred Cell," IEEE Journal ofSolid-State Circuits, val. 28, no. 11, pp. 1099-1104, November 1993.
1261 T. Sugibayashi, et al., "A 30-nn 256-Mb DRAM with a Multidivided Array Structure," IEEE Journal of Solid-State Circuits, "01. 28, no. 11, pp. 10921099, November 1993. [27] M. A&, J. Etoh, K. Itoh, S-I. Kimura, and Y. Kawamota, "A 1.5-V DRAM for Battery-Bwed Applications," IEEE Journal of Solid-State Circuits, "01. 24, no. 6, pp. 1206-1212, October 1989.
[28] Y. Nakagome, et d.,-An Experimental 1.5-V 64-Mb DRAM," IEEE Journal of Solid-State Circuits, vol. 26, no. 4, pp. 465-471, April 1991.
[29] H. Yamauehi, et al., "A Circuit Technology for High-speed BatteryOpersted 16-Mb CMOS DRAMS,~ IEEE Journal of Solid-State Circuits, "01. 28, no. 11, pp. 10841091, November 1993.
[30] N. C. C. Lu, " Advanced Cell Structnres for Dynamic RAMS," IEEE Circuits m d Devices Magashe, no. 1, pp. 21-36, Jenuary 1989.
[31] M. Takadn, "DRAM Technology for Giga-bit Age," International Conf. Solid State Devices and Materials, Tech. Dip., pp. 874876, 1993. [32] L. Itoh, et d.,"An Experimental 1-Mb DRAM with on Chip Voltage Limiter," in International Solid-State Circuits Cod., Tech. Dig., pp. 282283, 1984. [33] N. C-C. Lu, and H. H. Chao, '' Half-Voo Bit-Line Sensing Scheme in CMOS DRAMS," IEEE Journal of Solid-State Circuits, "01. SC-19, no. 5, pp. 451-454, August 1984.
406
(341 B. Kawamoto, T. Shinods, Y. Yamapehi, S. Shimiuu, K.Ohishi, N. Tanimum, T. YasUi, 'A 288K CMOS Pseudostatic RAM," IEEE Journal of Solid-state Circuits, vol. SC-19, no. 5 , pp. 619-625, October 1984.
1.351
Y. Trikihwa et d., "An Emcient Back-Bias Gcnezstor 6 t h Xybzid P u m p ing Circuit for 1.5 V DRAMs," in Symposium of VLSI Circuits, Tech. Dig.,
pp. 85-86, May 1993.
(361 Y. KQnishi, ct al., "A 3&ns 4-Mb DRAM with a Battery-Backup (BBU) Mode," IEEE Journal ofsolid-state Circuits, vol. 25, no. 5 , pp. 1112-1117. October 1990.
[37] T. Ooirhi, et al., "A Wen-Synchronized Senring/Equalizing Method for S u b 1 V Operating Advanced DRAMs," in Symposium on VLSI Circuits. Tech. Dig., pp. 81-82, May 1993.
1381 M. Asakura, et al., "An Experimental 256-Mb DRAM with Boosted SenseGround Scheme," IEEE Journal of Solid-state Circuits, d. 29. no. 11, pp.
1303-1309, November 1994. 1391 T. Sskata et al., "Subthreshold-Current Reduction Circuits for MultiGigabit DRAMS," i n Symposium on VLSl Circuits, Tech. Dig.. pp. 45-46, May 1993. [40] T. hrruyama, et al.. "A New On-Chip Voltage Converter for Submicrome ter High-Density DRAMs," IEEE Journal of Solid-state Circnits, vol. 22, no. 3, pp. 437-441, June 1987. 141) M. T s h d a . e l al., -A 4-Mb DRAM with Aalf Internal Voltage Bit-Cine Precharge," IEEE Journal ofSolid-State Circuits, vol. 21, no. 5 , pp. 612617. October 1986.
1.121 M. Hiroguchi, e l
aL, "Dual-Operation-Vdtage Scheme for B S i g l e 5-V. 16-Mb DRAM," IEEE Journal of Solid-State Circuits, vol. 23, no. 5. pp. 1128-1132, Oetober 1988.
1431 G. Kitsukawe, et al., "A I-Mb BiCMOS DRAM Using TemperatureCompensstion Circuit Techniques," IEEE Journal of Solid-State Circuits, "01. 24, no. 3, pp. 597-602. Jnnc 1989.
144) M. Boriguchi, et al., "A Tunable CMOS-DRAM Voltage Limiter with Stabilised Feedback Amplifier," IEEE Journal of Solid-State Circuits, YO\. 25. no. 5. pp. 1129-1135, October 1990.
REFERENCES
407
[45] M. Roriguchi, et al., "Dual-Regulator Dual-Decoding-Trimmer DRAM Voltage Limiter far Brun-in Test," IEEE Journal of Solid-State Circuits, d. 26, no. 11, pp. 15441549, November 1991.
and H. Topshima, " A Voltage Doan Converter [46] K. Ishibashi, K. S-ki, with Submicroampere Standby Corrent for Low-Power Static RAMS," IEEE Journal of Solid-State Circuits, "01. 27, no. 6, pp. 920-926, June 1992.
[47] P. E. Anen, and D. R. Rolberg, "CMOS Analog Circuit Design," Holt, Rinehart and Winston Publisher, 1987.
[48]
P . R. Gray, and R. G. Meyer, "Analysis and Design of Analog Integrated Cteuit," 2nd Edition Wiley Publisher, 1984.
[49] R. A. Blauschild et al., " A New NMOS Temperature Stable Voltage Reference," IEEE Journal of Solid-State Cicuitr. vol. SC-13, pp. 767-774, December 1978. [60] H. &aka,
Y. Nsksgome, J. Etoh, E. Ymaeki, M. Ao?4 and K. Miyamwa, *Sub-l-prn Dynamic Reference Voltage Generator for BatteryOperated DRAMS," in Symp. VLSI Circuits, T e d . Dig., pp. 87-88, May
1993.
7
VLSI CMOS SUBSYSTEM DESIGN
In this chapter, we study the application of the dreuit techniqnes developed through Chapter 4 in the implementation of CMOS b d d i n g blocks soch as adders, multipliers, ALUs, data-path, and regnlar structures, etc. The pow= dissipation constraint is also included through the several options presented for each dreuit. The use of Phase locked Loop (PLL) in high-speed CMOS systems for deskewing the internal clock is also examined. Low-power issuer of the circuits presented are also discussed.
7.1 PARALLEL ADDERS

Parallel adders ere the most important elements used in arithmetic operations of microprocessors, DSPr, ete. As in any logic design they are constrained by parameters aoch as speed, area, and power dissipation. The adder cell ir also an dement of multipliers, dividers, multiplier-acuundatorr (MACs). etc. A m o n g the varions adder's implementations used in many desigrw, w e c a n cite the following clssse.:
-.
m m
Ripple Carry Adders (RCA); Carry Look-Ahead Adders (CLA); Carry Select Adders (CS); and Conditional Sum Adders (CSA).
to
T h i s section h dovoted
describing all these adder classes.
410
CHAPTER 7
7.1.1
Ripple Carry Adders
In Chapta 4, a d-rription of the fnmtiondity o f an adder cell was presented. In an n-bit adder, a propagation of the carry always occurs. This propagation limits the speed of the adder. The simplest way to construct an n-bit adder is to cascade n 1-bit adders as shown in Fig. 7.1. This adder is called Ripple Carry Adder (RCA). Beesuse the carry ripples through the n-stager, the sum of the nthbit csnnot be perhmed until the c a w C = . L is evaluated. The delay of n-bit addition is given by
+ .,
= (n - 1)t.
+ t,
(7-1)
where t , is the esrry delay and t. is the som delay. Since the carry propagation path is II critical stage for the delay, the full-adder cell should be optlnied. The sum and carry out are given by
S = A @ B ( B C
(7.2)
C , , = A.B (A B).C;, (7.3) The schematic of Fig. 7.2 cam be genewted to &dently implement the adder cell. Compared to the conventional CMOS full-adder implementation, there is no inveiter stage. Therefore, the carry delay is redoced. To optimiae the cell, the transistors in the carry path W, and W,, UUL be s i n 4 up [see Fig. 1.21. The other devices can be kept amall to reduce the load on the carry and the power dissipation. The transistors, driven by the carry in C,,, are placed close to the output. Thir will reduce the body effect. since the cairy signal is the
VLSI CMOS SubSystem Design
411
Crilicai path
412
CHAPTER 7
latest one i n an adder chain. The schematic of Fig. 1.2 ir symmetrical and leads to better layout and small area. Since the outpnts are complemented, and in order t o implement an RCA circuit, the configuration of Fig. 7.3 can be used. In this case, many cells use inverted inputs.
Note that an n-bit RCA circuit is subject to the glitching problem. Fig. 7.4 shows 8 static simulation of a 4-bit adder, vrith the inputs A; set to zero (0), and the inputs B; and C , . i i s i g from 0 to 1. The outputs S, should stay at 0, however, due to the delay of the carry signal, through the chain of fulladders, the autpnts exhibit spurious transitions (glitching). There dynamic transitions dissipate extra powm and can represent an important portion of the total power. With careful design this glitchhg problem cam he minimized. One ddvbntage of the RCA is its low-power characteristic. However, its speed is very limited, particularly when the adder is wide.
Another efficient full-adder cell is based on Transmission Gates (TGs). Fig. 7.5 shows an optimived version of the fd-adder cell wing TGs & e d y discussed in e a l propagates only through one TG. Hence, an n-hit Chapter 4. The carry i RCA would be faster and more compact than the conventional o n e ' . Fig. 7.6 shows the construction ofan n-bit d d e r . Pmctiedy, an inverter is added every four stages to reduce the degradation of the carry signal due to the dktribnted RC effect. When the carry rignd is inverted after 4 I-bit stager, complementary carry path adders are used for the next 4-bit stages. This adder structure is sometimes called Mancherter adder. This circuit is faster than the RCA and may have loww power dissipation.
7 . 1 . 2 Carry Look-Ahead Adders

To avoid the linear growth of the carry delay, we use a Carry Lookahead Adder (CLA) in which the earties can be generated in pardel. The carry of each bit is generated from the propagate and the generate ~ignalr (P(, G;)ss well i ~ s the input carry (Go).The propaggste and the generate signals (Pi,Gi) are derived from the operands A; and B, hy
G; =
B.
(7.4)
413
414
CHAPTER 7
Ci"
415
The carries of the four stager are given by
C I = G a t POCO Cz = G I + PIGo PIP& Cs = Gn Cn = Gs
(7.6) (7.71 (1.81
+ PsGr + PsPzGi + PsPzPxGo+ PaP,P,PoCo
+ PxGr+ PzPzGo + PZPLPOCO
(1.9)
Fig. 1.7 shows the block diagram of a 4bit CLA adder. The carry generator blocks (CLG1 to CLG4) generate the carries CL to Cn, in parallel, &om the wryi n signal Co. The different P< and G; signals are implemented following the expressions given b7 Equations (7.4) and (1.51. The B generator blocks (SG1 to SG4) generate the sums. The mm, S ( , Li generated by
Sc = Ci-1
Ai
B;
(7.10)
416
CHAPTER 7
or
s ,=
if the propagate signal is given by
C<L, B Pj
(7.11)
P, = A<
B,
(7.12)
In general, an n-bit CLA adder can be implemented dciently using 4-bit blocks.
Fig. 7.8(a) and 7.8(b) show the first and the fourth CMOS carry lookahead generator kcuits, respectively. The generate and propagate signals are generated in parallel and are fed to all carry generators with the input carry signal Co. The e u r y signals %regenerated simultaneously. However, because the number of stacked MOS transistors increases, the delay of the fourth carry is greater than that of the first and limits the adder speed. The sum generator of the CMOS adder of Fig. 7.2 c m be used in this ewe. The same circuit is used for all four bits. This implementation is slow beeavae of the large numbers of stacked MOS transistors which represent a high equivalent resistance in the pull-up and pd-down paths.
Another CLA circuit implementation in static CMOS design which improves the critical carry path delay i s shown in Fig. 7.9(s). In this circuit, the number of stacked devices i s reduced. The same cell of Fig. 7.9(a) can be used to generate each carry within a 4-bit block. P and G are the global prqagate and generate signals, respectively. The invezter of the circuit of Fig. 7.9(4 i s used to reduce the load on the fourth carry, C , , when it is used to drive the next fourth CLG circuit. The output of this inverter, I, drives many blocks such BS the next first-bit, the next second-bit, the next third-bit CLGs, and the next sum blocks. For the fourth bit stage, P and G aze given by
P = P.+sP,+2P,+,P;
(7.13)
(7.14) G = Gi+a Pi+sGi+? +P;+aP;+2Gi+i +Pi+sPd+&+tGi The circuits of Fig. 7.9(b) and Fig. 7.9(c) show the implementations of the global functions P and G . Simildy, the P and G sign& for the third. second and first bit stages c a n be constructed. For an n-bit adder, all the P and G signals are computed in parallel. Hence, the critical path is the carry path C, C;+,, except for the fust &bit adder block, where the oritieal path can be from one of the inputs ( A , or Bo) to the carry out C4.
The 11101 generator is implemented using the propagate signals, P<and p;. Fig. 7.10(a) illustrates one pwsible circuit using B static CMOS implementation.
VLSI CMOS SubSystern Design
417
t Gn
418
CHAPTER 7
VLSI CMOS SubSistem Design
419
ci -
Figure 7.10 ramion.
S w generator circuits: (a) static CMOS; (b) transmiasion @tr
Another circuit more compact and faster i s shown in transmisJion gates and needs only 6 transistors.
Fig.
T.lO(b). It uses
Many urcuit techniques for high-speed carry lookahead adders have been propored. One of them uses the pseudo-NMOS like style [I]. The adder w~ used in a multiplier and achieved a high-speed static operation. However, it consumer a DC current and it is not snitable for low-power applications.
420
CHAPTER 7
Other CLA implementations, to improve the carry path delay, are based on the transmission gates and CPL families. In this section we present the one based on CPL. The TG version is left to the reader to design. Fig. 7.11 shows the block digram of a 32-bit PMOS lsttch CPL carry loakahesd adder using 4 b i t blocks. The carry generators (CLGs) of each 4 b i t block generate the carries C,+> through C(+$ in parallel from the carry in, C . . The different P; and G, signals, required by each 4-bit block, m e not shown for clarity reasons. When the carry Cj+4 is fed to the next 4-bit block it "re3 B buffer to distribute this carry to other CLGs and SGs. Therefore, the carry path is not signifmtly loaded. This results in a h t operation. Fig. 7.12 shows the CPL implementation of the CLG of the fourth bit. This circuit is located in the clitical path of the carry signal. It is compact and uses only NMOS pass transistors. P and G are the global propagate and generate signals, respectively. The fourth carry is generated from the carry in or G signals through only one NMOS device. The P signal block i b implemented using ANDINAND CPL style. After each 4 CLG blocks of the critical path, the carry is buffered and restored using PMOS latch buffers. The PMOS latch restorer the reduced high level to full-swing to avoid any DC leakage current as shown in Fig. 7.11. Fig. 7.13 shows the G signal block for the fourth-bit CLG 8s an example. The same circuit gtyle can be used t o generate this G signal for the third-bit, the second-bit, and the first-bit CLGs. In addition the output inverter rises a PMOS latch to rertore the swing. The PMOS latch circuit is incorporated only when dual rail signals are available. However, for a single-ended signal, a feed-back PMOS, transistor is added to restore the full r d high-level ar in the case of the sum generator of Fig. 7.14.
7.1.3 Carry-Select Adder

Another adder implementation which improves the speed of the RCA i s the Carry Select adder (CS). It provides B regular layout. as in the m e of an RCA. A CS adder basically consists o f blocks; each wrecuting two additions. One ammeS that the carry in is "1"; the other assumes the carry in is "0". The real carry in is computed from the previans block and selects one of the two m m outputs with a simple TG multiplexer. Fig. 7.15 shows an example of an &bit carry select adder implementation with 4 4 staging. The carry signal, C,, selects the nerd foulsums and the carzy Cs. The 4 b i t adder blocks usvaUy nse RCA with transmission gate implementation. For a 32-b adder, the use of the normal sta&g 4-4-4-444-4-4, does not lead to an optimum delay. This is due to the multiplexing delay of the next carry. Optimal staging depend. on the technology. For example, for the 0.8 pm CMOS device parameters presented
Buffers
I
C"
... ...
422
CHAPTER 7
Figure 1 . 1 3
G blockin CPL logic.
VLSI CMOS Su6System Design
423
in Chapter 3, simulations show that the optimal staging of a 32bit CS adder nSing TGr is 4-4-7-9-8 at 3.3 V power supply '&age. This implementation is regular and easy to layout. however it has a higher occupied area than the
RCA.
7 . 1 . 4 Conditional S u m Adders
In 1960 Sklansky considered the Conditional S u m Adder (CSA) 8s the fastest one,from a theoretical point ofview [Z, 31. The concept behind this architecture is explained using the basic circuit of Fig. 7.16. This example is for a 4 b i t conditional rum adder. It user two types of c e h i) the conditional cell, and ii) the multiplexer. For each bit there is one conditional cell circuit. It computes two sums and two carries: So and Coare cdculsted for a eauy in iero, and S' and C ' are calcdated for a carry in one. The selection of the true s is done with the first carry in and the previous carries. The troe final carry out (G in Fig. 7.16) is also selected.
424
CHAPTER 7
A possible implementation ofthe conditional som adder is shown i n Fig. 7.17 for the c s e of B 4-bit adder [4]. The conditional cell can be implemented vith the compact logic elements of Fig. 7 . 1 7 ( b ) . The different sign& ofthe conditional cell ate constructed using the following relations
s'p
A;.B*
+ A*.B+
(7.15) (7.16) (7.17) (7.18)
VLSI CMOS Subsystem Design
425
The adder uses mainly for the multiplexers transmission gates as shown in Fig. 7.17(~). Note that the architectue we6 the signals and their complements (dualhail architecture) to avoid the use ofinverterr for the multiplexers. Otherwise the delay of the csrrg path w i l l be pen&& by the addition ofinverterr. To design an n-bit (e.g., 32-bit) adder, one possible technique for fast operation is to use staged blocks of constant width or variable width. In this case, d l the conditional sum blocks compute thelr respective double snms and double output carrier in paallel. The troe sum and carry out signals of each block a r e then selected by the carry in generated by the preYions stage. The architecture at the block level UBU B any-select like technique where the carry in of each block ir the true carry out of the previous block. The optimal staging a n be determined from circuit simulation. The architecture has two critical delay paths within a block. One from the carry in to the carry out which is affected by the layout routing since the carry in of a block is distribnted to all the final multiplexers. The other critical delay path is the one from the LSB-inpnt of B block to the cnrry out. To reduce the power dissipation and the delay of the CSA adder, B CPL-Wre circuit style can be used. Fig. 7.18 shows the different circuit cells needed to implement such an adder. In Fig. ?,la(*), the conditional cell schematic is shown. The output signals have a high level voltage equal to VDD - VT. Fig. 7.18(b) shows the compact mdtiplexer using NMOS pass-transistors. The control signals of the multiplexers should have f u l - r d swing, When using t h e e reduced swing circoits in the adder, whenever a full-rail swing is needed it can be generated with the double-rail swing restored circuit of Fig. ?.lS(c). The output inverter ofthe rum Signal is shown in Fig. 7,18(d). The feedback PMOS transistor is needed to restore the high level when only a single-rail exists. The layout of such an adder is regular. Only three c& of the first. second and third bits have to be drawn. Fig. 7.19 illutratw the layout of a 4bit block 0.8 pm design rules.
7.1.5
Adders Architectures Comparison
The ripple adder has the smallest area compared to the other classes and the lowest power in many ca~es. So it should be limited to applications where the area and/or the power must minimized, while the speed is not important. For fast adders, u ~ u d l y the CLA &cuit is used, however its power dissipation can be relatively high. The carry select adders are widely used as the optimum compromise between high-speed operation of the CLAr and the small area of
426
CHAPTER 7
* : MUXs
(a1
VLSI CMOS SulSystem Design
427
428
CHAPTER 7
Figure 1.18
I bit ~anditional SM
sddcr layout
R C h . The conditional snm adder, with variable block staging, combincd with
carry select like style ULO iesult in the fastest adder if well optimized. The power dissipation of this adder can be comparable or maybe less than that of the RCA because it u e s jl reduced internal swing and a datively small transistor count if thc CPL-like style is used. When considering all the criteria ouch as the power, the area and the speed, a tool can be developed to select the adder class which satisfies the specified requirements.
Far wide adders, having operand's sire more than Whit, the different arehitecturer can still be utilised. However, to optimize the speed and power of such a wide adder, several additional algorithms can be combined. Examples of wide adders can be found in 15. 61.
7.2 PARALLEL MULTIPLIERS

High-speed parallel multipliers are becoming one of the keys in RISCs (Rednced Instruction Set Compnteers), DSPs (Digital Signal Processors), graphics accelerators and so on. Parallel multipliers are used in data proeerrorr as w e l l nr digital signal processors. For example, for multi-media applications 16 Y 16 fart multipliers are needed. For flosting-point unit osing double-precision multiplication (IEEE-754 standard), the mantissa data hnr 52-bit. Then 54 Y 54 are required for such an operation. The two added bits are the sign bit and the guard bit. In this section we discuss several parallel multiplier algorithms
429
which have been used in VLSI. The reader can consult references [7, 81 for more details on array multiplication algorithms.
7.2.1
Braun Multiplier
Consider two unsigned numbers X = Xn-l...XzXoand Y = YLi...YrY0 (7.19)
(7.20) The product P = P ~ ~ ~ , . . . P ~ P which , , , results from multiplying the mdtipli-d X by the multiplier Y, c a n be written in the following form
i=o
j=o
Each of the partial product terms Pk = Xi% is c d e d summand. Fig. 7.20(a) s h o w an example of 4 x 4 multiplication. The summands are generated in parallel with AND gates. Fig. 7.20(b) shows the Braun's array multiplier [7]. Such a multiplier of n x n requires n(n - 1) addecs and na AND gates. The adder can be implemented efficiently by arranging the array for a regular layout. Fig. 7.21 shows 8 regular 4 Y 4 array implementation of the multiplier of Fig. 7.20 using three different cells. The fist cell contains an AND gate [Fig. 7.21(b)]. The second cell shown in Fig. 721(c) contains a fd-adder and an AND gate. T h e routing lines arc d s o illostmted in these cells. The last cell represents a M-addex composing the final carry propagate adder. The multiplier array is using what ir called carry-save adders. The delay of such a multiplier is dependent on the delay of the full-adder cell and the final adder in the last row. In the multiplier array, an sdder with balanced carry and s u m delays is desirable beoause sum and carry signals are both on the critical path. This is diJkent than the case of a p d l e l adder where the carry path should be optimized and speed up compared t o the s u m path. For large arrays, the speed and power of the full-adder are very important. CPLlike styles discussed in Chapter 4 can result in reduced power dissipation and high-speed of operation. The final sdder in the last row can USE the techniques presented in Section 7.1.
430
CAAPTER 7
x,
Y3
x* x, xo
Y> Y, Yo
=x
=Y
VLSI CMOS SuhSystem Design
431
xi
(bl
qv;
432
CHAPTER 7
7.2.2 Baugh-Wooley Multiplier

It was noted that Biaun multiplier performs multiplication of unsigned nunbers. The Baugh-Wooley teehnique [7] was developed to design regular direct multipliers for two's complement numbers. This direct approach doer not need any two's complementing operations prior to multiplication. Let us consider two-numbers X and Y with the following form
= -x,-12"-'
; a - I
Y = -Y,-,2"-'
+ +
i=n-*
c c
i=o
X.2'
(7.22)
K2i
(7.23)
i=o
The product P = XY is given by the following equation

i=n-2j=n-2
P = XY
x"_rY,_,2"-'
cc
i=o
j=o
n . i
X;Ip'"
<=*-a
-x-.,
i=n->
c
i=o
fi2"f"-Y
c
i=o
X,2"+'-'
(7.24)
In order to avoid the use of subtractor cells and use only adders, the negative t e r m should be transformed. So
i=n-2
__,.x , _ 1
c
i=o
KZ"+L
x ".I
(-
p . 2
+ 2"-' + i=n-2 E P - 1
*=o
(7.25)
Using this property in Equation (7.23), the product P becomes

P = XY
-2-'+(z".l
x".*Y"-,)
.2'*-2
Using the above rdstion M n x n multiplier, using only adders, can be imple mented. The schematic circuit diagram of 8.4 x 4 two's complement mdtiplicr bared on Baugh-Wooley'a algorithm is shown in Fig. 1.22. The different cells composing the array are &o shown. In this scheme n(n- 1) 3 full-addus are
VLSI CMOS SudSyslem Desagn
433
Figure T.22 M-Adder).
(a) 4 x 4 Baush-Wooley two's complement r e d s &nay (FA :
required. So for the ease a f n = 4 the array needs 15 adders. When n is relatively large, the Rnal adder stage in the multiplier army a n be implemented with the techniques discussed in Section 7.1. This type of multiplier L suitable for applications where operands vith less than 16 bits are to be processed. Application;, for snch a mdtiplier are, far exxamplc, for digital filters where s m d operands mc used (q., 6 , 8 and 1 2 ) . For low-power and high-speed of operation, the array uses a CPL-like adder BS mentioned pieviously in Section 7.2.1,while a CSA scheme, combined with carry select, a n be u t i e d in the final adder. For operands equal or greater than &bit, the Baugh-Wooley scheme becomes too area-consuming and slow.
434
CHAPTER
Henee, techniques t o reduce the size of the array, while maintaining the regularity are required.
72.3 The Modified Booth Multiplier

For operands equal or greater than &bits, the modified Booth algorithm [a] have been used in almost all the designed multipliers. It is bhsed on recoding the two's complement operand (Lo., multiplier) in order to reduce the number of partial products to be added. Thb makes the multiplier faster and uses less hardware (area). For eurmple. the modified Rad*-2 algorithm is based on partitioning the multiplier into overlapping groups of 3-bits, and each group is decoded to generate the correct paztial product.
435
Let us mite the multiplier, Y , in two's complement

;=*--I
Y = -Y,-,2"-'
It can be rewritten as follows
1 Y.2'
irnO
(7.27)
In this equation, the terms in brackets have valuer in the set{-2, -1,O, 1, +2}. The reeoding of Y ,using the modified Booth algorithm, generates another number with the following five signed digits, -2, -1. 0, +1, +2. Each recoded digit in the multipliei performs B certain operation on the multiplicand, X ,85
illustrated in Table 7.1

Table 7.1 Partid ereduct .cl<c&n
Y2,+>Ya, Y , , . , Recoded
0 0 0
0 0
0
0 1
1
1 0
1
digit 0 +I +I +2
-2 -1 -1
Operation on X
OXX
+ l X X
+I x x +2xx
-2 x
1
1 1 1
0
0 1 1
0
1 0 1
-1
-1xx
OxX
x x
So the bits of the multiplier are partitioned into groups of overlapped 3-hits, each group permits generation of B ceitain partial product. The five posible multiples of the multiplicand are relatively easy to generate following the explanation given in Table 7.2
The generated partial prodnct is related to the multiplicand for each recoded digit by the relationships presented in Table 7.3. PP,is the partial product and PP, is the sign bit of the partial product w t h P , = Pn-l when no shifting of the partial product is performed. Note that the partial product is represented on n 1 bits.
436
CHAPTER 7
Recoded Digit 0
+1 +2
-1
-2
Opuation on X Add 0 to the partial product Add X to the-partid-product Shift left X one position and add it to the partial product Add twos complement ofX to the partial product Take twos complement of X and shift left one
Table 7.S
Pmtial prodvct gmcrathn relations.
Recoded Digit
Operation on X
Added to
LSB
0 +1 +2
-1
-2
PP; = 0 PP; = x, PP, = PP; = x, PP, = Z,-,
fori=O,.-.n fori=O, ...a for i =0. ...n for i = 0,.. -n for i = O , . . .n
0 0
1 1
To clarify this algorithm, an example is presented in Fig. 7.23. Let X = l O O l O l O l and Y = 01101001. The recoded digits of Y are
oiioio,oi:
+a
-1 -2 +I
The bits are grouped into 3-bit groups overlapped by one bit and a bit with a value of aero is added on the right side of Y 85 Y-I. So the mdtiplicstian of two %bit numbers generates only 4 partial products. The number is then reduced by half, The partial prodnet i n thb example is represented on 9 bits. For a correct partial products addition, the signs aze extended 85 shown in Fig. 7.23. The shape ofthe multiplier is then trapeiaidal due to the sign extension.
437
(-107)
(+165)
10010101 = X
% E L z y
Operalion
BltE recoded
+I
-2
extension
010
100
101
ni I
-1 ~100101010 1101010000011101 = P (-11235)
+2
I n order to make the =nay rectangular, and then more regular for VLSI implementation, the problem of sign extension must be addressed. T h i s problem is more crucial when the operand lengths ars wide, where each partial product must be sign-extended to the length of the product. In thirIeetion we will not deal with the techniques to solve the problem of the sign extension. Bat we d discuss one technique which is shown i n Fig. 1.24 for the e m p l e of Fig. 7.23. The bmie idea is to use two extra bits in the partial product. For the first partial product, the two additional bits, PP,+I and PP,+. ale equal to the sign bit of the partial product
P P . . , ,
= PP-,, = PP,
(7.29)
For the second partial product, if the first partial product was positive, then the two additional bits for this second partial product a e given by the expression above, otherwire we have two clues
PP,+z = PPm+,=l
and
if PP,=O if PP, = 1
(1.30)
PP*+, = P P . . + > =1
(7.31)
So it is more interesting to use a third bit, F, as a flag to indicate whether there is, from the previous partial, a negative sign bit to be propagated. F 1 is the flag generated by the first partial product to the next one. For the example of Fig. 1.24, FO = 0 (no PP before the first one). and F, = F2 = F , = 1. SO for the first partial product there is a sign propagation to all the others. This
438
CHAPTER 7
(-107)
(+I051
. .
lOOlOlOl = X K O E l=Y
Y Y
Operation
Bits recoded
:1E110010101
mOl10101 I0
~OOllOlOll
+I
-2 -I +2
010 100
101
01 1
D~00l01010
ll~10100P0011101 = P (-11235)
,
. . I
8-1
Additional hiis 10 he gencrawJ [sign ~i1cnsi0n1
0 Additional bits generated fmm the previous Sign and the prescnl sign
Figure 1.24 Thc prcviour trample of Figvrc 7.23 eith aimpiifiId sign cxtm<om.
fiag is expressed by the following Boolean equation
Fj+1 = F j + P P , , j
where PP,,i k t h e sign bit of the j t h partial product.
(7.32)
Let us now see the implementation of the n x n modified Booth multiplier. Fig. 7.25 shows the block diagram of the multiplier. Also it gives an idea about the fioorplan of this subsystem. It is composed of the following blodrs:
m
The multiplier axray containing partial products generators and I-bit adders; The Booth encoder and the sign extenJon bits (PP,+2,PP,+l,F). The Booth encoder generates the five signals (0, +lx, +2x, -Ix, and - 2 x ) for each group of 3-bit of Y ; and The final stage adder performs 2n bits addition.
. i
rn
For the sake of simplicity, we treat the case of B 6 x 6 multiplier. All the c& described in this easmple are the besic cells of any multiplier size. Fig. 7.26
439
X<*-l:O>
3
Y<n-l:O>
" Y
I
+JcF.w
n-bit adder
P<Zn-l:n:
Figure 7.25 Block diagram of the n x al*mithm.
n multiplier uing
modificd Bovth
shows the implementation of such a multiplier. Four types of c& the final adder. There cells are:
are used plus
The ADD cell which generates 0 or 1 [see Table 7.31. The schematic circuit of this cell is shown in Fig. 7.27(a). Two implementations m e possible: one using pars-transistors controlled by the five signals d&g the recoded digit code, and the other one is an AND2 gate of the two sign& -1x and -2x. The partial product MUX (PP-MUX) which generates the partial product. Fig. 7.2T(b) shows the schematic of PP-MUX using CPL type logic. The feedback PMOS, P j in this figure or in the o m of Fig.
440
CHAPTER 7
441
sumin
'i-1
*
5
cT 4
Sum"",
(*) not conncclcd for PP-HA
(b)
(Ci c&:
7.17 Boothmdtipiicr PP-HAl.
(4 ADD; (b)PP-MUX;
(0)
PP-FA (or
442
CHAPTER 7
?.Z?(a)are used to restore the high level to eliminate any DC current.
T h i s implementation permits fast operation and lowpower operation.

The PP-FA (PP-HA) cells. They merge the PP-MUX &cuit and a full-adder (half-adder). respectively. CPL-lihe adder can be utibed for fart operation and low-povrer.
rn
TheBooth Encoder (BE).It generates thcfivecontrolrignalsox, +lx, +2x, -lx, and -2x from a group of three bits of the multiplier Y. Fig. 7.28 shows the schematic of the different circuits involved in the BE block. The additional circuits ofthe two bits PP,,+i,j and PPn+z,j of the jth PP are &o illutrsted. Pj and Fj+, are the previous and the next flags, respectively. PPn,, is the sign bit of the jth PP. Note that Po is 0.
The Booth multiplier exhibits a lot ofunnecessary glitches. The main mason for glitchcs is due to the race condition between the multiplicand sod the multiplier due to the Booth encoder. The power dissipation assodated with the glitches can be an important portion ofthe total power and henee it needs to be reduced by some techniques of signal synehroniaation.
7.2.4
Wallace Tkee
By applying the Booth algorithm, the number of partial products is hdfed. However for large moltipliers, 32bit and over, the nnmber of the partial products is over 16-bit. In this case, the performanee of the modified Booth a l g e rithm is limited. One techniqne, to improve the performance of there multipli. ers, b to adopt the Wallace tree using 4 2 compressors. A 4 2 compressor accepts 4 numbers and a carry in, and $urns them to produce 2 numbers and carry out (really it is a 5-3 compressor). Fig. 7.29(a) shows an example of rueh a tree on partial products of 110. unaigned 8 x 8 multiplisr. Eight partial products are produced. Using 4-2 eompressors, two levels of additioru (rteges) are needed. The final two summands are added nsing a fast 16-bit adder. Some eeros me added to the array. This example shows that the bits which m e not nsed in the M stage (level) jnmp to the next one t o be combined with the ones produced by the compressors. Fig. ?.29(b) shows the architectme of the 8 x 8 multiplier. For the first stage of the tree, two blocks, A and B,are required. The block A (B) of compressors group the first (last) four partial products, respectively.
VLSI CMOS SubSysten Design
443
3-1
Figure T.28
sion losir
Logic aehemstis of the Booth encoder including thc aim exten-
444
CHAPTER 7
pp"J Fl
Fig. 1.30 shows how the 4-2 compressor can be implemented by 2 full-adders or by custom static CMOS Iogjc [9]. 4-bit 11,...,In. are added to produce 2 s u m S and C. Hence, 4-bit of the partial product are compressed to produce two new partial products. The compressor is implemented, using carry-save adder construction, by two cascaded fd-adders as shown in Fig. 1.30(b). Notice that carry-out2 is never generated by carry-in. Fig. 1.31 shown the 4 2 compressor circuit osing B compact structure of multiplexers [lo]. This structure is faster than the static complementary version. Fig. 1.32 shows the intereonneetion of the 4-2 compressors for block A of the example of Fig. 1.29. C . is connected
445
x7 Y7
........... ...........
X Y :
0 zcra
446
CHAPTER 7
447
As
I
B
L
448
CHAPTER 7
449
x<31:0>
7 I
iz-
2nd stage-BlockE
laslage-BlockC
I.
i i
-P<15:0>
1st stage-Block D
]
PPG: Gcncrator of panial
products
2nd slage.Block F
3rd alage-Block G
to the next carry-in f&. Since these signals are independent, the carry is not propagated through the row.
To further enhance the Wallace tree multiplier, the modified Booth algorithm can be used to rednee the number of partial prodocts by half in a camy-save adder array. One example of such combined construction is the architectme of the 32 x 32 multiplier shown in Fig. 7.33. It consists of four functions:
the Booth encoder, the partial product's generator, the compressor blocks, and the final 64-bit adder. The Wallace tree is constructed with 3 stages (levels). The first stage har 4 blocks (A to D ) , with each block summing up 4 partial
450
CHAPTER 7
products among 16. The second stage s u m up the 8 new generated partial products from the first stage. Hence, two blocks are needed, E and F. Finally, block G of the third stage of the tree generates two other new partial products to the find adder. This architectare exhibits some irregularities in the b y m t since it has a complicated interconnection scheme. Hence, the interconnection wirer affect the speed and power dirsipntion of the adder.
7.2.5
Multipliers Comparison
The basic array multipliers, like Baugh-Wooley scheme, consume low-power and have relatively good performance. However, their use ean be limited to process operands with less than 16-bit (e.g., &bit). For operands of 16-bit and over, the modified Booth algorithm reduces the partial products numbers by half. Therefore, the speed of the multiplier is reduced. Its power dissipation ir comparable to the Baugh-Wooley multiplier due to the circuitry overhead in the Booth algorithm. However, circuit techniques can ~ a n e e this multiplier to have low-power characteristics. The fastest multipliers adopt the Wallace tree with modified Booth encoding. A Wallace tree would lead, in general, to larger power dissipation and area, due t o the interconnect wlres. Henee, it is not recommended for low-power consumption applications. Dynamic multipliers ace not discussed in this section since they introduce problems of control and timing. Hence a t m area and power dissipation are added to the design.
7 . 3 DATAPATH
A VLSI chip can be partitioned in two piuts; the data path (oz execution unit) and the control unit. Data paths are often used in digital signal proce~~ors, microprocessors and application specific ICs (ASKS). The data path consists of a combination of an Arithmetic Logic Unit (ALU), a shifter, a file register, 1 / 0ports, a multiplier, an adder, B magnitude comparator, and data busses, etc. It performs many operations on the data in the register file, to which the results are sent back. The data busses permit communication between the diSerent units of the data path. The data busses are the communication means for the dats transfer between the ALU, shiiler, and file register, ete. These busses have a heavy load (few p F ) . In CMOS design, dynamic techniques are used to &ow fast operation. One way to reduce the power dissipation, doe to the precharging transistors, is to use static burres (111.
451
Lalch A
Lalch C
Latch B
Op Code
I
Figure 7.34
Atithmeti= LogiE u
*I
d (4l.U).
Bus-B
The control unit delivers the instructions to the data path. These instructions determine the operations that the data path has to perform. The eontrol unit can be implemented using random logic, micro-ROM (Read Only Memory), PLA (Programmable Logic Array) or n combination of these three implementations. Other macrocells, snch as TLB (Itandation Lookaside Suffe~), cache memory. ete., can be added to the data path and the control nnit. In thj, section, several blocks of a data path are discussed.
7.3.1 Arithmetic Logic Unit

ALU is an important part of a data path. It i s a macrocell which executes hthmetic operations snch as multiplication, addition, mbtraetion, negation, and logic operations such ar AND, OR, XOR. camp-on, etc. It performs the operation on two operands stored in latch A and latch B and puts the result in latch C as shown in Fig. 7.34. The operation code (op code) selects the operation of the ALU to be executed. The flags indicate the status o f the ALU, snch as overflow, ser+rerult, and carry generation, etc. The input latches A and Bare, in general, connected to two pardel data busses. Sometimes, the input latches are merged with MUXs to select many input sauces to the ALU. The result latch is connected to one of the busses or, to B t h d one. The ALU described in this section is static for low-power applications. The madmum clock frequency of a VLSI circuit may be limited by the ALU operations; especially the arithmetic ones. The critical delay o f an arithmetic
452
CHAPTER 7
operation is due mainly to the carry propagation along the width of the ALU. There are many types of ALU, depending on the number of operations t o be performed. Fig. 7.35 shows the block diagram of a 1-bit slice of an ALU. It has exactly the same structure as the adder, except that the P and G blocks are programmable. Fig. 7.35(a) shows the P block with 4 control sign& (OPI . . . O&). The feedbaek PMOS transistor. P j , permits restoration ofthe high-level from VDD - V . , to VDD. Hence the DC current of the first inverter, due to the reduced high-level, is eliminated. Fig. 7.35(b) shows the G block with 4 op code sign& (O&..OPa). The P and G b l a h use the pass-transistor style. The techniques discussed in Section 7.1 can be applied to achieve lowpower and fast operation. The carry and resdt (sum) blocks m e shown in Fig. 1.35(c) and (d), respectively. Table 7.4 summarises some of the functions that can be implemented with these blocks. Several other operations can be realimd with this ALU.
Table 1.1 Examples of ALU wcrationr
(1. me-
with).
Operation
LSB-C..
P function
G fanction G = A 01 B G=AorB
G=O G=O G=O
Op code
(0P1 ...ope)
Add w. carry Subtraction Bit-wke AND Bit-wire OR Not A
0
1
0 0
0
P = A ZOI B P=AzorB P=AondB P=AorB
10011101
10011101
01110000 00010000
P=H
10100000
Table 1.4
(cm6inwd)
Operation Add w. carry Subtraction Bit-wire AND Bit-arise OR
Result A A A A A tB t B+1 and B mB
Not A
To implement an n-bit ALU, all the techniques discussed for carry speed-up in
adders can be applied. Drivers are needed to dirtribvte the op code signals for
VLSI CMOS SudSystem Design
453
P P
P
454
CHAPTER7
Eigure 1.38
Absolute value calsulntor
an n-bit ALU. Foi low-power design, the busses which communicate with the ALU are in general not precharged 8s in the case of many data paths.
1.32
Absolute Value Calculator
The Absolute Valne Calculator (AVC) is, in general, used in data path. of video processors to compare the data of two pictuw. Fig. 7.36 shows the architecture of the AVC. This pardel circuits performs two subtractions simultaneously, A - B and B A. Using the most significant bit of there two operations, the MUX circuit selects the positive one. Then the output giver the absolute d u e IA-BI.
~
area of an n-bit AVC, the logic of two n adders rewired c a n be reduced by the merging of the common functions for both operations. Also the techniques described in Section 7.1. for n-bit addition. should be nsed
To reduce the power dissipation and the
455
7 . 3 . 3
Comparator
A magnitude comparator is oscd in many DSP applications. It permits comparison of the magnitudes of two numbcis A and B by providing if A < B, or A = B, or A > B. Fig. 7.37(a) shows an example of a two-bit comparator which requires two types of eelk C1 and CZ. The cell, C1, is constructed by the eireuit of Fig. 7.37(b). Table '1.5 shows the truth table for this cell.
Table 7.5 b t h tsbk for cLil C 1
B %bit comparator works. When A, c B,, then C, = DI = 0, and A1Aa < BIBo regardless of the magnitudes of the lower bits Simile.& for A1 > B,, then C, = 1, D , = 0, and AlAo > BIBo regardler. of the magnitudes of the lower bits. When A1 = BL = 0, the magnitudes of the two 2-b numbers depends on A. and Bo. In this situation, there are three
Let ns explain how
different cases:
1. AlAo
< B I B ofor
A.
Eo = Fo = 0.
c BO (i.e.,
Co = Do = 0). Then we can set

we can
2. AlAo = BLBO for Ao = BO ( k . , C , = 0, Do = 1). Then Eo = 0 and Fo = 1.
set set
3. AlAo > BIBo far AO > BO (i.e., C, = 1, Do = 0). Then Eo = 1 and Fo = 0.
we c m
These relations can easily be nsed to implement the second cell, Cz, of the comparator a8 shown in Fig. 7.37(c)
This technique, for the two-bit comparator, can be extended for an n-bit =omparator. It can be constructed by using B parallel tree of the cells C1 and C2. A 4-bit comparator could. for example, be constructed with two 2-bit comparators connected in parallel and at the output the 4 E and F generated signals
456
CHAPTER 7
fed to an added C2 cell. In this architecture, the glitching is reduced by equdizing the delay paths of each cell.
are
7.3.4
Shifter
Another macrocell of the data path is the shifter. It pertorms shift or rotate operations on the data If the number of bits to be shifted is arbitnuy, then a barrel rhifter is used [12,131. Fig. 7.38 shows the CMOS implementation
457
s3
s2
S1
SO
of a 4 b i t barrel sbifter. NMOS transistors are used as switches in the array. The input bns (Do - D,) can be connected to the output bus (Ra - RB)via the pass transistors. The control signal So-hselects the pass transistors to be switched. These signals determine the amount of shift and they m e generated by a 2-bit decoder. Since the outpots have a high level of VDD - VT,due to the pass transistor, then the output buffer nses a feedback PMOS device, Pf, to iestore the high level to VDO.This eliminates any DC current i n the first inverter of the buffer.
Table 7.6 shows the values of the output bus function of the input data. Depending on the values ofD < 6 : 0 >, several shift operation8 can be performed. For example if D < G : 4 >= O, and D < 3 : 0 > is the 4-bit input data, then
458
CHAPTER 7
B l o g i d shift is realiued. However, if D < 6 :4 >= 1 and D < 3 : 0 > is the input data, then an arithmetic shift operation is performed.
Table 7.6
Output bu. function of the &Sting amount
The barrel rhiftei is not 8 critical unit for the delay. A low-power operation is performed by odng a static implementation. This shifter can be implemented with transmission gates and the feeedbak PMOS are not required. However for low-power, the use of NMOS array is more efficient. The feedback PMOS should be sized to minimum.
7.3.5
Register File
A register file is a set oircgisters which store data. It consists of a small array of static memory c&. Register files are wed by miemprocessors and DSPs and they permit multiple read and write ports [14. 15, 16, IT]. A typical array is 32 registers of 32-bit. For example an ALU needs two pieces o i data from the regjster file. The array has dual-read ringle-te architecture.
Fig. 7.39 shows the schematic ofthe singleended memory eeU with 2 read ports and 1 write port (2R-IW). The read ports are the r e d bit-lines BL.RI and BL-R2. The memory cell, composed of two cross-coupled inverters h and 1 2 is addrwsed by two read word-line signals, W L R l and WL-R2. The NMOS transistor N, is controlled by the Wzite Enable ( W E ) signal. N1 is connected aerially to the write B E C ~ S S transistor N 2 . The transistor flz is controlled by the write word-line ( WL - W) signal. The transistor N, isolates the stored data from the write bit-line ( B L W ) .To write the datain the storage node A from the write bit-line, the imerters I , and I2 rhonld be sized earefnlly. The ratio of the inverter I, should be larger than 1 (e.g., 5 ) to set the threshold voltage of 1, to a law-level. This is due to the fact that Nl and N2 we&!+ transfers a high level (only 1 0 0 -VT=). Moreover, to ensure a correct write operation, the
ThedeFdlianofB iasivoninChc~pirr4.
VLSI CMOS SubSysten Design
459
BL-W
BL.RI
BL-RZ
WL-w WL-RI
WLLRZ
WE(Wdte Enable)
Figure 7.8s
( Z R I W ) rcgisterflle rrU.
feedback inverter 1 , should he we& so the access transistors N, and N, can chmge the state of node A. For example the NMOS and PMOS of I, shodd be minim- siae except that the length of the NMOS is twice the minimum. Also the acce55 transistars should have highcr p compared to the transistors of 1,. For a given technology, the sizes should be determined by circuit simulation for a correct write operation. The inverter 1% is a buffer for the storage node.
A pair of three-port memory e& is shown in Fig. 7.40. This rtrueture has shared access transistor N a and write bit-line, B L W . To read and write the memory cell, the simplified rchematio of Fig. 7.41 is nsed. T h i s schematic uses the calomn multiplexing scheme. For low-power, the register file U E ~ S static design and avoids the use of the conventional sense amplifier for bitlines sensing. The sense amplifier consumes DC power. For a three port register file, two read and one write row decoders are required. Also, Write Enable (WE) and column addresses are needed to produce the column write enable for writing the data to the specified storage node. For fast operation AND gates can be u.ed with a m-om of of 5-bit inputs.
During the read operation, if for example Na is asserted, then the data is put on the bit-line, BL.Rl. The bit-line is selected through the pass-transistor N,. The data is then senred by the inverter I , in Fig. 7.41. During this period, the
460
CHAPTER 7
BL-FSA
HL-W
BL_R2H
BL-RIA
WE-I
WE-2 (2H-1W).
BCRiB
Figure 1.10 A pmir d t h r r c p o r t memory c&
read enable signel, RE, is asserted, Ni is OFF and only the feedbaek PMOS P j is activated when a one ( V D~V T , ) is on the data-line. In this situation, the feedback PMOS charges up the data-line to VDD. Also the DC current, which c m be generated due to the reduced high l e d on the data-line, is completely eliminated. The p ratio of the inverter I, should be higher than one (e.g., 5 ) to achieve a symmetrical r e d access time for a % e m and a one. When R E = 0, then the data-lines axe i 4 a t e d from the bit-liner and the NMOS transistor N z is ON. Therefore, the latch formed by the pair of inverters 11 and I , latches the old data. The operation of such a re&a file is fully static and does not dissipate any atatic power at any mode of operation. Furthermore, the read and write o p erations are asynchronous. T h i s type of register file is suitable for low-power applications.
7.4
REGULAR STRUCTURES
In this section we examine the design of large regular rtruetnres such as Programmable Logic Arrays (PLAs), Read Only Memories (ROMs) and Content Addressable Memories (CAMS). The ROMs and PLAs are not only used to implement controllers in a regular manner but they also can be applied to signel processing. RAMS arc treated separately in Chapter 6. These large structures
VLSI CMOS SvbSystem Design
461
WSie decoder
(WAI
vow ,K. Y l W ....

WE lWritof3nablc)
YOR. YOR. Y l R , . RE (Read Enable)
462
CHAPTER 7
me usually dynamic circuits for fart operation. These dynamic circuits can be shut down with a power management Unit for power ravings. If for example the do& is turned OFF, all dynamic circuits go into 8 piechsrge mode with all PMOS precharge devices are ON.
7.4.1 Programmable Logic Array

Logic functions such s those used in the control units of VLSI processors, or a r e hard to implement in random logic. One way of implementing these functions, in a regular structure, is the m e ofProgrammable Logic Array (PLA) [18,191.
in finitestate machines,
PLAs have regular architecture divided mainly in two planes BS shown in Fig. 7.42. Theso planes pelform a specific fnnction such 85 OR and AND. CMOS PLAs can be implemented in both static and dynamic styles. The style is chosen depending on the timing strategy in the chip. Other factors such BJ speed, power dissipation, and the allowed area, p l q an important role in the PLA design style. A CMOS PLA example, ushg psendo-NMOS like style, is s h a m in Fig. 7.43. The output OR functions are r & d with NOR gates. From Fig. 7.43(a), we have
PI = A t B t C = A.B.C P, = A+C = A.C Pa = B + C = B.6
(7.33)
(7.34)
(7.35)
(7.36)
P , = A + 6 = A.C
The buffers are used when the load on the bit-line is large. They consist in general of two invectez's stages. The OR plane i s in principle similar to the AND plane [Fig. 7.43(b)]. From Fig. 7.43(b), we have
= Pi
+ P, + Pa
(7.37) (7.38)
Y = P, + P,
For this pseudo-NMOS PLA, NOR-NOR logic gate style iz used. This example shows that the PLA organization is useful for implementing Sum Of Products (SOP) functions. Hence any SOP function can be redzed by programming the army with the AND and OR cells. Any type of latch or register cm be used at the input and output. ThL design style of PLAs has e n m d size area and
VLSI CMOS SudSystem Design
463
Inputs
0"tP"tE
AND-OR PLA ~ h r t e c t u r e .
Figvre T.12
it is simple to implement. However,it is not suitable for low-power application due to the high DC power dissipetion, p a r t i d w l y when the PLA is large. Moreover, it has B speed problem.
In dynamic CMOS style, the circuit shown in Fig. 7.44 can be used. It is a selftimed PLA, where the AND and OR planes are both realised =sing precharged NOR configuration. In this structure, o d a ~ &gle clock phase is needed. When the dock, elk, is high the bit-lines are preeharged in both planes. The NMOS transistors NA and No are OBF, guaranteeing that there is no p.th to ground. Tracking liner in both planes are used to generate a delayed clock to the OR plane. When the clod is law, the prechargt PMOS transistors, in the AND plane, turn OFF, N A tarns ON and the produets a ~ l e evdnsted. The tiaching lines ensure that No tuns ON only when the inputs to the OR planer are stable. Othetwise the outputs can be spmiously discharged. This PLA is fast, bnt it har a lot of wasted dynamio power. The wmted power har r e v a d sources such ar:
464
CHAPTER 7
X = ARC+AC+RC
_ _ _
Y = ABCiAC
x = q + Pi+ Fj$
(bl
Figure 1.48
P#eudD-NMOS
CMOS PLA:(s)AND plane; (b)OR pknc.
465
AND-plane
OR-plane
clk
- :vinua1Ground
Figure 7.44
Sclf-timcd d+c
PLA using NOR-NOR style.
The virtual ground Liner are charged and discharged every cycle. The total eapheitance of the virtual ground is important, particularly for large PLAs because for the purpose oflayout compactness the ground lines ate in diffusion. T h i s capacitance can be reduced using metal level in multi metals technology; The number of inverters forming the buffers are important. Then, duiing the evaluation, several of them switch; and The switching activity of dynamic NOR implementation is high [see Chapter 41.
m
m
Consider now the PLA shown in Fig. 7.45 mith AND-NOR structure. The OR plane is still the same compmed to the PLA of Pig. 7.44. However, the AND plane is considerably simplified because:
rn
The virtual ground Liner disappear; and
466
CHAPTER 7
AND-plane Delay
OR plane
Tra'h"g
- 'Vinual Ground
Figure 1.45
Sclf-timeddynamic PLA u s h r AND-NOR stylo
The number of inverters for buffering is reduced by half. The switching activity of the NAND implementation is aLo lower than that of NOR implementation, resulting in Iower power in the AND plane. O n e problem associated with this struetme is that the use of NAND may result in a large discharge time. Another dynamic PLA combines the pseudo-NMOS and dynamic logic design styles [19].Fig. 7.46 shows an example of such a structure. The AND plane uses a predseharged pseud-NMOS NOR style, while the OR plane uses B conventional dynamic precharged style. During the precharge phase, the clock signal is high and the bit-lines in the AND are predircharged to ground. In the OR plane, the bit-lines are precharged to VDD.The i n p d s @ to the OR plane are low. During the evaluation phare (clk = 0), the PMOS loads in the AND plane are ON, and t h e plane behaves as pseudo-NMOS logic. In this case, the PMOS device should be siaed correctly to ensure safe operation when the output stays at a low level. The product terms are evaluated and then the outputs. During this evaluation phase, the PLA dissipates a static power m d y by the AND plane. Then the power i s increased by this DC component.
467
PMOSlOad
This PLA does not need the seW-t-g techaiqne nsed previously. Also it was shown that this PLA has a kst operation [IQ]. When implementing smaller controllers, it is sometimes more interesting to use random logic. The implementation consists of two or more levels of logic gates using s standard cell library. It is much less regular than a PLA structure and it can have lower power dissipation.
7.4.2
Read Only Memory
Read Only Memory (ROM) is used in many applications. In DSPs, for example. it can be used BJ table lookup to store coefficients. Also it i s often used in VLSI processors as a microcode controller. In this case, the ROM contains the microprogram instructions. Typical miero-ROM size is 2k words of 64 bits. The read-out cycle of the ROM limits the speed of the processor. Conceptually, the structore of a ROM is quite similar to that of B PLA. Fig. 7.41 shows a simple ROM circuit architecture using NOR logic design. The state of the memory array is retained even if the ROM is not powered. The
89P
469
Bit-he (merall)
A
G
- word-fine (rnCtSl2)
Diffurian
Ward-ime (polyriiicon)
Figure 7.41
Layout of a ROM memery cell
The ROM can be implemented in both styles: static and dynamic. In static styla, the pseudo-NMOS logic, similar to that of static PLA, can be used. Fb. 1.49 shows an example of a s m a l l ROM 'Lsing pseudo-NMOS circuit style. The conditioning circuits use PMOS devices, with their gates grounded, and the sense amplifier circuit is simply an inverter. The column decoder is also shown. One of the column decoders selects one of the two bit-lines. Then, node A is initially at VDD.If the selected bit-line is &charged, then node A is discharged and the outpot is pulled up to VDD.The pseud-NMOS is eaey to design and does not need a careful design, howveer, the power dissipation may be significant due to the DC current. For a relatidy large ROM, like the one used in microcontrollers, the power dissipation c m be significantly rcduced using the low-power techniques of SRAMsa. They include pulse mode operation using address transition detection, and r m d swing sensing, ete.
*These tecbsiisuca M discused in mom detail in Chapter 6.
470
CHAPTER 7
ROW demder
q<
Gmunded PMOS
Figure 7.40
PseudeNMOS ROM cirsYtry.
A dynamic version of the ROM ir shown in Fig. 1.50. During preeharge phase, elk = 1 and the bit-lines are precharged to VDD- VT, where VT is subject to the body effect. Node A is also precharged by the PMOS trensistar Pp. The select lines Sell and Sei2 are controlled by a column decoder. Ail the word-lines are predirchsrged to groond. Dudog evsluation, cfk = 0 and if the hit-line is discharged to gro.aund, node A is also discharged. Then the ontput of the inverter I is p d e d up. If node A is not discharged, the feedbadr PMOS transistor Pt permits to maintain the high level at VDD.Since the swing on the high-load bit-line is reduced, the power dissipation is reduced on this line by a factor V D D / ( V D D - VT).
7 . 4 3 Content Addmssable Memory

AC o n t e n t Addressable Memory (CAM)is an important maeroeell of a T~mslation Loakaside Buffer (TLB) [ X I and cache memory [21] circuits ofcomputer
systems. The TLB permits the translation of the virtual sddress of a CPU to the physical address, and the cache memory from the physical address to the memory data.
471
decoder
Word-linc
Sdl
Bit-line
Figure T . 6 0
Dynsmi~ ROM cirrvit.y.
A CAM stores tags which can be compared against an input address word (A o...A,,,) as shown in Fig. 7.51(*). A match detection signal is sent by the CAM if the valuer stored in the CAM array match with the input address word. A CMOS implementation of the CAM cell is illustrated in Fig. ?.5l(b). It c m be readable and writable jwt as an ordinary memory cell. The read/write and decoder circuits are similar to that of B RAM. A tag word ir formed by identical cells which are repeated in a horiaontd array. The write lines are used to write data in the array. The comparison procehs k described e ~ , follows. Dnring prechmge phase, the bit-lines me predischarged low. All the write lines are low. The Match line (ML) is precharged high. During the evaluation phase, suppose that a "1" is stored at node A. Assume that C B L line is held high and m l i n e is held low. In this case, the transistors N3 and N1 are OFF, hence the M L Line remains high, indiea&a match at this bit location. Assume now that C B L is driven low and C B L high. The transistor NQis OFF, but N1 and N2 are ON. Then the ML line is discharged, indicating B mismatch at this bit location.
For an array of n tags, there m e n matchliner f M L ( 0 ) ...ML(n)). Each match line i s common to m cells. If there is B mismatch in any bit of the tag wocd, the match line is discharged. If all the m bits match, the common match-line remains high, To detect the match signal in any of the match liner a dynamic
472
CHAPTER 7
Wnfe Line(WL)
Match Line (ML
CBL
(b)
CBL
Plgurs 7.61 (a) CAM m a y ; (b) CMOS CAM cell
473
NOR
circuit is used, LU shown in Fig. 7.62. When the clock is low the NOR gate i s precharged along with the match lines. The inputs to the NOR gate me predischarged to ground. When the cUr signal is high (evaluation phase), one of the match lines, MI,((), stays high and the others are discharged to ground. When the msteh liner are stable, the eual signal i n asserted with elk using self-timing (similar to the PLA case). This permits keeping the dynamic NOR gate from falsely diecharging. The inputs to the NOR gate must not go high until the data is stable. If one of the match line stays high, then the NOR gate i s discharged and the output matoh signal goes to high.
7.5 PHASE LOCKED LOOPS

Phase Locked Loopa (PLLs) have many applications in digital and analog
systems. In digital systems, on-dip PLLs are needed for the following reasons:
To reduce clock skew dne to clock distdbntion. As systems continue to demand higher clock frequencies, dock skew associated with input buffers snd clock distribution becomes a significant design problem LU shown in Fig. 7.63(a). The internal dock drives the output register, which in turn delivers the data to the output pad (with a buffer). The
474
CHAPTER 7
skew between the external and internal clocks is due to the clock tree.
The outpot datais significantly delayed compared to the external clock. One main contribution is the dock skew. In Fig. T.SS(b), the internal dock is deskewed via the use of a PLL. The PLL shonld reduce this skew OD B wide range of process, temperatnre and voltage vadations;
To synchronize data between chips as shown in Fig. 7.54. The PLL solves the problem of clock skew Grom chip to chip. An example of such an application is &cussed ia 2 2 1 ;and
To generate internal clocks with higher frequencies than the external dock (system dock).
There are other applications of PLL for clock recovery in serial data communications and these are not discussed in this section. Several theoretical references on PLLs can be found [23,24, 251. Thu section provides m introduction to the PLL. The CMOS circuit design of the PLL, for low-power applications, is then discussed.
7 . 5 . 1
Charge-PumpedPLL
One interesting C O Z L ~ ~ ~ U F L ~of ~O the O PLL is the charse-pumped loop shown in Fig. 7.55. It is B PLL-based frequency multiplier which consists of a Phase Frequency Detector (PFD), B ChargePump(CP), a Loop Filter(LF), II Voltage Controlled Oscillator (VCO), and a programmable frequency divider. The feedback of the internal dock is compared to the external clock for phase m d frequency error. The outputs of the phase/frequency detector are two +tal si& called U (for Up) and D (for Down). The charge pump and loop flter convert these digital EignaLE into ap analog signal (control) suitable for the VCO. The VCO function of the control signal level generates a certain oscillation frequency. If the PLL generates multiples of the external clock Gequency, then a frequency divider is inserted between the generated clock and the phase detector.
A simplified diagram of the charge pump and loop filter is shown in Fig. 7.68. It consists of two switchable corrent S O U ~ C driving ~ ~ an impedance (LF). The
pnlses generated by the PFD block are nsed to switch the charge pump, to charge or discharge the impedance. The loop filter flters these pukw and has an analog output signal to control the VCO.
Thc chargo PUP
Oltagcl.
102
PLL should not he confused with the one vacd to sonerate diffeicnt
475
Clock
Data oul
p
outpu,
D a a uul
Figure 7.6s PLL clock gener*ticm ior drakeluing: (a) n chip without PLLi (b) a chip with PLL.
476
CHAPTER 7
Chip#l
Chip #2
Data pad
Figure T.66
Block diascm of the PLL.
7.5.2
PLL Circuit Design
T h i s section presents the design of the PLL components. Fig. 7.57 shows the I@ diagram of the PFD circuit. It usel m a i n l y static-CMOS NAND gates
which results in good performance and law-power dissipation. The operation of this circuit using the state diagram of Pig. 7.6T(c) is aa followa. The circuit has three states: 1) UP,where the up signal U is w e r t e d when the external clock elk.., f a down, 2) D O W N ,where the down signal D is asserted when the internal clock elk fall. down, and 3) NOP,where the detector does not
477
LF
Q
r4
change the ontpnt control signals. In thia last state both U and D signals are at zero level. The d a t a changes whenevu clk or clk..t f a down. In no case U and D are both activated. Consider that d k and elk..t have the same freqneney bnt the f&g edges of eB..t (elk) leads the falling edges ofclk (~lkept), respectively. Then, d ( 8 ) is asserted with II certain duty cyde, while D (U) is never asserted. In this case, the PFD is characteiiaed &B the phase detector. Consider now the case where clkezt has a higher frequency than elk. d is asserted moat of the time. More falling edger of clEsmt signal than elk. A similar sitnation vhen clE h s higher freqoency than clk,,, and D is assected most of the t h e . In this case, the PFD is characterbed as frequency detector. The 8 and b signals, generated by the PFD, BE connected to the charge p m p dreuit of Fig. 1.58(a). When the signal d (d) is asserted the pull-up PMOS (pull-down NMOS) transistor charges (discharges) the output, respect i d y . Another variation of the charge pump circuit is shown in Fig. 7.58(b). are added as current 80urces biased by 8 current Two tranei4tors P,*j and
478
CHAPTER 7
clk
479
mirror circuit. In this situation, the output curent of the h g e pump can be adjusted through the control of the current mirror.
The manolit!ic impLenentation of the filter of Fig. 7.56 is shmn in Fig. 7.59. The two capacitors C , and Cz are in the order of tens of pF and are made with the NMOS transistors Ncr and Ivct. The re*stoz is made with a transmission gate in dosed stste. It can also be implemented with an N-well implant available in the CMOS pmcenn. The capacitor C a is added in parallel to the simple RC (R-C;) low-pass filter to form a second order filter. In this ease, the stability of the system is maintained even with the process variation of these on-chip components. Note that these capacitors c a n occupy a large portion of the PLL.
The charge pump and filter generate a control voltage for the VCO. One important parameter of the VCO is the VCO gain. When considering the charaeted4tic frequency-control voltage, the VCO gai0 is the sbpe of lhis characteristic. A linear characteristic is, in general, desirable. In general the VCO is implemented using h ring oscillator as shown in Fig. 7.60. A series connection of de1e.y inverter cells forms a tapped delay line which oscillates with a frequency determined by the delay time of the cell and the odd number
480
CHAPTER 7
of stages. The delay of the cell is controlled by a current which in turn is controlled by the control voltage V,. V, modulates the ON resistance of p d down N1, and through the current mirror,the p d - u p PI. All the devices of the VCO should be oriented in the same direction and have redundant contacts to reduce the jitter due to process variations. In the VCO of Fig. 7.60. madmnm frequency is achieved at madmum control voltage. Typical values of the VCO gain at low power supply voltage E B range ~ from 10 MHn/V to 100 MAzjV depending on the number of stages and technology. Note thst the bandwidth of the VCO presented previously is limited. The VCO of Fig. 7.61 har an excellent bandwidth characteristic, where B wide range of frequency can be generated I%]. It ia used for video signal processors end covers a wide range of applications. The freqnency range EM change by one order of magnitude from 50 MHz to 350 MHe. In fig. 7.61, by turning ON and OFF 8 CMOS TGs with control signals, the number ofring oacihtor stages can be selected among eight values (7,S,ll,l5,Zl,ZS.3S.61). Each stage of the ling oscillator combines an inverter in parallel with I I current-controlled inverter. The inverter inereares the frequency of oscillation of the VCO, where= the currenteontrolled inverter permits tuning of the frequency of the VCO. The generated clock frequency can be N times the external dock frequency (reference frequency). This dock then feeds the clock driver and tree. Since the PLL discussed here is intended to be integrated on-ehip, it is then sensitive to the noise generated on the power lines (called power-supply-induced dock jitter). If the power supply changes by 100 mV the skew 01 phaae error will
481
Flgure T.00
VCO wing m n t controlled OMOS ring oscillator.
Selection signals
7 t h stage
5 I It stage
Generated clock
Figure T.01
VCO
with .&&able charsctrti.tie..
482
CHAPTER 7
be important before the PLL has time (tens of clodrJ eydes) to correct this emor [ZT]. One vay to reduce the effect of thjs problem is to dedicate an analog power supply pin to the VCO and the charge pump. At the drcuit l e d , a ncw VCO delay cell war proposed by Young [ZT] to iedoce the phase error. Another VCO dhmatilse is shown in Fig. 7.62. It is rimilm to the VoltageControlled Delay Line (VCDL) [%]. The control voltage, V., is used to vary the amount of the effective load seen by each inverter output. The frequencycontrol voltage characteristic of this VCO has a negative slope. Then the minimum frequency of osdllation is linlited by the maximum V D DTherefore, . the minimum freqnency is increased with iednced VDD. A positive slope is, i n g e n e d , desirable so the mioimum frequency is not set by VDD. The frequency divider can be implemented using togglc flip-flops. Fig. 7.63 shows an example o f a divider with division ralm of 1, 112, 114, and 118. The PLL, so far discussed, is not completely digital. Only the PFD, charge pump and the frequency divider are digital. While, the I F and VCO are analog m d operate 8s eontinuoostime systems.
7 . 5 . 3
Low-Power Design
In deep mode, the on-chip PLL may bc controlled for low-frequency operation, or it may be disabled to reduce its power dissipation to the lealrsge currents.
483
T clk
T clk
Figure 1.84
A VCO emntrollcd by enable dgtd far low-pow=
modc
484
CHAPTER 7
As an exsmple, to disable the PLL, is to shvt down the VCO and disable the external clock. Fig. 7.64 shows the Same VCO of Fig. 7.62 but with one inverter transformed to a tw&nput NAND gate. One of the inputs is controlled by the Enable signal to shut down the PLL when it is low. The NAND gate can be used for any of the VCOs presented previously. Also the enable signal can be used to disable any current O O I I T C ~used i n the PLL to eliminate any DC cunent. A typical power dissipation of B PLL, at 3.3 V,is in the range of tens of mW depending on the frequency.
7.6
CHAPTER SUMMARY
T h i s chapter has presented the design of aeverd subsystems used in VLSI chips.
Many circuit alternatives are discussed which trade area, speed and power. The reader can construct theoe options and compare their performance in terms of power, delay and area. The power dissipation isrue is stressed more. Also several building blocks of VLSI chips using advanced circuit tcdrniqoes have been investigated. These iodnde
rn rn
I
High-speed addition. Multiplication techniques. PLL and clock deskewing technique.
REFERENCES
[l] J. Mori, et al., "A 10-ns 54 x 54-b Pardel Structured Full Army Multiplier
with 0.6-pm CMOS Technology." IEEE Journal of Solid-state Circuits, vol. 26. no. 4, pp. 600-606, April 1991.
(21 J. SUansky, "An Evaluation of Several Two-Snmmand Binary Adders." IRE 'Itanrllctions on Electronic Computers, vel. EC-9, pp. 213-226, June 1960.
[3] J. Sklansky, 'Conditional-Sum Addition Logic," IRE Transactions on Eleetronic Camputem "01. E C Q ,pp. 226-231, June 1960. [4] I. S. Abu-Khater, R.H.Yan,A. Bellaouar, and M. 1. ELnaary. -A 1-V LowPower High-Performance 32-b Conditional Sum Adder." IEEE Symposium on Loar-Power Electronics. Tech. D i g . , San Diego, pp. 68-67, October 1994. [5] T. Sato, et al., "An 8.6ns 112-b Transmission Gate Adder with a ConflictFrec Smass Circuit," IEEE Journal of Solid-State Circuits. 701. 27, no. 4, pp. 657-659, A p d 1992.
161 K. Ucda. H. Susiki.. K. Suds. Y. Tasuiihashi..~X. Shinohara. "A Whit ' Adder Ey P a r Tranaislor B&OS Ci"rcuit," IEEE Custom' lntcgrsfcd Circuit Conference. Tech Dig. pp. 12.2 1-12 2 4 \lay 1993
~
(71 K. Hwang, "Compoter Arithmetic: Principles, Architecture, and Design," John Wiley and Sons, 1979. [8] J. J. F. Cawnagh, "Compoter Science Series: Digital Computer Arithmetic." MeGraw-Hill Book Co.. 1984.
[Q] M. Nagsmatsu, S. Tanaks, J. Mori, T. Noguchi, and K. Hstanska, "A 16-ns 32x32-bit CMOS Multiplier with an improved Pardel Structure," IEEE Cuatom Integrated Circuits Conference, Tech. Dig., pp. 10.3.1- 10.3.4, May
1989.
486
LOW-POWER DIGITALVLSI DESIGN
[lo] N. Ohkubo, M. Suzild, T. Shinbo, T. Yamanaka, A. Shimieu, K. Sasab, and Y. Nakagome, 'A 4.4-n5 CMOS 54x54-b Multiplier nsing PassTransistor Multiplexer," IEEE Custom Integrated Circuits Conference, Tech. Dig., pp. 599-602, May 1994. [Ill R. Bechade, et al., "A 32b 66MAu Microprocessor," IEEE International Solid-State Circuits Conference, Tech. Dig.. pp. 208-209, Februaiy 1994.
[12] C. A. Mead, and 1 .A. Conway, "Introduction to VLSI Systems," AddisonWesley, 1980.
[13] R. W. Sherbnme, e t al., "Data path Design for RISC," Pme. Conf. Advanced Research in VLSI, pp. 53-62, 1982. [14] R. W. Sherburne, et al.. "A 32-bit NMOS Microprocessor with e Large Register File," IEEE Journal of Solid-State Circuits, vol. SC-19, no. 5, pp. 682-689, October 1984. [I61 K. J. O'Connoz, "The %-Port Memory Cell." IEEE Journal of SolidState Circaits, vol. SC22, no. 5, pp, 712-720, October 1987. [I61 R. D. Jolly, *A 9-ns, 1.4Gigabyte/s IT-Ported CMOS Register File," IEEE Journal of Solid-State Circnits, vol. 2 6 , no. 10, pp. 1407-1412, October 1991.
[I?] H.Shinoharn, et al., '"A Flexible Multipoit RAM Compiler for Data Path," IEEE Journal of Solid-state Circuits, "01. 26, no. 3, pp. 343-349, March 1991.
1181 A. R. L , "A Low-Power PLA for B Signal Processor," IEEE Jonmal of Solid-State Circuits, voL 26, no. 2, pp. 107-115, Febrnary 1991.
[I91 G. M. Blair, "PLA Design for Single-Clock CMOS," IEEE Jounal ofsolidState Circuits, vol. 27, no. 8, pp. 1211-12113, August 1992.
[ZO] H. Kadota,
et el., "A 32-bit Microprocessor with On-Chip Cache and TLB." IEEE Journal ofsolid-State Circuits, vol. SC-22, no. 5, pp. 800.807, October 1987.
[Zl] A. J. Smith, "Cache Memories," Computing Snrveys, Vol. 14, pp. 473-530, September 1982.
(221 L. Ashby, "ASIC Clock Distribution using a Phare Locked Loop (PLL)," in IEEE International ASIC Conference and Exhibit, Tech. Dig., pp. P1.6.1P1.6.3, September 1991.
REFERENCES
487
[23]F. M. Gardner, "Phase Lock Techniques," John Wiley and Sons, 1919.
[24] F. M. Gardner, "Charge-Pump PhaseLocked Loops," IEEE Transactions on Communications, COM-28(11). pp. 1849-1858, November 1980.
1251 R. E. Bert, "Phase-Locked Loops," McGraw Hill, 1984

[26] J. Goto, et al., "A Programmable Clock Generation with 50 to 350 MHz Lock Range for Video Signal Processors," IEEE Custom Integrated Circuits Conference, Tech. Dq., pp. 4.4.1-4.4.4, May 1993.
[21] I. A. Young, J. I<. Greason, and K. L. Wong, "A PLL Clock Generator with 5 to 110 MHs of Lo& Range for Microprocessors," IEEE Journal of Solid-State Circuits, 701.21, no. 11, pp. 1599-1607, November 1992.
[ZS] M. G. Johnson, and E. L. Hodsan, 'A Vaiahle Deb7 Line PLL far CPUCoprocessor SyruchroniUation," IEEE Journal of Solid-State Circuits, vol. 23, no. 5 , pp. 1218-1223,October 1988.
8
LOW-POWER VLSI DESIGN METHODOLOGY
methodologies at several abstraction levels such as physical, logical, architectural, and algorithmic levels. AU the power reduction techniques discussed are related to the dynamic power dissipation. It is shown that LP techniques, at the high-level (algorithmic and architectural) of the design, lead to power ravings of several orders of magnitode. Many uampleo are included to give the reader a quaotitative picture of LP issues. Several LP techniques, particularly at the circuit level have already been discussed in Chapters 4, 6 , and 7 including those related to static power oonsiderstiona. However, they are not reconsidered in this chapter. The power estimation techniques at the circuit, logical,architectural and behavioral levels are overviewed. Power aoalysk a t high-level d o - a~ early prediction and apt-stion of the power of a system. The LP concepts such as switching ncti.;ty, glitching, etc., discussed in Chapter 4 are used throughout this
chapter.
Thk chapter presents Low-Power (LP) de-
8.1 LP PHYSICAL DESIGN

There are several techniques to reduce the power at the physical design (layout) level. Same ofthese issues hwe been discusscd in Chapter 4 for full-custom and semi-curtom designs. In this section w e present two approaches for low-power physical design.
490
CHAPTER 8
8.1.1 Floorplanning
Floorplanning of a circuit is the first step in VLSI layout design. It permits the allocation of space on a chip for a given set ofmodules. A module can be rigid, e.g., the module is in the library and its dimension and power dissipation are known. or pezibie, e.g., it has not beon deaigned and has B list of parameters such as different shapes and power consumptions for feasible implementations. Floorplanner for low-power design should choose a suitable implementation for each module such as the total power/area of a chip are optimieed [I].
8.1.2 Placement and Routing

The placement and routing of a VLSI circuit is performed on standard cells, gate armyys, functional blocks, etc. All the diffeient modules me already laid out and well charactedeed in the library. Traditionally, placement refers to the process of placing modules to minimize area and delay. Placement for low-power uses the switching activity-eapaeitanee products as B function to be minimized, in contrast with delay minimiuation, where the wire capacitance has to be minimiad. After placement, routing permits connection of the modales with wirer. High switching activity wires should he kept short using the lower parasitic capacitance layer. A CAD tool for placement has already been developed
[4.
8.2 LP GATE-LEVEL DESIGN

The low-power design methodology should &LEO be applied to logic design. To achieve thia goal, power is traded for speed and area. In this section, we discuss a number of techniques to reduce the switching activity and internal capacitances during teebnology-independent and technology-dependent phases of logic design.
8.2.1 Logic Minimization and Technology Mapping

The area and power optimiaation of logic structures (both combinational and sequential) have matured considerably. The power optimimtian task benefits from there techniques. The objective of logic minimization is to reduce the boolean function. For low-power design, the signal witching activity i s mini-
Low-Power VLSI Deszgn Methodology
491
mized by restructuring a logic circuit during the technology-independent phase [3]. It is assumed that at the higher-level of abstraction, decisions regmding the power supply voltage and the dock Bequency have already been made. The power minimidion is eonstrained by the delay, however, the area may increase. D g this p h e of logic minimization, the function to be minimis& is
where P, is the probability of the node i being a "1" (1 P$)is the probability that node i is a 'V", and C s ia the capacitance of this node. For more infarmation on thia model see Section 8.5.2.1. To minimiie the above equation. one has to first evaluate the current value of P; and then change it by making P : dose to 0 or close to 1. Also i n [3], zero-delay approximation i s assumed. This implies that the glitching power is neglected.
To minimize the switching activity, some techniques that can be used are:
rn
Use don't-cares to minimize the probability P< of I I function. Indeed, the signal probability of B gate can change by altering the ON-set or the OFF-set by adding points from the don't-cme set. Collapse nodes that are not on the critical path. The intermediate signal lines me implemented as single node. The delay may increase, however this does not affect the m m d l performance of the circuit.
Power dissipation can be imprwed by m much as 60%, at the expense of an 8 % area increke [3] and with no delay degradation. More typical power reduction would be in the range of 10.20% [4]. The technology mapping step for low-power refers to the process of transforming a logic function into a technology-dependent (e.g,, CMOS) circuit with minimieed power consomed. This technology dependent Step ~ s e sa target technology. The first step in technology mapping is to decompose each logic function into twwinput gates. The objective of this decomposition is to minimize the total power dissipation by reducing the total switching activity. Fig. 8.1 shows an example of a foor-inpot AND gate decomposition into two different implementations. The probabilities of inpots being at "1" logical are also shown in pig. 8.1. Primary inputs ace assumed to be uncotrelatcd. The switching activity at each internal node is also shown in Fig. 8.1. A two-input ( i , j ) AND gate is given by
a = (1- P,Pj)PdPj
(8.2)
492
CHAPTER a
Lmpiomcnration 1
lrnpiemsntition 2
W e s m m e also that the gate delays are zero to ignore the power dne to the
glitehing phenomenon. The total switching setivitie for implementations 1 and 2 are 0.888 and 1.056, respectively. Therefore, implementation 1 is better than implementation 2. This problem ofdecomposition was addressed by [5,6]. In 151, the power dissipation, associated to glitehing, is neglected while in [6]it is not. Taking into rrccount the power dissipation of glitches is very i m p o r t a t ar is discussed in Section 8.2.2. The concept of technology mapping of logic opt-ation is an important step for standard c e h and gate anays (or sea of gstes) circuit design. All the cells in the library are characterized in terms of ares and speed. Another parameter to be added for low-power design is the characterization ofthe internal power of the gate and its output parasitic capacitance. Hence the process of technology
Low-Power VLSI Design Methodology
493
mapping ir to search, using B target library, the best possible implementation following constraints such power, area and delay.
In this aectian we do not consider the algorithms for technology mapping. The
reader can consult rcfcrencea [5, 71. W e illnstrste this concept of technology mapping by the following example. Fig. 8.2 shows an example for implementing the logic circuit of Fig. 8.2(a) into two implementations. The first implementation [Fig. 8.2(a)] is for minimal area deign using OAI (OR-AND-INVERT) gate. The second implementation [Fig. 8.2(b)] is for minimal power design where the high switching node N of Fig. 8.2(a) ir hidden using B mom complex gate.
Thus the process of technology mapping is to &st decompose the logic function such that the total switching activity is minimbed. Then, to hide any high svitching activity node within complex gates 80 that the capacitance of that node is minimisod. However, mahiog LL gate too complex c a n trade the delay for low-power. Typical reduction i n power dissipation is on the order of 20% without any degradation in performance but st the expenac of small area penalty. The quality of the targeted cell library can considerably impact the results of mapping [S]. For eremple, the availability ofcells with different drive etrengths and doublerail outputs (signal and its complement) gives more fleldbility for logic optimisstion. A goad library a n result in 20-5095 of power dissipation reduction.
8.2.2 Spurious Thinsitions Reduction

Due to the finite delays of logic gates, signal m e * in static logic deigns can result in dynamic hasards. Hence, a node can have transitions in one dock eyde before stabbing to the correct Logic level. These unnecessary switching transitions (glitches) can consume power dissipation in the order of 20.40% 19, 10, 1 1 1 .
To .educe this power the first appioach in to balance the path delays by changing the logic atmsture (e.g., tree) ar explained in Section 4.5.5. Another technique ir to balence the delay of the patho by pising down the gates in the fast paths 1121, However, this approach can increare the delay of the circuit. ALSO insertion of buffers (delay elements) in the fast paths can baknce the delay. However, the added buffers increare the power dissipation.
494
CHAPTER 8
Another techniqne employs self-timing techn;gues to reduce the lo@= depth 1 1 . The self-timed circuit should save more and then the glitehing power [9, 1 power than what it introducer. As B cLcuit example that exhibits spadous transitions, is an adder. The rum sign& can have fake transitions before they are stable. If the load capacitances on the outputs are relatively large, then the power due to the glitches can be important.
A conventional self-timed method for an adder is shown in Fig. 8.3. A Tran(TD) similar to the one discussed for SFLAMs h Chapter 6 is used. For each set of inputs ( A and B;) there is one transition detector. If A and B are both n-bit wide, then n TDs are reqnired for the pardel adder. For any transition at the inputs, the TD generates a pulse for the self-timed function. This self-timed circuit delays the pulse by an amount equal to the critical pnth of the adder. The delayed pulse then feeds the clock of a D-FlipFlop (DFF) or B gated &wit for the sum function. Consequently, the output
sition Detector
495
Self-timed
Pdlel-adder
funclion
Gated
function I
s m s are not witched notil they are evaluated. The additional Circuitry in the conventional approach UUI colls~unr more power than it mag s m e .
Another approach bsded on self-timing to reduce the spudous transition was proposed by [ll]. Fig. 8.4 shows a parallel adder using simple self-timed circuitry. When input signals are written into the registerr A and B, a single register bit is used to genepate an 'Input Valid" signal to the self-timed function. For an n-bit pardel adder, only B onebit register is required. e s shown in Fig. 8.4. The self-timed function is implemented using a series of inverters with dual-rail. Two enable signals E and 3 are generated by the selEtimed Circuit. They feed the gsted sum XOR gates. Also the enable ipd, E. cantrola the one-bit register to disable the i n p m t d i d signal. This technique har resulted in 25% power reduction [ll].
496
CHAPTER 8
Parallel-adder
Gated Output XOR Oale
8.2.3 Precomputation-BasedPower Reduction

Consider the original circuit of Fig. S.S(a). R1 and R2 are two registers at the input and output of II combinational logic block. The idea of precomputing is to preevaluate the output values of the circuit one clock cycle before they are required, to disable a part of the input register R1,then to reduce the inteinal switching activity in the succeeding clock cycle [l3]. Fig. 8.5(b) illustrates B simplified architecture of the preeompoting concept. Thin technique can be applied to several circuits su& BS: Finite State Machines (FSMs), pipeline circuits, etc. To illustrate this technique, consider the ulunple of an n-bit comparator that compares two n-bit numbers A and B and computes the function F that indicates that A > B. Fig. 8.6 shows the application of precomputing technique to the comparator. If the most signifiesnt bit, A=.I and B,.,, are different, then F ean be performed from the 1-bit MSB comparator and the registers R2 and R3 are disabled. Therefarc, the (n-I) comparators are shut-down. If the inputs have a uniform probability equal to 0.5, the enable signal has a pmbability of 0.5 to be at the logical level "1" or "0". Therefore. for h relatiwly large n the power saving can be qnite significant even if we include the power due to the *dditional circuitry. This technique of preeomputation can be synthesized for logic opt-ation. The selection of sub-set of input signals for which the output is precomputed
497
i s critical for power savings. Otherwise, the additional circuitry can dissipate a relatively important power. Note that this added logic slightly increases the area of the circnit and may also inerese the clock cycle. The preeomputation techniqne can be applied to a mnltiple output function. However, if the logic has a large number of ontputs, then it may be worthwhile to s e k c t i d y apply precompotation technique to a small number of complex outputs. This selective partitioning will add a duplication of combinational logic and regirtera and this may offset the powex savings.
498
CHAPTER 8
8 . 3 LP ARCHlTECTUKE-LEVEL DESIGN
In this section, sxhitecture meens also Register Transfer Level (RTL). The architecture uses a set of primitives suoh 8s adders, multipliers, ROMs, register filer, etc. RTL synthesis programs m e used to convert an RTL description to a set of registers and combinational lwgic. The impact of low-power techaiqnes on the architecture level c a n be more significant than the gate level as . r i l l be shown in this section. Techniques to reduce the power dissipation discxssed m e : parallelism, pipeline, distributed processing m d power man<&ment.
8.3.1 Parallelism
Parallelirm can be used to reduce the power dissipation at the expense of area while maintaining the same throughput [lo]. To finstrate thia, the quantitative example of Fig. 8.7 is considered. In Fig. 8.7(a), a regbter snpplies two 16-bit operands to a 16 x 16 multiplier. We refer to this architecture to reference one and we w e the ref notation for frequency, power snpply voltsge, power dissipation, etc. This register is clocked at a maximal frequency f , s j = 50 ME$. We assume that the worse case delay of the multiplication is 20 ns at V,el = 3.3 V power supply voltage. It is clear that we cannot reduce %,I to reduce the
499
500
CAAPTER 8
throughput as in the c s e of Fig. 8.7(a). The input registers are docked at f7.,/2 = 26 M A S . Therefore, the power snpply can be reduced to achieve B worst c delay of 40 m. With the same 16 x 16 multiplier, the power supply UUL be reduced Gom K,f = 3.3 V to 1.8 V ( V , s j / l . 8 3 ) . This value can be determined from the simulation of the two architectures. The effective capacitance has increased by a factor of 2 due to the duplication. However, due to the extra routing to both multipliers, thb effective capacitance is around 2.2 G C j . Thus, the estimated power dissipation is given by
Hence
Ppe7= 0.33P,.j
Thus, the power dissipation is significantly reduced.
n parallel The key to this power ssVings is the duplication of the hardware i configuration. In general, N processors E B be ~ paralldked by duplication, with each processor running with slower do& (by 8 factor of N).In this case, for the s a m e throughput, the power dissipation c a n be ieduced with the increase of N. Therefore. the power ropply voltage (VDD) can be aggressively rednced to meet II worst case delay almost equal to the reference delay divided by N. To wploit this power mpply reduction, the threshold voltage ( V T ) should also be reduced to limit the degradation of the delay as VDDapproaches VT. Keep in mind that the scaling of VT is also limited by the static current oonsiderations.
When the number N is relatively large, the parallelism can lead to several problems. A highly p m d d k e d configuration can result in s drastic incresse of the occupied area. In addition, there is rooting overhead to distribute the input and output signals. This also increases the &re8 and the wiring capacitance. Therefore, the power dissipation &a tends to increase and then limits the utility of parallelism.
8.3.2 Pipelining
Pipelining is another arehiteetluc that can reduce the power dissipdion [lo]. As an example, let us consider the case of the 16 x 16 multiplier presented in Section 8.3.1. The 60 MAB multiplier is broken into two equal parts as shown i n Fig. 8.8. A set of pipeline registtun (or latches) is inserted, resulting in a 2-stage pipelined version of the multiplier. Architectures with more pipeline stages can
501
i
mulliplicr be realized. S i e e the hardware between the pipeline stager is reduced then the reference voltege V,.! = 3.3 V c a n be reduced to 1.8 V (V,.t/1.83) to maintain a worst case delay of 20 ns (50 MHe). The estimated power dissipation is given
hv
The switching capacitance has increased slightly due to the pipelining. Thus, the power dissipation is redneed by a faetar ofalmost 2.8 which is spprodmately the same IU the pardel EIUC. Alao the area increase is relatively low and the area penalty h due only to the additional registers (or latches). As the pipeline registers reduce the logic depth, the power dissipation, due to the glitches, is also reduced.
In general, if a processor is pipelined with N stages of regiptets, then the delay between the pipeline stages is reduced by almost a factor N while the dock frequency is maintained. Then, the power supply voltage can be scaled sggressively. Canscqnently, the power saving is large.
Note that ez in the case of pardelism, an architecture with a large nnmber of pipeline stages can result in an offset in power and &re&. The added registers must be clocked and hence the load on the clock network c a n be important, with increased pipelining. One drawback ofthe w e of the pipeline is that more latency is added to the ontput signal.
The combination of pipelining and pardelism c a n result in further power redoction. because the power gopply voltage can be reduced aggressively. Also
502
CHAPTER 8
the frequency of operation is reduced. However. the luea would increase sign%eantly. For low-voltage, the threshold voltage should also be reduced to reduce the power dissipation, otherwise the power supply voltage redoction is limited. Indeed, at low-voltage, VDO approaches VT and the delay inereares d r a r t i d y . To maintain the throughput with pardelism/pipelhing, the threshold voltsge should be reduced compared to VDO.
8.33 Distributed Processing

To reduce the power dissipation of a centraked processor, B distributed processing technique can be ntihed. This concept of distributed processing is explained by the example of the Vector Quantied (VQ) image encode [I41 presented in reference [15]. First we review the VQ algorithm for the video compression, then in the next section the power reduction st the algorithm level of the VQ is discussed.
A video image, represented by a group of pixel, is vector qoantized by b r e a m it into blocks (uectois) of pix& that are mapped to a codebook of probable vectors using Mean Square Error (MSE) as the distortion m e m e . For the example given in [15], the image is segmented into 4 Y 4 pkel-vector (vector siae is 16). The VQ employs B codebook of 256 lev& The inpot data is represented on 16 x &bit and the output (&bit) represents the index of the best match as shown in Fig. 8.9 [ E l . Then the compression ratio is 163. To process 30 framesjs, a vector must be compressed every 17.3 ps ( e d frame is 128 Y 240 pixels). The MSE (distortion metric) between a vector X of 16 pix& and a codebook vector C i s given by
15
MSE = c ( C ; - X $
i=o
(8.8)
To compute this algorithm, a large number of memory access to the codebook and arithmetic operations is needed (see Section 8.4). The number of computations can be reduced by using differential search a priori combined with TrecSearch (TS) between two vectors a and b at the s a m e level of the tree. The distortion diffeience between the two vectors a and 6 at the same level o f the tree is given by M.7E.s = M S E , - M S E b (8.9) Then,
1s
16
MSE.a = c ( C . i -X , ) '- c(Cbi - X,)l

i=o
(8.10)
i=o
503
The two terms ( C : ; - CiJ and Z(C,; - C , ) are Computed in a memory to reduce the number of operations.
pliori and stored
Fig. 8.10(a) shows the centralized implementation of the VQ. It has a tentraliaed memory, processing element, and eontroller. This architecture is timemultiplexed, wbich performs operations sequentially over a large number of clock cycle^. In TSVQ, each l e d of the tree has specific code vectors that are found only at that level. Therdore, the memory can be paltitimed into separate memories for each level of the tree. Fig. 8.10(b) shows the distributed implementation of the VQ.The memory s k e from one module to the other increaser. The architecture is pipelined allowing the dock frequency and supply voltage to be reduced. The distributed memory architectme has lover switched capacitance when leading the code vectors than the centralized ease. This distributed imple mentation has eight controners and prowsing elements, bot since th.7 arc clocked a t lower freqneney, with low svpply voltage, the energy dissipated per vector does not change [15]. Through this partitioning, the power dissipated, of the eentraliaed implementation, was reduced by a factor 11 at the expense of an area increase by a factor of 2.
504
CHAPTER a
505
From this example we can learn that proper design of the architecture, through distributed processing, is more power-efficient than the centralieed procerror. In the distributed implementation, the different l o d hardware ~esonrcescan be optimized more efficiently than the global hardware in the centralized implementation. The application of this technique depends on whetha the executed algorithm can be partitioned. Keep in mind, that the power s8-g trades the occupied area, while the throughput is maintained.
8.3.4 Power Management

In old designs of microprocessors, DSPs, ASICs, etc., there war warted power due t o the clocking of blocks which a e idle for B significant period of time. Recently, power management methodologies are playing an important iole to avoid wasted power in normal and standby modes of operation [I?, IS, 19, 201. In this section, only some of the power management techniqnes m e discussed.
There are two types of power management: i) dynamic and i) static. Dynamic Power Management (DPM) allows selective shut-down of different blocks of the chip based on the l e d of activity required to run a particular application. Different blocks of the chip may be idle for a certain period of time when mnning different applications. For example, the floating point unit can have lOO%idletime when the processor is executing integer applications. The DPM requires additional logic on the chip. This logic is controlled by signals of idle periods.
In the PowerPC' 603 [21], the DPM mode is ensbled by software. The DPM logic automatically stops the dock switching of specific unit generated by clock regenerators. The clock regenerators produce two docks, C1 and C2, which feed master and slave latches. Two "freeze" input signals control the clocks, C1 and C2, as s h o w in the timing diagrams of Fig. 8.11. The logic needed for DPM does not introdnee any performance degradation and it eons - ~ 0.3% of the total die areain the PowerPC. The DPM provides a power raving of 10.20% depending on the application to be executed. The DPM can be implemented at either high-level (cg., execution u.it) and low-level (e.g., a block inside a unit) of hardwlue.
Static Power Management (SPM) permits the awing of the power dissipation in the standby mode. In this $me, the activity of the entire system is monitored rather than a specific unit (or block). When the system remains idle for a
'PowerPC 603 is h a m l B M C o w .
506
CHAPTER 8
y l T
................
........
c1
...............
...............
CLLiRr-tLh
a_FP.EEz
c2
c1mm c1
e
................
~
........
.........
significant period of time, then the entire chip L rhut-down2. The SPM may have several modes depending on whether the entire chip is shut-down or a part ofit. For example, the PowerPC 603 has three modes which are programmable through a hardware bat controlled by software (operating +em). In this microprocwor, one mode is called sleep mode which allows a m-am power swings by disabling the do& to all units. h this mode the PLL and external input do& are disabled to bring the power dissipation down to the leakage levels. The power of PowerPC 603. in the sleep mode, is as low as 1.8mW 1201.
507
8.4 ALGORITHMIC-LEVELPOWER REDUCTION

Algorithm opt-ation can have a signifcant impact on the power eonsump tion of a system. Design decisions, made at this level, combined with the architecture level, may lead to a large powcr saving. In this section, we disicnsr two approaches that reduce the power dissipation at the algorithm level. The first one is based on the reduetion of the switched capacitance, by minimieing the complexity of the system. The second method cxploita data coding for the purpose of low switching activity.
8.4.1 Switched Capacitance Reduction

The power dissipated bs an algorithm can be mearmred, for example. by coanting the number of operations reqnired to execute such an algorithm. To reduce the power of an algorithm, the number of primitive operations s o & as: memory access, ALU operations, ctc., should be minimiled. The different types of operations do not consume the same amount of power. For example a multiplication operation consumes more power than an addition operation. Thus, when minimiving the number of operations of an algorithm, the type of operation should be taken into account. Keep in mind that high performance systems w e complex algorithms that require a large nnmher of operations.
To illnstrate this consideration, the computation complexity for three methods of the VQ algorithm are presented. Remember that the distortion metric b e tween the input data (vecto. X )and a codebook vector C i s given by Equation (8.8). One method to evaluate the distortion and find the best match is to use B full rearch through the codebook. Thus, the distortion k computed for the
256 levels of the codebook. Each level requires 16 memory access l o perform 16 aubtrastions, 16 multiplications, 15 additions, etc. Hence a large number of primitive operations are needed. In the binary TSVQ already presented in Section 8.3.3, the codebook is orga, nieed into a tree structure a~ shown in Fig. 8.12. The input vector is compared with two code vectors at each node. Based on this comparison, one of the two branches is chosen and the eodehook search space is reduced compared to the full search, since a reduced number of code vectors (16) is utiked. For each comparison, at 8 specific level, an index bit is generated as shown in Fig. 8.12. The process of comparison thmngh the tree is repeated until a leaf node is reached. Far II codebook of 256 levels, the tree has depth of 8 (d=7). Compared to the full search, the nvmber ofmemary ~ e e e s s and executing operations
508
CRAPTER 8
d=O
d=l
d=2
d=3
6 . 7
iedoced considerably since only 16 code vectoxs -re used in the TSVQ a l p rithm. One VLSI implementation of the TSVQ algorithm uses systolic arrays P21. The number of computations can be fulther reduced by using the djffermtial search of the TSVQ [see Eqnation (8.11)]. At each level (i) of the tree the daferentd distortion between the left (vector a) and right (Tector 6) code vectors connected to the level (i 1) is compnted. Therefore, the number of operations is reduced. Table 8.1 [15] shows the computation complexity of the three methods of the VQ. The differential TSVQ results in a lower number of operations to be executed for each type.
~
8.4.2
Switching Activity Reduction
Minimizing the switching activity, at high level, is one way ta ieduee the power dissipation of digital proccsso~s. This can hsve an infinenee on the power reduction, erpedally when the switching signals have a large capseitanee. One method to minimiae the switching activity, at the algorithmic level, is to USE an appropriate coding for the signals rather then strakht binary code.
509
Algorithm
Memory Access
4096 266
136
Multiplication
Add/ Substract
8448 520 136
Full Search Tree Search Differential Tree Search
4096 256 128
In [23], Grey-coding h s been nsed for the address lines of B microprocessor, for both instructions Bnd data accesses, to reduce the switching activity of the nets. The sdwntage of Gray code over binary code is that Gray code changes by only one bit as it sequences from one number to the next. In other words, if the memory access pattern is a sequence of consecutive addresses, then each memory access chmgen only one bit at its address bit. Dur to instruction locality, dudng program execution, most of the memory accesses are sequential. Therefore the Gray code eliminates the simultanmus switches of a significant nnmber of bits. Table 8.2 shows B eomphrison of 3-bit representation of the binary and Gray codes. Note that the Gray code have only one transition for reqoential change
Tabla 8.2
Binary snd Gray-oode rcpresmtstion.
Binary
Code-
Grav
COG
000
000
Decimal Equivalent 0
110
111
101 100
6 7
510
CHAPTER 8
while the binary code may have many transitions
In 1231, the switching property of the address coding w e memured Using the number of bit switches per executed instruction. For instroction accesses, both the Gray and binary coding were compared wing benchmark programs. The maximum reduction in bit switches was found to be as high as 58% and the average reduction was equal to 31%. The same study was also carried out for data addresses. The average reduction of bit switches was 8%.
8.5
POWER ESTIMATION TECHNIQUES
Power estimation means, i n general, the techniques of estimating the average powex dissipation of cirenits. The goal of t h s section is to present an overview of power analysis techniques and took at the eleuit, gate, architectural, and behavioral levels of sbstractian. Measuring the power consumption is cdtie a l for low-power design as it permits the designer to optimise power, meet rq~ements, and know the power distribution through the chip.
8.5.1
Circuit-Level Tools
The most straight-forward method of power estimation is by circuit simulation; perform a circuit airnulation of the design and m m u e the average current drawn fram the supply. Therefore, the average power can be estimated. The disadvantage of this approach is that the results are strongly dependent on the input patterns to the circuit (pattern-dependenttechnique) also called dynamic3 power simulation. If the circuit has 8 large number of inputs, thcn the circuit simulation would be lime consuming and w e n impractical. The most accurate power simulator to date is still SPICE.However, it can handle only very small circuits (e.g, hundreds of transistors). SPICE accurately taker into account non-linear capacitances ljunction and gate) which esnnot be eaptvred by higher level tools. Also, it rnaccurately measwe short-circuit and leakage currents. The latter is very important for low-VT applications. SPICE cannot be used to estimate the power of large circuits or chips, due to the time e o n r u i n g nature of the simulator. It is a pattern-dependent power analysis tool.
' D y n d c l l i y computed PQWY should not bm c o d a d with dynamic power.
511
Another transistor-level power simulator/analyeer is PowerMdI' [24]. It a p plies an event-driven airnulation algmithm to inere- the computation speed by two to three oiderr of magnitude over SPICE,with an acceptable level of aecuracy (within 10%). Also, it uses table lookup to determine the terminal current of the device from the applied voltages.
PowerNIill can also identify the hot spots (which consnme more dynamic power) and twuble spots (which comnme unexpectedly large amoontr ofleahge .mulent). Moreover, elements with excessive short-circait are detected. This allows the designer to resise the circuit to reduce the riselfall time. Static reduced-swing nodes ace detected as shown in the example of Fig. 8.13. The node A is charged to VDD- VT when the input is low.
Another approach far power estimation is the use of statistical techniques. The work in [25] suggested the use of Monte Carlo simdation to ert-te the total average power of the circait. Basically, this statiitical technique is based on applying randomly generated inpnt patterns, a t the primary inpnto, and monitoring the convergenee of the power dissipation. The simulation is stopped when the measured power is dose enough to the troe average power. This approach, based on the Monte Carlo method, requires simulation over B large number of measurements. The advantage of the statistical techniques is that they can be built around existing simulation tools.
'PorerMill is fromEPlC D&gn Technology.
512
CHAPTER 8
8.52
Gate-Level Techniques
In order to oveccome the shortcoming of power analysis tools, at the *renit level, recently several gatdeml estimation tools have been proposed. In this section, we present two techuiqnes for power estimation at the gatelevel. The first approach relies on the probabilistic method. while the second one is bared on event-driven simulation.
8.5.2.1
Probabilistic Power Estimation
The power dissipation c a n be analyeed wing pattern-independent approach when the sign& sre represented with probabilities (also called static techniques). This approach permits to overcome the shortcomings of simulationbaaed techniques. The nser supplier the probabilities of the primary inputs to a logic network. The average power dissipation of a logic network is estimated
as
P = V&fC%C,
i=l
(8.12)
where N is the nnmber of nodes in the network. With a total physical capxitance Ci. ai is the switching activity (or c d e d transition probability, P,)given by
(8.13) = P,(1- P,) where P*ir the probability that the node i i s at high level. In this expression of sctivity it in assumed that the circuit input and internal nodes me independent
oli
(spatial independence). Also the values of the same Jignal, in two consecutive dock cycles, are assumed independent ( t e m p m l independence).
If the input probabilities to a network w e provided, then they are propagated through the circuit to evaluate the transition probability at each node. For example, for a 2-input AND function: y = z,.=a, the probability of the outThe computation of the put to be at high level is given by: Pu = Pz,.P*,. probabilities for different gates is discussed in Chaptu 4.
One tool (LTIMES), bared on probabilities, w s r h t proposed in [26]. In this work, the temporal and spatial independence of rignds are assumed. Practically, the signals may be correlated. Also e aero-delay model wm aasumed, which leadds to an error in ertimating the power, since the glitching power h not accounted for.
Low-Power
VLSI Desrgn Methodology
513
Probabilistic power estimation approaches that compute the power, due to glitches, and apply a r e d delay model have been proposed [Z7, 281. In [27], the switching activity computation is based on the tmnailion density. The assnmption made in [ZT] is the spatial independence of the sign&. A power estimator tool, based on the tran&tion demity, has been called DENSIM. The transition density of a node is defined as the ayerage number of nodal transitions per unit time. If y is a boolean function with inputs, z,, then the boolean difference of y, with respect to zi,is defined by
By = y(=, = 1) @ y(.; = 0) az;
(8.14)
It was shown in [29] that if 2, are spatially independent, then the density of the boolean fonction is given by
(8.15)
where P ( z ) is the equilibrium probability of the signal over time. Equations (8.14) and (8.15) are used to propagate the density throngh the boolean network. Byfa=; is one if B transition at zi will cause a simultaneous transition at y. As an example, consider the c8se of a 2-input AND gate with that D ( ~ = ) ~n thi. CW, ay/a., = c2 and ay/ars = =,, Y = P(Z~)D(Z +P(z,)D(ra). I) Hence, from the probability and density d u e s , at the p d m a y inputs of a logic network, the density at the aotput can be =omputed. The boolean differences of B logic network s l e calculated using Binory Doeision Diagrams (BDDs) [30].
Note that the average power dissipation is computed by

(8.16)
The factor 112 k added to a c c o r d for the doable transition pm dock period.
This model, blued on transition density, ignores the spatial correlation of the signals and eompntes, approximatidy, the power due to glitches. The work in [28] attempts to handle both spatial and temporal eorrdations. One disadvantage of the approach in [28] is that the use of BDDs, for the whole circuit, tends to limit the siw of the network thst can be analyzed.
The probabilistic techniques have the advantage that the user does not have to supply dmnlation patterns and they are daimed to have fast computation
514
CHAPTER a
time. However, they do not account for the internal power of the gates and static power dissipation. These techniques can be nsed, for example, as a fast power estimator for logic synthesis. They might also be suited for comparing varioos subsystem structures.
8.5.2.2 Event-Driven Simulation

Another gate level power analysis approach has been proposed for semi-cutom s shown i n Fig. 8.14. The system design [31]. The environment of the system i uses a cell library that has been charscterieed for static and dynamic pover dissipation with the Entice' (ENergy and T h i n g Characterieation En-onment) cell characterization system [32]. The dynamic power includes the power due to the short-circuit and the one due to the load capacitance. Entice characterizes each cell taking into account the following parameters: input signal slope. output capacitive load, operating voltage, temperature, and process parameters. Entice uses SPICE as a circuit simulator to model each cell for power. A set of p a e r vector8 drrcribes all possible events where power can be &sipated by the cell for dynamic and static cases. With SPICE these power events are accurately chanlcterised. There are two types of power vectors: i) dynamic snd ii) static. A dynamic power vector describer an event in which power is dissipated due to a signal switching st the cell inputs. For example, for a 2-input ( A and B) AND gate, when A = 1and B makes a tianAtion from 0 to 1, an energy is dissipated. A ststic power vector describes the conditions of logic signals under which leakage power OCUUS. The designer creates a design from the cell library at gate level then it is inpnt to the Aspen' (A System for Power EatimatioN) system. Also the stimulus to drive the logic simulator and the interconnect loads, representing the intercell connectivity (estimdea or actual d u e s provided by back-annotation from layout) are specified. A logic simulator such as Verilog-XL' is wed as an even-driven simulator. Upon invocation, Aspen monitors the power event occwrence (node a~tiYity) ofeach cell and computes the total power dissipation a8 the sum of the power dissipation of all the cells in the power vector paths. Multiple time windows can be specified for simulation to compute the average power O Y ~ I different time periods Note that Aspen uses the power vectors of a cell to compute the total power.
bEnliceis from MotordsInc. *Aspen io from Motmrola I n . 'Verilog-XL is fmm Cadcncr Deign Systems I n .
515
The dynamic power of each cell is computed by multiplying the number of power events (transitions' count) by the energy dissipation per transition event of I cell. This proce$s is applied to all dynamic power vectors for a cell to obtain the total energy dissipated. The total dynamic power of a cell, over a certain time period, is equal to the total energy divided by the t h e period.
The static power vector is used to compute the leakage of B cell. Note that the static power of B cell is dependent on the logic state of a cell, 85 shown in Fig 8.15. To compute the static power dissipation, the duration of activation time of the corresponding static power vector is measured. A transition of net signal may cause a static power vector to be activated and another vector to be deactivated. Vectors are time stamped during aetiwtion andnpon deactivation. Then the total time length in which the vector is active is foand. The activation time length of the static power vector is multiplied with the power dissipation value (per time unit) to obtain the static power of the vector. Again the static power dissipation for aU veotors asrociatcd with a cell instance is summed to derive the total power dissipation.
516
CHAPTER 8
The results reported by Aspen, such SJ the switching activity of nodes, can be used to drive floorplanning, placement and routing tools. Also Aspen can handle chips with B complexity of o w e d hundred thousand gates and is four orders ofmagnitude faster than SPICE.It prodnces results within 10% accuracy of SPICE results. One disadvantsge of Aspen is that it cannot handle power due to the glitches.
8.53 Architecture-LevelPower Estimation

The architecture of B design is represented by fnnctiond blocks and the complexity ofthe design at this l e d is relatively low compared to the circuit lrnd gate levels. In this section, several approaches and techniqoes for power mod&g and mdysia a t the archi%ectomllevel are reviewed.
8.5.3.1 Gate Count Method

One tool developed for architectural power dissipation estimation is based on epuivdent logic count, memory sise, logic circuit styles (dynamic 01 static), interconnection busses, cLo& network a d layoat style (fdkustom or remicustom) [33]. The complexity of an architecture is described in terms ofaverage number oflogic gates soch ~1a Sinat AND (bufeted NAND) gate connected to three identical AND gates at the output node (i.e, Ianin=fanout=3) as shown in Fig. 8.16. The total power ofthe logic part is roughly equal to the number of gates multiplied by the power of a gate using B user specified switching activity. T h i s activity factor is sssumed fued acioss the design.
517
1
latch
The power ofthe on-chip memory is modeled for a certain memory architectnre. The interconnections are defined in two categories, local and intermediate, and s defined as interconnections within a global busses. The local interconnection i logic gate. The intermediate interconnections are used for connection between gates or functional blocks (subsystems). The global bun includes data, control, and address busses. The lengths of local and intermediate interconnections are modeled by the Rent's rule [34]. Then the power can be computed from the lengths u&g a fixed switching activity equal to the one specled far the logic. The global interconnect is determined from the dimensions of the ehip and the number of drivers/receivers connected to it. The power model of the clock network ia bared on the H-tree [34] and the chip dimFnsionr. The power of on-chip drivers are also modeled in two components. One'is the power used to drive the off-chip total capacitance. The other is the pou/er consumed by the pad driver itself. The activity factor for the pads is ars med fixed and is equal to 1 [33].
T$e tool developed in [33] is used ar a power estimator in the early stage of t#e design. It requires some technological parameters (feature siae, gate oxlde fltickncss, p a m e t e r e of the intereonneetion layers, etc.), the snpply voltage, the chip area, the switching fhctor and the gate count. This tool can only be used ar a roogh estimator of the total power of the chip since the switching activity is arrumed fixed through the design. Therefore the pourer partition between the different units can be incorrectly estimated.
518
CHAPTER 8
8.5.3.2 The Power Factor Approximation Method

The Powcr Factor Approrunation (PFA) technique is another method to e& timate the power dissipation [35]. It h a been used for D S P s architectnres. The total power dissipations ofa functional block such as: multipliers, adders, memories, etc. can be modeled by the following approximation
where G is the number of the logic gates comparing the fnnctional block, ui is the switching activity of the ith gate, C ,is the load of the ith gate, i,.,i is the
short eirenit component, and f is the frequency. This power equation expressed in more compact form as
can
be
Pavg = SGf
(8.18)
where x i s the PFA constant snd can be related e d y to Equation (8.17). G can also be looked at a the hardware complexity factor instead of a number of gates. The parameter I( has Merent d o e s for different blodts. For example for an n-bit multiplier, thc factor G can be approldmately equal to 2 as shown in Fig. 8.17. This is due to the number of addw eelk in the multiplier. Then,
we have
P."d< =
K.".ltn2f."".
(8.19)
The power supply voltage is included in the parameter IC. This parameter is extracted e m p i ~ i d l y from meeaured or simulated power valuer at a h e d power supply voltage.
For a VLSI chip, composed of several functional blocks, the t o t d power dissipation can be determined by summing the power o f & bloekr. We have PM =
d, b l e r l .
niG,f,
(8.20)
Thus, this PFA technique is based on modeling precharacterimd functiond blocks. Each block has a PFA factor independent from the other. Hence this technique provides some general methodologg compared to the gate esnivalent model of Svenssan and Liu discussed previously. The PFA factor is extracted using independent Uniform mile Noise (UNW) inputs (i.e, random inputs). UWN inputs mean that the input's bit axe uncorrelated in space and time and
'Withon* ,he static power diaaip.,i.,,
519
independent of the data distribution. The signal and transition probabilities of each i bit of the input are given by
P i ( 1 ) = 0.5
and
P((0+ 1) = 0.25
(8.21)
Consequently, this technique doer not account for the strong dependency of power consumption on the statistics of the input data [36]. The next section tr t s the ease of power modeling, taking into account the correlated behavior ofthe bits.
8.6.3.3 Dun/ Bif Type Model

In digital signal processing, corrdation can exist between value of a temporal ~e uence of data. The UWN model can lead to an error in estimating the power of a dreuit even if the bit-width utiliantion is maximized. To take into account the data correlation, the Dual Bit Type (DBT) dbta model har been proposed in [36,311. The DBT data permits accurlrte estimation of the power dksipation.
520
CBAPTER8
P(0-1)
p =4.99
p =4.80
p = -0.60 p = 0.0
p = 0.60 p=o.80
p = 0.99
14
12
10
I1
Fig. 8.18 shows the transition activity for several different two's complement data stream versus the bit (for an n-bit word). In this figure, eaeh enme corresponds to B different temporal correlation given by
P = cou(Xt-l,X,) sl (8.22)
where X,_l and Xt are successive data ( i ntime) and rais the variance. p = 0 corresponds to the white noise case, where P ( 0 1) = 0.25. From Figure 8.18 it is evident that the UWN model, while sufficient for describing activity in the Least Significant Bits (LSBs), is inadequate for the Most Significant Bit (MSB) region. The U N W model works correctly for the LSBs up to the break point BPO. The MSB region corresponds to the sign bits and consequently, the signal and transition probabilities of there bits are far from random. p > 0 eorrerpands to a lower activity for positively correlated signals, while p < 0 corresponds t o a higher activity for negatively correlated signals. T h e MSB region starts from the break point B P I . The region between BPO and BPI can be modeled by linear interpolation. BPO and B P 1 can be determined from the word-level statistics [37]. The power estimation of the architecture modules is based on B black-box teehnique of the switched capacitance. T y p i d modules are: adders, multipliers,
521
shifterr, RAMS, ROMs, ete. The power dissipation is modeled for each module by P = CV&f (8.23) where the switched capacitance C is related to the compleity and the activity of the module. For example of an n-bit dpple-carry subtractor, the switching capacitance is modeled by
= CGf,n
(8.24)
where C,,, is a capacitive coefficient (in fF/bit) determined from the DBT model. Ce,f can be a single coefficient for the U W N case. The DBT model employs several codfieienti for C . , , , which reflect the data representation and signal statistics. For the case of the subtractor, for example, B table of Cc,j is generated as a function of all possible data transitions, i.e., i g n bits transitions and LSB bits random transitions.
To extract the capaeitiae coefficients ofeaeh module, the library should be characterbed. This operetion is performed onetime for one library. The process of
extraction consists of several steps:
I
Pattern generation. Input patterns to B module are generated based on the DBT data model. Both xandom (UWN) and sign data stlearns should be used. The input patterns containing the U W N camponent must be simulated for several cycles. This allows convergence of the a~erage capacitance.
Simulation. The generated patterns are fed to a simulator (such 85 a circuit simulator) from which the switching capacitances ace extracted.
rn
Capacitive coefficient's extraction. The simulation step produces the average effective switching capacitances for the entire series of applied input tramitions such a: U U, S 9 , cte. The capacitive coefficients are utracted from the effective switching capacitances and the complexity parameters.
- -
Based on this methodology, a power mdysis tool, at the architectural level, has been developed
[%I.
'U and S me-
UWN and dgl P-S
of the input bits. rmapcctively.
522
CHAPTER 8
8.5.4 Behavioral-LevelPower Estimation

A behavioral representation describes the function of . e system versus a set of inputs. The behavior can be specified, for example, by algorithms (in Vedog, VHDL, ete.) 01 by boolean functions. The power estimation, at the behavioral level, relates the consumed energy to the execution of an algorithm. Decisions at the system and behavioral levels can influence the final power dissipation of the circuit by several orders of magnitude.
One approach for power estbation, at the behavioral level, h a been proposed in [38]. It is based on the combination of analytical and stochatic power models. In this work, e cl- ofapplieationa such a zeal time DSPs is considered for the power estimator. In the behavioral context, the power consnmed by a hardware resource is given by
P = N.CV'f
(8.25)
where N . is the number of accesses to the resource over the period of computation. Cis the average capacitance switched per access and f is the computation frequency.
In [38] the power of aome hardware ielionrce~, such as execntion units, registers, etc., are analytically modeled (using Equation (8.25)) from the Control/Data
Flow Graph (CDFG)which is used to represent the design. The average capacitance switched, per BCC~JI, for a partioular hardware is estimated from the white noise data modd. The power consumed by hardware resources such a controllers, interconnects, and clock network is diScult to estimate. Statistically a large number of reabed chips i used to estimate the switched capacitance of there hardware ~esources.
8.6
CHAPTER SUMMARY
Low dynamic power techniques at several levels of abstractions have been presented. Algorithmic and architectural decisions c ~ influence n the power dissipation of a circuit by orders of magnitude. Therefore, CAD tools that help the designer to analyee the power of the ckeuit at these levels are needed. At lower levels of the design, the power reduction teehniqner offer some ravings but less than the one expected at higher levels. Several powor estimation tools have been discussed at the different levels of the design. Keep in mind that the circuit simulators provide B high accuracy for power analyais and take into account all power components.
REFERENCES
[I] K-Y. Chaa. and D. F. Wong. "Low Power Considerations in Floorplan

Design," Prae. of the International Workshop on Law Powev Design, pp. 45-50, April 1994.
[Z] H. V8ishnav and M. Pedram, "PCUBE A Performance Driven Placement Algorithm for Lower Power Designs," Proc. of the EURO-DAC'93, pp.7277, September 1983. [3] A. Shcn, A. Ghosh, S. Devadar, and K. Keutaer, "On Average Power Dissipation and Random Pattern Testability of CMOS Combinational Logic Network," Proc. of the International Conference on Computer-Aided Design, pp. 402-401, November 1992. [4] K. Keutaer, "The Impact of CAD on the Design of Low Power Digital Circuits." IEEE Symposinm on Low Power Electronics, Tech. Dig., pp. 4245, October 1994. [5] GY. Tsui, M. Pedram, and A. M. Despain, "Technology Decomposition and Mapping Targeting Low Power Dissipation," 30th ACMfIEEE Dcsign Automation Conference, Tech. Dig., pp.68-T3, June 1993. [6] R. Murgai, R. K. Brayton, and A. Sangiovanni-VinEente, "Deeomposition of Logic Functions for Minimum Transition Activity," Proe. of the International Workshop on Low Power Design, pp. 33-38, A p d 1994.
[TI
V.Tiwad, P. Ashar, and S. M&,
"Technology Mapping for Low Power." 30th ACMfIEEE Design Antomation Conference, Tech. Dig.,pp.74-79, Jrme 1993.
[a] K.
Scott and K. Keutsc., "Improving Cell Libraries for Synthesis," IEEE Custom Integrated Circuits Conference, Tech. Dig., pp. 128-151, May 1994.
[9] C. Lemonds and S. Mhhant Shetti, "A Low Power 16 by 16 Multiplier using Transition Reduction Circuitry," Proe. of the International Workshop on Low Power Design, pp. 139-142, April 1994.
524
LOW-POWER DIGITALVLSI
DESIGN
A. Chandrakasan, S. Sheng, and R. W. Brodcrren, '%w-Power CMOS Design," IEEE Journal of Solid-state Circuits, "01. 27,no. 4, pp. 472-484, A p d 1992. U. KO,P. T. Balsam, and W. Lee, '"A Self-timed Method to Mlnimiie Spurious Trannitionr in Low Power CMOS Cixcuit.," IEEE Symposium on Low Power Electronics, Tech. Dig., pp. 62-63,October 1994.
[I21 R. I. Bahar, H.Cho. 0 . D. Hachtcl, E. Mac", and F. Somenzi. "An Application of ADD-Based Timing Analysis to Combinational Low Power ReSynthesis," Proe. of the International Workshop on Low Power Design, pp. 139-142. April 1994.
[I31 M. Alidins, 1. Montiero. S. Devadar, A. Ghosh, and M. Papaefthmiou, "Precomputing-Based Sequential Logic Optimization for Low-Power," IEEE lhnsactionr on Very Large Scale Integration Systems, vol. 2, no. 4, pp. 426-436, December 1994. 1141 A. Ghersho, and R. Gray, "Vector Qusntisation and Signal Compression,' Khwer Academic Pubhhers, MA, 1992.
[I51 D. B. Lidrky, and J. M. Rabaey, "Low-Power Design of Memory Intensive Functions," IEEE Symposium on Low Power Electronic-, Tech. Dig., pp.
16-11. October 1994.
[16] A. P. Chnndrskasan, A. Burstein, and R. W. Brodersen, "A Low-Power Chipset for B Portable Multimedia I/O Terminal," IEEE Jonrnal of SolidState Circuits, "01. 29, no. 12, pp. 1415-1428. December 1994.
[I71 J. Sfhut., *A 3.3 V 0.6 p m HiCMOS Superscalar Microprocessor," IEEE International Solid-State Cholits Conf., Tech. Dig., pp. 202203,Febiuary 1994.
[I81 N. K. Yeung, Y-H. Sutu. T. Y-F. Su, E. T. Pat, C-C Chao, S. Akki, D. D. Yau, and R. Lodenquai. "The Design o f a SSSPECint92 RISC Processor under ZW," IEEE International Solid-state Circuits Conference, Tech Dig., pp. 206-207, February 1994.
[19] D. Pham, et s l . , "A 3.0W 75SPECint92 85SPECfp92 Superscalar RISC," IEEE International Solid-state Circuits Conference. Tech. Dix., DO. 212213. February 1994
[ZO] G. Gerora, et al., "A 2.2 W 80 MHz Superscalar RISC Microprocessor." lEEE Journal of Solid-State Circuits, vol. 29, no. 12, pp. 1440-1454, De-
cember 1994.
REFERENCES
525
[XI S. Gary, C. Diete, J. Eno, G. Geross, S. Park, and H. Sanches. "The PoaerPC 603 Microprocessor: A Low-Pow- Design for Portable Apphtiom," Proc. of COMPCON'94, Tech. Dig., pp. 307-315, February 1994.
[22] R. K. Kolagotla, S-S. Yu, and J. F. Jda, "VLSI Implementation of a 'Itee Searched Vector Quantieer," IEEE Transactions on Signal Processing, "01. 41, no. 2, pp. 901-905, February 1993.
[23] C-L. Su, C-Y. Tsui, and A. M. Derpain, "Low Power Aichitecture Design and Compilation Techniques foz High-Performance Processors," Proceedings of COMPCON'OI, Tech. Dig., pp. 489-498, Februsry 1994.
[24] A-C Deng, "Power Analysis for CMOS/BiCMOS Circuits." Proe. of the International Workshop on Low Pow- Design, pp. 3-8, A p d 1994.
[25] C. M. Emher, "Power Dkipation Andyysk of CMOS VLSI Circaits by Means of Switch-Level Simulation," Proc.of the European Solid-state Circuits Conference,pp. 61-64, 1990.
1261 M. A. Cirit, "Estimating Dynamic Power Consumption of CMOS Circuits," IEEE International Conference on Computer Aided Design, pp. 534537, November 1987.
[27]F. Najm, I. Hai,and P. Yang, *An extension of Probabilistic Simulation for Reliability Andy& of CMOS VLSI Circnits," 28th ACMjIEEE Design Automation Conference, Tech. Dig., pp. 644649, June 1991.
[28] A. Ghosh, S. Devadas, K. Keutser, and J. White, 'Estimation of Average Switching Activity in Combinational and Sequential Circuits," 29th ACM/IEEE Design Automation Conference, Tech. Dig., pp. 253-259. June 1992. [29] F. N. Najm, '"A Survey of Power Estimation Techniques in VLSI Circuits," IEEE Transactions on Very Large Scale Integration Systems. vol. 2, no. 4, pp. 446-455, December 1994. [30] R. E. Bryant, "Graph-Baaed Algorithms For Boolean Function Manipulation," IEEE Tmnsaetiona on Computer-Aided Design, pp. 677-691, Augort 1986. [31] B. J. George, G. Yeap, M. G. Wloka. S. C. Tyle., and D. GossCn, "Power Analysis for Semi-custom Design," IEEE Custom Integrated Circuits Conference, Tech. Dig., pp. 249-252, 1994.
526
[32] B. J. George, G. Yeap, M. G. Wloka, S. C. Tyler, and D. Goss&, "Power Analysis and Characteridion for Semi-Custom Design," Proc. of the Int e r n s t i o d Workshop on Low Power D e s i g n ,pp. 215-218, April 1934. 1.331 D. Lui, and C. Svensron, "Power Conramption Estimation in CMOS VLSI Chips,' IEEE Journal of Solid-state Circuits, uol. 29, no. 6, pp. 663-610, June 1994.
[34] A. B. Bakoglu, "Circuits, Interconnects, and Packaging for VLSI," Addison-Wesley, Rcading, MA, 1990.
[35] S. R. Powell and P. M. Chm, 'Estimating Power Dissipation o f VLSI Signal Processing Chips: The PFA Technique," VLSI Signal Procesing N.pp. 250-259, 1990.
1361 P. E. Landman, and J. M. Rabaey, "Power Estimation for High Level Synthesis," EDAGEUROASIC, Paris, Rance, pp. 361-366,February 1993.
[37] P. E. Landman, and J. M. Rahaey, "Bla&-Box Capacitance Models for Architectural Power Analysis," Proceedings of the International Workshop on Low Power Design, N a p , CA, pp. 165-170,A p d 1994.
1381 R. Mehra, and J. Rabaey, "Behavioral Level Power Estimation and Exploration," Proceedings of the International Workshop on Low Power Design, Nape, CA, pp. 191-202. April 1994.
INDEX
Absolute value calculator. 454 Adders carry lookahead, 412 carry select, 420 sompruison, 425 conditional I-, 423 Manchester, 412 ripple carry, 410 Address transition detection, 332 Adiabatic computing, 249 ALU, 451 Arithmetic logic unit, 451 Array multiplication, 429 ATD,332 AVC, 454 Back-biar generator, 373 Barrel rhifter, 456 BiCMOS applications, 299 BiNMOS logic, 272 bootstzapped, 288 CEBiCMOS, 285 comparison, 294 complementaiy technology, 43 complementary, 283 conventional gate, 257 delay analysis, 262 DSP, 303 gate array, 304 low-voltage families, 280 merged, 281 power dissipation. 266
pracesser, 36
quasi-complementary, 282 shunting techniques, 268
Bidirectional I/O, 229 BiNMOS family, 272 gate design, 274 logic gates, 277 p-transistor, 299 Bipolar EberrMoU model. 94 Gummel-Poon model, 101 high current effects, 99 hwh level injection, 101 Kirk effect, 99 knee cumnt, 101 structure, 91 technology, 21 transit time, 105 Webster effect, 99 Birds beak, 30 Body effect, 66 Boosted voltsge generator, 377 Booth multiplier, 434 Bootstrapped BiCMOS, 288 BSlM model, 77 Buffet siring, 221 By-pars capacitance, 235 CAM, 470 Capacitance estimation, 138 fringing, 144 gate, 83 i n.w t . 139 junction, 82 MOS. 82 parasitic, 141 wiring, 143
528
CBiCMOS, 283 CEBiCMOS, 285 Channel length moddation, 75 Chmge pump, 373 Charge sharing, 180 Clock buffers, 226 Clock distribution, 224 Clock skew, 187, 474 Clock tree, 226 Clacked CMOS, 183 C I O ~ singlephase, 198 strategy, 188 two-phase, 202 CMOS sealing, 89 CMOS complex gate, 149 CPL, 203 delay- 124 domino, 177 DPL, 207 dynamic, 177 full-adder, 171 inverter, 116 layout, 161
Data path, 450 Desi- roles, 44
Dital d g d P I O C ~ Q S O I , 303 Distzibuted processing, 502 Domino logic, 177 DPL, 207 DRAM, 356
asceoo t i e ,
359
NORA, 183
power dissipation, 129
process technology, 14 peodc-NMOS, 176 SRPL, 210 tranamistiion gate, 169 Zipper, 183 Colnmn decoder, 332 Comparator, 455 Complementary BICMOS, 283 Complementary pass-transistor logic, 203 Compressor, 442 Content addressable memarp: 70 .. 4: Control unit, 451 CPL, 203 current gain, 97
architecture, 359 baek-bi- generator, 373 boosted voltage generator, 377 ceh 359 charge pump, 373 deeodez, 366 half-voltage generator, 371 hierarchical word-line, 370 lowvoltage, 381 refresh, 377 sense amplifier, 367 DSP, 303 Dnal pass-tramistor logic, 203 Dynamic logic, 177 Early effect, 89 voltage, 99 Ebers-Moll model, 94 Edgetriggered D-Ripflop, 194 F&, 146 Fanout, 146 Flipflop, 194 Floorplanning, 490 hequency divider, 482 FuU-adder, 171 Full-custom design, 165 Gate array, 166, 304 Glitches, 160, 493 Ground bounce, 233 CTL, 236 Gummcl-Poon model, 101 Gunning 110, 236 Half-voltage generator. 371 High level injection, 101
Indez
529
HSPICE bipolar parsmeters, 105 MOS parameters, 77 1 1 0 circuits, 214 Input pad, 214 Isolation, 27 JK Bipflop, 197 Kink effect, 62 Kirk efteet, 99 Latch, 190 dynamic, 191 hold time, 190 setnp t i e , 190 static, 190 Leakage current, 130 Lightly doped drain, 17 L o 4 oxidation of silicon, 28 LOCOS, 28 Low-power algorithmic-level, 507 arehitreturtlevel, 498 circuit techniques, 239 CMOS technology, 17 DRAM, 364 gate-level, 490 Layout guidelines, 165 physical design, 489 reference voltage generator ,399 SRAM, 330 Low-voltage CMOS technology, 20 DRAM. 381 MOS model, 84
Mobility model, 74 MOS SPICE Models, 69 MOSl model, 72 MOS3 model, 73 Multi-threshold voltage techniqne,
242
SRAM, 352 TTL, 215 MBiCMOS, 281 Memory DRAM, 356 ROM. 467 SRAM, 313 Merged BiCMOS, 281 Minimum power supply, 123
Multiplexer, 171 Multipliers Baugh-Wooley, 432 Braun, 429 comparison, 450 modiiied Baath, 434 Wanace, 442 N-well process, 14 Noise margin, 121 NORA logic, 183 Output buffer, 229 Output pad, 227 Pardel adders, 409 Parallelirm. 498 P-tranristor logic complementary, 203 conventional, 169 dud. 203 swing restored, 203 Phase IocEred loop, 473 Pipelining, 500 PLA, 462 Plaeement and routing, 490 PLL, 473 charge pumped loop, 414 filter, 479 phase frequency detector. 476 voltage controlled oscillator, 479 Power diSsip&on components, 129 dynamic, 132 estimation, 510 internal, 152 measurement, 138 short-circuit, 135 stetic, 130
530
Power management, 505 Prechargc transistor, 178 Preeomputation, 496 Prababilirtic power estimation, 512 Programmable logic a ~ r a y462 , Pseudo-NMOS, 176 QCBiCMOS, 282 Quasi-complementary BEMOS,
282 Raee, 493
equalieing, 327 hieiacbical word decoding, 350 law-voltage, 352 ontpnt latch, 347 read cycle time, 315 readjwsrite circuitry, 324 row decoder. 332
s-e
amp&,
339
SRPL. 210
Standard-cd, 165 Subthreshold current, 86 Swing restored pars-transistor logic, 203 Switchiw activity. 152 Technology mapping, 491 TFT, 323 Thin film transistor, 323 Threshold mltage, 66, 85 TLB, 470 Toggle, 197 Trench isolation, 3 1 TTL. 215
RAM dynamic, 356 static, 313 Read only memory, 467 Reference voltage generator. 395 Register file, 458 Register transfer level, 498
Register, 194
..
Reg& structures, 460 RGM, 467 Row decoder, 332 RTL, 498 RVG, 395 Scaling, 89 Schmitt trigget, 218 Self-reverse biasing, 239 Semi-custom design, 165 Sense amplifier. 339 Shift-, 456 Silicon On Insulator. 52 SO1 SIMGX, 52 Sol. 52 SPICE, 510 Spnrious transition, 160, 412,493 SEAM, 313 addrear access time, 315 architectnx, 315 ATD, 332 bitline prechatge, 337 cell. 318 column decoder, 332 divided word-line. 348
Vector quantiacd image encoder,

502
Video compression, 502 Voltage controlled oscillator, 479 Voltage down convcrtez, 389 Voltage levels interface, 231 Voltage-eontrolled delay h e , 482 VQ, 502 Wallace tree, 442 webster effect, 99 Zipper CMOS logic, 183

Low-Power Digital VLSI Design

Uploaded by

Copyright:

Available Formats

You might also like

Low-Power Digital VLSI Design

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Low-Power Digital VLSI Design

Uploaded by

Copyright:

Available Formats

1

LOW-POWER VLSI DESIGN: AN OVERVIEW

1.1 WHY LOW-POWER?

Low-Power VLSI Design: An Overview

1.2 LOW-POWER APPLICATIONS

IBM 486SLC2 MIPS R4200

(W) 2.2 1.8 1.8

Handheld Cellular Motorola Microtac GOO mW

Battery life Total power load

1.3 LOW-POWER DESIGN METHODOLOGY

1.3.1 Power Reduction Through Process Technology

Low-Power VLSI Deszgn: An Overview

Power reduction design ~pacr

I 4x5 I 225 I 2.35 I 3.2 x 4.2 I 1

1 2.5 x 3 I 330 1 1.5 1 2 x 2.5 1 0.45

1.3.2 Power Reduction Through Circuitnogic design

Re-encoding of sequential &enits.

Low-Power VLSI Design: An Overuiew

1.3.3 Power Reduction Through Architectural Design

Power management techniqoes where annsed blocks are shutdown;

Minimieation of instruction set for simple decoding and execution.

Power Reduction Through Algorithm Selection

Data coding far minimum switching estiuity

1.3.5 Power Reduction in System Integration

Low-Voltage Process Technology

1.4.2 Low-Voltage Device Modeling

Low-Power VLSI Deszgn: An Overview

1.4.3 Low-Voltage Low-Power VLSI CMOS Circuit Design

Low-Voltage VLSI BiCMOS Circuit Design

1.4.5 Low-Power CMOS Random Access Memory Circuits

1.4.6 VLSI CMOS SubSystem Design

1.4.7 Low-Power VLSI Design Methodology

[6] MIPS Press release, 1994.

[TI A. Charms, ot al., "A 64b Microprocessor with Multimedia Support,"

[ID] G. Gerosa, et d.,"A

[ll] R. Beehade, et al., "A 32b 66MAu Micropzocersor," IEEE International

Solid-state Circuits Conference, Tech. Dig., pp. 208-209, February 1994.

LOW-POWER DIGITAL VLSI DESIGN

2.1 CMOS PROCESS TECHNOLOGY

N-well CMOS Process

Patteln s/D regions for P-ehanorl ~mi~rp+srn Stripphotar&t RepeatiorN+SlD Stripphotore%l

Etch contact hoie Deposit mptd Pattar" metal Metal anneal

2.1.2 Twin-Tub CMOS Process

Low- Voltage Process Technology

2.1.3 Low-Voltage CMOS Technology

8 Grow sclcctivc hick

Remove niindeipad oxide B in ( P - ~ ~ I I ) B anneal (P-wolll 2 n d B Ill (channel-stoppis)

Twin-tub pmscss sequence

Low- Voltage Process Technology

1 Voltage (V)I Delay (ps) I

21.0 50.0 52.0

Low-Voltage Process Technology

2.2 BIPOLAR PROCESS TECHNOLOGY

LOU- Voltage Process Technology

C r o a s a d i o n d vicw of the SICOS bipolm device structure [ll]

Grow oxide Apply p h a r o n a a Pducdetch N+BLmark Implant Sb

Strip oride Epilaxy (intrinsic layer)

Low-Voltaqe Process Technology

Initial Svucmre Apply photoresist PatBrn pholomist