Professional Documents
Culture Documents
Low-Power Digital VLSI Design
Low-Power Digital VLSI Design
Low-Power Digital VLSI Design
+ . ,
Recently, power dissipation is becoming an important constraint in B design. Several reasons anderlie the emerging of this issue. A m o n g them we dte: Battery-powered systems such BS bptop/noteboak campatus, electronic organiserr, etc. The need for these systems a r k s from the need to extend battery W e . Many portable electronics nse the rechargeable Nickel Cadmium (NiCd) batteries. Although the battery industry has been making efforts to develop batteries with higher energy capaeity than that of NiCd, 8 strident increase does not seem imminent. The expected improvement of the energy density is 40% by the turn of the century. With iecent NiCd batteries, the energy density is around 20 Watt-hour/pound and the voltage is around 1.2 V. So, for example, for a notebook consuming a typical power of 10 Watts and using 1.5 pound of batteries, the time of operation bdween recharges is 3 hours. Even with the advanced battery
CHAPTER 1
technologies. such as Nickel-Metal Hydride (Ni-MH) which provide large energy density characteristics (- 30 Watt-hour/pound), the life time of the battery h still low. Since battery technology has offered a limited improvement. low-power design techniques are essential for portable devices.
Low-power design is not only needed for portable applications but also to reduce the power of high-performance systems. With large integration density and improved speed of operation, systeme with high do& frequencies are emerging. These systems are using high-speed products snch as microprocessors. The cost as9ociated with packaging, cooling and fans required by these systems to remove the heat is incteasing significantly. Table 1.1 shows the power consumption of various microprocessors that operate in the frequency range of 66-t-300 MHu. This table demonstrates that, at higher frequencies, the power dissipation is tw excesive.
rn
Another issue related to high power dissipstion is reliability. With the generation of on-chip high temperature, failure mechanisms are provoked [El. Among them, we cite silicon interconnect fatigue, package relstcd failure, electrical p a m e t e r shift. electrornigration, junction fatime, ete..
In addition,there is a trend tv keep the computers from using more than 5% shlue of the total US power bndgct [9]. Note that 50% of office power is nsed by PCs. Since the processors' frequency is increasing, which results in increased power, then low-power design techniques are prerequisites.
The power dissipation issues and the devices' reliability problems, when they
are sealed down to 0.5 fin and below. have driven the electronics industry to adopt a snpply voltage lower than the old standard, 5 V. The new industry
standard for IC operating voltage is 3.3 V (i10%). The effect of lowering the voltage to much lower values can be impressive in terms of power saving. The power is not only reduced but also the weight and volume associated with batteries in battery-operated systems.
(!4
PowerPC 603 80
_.
0
3.3 3.3 3.3
0.5
0.8
66
80
0.64
[lo] [Ill
[IZ]
dissipation dominated by I j O devices such as hard disk ddves and LCD displays. The total expected power dissipation of notebooks is 2 Watts with 4 pounds weight and daily recharge. Electronic pocket commvnication products such 8s; cordless and cellular telephones, PDAs (Personal Digital Assistants), pagers, ete. Table 1.3 shows a battery analysis far B handheld cellular system. Low-power is crucial for extending the battery life of these systems. Also, battery improvement is needed. The PDAs requite a large *mount of dats processing with multimedia capabilities. The expected power of PDAs is around 0.5 Watt with 0.5 pound weight. Also the expected power for pagers is 10 mW with 0.125 ponnd weight.
CHAPTER 1
Example RF Power
I
I
750 mAH secondary NiCd 75 minuter talk time 20 hours standby 650 mA x G V = 3900 m W
.
rn
SubGHz processors for high-perfomance workstations and computers. 100 MBz systems and over are emerging, and 500 MHz and higher will be common by the end of the decade. Since the power consumed is increasing with the trend of frequency increase then processors with new architectures and circuits optimized for low-power are crucial.
Other applications such as WLANs (Wireless Local Area Network) and electronic goads (calculators, hearing aids, watches, ete.).
cI
rn
I
LOGIC/CIRCUlT
I
I
DEVICEPROCESS
Figure 1 . 1
the threshold voltage. To overcome this problem, the devices should be scaled properly. The advantages of scaling for low-power operation are the following: Improved devices charlrcteristics for low-voltage operation. This is due to the improvement of the current drive capabilities; Rednced capacitances throngh small geometries and junction capacitances; Improved interconnect technology; Availability of multiple and variable threshold devices. This iesults in good management o f active and standby power trade-off; and
1
Higher density of integration. It was shown that the integration of 8 whole system, into a single chip, provides orders of magnitude in power savings.
CHAPTER 1
Table 1.4 shows the effect of ecaling on microprocessor performance [14]. The power &sipation can be reduced by one order of magnitude at fired frequency of operation.
I 0.50 I 0.35 I 0.35 1 0.25 VDD (V) I 3.3 1 2.5 Area (mm') I 8 x 10 15.6 x I Clock (MH.) I 1 150 100 Power (W) 1 5.0 I 3.3 m Inn"M" R-~ " Area (%ma) 1 6.4 x 8.4 I 4.5 x 6 Power(W) 1 5.0 I 2.2
L (/4 L.ff ( P )
1 1 1
0.25
0.15 1.8
I I I
0.15
0.10 1.5
Use of more static style over dynamic style; Reduce the switching activity by logic optimim.tion; Optimim clock and bns loading; Clever circuit techniques that minimise device count and internal swing; Custom design may improve the power, however, the design cost increases; Redace VDOin "on-critical paths and proper transistor sizing;
Use of multi-!+ logic circuits; and
Low-power architectnrcs based on parallelism, pipelining, etc.; Memory partition with selectively enabled blocks; Reduction of the number of global busses; and
rn
1.3.4
Among the techniqves to minimize the power at the algorithmic level, we cite:
rn
Minimking the number of operations and henee the number of hardware resonrces; and
.
rn
Utilive low system clocks. Higher frequencies are generated with on-chip phbse locked loop; and
High-level of integration. Integrate off-chip memories (ROM, RAM, and other ICs such 61 digital and analog peripherals.
etc.)
1.4
THISBOOK
Tb3 book is an early eontribntion to the field oflow-power digital VLSI circuit and system design. It targets two types of aodiences; the senior undergraduate and postgradoate university stodents and the VLSI circuit and system
CHAPTER 1
designer working in industry. In this book we have tried to cover the basics, from the process technologies and device modeling t o the architecture level, of VLSl system. T h e fundamentals of pow- dissipation in CMOS Circuits are presented to provide the readers with Juffieient badrgranod to be famdiaz with the low-power defign world. Several practical eheuit examples and low-power techniqucs, mainly in CMOS technology, me discussed. Also low-voltage issues for digital CMOS and BiCMOS eircnitr are emphasiied. This book also provides an extensive study of advanced CMOS subsystem design. brious power minimiaation techniques, 8t the circuit, logic, architecture and algorithm levels, are presented. Finally, the book includes a rich list of references, treating advanced topics, a t the end of each chapter. This allows the readers to study, in depth, any topier they find interesting. This book is orgganiad into eigth chapters. The first chapter i s an introduction to low-power design. The other chapters m e presented in the following sections.
1.4.1
Chapter 2 deals with CMOS bulk, bipolar, BiCMOS and CMOS Silicon On Insolstor (SOI) process technologies. Several CMOS technologies (N-well and twin-tub) and low-voltage CMOS enhancement m e reviewed. Bipolar technology with emphasir on advanced stmetme. is considered. The topic of the isolstion techniques wed for both bipolar and CMOS is addressed. Three BiCMOS technologies, with different perfomance/cmt, are presented. Complementary BiCMOS structnre, where a vertical irolated PNP transistor merged with an NPN transistor in 8 CMOS process. The design rules of a 0.8 ~"m BiCMOS process is supplied. Finally, SO1 technology is reviewed for low-voltage and low-power spplieatianr.
FETs i6 discussed. The SPICE device models of an 0.8 pm CMOS/BiCMOS process are also presented. This should help the reader to appreciate the meaning of the model parameters as well as to analyse the power and delay of the low-voltage cirenits presented throughout the book. Supply voltage scaling, due to reliability and power dissipation issnes, is presented.
1.4.4
A variety of BiCMOS logic circuits suitable for 3.3 and sub-3.3 V are presented i n Chapter 5. The chapter starts with the introdoction of the conventional BiG MOS (totem-pole) gate which was used in 5 V applications. The degradation of this gate, with supply voltage scsJing, is demonstrated. The BiNMOS family suitable for low-voltage applications (3.3- 2 V range) is introduced. It is shown that it provides better performance and delay-power product than CMOS, at these voltages, even a t low fan-out. Other logic families, for low power supply voltage operation, are also discussed. Finally, this chapter presents several low-voltage applications of BiCMOS.
SPIUE i s th. mod c o m o n l y u r e d circuit timulator.
10
CHAPTER 1
REFERENCES
[l] Special Report, 'The New Contenders," IEEE Spectrum, pp. 20-25, De
cember 1993. [2] D. W. Dobberpuhl et al., 'A 200-MHz 64-b Dual-Issue CMOS Microprocessor", IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 1555-1567, November 1992. 131 W. J. Bowhill et d.,"A 300MBs 64b Qoad-Issue CMOS RISC Mieroprocessor," IEEE International Solid-State Circaits C o d , Tech. Dig., pp. 182.183, February 1995. 141 Technology 1995: Solid State, IEEE Speetmm, pp. 35-39, January 1995.
[5] D. Bearden, et d., "A 133 MHe 64b Four-Issue CMOS Mieroproeessor,' IEEE International Solid-State Circuits Conf., Tech. Dig., pp. 174.175, February 1995.
[9] P. Verhofstadt, "Keynote Address," IEEE Symposinm on Low Power Electronics, Tech. Dig., October 1994.
2.2 W 80 MHz Superscalar RISC Microprocessor," IEEE Journal of Solid-state Circuits, "01. 29, no. 12, pp. 1440-1454, December 1994.
12
[I21 N. K. Yeung, Y-H. Sutu, T. Y-F. Su, E. T. Pak, C-C Chao, 5. Akki, D. D. Yau, and R. Ladenquai, "The Deign of a 55SPECint92 RISC ProeesIOI
under ZW," IEEE Internationd Solid-State Circuits Conference, Tech. Dig., pp. 206-201, Febrmry 1994.
[13] 5. Lipoff and A. D. Little, "Evsluation of New Battery Technology i n Se lected Applications," IEEE Workshop on Low-power Electronics, Phoenix, AZ, August 1993.
o r U1L.a-Low Power Inmmation (141 J. M. C. Stork, "Toehaalogy Leverage f Systems," IEEE Symposium on Low Power Electronics, Tech Dig., pp. 5255. October 1994.
2
LOW-VOLTAGE PROCESS TECHNOLOGY
This chapter ~ e w ffi a an introduction to IC fabrication of CMOS bnlk, bipolar BiCMOS and CMOS SO1 devices including sub-micron devices for low-voltage applications. Section 2.1 i s a review of CMOS process technologies. Examples for an N-well CMOS process and a twin-tub CMOS process are considered. Section 2.2 deals with bipolar technology with emphasis on advanced hipola structures. The topie of the isolation techniques used for both bipolar and CMOS is addressed in Section 2.3. In Section 2.4 we discuss the similarities between advanced CMOS and advanced bipolar transistor strnetnres to demonstrate how both technologies m e indeed convergiug. The BiCMOS technologies we introduced in Section 2.5. with emphasis on CMOS-based processes. Three BiCMOS technologies, with different performance/cost, w e presented. Section 2.6. introducer a complementary BiCMOS structure, where B vertical isolated PNP transistor is merged with an NPN transistor in B CMOS process. In Section 2.7, B table with the design rules of B generic 0.8 pm BiCMOS process is supplied. Finally, in Section 2.8, SO1 technology is reviewed for low-voltage applications.
14
CHAPTER 2
In this section we review two CMOS bull. technologies: N-well and twin-tub proeeeser. Other processes such ar retrogradwvell technology is not discussed.
2.1.1
In the N-well CMOS process, the P-channel transistor is formed in the N-well itself and the N-channel i n the -substrate. Fig. 2.1 illustrates cross-sectional views and process steps of B typical N-well process.
The process starts by growing an oxide on the wafer. The oxide is then patterned to open N-well windows. Phosphorus atoms are implanted into the &con followed by a high-temperature annealing to diffusethe well [Fig. Z.I(a)]. The LOCOS ( L o c a l Oxidation of S i l i c o n ) ' technique is used to isolate the Merent active areas. After removing the nitride used in the LOCOS process, a photoresist layer is deposited and is then patterned by B P-well mark (new mark). This is followed by low energy ion implantation of boron (B I/I) to adjust the threshold voltage of the N-channel transistor [Fig. Z.l(b)]. A seeond ion implantation can be applied to eliminate punchthrough in the short channel device. Simiirly, the threshold voltage of the P-channel tramistor is adjusted [Fig. Z.I(c)]. A thin gate oxide is then grown and B layer of polysilicon is deposited and doped with phoaphoros. The polyailiean is patterned to form the gates of a l l the transistors and intereonneetion layer [Fig. Z.l(d)]. The source and drain regions are then implanted by using =photoresist mark. Boron is used for the Pf regions of the P-channel transistors and arsenic for N-channel transistors [Fig. 2.l(e)]. The N f and P+ regions e . r e dso used Nand F- we& contacts, respectively. The photoresist is removed and a thick oxide is deposited by Chemical Vapor Deposition (CVD) ar an isolation layer between the polysilicon layer and the subsequent metal layer. Contact holes are opened in the oxide layer and metal (usually aluminum) is deposited on the whole wafer. At this stage, the metal is patterned and annealed at d s t i v d y low-temperature (450 C) [Fig. Z . l ( f ) ] . One or two other metal layers are u m ally added. At the end, the wafer is pauivated and windows are patterned over the metal bonding pads to provide electrical contacts with pins.
'For nore dctoils on the LOCOS iadationnrrc Sictian 2.8.l.
PI
16
CHAPTER 2
.
8
Strip 1eisUordde Grow gate oxide Deporitpolysilicon Apply photoresist and pattern stripresirt
0
Figure 2.1
(emtinwd)
. -. . . . . . . .
a Apply photoresist
Grow oxide
17
Fig. 2.2 shows the major steps involved in B typical twin-tub process. The starting material is B lightly doped P-epitaxial material over a , Pi- substrate to reduce latch-up. In addition to the conventional N-tub process, another N-type (arsenic) shallow implant is used to increase the suifaee doping of the N-tub to prevent punchthrough (far short channel devices). It is also used to form the channel-stoppers' for the P-channel transistors [Fig. Z.Z(a)]. The photoresist is stripped and a selective oxidation of the N-tub is performed. The nitride/pad wide layers are removed to implant boron, which is driven in to form the P-tub. This is followed by a second boron ion implantation for the channel-stoppers for the N-channel device [Fig, 2.2(b)]. The N-tub oxide is then stripped. So far only one mask (N-tub mask, MASK#l) is required for self-aligned wells and channel-stopper processes. Both tubs are driven in. LOCOS isolation is developed to isolate between the devices using MASK#2, which defines the active areas. After the LOCOS process, baron is implanted through the pad oxide (wed in the LOCOS) to reduce the threshold voltage of the P-channel transistor using MASK#3. This process results in a buried-channel PMOS transistor. The pad oxide is then removed. The remaining steps are similar to those used in the N-well process where MASK#4 is needed to pattern the polysilieon [Fig. 2.2(~)].MASK#B and MASK#B me required to form the N t and Pi Joureer/drainr (S/D), respectively. MASK#? for contact openings, and MASK#8 for patterning the metal [Fig, 2.2(d)].
The fabrication ofsobmicron MOS transistors requires additional process steps to avoid hot carrier effects. Fig. 2.3 illustrates &CMOStwin-tub structure with Lightly Doped Drain (LDD). Both NMOS and PMOS devices have lightly doped extensions t o the ~ o u i c e and drain regions. The electric field near the drain is reduced due to its light doping. This prevents the generation of hot carriers. The major process steps to fabricate the LDD structure are shown in Fig, 2.4.
18
CHAPTER 2
P-tub
N-rub
. -. . .
. . . .
stripe rcsir,
oxide
P-rub
P epi-1aycr
H'SID P'SID
contacts Metalhalion
A
P rpi4ayer
Figure I.l
19
Side will
Field irxidc
20
CEAPTER 2
other pararitic capacitances. Also, the subthreshold cmrrent should be reduced when low threshold voltage (VT5 0.3V) is wed.
Extensions and variations of standard CMOS process have been proposed to enhance the performance of devices at low-voltage [3, 41. There devices have
good short channel behavior, low junction eapadtbnce and ledwed parasitic resistance. The power supply choice depends on performhnce/reliabity/power trade-offs. Reduced power supply is needed far low-power applications, but 8 deeprubmicron CMOS device with ultrathin gate oxide and low threshold voltage should be used to improve performance. Table 2.1 shows the speed achieved at low-voltages using deepsubmicron processes.
Table 1 . 1
Perforrnsnee cornperison
tow-uoltsge.
N a m e [Ref.] I C M O S Process IBM [3] 0.10 pm ATLT [4] 0.10 pm NEC [5] 0.15 pm Fujitsu [6] 0.10 pm 0.15 pm Toshiba [8] 0.35 pm
An example of improved performance CMOS technology suitable for low-voltage is the one proposed by Toahiba [a] called CMOS Shallow Jnoction Well F E T (SJET). Fig. 2.5 shows the cross-sectional view of the CMOS-SJET process. The N-well and P-well depths are very shallow and comparable to the maxmum depletion layer width i n the channel. With this CMOS-SJET structure the depletion layer of the NMOS device, for example, is extended compared to the original one and reaches the depletion layer of the P-well and the Ntype sobstrate. As B result, the total depletion layer width is inmeaced and low depletion capacitance, Go,is obtained. This leads to the reduction of the subthreshold slope ( s w Section 3.3.2). Thus, the threshold voltage can be reduced at low power supply voltage compared to the conventional CMOS p r e CWS. Furthermore the wells are designed to reduce junction capacitance of the S/D tegions by 40 to 55 % compared to the conventional one. The structure of Fig. 2.5 alro uses dual polysilicon gate Nt and Pt,to optimize the threshold voltages of the MOS devices. Mo W-polycide gates m e used to reduce the poly sheet resistance. The delay of the CMOS-SJET inverter is 2.5 times better than that of conventional CMOS using the same gate sine (0.5 pm technology) a t 1.5 V power supply. The power-delay product of a CMOS-SJET gate a t
21
P MOSFET
N MOSFET
N-Subsmh
1.5 V nsing 0.35 p m teehno1o.q is 1.3 fJ which is 113 times improvement of that for conventional CMOS d e ~ c e s . However,the main drswback with the CMOS-SJET is the large body effect due to its retrograde doping profile.
22
CHAPTER2
23
A1
Figure 1.7
hsve been replaced by the side wall base electrodes. T h i s allows the base are& to be almost as large as the emitter. The SICOS rtructnre is suitable for VLSI applications became of its density and low perasitics
One of the features of advanced bipolar transistors is the replacanent of alnm n i U m by polysilicon for the contact of the emitter. This step has led to noticeable improvement in the current gain of bipolar transistam. For further reading on polysilicon emitter BJTs refer to [lo, 12, 131.
In this aection, we introduce &typical DoublePolysilicon Self-Aligned (DPSA) process technology as an example of the advanced bipolar technologies'.
Any bipolar process typically starts with creating the bnried layers and the epitaxial layer. Fig. 2.8 illustrates the major steps of the epitaxid growth with an iv+ buried layer (BL). This buried lsyer is introduced to reduce the collector resistance o f a hipolar transistor. While the epitaxial layer offers the high-quality silicon host far the bipolar transistor. The steps involved in Fig. 2.8 are the following. First, an oxide lsrer is grown on the substrate and is then patterned using the buried layer mask. The photoresist on the oxide s e r ~ e sas a mask against etching and ion implantation. After etching the oxide, the exposed regions of the silicon surface are implanted by arsenic or antimony to form the Nt buried layers. The photoresist is then removed and an annealing step is carried out. All oxide is then stripped. An N-epitariai layer is grown
'A r-irw of conrmntiond bipolar t.~chnology using the jundion isolation ttchniquu can be f o n d i n [la].
24
CHAPTER 2
Pholamm
. .
8
Si Epitaxial Laycr
. .
Strip resist
Annenl
on the substrate as shown in Fig. 2.8(b). The thickness of this epitadal layer can he as low as 0.8 pm for advsnced digital bipolar technology. The problems limiting the &g down of the thickness of epitaxial layer are the autodoping and oot-diffusion of the boried Ieyer.
Fig. 2.9 amstrates the sequence of a DPSA process assuming B starting stimcture with N+ buried layer, N-epitaxial hyer and isolation oxide as shown in
Fig. 2.9(a). First, photoresist is deposited and patterned to define the collector contact region (deep Nt collector sink). This region is then implanted with phosphorus to increa~eits doping level. The photoresist is stripped and
25
Oxide isolalion
(3
, : ,:
CVD Oxide
(4
. .
26
CHAPTER 2
P Ill IN+poIy)
Anncal
a Pauemictch N+ p01ysi
Pallemicuh mcial
27
P-type bare is implanted through a pre-implantation oxide as shown in Fig 2.9(b). The resist and the oxide are then removed. A combination of ' P polysilicon and oxide layers are deposited o m the wafer. These layers are then etched 8 s shown in Fig. 2.9(c). A CVD oxide is deposited eyer the wafer. The oxide is then dry etched using reactive ion etching (RIE). The Pi- polysilieon is walled with the oxide (called sidewall space^) [Fig.P.S(d)]. The secondled of polysilicon is deposited and implanted with phosphoros that will ultimately form the diffosed emitter junction. At this stage, the wafer is annealed to drive the dopants from the P+ and Nf polysilicon layers. Fig. 2.9(e) illwtiates the structure after patterning the N+ polysilicon. The P+ diffusion under the polysilicon forms the extrinsic base. The eontaet openings to the P+ and Nf palyrilieon, and collector are etched. This is followed by the metallieation step. At the end, the metal is patterned 81 shown in Fig. 2.9(I).
B
The advantage of bipolar devices is their high-speed performance. However, there are not suitable for battery backup systems because they consume high DC current. Many logic circuit techniqoes have been proposed for low-power adlow-voltage operation, particularly for telecommunications applications 115, 161.
28
CHAPTER 2
Active Area
Active Area
<
Figure 2 . 1 0
SubrLrare
Several isolation techniques have been proposed and used. The most popular are LOCOS (Local Oxidation ofSilicon) [17],trench i d s t i o n [la, 19,20, 211, and selective cpitaxy [22]. Selective epitaxy is not studied in t h s chapter.
The steps of the LOCOS process m e illwtrated in Fig. 2.11. A p d oxide of 40 n m is grown and is followed hy chemical vapor deposition of B 100 nm thick nitride layer, which masks the active region. The pad oxide is called stress-relief-oxide (SRO) because it protects the silicon from stress caused by the nitride during nuhsepucnt high temperature processes. Sicon nitride is used as a mask to protect the active region from oxidation. A layet of photoresist h applied to the wafer and then patterned using the mask of the active areas. The nitride/oxide layers ace etched [Pi. 2.11(4]. A P-type dopant is
29
I
I
PChanncl-Stop
Substrate
Substrate
30
CHAPTER 2
Nitride
PolySiiicon
Nilridc
implanted to form the channel-stoppers [Fig. Z.ll(b)]. The photoresist, which is used for protection against ion implantation,is sttipped and a thick thermal oxide is grown;i.e. FOX. Only local oxkdstion is reahed hecanre the nitride masks the cegions heneath it. At the end, the nitride/oxide are removed [Fig. Z.Il(c)]. During this LOCOS process, 56% of tho FOX thickness b under the silicon surfwe because the oxidation consumer some of the silicon. This p m ceie is called remi-reeerred LOCOS isolation. One problem associated with this PCOCOIS is the lateral extension of the field oxide under the nitride during the oxidation, forming what is c d e d birds be& encroachment [Fig. 2.11(~)]. A typical value ofthb encroachment is 0.5 pmlside. This encroachment limits the sealing of the active areas and the c h e l width of the MOS device. Moreover, this birds beak introduees imprecise channel widths.
The Pofy Buff=? LOCOS process was developed to iedoce the hids heat encroachment [23]. Ln this modified LOCOS process, the nitride m a s k thickness has been inereared t o 240 n m snd B polysilicon streas relief buffer layer or50 nm has been added between the nitride and B 10 n m pad oxide [Fig. 2.12(a)]. This srrangement prevents deep lateral extenlion ofthe field oxidc under the nitride layer [Fig. 2.12(h)]. A 0.8 pm field oxide thickness results in 0.15 pmlride of
31
encroachment and 2.2 pm minimum isolation pitch. Other techniques to solve the problem of the bird's beak encroachment can be found in [24, 25, 261.
2.3.1.2
Trench Isolation
Treneh Isolation is mother alternative to LOCOS isolation process. This technology has been accepted relatively quickly b the industry [Z'f]. It addresses the isolation problem between opposite type devices (like N-channel and P-channel MOSFETs in CMOS technology). The advmtages of the trench isolation m e : i) no bird's beak encroachment, ii) latch-up fiee structure, and iii) planar sorfacc. Fig 2.13 illustrates the steps of the trench isolation process. First, the pad oxide, the nitride and the thick oxide layers are patterned using the mask of the active areas. The thick oxide series ar s mask in the trench processing This is fallowed [Fig. 2.13(.)]. A deep trench is formed by dry etching (RLE). by B boron implsnt to ueate the P+ channel-stoppers at the bottom of the trench. The top thick oxide is removed, and the trench sidewds are oxidived [Fig. 2.13(b)]. The polysilicon is deposited over the whole wafer, filling the trenches. The polysilicon is used as the trench dielectric because it uniformly fills the trenches better than other dielectrics. The surface polysilicon is then etched to yield the stroetore shown in Fig. 2.13(c). The wafer is oxidized using the nitride as a mask. The nitride is finally removed as illustrated in Fig. 2.13(d). At this stage, conventional processing can be used to integrate the CMOS devices. Although trench isolation permits reduction of the separation between the active regions; it has several drawbacks: i) it is a costly process because of the large number of processing steps, and fi) it can not be used BE an isoletian region for the inactive parts of the chip. In this ease, LOCOS is usnally used. T h e description of other trench isollrtion processes c m be found in [28].
32
CHAPTER 2
. .
. .
.
Complement wcll
Porl-orocersinP "
CII
Oxidize
Remove nitride
33
P-Subairare
Figure 1.16
isolation.
the components i n different N-wells (N collectors) me isolated. The area conmmed by the isolation isles is large relative to the tramsirtor area.
all
The pa&s density of the bipolar technology tan be improved by r e p k g the junction isolation with LOCOS kolation. An additional advantage of LOCOS isolation is the reduction of the parasitic collector-substrate capacitance. Fig 2.15 illustrates the cross-sectional view of an NPN bipolar tranktor with LOCOS isolation. The ares oecnpied by the oxide isolation is proportional to the
34
CAAPTER 2
epitaxial layer thickness. As the epitaxial thickness is being reduced for higher device performance the oxide isolation area becomes smaller, which means that LOCOS may become a practical isolation technique for advanced bipol-1 and BiCMOS technologies. Fig. 2.16 illwtrates thc proecsr steps for oxide isolation in a bipolar pmcesl. After epitaxy growth, a thin layer of Si02 is grown and B layer of S i J N I is deposited. A photoresist layer is applied and patterned with M isolation mark [Fig. 2.16(a)]. Then the nitride/pad oxide layers and approximately half of the epitaxial layer are dry etched. Boron implant is performed to form the ehannel-stopper [Fig. 2.16(b)]. The photoresist is then removed and the wafer i s oxidized to grow the thick isolation oxide. This oxide is called recessed ozide. The SisN* and the pad oxide are stripped at this stage. The resulting strocture is almost planar. In this structure the birds beak is formed BE i n the MOS ewe [Fig. 2.16(c)].
In the early 198Os, new isolation techniques such as grooves and trenches [29, 30, 311 were demonstrated. These techniques reduced the collector-substrate capacitance and increased the packing density. Hence they improve circuit speeds The fabrication process is the same BS the one described in CMOS trench isolation.
Many of the steps of the advanced CMOS and bipolat procesrer ate similar, hence, they can be shared for the fabrication of MOS and bipolar trsosistors
35
Oxide
Photoresist I \
Nilode
NtBL PruceES
Cmw epi-layer (Ntype1
Grow pad oxide Dep06if nihidelresisl Palteem resisl
Epi-layer
(CI
-+
. .
Remove nilndeloride
36
CHAPTER 2
when they
are:
are
bl the body of the PMOS transistor and ar the N-collector of the NPN transistor;
2. The N + buried layer of the NPN can be used to form B retrograde w e l l for the PMOS to reduce the latch-up susceptibility;
3. The polysilicon can be used for the CMOS gatos and for the emitter contacts;
4. The r h d o w P-type implantation c a n he shared by the PMOS S/D and
the emitter of the NPN transistor; and 6. The final annealing s t e p match.
However, as more steps me being shared by t h e different devices, the device charactedstics have to be compromised. There is L tradeoff between the process complexity and device quality.
2.5
BICMOS TECHNOLOGY
Although the idea ofmerging bipolar and CMOS on the same chip originsted 20 years ago [32], it was not feasible from a practical point of view becsuse of the lack of adequate process technology. With the technological progresr achieved i n r-t ycarr, this idea has been revived. There are many techniques t o merge bipolar and CMOS devices as reported in the literature [33, 34, 35, 36, 37, 381. There m e two ways of classifying BiCMOS processes. One way ih to classify them according to the baseline process. A CMOS-based BiCMOS process is a CMOS bareline process, to which a bipolar transistor is added. Similarly, a bipolar-bared BiCMOS process is a bipolar bascline process, to which CMOS transistors are added. In both eases, the added device would have to be compromired, which means that its characteristics can not be optimired. Alternatively, BiCMOS processes can be classified according to their co.t/performance. In this regard, three categories can be identified:
37
1. Low-cost;
2. Medium-performance; and
3. High-performance (high-speed).
In this section, we present three examples of BiCMOS processes. The first one represents B low-cost proeers. It needs only one mask to incorporate the bipolar device in B CMOS-based process. The second example shows a mediumperfamanee BiCMOS process, which requires 3 extra masks to a CMOS process. The third example illnstrbter a high-performsnce process in which polydicon emitter and self-aligned structures are used.
38
CHAPTER 2
CMOS (Bme)
Bipolar (Addition)
P-SubsUale
N-well LOCOS isolation NMOS channel implanration PMOS channel implantation
Gate oxide
__I
Collector
Polysilicon gate
SiDN+implantation
S l D P + implanmtion
~~
Contact opening
MeMiZa~CIn
(a)
WN
NMOS
PMOS
39
40
CHAPTER 2
region between the N t buried layerr. A thin epitaxial layer (1 pm - 2 p m ) is used to increase the cutoff frequency of the NPN transistor and to reduce the required width of the isolation islea between the bipolar transistors. The N collector is formed at the same time with N-well of the PMOS transistor. After the formation of LOCOS a deep N+ sinh is implanted and driven in. The Pf extrinsic base is impknted at the ssme time with Pf S/D regions of the PMOS transistor. The Nt emitter and the N+ S/D share the same implantation step. In this process an aluminum emitter contact is used. Therefore. the 3i.e of the emitter is larger compared to the case where a self-aligned polysilieon emitter contact i v used. This process uses only 3 extra masks to form the bipolar transistor. The first mask is needed for N t buried layer. The second mask is used to implant the N+ deep collector, and the third one for the base implantation. The BiCMOS process described above can be optimized to be used far high performance circuits. The collector resistance is low in comparison to the lowcost proecsr (exsmple 1 ) . For a 0.8 pm process, the cut-off frequency (ft) of a bipolar can be as high m 5 081.
2 . 5 . 3
A high-performance BiCMOS process can be achieved b7 replaeiog the N t S/D implant, used t o form the emitter in example (21, by a doped polysilicon emitter. One mtra mask is required to open the emitter window of the bipolar transistor. The ion implantation of &hepoly emitter and MOS gates is developed simultaneously. As shown in Fig. 2.19, four additional mask levels (N' buried layer, Nt deep collector, P-base, and emitter window) me required to ohtnin an advanced BiCMOS. After the farmstion of the N f / P + buried layers, the conventional twin-tub process is carried out. LOCOS is developed to isolate the devices. The deep collector N t is implanted and driven in, and the P-baseiS then patterned and implanted. The threshold voltages of the MOS transistors are adjusted hy additional ion implantations. After the gate oxide growth, a thin polysilicon is deposited as shown in Fig. 2.20(a). The emitter window is then pettermed and a second polysilicon layer is deposited [Fig. Z.ZO(b)]. The polysilicon is then doped by implantation and patterned to define the CMOS gates and polyrilieon emitter [Fig. Z.ZO(c)]. Next, implants are selectively carried out to form the LDD regions for CMOS. Before implanting the N t / P + S/D regions. a sidewall
41
42
CHAPTER 2
Polysiticon
NPY
P-base
N-well
. .
Apply photarcsisf
rauem emi,,er
Etch polytoxidc
s,ripresin
Deposit LPCVD poly
(250 "rn) 2nd pan
of spiit poiy
Poly-Erniller
\
-. -.
.
lmplilni AsiQ
Apply pho~oicsist
Pattern poly
Ann4
43
oxide is formed nelu the emitter and gate edges. Fig. 2.19(b) shows the find crosrsection of this BiCMOS process. The BJTs realiaed in the presented high-performance BiCMOS process have low collector resistance (because of the buried layer and deep sink), high current gain (becsuse of the poly emitter contact) and low parasitic capacitances (because of the self-alignment). With this BiCMOS process ft's greater than 5 GHz can be achieved. BiCMOS technology k a relatively high cost and complexity, because it requires a total of 15 masks for snbmicron process. S e ~ e r d solutions have been proposed to redwe the number of process steps to lower process complexity and cost. Recently one idea [40] has resulted i n low-cost 0.35 fim BiCMOS technology which needs only 11 masks by &g W-plog trench collector sink. T h i s technology is suitable for 3.3 V power supply voltage and promising for low-power mixed-signal applications. Recently BiCMOS technologies with high N P N f*'s transistor, from 10-to-30 GHz., have been reported [38, 40, 411. The applications of these technologies are, for example, for low-voltage (3 V and s u b 3 V) and high-speed logic circuits. Another application of BiCMOS is mixed andog/digitd ICs . a n & from teleeommnnication circuits and high-speed networks to wireless systems. Among these npplicstions, BiCMOS can be used for low-power high-frequency portable systems. Bipolar devices can be used for high-frequency and highspeed parts with low-power innovative circuits, and CMOS can be used for low-speed ultra-low-power parts.
44
CHAPTER 2
ated with the PNP transistor are its high collector resistance, low current gain, and high b s e transit time. It has been recently reported that CBiCMOS processes can offer NPNs with of 8-20 GHz and PNPr with 2-7 GHa A [45, 46,41, 48, 49, 501. Fig. 2.21 shows a cross-sectional view and process flow of a CBiCMOS [46]. The N+ buried layet of the NPN transistor creates a retrograde well for the PMOS transistor. The Pi buried layer is only used for isolation isles between NPN transistors. After the epitaxial layer growth, twin-well and LOCOS processes are performed. The P-well of the NMOS device is used 86 the collector of PNP tr-tor. A second high energy (600 keV) boron ion implantation is carried out to form the retrograde well (2nd P-well) for the NMOS and the P+ buried 1ny.r for PNP device. The S/D implants of MOS transistors are used simultaneonsly for the extrinsic baser of the NPN and the PNP transistors. The emitters of the NPN and the PNP are formed by the self-aligned contact doping technique to simplify the process flow. Finally, the metal is deposited and patterned.
fe'g
Complementary BiCMOS offerr a technology with versatile devices. It adds flexibility for mixed bipolar/MOS circuit design. The CBiCMOS technology promises further improvements to BiCMOS circuits performance.
45
P~rvbrUalc
N + I P + b w i d layer
N - t p spifBxill layer
Nn'iwinweIl(lnP-wcllfor PNP)
Field ihlulion
Callmior deep N '
Caniacl haler
N t w d P'eniLL~r implant
Mctslizaalion
P+
s l of CBiC Figure 1.11 (e) Fabrication pmcom flow: (b) C r o ~ c ~ o c t i mview MOS [48].
46
CHAPTER 2
N-well (NW)
The NW mark is used to define the N substrate (bulk) of the PMOS and the Ncollector of the NPN transistor. The CN mark defines the area which is exposed for the N + sink implantation.
The CP maJk defines the ~e9;cm vhich is to receive an P-implant to create the basc
dmlrion.
Polyrilicon (PO)
The PO mark defines the gate and the emitter electrodes, and the polysilicon interconnect layer. The EW mask definer the opening for the emitter window. The DN (DP) mask d e h a the N+ (Pi) somzce and drain regime of the N-eh-d (?-channel) device within the P-well (Nwell), and the body contact regions in the N-wen (P-well) respectively. The CO mark defines the contact openings. The M1 mark interconnects.
defines
the
metal
The VIA mask d&ms the openings of the via that connects metal 1 to metal 2. The M2 mask interconneets.
Metal 2 (M2)
47
1.
12A 12A
2.
N + -diffusion (DN)
2.1 minimwidth 2.2 minimum spacing 2.3 minimum NW overlap ofDN 2.4 minimum NW to external DN spacing
3A 3A OX 6A
3.
P+ -diffusion (UP)
3.1 minimum width
3A 3A 4A 4A CIA 3A
N-collector plug (CN) 4.1 minimum width 4.2 minimum spacing 4.3 minimum space to NW 4.4 minimum NW overlap of CN 4.5 minimum space to DN 4.6 minimum space to DP
4A 12A
1OA
3A
6A
5A
5.
P-base diffusion (CP) 5.1 minimum width 5.2 minimum spacing 5.3 minimum NW olerlbp of CP 5.4 minimum space to CN 5.5 minimum space to DN 5.6 minimum space to DP
4A 4A 3A 5A 3A 3A
48
CHAPTER 2
6.
Polyrilieon (PO) 6.1 minimum width 6.2 m-um spming 6.3 minimum space to DP or DN 6.4 gate overhang of DP 01 DN 6.5 minimW0 space to CN or CP
Emitter window (EW) 7.1 minimum width
2A 3A 2 A 2A 1A
7.
7.2 minimum length 7.3 minimum spacing 7.4 minimum CP overlap of EW 7.5 minimum poly overlap of EW
8.
2A 4A
3A 2A
2A
contact (CO) 8.1 minimum size (single) 8.2 minimum rise (double) 8.3 minimum spacing 8.4 minimum DN or DP overlap of CO 8.5 minim"rn space to gate 8.6 minimum PO overlap of CO 8.7 minimum CN or CP overlap of CO 8.8 minimum PO to CO spacing in P b s e 8.9 minimum poly emitter CO to CP spacing Metal 1 (MI) 9.1 minimum width 9.2 minimom spacing 9.3 minimum M I overlap of CO 9.4 maximum current density
1A 1A 2A 2A
9.
2A 3A 1A 1 mA/pm
49
10.
Metal 2 (Ma) 10.1 minimum width 10.2 minimum spacing 10.3 maimcurrent density Via(VIA)
11.
11.1 minimnm size 11.2 minimum spacing 11.3 minimum MI or M2 owrlap of VIA 11.4 minimum VIA to CO spacing 11.5 minimum PO to VL4 spacing 11.6 minimum PO overlap of VIA
50
CHAPTER 2
51
NMOS
PMOS
BIT
52
CHAPTER 2
Si
53
Drain
Kink effect
Drain Voltage
Figure 2 . m
The SO1 SIMOX is now m a t u n materid and represents a potential technology for low-power applications. Several LSIfVLSl circuits have been fabricated in SOI/SIMOX, particdarly for low-power application. Such circuits inelude PLL (Phare Locked Loop) for wireless terminals applications [64], and 1.2GHe frequency divider under 1-V power mpply [55]. The SO1 technology was applied &so to design a RUy pipelined 512-KbSRAM [53]. This SRAM worked successfdly do- to O.? V with an access time less than 5 nr.
Pig. 2.24shows B thin film SOI/SIMOX CMOS process cross-section. The process starts by the formation of buried oxide in silicon wafer ar explained above in [Fig. 2.24(a)]. Then, an oxide is grown on the surface silicon and 8 nitride hyer is deposited. Silicon nitride is used as n mark to protect the active region from oxidation. The nitrideloxide layers are patterned and a LOCOS isolation is applied [Fig. 2.24(b)]. At the end, the nitridejoxide layers are removed. This is followed by P I/I to Bdjut the threshold voltage ofthe N-channel transistor. Skilady, the threshold voltage of the P-channel transistor is edjdjnsted by I/I. A thin gate oxide is then gmvn and a layer of polyrilicon is deposited and doped with phosphorus. Then the Pt souice and drain regions of the PMOS are patterned and implanted with boron [Fig. 2.24(c)]. Similarly, the N+ S/D r@onr of the NMOS are patterned and implanted with phosphorus. A thick oxide is then deposited BS an isolation layer between the polysilicon and the subsequent metd layer. The oxide is etched at contact locations. N u t . the
54
CHAPTER 2
P-ChVTpimpianr
N-ChV m paitcm
N-Ch V m implant
metal l a y s (aluminum) is deposited over the whole surface. Finally, the metal is etched and annealed. This simple process description showsthat the SO1 process i s much simpler than bulk CMOS. Forbdance, the wells are no longer needed, and the punchthrough u e also unnecessa~yi f thin-film SO1 is used. Fig. 2.25 shows B implants a
. ..
.. ..
,.
...
56
CEAYTER 2
Due to the dielectric isolation, the MOS devices have several advantages over bulk CMOS such as : absence of latch-up, high packing density and lower pmasitic capacitances. SO1 reduces the circuit capacitance by 30% [57]. It has been discovered that if the silicon (containing the devices) is made sufficiently thin (< IOUnm), the MOSFETs devices are f d y depletcd [51! even when Vos = 0. W y depleted thin film SO1 MOS dwiccs offer attractive characteristics for CMOS applications such ar immunity from short channel effect, absence of kink effect, superior aobthreshold leakage and high d r d n 8atursAition current (due to low channel doping) [58, 59, 601.
Unfortunately, the technology hsr minor disadvantages such sr floating body effects which rault in i) floating body induced threshold voltage lowering and ii) low drain-tusauce breakdown voltage. For 1 V power supply this is not a problem. However for 3 V operation this could be an important limitation. Also, the threshold voltage is very sensitive to the thickness uniformity of the superficial silicon. In addition. the low thermal conductivity of the oxide underneath the thin film silicon layer is II severe problsrn when the SO1 circuit is operating at high-frequency. Therefore technological improvements are still needed to mlve there Limitations.
devices. W e have shown that the advanced CMOS and bipolar processes me converging, and many process techniques can be shsred for the fabdestion of both devices. The different options for merging bipolar and CMOS devices are then discussed. Three examples for BiCMOS processes with different eomplcxitier a e presented The eomplemcntary BiCMOS process is ako considered. A table of design rules for a state-of-thcart BiCMOS technology is given for layout exercises. Several advanced technologies such as CMOS SOI/SIMOX and CMOS-SJET are reviewed for lm-voltage operation.
REFERENCES
[l] A F.M. Wanlans, and C.T. Sah, Nanowatt Logic using Filed-Effect MOS Triodes, International Solid-state Circuits Conference Tech. Dig., pp.3233, 1963.
[Z] L.C. Parrillo, R.S. Payne, R.E. Davis, G.W. Ratlinger, and R.L. Field. Twin-Tub CMOS: A Technology for VLSl Chcuits, International Eketron Devices Meeting Tech. Dig., pp. 752-755, December 1980.
[3] Y. Tam et al., High-Performance 0.1 pm CMOS Devices with 1.5 V Power Supply, International Electron Devices Meeting Tech. Dig., pp. 127-130, December 1993.
141 K. F. Lee et al., Room Temperatare 0 . 1 pm CMOS Technology with 11.8 ps Gate Delay, International Eleetmn Devices Meeting Tech. Dig., pp. 131-134, December 1993.
[5] K. TaLeuchi et al., 0.15 pm CMOS with High Rdiability and Performance, International Electron Devices Meeting T e c h .Dig., pp. 883-886,
December 1993.
[6] T. Yamaeaki, K. Goto, T. Fukano, Y. Nara, T. Sn@, and T. Ito, 21 pr Switching 0.1 pm-CMOS at Room Temperature using High Pedormance Co Salicide Pmcess, International Electron Devices Meeting Tech. Dig., pp. 906-908, December 1993.
[7] A. Oyamatsu, K. Kinugawa, and M. Kalrumu, Design Methodology of Deep Submicron CMOS Dwices for 1 V Operation, Symposium on VLSI
Technology Tech. D i g . , pp. 89-90, 1993. [8] B. Yoshimma, F. Mdatsooka, and M. K a l r m u , New CMOS Shallow Junction Well FET Structure (CMOS-SJET) for Low Power-Snpply Voltage, International Electron Devices Meeting T e c h .Dig., pp. 909-912, December 1992.
[9] T. Uehino, T. Shiba, T. Kikuehi, Y. Tamaki, A. Watansbe, Y. Kiyota, and M. Honda, 15-pr ECL/74-GAz ft Bipolar Technology, Intecnational Electron Devices Meeting Tech. Dig., pp. 67-70, December 1993.
58
DESIGN
[lo] T.B. Ning, and D.D. Tang, "Bipolar Trends," Proe. IEEE, vol. 74, no. 12, pp. 1669-1671, December 1986.
[Ill T. Nabamnra, T. Miyslaki, S. Takahashi, T. Kure, T. Ohabe, end M. Nagata, "Self-Aligned Bipolar Transistor with Polysilicon Sidewall Base Electrode far High Packing Density and High Speed," IEEE Journal of Solid-state Circnits, vol. 17, no. 2. pp. 226-230,April 1982.
1121 T.H. Ning, and R. D. Isaac, "Effect of Emitter Contsct on Current Gain of Silicon Bipolar Devices," IEEE Electron Device Letters, ED-27, pp. 2051-2055, November 1980.
[I31 A.K. Kspoor and D.J. Rodston, "Pdysiliilicon Emitter Bipolar 'IkansiStors," IEEE Press Book, 1989.
[14] M.I. Elmbsry, *Digital S i p o h Integrated Circnita," John Wiley & Sans, New York, 1983.
[I71 E. Kooi, J.G.Van Lierop, and J.A. App&, "Formation of Silicon Nitride at II Si-SiOz Interface during Local Oxidation of Silicon and During Heat Treatment of Olddbed Silicon in NE, Gas," J. Electrochem. Soc., vol. 123, p. 1117, 1976.
[I81 R.D.Rung, H.Momore, and Y. Nagakubo, 'Deep-Trench Isolated CMOS Devices," International Electron Devices Meeting Tech. Dig., pp. 6-9, D h eember 1982. 1191 T. Yamaguchi, S. Morimoto, G. K-wamoto, H.K. Park, and G.C. Eiden, "High-speed Latch-up Free 0.5 pm-Chamel CMOS using Self-Aligned TiSi and DeepTrench Isolation Technologies," International Electron Devices Meeting Tech. Dig., pp. 522-525, December 1983. [20] R.D. Rnng, "Trench Isolation Prospects for Application in CMOS VLSI," International Electron Devices Meeting Tech. Dig., pp. 574-577. December 1984.
[21] A. Mikashiba, T. Homma, and K. Hamano, "A New Trench Isolation Technology as a Replacement for LOCOS," International Electron Devices Meeting Tech. Dig., pp. 578-581. December 1984.
REFERENCES
59
[22] P. Singer, "Selective Epitaxial Growth Finds New Applications," Semicondnctor International, p. 15, January 1988. [23] R.A. Chapman, et al., "An 0.8 mzm CMOS Technology for EighPerformance Logic Applications," International Electron Devices Meeting Tech. Dig., pp. 362-365, December 1981.
[24] K.Y. Chiu, R. Fsng, J. Lin, and J.L. Moll, "The SWAMI- A Defect Free
and Near-Zero Bird's Beak Local Oxidation Technology for VLSI," Symp. on VLSI Technology Tech. Dig., pp. 28-29, 1982.
[ZS] K.Y. Chin, J.L. Moll, and J. Manoliu, "A Bird's Beah free Local Oxida-
tion Technology Fearible for VLSI Circuits Fabrication," IEEE Trans. on Electron Devices, vol. ED-29, pp. 536-540, 1982.
[26] 3. Aui, P. Vande Voorde and J. Moll, "Scaling Limitations of Suhmi-
won Local Oxidation Technology," International Electron Device Meeting Tech. Dig., pp. 392-395, December 1985.
[27] H.B. Pogge, "Trench Isolation Technology,' Bipolar Circaits and Technology Meeting Tech. Dig., pp. 18-25, September 1990.
[28] Y. Nits", ~~~~~~~-up Ree CMOS Structnre using Shallow lkench Isolation," International Electron Devices Meeting Tech. Dig., pp. 509-512, December 1985.
[29] H. Yamamoto, 0. Mieuno, T. Kubota, M. Nakamae, A. Shiraki, and Y. Ikurhima, "High-Speed Performance ofa Bwic ECL Gate with 1.25 Micron Design Rule," Symp. on VLSI Technology Tech. Dig., pp. 38-39, 1981.
[30]Y. Tamaki, T. Shiba, N. Honma, S. Miauo, and A. Hayas&, "New UGroove Isolation Technology for High-speed Bipolar Memory," Symp. VLSI Technology Tech. Dig., pp. 2425, 1983.
[31] D.D. Tang, P.M. Solomon, T.H. Ning, R.D. Isaac, and R.E. Burger, "1.25 mwn DcepGmove-Isolated Self-Aligned Bipolar Circuits," IEEE Journal of Solid-State Circuits, vol. SC-11, pp. 925-931, 1982.
[32] H.C. Lin, J.C. Ro, R.R. Iyer, and K. Kwong, "CMOS-B$pIar Transistor Structure," IEEE Trans. Electron Devices, "01. ED-26, no. 11,pp. 945-951, November 1969.
[33] T. Ikeda, A. Watanabe, Y. Nishio, I. Mwuda, N. Tamba, M. Okada, and K. Ogiue, "High-Speed BiCMOS Technology with a Buried Twin Well Structure," IEEE Trans. on Electron Devices, vol. ED-34, no. 6, pp. 1304 1309, June 1987.
60
1341 H. Momose, K.M. Cham, C.I. Drowley, H.R. Grinold., and R.S. Fu, "0.5 Micron BiCMOS Technology," International Electron Devices Meeting Tech. Dig., pp. 838-840, December 1987. (35) A.R. A l w e a , 3. Teplik, D.W. S c h d m , T. Hnlsemh, H.B. l i n g , M. Dydyk.snd I. &him, "Second Generation BiCMOS Gate Array Technology," Bipolsr Circnits and Technology Meeting Tech. Dig., pp. 113-117, 1987. 1361 B. Bastani, C. L a g , L. Wong, J . Small, R. Lahri, L. Bouknight, T. Bowman, J. Mao~liu, and T. Tunt-od, "Advanced l Mimm BiCMOS Tcch0010gy for High Speed 256k SRAM'r," Symp. on VLSI Technology Tech. Di., pp. 41-42, 198~. [37] T. Y-guchi and T.H. Yuanriha, 'Process Integration and Device Performance of B Submicron BiCMOS with 1GGHB f< Doable Poly-Bipolar Devices," IEEE Trans. on Electron Devices, "01. 36, no. 5, pp. 890-896, May 1989. [38] C. K. Lau, C-H Lin and D.L. Packwood, "Sub-micron BiCMOS Procer. Design for Manufaoturing," Bipolar/BiCMOS Circuits and Technology Meeting Tech. D i g . ,pp. 76-83, 1992. [39] C. H.Wang and J. Van Der Velden, '"A SinglcPoly BiCMOS Technology with a 30 GHa Bipolar A," Bipolar/BiCMOS Circuits and Technology Meeting Tech. Dig., pp. 234237, October 1994.
[40] 8. Yoshida, H. Suziki, Y. Kinoshita, K. Imai, T. Ahnoto, K. Toksshiki, and T .Yamaaaki, "Process Integration Technology for Low Process Complexity BiCMOS using Trench Collector Sink," Bipolar/BiCMOS Circuits and Technology Meeting Tech. D i g . ,pp. 230-233, October 1994.
[41] J. M. Sung et al., "BESTP- A High Performance Super-Aligned 3V/5V BiCMOS Technology, with Extremely Low Paraaitics for Low-Power Mixed-Signal Applications," IEEE Custom Integrated Circuits Conf. Tech. Dig., pp. 15-18, May 1994. [42] H.J. Shin, "Performance Comparison of Driver Configorations and MSwing Techniques for BiCMOS Logic Circuits," IEEE Jorunal of SolidState Circuits. "01. 25, no.3, pp. 863-865, Jone 1990. [43] S.H.K. Embabi, A. BeUaouar, M.I. Elmarry, andR.A.Hadaway, "New FullVdtag&wing BiCMOS Buffers," IEEE Journal of Solid-state Circuits, vol. SC-26, pp. 150-153, February 1991
REFERENCES
61
[44] M. Hiraki, K. Yam,M. Mioami, K. Sato, N. Matsumki, A. Watanabe, T. Nirhida, K. Sasa!&, and X. Seb, "A 1.5-VFull-Swing BiCMOS Logic Circuit," IEEE Journal of Solid-State Circaits, vol. 27, no. 11, pp. 15681574, November 1992. [45] Y. Kobayashi, C. Yamaguchi, Y. Amemiya, and T. Sakai, '"High Petformmce LSI Process Technology: SST CBiCMOS," International Electron Devices Meeting Tech. Dig., pp. 760-763,December 1988. [46] K. Higashitmi, H. Honda, K. Ueda, M. Hatanalra, and S. Nagao, "A Novel CBi-CMOS Technology by D I P Process," S p p . on VLSI Technology Tech. D i g . , pp. 17-78, 1990. [47] T. Maeda, K. Ishimaru, and H. Momose, "Lower Submicron FCBiMOS (Fully Complementary BiMOS) Proeerr with RTP and MeV Implanted 5GHs Vertical PNP Transistor," Syrnp. on VLSI Technology Tech. Dig., pp.19-80, 1990.
[48] W.R. Burger, C. Lage, B. Landau, M. DeLong, and J. Small, "An Advanced 0.8 Micron Complementary BiCMOS Technolorn for Ultra-High
Speed Circuit Performance," Bipolar Circuits and Technology Meeting Tech. Dig., pp. 78-81, December 1990. [4Q] S.W. Sun, et al., "A Fully Complementary BiCMOS Technology for SubHalf-Micrometer Microprocessor Applications," IEEE Trans. Electron Dev i e r , "01. 39, no. 12. pp. 2733-2139, December 1992.
[SO]
T. Ikeda, T. Naksrhima, S. Kubo, A. Jonba, and M. Yamawaki, "A High Performance CBiCMOS with Novel Self-Aligned Vertical PNP," B p r t
lar/BiCMOS Circuits and Technology Meeting Tech. Dig., pp. 238-240, October 1994.
Publishers, 1991.
[52] K. Izumi, M. Doken, and H. Ariyoshi, "CMOS Device Fabricated on Buried SiOz layers Formed by Oxygen Implanted into Silicon," Electron. Lett., vol. 14, pp. 593-594, 1978.
[53] G.G. Shahidi, T.H. Ning. R.H. Dennard and B. Dawri, "SO1 for LowVoltage and High-speed CMOS," International Conf. SSDM, Japan. pp. 265-267, 1994. I541
62
Film CMOS/SlMOX Technology with Synchrotron X-ray Lithography, IEDM Tech. Digest, pp. 243-246, December 1993.
(551 M. Fujishima, K. A d a , Y. Omura and K. Irumi, Low-Pow,, 1/2 R e quency Dividers ~ & g 0.1-pmCMOS Circuits Built with Ultrathin SIMOX Substrate, IEEE Journal of Solid-state Circuits, ml. 28, no. 4, pp. 510512, April 1993.
1561 T. Ohno, Y. Kado. M. Hsrada, and T. Truchiya, A High-Performance
Ultra-Thin Quarter-Micron CMOS/SIMOX Technology, IEEE Symposium on VLSI Technology Tech. Dig., pp. 25-26, 1993.
1571 Y. Yamaguchi, A. Ishibarhi, M. Shimiau. T. NiPhimura, K. Tsu);amoto. K. Aoric, and Y. Akasaka, A High-speed 0.6-pm 16K CMOS Gate Array on 8 Thin SIMOX Film, IEEE Trans. Electron Devices, vol. 40, no. 1 , pp. 179-186, January 1993.
158) J. P. Colinge. Subthreshold Slope of Thin F i l m SO1 MOSFETs, Trans. Electron Device Letters, pp.274-276, September 1988.
IEEE
1591 J. C. Sturm, K. Tokunaga, and J. P. Colinge, Inereared Drain Saturation Current in Ultrnthin SO1 MOS Transistors, IEEE Electron Device Letters, vol. 9 . no. 9, pp. 460-?, September 1988.
1601 Y. Omura, S. Nakashima, K. Pumi, and T. Ishii, O.l-pmGate Ultrathin Film CMOS Devices using SIMOX Substrate with SO-nm Thick Buried Oxide Layer, IEDM Tech. Dig., pp. 675-678. December 1991.
3
LOW-VOLTAGE DEVICE MODELING
The objective of this chapter is two-fold. It is intended to review the basics of the MOS transistor, which is a prerequisite for Chapters 4. to 7., and to introduce commonly used models of both MOS and bipolsr devices [Sections 3.1, 3.2, and 3.61. In this chapter we consid- simple analytical models which can be used for circuit analysis and deign of deeprubmicrometer MOSFET's at low-voltage. Also, a simple model to compnte the leakage current of MOSFET's is presented [Section 3.31. The more sophisticated SPICE device models are also presented to d w the reader to appreciate the meaning of the model parameters as well as the capabilities and limitations of there models The SPICE parameters for the 0.8 pm CMOS/BiCMOS p r o w s presented in C h a p ter 2 are included in this chapter for readers who are interested in designing and simulating low-uoltage CMOS circuits as well as BiCMOS circoita. In Seetion 3.4, supply wltage scaling due to reliability and power dissipation issues is presented.
64
CHAPTER 3
surface charge of the semiconductor (Qs cod/cm2) is equal in magnitude to the charge of the gate electrode (QGeoul/ema). Thus, we have
4 s = - Po = (Vos - VPB d.)C, (3.1) where Vos is the gate-source voltage and d, is the semicondnctor surface POtential. C , is the gate oxide capacitance per unit area and is given by
~ ~
<o c . , = -
t.,
(3.2)
Qo is the total of d l charges in the oxide and near the interface oxide/silicon. This charge is positive. The work function difference between the gate electrode and the semiconductor d,, depends on the type ofthe electrode and the doping concentration of the semiconductor, For an aluminum electrode, we have
dm, =
For N '
-0.61
+ dt +
$f
(3.4)
4". =
N.
i
0.55
(3.5)
4fP = -&In(-) l
$f,,
for P - t y p e
s i
(3.6)
(3.7)
= +Kin(-) ni
Nd
f o r N-type S i
where K = K T / q . The charge Qs is the s u m of the charge in the depletion layer QB and the inversion layer QI.Therefore;
vos =
vrs
+ b,
QB +&I ___
(3.8)
The bulk depletion charge (per unit are*) consists ofioniied acceptors (P-type substrek) or donois (N-type substrate). The depletion charge ofB P-type bulk, with zero biss b&-s-aouree voltage (VBB = 0), is given by
QBD
= -9NaWn
(3.9)
9.1 (a)The layout and ~ m s a - s c ~ t i o n n l r of i~m n NMOS tzanrislor; (b) Symbola of different types of MOS tronnirtorr.
Figure
66
CHAPTER 3
where the q is the electron charge and N . is the donor concentration. T h e width of the depletion layer in the bulk ( W D )is given by
(3.10)
The tnm-on (or threshold) voltage of an NMOS transistor is defined as the gate-source voltage at which the surface potential 4. is equal to 21dt[. This condition also defines what is known as the strong inversion'. At the onset of strong inversion we can assumc that Qs i i : Q B . Using Equation ( 3 4 , we can write the following expression of the threshold voltage
V T O = VPB
t 4, - Go,
880
(3.11)
QBO i s eqnal to -qN.W,,, where W D , = W D ( ~ = . 21dj1)3. Thus, the threshold voltage can be rewritten as
If the bulk-source is reverse biased (IVBBI> O), the threshold voltage becomes
VT =
VPB
t 21$fl
WJ"(lv5al + zl4fl)
(3,13)
c . ,
(3.14) & i )
VT =
K"0
7(t/iiGmcl
(3.15)
67
This valoe is negative and is not suitable for digital circuits where a positive VTIlis ieqmked fox switching. To get a reasonable VTo, the device rnrface is implanted with boron. The implanted dose DI came$ VTo to increase by the amount qDi/C,. The threshold voltage is hence given by VTo = VFB
+ W,I
7fi
+ ,?$
(3.16)
Consider now the previous example, with DI = 1.725 x 10'2cm-' and 7 = 0.238 V1i2we find that VT is equal to 0.7 V when lVss 1 = 0 V and is equal to 0.98 V when IVaai = 3.3 V . The symbols of the NMOS and PMOS transistors are shown in Fig. 3.l(c). Typical values of the VT are -2.5 V to -4 V far depletion-mode NMOS devices. For low-voltage CMOS they a m 0.3 V to 0.8 V for enhancement-mode NMOS devices, -0.3 V to -0.8 V for enhancement-mode PMOS devices. When VGs < VTO, the transistor is in the cuiqffwgion, since no inversion layer exists, 85 r b w n in Fig. 3.2(a). The drain current is, therefore, approximately zero. When VGs > Vm, the channel is formed and a drain current flowsfrom the dm.b to the source [Fig. 3.2(b)]. The transistor is in the linear region (&o called ohmic wgion) when VOD( i . VGE ~ - VDS) 2 VT. When Vcr > VT a d VDs > Vos - VT (ix. Vco < VT) the channel is pinched off as illustrated i n Fig. 3.2(c) and the device enters the solurntion region. The drain-source voltage which causes the channel to pinchoff at the drain edge is commonly known as the saturation d r a k s o u r c e voltage V D S . . and ~ is equal to Vcs VT.
~
The voltage drop between the pinchoff point and the wmce is VDS,.~.Any VoS higher t h m V D S , .will ~ appear between the pinchoff point and the drain. If we assume that the distance between the piacbaff point and the drain is extremely small compared with the overall length. then for VDS> V D S , . the ~ drain current is constant. The carriers which reach the pinchoff paint are swept across to the drain by the potential (VDS- Vns..,) between the drain and the end of the channel.
68
CHAPTER 3
69
I-V characteristics of
(3.17)
We assume that the mobility ( p ) of the electrons in the channel of an NMOS device is constant. A cnrrent IDS crossing the incrementd resistance d R causes a voltage drop of dV = IosdR (3.10)
Sobstitutlng from Eqoation (3.11) in Eqnation (3.10) and integrating from the sonrce to the dinin, we obtain
70
CHAPTER 3
To solve thL integration, we need to express the electron inversion charge denin term of V . From Equation (3.8), we have sity QI(=)
Vos - V ~ B -
QBO
C .
C ,
(3.20)
The surface potential 4, at any point z dong the channel is equal to ZlQfI [Equation (3.11)] in V ( z ) . By substituting for VFB- Qso/C, 2l$fl by Equation (3.20) we get
Q r ( a ) = 4 V c e - VTO - V (x ) l G
(3.21)
The surface potential at the drain is larger than that at the Y ) ~ C C by VDs. Therefore, the magnitnde of Q I decreares with the distance across the channel. s triangular a illustrated in Fig. 3.3. Assuming This is why the inversion layer i that QBO is constant across the channel and substituting for Qi from Equation (3.21) into Eqnation (3.19), we obtain
where kp is B process-dependent parameter defined as kp = pCs=. Equation (3.24) is valid only for VDS 5 V D S , . ~ (ohmic region). W h e n VDS exceeds V D S . . the ~ drain-source current saturates. The saturation current can be found by substituting for VDSby V D S , ,in ~ Equation (3.24) and is hence given by
The characteristics ofan MOS transistor based on Equations (3.24) and (3.25) are s h o w in Fig. 3.4. The cnrrent eqnations (3.24) and (3.26) have to be by modified if the bulk-source voltage is greater than eero by replacing
[see Eqnation (3.14)]. Note that when VDSis small (say 60 mV), Equation (3.24) can be a p p r o h a t e d by
VT
71
72
CHAPTER 3
This equation expresses B linear relatiomhip between I D S and Vos. Using l i n ear extrapolation, VTO and k p p can he determined 8s shown in Fig. 3.4(h).
-9,
The measured I-V characteristics show that the drain cnnent, in the saturation region, iS a weak function ofVDs. This is due to the channel length modulation phenomenon which can be explained s follows. Let us define LLll = L.fl - AL
(3.27)
where AL is width of the depletion layer between the pinchoff point and the drain as shown in Fig. 3.5. The voltage wrom this depletion layer is VDSV D ~ , therefore ~ ~ , AL can be written as
If we assume that
The ratio
can
_ AL - XVDS
L m
V-?
(3.31)
The drain current model described, so far,is known as the LEVEL I (MOSI) model in SPICE'. Thi. model is also d e d the Shiehman-Hodgea model. Howeveq this model b still very simple' to accomt for state-of-thtart CMOS devices and might lead to B 100% error in the current particularly for lowvoltage deepsubmicrometer CMOS devices. However, kp ( or p ) can be used as D fitting parameter to reduce this error. T h i s model in most suitable for preliminary analysis.
4SPICE1GBor 381 oz 3C1.
'Tbis model 1- used i n the 70's.
73
*
rn
A model for mobility degradation with the vertical abd the horizontal electric fields;
A model for the threshold voltage of short- and narrow- channel devices
(the (Drain Induced Barrier Lowering
74
CHAPTER 3
.
m
The correction factor for short-channel &eft is based on a modified trapeaoidal approach for calculating the charge Q B [Fig. 3.61. The correction factor can be obtained from [3]
where W,, the depletion layer width of a cylindricsl junction and is given by
We = 0.0831353+ 0.8013929m
W D
2,
- 0.0111077(-)W D
2,
(3.35)
MOS is given by
where B is an empirical constant which depends on the oxide thikness. A typical value of 0 is 0.05. To account for the effect of lateral average electnc field, the effectivemobility is related to the drhin-source voltage and the channel length by I41
(3.38)
In this expression, when the device operates in the saturation, Vos is replaced by VosSct.
75
(3.39)
(3.40)
where
v ,
end
v , + nvl
0"s
n = 1 +
c,
+ Ca
(3.41)
76
CHAPTER 3
where
dQs dVsa
(3.42)
and Nps is a curve fitting parameter. V , marks the point between the weak and strong inversion modes. Typical d u e s of n range &om 1.0 to 2.5. I , is related to the c u r e d of Equation (3.39) by taking Vos = V , . Fig. 3.7 illustrates the transfer characteristics of the weak inversion and drift model. The voltage V , insures the continuity of the current, but it is dear from the figure that at Vo3 = V , a discontinuity exists in the derivative. Therefore, the MOS3 model is not precise in simulating the intermediate region where the diffusion and drift currents are comparable.
In the strong inversion, the drsj, cuprent can be expressed as
d m ) + FNV(=)
(3.44)
(3.45)
VT(5) = VT
+ (1+FB)V(Z)
71
= P c f / c o z w c j f L c / f [vC3
- VT
- 7 V D . I
+ Fg
VDS
(3.47)
The saturation voltage, which taker into aecomt the carrier velocity saturation effect, is gi~a. by
VDS,d
v,,,
+ v. -
fi
(3.48)
where
(3.49)
(3.50)
v. = v,,.L,ffIP.
a b l e 3.1 shows the CMOS device and ASPICE panmeters correspondence. Typical values for parameters of LEVEL 3 are shown in Table 3.2 for MOS devices of the 0.8 pm BiCMOS proces described in Chapter 2.
The LEVEL 3 model approximates the device physics and relies on the proper
choice of the empirical pammeters t o accurately reproduce the device characteristics.
I?].
.
rn
Drain-induced barrier lowering effect; Non-uniform doping in the channel surface and sub-surface regions effect;
CHAPTER 3
TBble P.1
Pnramaer
Description
NSUB
NFS UO
VMAX ETA KAPPA THETA DELTA XJ CJ
JS
JSW MJ PB CJSW MJSW CGDO CGSO CGBO
RD RS
ID WD XL
xw
ACM LDlF
Model level Zero-bias thrcshold voltage Gate oxide thickness Substrate doping Surface fast state density Surface mobility Madmvm drift velocity of carderr Static feedback on threshold voltage Saturation field factor Mobility degradation factor Width effect on threshold voltage Junction depth Zero-bias balk junction cspacitanee Buk junction saturation current Sidewall balk junction saturation uurent Balk junction grading coefficient Junction potential Zero-bias side wall capacitance Sidewall cspacitsnee grading c o d Gate-drain overlap capacitance Gate-rource overlap capacitance Gate-bulk overlap capacitance Drain ohmic resistance Source ohmic resistance Lateral diffosion from drain or source Laterd dXusion dong the width Making and etching effects on W M d m g and etching effects on L Area calculation method Lateral diffusion beyond the gate
79
(LEVELs1) ( 0 8 p m BxC-
N.Channel
3 0.8 17.5 Y 10-9 3.23 x 10" 820 Y 10s 503 150 x lo8 45 Y lo-*
PChannel 3
-0.9 17.5 x 10-9 3.37 Y 1 0 ' 6 764 Y 10' 165 190 x 108 121 x 10-8 1.45 135 x 10-3 0.336 230 x 450 x lo-' 5 x 10-4 5.5 Y 10-8 0.50 0.92 212 x 10-'1 0.30 215 x lo-" 215 Y lo-'> 571 x lo-'' 1189 1189 0.
0.
Units
uo
6.7
63.4 x
10-3
lo-'
fl 728
C J
JS JSW MJ PB CJSW MJSW CGDO CGSO CGBO RD
n.m ...
0.92 205 x lo-'' 0.30 274 x 274 x 10-12 571 x 10-l' 596 596 59.5 x 10-9 0.
0.
RS
LD WD XL
xw
ACM LDIF
0.
0. 2 1 x 10-8
0.
2 940 x 10Wo
80
CHAPTER 3
rn rn
Depletion charge sharing by the drain and source; Channel-length moddtion; Dependence of some electrical parameters on drain and substrate biases; Better modeling of weak-, medium-, and strong- inverzion regions and elimination of the discontinuity problem in the drain-current; and Geometric dependencies;
3.2.3.1
Threshold voltage:
VT = VFB
4,
Kd9. t IVBBI) -
?VDS (3.51)
The two parameters, K , and K,, model the effect of non-uniform doping of the substrate on the threshold voltage. Typical values for KI and K 2 are 1 V'lz and 0.12 iespectively. The factor q mod& the DIBL effect and accounts for the cbsnnel-length modulation effect. It is a function of VDSand VBB.
3.2.3.2
Drain current.
PO 1t UO(V0S - VT) (1
*
'=f)
+ $$V,,)
" )
(3.52) (3.53)
(3.54)
where
a = 1
+ 9 XI F(Q. t
I
1.744
IVBgl)-"'
and
g = 1 -
+ 0.836(h + ~ V B B ~ )
The parameters Uo = U&), U, = UI(VB) and po = p o ( v ~ s , Vare ~ ) bias sensitive. For VDS > VDS..~, the drain current is given by
81
where and
K' = I+..+J1+2..
2
(3.56)
(3.67)
The drain-source saturation voltage is given by
(3.58)
and (3.61)
The factor d.8 is empirkd to achieve the best fit. The Subthreshold parameter n is a function of Vpbs and VB.
(3.62)
where P o is an arbitrary parameter, LPo and W P o ate the Land W sensitivity factor. of Po.
82
CHAPTER3
Another deep-submicrometer MOSFET's model called BSlM3 181 has been den . improved threshold voltage, drain velopcd for circuit simulrdion. It uses a current snd chaanel-lenpth modulation mod&. The model i s also simple and has a s d number of parameters (x 25).
3.2.4
MOS Capacitances
In transient simulation, MOS capacitances are very important for CMOS and BiCMOS circuits a n & & The MOS capacitances can be divided into two types of lumped capacitors:
the depletion capacitors of the bu&drain ( C m and C B S )[Fig. 3.81.
m
the capacitors associated with the gate ( C a , COD, COB.Ccsm, C G D ~ and COB,) [see Fig. 3.8, except for COB-].
3.2.4.I
The bull-source and the bullr-drain junctions have a bottom area As and AD respectively and B sidewall with a perimeter P , and PD respectively. Each of the bottom area and the sidewall contributes to the total depletion cap-tance. The bottom area capacitance is mesured per unit area, while the sidewall capacitance is measured per unit perimeter. Both of t h e e components are voltage dependent. As these junctioos a x normally zcyerse biased, we will consider the case when the bulk-soures and bulk-drain voltages ( V hand V B D ) m e less than 01 equal to 0.5#j (6 is the junction built-in potential). The total bull-source and hulk-drain capacitances can be expressed by the following reletions [l]
The exponential factor. Mj and Mi.- are in the order of 0.3-0.5. C, is the zero-bias capacitance of the bottom jmction p a unit area and C;,- is the eel-bias capacitance per unit perimeter.
83
The fid overlap capoeiioneea: gatedrain (CGD-), gatesource (Ccs-) , and gate-hmk (CDBm) ovellap capacitances. Both Ccs.. and Coom exist due to the lateral diffusion of the source and drain under the gate. They are usually given per unit width as Coso and Cooo. The total gate-source and gate-drain overlap capacitance is given by:
cosm = CcsoWe:r,
(3.65)
(3.66)
coo,
COD0
W.ff
where Cam and Cooo are eqod to C,L+ The capadtor COB, is due to the overlap of the gate a i d e and the bulk along the channel length at both ends of the active of the transistor. This capacitance is typically normalined to the effective channel length, the total COB^ is hence given by Coaw = C O B 0 L*ff (3.67)
a4
CHAPTER 3
The nonlinear capacitance due to the c A q e of the bulk OP tAe channel. This capacitance is actually distributed but CM be modeled by lumped eap&tances. In the CEX when the channel does note& the capscitance CM be expressed as C G B = cmwc,,Lc,f (3.68)
When the device in in the linear resion the channel is extending uniformly to the drain. The channel shields the b d k and the CBpaeitance exists only between the gate and the channel. The gate-buk capacitance goes to %em.The gate-channel capacitance can be oxpressed in terms of two equd lumped capacitances, B gate-source and a gatedrain capacitance, which am denoted Cos and CGDand are given by
Gom the m n x e
COS
1 = COD = FcozweffL'ff
(3.69)
Finally, when the device enters saturation, the channel at the drain pinches off and hence the gate-drain capacitance component becomes i e m while the pste-source capacitance esa be expressed by
Ccr = -C,W.,fL.ff 3
(3.10)
Fig. 3.9 depicts the change of the capacitance components as a fnnctbn of the gatc-source voltage (assuming that the sourcebulk voltage is zem). The total gate-ronrce capacitance is given by the snmmation of the Cosm and Ccs, and s i d m l y , the total gatedrain capacitance is given by the summation of C C D ~ and COD. The above described capacitance model can be used for circuit analysis and eLeuit design. SPICE me8 B chargecontrol model, which IS- developed by Ward and Dutton [$I. This modelis bared on the mtod distribution of charge in the MOS stiuctue and its conservation.
85
3 . 3 . 1
The threshold voltage, VT,has some definitions which are important for the estimation of the static power dissipation. The first definition is the utrapolated threshold voltage from the characteristic IDS - V m [me Section 32.11. Another one is the constant-current (Lo., 010 nA per width unit) threshold voltage. These voltages do not have the same value [lo, 11). The extrapolated VT has approximately 0.2 V more than the constant-current one [ll]. The extrapolated threshold voltage should be sealed down proportiondy to the supply uoltage. This is becmse the drive (saturation) current depends on (VDD - VT(ertrapo1ated)).
86
CHAPTER 3
IDS,"* = w;,,I,locv..-"l/s
W.
(3.71)
where VT here ir the constant-eorrent threahold voltage. I, and W . are the drain current and the gate width to define VT. S is the subthreshold swing parameter. which is the gate d k g e swing required to redvce the drain uuient by one decade. The current I , is related to VDs by
I, = I;(1 - P=/". 1
(3.72)
2)
Vldeeode
(3.73)
where Cdisthe drplelion-layer capacitance of the sourcejdrain junctions. Thus, S has a theoretical minimum limit which is 60 mvldeeade.
The leakage current, due to the subthreshold eandnction, is computed from ID^..,^ when Ves = 0 . Then
I l d
=w.llIo,o-vds
W .
(3.74)
Using the examples of Fig. 3.10, typical values for constant-current and axtrapohted threshold voltager are 0.3 V and 0.5 V respectively. The parameter 5 is equal to 75 mVldeeade and the leakage cnrrent is e q d to 1p A l p m When estimating the static power dissipation, the worst-c leakage current has to be evaluated. In this E B S ~ ,the worst csre threshold d t a g e , VT,, hsr to be used where (3.75) VT,. = VT - AVT
AVT is the vapiation of the threshold voltage due to the process parmeters fluctuation such BS the oxide thickness, doping profile, junction depth, gate and width lengths, ete. AVT can be BS high as 50 mV on the same wafer and 150 mV for different wafers. This results i n almost two decades ofleakage
current increase. Also the temperature effect has to be considered when leakage current is computed. The temperature affects both VT and S. A typical value of the temperature coefficient of the threshold voltage is 1.6 mV decrease per degree Celsius. The subthreshold suing, S increases by 0.25 mV/(decade.C) [See Equation 3.731. For example, if the temperature increases &om 25 C to 75 C, the thrcshald voltage decreases by 80 mV md the leakage current equalr 30 pA/pm (initid extrapolated VT = 0.5 V). T h i s value ib 30 timu higher than that at 25 C. Both the temperature and process effects can result i n a drastic increase of the worst-case static power dissipation. Note that this variation of VT greatly affects the delay of CMOS circuits a t low supply voltage, since the drive cuirent is proportional to (VDD- VT).
88
CHAPTER 3
the vertical electric field in the inversion layer. At this point we prefer to use the symbol & for the mobility to denote its dependence on the vertical dectrie field. Also, the velocity (v) is no longer proportional to E but is gjwn by the following twwregion piecewise empirical model [14]
where
2%., E . = &
(3.77)
where the saturation velocity is equal to 8 x lo8 e m / s for electrons (NMOS device) and 6.5 x 10e e m / s for holes (PMOS device). The drain current in triode region (VDS5 VDS,,,) is given by [I31
= "sdC-Wtfl(VOS
VT
VDS.d)
(3.79)
By equating (3.78) and (3.79) we can derive the following expression for V D S . . ~
VD'oS,.t = (1 - X)(VCS - VT)
(3.80) (3.81)
where
(3.82)
Note that VT,m the current eqnation, is the extrapolated threshold voltage The mobility & for electrons UUL be expressed [l5]
fin = 240\/0.06tO./(Vcs
+vT)
f m NC ply-gate
fm ' P
fop
(3 83)
..=(
POlY- gate
N i p l y - gate
(3.84)
n k and the mobility in cma/(Vs).Thn analytical model CM he where to, is i used for gate length down to deepsobmcmn range
8 ' 3
3.4
Scaling device feature size has been used to increase paddng density and speed. MOSFET scaling can follow three theories: 1 . Constant Electric Field (CE) scaling [16]. 2. Constant Voltage (CV) scaliog [l?].
Expression
Dimensions
Gate oxide Doping Voltage Capaeitace current Gate Delay Dynamic Power Dynamic Energy
In the CE scheme a l l horizontal and vertical dimensions and voltages scale h e d y with the $ m e faetor. In the CV reheme, the dimensions are scaled, w h i l e the voltages w e kept constant. This scenario has been the most cornmonly used. While the constant electric field scaling is natural Lom the device physics point of view, the constant voltage scaling is more piactical from the systems standpoint. Changing the supply voltage every technology generation (when the feature sizes a e scaled) is too expensive because mdtiple pow-
90
CHAPTER 3
supply generatois will be required for each PC board. However, BS the channel length scales helow sboat 0.6 p m the 5 V supply voltage must be reduced for reliability rea~ons(e.6. hot carrier effects, breakdown, ete). The quasi-constant voltage scaliog is an intermediary scheme between the CE and CV views. The @c&g factors of the hoiieontal dimensions and the volts@ are denotd by kh and !ex, rerpectively. Table 3.3 summluiees the scaling ef the important device parameters according to the three theories as a fonction of the horizontal scaling factor (kh). Note that in the QCV scheme, the dimenions scale more aggressively than the voltage (k, = k h ' . )
W/LC,(VOS - V T ) ' . 5
(3.85)
Thk expression is not far fiom the one propored by [El. Table 3.3 shows the erect of device sealing on the delay, power and energy. It is assnmed that a gate
drives other gates, where the load is mainly the gate cspscithnce. The threshold voltage is sealed proportional to VDD rcsling. The gate delays imprave with scaling for all the scenarios, but with II better rate in the CV scheme. However. the dynamic power. at maximal frequency, of the gate increases by a factor k ; ' in the case of CV. For the CE scheme, the power is reduced by a high factor equal to kF6. Also in this Table, the dynamic energy dissipated by a gate is reported. This is independent of fkquency. For all schemes, it has improved significantly, particularly for the CE case.
Scaling the snpply voltage is an efficient way to reduce the power consomption. However, to get B better performance 8t low-Vdtagge the device sizes and the threshold voltage have to be properly scaled. For B fixed sub-micron technology. the supply voltage can not be reduced aggressively, otherwire the *peed is degraded. However, for each fixcd technology generation, there is a lower limit power supply voltage VDD,~, [la]. For VDD'S higher than this minimum limit the speed does not improve significantly. Typical d u e s for VDD,~,are, 3.3 V and 2.5 V for L.,j of 0.5 pm and 0.3 pm, respectively. On the other hand, the h i e r lrmit of V ~ is Ddriven by the reliability and the power dissipation limiitation. The d n e of this VDD is proportional to the s p a r e root of design rules (6) [IS]. For 0.6 pm and 0.3 pm design rules with LDD structure, these high limits are 4.5 V and 3.3 V, renpeetively.
91
= Za t L d
+
+
Ira
IPE
(3.87)
(3.88)
4 = I,&
+I
Note that it has been asmmed that the base and collector currents ere flowing in the device, while the emitter coxrent is a0-g out of it [Fig. 3.121. The emitter bjection efficiency, which is defined as the ratio of the electron's current iojected into the base to the total emitter eorrent, is by
(3.89)
92
CHAPTER 3
. /
N-well
has to be nem unity; thst is, the emitter current should mostly be due to electrons for an NPN transistot. The ratio
T h i s ratio
is defined
1C fl= IB
(3.90)
93
94
CHAPTER 3
When the emitter-base junction is reversebiased and the collector-base jamtion is forward-biased, the transistor is in the inverse xpion where the emitter and collector may be exchanged. When both junctions are reverse-biased the transistor is in the cutoflregion. But when they are forward-biased, the device is said to be in the astoration repion. In this situation, both junctions sre injecting into the bsse, the small electric fields in the two depletion regjons sweep the carders into the emitter and collector repiom. Both junctions collect as well as emit.
IE = Lc
+ &E
(3.91)
The current due the holes injected &om the base into the emitter is given by 1201 [ , V D . / V . - 1 1 (3.92) I , o = q AE D,E P ~ E O
WE
where h~~ is the equilibrium hole concentration in the emitter and W Eis the neutral emitter width. The current Incis dominated by the diffusion current in the base and is proportional to the gradient of the minority carders (electrons) in the neutral base. Because the neutral base width (WB)is very thin, this gradient is approximately a comtant. Therefore, we c a n write 1C as [20]
Inc =
AE D,B [ n B ( O )
; : g a g ( w B ) ]
(3.93)
where na(0) and na(Ws) are the electron concentrations at the edges of the emitter-base and collector-base depletion regions respectively [see Fig. 3.131. Note that the slope of the clectmns in the base is given by the term between the brackets as demonstrated by Fig. 3.13.
' B ? app~ying KCL (i
If thc recombination in the bsrc i s n&c$cd bstuten LB and I.o. j l s . / w e that I,., ri L o .
I,
+ I~
I, = 0).
scL t h t
is the differcncc
0). we can
(LB =
95
KllliffC
BaJC
CDiieclor
Using thejunction law, the electron concentrations nn(0) and na(Ws), can be expressed rn terms of VBE m d VBCrespectively. The current I., c a n hence be given by [ZO]
Ic = Inc - Ipc
(3.95)
The current IPc is due to the holes injected from the base to the collector8. The baSc-eoUcetor junction is basically a P + N N + structure as shown in Pig.
*Not= Lhat I.,
we harr -rumEd
w mat inclvdcd i n Eqv~tion (3.88) because i n drriring Equation (3.86) that the Eallsstor-b-e junction was revc-c biased.
96
CHAPTER 3
where pnco is the equilibrium hole concentration in the collector, Wc is the epitaxial thickness under the base and T ~ ? i ,s the hole lifetime in the epitaxial layer. By substituting Lorn Equations (3.92) and (3.94) in Equation (3.91) and from Equations (3.94) and (3.96)in Equation (3.96)we get the following equations for I p and lc I, = I, - U,I, (3.97)
Ic = -I,
+ at',
(3.98)
Eqnations (3.97) and (3.98) m e called the EberrMoU eqmations. Fig. 3.14 shows the equivalent circuit of the BJT bared on the Ebers-Moll equations. The EbersMoU model described above is general and can be used for any region of operation by substituting for VB, and V.c by lhe appropdate values. In the forward ective region, assuming that VBS = 0.8 V and VBC < 0.3 V the emitter and collector current of Equations (3.97) and (3.98)reduce to
la = I, sz I,, eV-1".
(3.102)
where the reverse saturation current of the bare-emitter junction In, can be derived from Equation (3.99)snd is given by
97
E
ligure 3 . 1 4 model
on
the Eb.ra;MoU
-F
Ql
(3.105)
Eqnatims (3.102),(3.103) and (3.105)arethe well-known current equation. ofa fommd biased bqpolar transistor. Note that Equation (3.105) yields the famous relation between at and the DC forward current gain P P = Qf/(l - a f )1.
The simple Ebers-Moll model lacks accuracy for the following three reasons
1. It does not account far the parasitic resirtors of the emitter. base and
collector.
98
CRAPTER 3
PC
d E
2. It doer not aocount for the Early effect, which causes the collector current to increase 8s the collector-emitter voltage increases.
3. It does not sccount for the effect of the high collector currents on the current gain. Next, we will discnss the modeling of e& phenomena separately,
(3.106)
(3.107)
+ RcIc + REIE
99
The drop across the parasitic resistors has to be acconnted for to get more accurate iesalts from the EM model. Neglecting these drops may ~ V U Llead to erroneous iesults. For example, if the external collector-emitter voltage i n fonnd to be equal to 2 V one may dednce that the BJT operates in the active Ecgion. However, if Rc = 1.8K and RB = 0 . M and Ic I , = 1 mA, then the intrinde collector-emitter voltage (Von) is 0.1 V. This implies that the bipolar transistor is actually saturated. T h i s phenomenon is known as QuariSatuwlion.
3.5.2.2
The E d y effect refers to the base width modalation due to the change of the collector base reverse voltage (in the forward active region). As the collectorbase reverse voltage increases, the base-collector depletion layer widens. The resulting reduction in the neutral base width causer the current gain to increase which, in turn, leads to an increase in the collector current [see Fig. 3.161. T h i s effect can be modeled by introducing the Early voltage (Va,) i n the expression of the collector cnrrent a5 follows
(3.108)
The inverse of the forward Early voltage 1,'VAj is analogous to the coefficient A in an MOS transistor. A typical value of VA, is 50 V. The AC output resistance of the BJT in the forward active region is related to the Early voltage and is given by
70
-v.r I0
~
(3.109)
The Early effect in the inverse active region can be modeled by using the reverse Early voltage (VA,) which charaderises the slope ofthe collector cutrent in that region (inverse active region).
3.5.2.3
The current gain and the cut-off freqnency are degraded due to high collector current. Fig. 3.11 shows the effect of the collector current o n the gain. T h i s degradation can be referred to the high level injection in the base (Webster effect) and/or the base pushout (Kirk effect). For B detailed discussion on these phenomenon, the reader is advised to consult reference [ZO]. In the w e , -here the injection level in the bare is high (Webster effect) the collector
100
CHAPTER 3
Figure 8.18
101
Ic =
ev-l=v%
where the forward knee current Ixje is defined the collector current a t which its slope in the Gummcl plot changes from 1 to l/Z [see Fig. 3.181. This current marks the onset of high level injection. The degradation of the current gain, when Ic > k,, can be described by the following relation [203
(3.110)
P = - I0 =&IB
1x1
IC
(3.111)
where & is the value of the gain when Ic < I z f . The modeling of the Kbk effect is very complex. However, simple model for the current gain, which can be used in first oidei circuit analysis, i n given below [Zl]
(3.112)
The aemracy of the simple EM model can be enhanced by acconntbg for the parasitic resirtars, the Early effect and high emrent effect which mn be modeled by simple analytical expressions as shown above.
3 . 5 . 3
Two BJT models are implemented in SPICE. The Ebers-Moll model and a more sophisticated one, which is based on the Gummel-Poon (GF) model [ZZ].The second model indudes the following second order effects:
rn
rn
Base width modulation effect. High-level injection effects (the Kirk effect is not included)
Base resistance -tion
.
m
with current.
The GP model is based on one-dimensional analysis. It is valid for all regions of operation: cutoff, forward-active, invecse-active. and saturation. The GPbared bipolar model is illustrated by the equivalent circuit shown in Fig. 3.19.
*A trpicai value of 1x1 B
C ~
u i Lacsi s 1 m.4/pmn
102
CHAPTER 3
in1ii
The two bad-teback diodes on the right represent the intrinsic base-emitter and basccollector junctions and their curients are given by 1231
I,,
= -(e I . ves/n,v. - 1)
qb
(3.113)
Iso = I* - ( e vec/n,v, - 1 )
4s
(3.114)
(3.116)
The forward and reverse current e-on coefficient (nt ond %), which ate introduced in Equations (3.113) and (3.114), are used to model thelow currents. The parameter qb (base charge factor) accounts for the high current and base
103
Figure 2.1s
+ 1-
(3.116)
The general expression of qs [Equation (3.116)] can be simplifled for lo dev el and high-level injection conditions.
if
if
PI q,
> 91214
q:/4
(3.119)
104
CHAPTER 3
The two back-to-back diodes on the left [Fig. 3.191 account far the currents caused bv the recombination of carders in the emitter-base and the collectorbase space-charge layers and other recombinations. These currents be modeled by [23] c,r,(ev-~-v~ I) (3.120)
~
c,r,(ev**m=vs - I)
(3.121)
where C,,C,.n . and n . have been introduced to fit the measured corrents. Further improvements to this model ate possible by the inclusion of three parasitic resistances ( R c , Rs, R B ) ;three jnnction capacitsnces (CE, C c , Cs); and two diffusion capacitances (C-, Cdc) = shown in Fig. 3.19. The model of the bare resistance take. into account the effect of the corrent (current crowding) through the following expression [24] tan(r) - I RB(I) = R B + ~~(RB - R B ~ z) tan(z)l where the variable z ia given by
(3.122)
Rg represents the low-current maximum resistance and RBm high-cmrent minimum residanee. The junction depletion capacitance is a function of the junction voltage (V). This function can be approximated by the following two expressions
Cj.irp= C ; ( 1 - -)
-Mi
4,
if V < FC4;
(3.124)
The empirieal factor FC has a value between 0 and 1. Its default valne in SPICE is 0.5. Note that Equations (3.124) and (3.125) apply for a reverse and forward biased junction respectively. The diffusion capacitances model the charge associated with injected carriers. For example, the electrons injected i n the bare have B corresponding rtorsge charge Q~~ = r,rcc (3.126)
105
Where VTF is a fitting parameter to model the change of 7 , as a function of VBC ( 01 V c s ) ,ITF models the change due to Io and XTF controls the increase of q . ICO is the collector current in the absence of the high-current effects which corresponds to that dEbers-Moll model. The diffusion capacitance (associated v i t h the injected electrons from the emitter into the base, when the base-emitter junction is forward biased) is gjvm by
CDE
aQDB
(3.128)
Similarly, the base-collector junction has a diffusion capacitance, which is given by aQDc CDC = (3.129)
a v , ,
where
QDC
= SIEC
(3.130)
Although the SPICE models account for most of the first and second order effects, they m e not highly accurate. This originates from some weaknesses in the theory on which the models are based. As the device festnres are scaled down the currently a d a b l e models become less accurate. The physics and the theory of the sealed devices is more complex. Hence, aseluate modeling becomes very difficdt. One way around that problem is to chose the model parameters such that simulated device chsracteriaties agree with measurements. In practice, the models' parameters are extracted automatically using parameter analyser. with software tools to obtain the best fit. As a result, the values of the extracted parameters may not correspond to their actual values. For example, it is common to find B discrepancy of 20% between the measured cnrrent gain of a bipolar transistor and that listed in the SPICE fie. h o t h e r approach, which U eqmivalent to tweaking the parameterr, is to m e empifid models (eg. BSIM model), in which the empirical (fitting) parameters c m be optimized to get the best fit between simulation and measurements. Typical GP parameters , for the 0.8 prn BiCMOS prsented in Chapter 2., a ~ e shorn in Table 3.4 and 3.5.
106
CHAPTER 3
Table I.,
Bipolar dcviccpar-ekx
Para meter
SPICE Keyword
Description
IS BF
BR
NF NR VAF VAR IKF IKR ISE ISC NE NC RE RC RE IRB
RBM
CJE VJE MJE CJC VJC MJC CJS VJS
MJS
XCJC FC
Saturation current Ideal madmum forward gain Ideal madmum reverse gain Forward current-emirision coefficient Reverse current-emirision coefficient Forward early voltage Revers early voltage Forwadknee enrrent Reverse-knee current Baseemitter leakage ssturation current Basecollector leakage saturation current Baseemitter leakage emission coefficient Basecollector leakage emission coefficient Emitter resistance Collector resistance Base resistance a t zero current Base current where RB = RB(O)/Z Minimnm high-current base resistance Base-emitter ser-bias depletion cap. Base-emitter built-in potential Base-emitter junction grading factor Basecollector aero-bias depletion cap. Basecollector built-in potential Base-collector junction grading factor Collector-substrate iero-bias cap. Collector-substrate built-in potential Collector-substrate junction grading factor Internal base fraction of base-collector cap. Coefficient for forward-bias depletion cap.
107
XTF
VTF ITF
T,
I,
TF
XTF VTF ITF TR XTB XTI ED KF AF
Forward transit time T F biar-dependant coefficient TF barecollector voltage dependence c o d . T F high current parameta Reverse transit time Forward and re~erse betel0 temperature exponent Saturation current temperature exponent Energy gap Flicket noise coefficient Flicker noise exponent
Table 3.5
SPICE Keyword
IS BF BR NF NR VA P VAR IKF IKR ISE
Vdue
Units
A
Zx
100 1 1
1
sn . .
5 5n 10P
0.
0.
A A
A
108
CHAPTER 3
RE RC RB
IRB
30 87
RBM CJE
VJE MJE CJC VJC
n n n A
62 F V
F V
FC
TF XTF VTF
ITF
TR
XTB
XTI EG
ev
-
XF
AF
2.0
109
3.5.4
Chapter Summary
111 thk Chapter, we h a w r r r i c w c d the fundamentds ofth e 110s xiid bipolnr derirrv 'l'hr ~ m w common t device rwud11 u s S 4 i n SI'ICE ILRYC been pn w ~ t d 'The key device P B I I U ~ ~ ~of Cw ~ S h model h a w been defined and rrplaincd, so that the rradcr is familiar with the drtailr of these niodclr and can apprecislr the importance a f t h e different model parameten T h e reader 19 given B Lst of model parameterr, for B typical 0 8 pm RiCXOS prnccis. that can be used for circuit simulations T h o c modrl ran be used even a1 low-voltage opcralion. hlorcoser, ia .in,plc analytical model unltd for suhmirronwrr 1lOSFET'r has berm 1 l i r c i . r 4
REFERENCES
[I] A. Vlrudimirescu, and S. Lio, "The simulation of MOS Integrated Circaits using SPICEZ," M m o . No. UCB/ERL M80/7, Univ. Cdifomia, Berkeley, October 1980. [Z] H. Masuda, M. Nakai and M, Kubo, "Characteristics and Limitations of Scaled Down MOSFET's Due to Two Dimensional Field Effect," IEEE Trans. on Electron Devices. Vol. ED-26, pp. 980-986, 1979.
[3] R.L.M. D u g , "A Simple Current Model for Short-Channel IGFET and Its Application to Circuit Simulation," IEEE Journal of Solid-State Circuits, vol. SC-14, pp. 358-367,1979.
(41 G. Merkd, J . Bore1 and N.Z. Cupces. "An Accurate Large Signal MOS Transistor Model for Use in Computer-Aided Design," IEEE Trans. an
Electron Devices, vol. ED-IS, 1972. [5] G. Baum and 8 . Beneking, 'Drift Velocity Saturation in MOS Tranristors," IEEE Trans. on Electron Devices, YOI. ED-17, pp. 481-482, 1970.
[6] R.M. Swanson and J.D. Meindl, "Ion-Implanted Complementary MOS Transistors in Lou-Voltage Circuits," IEEE Journal of Solid-state Circuits, vol. SC-7, pp. 146-153, 1972.
171 B.J. Sheu, D.L. Scharfetter, P.-K. KO, and M.C. Jeng, "BSIM Berkeley Short-Channel IGFET Model for MOS Transistors," IEEE Journal of Solid-state Circuits, vol. SC-22, pp. 558-566, 1987.
[8] J. 8. Huang,
Z. H. Liu, M. C. Jeng, P. K. KO, and C. Ha, "A Robust physical and Predictive Model for Deep-Snbmicmmeter MOS Circuit Simulation," IEEE Custom Integrated Circuits Conf., Tech. Dig., pp. 14.2.114.2.4, May 1993.
[9] D.E. Ward and R.W. Dutton, "A Chargeoriented Model for MOS Transistors Capacitances," IEEE Journal of Solid-State Circuits, vol. SC-13, pp. 703-707, 1978.
112
[lo] Y. P. Tsividir, "Operation and Modeling of the MOS Trwsistor,' Gmw-Ha, 1988.
Mc
[Ill T. Sakata et al., "Subthreshold-Current Reduction Circuits for MultiGigabit DRAM'S," B E E Jonmal of Solid-state Circnits, vol. 29, no. 7, pp. 761-769, July 1994.
1121 S.M. Sae, "Physics of Semiconductor Devices," John WiIey & Sons, 1981. 1131 C.G. Sodini, P.-K. KO,and J.L. Moll, "The effect of High Fields on MOS Device and Cireuit Performance," IEEE Trans. on Electron Devices, Vol. ED-31, No. 10, pp. 1386-1393, October 1984. [14] B. HoefRinger, H. Sihbert, and G. Z h e r , "Model and Performance of Hot-Electron MOS Transistor for VLSI," IEEE Trans. on Electron Devices, Vol. ED-26, pp. 513, 1979.
[I51 C. hu, "Low-Voltitge CMOS Device Scaling," IEEE International SolidState Circuits Canf.,Ted. Dig., pp. 86-87, 1994.
(161 R.H. Dennard, a t a l . ,"Designoflon Implanded MOSFETa with Very S m d Physical Dimensions," IEEE Journal of Solid-state Circuits, vol. SC-9, pp. 256-266, October 1974.
[I71 P.K. Chatterjjee, et al., ''The Impact of Scaling Laws on the Choice of N-Channel or P-Channel for MOS VLSI," IEEE Electron Device Letten, Vol. EDL-I, pp. 220-223, October 1980. [la] M. K e h m u , "Process and device Techoologiea of CMOS Devices for LowVoltage Operation," IEICE Trans. Electron., vol. E76-C, no. 5, pp. 672680,May 1993. [19] M. Kdkumu, M. Kinugawa, and K. H m b o t o , "Choice of Power-Supply Voltage for Half-Micrometer and Lower Submicrometer CMOS Devices," IEEE Trans. Electron devices, vol. 37, no. 6, pp. 13341342, May 1990.
[20] D.J. Rodstan, "Bipolar Semiconductor Devices," McGraw-HiU Publishing
Company, 1990.
1211 K. Naknuato, et al.,'Characteristics and Scaling Properties of n - p n Transistors with a Sidewall Base Contact Structure," IEEE Trans. on Electron Devices, vol. ED-32, no 2, pp. 328-332, February 1985.
[22] H.K. Gummel and H.C. Poon, "An Integral Charge Control Model of Hipalirr Transistors," Bell Syst. Tech. J., vol. 49, 1970.
REFERENCES
113
4
LOW-VOLTAGE LOW-POWER VLSI CMOS CIRCUIT DESIGN
In thir chapter we introduce the CMOS logic gate with the development of sim-
ple models for delay and power disripstion estimation. These analysis permit us to understand the mechanisms that control the performance, particularly the power dkipation, of a logic circuit. Several CMOS d m i p s t y k , such as pseudoNMOS, dynamic logic and NORA, are presented. Other k c n i t variations of the static complementary CMOS, which are suitable for low-PO- applications, are discussed. These include the passtransistor logic families such as Complemendary Pass-transistor Logic (CPL), Dud Pasctramistor Logic (DPL), and Swing Restored Pass-transistor Logic (SRPL). Also an overview of clocking strategy in VLSl systems is covered. Included in this chapter is one important %re*which is the I/O circuits. The power dissipation of the I j O circuits is also analyzed. Findy, low-power techniques for CMOS design are also reviewed at the tr-istor-level. We will cover the low-power issues a t subsystem/system/architeeture levels in Chapter 6,7 and 8 in more detail. Several books treat in detail other CMOS circuit design aspects [I, 2, 31. The reader CM refer to them. Many issues existing in todays advanced CMOS circuit structures are considered; such as: Power dissipation components of a CMOS gate and their importance; Concept of switching activity; Power dissipation in 1 1 0 circuits;
116
CHAPTER 4
rn
m
Clock distribution in VLSl systems; Ground bouncing; and Low-power circuit techniques and design guideher.
4.1
Fig. 4.1 shows the basic complementary MOS inverter. Before deriving the DC-transfer characteristics of this inverter (the output voltage Y C ~ S U Ithe input voltage), lets understand the operation of this circuit.
When the input is BIGH, which means at VDD,we have
VSSn = Krn = V D D
(4.1)
= K" VDD = 0 (4.2) In this case, Vosn > VT, and lVcstl < lVrpl. The PMOS is OFF and the NMOS is ON. The NMOS transistor N provider a current path to ground. The find stable value of the outpot voltage V . is
~
v ,
v, = 0
(4.3)
At the steady rtete, the DC cnment from VDD to the groondis controlled by the subthreshold current of the PMOS P ,since this device ia OFF and the NMOS N has B VDS equals to zero. We assume that the junctions leakage is negligible. If VT,,' is low enough (lower for example than -0.5 V), the subthreshold current is negligible (< 1 pA/prn width). If (negative) is high, the subthreshold is not negligible and can be w high as 1 p A / p m for = -0.05 V [see Section 3.321. In this case the output is not exBctly at zero and can have a value of tens of mV. In this section we a m m e that the subthreshold cmient is not importmt. Low-VT CMOS circuits .%re treated in Section 4.10. Similarly, when Kn is low (OV) Vos. f VT, and IV,s8l > [VTJ. The PMOS transistor is ON and the NMOS transistor iS OFF. The output voltage is given by
v .
'Exbr*pold.ed thruhold voltage.
= VDD
(4.4)
117
%sf+
PMOS
*
Figure 1 . 1
A CMOS Inruter
The logic levels of the CMOS inverter are close to VDDand ground and the logic swing is equal to VDO.This is B main feature of CMOS gates.
Region (A): 0 5 Ern < VT, The NMOS transistor is operating in the subthreshold region and the current is assumed zero. Hence the PMOS current is also em. The PMOS transistor is in the linear region. Thus, V. = VDD.
118
CHAPTER 4
Region (B): Vrn < K. < I L Ens is defined M the input voltage at whioh the gab of the inverter is maximum and is also defined s the gate threshold voltage. In this region, the NMOS transistor ia operating in the satmation region and the PMOS is in the linear region. Since the emrent in both devices is thc same (in sbsolute value), w e have
IDS? = - I D S .
The PMOS current i s given by
I D S p '-Pp
(4.5)
[(~~-vDD-vTn)(va--I/DD)-~/~(~-vDO)z]
Where
(4.6)
6 , = kp%
Leff
(4.7)
(4.8)
where
a.= , k
VGS,
W.ff L.ff
(4.11) (4.12)
and
= Km
Using equations (4.5), (4.6) and (4.10), the ontput voltage is given by
v,
= (K*-Vrp)+
(%, - VTp)' - a(%%
(4.13)
-- vTv)vDD
2
VDD
- P( ! &
PP
- vT,)a
(G" - VTJ
(4.14)
119
'DI
YO
The NMOS saturation current is given in Eqoation (4.10). By iring the absolute value of the two dr- currents we have
equal-
(4.15)
where
p = -i %
PP
(4.16)
B design point of view. Note, from this equation, that the logic threshold voltage of this gate is set by the designer; since the parameters & and /a are dependent on W c f fand L . t f . Moreover, the region (C) is d e k e d for only one point of I$,, For symmetrical NMOS and PMOS devices we have
VT" = VTP
If the designer set
(4.17)
a 'PP
(4.18)
120
CHAPTER 4
This ratio is a typicd example. The designer should set the rise ratio
a5
(4.20)
We obtain
VD D K, = K*" = -
(4.21)
A n inverter with this V,."* is sometimes called B symmetrical gate. The cutput voltage in this ea5e h not neeereary equal to VDD/2 and is given
by the following inequality
K"
-vT,
v ,
(4.22)
\i
& ( I $ .
Pn
Region (E): VDD < '4" 5 VDD In this region the NMOS transistor is ON, and in the linear region, and the PMOS is operating in the subthreshold region. If we arirume that this current is too small then
v .
=0
(4.24)
The cnrient flowing from VDDto ground, Y C ~ I S Y S the inpnt voltage, is plotted in Fig. 4.2(b). It reaches its madmum when both the MOS transistors are in saturation. It h important to note that f o rV ,= K,," the DC power dissipation would be maximal.
121
Figvre 4.3 ERccl of thc ratio p on the (s)DC t r d w F h ~ E t e r i s t i c (b) i threshold voltage of ulr CMOS inverter
4.1.2
Effect of p
As we discussed before. the ratio 0 controls the threshold voltage of the CMOS inverter. This panmeter is set by the ekenit designer through the transistor sizes. Other psrameters such BS the mobility and the theshold voltage of devices are set during the fabrication and the circuit designer can not change them. Fig. 4.3 illustrates the dependence of DC transfer charaeterirtier and the threshold voltage of the CMOS inverter on the ratio p . Increasing 0 decreases the voltage &,". KU has I I prwticsl maximum less than VOD t VpP and practical minimum greater than I+". Practical values mean that 0 can not have zero or infinite. In general, the circuit designer tries to set 0 = 1 for symmetrical operation unless the gate is used to switch an input s 8 different than a CMOS swing (from ground to VDD).
4.1.3
Noise Margins
Noise margin LG an important parameter in logic design. It i6 defined si the allowable noise voltage on the input 10 that the output is not affected. In other
122
CHAPTER 4
(a)
words, we would define the valid logic levels such that they are restored when they propagate through a digital circuit. The logic levels c a n be extracted from the DC characteristic. As illustrated i n Fig. 4.4 we define the levels a t the input by
.
rn
Logic 0 : for 0 5
Logic 0 : for 0 5
v. 5 V0'
5 V, 5 VDD
The
(4.25)
123
N M H = IVOH- Vrxl
(4.26)
The V,r. and the V m lev& can be defined ils the points where the slope of the DC transfer characteristics is -1, i.e.,
These valuer can be deduced wing equations (4.13) and (4.23). To have good noise mar&, it i s desirable to have Vii. and f i x each near the other, mound the point V D D ~ ~ .
For CMOS circuits, the HIGH output Voltage level VOH, can be defined by letting VOH = VDD and Vor. = 0. The CMOS logic inverter has fairly ideal transfer nnnnctian and it tends to have very good noise margins. In some applications, either N M x or NM,, is compromised to have good speed of operation.
Vnom,n = BkTln
(4.28)
At room temperature this value is equal to 0.2 V. This demonstrates that CMOS ir a good candidate for ultra-low-power applications.
4.1.5
For an inverter with W, = 2W, = 4 p n (in 0.8 p n CMOS technology), and using a threshold voltage VT = VT,=(V~,(=0.5 V, we have the fobwinsvalues for N M L and H M H . At 3.3 V power supply voltage, Nnai. = 1.15 V and N M x = 1.45 V. However at 1.5 V, N M L = 0.60 V and N M H = 0.65 V. So the noise level should be kept low, particularly at low power supply voltage.
124
CHAPTER 4
vDD
Figure 4.5
The power dissipation issue during the switching is considered in Section 4.3.
125
4.2.1
The load capacitance shown in Fig. 4.5 at the output of the CMOS inverter represents the total of the input capacitance of driven gates, the pararitic capacitance at the output of the gate itself and the wiring cepacitance. In Section 4.4, we discuss the estimation of this load capacitance. For simplicity we ac sume for 50% delay. that the MOS current is averaged, and is e q d to the saturation current. The equation of the saturation used in this seetion is the one given by Equation (3.82) Section 3.3.3. T h i s saturation current is w e l l modeled for short-ch-el devices,
where
I D S , , ~ , = Kn~.atCocWe~,m(Vcsn -E n ) (4.30) We ~ s s u m ethat the factor K, does not change. By integrating Equation (4.29) from t = tL, correrponding to V , = VDD, to 2 = t l , corresponding to V . = V D ~ / Zand , substitution of (4.30) into (4.29) we obtain
Note from this equation that the delay is inversely proportional to the width of the MOS transistor. So by aising the gate we can reduce the delay of the gate alone.
126
CHAPTER 4
1 1
vDD At t = t , Vo=V,,
At t = t 3 V o = O
At t = t Vo=v~~ 4 2
From the *bow equation we can deduce that the dse delay is greater than the fall delay for equally sisad MOS transistors. So We,,, phould be rised such that the two saturation currents are almost equal in order to get symmetrical rise and fall dehyr.
fz = #d,
Hence, for
+td.)
(4.33)
127
. .
The width of the MOS transistor; The load capacitances (input of the n u t stage, wiring,ette.); and The supply voltage V D D .
Fig. 4.7(a) shows the simulated effect of the power supply voltage on the delay ofan inverter with fanout = 3, using the device parameters given in Chapter 3. W e buffer the input voltage with one inverter stage to obtain accurate results. The delay is almost stable at high VDO,however when VDD approaches the threshold voltage of the NMOS and PMOS devices, it increaser drastically as expected by Equation (4.35). Therefore, the threshold wltage should be reduced to overcome this problem. In Fig. 4.7(b), the delay of the inverter is D VOD= 2.5 V. For VT/VDD > 0.5. the delay plotted versus the ratio V T ~ V D at incresses rapidly. In order to maintain improvement in circuit performace at reduced power supply voltage, VTJVDD must be 5 0.2.
128
CHAPTER 4
4.5
129
0.65 I
0.15
'
I
2
10
There are three power dissipation components within the CMOS inverter. These are: 1. Static power csused by the leakage current rent 1.t due to the value of the input voltage;
CL; and
130
CHAPTER 4
switching transient Sometimes component (2) and (3) are merged as total dynamic power
+P . 2
(4.36)
Leakage eubent consists of MOS junction leakage currents. Fig. 4.9 shows the parasitic diodes in a CMOS inverter. The body ties in this stroeture, such as the p&itic. diodes, m e not conducting (i.e. reverse biased and/or at iero voltage). The current in B diode is given by
9vd Id = I,(exp 1)
nkT
(4.37)
where n is the emission coefficient of the diode (sometimes equal to 1) and V d is the applied voltage to the diode. Note that the current parameter 1 . inereares with temmnrturc. The total rrower dissipation due to these le&am currents is given by P,l = ~ I a , V L W (4.38)
A typical value of this leakage current Id is 1 fa/ device junction. This value is too small to have any effect on the static powex, because if we have o m million deuicer, the total contdbution to the power would be 0.01 pW. This first component of the static power is neglected, in the analysis, through all the chapters of this book except Chapter 6 in the c of memory design.
We con$der now the second component ofthe static power which is a function of the input voltage Kn. Assume that the input of the pull-down NMOS, of the inverter, is at B voltage 0 5 K" < V , . In this ease the torrent is given by the subthreshold expression (Fig. 4.10)
I D S = zo-I
w . , ,oLsgw W O
(4.39)
131
Vss
132
CHAPTER 4
wherc VT is the constant-current threshold voltage. For V , . > VT the current is given by expressions discussed in Chapter 3. The corresponding static power disripation is given by P . 2 = IDsm*o.VDD (4.40) Thc mean value ofthe current is for both the PMOS and NMOS transistors. For example if V . = 0, VT = 0.15 V, W c f j= 10 fim and S = 75 mVJdeeade, this current is 1 nA. Far 1 million devices integrated, the total static power would be impmtant (1 mA of current). Note that this current increases drasticdly with the increase of temperature [see Section 3.321. This value, in standby mode. is not permitted lor battery-operated applications. CMOS circuits have been known to consume energy only during switching. But this is not troe mow. since low-VT CMOS is used far low-voltage operation. Some CMOS circuits, which exhibit a high DC current, are discussed in Section 4.6.
4.3.2
In this section we estimate the power dissipation due to the total oiitput load capacitance CL.This power is due to the currents needed to charge and discharge CL as shown in Fig. 4.11 and 4.12. We assumc a etcp input 10 neither the PMOS and NMOS m e on rimultanmurly. The average dynamic power P a required to charge and dischsrgc II capacitance C, at Iswitching frequency f = IjT (Fig. 4.12) is given by
I =
(4.41)
do - .Ip = C , " df
(4.42)
i - In = -c&dv.
'
df
(4.43)
(4.45)
133
VDD
vDD
T h i s equation shows that the power dissipation is proportiond to the operating frequency. Moreover, the ieduction of the power supply d r a s t i d y reducer the power dissipation. Ideally, 3.3 V ~npply voltage rednces the power dissipation by 56% compared to that of 5 V. Moreover, at 1 V the power is reduced by 96% compared to 5 V. The expression of dynamic power in Equation (4.45) is valid only for an inverter. However, for E. complex gate the concept ofswitching activity is introduced [see Section 4.5.31.
During the h s t output transition (charging) from 0 VDD,the energy drawn from the power mopply is Ed = CLV;,. For tbis transition, the energy stored in the load capacitor i s
This means that during lhe output transition 0 Vo0, hdf of the energy drawn Gom the supply is stored in the capadtar and the other haUis eonramed
134
CHAPTER 4
...............
/
... ...
.......
L
......
....... 1
Time
y
......
...... .>
Time
135
by the pull-up PMOS transistor. For the outpnt transition VDD 0, the mergy [l/2 C z V i D ) stored in the capacitor is consumed by the pun-down NMOS transistor and no current is drawn from the supply.
4.3.2.1
It is important to distinguish between enecgy and power. If for uample, for a CMOS gate x e reduce its dock rate its power coxsmption will be reduced by the same proportion. Howevu, its energy d still be the same. Assume that the gste is powered with a battery to perform computations. The time reqoired t o complete the computation, with low dock rate, d beincreased. Therefore, after t h e computation the battery Uiy be jnst as dead as if the computation had been performed at high clock rate. So law-enecgy design is moreimportant than low-power design. The factor of merit in this case can be defined as the pmdud of energy limes the delay. The canvcntional term, low-power.i s used through out this book to mean that we design for low-energy.
(I),
P,c = I,..,.LVDD (4.47) To estimate I , . , , we use the simple model of the short-circuit current of Fig. 4.14 151. Also we Bssume that the inverter has symmetrical devices, which = P, = 0 and V T , = -VT- = VT. W e also assume that the mesni that rise time is equal to the fall time of the input signal ( 7 , = rt = 7). The mean short-circuit current in the unloaded inverter is
r,,.
=z
[j:
i(t)dt
+ j:i(tpt]
(4.48)
136
CHAPTER 4
350 I
-50
'
1
I 2 1 4 5
(1
Time (ns)
Figure 4.18
Shari-circuit evmnt function of the input dope
X * ( t ) = VOO -f
It can be derived &om Fig. 4.14 that
*I= VDD 7
(4.51)
VT
and t 2 = I 2
(4.62)
137
Figure 4.14
Thk equation shows that the short-circuit power dissipation is also proportional
to the tiequeney. The only parameters that can be controlled by the circuit designer at given frequency and power supply to reduce P . , are: 0 and 7. The power supply s d n g greatly affects the reduction of short-circuit power dissipation. Note that this analysis was done for an unloaded inverter. For a loaded gate, if the outpnt signal and inpnt signd have eqnd rise/fd times, the short-circuit power dissipation will be less than 20% of the total power [5]. So it is very important to keep the edges fast, to have negligible P,*01a t least, it is desirable to have equal input and output rise/fd times.
If the load capacitance is high, the output rirejfaU times become larger than the input ones. In this case, the inpot ehsnges completely before the output changer rignificantly. Therefore, the short-circuit current is near zero. Note that if VODis approaching (VT,, + VTz)01 is less, the short circuit current can he eliminated because both devices can not conduct simultaneourlv.
138
CHAPTER 4
4.3.4
=P .
+ Pd + PSC
(4.54)
It represents the total power of a gate when it is switching at the same rate aa the operating frequency. In Chaptez 8, we will discuss how to estimate the power dissipation of a complex circuit.
Other power dissipation k u e s exist, such as: worst ease power estimation and temperature effect. These conditions are : maximum VDOandjunction tcmperatarc, and faat-faat process. Static power dissipation (subthreshold carrent) is incieaad by the increased temperature and increased power supply. Dynamic pow= is not sensitive to the temperatare bat it is affected greatly by the worst caae VDD. Short-drcuit power dissipation depends on the temperature j u t as the short-circuit current doer. It i s also dependent on the power snpply. The mobility and threshold voltage deereaae with increasing temperature. Each of these two parameters has an opposite effect on the current. So it is important to eonrider the worst case power consumption evaluation in any design.
The simulated average total power dissipation can be easily measured by the SPICE simulator u&g POWER MEASUREMENT commands. However, several papers in the literature have introduced "power meter" in circvit simulation to meaauce the power dissipation [6, 7, 81,
4.4
CAPACITANCEESTIMATION
Previously we saw that the speed and power dissipation of CMOS gat- depend strongly on the total ontput load ce.paeitance. This capacitance is the sum of three components as shown in Fig. 4.15. Total input capacitances of N driven gates noted C,m;
1
I
Parasitic output capacitance of the drive gate noted C,;and Wiring capacitance noted C , .
For simplicity we estimate, in this section, the average value of Cr. over the range of the output awing. This approach is used only for b i t i d estimation
139
of the design. More circait simulation and layout extraction and port-layout shdation arc needed fm mole accuracy. Moreover, it is sometimes interesting to derive a simple expression for the load capacitance to dee the impact of important parameters on the speed and the power dissipation. We h t eramine the different components of the outpnt load capacitance: then we illustrate by
e o .
4.4.1
Estimation of C,,
the
The total eapacitanee of the driven gates can be evaluated by 5m-g input capacitance of all the receiving gates and we have
Cq*te =
con C(WL)<
;=I
(4.56)
where n is the number of tr-torr of the gate. T h i s expression sum3 the gate capacitances of all the transistors composing the driven circuit. For a CMOS inverter it is given by (4.57)
140
CHAPTER 4
3.5
3 -
VOllll
y:
i i i
,?
'
? '
', ,'
! ? I
voD=3.3 v -
2.5
Vin
2 1.5
- i
1 -
i
i i
7
0.5 -
t. _..
-0.5
i i
i
;vout2
. . .
*< .... . ..
ei
141
6
Figwe 4.16 shows an example of the equivalent gate capacitance of the receiving gate. The driven inverter has the following drawn sizes : W, = W . = 20 p m and L = 0.8 pm. This gate can be replaced by an equivalent capaeitenee Cgacc z= 50 f F ,which is approximately the same as the one ealeulated from Equetion (4.57).
of a
c ,
= CdP Cd,,
+ Gjp+ c,,
(4.58)
142
CHAPTER 4
cg. = c,w
(4.59)
C , is ddned in SPICE parameters of Chapter 3 as CCDO. The drain junction capacitance is a function of the ~everse applied voltage during the switching of the inverter. The average value of this capacitance over the range of output swing is defined by (4.60) = 6,aAo c j . , P ~
c,
where AD and Po are the area and the perimeter of the drain junction a shown in Fig. 4.18. The average bottom junction capacitance is (4.61) The average side-wall capedance
143
\I
4.4.3
Wiring Capacitance
The Simple model of wiring capacitance is bared on the parallel-plate model [Fig. 4.191 given by
c,,
= -
cm
(4.63)
where H is the thickness of the insulator layer (oxide), and C , . is the capaeitanee per erea unit. The total capacitance of the wire is
c,
= IWC,.
(4.64)
where W is the width of the wire (metal or poly). and I is the length of the wire. Table 4.1 piyes some values of the widng capacitance per area for the 0.8 pm process presented in Chapter 2. T h i s capacitmce can not be known i n the early design stage but can be known after layout extraction. When the thickness of the insulator becomes comparable to that of the wire, T, then the fringing fields at the edge of the wire become important. The effect of the fringing fields is manifested by the increare of the effective area of the plates [Fig. 4.191. Many approximations have been proposed to compute the
144
CHAPTER 4
Metal2 to Substrate Metal2 to Metall Metall to Substrate Metal1 to poly Metall to diffusion Gate poly over field oxide
11
25 19
28 27 58
Table 4.1
csparitmr.
Layer Metal2 to Substrate Metal2 to Metall Metall to Substrate Metall to poly Metall to diffusion Gate p d y over field oxide
Perimeter C a p a d t a c e
F/pm)
38 47
44 48
47
44
C , , = ~[(~)+0.77+1.06(-)0~"+ W W 1.06(-)0.6] T B H
where C,, is the total capacitance ofthe wire per unit length. The contribution of the fringing effect in many -es k important. "able 4.2 shows the fringing capacitance per =nit of length.
4.4.4
Example
Consider en inverter with W, = 2W. = 20 pm with 3 pm length of each drain and source. This inverter is driving B Line of metall of 100 pm length by 2 pm width a d an inverter with W, = 2W, = 20 pm operating st VDD= 3.3 V.
145
The total load cspacitsnce i s computed using the 0.8 p m device parameters presented in Chapter 3 BI follows:
m
.
rn
c,
,c ,
Then
= CGD,W,
+ CODhiW"
C , ,
The total drain junction capacitances can be approximated at midvoltage of 1.65 V (1/2 of V D ~ instead ) of eompnting integrh. We have far one drain junction
The drain areas are 60 pmaand 30 p d far PMOS and NMOS respectively. The drain perimeters are 46 p m and 26 pm for the PMOS and NMOS transistors respectively. The total junction capacitance can be easily calculated and is Cj s 3 2 f F Note that this capacitance increaser with the power supply voltage reduction.
m
The wire capacitance is estimated by adding the two components psxallel plate and fringing capacitances. The ares of the wire is 200 pm' while its perimeter is 204 pm. We have
c ,
= w x I x CW(peV m a ) Z(W i ) x C&r length) = 200pm' x 19 Y lO-'fF/pm' 204pm x 44 x 10-3fF/pm = 3.8 + 9.0 c 13 f F
Note that the fringing capacitance is an important portion of the total wire capacitance.
146
CHAPTER 4
Hence the total capaeitance at the output is 100 fF. Note that the contribution of the junction capacitance is important. The contribution of each component wries *om one circuit to another and it depends on the layout style osed. Before starting any circuit layout, it L important to keep in mind an estimation of capacitances snch BQ the gate a d ontput capacitance of 1 unit sbe inverter and the wire capacitance of, for example, 100 fin poly line and 100 p n metall line. With these data, when starting the design, it is possible to siee different transistors correctly.
147
gF
6
=c
148
CHAPTER 4
We assume that
(4.66)
and
w,= w,= w , , =w , ,
G=G+-t-=w 2
(4.67)
The first thing to do is to approximate the gbtc by M equivalent inverter where the effective p is given by 1
s . 0
0,
(4.68)
and
?Pelf
=a,
(4.69)
To have LS of the gate in the midway of the power supply in DC characteristics, the following condition should be satisfied for the Sinpot NAND gate (see Eqnation 4-18) PPLlf = (4.70)
a<n
P, = 0. 3
(4.71)
To have the same delay BE an inverter with determined eiues, we should have (assuming that L is the same)
w,,= w*e,l = w ,
and
(4.72)
w,,. =w , . , , =T W,
(4.73)
But in practice the size of these transistors, composing the 3-input NAND gate, should be increased because the output parasitic capacitance afthe NAND gate (or any complex gate) is larger than that of the inverter. Hence
w,> w ,
and W" > 3w"i
(4.74)
(4.75)
Note that by circuit simulation, we can properly size the transistors. Moreover,
it should be noted that the back-gate bias effect has to be taken into consideration in the design of the series NMOS devices in NAND gate (or repier PMOS in NOR). The relies-connected MOSFETr, during switching, exhibit a threshold voltage increase doe to a non-null source-substrate voltage as shown in the simulation example of Fig. 4.22. In Fig. 4.22(a), the transistor NL of the
149
first NAND3 gate near the ootpot outl, is driven by the latest signal becanse N, 8nd N, are already ON. Therefore, the node oi is at the ground level and the source of the transistor N, is not subject to the body effect. In t h e other NAND3 gate, the transistor N , and N6 are ON, while Ne receives the input signal. In this case, the node a. and bz are eit II certain voltege Icvd. Henee, during the discharging period the transistors N, and N5m e subject to the body effect. This effect slows the discharge of the output aa shown in Fig. 4.22(b). The output outl is discharged more ispidly than the output oui2. One way t o reduce the body effect at the logic level is to put the transistor, driven by the latest ardving signal, near the output. The e d y arri'ving sign& should be used to discharge the nodes snsceptible to the body effect. For example in ~n adder &=nit, the transistor driven by the carry is placed near the ontpot. Let us derive the output parasitic capacitance ofthe m-input NAND gate and compare it to thst of the CMOS inverter of Fig. 4.21(b). We have
c, = *wpc,, + w,c,
+ mC*? + .c ,
(4.76)
The Ce. of the m-input gate is larger than that o f the CMOS inverter by the ratio W,/W,.i. Fmm the above equation it is obvions that C, of the m-inpnt NAND gate is lrtrger than that of the CMOS invater. Note that for the same pedormance and far the same number of inputs the NAND gate consumes less silicon area than that ofa NOR gate because of the s m d e r *pea taken by the NMOS devices. Hence, CMOS NAND gates arc more widely used than NOR gates. Moreover, the NOR gate eonsume~more power than the NAND gate.
4.5.2
The strategy used to build NANDINORgater can be extended to build more complex logic gates. Complex logic functions can be realiied by connecting several NAND, NOR and INVERTER gates. However, they can also be 6 % eiently realized oring a single CMOS logic gate. Any complex CMOS gate is formed by two N and P logic blacks as shown in Fig. 423(a). The two blocks have the same number of transistors. Fig. 4.23(b) shows a threcinput complex CMOS gate and its logic equivalent symbol. The topology of the block N is the dual of the block P, i.e., p a d e l connections become sexier and vice v e w . In either the P or the N logic blocks, the pardel combination is placed Iar from the output to minimize the output capacitance and hence improves the speed and maybe the dynamic power dissipation. For example, the contribution of
150
CHAPTER 4
the N block to the output capacitance in Fig. 4.23(b) is less than that of Fig. 4.23(c). There is no direct DC path between VDD and ground for any of the logic input combination. In practice, the complex CMOS gates are used for a marimurn f& of 6-6.
151
Logic
Block
c Logic
ci5
(C)
Figvre 4.13
CMOS
152
CHAPTER 4
4.5.3
to
So far, we have discussed the dynamic power dissipation of an inverter due the load capacitance. Whet about a CMOS complex gate driving a load
capacitance ? The dynamic power dissipstion has two components in B complex gate. The internal cell power, P*mcd,,n, and the capacitive load power. The internal cell power consists of the power dissipated by of the internal capacitive nodes. Sometimes the internal short-circuit power i s added to the internal cell dynamic power. The dynamic power for B complex gate cannot be estimated by the simple expression Cr,ViDf, because it might not always switch when the dock is switching. The switching activity determines how often this switching occurs on a capacitive node. For N periods of 0 VODand VDD 0 transitions, the switching activity a determiner how many 0 + V O D transitions ~ occur at the output. In other words, the activity Q represents the probability3 that a transition 0 VDDwin OEEU during the period T = l / f . f is the periodicity of the inputs of the gate. The average dynamic power of B complex gate due to the output load capacitance is
P* = aCLV;,f
(4.77)
The internal power dissipation, due to the internal capacitive nodes, can be characterized by simulation. Fig. 4.24 illustrates an example of a complex gate with internal nod-. The internal dynamic power of a cell is gken by
"
P k A p = xQiC$xvDDf
i=,
(4.78)
where R is the number of the internal nodes, Q, is the switching activity of each node i, C;is the parasitic capacitance of the internal node, and V, is the internal voltage swing of each node i . The parasitic capacitance at the output is included with the load CL.Note that internal voltage swing can be different than VDO.
153
in the next sections. First we consider the c s e of a NOR gate. Then we treat several rtatk gates. Table 4.3illustrates the truth table of the NORgate. From the table the probability that the output is at zem is 3/4 and that it is a t one is 114. The probability for (I VDDtransition is eompnted by multiplying the probability that the output d be at sera, Po, by the probability it d be a t one, P,. 3 1 3 PNOn, = Po.P, = - Y - = (4.79) 4 4 16 We aFsume that the inputs ate uniformly distributed (i.e, the probabilities
P(A=I)=P(B=l)=I/1).
We
by
OI
(4.80)
where Po is computed by dividing the nvmber of zeros by the total n-ber of input eornbin&ons (N = 2" for n-input gate) and P, is computed by dividing the number of ones by N. P o is also equal to (1 -PI), Fig. 4.25 shows the probability that the output maker an 0 3 1 transition for several static gates. The probability of transition. at the inputs are assumed uniformly distributed.
155
P(O-21)
P(0 +I j
3/16
1 1 4
3 D
4.5.4.1 Example
1/64
I4
gates
with d o d g dis
As an example of a logic decision far low-power, consider the different Lnplementation of an 6-input AND gate driving a 0.1 pF load. As shown in Fig. 4.26, we may compare the following implementations:
.
rn
3-input NOR
The library osed of such 8 comparison is a high-performance standard cell library optimbed for speed. Table 4.4 shows some eharacteristics of the library, where the average delay is reported which is the average v d u e of the rise and delay timer. W, = Z W , = 10 pm is set for all the t r d t o r s composing the different gates. The delay i s a function of the outpui load capacitance4 C, in pF. The area is a function of a unit area called cell grid. Each unit area for a cell h= a certain height and width. Also included i n this Table, is the input capacitance of a gate and the output parmitic capacitance in fFr. We make, for this example, the following annumptions:
Tlua saparitmcc doer not inrlvda the output pararilic one.
156
CHAPTER 4
P = 6314096
0 1 lrnplernenialion I
P = 6314096
157
=
m
Gate
type
output
cap.
Input
( f F ) cap. (fF)
85 105 132 200 101 117 48 48 48 48 48 48
2 3
4 T
0.22
0.37 0.65 0.27 0.31
+ 1.00 C .
+ 1.50 C . + 2.30 C . + 1.50 C, + 2.00 C .
0.30 t 1.24 C .
3
4
First we compare the delay and the iliea of the different implementations. Using the data of Table 4.4, the results are reported in Table 4.5. The delay may be computed or simulated by SPICE as illustrated in Table 4.5. The implementations 2 and 3 offer the best speed compared to the first one. However, they requiz. more area.
Implern. 1
Implem. 2
11 0.85 0.86
Implem. 3
9
1.1
13
0.87 0.83
1.1
Let us now compare the power dissipation wing the power cost function. It ir defined by Power coat = CP.-.,,C, (4.86)
158
CHAPTER 4
where Po+,,; is the probability of transition 0 1 at each node i and C: is the t o t d capacitance at each node i. We assume that the inputs A, B, C,D , E , and F a r e uncolrdated andrandom (i.~., E = 0.5). For the implementstions of Fig. 4.26, w e compote the transition probabilities. Table 4.6 summarizes the procednre of probabilties compntation of Merent nodes in the drcnit.
lmplomentatian 1
0 1
P,
Po = 1- P,
PO-,
1/64 63/64
^^II^^^
oa/nuao
Implementation 2
P I
P o = 1 - P, PO-,
0 1 718
118
7/84
0 2
2
1/64 63/64
65/4090
7!8 1/8
7/64
Note that the node 01, in implemention 1, has a lower switching activity =ompared to the other two. To compute the power cost function w e laiu not indude the p~imary inputs. Table 4.7 illnstrates the results of this calculation. The results indicate that implementation 1 has the lowest power. So technology mapping is important for low-power applications. We consider now another example using low-area 0.8 p m CMOS standard eel! library for the &input AND implementation. Some characteristics of this library are s h o w in Table 4.8. Cornpazed to the library presented i n Table 4.4, this library uses sma!! transistors with W, = W, = 4 em. Compared to the
159
case of the highperformance hbrary, the cell area unit, in the low-area ease,
LS
smaller by a factor of 1.5. Note that the delays of diRerent gates are higher. Bowever, the input gate and output parasitic capacitance$ me lower Thus, this hbrarg c a n be used for low-power fonction implementation.
Table 4.8
Characteristic. of s lov.mcs 0 8 ,zm CMOS bbprrry
Gate
Area
(cell unit)
type
Average
delay (ns)
3 4 7 3 4
35 60 65
13 13
0.23 t 3.73 C,
81 62 69
13 13 13
13
t 8.84C,
Implem. 3
43.7
The delays reported in Table 4.8 do not indnde the effect of the input voltage dope. The delay, of the m e r e n t implementations, w.s simulated with SPICE and it is almost the pame for all the configuration. The delay is 1.5 "8. Using the same reasoning discussed earlier we can compute the power cost function wing this library. The transition probabilities are the same, except the total
160
CHAPTER 4
node capacitances which are different. The results of the power cost evaluation are illustrated in Table 4.9. The power cost, in the case of low-power library, is almost half of that of highperformenee. Still, implementation 1 hea . e low-power chs*Factedstie while the speed is h o s t the S-e compared to the others. The me- is also lower than the other implementations. T h i s example shows that the power dissipation e m be Fedneed a t the gate level. Even if we take into account the wire capacitances between the cells atill, the conclusion is valid. The topic of low-power at the gate-level is discussed more in Chapter 8. Keep in mind, that in this comparison, the internal power of the gates has not been considered.
4.55
GlitchingPower
Note that in the probabmty discussed so far, we assumed that the gates had e e m delay. In that case, we m e not taking into account the glitches and we consider only the transitions between stable states. Glitches must be considered if we assume non-aero delay at gates. Thus the total dynamic powei of a circuit is the total dynamic power with iero delays power and the glitching power. So what is the glitehing phenomenon?
In a static logic gate, the output or internal nodes can switch before the correct logical value is being stable. To illustrate this spurioos transition, Fig. 4 . 2 T shows an example of a circnit with a cascaded configuration. When the inputs ABC make the following transition 100 111, the output, with %emdelay gates, should stay high. However, considering a unit delay for each gate, the output 01is delayed compared to the input C and hence csusing the output Z to evaluate with the new value of C and the old value of O1.In that care, the output expedenee. a dynamic hazard (glitch). This transition increases the dynamic power of the circuit and adds a dynamic component to the switching activity,
Another example is shown in Fig. 4.28(a). The cawaded circuit exhibits a glitching pioblem. However, the same function can be implemented oring balanced delay implementation as shown in Fig. 4.28(b). These are some mles to amid this problem:
Balance delay paths; psrticdaxly on highly loaded nodes. Insert, if possible, buffers to equirliee the fart path; and
161
to
To do a layoat of a complex gate (i.e, several tens of transistors), the folloving general layout guidelines can be used :
.
m
rn
Run V D ~and , Vss in metal (1 or 2) hodmntdy. For example, VDD at the top and Vss a t the bottom of the cell in semi-rectangular form; Define the polysilicon gate lines odentatioionr and order them for maximum active area cros~over to form the gate regions;
Place the N-block (NMOS transistors) near Vss and theP-block (PMOS transistors) near VDD. The PMOS devices should be located in the common N-well ifthey use the same bulk potential; Adhere to the design rules snd m e if possible an interactive DRC (Design Rule Checker);
162
CHAPTER 4
AEC
loo
Iii
-*
(a1
163
164
CHAPTER 4
"OD
v~~
B
A
i ; l l
lhl
. .
-. .
B
OUI
165
rn
m
Keep the internal junction and wire capacitances to the minimum to minimiae the paes and the delay; and Complete the uonnection of different nodes inside the cell using the different layers available (metall, p l y , etc.).
Note that the power Line widths are drawn taking into consideration the current consamed by the cell because the electromigation phenomena sets the minimum width of eoodacturs.
Far low-power design, these are some layont guidelines:
m m
Use for these high activity nodes low-capacitance iayers such BS metall, metal$ ete.;
Keep the wires of high activity nodes short;
rn
166
CHAPTER 4
NAND, NOR, XOR, AOI, OOAI, latches, buffers, multiplexers, fulladder, fipfiops, etc.;
=
m
rn
Linear cells : low-battery detector, power-np reset, etc.; MSI/LSI functions : ALU (Arithmetic and Logic Unit), countezs, magnitude comparators, ete.; Compiled maemeellr : register file,FIFO (First In Fhrt Out), ROM
A &wit is designed by capturing the rehematie or thefanctional model (VBDL, Verilog, etc.) of the cells. The layont is generated by an antomatic placement and routing. An example of a CMOS standard cell library can be found in [lo]. In standard cell approach, the logic c& have the same height and the width is variable. In many libraries, the cells are available in two layout styles. In the area-optimized cell, the cells me made as s m a l l an possible. In the performanceoptimized style, cells are optimieed for high-speed performance and, as a result, occupy more aces than the small cells. Even the height of the c& in the two styles is different. A typical standard cell layout for a NAND gate is shown i n Fig. 4.32. This methodology providu lower cost and higher productivity than the fall-enstom one. For low-power applications, the s m a l l and large cells for the same function can be c a r e U y chosen to optimise the power in a complex design without degrading the timing requirement. The third layout methodology is the gete array6. The gate arrays consist d i m plemented cells and need only the personalination steps. Fig. 4.33illuetrates an example of gatearray core using Sea-Of-Gates structure. It consists of I/O and internal cell areas. The 110 cell area contains pads with input/output buffets. Theinternal cell array eontainsscontin~ousarray ofNMOS and PMOS transistors. Hence, the transistors and interconnects a r e & e d y predefined. The design of a logic gate consists of wiring the different tramistors using metallization and contacts. The isolation of a logic gate is performed by tying the polysilieon gates of the limiting transistors to Vss or VDDdepending on the type of gate diffusion. Routing channels are routed over unused transistors. This methodology permits the reduction of the design cost at the expense of area, power and performance. Ont recent gate array nrchiteeture WVIU based on multiplexers with small sine transistors to maintain low-power characteristics
1 1 1 1 .
167
Figure 4.53
168
CHAPTER 4
VDD(metal)
Pdiffusion
Polysilican gates
N-diffusion
ss (metal)
Comparing these layout approaches, the full-custom methodology offers the beat approach to minimive the power digsipation. However, for a complex d t sign, it is costly to use such a design strategy. The standard cells approach provides good performance and an improved design time. However, in many libraries the devices ate oversized for performance purposes and conrequently, the power dissipation would be high. To efficiently use the standard cells tech-
169
Figure 4.14
nique for low-power applications, the library should be expanded to include several versions of the same function with different driving oapabilities. In that case, powerful synthesis tools are needed to optirnim the power while maintaining the timing specificstions. Moreover, both the standaid c& and gate arrays stylu require new place and route took for low-power design.
4.5.8
Another alterndive to CMOS static complementary logic ir the conventional passtransirtor logic based on MOS switches. Fig. 4.34 shows a CMOS trans mission gate (TG) as primitive element. It u o n ~ t r o f a complementary pair connected in parallel. It acts as B switch, with the logic variable A as the control inpnt. If A is low, the gate is OFF and presents e high resistance between the terminals. If A L high, the gate is ON and acts as a switch with an on resistance of R,, and % in pamllel. The equivalent resistance of the TG i s RTD = R,,llG. This resistance is ulways less than the smallest among R, and 4. This permits a fast switching characteristic. When the input I is at Voo, then the outpot F is quidtly charged initially by the NMOS, then at the
170
CHAPTER 4
vD;k;
PMOS ON
>"
NMOS ON
TlIlE
end by the PMOS transistor as illustrated by the equivalent resistances of Fig. 4.35. In this figure, we assme that at V,, = 0, A and A are set to their final values. During this transient switrhing phase the NMOS i s subject to the body while the PMOS is not. When a eero, at the input I , is to be transmitted then the PMOS is subject to the body &ct. The PMOS and NMOS transistors should be sbed such that they charge and discharge the output symmetrically. If V T . = IVT,~and the body effect is symmetrical then we can size the devices such as P. = Pp. Sometimes, equal shed NMOS and PMOS devices can be used. It i s easy to see that the delay of the TG gate in approdmately independent of the input level. T h i s is not the case if the pass-logic Y S ~ S a singlcchannel
171
transistor. A drawback of the CMOS TG is that it co~~sumes more area than a single-channel transmission gate (NMOS TG 01 PMOS TG). Thnr, if the area is ofprime concern, NMOS TGs are used. Any CMOS TG logic (we call it here conventional pars-transistor logic) function can be implemcntcd using the TG primitive element described above. In such implementation the transistor count, hence the silicon area, is low compared to standard static CMOS implementation. This ishighlighted in the implementation of such functions BJ mdtiple-g, demdtipleldng, decoding and addition. Pi. 4.36 shows & 4 1 multiplmer, where the data lines A, B, C and D are contlolled by S 1 and S2 such that
+ D.S,.S2
(4.87)
Thm form of logic is used when the inputs and their logic complements are available. The implemenlation does not need VDDor ground liner. However, the implementation suffers f r o m a number ofdrawbacks; the driving capability of the ckcnit is limited and the delay increa~eswith long TG chains. Moreover, the eireait does not provide a restoration ofthe logic lev& i.e., the logic gates are passive with no gain elements. P i . 4.37 shows an example on how to lestore the voltage levels in chained TGs. When 8 TGs are pnt in s u i e s . the output signal changes very slowly. However, when an inverter stage is added every 4 TG stages, the level is restored as shown in the SPICE voltage waveforms of Fig. 4.37. The CMOS TG logic can be used in CMOS d r c u i t design offering an extra The adder degree of eirenit design Beedom. A0 example is the full-adder. Circuits d l be diseused in detail in Chapta 7. Fig. 4.38 shows the schematic of the XOR gate w h i c h is used by the adder. When the input A is low, A is high. The transmission gate TG is closed, then the output is equal to B. When A is high, A is law. The inverter formed by the transistors N m d Pis enabled, then the output is equal to A. The TG gate is open in this care. To implement an adder lets first review its functions. The boolean function o f a full-adder are: (4.88) S , , = A B B B Ci, ,C ,
= A.B t &(A
+ B)
(4.89)
A and B are the inpots, C i , the carry input, , , S is the sum ontput, and C , , is the carry output. The truth table ofan adder is shown in Table 4.10.
The CMOS implementation ofa one-bit full-adder is 3hown in Fig. 4.39(a). It requires 28 transistors and has two gate delays. In this circuit the transistors
172
CHAPTER 4
F
C
173
n<I
controlled by the carry signal C,, should be placed dose to the output. This will _offret the body effect problem, since the carry is the latest arri-8 signal. An optimiaed implementation of the full-adder is shown i n Fig 4.39(b) It uses only 18 transistors and is bared on the XOR function shown in Fig. 4.38 and the TG gates. Hence, this adder is more compact and farter and eonrnmer less power than the complementary static one.
174
CHAPTER 4
Figure 4.38
TG XOR gate.
A 0 0 1 1
B C ; . , 0 0 1 0 0 0 1 0
S , ,
0
1
1
C ,
0 0
0
1
Table 4.10
Adder l h t h Table
4.5.9
Fig. 4.40 shows a mxs-cmpled CMOS static latch. In the storage mode (input LD = O), when the node A is high, B is low,PLand N, are ON while P2 and N t are OFF. Similarly, when A is low, B is high, PI and N2are OFF while P, and N1 are ON. The standby power &sipation of the ceU is very small. The
state of the htch changed by turning the two transmission gates ON (LD high) and applying the input and its complement.
175
176
CEAPTER 4
Figure 4.40
177
Thus V0,, depends strongly on the ratio &/A,. For example, if we need B VOL = 0 . 0 4 V ~ and ~ VT = 0 . 2 V . ~ , then the ratio &I@, should be equal at l e s t to 0.1. If the NMOS transistor is minimom she, the PMOS should be weak to provide adequate noise margins (low Voc). In this case, the rise time of the gate is too slow. If we improve the rise t i m e , the ratio condition tends to inerurre the gate area a d hence the input capacitance. Although this circuit offers a reduetion in total transistor count and ease of layout, it has the disadvantage of non-~ero static power dissipation. Since the pull-up PMOS is always ON, a current flows from VDD to ground whenever the pull-down section of the pseudo-NMOS is turned ON. This current is the source of the static power dissipation. When II pseudo-NMOS gate, with antput a t VoL, is driving another one, the d i v a gate, with OFF pd-down section, leaks a high eubthreshold cnrrent but still this cnrrent is lower than the one when the pull-down in ON. For a-input preudrrNMOS gate there ate (ntl) transistois. Fig. 4.42 illustrates an example of complex gate implemented in pseudo-NMOS style. This logic hns been used in many applications such 8 8 . decoding logic for memories and PLA. Because of its high static power, it is not suitable for low-power applications.
4.6.2
To reduce the area and improve the speed of CMOS circuits, another popular style e d e d dynamic iogie is used. Fig. 4.43 shows a dynamic CMOS gate. This logic is referred to as domino CMOS logic [13]. The domino gate shown in Fig. 4.43(a) consists of e dynamic CMOS drcuit followed by a static CMOS
178
CHAPTER 4
A R i
Figure 4.41 PseudaNMOS complex laslc g a b
buffer. The dynamic circuit consists of a PMOS prechargc transistor P i , an evalnation NMOS transistor N,,a storage capacitor C , and an N-logic block which is a serie-parallel combination of NMOS transistors estivated by the inputs and implementing the required logic. The storage capacitance represents the parasitic et node A. This circuit u4es asingle clock phase clk. DuMg theprecharge p k e ( c f k= O), the storage capacitance is charged through the PMOS pull-up PI to VDDand the inpats have no effect since there is no path to ground. The output of the buffer is precharged to ground. During the evaluation phase (cfL = l), A', is ON, and depending on the logic performed by the N-logic block, the node A is either discharged or it will stay precharged. Fig. 4.43(b) shows an example of complex gate. In a cascaded set of domino logic stages, a5 shown in Fig. 4.44, the first stage evaluates and causes the next one to evaluate (like domino f a ) . The number of erscaded skages is limited by the evaluation clock phase. Compared to psendo-NMOS, domino logic has the same k p n t capacitance snd improved iise time. However the fall time is affected since there is one more transistor in the pull-down section. Also the gate is suitable for high-fanout operation because of the CMOS buffer. Moreover, it is efficient in area for high fanin because n 4 transistors are required compared to 2n for CMOS static gate.
ue:
179
180
clk
e r
Stagel
Figure 4.44
CHAPTER 4
sage2
Dormno logic c h w
stage3
The domino gate has a problem called charge sharing OP redistribution. Fig. 4.45 gives an example to explain this problem. During the precharge, the node A is a t VDD and charge CVDDis stored on the capacitance C. We armme (worst-case) that the pararitic capacitance of nodes B and C,C, and C2respectively, have iero charges. During the evaluation, the node A should stay at VDD, however, due to C, and C z ,charge sharing take place. Using the charge conservation principle before and after redistribution, we have
CVDD
= (C
+ c, + C,)V.
C
(4.92)
VA =
c + c, + c, "DO
(4.93)
Iffar example CI = Cz = 0.6C then this voltage wonld be VDD/Z. This voltage can alter the logic and provoke the CMOS buffer to dissipate high static power dissipation.
rn
If the clock frequency is too lour, the node A leaks the charge stored on C due to the leakage cnizents. The dynamic node can leak its charge in n t h e of few hundreds of #r to few ma, depending on the temperature, the Starage capacitance and the leakage cnrrent. When
181
Figure 4.45
Charge aharingin
h - c
CMOS l o p k
using power-down techniques, the dynamic nodes should not be left floating for a long time. If the leakage is high with low VT devices, the charge can be deleted in B t h e IU low s 100 RS. This problem is similar to charge sharing. Fig. 4.46 shows two alternates to solve the problems of charge sharing and leakage. In Fig. 4.46(a), a weak
PMOS (low W/L) is added BL pull-up transistor. This circuit operates like pseudo-NMOS during evaluation phae. Hence it consumes some static power dissipation. If the circuit operates at high-fceqnency, the added Teak PMOS har no role because it does not have enough time to operate. Note that this weak PMOS inereares the ontpnt cappacitmee and then it slows this dynamic gate. To eliminate the DC path during evaluation, the gate of the weak PMOS c a n be driven from the output of CMOS buffer as shown in Fig. 4.46(b). This circuit adds another capacitance at the output ofthe inverter. A third alternate circuit which solves only the problem ofcharge sharing is shown in Fig. 4.41. In this chcoit configuration, intermediate nodes of complex gate are prccharged with additional precharge PMOS devices.
rn
Another limitation of the domino logic gate is that it implements noninverting logic functions. Hovever, this is not a serious limitation and can be overcome, if the need arises, by "Jig CMOS static gates. The dedgnep can mix both stalic and dynamic CMOS logic circuits in a given design to optimize the overall performance.
182
CHAPTER 4
Logic Block
Block
183
Historically, dynamic design style have been devised f a low-power charaeteristics because of t h e reduced device count. Moreover, dynamic gates do not experience short-kcnit pover &sipation and glitching problems as in rtatie &wits. However, to drive the docked transistors, a lluge dock dirtribation network is needed. This highly loaded network consumes a significant a m o u t of dynamic power particularly at high frequency of o p e r a t i d . The switching activities of dynamic gates are higher than those of static gates. In B dynamic gate the output maker a 0 1 transition during the precharge cycle only if the N-bloc discharges the autpnt during the evaluation phase. Hence, the probability of 0 + 1 transition is given by
Po-, = P o
(4.94)
where Po is the probability that the output has a "0" output. For a two-input NAND dynamic gate, the output has only one zero for 4 input stater. So,
1
~ ~
2' - 4
(4.96)
Another refinement oftbe domino CMOS logic is shown i n Fig. 4.48 [14], where the CMOS buffer is removed. N and P logic blocks are alternated and each drke the other. When clk is low (0), the h s t and third stage are prechsrged high and the second stage is precharged low.
Fig. 4.49 s h w s another NP domino logic called NORA (No Fbcce) [El. Two sections elk and elk are shown in Fig. 4.49. It i s constructed by cascading N and P blocks followed by C 2 M O S (clocked CMOS) latch. CMOS buffers (inverters) ace nsed to provide logic inversion. When clk = 1 (evaluation phase in section dk),the CaMOS latch3 operates like aninverter. When clk = 0, the latch move* into hold state because the output NMOS and PMOS transistors ale OFF. In this case, the old data is latched at the output. This latch is used to avoid signal races. A NORA pipeline is shown in Fig. 4.50 and it consists of alternating elk and cik sections. Signal racer do not occur in this structure because of the use of C'MOS. Another logic hlrr; been proposed to oveicome charge sharing by using additional clocking signals. It is e d e d Zipper CMOS logic. For more details refer to [MI.
' S c r the ex-ple
of the DEC Alpha Ehip
in Scc~ion4.8.4.
184
CHAPTER 4
Block
Block
Block
Pigme 4.48
NP
do-o
I Q ~ E
An example o f a pipelined full-addu (FA) NORA circoit is shown in Pig. 4.61. This cell can be used in many deigns such as B pipelined multiplier. The output C'MOS latches c a n only use three transistors rather then four. The NMOS and PMOS tramistor Pa and N, respectively, can be removed from the output C'MOS latches. The reason is that during precharge phase (clk = O), the outpnt nodes A and B are set t o ground and VDDre~pectively. Thus, the transistors PI and are tmned OFF. Benee, the clocked transistors P . and N, cam be removed and the FA cell is isolated from other sections during precharge.
185
\?7+
T
To N-Block
\?7
T
i : :
To I
lock
To N-Block
186
CHAPTER 4
clk-Section
clK-sect,on
Figure 4.110
clk-Section
NORA p l p e h r l o g x o .
Figure 4.61
logic and improved rise time. The power dissipation consumed by this logic Is high due to the hi& switching adi-ity of the clock even if the circuit is not used. However,power-down techniques can be used t o control the dock of the logic. Using thi. style, requires from the desi@er to spend more d s i p effort than the static style to solve all the problems of dynamic logic such 81: charge sharing, clock skew, preeharging, ate. Finally, we note that pass-transistor logic i s very pxomising for high-performance low-voltage low-powez applications.
187
Figvre 4.51
Clock skew.
4.6.4
Clock skew is 8 critical design parameter in high-speed circuits. Fig. 4.52 shows the clock skew in single complementary-phase dock sipds. If & is generated &om elk, clock skew is possible. The time skew is measured between the h&-VDD points of clk and & sign&. In the presence of dock skew, a glitch e m be transmittad from one section to another as illustrated in the example of Fig, 4.53(b). T h i s structure cant- one stage between the two C'MOS latches, and a glitch can be transmitted to the last C'MOS latch. The example ofFig. 4.53(c) does not have this problem. It has been shown that to eliminate the signd race in N-P domino logic. an even number of inversions &odd be used between stages 1171. Moreover, the clock skew problem shonld be minimieed to improve the speed of dynamic circuits. One possible solution of single complementary-phase dock generation, with miaimd skew and p ~ o c e s insensitive, is the one shown in Fig. 4.54 [18]. The delays clk. + clk and elk; d k are equahed with special buffer sizing.
188
CHAPTER 4
4 c :
4.7
CLOCKING
One way to synchronize thousands of sign& in 8. VLSI system is to employ a docking strategy. The clock controls the flowof data in the digital system and
reduces the compl&ty of design.
Low-Voltage Low-Power
189
clock signal
repistcr
input register
register
Figure 4.65
do&dpip.lm. ayrtrm
Moat VLSI processors a r e constructed Using a set of functional blocks (ALU, shifter, register file, ete.) connected vis pipeline registers as shown i n the example of Fig. 4.55. The clock signd can be split to one, two, three o r four phases. Typically the phases are non-overlapping.
First we pesent the different storage elements (latches, registers), then we treat two doeking strategies : Jinglcphase and two-phsse with emphasi. on the former which is usually the main option available i n standard cell and gat-array approaches. The doc$ distdbntion issues are discussed i n Section 4.9.4.
190
CHAPTER 4
Q
lateh
clock
4.7.1
Storage Elements
There are many types of storage elements. Some of the ones used in VLSI design are the fallowing:
4.7.1.1 D-Latch
Sometimes d e d level-sensitive latch. Its operation is shown in Fig. 4.56. The output changes with the input when the dock is high (case of positive levelsensitive latch). The D inpot must he rtehle within LL time window around s pasred to the positive transition of the clock (Fig. 4.57). The input data i the output within B delay ti. The time window i s defined by two times; called setup'time t , , lrnd hold time h. Setup time, t., is the time needed for the D input to he stable, prior to the do& edge. More specifically, it is the delay between the input of the latch and the storage node. Hold time, t h is the time needed for the D input to he stable after the clock edge. This time relates to the delay between the clock input and the storage point. There are a variety of implementations for this D-latch. Fig. 4.58 reviews some of the static versions. The circuit of Fig. 4.58(a) hhS a weak inverter used 85 feedback path for latch mode. The mltsge at node A is not changed by noise or leakage because the feedback inverter would keep the level. The feedback inserter should have low (Wjl) for NMOS and PMOS (weak inverter) compared to the transmission gate and forward inverter. This assures that the transmission gate is capable of overdriving the feedback inverter when data is being written to the latch. The feedback inverter should he carefully siaed to guarantee switching for all process corners and maximom fanout condition.
191
The problem of rstioed design in Fig, 4.58(a) can bc avoided by using the modified version in Fig. 4.58(b), where B transmission gate in added in the feedback path. When clk = 1, the data is passed to the storage node and the feedback node is disconnected. When clk = 0, the feedback loop is dosed, and i g . the latch is in store (latch) mode. Fig. 4.58(c) shows another version of F 4.58(b), where the outputs are buffered. Thia latter latch is fonnd in the cells library of standard-cell and gate-array. All there described static latches store their state even ifthe clock is stopped. Note that these latches do not dissipate any DC power.
To reduce the size of the static latches, dynamic versions can be used as illustrated in Fig. 4.59, Fig. 4.60 and Fig. 4.61. Fig. 4.59 shows a simple dynamic latch, where the storage node A, temporarily stores the data. Note that latches have B property called "trampareney": output follows the input when the dock is asserted. Otherwise they are yopsqne". Fig. 4.60 shows two other latches [19]. The circnits of Fig. 4.60(a) is transparent when the dock elk, is high and latches the data (opaque) when the dock is low. This latch is positive level-sensitive. The negative level-sensitive is shown in Fig. 4.60(b). Note that these latches use one clock line ( c l k ) . The circuits of Fig. 4.60 have redaced noise immunity. For example, for the circuit of F i g . 4.60(a), when the latch is opaque (elk = O), the node A may be tristated high with Q tristated law. The node A is isolated and may be surceptible to noise which reduces its voltage. The reduced voltage of node A can cause the PMOS PBleaking current, thereby deitwyhg the output Q. This problem was addressed with latches designed in DEC Alpha microprocerror PI]. For example the eircoit of Fig, 4.61 is an improved version of Yuan and Svenrron [19]. A weak PMOS device P3 is added to solve the problem of noise i n positive level-sensitive latch. The operation of this latch follows. When clk
192
CHAPTER 4
clk
clk = 0
193
clk
Figvre 4.68
b high, PI, NI and N3 function like an inverter. Pz,Nz and N4 function &a &e an 'bwerter. Therefore the latch p~3ses the input D t o the output Q. If D falls to low,then A is high and Q is low. When clk is low, Ns and N n are OFF. If D goes to high, Pi is OFF,while the nodes A and Q are tristated high and low respectively. The added P3,in this case, is ON and holds P2 OFF. This device supplies current to node A and counters any noise.
194
CHAPTER 4
TT
Figure 1.81
For R&bility reason many latches have been designed for DEC Alpha chip [Zl].Some are illustrated in Fig. 4.62. These latches have been designed for all
process corners and circuit conditions (supply Voltage, temperature, rise/faU times, etc.). The results showed no appmciable evidence of raccthrough for elk risvjfd times at or below 0.8 ns. With 1-ns rise/fall times, the latches showed some signs of feilure. A 0.5 ns for rise/faU timer was set for the dock in this chip.
195
TT
196
CHAPTER 4
cik locally, to reduce the clock skew problem. The dock skew, in single-phsc strategy can lead to invalid data storage.
A dynamic version of the positive ETDFF is shown in Fig. 4.64 [19]. The operation of this drcuit is Unstrated by the voltage waveforms. The d o e
197
of the hold time of this Ripflop is close to zero [ZO]. This dynamic flipflop, compared to the static one, needs only 9 transistors and one clock Line. The negative ETDFF is shown in Fig. 4.65.
4.7.1.3 MiscrlIoneous
Many other latches and Ripflops are available; Car example in gatearray Libraries such as the JK Ripflop and the toggle (T) flip-flop. Fig. 4.66 shows the T Rip-flop with reset control. When elk = 1, the output Q is complemented, whereas when d k = 0, Q keeps its old state.Thir T flip-flop provides divide-by-2 operation. A J K flipflop is shown in Fig. 4.67. When J and K inputs are low, the outputs are meintainod on the positive edge of the dock. If
198
CHAPTER 4
6
J = 0 and K = 1, the ontput Q is set to 0, whereas when J = 1 and K = 0, the output Q is set to 1. When both J and K are high then the ontput are complemented.
4.7.2
Single-Phase Clocking
Generic singlephase finite-state-machine (FSM) is shown in Fig. 4.68. The storage element c a n be either a latch 01a register (Bpflop). For the latch case, it demands more constrained design because of the transparency property of the latch. When the latch is transparent, thc statesignals can pass the logic block more than once during one dock eyele. To avoid race condition in this FSM, the clock width (of transpateney) has to satisfy B two aided-constraint [22]. Hence, singlephme with latches, in the case of FSM, i s insidiously complex. To reduce the complexity of timing constraint, single-phase ETDFFs c a n be used. T h e ilipipaop k never transparent. At the clock edge, the state is stored and it cannot pass the logic more than once during one d o c k cyde. D&& and synchronizing VLSI circuits with ETDPFr is rather simple and straightforward pazticukrly when nsing static Bpilops. For high-speed CMOS applications it is necessary that the storage elements should be carefully designed with minimum delay, setup time and dock skew. In thia case, trktate dynamic latches can be used efficiently. Fig. 4.69 shows ~n example of using dynamic latches [21]. Notice that L1 and L2 arc tr-parent latches separated by random logic and are not simultaneously active. When
199
200
CHAPTER 4
Elk
K Q
Q .. ...... ~i
Figure
4.81
JK &p-tlop.
201
Combinational
clk i s high, L1 is transparent, whereas when elk i s low, L2 is transparent. The minimum number of logic gates hetween latches can be B ~ F and O the madmum
k constrained by the cycle time.
202
CHAPTER 4
Fig. 4.70 shows another example of singlephase system using ETDFFs. This system is edge based and the minimum cycle time is given by [22] t.q.l.,min
= ttf,m.r
+ b s k , m ~+ *t..tup,m.* + t.inu.mnr
(4.97)
where t i t , t ~ ~ t,.tup,m.r ~ , ~ and , ~ i,~.lo,m.r ~ ~ are , worst case ddsys of the flipflop, combinational logic block, setup time and clock skew. When designing with gatc-array and/or standard cell approaches, the single-phase clocking scheme using static ETDFFs is the oaly option available for the designer.
4 . 7 . 3 Wo-Phase Clocking
Two-phase "on-ovedapping clocking strategy iernove~many constraints existing in single-phase discipline. However, the use of two-phase (or multiple phase) non-overlapping clock atructmes becomes more difficult as clock fre quendes and chip size increase. This is because of the increase in dock skew and clock interconnect wking. For high-speed applications, singlephare strategy is preferred and tends to be widely used in many VLSI systems' designs. Fig. 4.71 shows an example of tw-phase non-overlapping docking scheme. The first latch LI i s transparent when the clock elk, is high, ahereas 1 2 is transparent when d k a is high. The example of Fig. 4.71 is not the d y way to build 8 two-phase system. Latches C ~ be R replaced by two-phase master-slave flip-flops where the master latch is clocked by elkl and the slave latch by elk2. This latter structure does not have transparency property.
VLSI CMOS C i r c u i tD e s i g n
203
4.8.1 CPL
The main concept behind CPL ia shown in the block diagram of Fig. 4.72. It consists of NMOS pass tranrktor logic network driven by two sets of eomple mentary inputs and two CMOS inverterr used as buffers.
Fig. 4.13 illustrates an example of ANDINAND gate built in CPL logic. At the node Q for exhmple we have
Q = A.B t B . B = A.B
(4.98)
At the output of the corresponding inverter we have NAND function. The NMOS pass-transistor loaie network forms pull-up and pull-down functions. When the inputs ( A B ) have the followingcombination (ll),the voltage of the node Q i s a t a voltage given by
VQ = VDD - VTdVQ)
(4.99)
204
CHAPTER 4
Figure 4.71
circuit.
where V T , . is the threshold voltage subject to the body effect. So the invertiog buffers translate the swing of the output fram ground to VDD - VT,,to a fullrail logic swing (ground to V D D ) .The logic threshold voltage of the inverting buffers should be shifted to lower voltage than VDD/Z. Hence the 0 ratio of the inverter in this case should be higher than unity. This inverting buffer permits also to drive large load capacitance efficiently. When the output of logic networks are st Von - VT, then all the output inverters are driven by reduced $Wing, BS shown in Fig. 4.74. Hence, the DC power of the inverter increases because the pull-up PMOS device is not completely OFF. The VG, of the puU-mp PMOS is eqnal to -VTm.Moreover, the drive capability of the pull-down NMOS transistor is reduced particularly if the power supply voltage is iedueed. The noise margins are also affected. To solve the problem of DC power &$pation we can design NMOS transistors with lower VT than that of the PMOS transistor. Also, the body effect should be controlled. Another way to solve all the problems associated with the reduced high-level is to add to the CPL II PMOS latch 8s shown in the case of the ANDINAND circuit of Fig. 4.75. In this case, the two added PMOS transistors can be sised to be
205
minimum. as long 8s the high-level reacher VDDin the given cycle time. We call this style PMOS latch CPL. Careful design should be considered when the NMOS network has minimum size devices. Otherwise the high-level stored in t h e latch cannot be discharged. Fig. 4.16 shows examples of CPL arrays for ORINOR and XORjXNOR fune. lions. With only 4 transistom we cm pmdnce many awo-kput functions with their complement. More examples are shown in Fig. 4.17 for 3-input ANDINAND and ORJNOR gates. In these examples 8 NMOS transistors are needed to generate the 3-input functions. Any complex logic function can be constructed easily using this principle of NMOS n e w o r k t~an&%tors. For e x m Ple the full-adder circuit call be constructed wing wired CPL as shown in Fig. 4.18. The circuit is constructed using basic CPL primitives discussed before.
206
CHAPTER 4
(h)
207
A i t ;
~ ~~~
i i
B
~
ABC
(a)
ABC
A+BIC
A+B+C
(b)
Figure 4.71
Ako the sizes of the transistors are shown in this fignre for fast operation. The tr-istors of the NMOS net>mrk, far from the output, have larger size than those closer to the mtput. This is because the NMOS devices, closer to the output, pass a reduced swing. The siving of the transistors depends on the chcuit type, layout and device's parameters, Compared to full-dder implemented in standard static CMOS style, the adder of Fig. 4.78 is much fsstei and dissipater less power due to the low internal swing. Also the schematic of this CPL adder is structured resulting in simplified layout.
One drawbad assodated with the CPL logic is the driving capability which is limited and the delay increases with long pass-transistor chains. So buffering is needed to restore the transmitted level and improve the driving eapability.
4.8.2
DPL
The DPL is a modified version of CPL suitable foor law-voltage applications. It deviates the problems of CPL associated with the reduced high level. Example far ANDINAND gate is illustrated in the schematic of Fig. 4.79. It consists of NMOS and PMOS pass transistors in contrast to CPL gate, where only NMOS devices are used. In the example of ANDiNAND gate, the NMOS tranrktor m e used to pass the ground while the PMOS transistors are used to pass the high level (VoD). The output of the DPL is full rail-to-rail swing owing to the addition of PMOS. However. this addition results in increased
208
CHAPTER 4
209
A.5
Figure 4.18
A.B
input capacitance compared to CPL. T h i s wiU not limit the performance of DPL as will be explained.
Fig. 4.80 shows a comparison between the switching characteristics of CPL, conventional pus-transktor CMOS and DPL XOR gates. In the truth tables, the colnmn labeled *Pass" shows which signals are passed and perform the XOR function. There are some features of DPL
.
rn
The DPL gate h a s a balanced input capacitance. This reduces the dependence of the delay on the input data, contrary to the CPL and conventional CMOS pass-transistor logic where the input capacitances for the signals A and B are not the same. In DPL, far any input combination, there are always two eurient paths driving the output. T h i s compensates for any reduction in speed due to the additional PMOS. Fox example, when the inputs A and B are low, A is passed by a PMOS while B is passed by sn NMOS.
A DPL fall-adder implementation is shown i n Fig. 4.81. When d the input A, Band C arelow, for exampie, there are two current paths to the output buffer. This implementation uses DPL primitives such as ANDJNAND, ORINOR,
XOR/XNOR and MUX to generate the carry and rum signals.
210
CHAPTER 4
CPL
Ciicuii
B XOR Pars
Table
-"DO
-" T ,
PLII~
k-ister
Figure 4.80 Cornpariaon oi CPL,conventional CMOS TC and DPL iogin for XOR gata.
211
OWNOR
Figure 4.81
DPL Iull-addcLr.
212
CHAPTER 4
NMOS CPL
improves the speed as shown in the simulation C U Y ~ of Fig. 4.84. It har been found that the rim of the latch should be minimum, for a fast operation, using the 0.8 p n device parameters of Chapter 3. If the siae of the NMOS transistors in the network k small, the autpnt of the SRPL gate fails to switch to ground b e c a m the equivalent impedance of the network is lower t h a n the one seen by the output to VDO. Thk problem becomes wome when many gates are cascaded. Fig. 4.85 illostrstes this problem in 2 ANDJNAND cwcaded gates. When the input goes from VDOto ground, the nodes A and B,initidly at VDD, cannot be completely discharged.
213
750
I
4 6
8 10
12 14
16
18
20
4.8.4
Pass-TransistorLogics Comparison
The speed and power dissipation of the different pars-logic styles. so far presented, depend on the circuit type and the application of the circuit (cascaded gates, driving a fixed load, etc.). For the care of 8 full-adder, used in a multiplier array, B comparison is given in Chapter 7. In general, SRPL has the lowest power dissipation but careful design is needed when smaU device iim are used. The DPL consumes more power than SRFL and PMOS latch CPL. because of the higher transistor count.. Both CPL and SRPL Circuits have the smallest area and the fastest speed. In summary, CPL-like styles are promising, for law-power and high-speed applications.
214
CHAPTER 4
-0
I T
4.9
YO CIRCUITS
1/0 circuits connect the on-cbip l o & circuitry to the external world. They play an impmtant role in the limitation of speed and power dissipstion of the whole chip. In thu section many 1 1 0 circuits are discussed such BS input and output buffers, dock distribution, clock buffeimg and low-swing 110.The power dissipation issuer related to there circuits are &o studied. Layout techniques for 1/0 circuits are not cclverd in this chhapter.
4.9.1
Input Circuits
To distribute en inpot signal to the i n t e n d circuitry of a chip, BO input buffer i s needed. It has its gate connected to the input pad. Excessive electrostatic charge, on the input pad, can break down the oxide and destroy the trandrtorr of the input buffer. For an oxide thiekmss of 100 A, the bieakdoxn voltage is i i 7 V. The voltage build on the gate, from the electrostatic charge, can be ss high 300 V Fig. 4.86 shows an example of electroatstk dkcharge protection. If the voltage, a t the node N , goes above V m or below ground, than the coupling diodes D, and D2 limit the voltage excureion of the node N w i t h -VBz and VDD+ VBz. The role of the resistance R, is to limit the
[%I.
215
YDD
peak current that flows in the diodes. %ical d n e s of R are few a hundred of and m e realieed using the diffusion layers. The input protection Circuit has a pararitic RC time constant which can limit high-speed operation. It ranger from a few tens of ps to a few hundreds of pa. The input buffer, connected to this input pad, consists in general of a number of inverter stages to drive the internal circuitry. The input buffer. for clock distribution, needs rpecid care and design and is discussed in Section 4.9.4.
216
CHAPTER 4
Madmnm
high inpnt
low output
+
Figure 4.81
DC power, BL shown in Fig. 4.87, particularly if the VT of the devices is low. If the first inverter does not fully translate the input TTL levels then the second Stage dissipates some DC power. The static power dissipated by a TTL i n p d buffer is PTTL = VDDIDTTL (4.100)
where
(4.101)
IDDTTL is the average dissipated current for the CBLSEJwhen the input is at low and high levels. At VDO= 3.3 V, the input buffer dissipates more static power when the input is high than when it is low. Fig. 4.88 shows the characteristics of the static power dissipation of the input buffer. Note that w h a VDD is sealed down the DC current is reduced beeanre the Vos o f the pull-up PMOS of the input buffer is zedwed. If the number of TTL input pads is large, then the DC power of the input buffers could bc an important and limiting factor. A static power-saving input buffer fox reducing IDTTL for 5 V power supply voltage har been proposed in [21].
217
Figure 4.88
PI= ANsE*< f
(4.102)
where A is the switching activity, N , the number of the input 'pads and Eii is internal energy of the input pad in Watt/Hz.
When the input signal has ECL levels, then an ECL input buffer, with ECLnsed. In " eeneral the" are imolemented in BiCMOS **CMOS converter a ~ e technology and con~umea DC power. An ECL-CMOS converter can be designed in full CMOS ps].
218
CHAPTER 4
4.9.2
Schmitt Rigger
When the input signal to a chip is slowly e g , a hysteresis circuit is needed at the input pad to generate B dean edge. A circuit called Sehmitt trigger can be used for this fnnetion. They are often found at the on-chip inputs. Fig. 4.89 illustrates the transfer characteristic of ideal Schmitt inverter with hysteresis voltage Vx = VT+ - VT-. For 3.3 V power supply with 3.6 V for fast process and 3.0 far slow process, typical d u e s are : V T + , . , , . . = 1.7 V and VT-+* = 1.0 V. The Schmitt circuit switches at different thrrrholds. When the input is rising, it switches when En= VT+ and when the inpnt is falling,it switches when K,, = V T . . Fig. 4.90 shows an example of how the Schmitt t*gw turns a signal with a very slow transition into a Sign& with a sharp transition.
'
A CMOS version ofthe Schmitt trigger is shown in Fig. 4.91. When the input is rising, initially the NMOS transistois are OFF. The Vcs afthe transistor N z is given by (4.103) v , , , = v;" v m
~
219
Y
vT+
.. ...... ... .... .... ........................
vr.
vDD\
~~
Time
6
Figure 4.81
The CMOS Schmilt triggrrrchrrnstic.
When V,. = VT+, N, enters in conduction mode which means VGS, = V,, then' (4.104) V F N = vr+ - VT"
' W Ineglrct the body
offast of N,
220
CHAPTER 4
The voltage VFN i s rontiolled by Nt and N , . These transistors opelate in saturation because
VCSl
= VT+
VPN
&(VT+ 2
We have
VTm) = ,(vDD
V T + ) '
(4.109)
(4.110)
where
(4.111)
This equation shows that the trigger point is independent of the process prsremeters except for V T , . By symmetry, the trigger point for falling transition, ULO be deduced from the pull-up section. We have
(4.112)
where
(4.113)
If & =
and V T . = -V,
= VT,then
VT+ = "OD
~
2
2
VT +2
(4.114) (4.115)
v7.=--VOO
VH
VT
2
= VT+ - VT- = vr
(4.116)
In this
case the hysteresis voltage can be made equal to VT. The short-circuit power dusipation of the Sehmitt trigger can be very important since the rke/fd timer of the input signal is very long.
221
Fig. 4.92 shows SPICE simulation o f the circuit of Fig. 4.91 in 0.8 p m technology. In thla example, the load capacitance is 0.1 pF and the total power dissipation is 0.85 mW. The dynamic power &sipation, dne to the load and parasitic capacitances, is 0.40 m W .Therefore, the power dre to theshort-circuit iS 0.45 m W , which represents 53 %of the total power dissipation.
4.9.3
When the gate is intended to drive B large load capacitance (larger than the h p u t capacitance of the gate), the driving CapabilitY is limited and the delay is large. If we increase the i i e of the gate (driver configuration), we improve the nse/fall times but still the delay can be improved by putting several stager of buffering between the first gate and the load. The objective in B buffer configuration io to gel the input signal to the load as quickly as possible. Each stage in the buffer chain should have its transistor widths larger than the previous
(ZZ1.P)
223
Question : What are the d u e s of the size ratio a and the number of stages
n t o op&e
the deky ?
By differentiating t a equation with respect to a and then setting it equal to aem, we have = o 2.1 (4.124) The optimum number of stages ir
, , n
= I.(Cf,/C,")
(4.126)
In this analysis, we have neglected the pararitic output capacitance of each stage. Other stndies [30,31, 32, 331 illustrate that the siee ratio a depends on the ratio of the parasitic ontput capacitance and load cspacitanee. In [34] B new approach for CMOS tapered buffers, with large Ch/Cs, ratio, was proposed. It uses B variable sise ratio between the stages.
The power dissipation ofa CMOS bufferis mainly dominated by dynamic power dissipation for large VT. The short-circuit power dissipation can be neglected 85 first-order analysis [34]. If we indude the parasitic outpnt capacitance. So stage i, has a t o t d ontput capacitance
c, = O'C., + a.-'Cp
Pi = c,v;,r = V&f(a'C,
or
(4.126)
we assume that the parasitic capacitance of stage i is proportional to the size ratio a. The dynamic power dissipation at the output of glrte i is
+a'-'cp)
(4.121) (4.128)
P, = v;,fa'-'(ac."
The total power is
+ C,)
Rence
P , = V&f(aC,, t C , ) -a-1
a " -1
(4.130)
224
CEAPTER 4
where P~isthepowe~dissipated, duetotheloadCL, whichissimply C=V&f. PT is the total power dissipated given by Equation (4.130). This power effidency, for a given Cc,C,, and C,,is afunction of only the factor a . The term 1 - characteriaes the additional power dissipation overhead, needed by the buffer chain to drive the load CL. For high values of a,the power efficiency of the buffer increases. In practice a can be in the range of 2-ta-10. T h i s d u e of a can beret depending on speed, dday and power dissiphtion constraints.
4.9.4
U m d y when the dock is to be distributed on-cbip, input buffers me needed. The clock erenit hss to drbe wry high internal load with extremely h t fd/Jl/rise times. For example, in the CSLS of DEC Alpha chip [21] the dock load is 3.2 nF. If this load has to be driven by a large driver, in ~ i s e / Wtimes of 0.5 ns when the clock frequency is 200 M B z [ T . i O r r = 5 4,then the average transient current would be
r,.
3.3 = 21 A
0 . 5 lo-* ~
(4.132)
x 3 . P x 200
1 0 ' s
7W
(4.133)
The difference in RC intercomat time constants: For example i n Fig. 4.94 node A and node B have two different branch lengths to node C. In this case, the delays of the signals at node A and node B Vir a v k node C ace different. Therefore, the dock skew is eqoal to the time difference between these two signals.
n the example of Fig. Unbalanced loads a t different nodes: As shown i 4.95, if the loads at the nodes A and B, Ca and CB respectively, are different. Then the skew between the signals at these nodes exists.
225
F F Z
Block
Figure 4.95
-T
Clock Driver
B.
226
CHAPTER 4
Several stmtegiea have been proposed to minimiee dock skew. The first a p proach is to use cascaded inverters (buffer) to ddve B lmge load and feed d l blocks as shown in Fig. 4.96. The buffer chain is designed by the approach presented in Section 4.9.3. In another approach, the clack distribution is aceomplirhed by using a tree of clock buffers well sized as illustrated by Fig. 4.97. Identical buffers are used in each level and each buffer sees the s a m e load capacitance. Equalking clock buffer loads is possible by : 1) equalizing the interconnect lengths between the buffers of different levels, and 2) the addition of dammy bufferr st the slightly loaded bvffer ontput. The last distribution level has buffers which drive the functional elements such as registers. This structure results in very reduced skew and the only skew that exists is the one produced by variations in process parameters. To further minimile the skew, identical layout for all the buffers, should be wed. As an uample of tree approach is the following case. To distribute the clock signal to 64 elements (for example r e e k s ) . 3 stages (levels) of buffering with 1-to-4 tree structure m e required. A wuiety of software paekager have been developed for clock tree synthesis [35. 361.
T o ieduce the high dynamic power dissipation (few Watts) in dock distribution
at a
in intermediate
227
Figure 1.87
For the second approach, a half-swing clocking scheme has been proposed 1371. F i g . 4.98 shows the half-swing dock driver which generate half VDD clock signals (four phases) to the elements (eg , latches). Using the charge shaiing principle, the node of haEVDD can he expressed by H-VDD = H-VDD
c, + c ,
+ c, + c s V D ~
-VDD
(4.134) (4.135)
ca+ c3+ c, + G B
whenclk ia hwh
, and CB me added Capacltms to the power liner. C, through C4 are where C the load capacitances of the driver. When CA is equal t o CB and both ase large enough, compared to C,-C,, then H-VDD node is stabilized at V D D ~ ~ .
F i g . 4.99 shows the clocking schemes of the latches driven by the clock driver. Compared to the conventional scheme which uses two clock phases, the halfSwing scheme requires four clock phases. Two phases are for PMOSs and two are for NMOSr BI shown in Fig. 4.99(b). This scheme reduces the power by 75%. However, the delay of the latch is increased by the new docking scheme,
which can be acceptable [37].
4.9.5
Output Circuits
a high drive capability driver is needed to achieve adeqnate rise and fall times. In this cme, inverter chain is used to handle the
228
CHAPTER 4
229
large load of the pad, package wiring, and off-chip load. This capacitance can be few tens of pF. A typical value of this capacitance is 50 pF. There arc many types of output pads swh BS tristate, bidirectional, I O W - V D (3.3 ~ V) to higb-VDo ( 5 V) output buffer and low-swing output.
230
CHAPTER 4
1
Data-in
Figure 4.101
Biduraciiod pad.
IDS,-..* value would be important, beesnse the devices i n the autpnt bnffer have large ske partiedrub the output transiston. I D , , . . , should be cornputed in worse case where the VT has its minimum value. Thus for future technologies where the threshold voltage is low and the nomber of output pads is large, thm static power dissipation would be very important and can be a limiting factor for low-power applications. Hence low-power eircuit techniques are needed for output buffers.
If the CMOS output buffer is intended to drive bipolar TTL inputs (not CMOS TTLinputs), thenMportanteurrentissn~.Fig. 4.102shows thefinalstageof the buffer driviog a TTL logic. Since, bipolar TTL inputs can sonrce significant amounts ofcnrrent, B CMOS ootpnt buffer must sink this current. For 3.3 V power supply, this current can be in the range of 1 mA to 12 m.4 depending on the strength of the ootput driver. The static power dissipated by the one output pad driving bipolar TTL inputs is
= VOLIOL
(4.131)
231
output driver
:
TTL output buRIr.
Figure 4.10'2
where lo& is the cmrent sunk by the output buffer and is equal to the I of the cnxrent from d the bipolar TTL inputs. VOL = 0.4 V for 1 0 - TTL output. This disspated power is due to the ontpnt NMOS pull-down transistor and can be an important issue s far BJ the chip heat is concerned. Note that the corresponding energy is not drawn from the internal power supply. Another romponent of the total power dissipated at the output pads is the dynamic power. It is given by
(4.138)
where E;, is the internal switching energy of the output pad, and G, is the werage output load capacitance (including the pad load). As an example. 64 output pads switching vith an activity of 10% at 200 MHe dissipate 0.8 W (WDD = 3.3 V, E;. = 70 ) r W / M H Z and C, = 50 pP). This d u e is very important to take into account. The total power dissipation of the bidirectional pads can be evaluated using the approaches developed far the input and outpot circuits.
4.9.5.3 3.3-10-5
v olllpul hzterface
When a 3.3 V chip is connected to a 5 V chip, zero DC power dissipation interfaces are needed. If the conventional CMOS is used to interface the 3.3 v 109;. to 5 V logic, the DC power would be large. Fig. 4.103 illurtrates this
232
CAAPTER4
problem. For example, if the 3.3 V inverter driver high into the 5 V inverter, the Vos of the PMOS transistor P, is equal to 1.7 V. This value is larger than VT of the device and thus results in large DC power dissipation in the range of milliwattr. Since this power is for every 110, then for a whole ASIC chip it could be hundreds of mW. This situation is unacceptable for low-power application.. The circuit of Fig. 4.104 defines a solotion t o the problem of DC pow% d i c sipation (381. The circnit has two power supplies, denoted VDDL and VDDB corresponding to Iow-VDo (erhmple 3.3 V) and high-VoD (example 5 V), r+ spectidy. For low input data, node A is at VDDL and node B is at aero. The NMOS transistor N is conducting and the output is at Vss. Since the output is %em, the feedback PMOS transistor. PI, is also conducting. The p a r NMOS transistor N,, is cutoff, thus the node C is palled up to V D D XThen . the PMOS transistor P is completely OFF. Hence no leakage is in this state except the junction leakage currents and the Subthreshold currents. For high input d a b , node A is a t s e m and node B is at VDDL. In this cffie the NMOS transistor N is OFF and the pffis transistor Ne is condncting. Initially the feedback PMOS transistor Pj is ON and since Np i s conducting, then proper sising of PI and Nn (higher conductance of Np)d l permit node C to be discharged though Np. T h i s canses P to eondnct, which in t u n charges the ontput to V D D H . Then the feedback device P j is completely OFF. Thus this interface results in very limited leakage current and solver the problem of interface.
As mentioned, the transistors PI and Np should be sined properly so that the circuit does not hteh the prcvious data. Pj should be mvch smaller than
233
Xp. We we simple analyri. to find the relationship between the sizes of the two transistors. For high input data, initidly the node Cis at V D D X . Thns the NMOS Ng is in satmatian and the PMOS Pf is in the linear region. By 'ustoning that the drain current of N? is much higher than that of P f , w e have
(4.140)
where & and are the 8 s of the NMOS transistor Np and the PMOS transistor P f , respectively. The low-to-high voltage converter has jl negligible DC current when the input is stable since all the devices are completely OFF. T h i n technique can be used to interface any lowvoltage to higher voltage.
opt
4.9.6
Ground Bounce
W h e n a high drive carrent CMOS driver switches, it generates high carrent spikw. This current can generate noise, as shown in Fig. 4.105. The current tlows through the impedance between the pad and supply node and produces a voltage noise. This noise is often called L$ or ground bounce. The I is due to the padrage inductance. The ground hounce is given by
di V ' = Ldt
(4.141)
234
CHAPTER 4
C""*",
:
Vi"
p F y j > 'TI i n x
. . ..
Time
V
L = L- dt
dl
This noise problem can occur on power lead and is termed power bounce. We will use only one name to refer to this problem. Consider a CMOS output driver driving the output pad of 50 p F at 3.3 V in 2 ns rke/fall timer. It can be shown [39] that 2 is related to the fall/rise times by
(4.142)
The dijdl can be as high as 165 mA/m. If for example 8 drivers are dowed to switch rimnltaneoudy per eaeh VoojVss pads pair, the resulting ground bounce for 1 = 1 n B is 1320 mV. This value can be B problem, partieduly for low-voltage applications, since this ground bounce consumes a large fraction of the digital noise margins. Some of the problems encountered arc 1) fake triggering. 2) double cloddng, andjoz 3) missing clocked pulses.
235
110 buffers are not the only sonree of ground bounce in CMOS circuits. Clock
bnffers llod slightly the c o x logic can also cause serious ground bounce in the supply leads when driving large loads. Careful power supply routing should be taken when we power large buffes. The resistance of the metal should be minimieed so the voltage drop, due to the corrent spike, is reduced. There are many techniques to reduce the ground bounce. One simple approach is to use separate supply pins for the ootput buffers. Some approaches, based on reducing L and d i l d l , are the following: Multiple supply pads and pins iz O ~ way E to ieduce the indnctanee of the supply. A recent chip nses 121 power/gronnd pins oat of a total of 293 pins [40]. Placement of power and ground pins, adjacent one to the other reduces the effective inductance of power sod groond pins by mutual inductance. This approach cmses an inerutse in chip s i x and cost.
Circuit techniques to reduce the d i j d t of the output and dock bufferr,
while maintaining sdeqwte performance. The simplest way is to control the rise/fsD times while maintaining the timing requirement. However, this approach has a serious problem, since worst-ease-slow process dictates the buffer rising (worse~ase dclsy), while best-casefast process dictates the ground bounce l e d Benee the buffer design is constrained by the two extremes of process variations. Once the buffer i s siaed to satisfy the worse~asedelay, the worsecase gronnd bounce may exceed the fired level. This problem can be solved by controlling the signal slope at the inpnt of the output transistors of the buffer [41].
rn
For clock buffers, and in high-performance design, on-chip by-pass apacitmce are added between t,he power bur and the substrate as shown in Fig. 4.106. This capacitance lowers the impedance of the power s u p ply. On-chip bypass capacitance doer not reduce the noire produced by output buffers.
Another approach is to reduce the output d t q e swing of the large boffer.
In eondudon, to reduce the ground bounce, all the techniques can be combined
to reduce Land d i l d t The reader can refer to many other techniques to reduce the ground bounce [42, 43, 44, 451.
236
CHAPTER 4
T'DDC
VDDBus
4.9.7
With the advent of high-performance VLSI chips, which operate beyond 100 MHe and have over 100 I/Os on the same chip, high data rate CMOS 1 1 0 interfaces with low-swing signals are needed such BP ECL (Emitter Coupled Logic) 146, 47,481, BTL [4Q], GTL (501, and CMTL (Current Mode Transceiver Logic) (511. Conventional unterminated htecconneets (between VLSl chips) for CMOS-level sign& w u d y have poor signal quality with severe overshoot and r k g h g . accompanied by EMJ (deetromag~tetie interference) and the possibility to trigger the lath-up.
Fig. 4.101 shows two chips connected to the bidirectional transmission line (50 R termination resistors) though GTL I/O (Gunning 110 ) transceivers. Bath ends of the transmission line are tezminated to prevent reflections. The load seen by each driver is 25 R. The termination voltage VTMis about 1.2 V. The output driver is an open-drain NMOS pull-down transistor and when it is inactive the output is at high-level signal Vox equal to 1 $ ~ The . input receiver = 0.8 V. uses a M e r e n t i d comparator with external reference voltage
237
Figure 4.101
transmirsionhe
Fig. 4.108 shows an output duver i n open-drain confignration which indudes circuitry to reduce overshoot and the turn-off dildt. When K, is low, P, turns ON which itself turns Na and N, ON. In this C B J ~ , ,the maximum output voltage is VOL,,,, = 0.4 V. The powei dissipated by the pull-down NMOS ir madmum and mainly static. The static current is equal to (VTM - V o r ) / R= 0 3 / 2 5 = 32 mA8. Hence, the marimurn static power dissipated on-chip is P = 32 n A x 0.4 = 12.8 mW for each I/O. % i d value of Vor. is 0.24 V, thns the nomind power dissipated by each active driver is 9.2 mW. When the input goes Lorn low to high, N, turns ON and Na is still ON because the signal , is delayed by about 1 na. The transistor through the two inverters I , and 1 NI is weak, hence the output discharge ir controlled by N, and Ns. There transistors let the drain of N, connected to its gate as long BS V ~ s ir s higher than VT. When Ns turns OFF, then NI discharge. the gate of Nq to the ground. Thus, the turn-off of N4 is controlled. In this mse, there is no DC Power dissipated. Fig. 4.109 shows the input buffer which employs B differential comparator. This L V , . , > 50 mV (< -50 mv), circuit switches to high (low) V,, when I respectively over process, power supply and junction temperature variations.
~
V p ,
238
CRAPTER 4
Vi"
(GTJ. levels)
YOU,
239
T h i s technique has been used mainly to reduce the static power dissipation
i n standby mode of the memory decoded-driver [53]. The drivers, in memory, have a lbrge number of circuits, arranged repeatedly, but only a few of them operate aimultaneoudy. The drcuit of Fig. 4.111 can drastically reduce the subthreshold current of the drkers. The technique simply consists of inserting a PMOS tmnsbtor P- with a size W. between the power supply VDO and the common source node A. AU the PMOS transistors (Pd,,Pd2, ...,Pdn)of the
' I C o l y L ~ -nl t
tbcahold voltage.
240
CHAPTER 4
drivers have, in thk example, the s m c sivc Wd and common SOUICC (node A). The number of drkers R can be between a few hundreds to a few thousands. The MOS transistors in the ddvers have low iVTdl (e.g., 0.1 V). The PMOS transistor PG have a threshold d t t a g e I V T , I slightly higher than I V d (%.,
0.2
~
0.4
V).
In active mode, the input S is low and the transistor Pois ON. For the drivers only one circuit is ON. In order that the PMOS transistor Pedoes not affect the drive current of the driverg, its size W, should be larger than Wd, depending
on the capacitance of the common murce, which is huge for high R. In standby mode, the input S is high and the PMOS transistor P, is OFF. The inputs of all drivers are set to high (VDD). Without the PMOS tiansirtor P . , the total subthreshold emrent would be n timer the c u r d of each driver. T h i s malres thk current very high. Hence Pc %educesand limits the sobtbrahold cnrrent. The voltage of the common source node A, is reduced by an amount AVsna (afew hundreds ofrmV). This CBUSOS the PMOS transistors ofell drivers to hsve self-reversebiasing gate-source voltage, which drastically reduces the subthreshold current. The time needed for the node to stabiliue to VDDAVsns (or the time needed to switch from the active to stsndby mode) i s called evolution time and can be very high (order of 1 mr) compared 10 the delay of the driver. The reason is that only the leakage and subthreshold cyzlents which
Low-Voltage Low-Power
241
Active mode
Slvndby mode
Figure 4.111
&charge the node A in this mode. This time can be undgnificant to low-power operation if the standby mode time is large enough s i n the case of many lowpower applications. When the input S is turned low (active mode), the time needed for the coinmm source A to recover (reaches almost V D Dis ) too low and can be lower than the delay time. Hence. it doer not interrupt the start of normal operation.
Lets derive now the subthreshold current expressions before and after reduction by SXB technique. The total subthreshold current withont the self-reversebiasing techaique is given by
w a I..*, = n.I-exp
wo
-1vm
~
Sjln10
(4.143)
w. exp I d 2 = law,
-lvTcI
~
S/I.lO
(4.144)
242
CHAPTER 4
We assume that the devices have the s m e lo, W o and S. By dividiog the current equations (4.143)and (4.144). ws have, for the subthreshold current, a
reduction factor
-,
Forexampleforn = 512, W. = (with this ratio thespeed irnot affected), VT, = 0.3 V, V T ~ = 0.1 V and S = 90 mVjdecode, the factory = 8.5 x 1 0 ' . So, the saving, i n subthreshold current, is sufficient. The parameter AVsni, can be easily deduced. Note that this technique needs multi-VT technology.
lowd,
4.10.1.2
Mulri-VTTechnique
This techniqne is similar to the one discussed above, but it u ~ be n applied to any CMOS logit (54,561. The basic idea i s shown in the crsmple of the NAND gate of Fig. 4.112. Here the MOS transistors P and N have high V T (e.g., 0.6 V extrapolated) for 1 V power supply applications. Also the logic gate has MOSFETs with low V T ( 5 0.3 V). The signal SL is used to switch the gate in active or sleep (standby) mode. The virtual upp ply lines VDDV and Vssv are common for many gates. W e call thb logic multi-threahold CMOS logic (MT-CMOS).
In the active mode, the signal SL is low,P and N are ON, so the vktoal supply Vssv can be set to almost VDOand ground, respectively. Hence, the 10w-V~ logic o m switch effidently, bot cart shonld be taken i n the siziing ofthe P I N devices compared to the logic. Fig. 4.113 shows the effect of aieing the high-& devices on the delay of the gate. The width of P I N rhodd be at least 10 timer larger than that of logic cells. This condition depends greatly on the pararitic capacitances of the Virtusl sopply lints CI6nd C, [see Fig. 4.1121. If C , and C , are large then the width of P and N transistors can be reduced, because these capacitances tend to suppress the bouncing of VDDV and Vssv and henee improve the rpeed. The high-& MOSFET. can be cornmon for several logic g a t e s (q, 1 0 ) .
lines VDDVand
In the standby (sleep) mode, the signal SL is high, then P and N are OFF. Hence, the subthreshold current is limited by that o f these high-VT devices. In this ease, the static power dissipation is dramatically reduced in the sleep mode. The subthreshold reduction factor can be deduced using the analysis presented in the previous section. One problem associated with this MT logic is that the evolution and recovery times can be large.
243
'
244
CHAPTER 4
The measured delay, as a function of the supply voltagc tor Zinput NAND gate with FO= 3 and wiring load of 1 mm (0.25 p F ) , is shown in Fig. 4.114. The technology is 0.5.pm CMOS with low VT- = 0.25 V, low V T ~ = -0.35 V, high VTn = 0.55 V and high VTp = -0.65 V. The MT-CMOS logic has almost the s-e speed ag the full 10w-V~ logic. The logic delay time is reduced by 70% at 1 V as campared with that af the high-v, one.
For holding the level of the output during the deep mode, a level holder is 85 shown in Fig. 4.115. It consists o d y of cross-coupled inverters with high-VT devices powered from the power snpply VDD.
necessary
T h e source of the static power dissipation is not mly low VT devieer. Several
other issuer eontribnte to static power increase. These are some Circuit design guidelines to ieduce the static power Mipation :
rn
245
Figure 4.116
where (I,is the gate activity, V, is the voltage swing, C, is the load and parasitic capacitances and f is the operating frequency of the system. Equation (4.146) demonstrates that there m e several ways to reduce P,:
246
CHAPTER 4
1. Reduce the power supply voltage. Seating VDDfrom 3.3 V to 1 V results in B power reduction factor of 1 1 . However, tbia approach leads t o speed degradation for a givcn technology. But if device sealing is applied, in a next generation technology, the delay will improve and henee the operating frequency. In a complex digital system local supply reductions een be used for non-&tical dreuits. 2. Redwe, temporarily, the clock frequency of unused blocks on a VLSl chip using an on-chip power management unit or reduce the gate BCtivity. These can be done a t the architectural level.
3. Reduce the output capacitance Ci. As a first order approximation thi. capacitance is composed of the intercomect capadtanee G.,, and the total input capacitances of the driven gates C;sv The latter caa be redwed Using low inpat tapa6tanee logic family [SO] such a CPL-like. Also u5ing minimum size logic gates in non critical parts of the dclign can reduce the dynamic power significantly.
When Ci,, dominates, &s in busses and high-capacitance intereonncctionr (interbloek wirer), then dreuit techniques, bwed on low-swing signal, while maintaining the power sopply voltage. can lead to power dissipation reduction 158, 591. With increasing chip dimensions and integration density, the capacitances of wirer will dominate. It is expected that the power &ripation associated with the busses and the interwnneetions in future ULSl chips waill reach half of the total power dissipation [58].
These arc some guidelines for the design of low-dynamic power eircnits :
rn
Cho0.e the technology that has low junction and oxide capacitances for the same performance. Avoid, if possible, the use of dynamic logic design style.
rn
. .
rn
For any logic design, reduce the switching activity, by logic reordering and balanced delays through gate tree to avoid glitching problem.
247
where V is the power supply voltage ar shown in Fig. 4.116(a). Half of the energy is dissipated by the resistor of the pull-up PMOS device during the charging phare. A similsr argument applies Lo the discharge resistor of the h i s analysis is valid men if a step power supply pd-down NMOS transistor. T voltage, V, is applied to the network. From Fig. 4.116(b), the Voltage drop across the resistor, Rp varies from V (supply voltage) to eero. Hence. the energy disripsted by Rp is given by
En = / e V . d Q = / e V n C d ( V - V x )
then
En =
1 41.v 2
(4.148)
(4.149) (4,150)
En = C L V V .
where
6 is the average voltage drop nerosr the resistor of the pull-up PMOS.
If the power supply voltage bar two half steps, ar shown in Fig. 4.116(c), the energy dksipated by the resistor is
ER = -C,Va
4
(4.151)
So less energy is dissipated by the resistor, when the average voltage is reduced, while keeping the swing and load eapaeilnnce constant. This is the principle of
BC
Va = E = CL-
Ecmuant,msj
(4.152)
En = 4 N
1 2
vz
(4.153)
248
CHAPTER 4
249
where N is the number of voltage steps uniformy distributed. Fig. 4.117 shows an example of a driver with uaiformy distributed supplies which are switched in surcesi~ely.The voltage V, is given by
To charge the load, V t through VN are connected to the load in succession (by dosing switch 1, opening switch 1, dosing witch a, etc.). To discharge
the load, Kx-1 through K are switched in the same way, and the switch 0 is dosed, connecting the output to gannd. Note that the supply voltage, with mnlti-steps, needs B longer time period than the conventional case to charge m p the load capacitance. This techniqne has been used for large loads. Another variation i s to use a supply voltage with a ramp form" [62]. In this case, the energy is drastically reduced if a long time period is used. For the (PPS) are applied to the circuit. inverter for example, pulsed power supplie~ The adiabatic comput;oP becomes attractive only when the delay is not critical, b e c a m in that technique the energy is traded for delay. The energy-delay product of the sdie.bbstic circuit is much worse than the conventional CMOS gates [64].
4.12
CHAPTER SUMMARY
This chapter has provided an introdnction t o low-power CMOS desisn. The power dissipation components of a CMOS gate hsve been discussed. Techniques to reduce the different components, a t physical and circuit levels, were presented. Novel CMOS design styles such iu CPL, DPL, and SRPL were examined. Several issues in CMOS circuit design, such as clock distribution, ground booncing, etc., were reviewed. This chapter represents a base, for Chapters 6 , 7, and 8 , where subsystems and low-power architectures are discussed.
REFERENCES
[I] N. H. E. Weste and K. Eshraghian, "Principles of CMOS VLSI Design : A Systems Perspective,'. second edition, Addison-Wesley, Reading, MA, 1993.
[2] J. P. Uyemura, "Circuit Design for CMOS VLSI," Kluwer Academic Publishers, Norwell, MA, 1992. 131 M. I. Elmasry, "Digital MOS Integrated Circuits 11", IEEE Press Book, 1993. [4] R. M. Swansan and J. D. Meindl, "Ion-Implanted Complementary MOS 'hamistors in Law-Voltage Circuits", IEEE 3. Solid-State Circuits, "01. 7, no. 2. pp. 146-153. April 1972. [S] H. J. M. Veendrick, "Short-Circuit Dissipation of Static CMOS Circuitry and Its Impact on the Design a l Buffer Circuits," IEEE 3 . Solid-State Circuits, "01. 19, no. 4, pp. 468.413, August 1984. [6] S. M. Kang, "Accurate Simulation of Power Disripation in VLSI Circuits," IEEE J. Solid-State Circuits, vol. 21, no. 5, pp. 889-891, October 1986. [TI G. J. Fisher, "An Enhanced Power Meter for SPICE2 Circuit Simulation," IEEE Trans. Computer-Aided Design, vol. 7, pp. 641-643, May 1988. [8] G. Y. Yaeoub and W. H. Ku, "An Enhanced Technique lor Simulating Short-circuit Power Dissipation," IEEE J. Solid-Slate Circuits. YOI. 24, no. 3, pp. 844-847, June 1989. [9] N. Meijs, and J. T. Fokkema, "VLSI Circuit Reconstruction From Mhsk Topology,'. Integration,"01. 2, no. 2, pp. 85-119, 1984.
252
LOW-POWER
[lZ] M. 1. Elmasty, "Digital MOS Integrated Circuits I", IEEE Press Book,
1981.
[I31 R. H. Krambeck, C. M. Lee and H-F S. Law, *High S p e d Compact Ckcuitr with CMOS", IEEE J. Solid-State Circuits, vol. 17, no. 3, pp. 614-619, June 1982.
[I41 V. Friedman and S. Lio, "Dynamic Logic CMOS Circuits". IEEE J. SolidStale Circuits. vol. 19, no. 2. pp. 263-266, April 1984. 1151 N. F. Conclaves and H. J. DeMan, "NORA:LI Race Free Dynamic CMOS Technique for Pipelined Logic Structures" IEEE J. Solid-state Circuits, vol. 18, no. 3. pp. 261-266, June 1983. 1161 C. M. Lee and E. W. Seeto, "Zipper CMOS," IEEE Circuits and Dcviccr Mag.. vol. 2, no. 3, pp. 10-17, May 1986. [lT] N. Weste and K. Erhraghian, "Piinciplcr of CMOS VLSI Design : A Syrtemr Perspective." Addison-Wesley. Reading, MA, 1985.
[IS] F. Lu and H. Samueli "A 200-MH1 CMOS Pipelined MultiplierAeeumiilator Using a Quasi-Domino Dynamic Full-Addcr Call Design,"
IEEE J. Solid-Stale Circuits. VOI. 28,
no.
[19] J. Yuan and C. Svenron, "High-speed CMOS Circnit Technique," IEEE J. Solid-state Circuits, vol. 24. no. 1. pp. 62-71, February 1989.
1201
M.Afghahi and C. Svensson, "A Unified SinglcPhare Clocking Scheme far VLSI Systems," IEEE J. Solid-state Circuits, uol. 25. DO. 1. pp. 225-233. February 1990.
I211 D. W. Dobberpuhl e l al., '"A 200-MHz 64-b Dual-Issue CMOS Microproccs~or",IEEE J. Solid-State Circuits. vol. 27, no. 11. pp. 1555-1567, November 1992.
1221 H. 8. Bskoglu, "Circuits. Interconnects. and PacLaging lor VLSI," Addison
Wesley, Reading. MA, 1990. [23] K. Yam, e l al., "A 3.8-ns CMOS 16x16 Multiplier u%htg Complementary PaJr-'Ihn8islar Logic", IEEE J. Solid-Stntc Circuits, "01. SC-25. no. 2. pp. 388-394, April 1990. [24] M. Suaiki. e l . I . , "A 1.5-ns 32-b CMOS ALU in Double Pars-Thnsistor Logic", IEEE J . Solid-Slite Circuits, vol. SC-28. no. 11, pp. 1145-1151, November 1993.
REFERENCES
253
[25] A. Psrameswai, 8 . Eara, and T. Sakurai, "A High-speed, Low-Power, Swing Restored P a s s - T r k t o r Logic Based Multiply and Accnmulate
Circuit for Multimedia Applications," IEEE Custom Integrated Circnits Conference, Tech. Dig., S a n Diego, CA, pp. 278-281, May 1994.
[26] L. A. Glasser and D. W. Dobberpuhl,
"The Design and Analysis ofVLS1 Circuits", Addison-Wesley, Reading, MA, 1985.
[27] T. Kobayashi et al., "A Current-Controlled Latch Sense Amplifier and B Static Power-Saving Inpnt Buffer for Low-Power Architecture", IEEE J. Solid-state Circuits, vol. SC-28, no. 4, pp. 523-527, April 1993.
[28] M. S. J . Steyaert, et al, 'ECL-CMOS and CMOS-ECL Interface in 1.2pm CMOS for 150-MAz Digital ECL Data Transmission Systems", IEEE J. Solid-State CLcuits, uol. SC-26, no. 1,pp. 18-24, January 1991. [29] C. Mead and L. Conway, "Introduction to VLSI Systems", AddisonWesley, Reading, MA, 1960. [30] N. C. Li, G. L. Haviland and A. A. Tureynrki, "CMOS Tapered Boffer", IEEE J. Solid-state Circuits, vol. SC-25, no. 4, pp. 1005-1008, August 1990. [31] M. Nemes, "Driving Large Capacitances in MOS LSI Systems", IEEE J . Solid-state Circuits, vol. SC-19, no. 1, pp. 159-161, February 1984.
[32] N. Bedenstiema and K. 0. Jcppson, "CMOS Chcuit Speed and Buffer Opthiastian", IEEE Tram Computer-Aided Design, "01. CAD-6, no. 2, pp. 276-281, M a d 1987.
[33] A.J. Al-JShalili, Y. Zhn and D. Al-KhaIili, "A Module Generator far Optid e e d CMOS Bnffer", IEEE Trans. Computer-Aided Design, "01. CAD-9, no. 10, pp. 1028-1046, October 1990.
[34] S. R. Vemuru and A. R. Thorbjornren, "Variable-Taper CMOS Buffer", IEEE J. Solid-state Circuits, "01. SC-26, no. 9, pp.1265-1269, September 1991.
[35] J. Burlds, "Clock Tree Synthesis for High Performance ASIC?', in IEEE ASIC Intun. Conf. and Exhibit, Rochester, NY, pp. PS-8.1-PS-8.3, September 1991.
[36] P. D.Taand K. Do, "A Low-Power Clock Distribution Scheme for Complex IC System", in IEEE ASIC Intern. Conf. and Exhibit, Rochester, NY, pp. PI-5.1-P1-5.4, September 1991.
254
Half-Swing ClocLing Scheme for 75% Power Saving in C l o c h g Circuitry, Symposium on VLSI Circuits, Tech. Dig., Honolulu, pp. 2524, June 1994.
[381 J. S. Caravella and J. H.Quigley, *Thee Volt to Five Volt Intedace Circuit with Device Leakage Limited DC Power Dissipation, in IEEE ASIC Intern. Conf. and Exhibit, Rochester, NY. pp. 448-451, September 1993.
1391 M. Shoji, CMOS Digital Circuit Technology, Prentiee Hall h c . , Englc wood Cliffs, NJ., 1988.
(401 F. Abu-Nofd et d., A ThresMillion Ttanaistor Microprocessor, in IEEE Iotenw&xal Solid-State Circuits Conf., pp. 108-109, February 1992.
(411 T. Gabars and D. Thompson, Ground Honnee Control in CMOS Intessted Circuits, in B E E International Solid-state Circuits Cod., pp. 88-89, February 1988.
(421 T.Gahara, Gronnd Bounce Control and Impromd Latch-op Suppression Through Substrate Conduction, IEEE J. Solid-State Circuits, 01. 23,no. 5 , pp. 12241232, October 1988. [43] M. HashLnoto and 0 - K Kwon, Low dI/dt Noise and Refletion Free CMOS Signal Driver, in IEEE Cuatom Integrated Circuits Conf., Tech. D i g . ,pp. 14.4.1-14.4.4. 1989. [44] T. Wada, M. EiOo and K. Anami, Simple Noise Model and Law-Noise Data-Ontput Buffer for Ultra-High-speed Memories, IEEE J. Solid-state Circuits, 01. 25, no. 6, pp. 15861588, December 1990. [45l R. S e n t b a t h a n and J. L. Prince, Application Sp&e CMOS Output Driver Circuit Design Techniques to Reduce Simultaneous Switching Noise,IEEE J. Solid-state Circuit, YOI. 28, no. 12, pp. 1383-1388,Decemher 1993.
[46] T. Knight and A. Krymm, A Sew-Terminating Low-Voltq,e-Swing CMOS Outpvt Driver, IEEE J. Solid-State Circuits, 701. 23, no. 2, pp. 457-464, April 1988.
[47] H-J Schumseher, J. Dikken and E. Seevindr, CMOS Subnanosecond True ECL Output Buffer, IEEE J. Solid-State Circuits, 01. 25, no. 1,pp. 150154, February 1990. (481 M. PedcrMn and P. Meta, A CMOS to lO0K ECL Interface Circuit, in IEEE International Solid-State Circuits C o d , Tech. Dig., pp. 226-227, February 1989.
REFERENCES
255
[49] J. Martinen, "BTL Transceivers Enable High-speed Bus Design", EDN, August 1992.
[50] B. Gunning, L. Yuan, T. Nguyen and T. Wong, "A CMOS Low-VoltageS g Itansrnisrion-Line Transceiver", in IEEE International Solid-state
Circuits Conf., Tech. Dig., pp. 58-59, Februay 1992. [51] J. A. Quigley, J. S. Caravella and W. J. Neil, '"Current Mode Transceiver Logic (CMTL) for Reduced Swing CMOS, Chip to Chip Communication", i n IEEE International ASIC Conference and Exhibit, Rochester, NY,Tech. Dig., pp. 452-457, September 1993.
[52] M. Kakumu, 'Process and Device Technologies of CMOS Devices foz LowVoltage Operation," IEICE Trans. Electron., Vol. E76C, No. 5 , pp. 672-
680, May 1993. [53] T. Kawahara et al., "Subthreshold Current Reduction for Decoded-Driver by Self-Reverse-Biasing." IEEE J. Solid-state Circuits, vol. 28, no. 11, pp.
1136-1144, November 1993. [54] S. Mutoh et al., "1 V Bigh-Speed Digital Ckcuit Technology with 0.5pm Multi-Threshold CMOS," in IEEE International ASIC Conference and Exhibit, Rocherter, NY,Tech. Dig., pp. 186-189, September 1993. [55] M. Eoriguchi et el., "SSI CMOS Circuit for Low-Standby Subthreshold Current Giga-Scale LSI'r", IEEE J. of Solid-state Circuits, Vol. 28. No. 11, pp. 1131-1135 November 1993.
[56] R. W. Badeau et al., "A 100-MAz Macropipelined VAX Microprocessor,"
IEEE J. Solid-state Cmcnits, vol. 27, no. 11, pp. 1585-1597, November 1992.
[57] R. Brodersen, A. Chandrakasan and S. Sheng, "Design Techniques for Portable Systems", in IEEE International Solid-state Circuits Conf., Tech. Dig., pp. 168-169, February 1993.
[58] Y.Nakagomeet al., "Sub.1-V Swing Internal Architecture for Futwe LowPower ULSI's," IEEE J . Solid-State Circuits. vol. 28, no. 4, pp. 414419,
A p d 1993. [59] A. Bellaouar, I. S. Abu-Khater, and M. I. Elmssry, "Low-Power CMOS/BiCMOS Drivers and Receivers for On-Chip Interconnects," IEEE 1. Solid-state Circuits. vol. 30, "0.1, May 1995. [601 A. Chandrakaran et al., ~~~~~-Power CMOS Digital Design", IEEE J. Solid-state Circuits, VOL 2, no. 4, pp. 473-484, April 1992.
256
[61] L. J. Svensson, and . I . G. Kollcr, "Driving a Capacitive Load without Dissipating fCV'," IEEE Symporiam on Low Power Electronics, Tech. Dig., San-Diego, pp. 100-101, October 1994.
1621 T.Gabara, "Pulsed Power Supply CMOS - PPS CMOS," IEEE Sgmposium on Low Power Elcotronics, Tech. Dig., San-Dicgo, pp. 98-99, October
1994.
[63]J. S. Denker, "A Review of Adiabatic Computing," IEEE Symposium on Low Power Electronics, Tech. Dig.. San-Diego, pp. 94-97, October 1994.
[64] M. Horowita, T. Indermaur. and R. Gonadeu, "Low-PowerDigitd Design." IEEE Symposium on Low Power Electroniw, Tech. Dig., Slm-Diego, pp. 8-11, October 1994.
5
LOW-VOLTAGE VLSI BICMOS CIRCUIT DESIGN
BiCMOS technology offers enhanced performance compared to CMOS at 5 V power supply voltage. Many high-speed BiCMOS SRAMs, gate arrays, ASICr, etc. have been fabricated [I]. In this chapter, we present 8 variety of BiCMOS logic circnits suitable for 3.3 and rub-3.3 V. The potential gatel for digital applications m e identilied. The chapter starts with the introduction of the conventional BiCMOS (totem-pole) gate which is used in 5 V applications. The degradation of this gate, with supply voltage scaling, is demonstrated. In Section 5.2, we introduce the BiNMOS family suitable for low-voltage applications. Othec logic families, for low power supply voltage operation, are discussed in Section 5.3. Low-voltage digital applications of BiCMOS m e identified. The reader is referred to BiCMOS books [Z,31 to get more familiar with BiCMOS circuits.
258
CHAPTER 5
inverter. The addition of the bipolar driver stage to the basic CMOS inverter is responsible for the high current driving capability of BiCMOS over CMOS. As a result BiCMOS offers lower d e l q compared to that of CMOS especially at high loading capacitance. The operation ofthis gate is straightforward. When the input is low, the PMOS P is ON and its d r a b current tmns the transistor QlON. The collector current of QIcharger the output load capacitance. As the output reacher VDD -VBB,, where VBE, is the turn-on voltage of the bipolar transistor and ir about 0.7 V, Q, gradually turns OFF. During this period, the NMOS transistor N a is ON. Since Ndl is conducting, Q2 is in the cutoff region. Bansistor Nd2 can also be controlled by the output node. However, using the base node results in faster operation because the b a of Q t is p d e d up faster than the output node and because the voltage level of the b a a node is largei. If the input is high, the NMOS transistors N and Nd, are ON. Qlis OF while Q . turns ON to discharge the output node. As a result, the load capacitance is pulled down. As the output V. leaches VEB, transistor Q . turns OFF and the outpot stays at this level. The conventional BiCMOS gate provides high drive capbilitr, eem static power dissipation and h g h input impedance. More dincnssionr on this gate are given in the following sections.
259
"0
w CMOS
BiCMOS
L
TCL
Conventional BiCMOS h v c r k r
Figure
6 2
5.1.1 DC Characteristics
Fig. 5 . 3 shows the DC transfer characteristic of the conventional BiCMOS inverter of Fig. 5.2. When the input voltage to the BiCMOS inverter is s e r a both the bipolar tran&lurr azr OFF. The PMOS device P operates in the h e a r region with rero drain-source voltage. Due to the subthreshold current of the transistor N (- 10 p a ) , the base-emitter voltage of QI is around 0.45 V. As a result, the output voltage V, = 4.55 V (0VDD= 5 V). The bilse of the bipolar transistor Q2is at zero voltage because Nd2 is ON.
As the input voltage increases, the subthreshold current of N h u e a r e s causing VB,,~,to rise and the ontput voltage to fa.When the input voltage is around the mid-VDo. both the P and N MOSFETs are ON and operate in t h e saturation region. Also the bipolar devices are ON. At this point, the BiCMOS inverter is in the high gain region and the output voltage drops sharply towards its low level.
260
CHAPTER 5
5 3 ,-.
0
>
z 21
Figure 1 . 3
V.
As the input voltage increases again, the base of Q2Sollows the voltage of the output since N is ON. When the input voltage reaches V D D ,the PMOS P is OFF.The discharge device, A ' , is ON and the base ofQl is at uero. Also, the o n t p t is completely discharged and N is ON. Then, the base of Q, is at sera In this cme, the output voltage is %em end both the base-emitter voltages are
aero.
261
Time (nr)
e
-6 -8
(b)
Time (ns)
snalysis of the puU-op section. Then we show the difference in the case of the pull-down section. We asinme a step input.
262
CHAPTER 5
c,,
where C d , pand Cd,Na, are the drain junction capacitances of P and N d l and Ca,N., is the gate oxide capacitance of N d l . The overlap capacitances of P
Low- Voltage
263
. \
and N,, hie assumed negligible. The bipolar parasitic capacitance C a , of Fig. 5.5(a) is given by (5.3) Cpa = CC.Q> t CE.Q, The total load capacitance, C., shown in Pig. 5.5(b), i s given by
c, = c,
CS,Q1 +CC.Q,
(5.4)
where Cr.is the external load capacitance, C,,O, i s the average collectorsubstrate capacitance of Qz and CC,~, is the average base-collector capacitance of Q2.R e c d from Section 3.5.3 lhat the base-emitter Murion capacitance is given by
co
drc,Q,
=if=
(5.5)
1. The first component, l,, in defined as the time required to turn QION. The model of Fig. 5.5(a) can be used in this case. Writing lhe current equation at the base node of QI, we have
264
CHAPTER 5
Solving that equation and assuming that initidly the bare-emitter of Qzis zero, we have
t,
(CF + C , ) -
VBB,a
I.?,,.,
(5.7)
If the initial VBEis not eeio then the above expression should be corrected. Typical value of il is 17.5 ps for a total parasitic capacitance at the base node of 50 f F ,V.j+,, = 0.7 V ,and I D S , . ~= 2 mA.
2 The second component, t2, is defined as the time required to charge the diffusioncapmitame, CD,p,.Startingfrom t,, the collector current begins to quickly rise and then rexbes its peak value, I c p . The output voltage changes slowly (see waveformsofFig. 5.4). Sot. is then defined as the time required for the collector corrent to reach its peak. This delay component is given by
t2IDSd
T,IOCp
(5.8)
which means that the charge furnished by the PMOS is needed to charge diffusion capacitance. Therefore,
The peak collector current of Q1 can be approximated 'sing Equation (3.111) [Section 3.5.21. So we have
ICP = JBOIX,IDS..t
(5.10)
where Po is the value of the p i n for low-level injection and I x , is the forward knee current. Note that r , is incremed by the collector current [see equation (3.127) Section 3.531. Hence, an average value of the forward transit time should be used in the above delay expression. The initial value o f q is 12 ps and it can leach 50 pr when the collector current reaches, for example, 5 mA. For = 2 mA, typical value for t a is 78 pr (average forward transit time is 31 ps).
3. The third component, ts, is defined as the time required to charge the total load capacitance to the middle point of the output swing. If we assume that the voltage across the base-emitter of QIis almost constant, then we have the following approximation
(5.11)
265
I f w e assume
that Ic,pz is constant during this time [see Fig. 5.41, and the mid-point of the output is VDD/Z, then we have (5.12)
The value of this delay vsries by more than an order of magnitude depending on the devices sise and the load capaeitnnee. For example, for a load C , of 1 pF, this delay. t 3 , has a typical value a t 5 V power voltage 400 p, while for load 100 f~ a typical value is 70 ps.
Hence, the total delay t d can he written as
1
IIitatt.
(5.13)
The first delay is associated with the parasitics at the bare, the second one with thc forward transit time and the last one is a function of the load capacitance. For smdl loads, t2 and ti dominate. Bowever, for large output loads, the third delay term, t s dominates. The exprersion of the pull-down time is similar to that of the pull-up time ucept for the value of the drain e m e n t of the transistor N [see Fig. 5.21. The saturation current ofthis device is given by
I D S . .= ~ K,C=U,G~W~(VG~ Vh)
(5.14)
The VGs far the NMOS during the switching is affFeted by V L Zdrop ~ while the one o f the PMOS is not. This voltage is given by
vos =
y;.,h.
VBE
(5.15)
So the effective gate-source voltage of the NMOS k lower than that of PMOS. The sizing of the NMOS and PMOS dwicer doer not follow the rule used for CMOS. It can only be determined from circuit simulation to get symmetrical risc/fa delay limes.
The slope of the characteriPtic delay-load of the BiCMOS gate is larger than that of CMOS, since it is equal to V D D / Z ( ~ D S +, lc ~p ~) . For 8 CMOS gate, the slope is rimply VDD/~(~DS.~,). The saturation culient in the CMOS is slightly higher than that of BiCMOS because the CMOS inverter has D PMOS with slightly wider device (see next Section]. Houcver, the slope of the BiCMOS inverter is larger due to large Icp. Therefore. the BiCMOS gate h a s a higher ddvability than CMOS.
266
CHAPTER 5
5 . 1 . 3
Lets compare the delay of BiCMOS gate to CMOS gate, having both of them the same inpnt capacitances. We consider the case of inverters with the following riser. For the BiCMOS inverter, we have : W, = W, = 10 em, WN*, = WN,, = 2 fim, and the emitter ate8 is n2 the minimom area. For the CMOS inuerter, we have W, = 15 em and W, = 7 em. For unloaded inverters and from the delay cxprersion of the BiCMOS inverter discussed above, ~ ~ , C M O <Si d , B , o M o S because the BiCMOS circuit has more parasitics and requires an initial delay to turn ON the bipolar devise. For large loads, I ~ , C M O S> G,B;CMOS, as explained previously. Fig. 5.6 shows the simulated delays of the CMOS and BiCMOS inverters function of the fanout. Fanout is defined here a s the ratio of the load seen by the gate to the hpni capacitance. In other wozdr, fanout is equal to the number of the gates connected to the ontput of the driving gate, all having the same input capacitance. The inputs axe driven by a small siae inverter of the s a m e type to have t y p i d inpnt waveform falljrise times. For low fanout, 1-to.2, CMOS outperforms BiCMOS at 5 V powez supply voltage. However, when the fenout is greater than 3, BiCMOS outperforms CMOS;particularly for high loads. In Fig. 5.6, the u o s s ( ~ ~ e ear pacitance (or fanout), denoted C,,is typically h the order of 100 f F . This cm~over value is critical for the performanee of BiCMOS; particularly when the supply voltage is sealed down.
5.1.4
Power Dissipation
As discussed, the BiCMOS gste of Fig. 5.2 has no DC emrent path from VDD to Vss if the input has rail-to-rail swing. Hence the static power dissipation is negligible if VT of the MOS devices is high. The dynamic power dissipation of the gate can be estimated from the circuit diagram of Fig. 5.7.
It is estimated by
Pa = C,iV%f
+ Cp2Vizms=f+ GVDD(VX- V L ) f
(5.16)
The first term is due to the total peraritie capacitance at the base node of Qi where the swing is V D D . The second term is also due to the parasitic capacitance st the base node of 4 . The swing at this node is limited to VBB.,... when the collector current reaches its peak. Finally the third term is related to the output load capacitance, CL, and the parasitic capacitance at the output. The swing is only V x - V ~ where , VH and VL are the high-level and the low-level of ontput, respectively. These levels ace affected by the output load.
Low- Voltage
267
For small loads the power of BiCMOS is greater than that of CMOS, w h i l e for large loads, they have almost the same dynamic power. Table 5.1 shows the simulation results of the power dissipation for both gates at 5 V power supply. At a fanout of 1, CMOS consumes much lower power than BiCMOS and it is h t e r . However at a Ianout of 10, the BiCMOS is faster (37.5% delay reduction) and it dissipater only 24% power more than CMOS. When a BiCMOS gate is driving another BICMOS, or a CMOS gate, the driven gate exhibits a DC power dissipation. T h i s DC current is nat acceptable, particularly when the circuit is in standby mode. Thk is due to the reduced $-Ping at the output of the first gate. Fig. 5.8 d o w r an example of BiCMOS gatedrivhgaCMOS gate. Iffor example theoutput ofthefirst gate (BiCMOS) VBE,the Vos of the driven NMOS would be higher than ieio and around the V T , resulting in appreciable DC power. Furthermore, the drive current of the driven gate would be reduced; particularly a t low power supply voltagc. Another disadvantage of the reduced swing is the noire margin reduction.
268
CHAPTER 5
Table 5.1
f=100hmS
Driver
Fenout=l
Fsnout=5
Fanout=lO
0.67 0.23
0.83
0.58
1.26 1.02
269
Fare 1 (BiCMOS)
Figure
5.8
Gate 2 (CMOS)
p t c
DC
and N,. When V. falls below V,, Q a ceases to sink current from the load capacitance. Then the output is discharged to the ground through only the MOS transistors N and N,. The final charging and discharging phaser occurs through the shunting devices. Hence, these phases c a n be slow became the MOS shunting devices have low drive capabilities. When this FS BiCMOS gate L operating under high frequency, the output s-g can he reduced. Another drawback of this circuit is that part of the current supplied by P ( N ) is wasted through the shunting transistors which weakens the bipolar drive. The shunting transistors P, ond N, can be minimum size. The problem of the base drive inherent in the "FS type" BiCMOS gate can be overcome by using feedback (FB) from the output through an inverter as shavn in Fig 5.9(h). This eireuit is called "FB type" [9]. During the pull-up transition, the shunting device P, is initially OFF and the PMOS transistor p wpplied all its current to the b s e af Q,. When V, is approaching its high level, the inverter I turns ON P, which itself charger the output node to V D D . The pull-down transition can be explained similarly. The shunting devices P . and N , and the inverter I can be sived properly to achieve greater speed then the othei configurations, even the conventional BiCMOS gate.
270
CHAPTER 5
VDD
V n n
&:
CMOS inverter
Figure 5.0 Fdl.swing BiCMOS gstr typal: (a) "FS type"; (b) "FB k y p i ' ' ; ( c ) '"CErhlvltingtype.
Another full-swing configuration is the one shown i n Fig. 5.9(c). It uses a parallel inverter from the input to shunt the collector-emitter (CE) of QLand Qa ontputs. The disadvantage of this gate is the increased input capacitance.
5.1.6
The output bipolar stage introducer VBEvoltage losaes at the output node as discussed earlier. When LL BiCMOS gate is driving another BiCMOS gate, the conventional BiCMOS gate loser its superior performance o v a CMOS at lower power supply voltage. The major c a w of this problem is the pull-down section of the BiCMOS gate. The VoSvoltage of the driving NMOS transistor of the pull-down section is eqnal to VDD 2VeB. As VDDis redoeed, VOS is signifinrntly reduced, resulting in degradation of drain current, hence the driving capability ofthe conventional BiCMOS gate. Fig. 5.10 shows the delay of a BiCMOS inverter in comparison to that ofs CMOS m the supply voltage is scaled down. The reported delay times were extracted from SPICE simulation by memuring the delay of the second gate i ne . chain of identical inverters. AU gates were equally loaded by B load CL = 0.25 p F and one fanout. All the circuits have the same input capacitance. The BiCMOS invcrter fails to
~
271
1.4,
operate at 2 V power supply. The BiCMOS outperforms CMOS but for 3 and sub4 V it looser its superior performance. The limit of operation of the conventional BiCMOS gate with the power supply voltage is determined by the NMOS device of the pull-down section. The drive current of this NMOS d e v k k (VDD -2Vs.s V T . . ) . Hence, VDD,,,~ 2.2 V. Therefore, high-performance BiCMOS circuits, at low-voltage, are needed that
minimize
m
rn
m
rn
272
CHAPTER 5
In the basic circuit of Fig. 5,11(a), the output reachs only VDD VBE level. This increaser the delay and power &sipation of the subsequent gates. If a resistor (in this case the gate is called BiRNMOS) or n grounded gate PMOS transistor is inserted between the emitter and the base of the pull-up bipolar transistor. the output achiever fd-swing. However, this will degrade the speed of the gstc because the base current is bypasaed by the inserted element and hence is reduced.
~
Many alternatives have been proposed such ar BiPNMOS [Ill, and PBiNMOS [I21 to realist full-swing output. The BiPNMOS is shown i n Fig. 5.11(c). A small rise PMOS transistor and an inverter ale added to the bark BiNMOS gate. The PMOS device realiees full-swing output when the output changes from low to high. The Sdded PMOS, P, turns ON only when the output rewhches the threshold voltage of the feedback inverter. Hence, the bare curreat supplied by the pull-up PMOS transistor is not affected by this added PMOS transistor. Consequently, the BiPNMOS gate has higher performance than conventional BiNMOS and BiRNMOS. One drawback of the BiPNMOS is the increased output load capacitance due to the inverter I. The PBiNMOS gate eonfiguration shown, in Fig. 6,ll(d), uses a small sine PMOS device in parallel with the bipolar p d - u p transistor t o r&e full-swing output. This configuration results in better performance compared to the other circuit structures but slightly increases the input capacitance of the gate. In this section, we show that a properly optimiied PBiNMOS gate is faster than CMOS, even a t low power supply and load.
273
274
CHAPTER 5
5.2.1
In this section we discuss the effect of the circuit parameters available to the designer to optimine the PBiNMOS gate for low fanout fast operation ming the 0.8 pm BiCMOS device parameters discnssed in Chapter 3. W e optimie the design of the inverter. Then, the teeh*que can be extended to more complex gates.
Finding the proper sieing of the inpct MOSFETs P and N (W, and W, respectively) is not tdvial. The sizing of Na and P, [see Fig. S.ll(d)] k not critical. For typicd applications, it is enough to use near minimum size devices. When the delay of the PBiNMOS is plotted versus the width of one of the devices P o r N,for different fanouts, a common optimum width exits as shown in Fig. 5.12(a) with a fiattaed region. This optimum is due to the fact that when inerebdng the size, the d r i n t i i t y of the gate increases. However, the equivalent ontpnt load also increase.. Then at a certain siee, an optimum delay exits. &om this figure,the optimum W, is 9 p m and W, = 11p m (particularly for low-fanout). Note that in Fig. 5.12(8), we have chosen W, i i 0.8Wm. This is explained in more detail below. When the BiNMOS inverter is used as a driver of a fixed losd (e.g., bus), instead of d d ~ gates, g then we should consider the delay of the driver, including the delay of the stage that drives it. In Fig. 5.12(b), the total delay of the PBiNMOS driver and the CMOS inverter that driver it is plotted for two fixed loads: 0.2 p F and 0.5 p F . The CMOS stage has a minimnm dae. The minimum delay is around the point determined previously for the knout cese The choice of the emitter area in this gate depends on the technology and the load. For the 0.8 pm BiCMOS at 3.3 V power supply voltage, it was found that using the minimum emitter ares (AB x 1 = 0.8 x 4 pm) gives the minimum delay for the range of loads 5 1pF. Fig. 5.13 shows that the optimal W,/W, ratio is the same for different fanonts and is equal to 0.8. This point &o gives almost symmetrical f d j d s e delays. So wen if the fanont is unknown,the optimnm gate is fixed and the size. depend only on the device parameters. This result is very important for standard cells and gate arrays where the cells are ddgned with unknown loads.
275
1411,
2201
-8 LO
I
12
14
16
276
CHAPTER 5
.....
340
....
VD0 = 3.3 v wp +W,,=201im
2x0
2.2 2.4
wpmn
ratio
Figure &.I$ The &lay of PBiNMOS inverter Y I ~ U B the ratio of W p / W . for n fired input capacitance.
CMOS.--.-
'
500
....
......
$ 4 0
300
200
IwI
2
Fanout
Figure 6.11) Comparison of the CMOS m d PBiNMOS delays for the same input ce,p~ciLancc funslim of the fan..uk.
277
5.2.2
F i g . 5.14 shows the delay of CMOS and PBiNMOS inverters fnnction of the knout. Both gates have the same input capacitance. The impmtant result of
this plot, is that the PBiNMOS gate is always h t a than CMOS, except for B fanout of I , where PBiNMOS is slightly Carter. For a fanout of 3, which is II typical value in many designs, the delay is reduced by 20%. For a higher fanout, tho delay is reduced by 25.40%. This result ir quite different from the e a ~ e of conventional BiCMOS where B high fanout (or load) is required for BiCMOS.
Let us compare the power dissipation of the gates for different fanoot. Table 5.2 shows this comparison for s m d fanouta. The power dissipations of both gatare comparable and are the same for e . fanout (> 3). The small rize additional bipolar in the BiNMOS gate does not result in sigaificant power dissipation overhead. This result shows that the BiNMOS family is an excellent choice fo? law-powcr and high-speed operation. However for D fanout 1-2, still the CMOS can be used.
TableS.2 CMOS/PBWMOSpow~i.di..ipationsarvfanovtBV~~ =3 . 3 Y f = 100 MBx.
Fanouk2
Fanout=3
192 203
Fanout=5
277 287
149
171
ewily constructed using the basic PBiNMOS inverter of Fig. 5.11(d). Twoinput NOR and NAND gates are shown in Fig. 5.15(a) and Fig. 5.15(b). The logic function is implemented using the PMOS and NMOS blocks a5 i n CMOS technology. The bipolar device Ql is osed as a current drive. More complex functions c m be implemented wing standard CMOS gate formation theory. The layout of the PBiNMOS inverter is shown in Fig. 5.16. The BJT consumes area in the PBiNMOS gate. However,when complex gates are implemented with more MOS devices, &heextra area of the BJT is reduced.
278
CHAPTER 5
rchhcmslier of:
(b) PBiNMDS
One technique to reduce the area penalty of the BJT is to use merged N-well bipolar and PMOS device..
5.2.4
For fntare technologies, the power snpply voltage will be sealed below 3.3 V. Fig. 5.17 shows the delay of PBiNMOS and CMOS inverters for a fanaot=3 versus the power supply wltage scaling. The reported delay times were extracted from SPICE simulation by measuring the delay of the second gate in a chain of identical inverters. In this case, the full-swing operation, at the input of a PBSMOS inverter, is provided by an identical gate, where a shunting PMOS is used. Fig. 5.17, shows that PBiNMOS is faster than CMOS down to 2.5 V. At 2.5 V the delay reductinis 15%. The crowwer power supply vdtage between PBiNMOS and CMOS is around 2.15 V. Note that in this comparison we used 8 0.8 pm BiCMOS technology aptimked for 5 V operation. In this case, to compare the BSMOS to CMOS at low-voltage, deepsubmicron technology should be osed. From the device Iwd point of view, scaled technology is expected to improve the performance of BiNMOS a t low-voltage. However, 2 V is the limit of the use of BiNMOS, since almost half of the swing a t s u b 2 V is provided by the poor shunting PMOS device. In summary, BiNMOS family provides the follorving advantage:
279
m P - B a s e a M e t a l 1 UMetal 2
UEmitter
~ContactlX]VlA I
280
CHAPTER 5
.
rn
Simple gste compared to other BiCMOS logic circuits; Good performance at 3.3 and 2.5 V power supply voltage generations even at low-fanout; and Needs simple BiCMOS process
The only disadvantage of BiNMOS is its poor performance for sub-2 V operation. The s m a l l area penalty of BiNMOS is not a problem since for complex gates the overhead of the bipolar device is miaimiued.
281
For fast operation at low-voltage the fd-swing operation should be realized with bipolar devices. Otherwise, the techniqnes based on shunting devices do not provide high drivability
5.3.1
In this section two circuit techniques to overcome the shortcomings of the conventional BiCMOS gate are discussed and compared. These gates are intended to be nsed for sub-3.3 V operation. luso they m e devised to solve the pmblem of ming PNP transistor (see next section on Complementarg &CMOS). In all there circuits, the improvement is done mainly on the poU-dourn section of the conventional BiCMOS, since it is the major can~e of speed degradation at low-vdtage.
282
CHAPTER 5
Figure 5.18
Tho MBiCMOS
r t r
It was shown that this configuration (with shunted source/substrate) is fsJter than its CMOS counterpart down to 2.2 V supply voltage "sins sub-0.5 pm BiCMOS technology [15,161.
283
5.3.2
Full-swing operation can a L o be achieved by using what is called the Complementary BiCMOS (CBiCMOS). The n ~ of e complementary BiCMOS has been encouraged by the recent advances in bipolar technology, which led to high-performance PNP transistors. It is expected that the N P N and PNP transistors will exhibit dose performance when the de~cicesare scaled doam and the base doping inerearer. In this section, we study the emitter-follower (EF) CBiCMOS. Fig. 5.20 shows the use of complementary bipolar output stage to form the bnsic complementary BiCMOS circuits [18, 191. The pun-op section is similar to the conventional BiCMOS. The pull-down section is symmetdcal to the pullnp. The cnrrent of the NMOS transistor N does not sdfer of VBSreduction doc to Q . as in conventional BiCMOS. T h e static swing varier between VBEand VDD VBB-. However, m explained in Section 5.1.2, the actual swing might bs larger than the static design. The balanced transconductance of the PMOSINPN and NMOSIPNF makes it ensier to obtain symmetrical fall and rise time. Hence this circuit eliminates the degradation of the pull-down delay with power supply voltage of the conventiond BiCMOS.
~
284
CHAPTER 5
Figure 6.20
The gate of Fig. 5.20 can be modified to achieve full-swing operation by using emitter-base shunting devices. Fig. 5.21(a) shows EF CBiCMOS with shunting technique. The shunting MOS transistors of the base-emitters permit rcstor8r tion of the full logic level of the output. But still the full-swing is achieved with the two dow MOS devices. Some of the base current can be consnmed by the shunting devices which weakens the drive of Ql and Qz. To O T C I C O ~ ~ this problem, the feedback technique can be used as shown in the circuit of Fig. 5.21(b). The turn ON of the shunting devices i s delayed by the feedback inverter, I. There CBiCMOS drcuits have two drawbacks: poor performance at 2 V power supply voltage and less, and high proce-g cost because of the high performanee PNP device needed. This low performance, at low voltage, is due mainly t o the fact that 2Vse outpot swing is generated by the two shunting transistors.
285
Figure 6 . 1 1
only when the operating frequency is low, where the gate can complete its fullswing operstion and/or when the load capacitance is small 1201. FuU-swing circuits with full bipolat drive are needed. In this section, CBiCMOS variation suitable for sub-2 V operation, called Ttmsient Saturation (TS) is presented.
5.22 shows the basic common-emitter complementary BiCMOS ( C E CBiCMOS) circuit. The circuit is symmetrical and has symmetrical fall and rise times. When the input goes to high, N turns ON to rink the current from the base of the PNP transistor Q2.When the base voltage o f Q 2 falls to V D ~ - V Q .~ turns ~ ~ ON , to s o u m the current to the output load capacitance. Q 2 eventually saturates and the output node ir pulled-up to VDD - Vcs..,. A1 the end of charging the MOS device is still consuming current. The operation of the pull-down section can be explained similarly. Hence, the operation of CECBiCMOS is "on-inverting and the gate needs an extra CMOS inverter at t.he input to achieve complement fnnction. In this circuit, the MOS trsnrktors operate in saturation, hence they supply high cnrrent for the bipolar transistors. Furthermore, the output swing has near rail-to-mil w i n g (VCB,.~ to VDD - V,o,.r). This circuit offers high-speed at low-voltage, but har two drawbacks; (i) the high-static power dissipation, due to the DC cwrent flowing through the bave of either QI or Q a , and (ii) the excess delay due to the slow procesr of turning the saturated BJTs OFF.
Fig.
286
CRAPTER 5
"DO
4
Figure 1.22
Common-*mitt* CBiCMOS $eL.
These two problems have been salved with several implementations [21, 221. One possible implementation is shown in Fig. 5.23. It is cslled Transient Satmation M-Swing (TS-FS) BiCMOS. This logic nses the principle of CE CBiCMOS described in Fig. 5.22. When the input f a , we - m e that the output is charged high, then Pa is ON. Pz tmns ON and the base of QL is charged throngh Pa and Pa [Fig, 5.23(b)]. Consequently, Ql discharges the output (load) down. When the octput voltage approaehs eero, the inverter Z , turns P s OFF and N4 ON [Fig. 522(c)]. The base voltage of Q 1 falls below V B E , causing it to torn OFF. Although 9 1 Jatutates, this does not slow the n u t pull-up transition because the excess minority carriers of Q, are discharged immediately after the pull-down operation. Thus, the bipolar transistor ra1mst.a transiently. The circuit is symmetrical, hence the operation of the pull-up section can be explained W a r l y . T h e PMOS transistor,Pa, cuts off the the DC enrient path during the pull-down transition to avoid any static power dissipation. The small sine ontput latch, composed of the inverters I, and I,, holds the output level because in steady state there is no path between thc ontpnt and the supply h e s . Compared to the BiCMOS logic circuits so far presented, TS-FSis faster below 2 V supply, when the load is relatively large (- 1 pF). At 1.5 V it is twice as fast s CMOS for large loads. Although this circuit solves the problem of speed degradstion of BiCMOS a1.5 V power supply, it still has several drawbacks:
287
(a)
(C)
(c)
Figure 6 . 1 3 (a) Circuit configuration af TS-FS BiCMOS: (b) and sicnt saturation opcrstion for the pd-down srclion.
tram.
process complexity due to the PNP bipolar transistor; large area; relatively high crossove~point with CMOS (- 0.4 pF); and it is a noninverting circuit.
5.3.4
Bootstrapped BiCMOS
An alternate way to avoid the negative effect of VgBloss i n BiCMOS is simply to use a second supply voltage equal t o (VDD t V B B ) Bowever, . this approach is costly because of the additional wirer needed to distribute across the chip and the need for the second supply voltage. Another approach is to use boatstrapping technique to pull-up the base of the pull-up bipolar transistor to (VDD V B B )and hence the output to V D D .The generation of voltages higher than the power supply at the gate level adds an extra degree of freedom to BiCMOS. Schottky BiNMOS/BiCMOS circuit configorations using the boat-
288
CHAPTER 5
strapping have been proposed to overcome lhe negative effect of VBEloss [ZO]. The full-swing operation is performed by saturating the bipolar transistor of the pull-up section with jl base current polse. After which, the base is isolated and bootstrapped to a voltage higher than VDD. These Schottky circnits ontperform all exjsting BiCMOS families in snbW regime down to 2 V, but they need a BiCMOS tcehnology with good integrated Schottky diode. Other examples of a such technique are the bootstrapped BiCMOS circuits published by [23,24. 251. The main advantage of the bootsttrapped circuits is that they c a n be realized in conventional BiCMOS process with CMOS and NPN transistor only. In this section, we present one bootstrapped circuit which overcomes many drawbacks of the BiCMOS logic families discussed previously.
P . .
289
"OD
I"
G t
Figure 5.24
Compared to the Bootrtrapprd BiCMOS (BS-BiCMOS) [23] af Fig. 5.25, the BFBiCMOS has several advantages. First, the bootstrapped capacitor ir driven by the outpnt rather the input as in the BS-BiCMOS. In BS-BiCMOS, the gate of precharge transistor, Pp is driven to VDDand the node n t to VDD VBE. Hence, when VT is lower than Vss, the boolrtrapped node leaks its charge and resalts in less efficient bootstrapping. Third, a PMOS transistor P s is used to discharge the base to a pxcharged level VT, resultins in improved performance. Furthermore, it has a high cioisover capacitance and less performance than the BFBiCMOS.
290
CHAPTER5
Figvre 5.15
The simulated waveforms at 1.5 V power supply of the BFBiCMOS inverter aze shown in Fig. 5.26. The base of QLgoes to (VDD t VBB) when the input is low. Note that when the input is high the base voltage falls to VT.
= VDDC~..~ VDDC,
(5.17)
In order for V t , to reach VDD, V,, must reach VDD t VBE- (during the , is (VDD VBE,)~, and the bootstrapping cycle). Thus the charge on C
291
QI, =V
s ~ ~ C+ a (VDD ~ ~ i +V B S ~ ) C ~
(5.18)
= Q-1-
4 6 1
(5.19)
=I&
(5.20)
292
CHAPTER 5
where I , is the average base current of Q 1 and t, is the rise time of the output. From Equations (5.17-5.20) we find that
This equation indicates that Csomi has to be increased as the power supply is scaled down. When power supply scaling is accompanied with device scaling, 1, improves and as a result ChOot can be kept smsll. At 3.3 V, a typical value of C , , , , is I00 IF, while at 1.5 V,without technology sealing, it is equal to 250fF. The bootstrapped capacitance can be implemented using a NMOS transistor with its IOUC~ and drain connected together. In this cme, the capacitance is related to the area and gate oxide thickness of the MOS transistor. Simnlations have shown that for 1.5 V power snpply voltage, the width and length of this bootstrapped NMOS are equal to 13 fim and 6 pm, respectively. A typical area increase for B two-input NAND gate due to C b , , is 10%. As shown i n Fig. 5.24 of the BFBiCMOS inverter, the N-well of the PMOS devices Pp, PI and P*is connected to the bootstrapped node nl.This prevents their source/drain-well junctions to turn ON during the bootstrapping cycle. Also, it pzevents any latch-op which might be eaosed by the parasitic SRC when the drain/sowce-well voltages a r e forward-biased. The PMOS tiansistor Pa &o has its well connected to its source. This eliminates the body effect of the transistor and prevents any leakage during the bootstrapping.
293
Figure 5.2T
(VDOt VBB,) to VOO by the PMOS P j , inverter I2 holds the output level a t VDO. Withoot this inverter, the output falls down to a level equal to (VDD VBE) due to the baseemitter coupling capacitance. The simulated waveforms of the different voltages are shown in Fig. 5.28.
For an n-input gate implementation, the BFBiNMOS requires 4n input transistors. Whereas, the BFBiCMOS and the BS-BiCMOS require 5n and 6n input transistors, respectively. The E ~ O S S O W ~ load capacitance represents one of the important parameters in circuit comparison. It is B measure of the load where BiCMOS circuits start to have speed advantage over that of CMOS. In the range 1.2-3.3 V. BFBiCMOS/BFBiNMOS circuits require almost an e q o i v d d minimum fanont of 5 . The BS-BiCMOS have a higher cmssavm capacitance.
294
CHAPTER 5
Two-inouts NAND " gate confirruration wlls chosen to evaluate and com~are the performance of the circuits shown in Fig 5.29. The logic families compared are: CMOS [Fig. 5.29(a)], PBiNMOS [Fig. 5.29(b)], TS-FS [Fig. 5.29(c)], BS-BiCMOS [Fig. 5.29(d)], BFBiNMOS [Fig. 5.29(e)], and BFBiCMOS [Fig.
295
296
CHAPTER 5
297
Teble 6.1
BiCMOS PROCESS
0.35pm
o a3pm
4.9 mA B V. = V n F
52 fF
73
fF
30 5 l
28
37 R
31 R 280 R
265 R
5.29(f)]. The simulations were carried out using a chain ofgatcr. The reported 50% delay timed m e those of an intermediate gate. Table 5.4 shows the delay, the a w a g e power dissipation and the power-d&T product of the different NAND gates at two sopplies; 3.3 and 1.5V. The rimulation was carried out at a typical load capacitance of 1 pF. The bootstrapped family consumes more power than CMOS because of the higher internal node capacitance. However, they provide a high speed of operation, particularly the BFBiCMOS, where il has a factor of 3 speed advantage compared to CMOS at 1.5 V. Moreover, the delay-power product of the bootstrappcd family is lower than that of CMOS. Notice that at 3.3 V, PBNMOS has the lowest delay-power product and less delay than CMOS. BiNMOS at 1.5 V is slower than CMOS and is not reported in the table. These rwulta also indicate that the m e of the bootstrapped BiCMOS/BiNMOS gate would improve the delay-power product when VDOis scaled dawn to 1.5 V.
298
CHAPTER 5
Logic Type
Delay
Power
(PWWBZ)
(PSI
TS-FS
DelayxPowei (fJ/MH.)
7.6
Delay
Power
DelayxPowu
3.84
3.1
4.60 3.50
3.2 4 . 1
5.3.6 Conclusion
We have demonstrated, during all the previous sections, that the b e t family to use for B fanout higher than 5 , is the bootstrapped BiCMOS for the r q e of power supply 1-to3.3 V. Bowe~er, due to its higher area occupied, it can be used m d y in high-speed digital applications. Note, when the load is large, in the range of 1 p F , the bootstrapped f d y provides a Q h speed and a good dday-power product. One drawback of this f d y , beside the large =ma, is that the bootsttapping is sensitive to the shape of the inpot voltage. One practical gate which can be used in several applications, even when the fanout is low, is the BiNMOS family. It has good performance for 3.3 and 2.5 V power supplies. Also it provides a better delay-product than CMOS. In the next section, many digital applications b a e d on BiNMOS family are outlined.
299
5.4.1
BiNMOS logic have been nred i n several microprocessors [26, 271. In this application, BiNMOS can be used in critical path delay reduction without increasing .hip area since BiNMOS needs a low-fanout to outperform CMOS. Among the critical paths, we cite
Sense amplifiers and output buffers in the register file and the cache;
Booth's encoder. Wallace tree, and the final adder i n a multiplier;
In the microprocessor of [26], the PBiNMOS logic family is used a t 3.3 V power supply. The critical p s t h ofthe control onit is reduced by 36% ovei CMOS. The BiNMOS gates keep their speed advantage even in the worst ehre (VDD = 2.7 V and T = 125 C).
BiCMOS logic is not only limited to conventional gates, but many other logics can be devised. One such example is the pass-transistor BiNMOS used i n the design of a 64bit adder [28] similar to the CMOS CPL logic family discussed in Chapter 4. Fig. 5.30 shows an urdnsive ORINOR gate uriing the passtransistor BiNMOS gate (abbreviated PT-BiCMOS) wing donble raiL The outputs of the pass-traoristoi network a m connected to the bases of the bipolar transistors Q, and Q2 to reduce the intrinsic delay. The PMOS transistors Pl and P s are crorr-coupled to restore thc high level of the pass logic to full Voo. The PMOS transistors, P2 and P4,charge the oatput to full-swing. These transistors are subject to body effect, hence they turn ON later during transitions.
300
CHAPTER 5
-Pars-transistor network
exclusive OR and NOR gates using PTBiCMOS, TG-type CMOS, and CPL-type CMOS using 0.5 pm BiCMOS process at 3.3 V power supply voltage. The fanout=l is equivalent to jl capacitance of 35 I T The PT-BiCMOS gate is faster than the CMOS gates for any fanout. The power-delay product is &so shorn i n Fig. 5.31(b). The T G gate has the best delay-power product for a fanant lower than 3. However, for B fanout greater than 3, the PT-BiCMOS sate is better.
This PT-BiCMOS has been used in the dcsign of . e &bit adder [28]. It is used mainly in the P, sum and carry blacks. A delay time of 3.5 ns was obtained for the 64-bit adder at 3.3 V, which is 25% better than the CMOS version. The area and power dinsipation penalties of the PT-BICMOS adder, compared to the CMOS, were 13% and 14% respectively. The speed advantage is kept down to almost 2 v.
5.4.2
One of the largest applications of BiCMOS is i n RAM design, particularly Static RAMS (SRAMs). The first BiCMOS SRAM was proposed in 1985 [29], then many BiCMOS SRAMs were reported [30,31, 32, 33, 34, 35, 36,371. The major applications of fast BiCMOS SRAMs a x cache for workstations and msin memory for super computers. Many BiCMOS SRAMJ are in production
301
B N l VD".,.,
Y
7 w
006
0 12
0 I*
0 21
303
complexity. BiCMOS war limited to some periphery circuits due to layoutpitch matching. It WIU used in the 110 buffers, decoder and drivers, main sense amplifier and voltage down converter. In general BiCMOS SRAMs and DRAMS are not suitable for low-power applications.
5.4.3
High-performance DSPs are needed i n many applications such as video signal processo~~, convolvers, filters. etc. BiCMOS technology has been used E U C C ~ S S fully in DSPs operating at B frequency of 300 MHs [41,421. These DSPs operate at 3.3 V power supply voltage using BiNMOS logic family. Among the characteristier of there BiCMOS DSPr, we cite:
High-performance and high density of integrstion; In this ewe, critical data-path functional blocks are customized; and
304
CHAPTER 5
rn
BiNMOS is used in the blocks such as: SRAM, ROM (Read Only Memory), ALU (Arithmetic Logic Unit), multiplier, and clock driver,
etc.
Fig. 5.33 shows a block diagram of a DSP [41]. This architecture can ~ E O C ~ S B any signal processing operation. The BiNMOS inverters me used as dock buffers to reduce the clock skew at 300 MHu clock frequency. The dock is distributed to about 1000 registers. High clock frequency increares drastically power and reduces the power supply voltage due to the powor noise (effect of high disripsted current). The BiNMOS inverter, used in the clock distribution, is the conventional one which h= a high level of VDD- VBE. Bence, the dynamic power of the clock network is rednced by 17% compared to CMOS when rising BiNMOS. Also the BiNMOS logic is used as:
rn
Ootput buffer of the Booth encoder of the convoluer/multiplier blodr; Decoder driver of the register file; and 0th- drivers.
5.4.4
Gate Arrays
Gate arrays became very popular for a wide spectrom of applications becsnse of their low cost and short turn-around time. Gate array chips consist of s large number of identical sites 01 basic cells which are usually placed in rows. The rows are separated by routing channels. The core of rows and channels is surrounded by 1 1 0 cells at the chip periphery as illustrated i n Fig. 5.34. Each of the basic cells is typically made up ofa nnmhez of transistors which can he connected to form a two input NAND 01 NOR gate or B simple latch. The only p ~ ~ e step ~ ~ that h can g be cnstomiaed is the metalhation. The nser of a gate array can implement the system by specifying the required connections between the devices in each cell and then the connection between the various cells. This is done a u t o m s t i d y using CAD tools. The number of metal levels used for wiling varies from 2 to 4. The first one or two levels are used for internal Wiring of the cell and the upper levels (0.g. third and fourth) for wiring between the cells in the harbontal and vertical directions [43].
305
24-bit
fl-
BiCMOS technology has been used extensively for building gate arrays and channelless gate arrays (sea-of-gates) [43, 44, 45, 461. At 3.3 V power supply voltage, BiNMOS logic f d y has been wed [lo, 111. In [ll],BiPNMOS logic gste has been proposed for the Chamelless gate array. Fig. 5.35 shows a layont ofa BiPNMOS basic c d on 0 . 5 pm BiCMOS technology. A bipolar transistor and a md size MOS transistor are added to the pnre CMOS basic c e l l Thew transistors are not only used to implement BiPNMOS gates but also Eip-flopn, memory macros (RAM, ROM, and CAM), etc. A BiPNMOS two-input NAND gate has 36% delay reduction compared to a similar CMOS gate for B fanout of 7. The speed advantage is maintained down to 2 . 5 V.
306
CHAPTER 5
1 1 0 PADS
I":
Figure
5.54
~ . t . A-.~
d+.floeqian.
5.4.5
In order to realiae high-performance ASICr, fast standard cell library macros for rapid design are important. T h i s library contains custom functional maems such as: adder, Programmable Logic Axray (PLA), register file, RAM, cache, Table Look-aside Buffer (TLB), and controller, ete. PBiNMOS logic has been used for such a standard een library [12]. The cells of logic gates are d-ad in CMOS and PBiNMOS for the same logic functions. T h e PBiNMOS gates are used for a relatively high fanout and load, whereas CMOS gates are used for a m a l l fanout. A CAD tool can be utiised to choose the most appropriate cells in the design.
307
Bipolar
0 I
Resinlor
IM a
PMOS
F3S
NMOS
5.5
CHAPTER SUMMARY
In this chapter, we have demonstrated the advantage of using BiCMOS over CMOS in terms of speed. W e have shown the historical evolution of the different BiCMOS logic families. A vmiety of alternative circuit techniques for low-voltage operation have been outlined and compared to the conventional BiCMOS. Also we have shown how optimized BiNMOS are faster than CMOS even if the fanout i s low (greater than 1). The design techniques c8n he utended to more complex gates and building blocks such as flipilops, and adders, ctc. Vsdety of applications where BiCMOS, particularly BiNMOS can be used at low-voltage are reviewed. The addition of the bipolar to CMOS to devise new structures enhancer the performance of ICs. T h i s feature improver the access time of memories, register files, ALUs, DSPs, ete. Notice that a large portion of a BiCMOS IC is implemented in CMOS, while bipolar transistors represent a s m d portion ( 0 5 4 % ) for driving or sensing p u p o s s . The power dissipation of BiCMOS circuits, compared to their CMOS cannterpartr, inaea5es drruticdy if ECL is nsed because of the DC current. However, if m l j BiCMOS logic gates m e used, the powez inccease is not significant compared to speed enhancemcnt. In some cases, like clock didribution network, the power dissipation is reduced when using BiNMOS.
REFERENCES
[I] A. R. Alvsree, %CMOS Technology and Applications," Kiuwer Academic Pnb., MA, Second Edition, 1993.
[Z] S. H.K. Embabi, A. Bellaouar and M. I. Elmarry, "BiCMOS Digital Integrated Circuit Design", Kluwer Academic Pub., MA, 1993.
[3] M. 1. Elmasry, "Design and Analysis of BiCMOS ICr", IEEE Press, 1994.
[4] G. P. Rosseel, and R. W. Dutton, "Muence of Device Parameters on the Switching Speed of BiCMOS B u f f e r s , ' IEEE Journal of Solid-State circnits, vol. 24, no. 1, pp. WB9, Febmary 1989.
[5] P. Raje, K. Chan, and K. Saraswat, "BiCMOS Gete Performmcc Optimieation wing Unified Delay Model," Symposium on VLSI Technology,
[6] S. H. K. Embabi, A. BeUaouar, and M. I. Elrnsrry, "Analysis and Opt-ration of BiCMOS Digital Circuit Structures," IEEE Journal of Solid-state circuits, vol. 26,no. 4. pp. 676-679, April 1991.
[TI P .A. Raje, K. C. Sarsraat and K. M. Cham, "Performance-driven Sealing of BiCMOS Technology", IEEE Trans. an Electron Devices, ED-39, no. 3, pp. 685-693, March 1992.
[8] 3. Gallie, et al., "High-Performance BiCMOS 100K-Gate Array," IEEE Journal of Solid-state Circuits, vol. 25, no. 1 , pp. 142-149, February 1990.
[9] Y.Nishio, et d., "A BiCMOS Logic Gate with Positive Feedback," International Solid-State Circuits Conference, Tech. Dig., pp. 116117,Febrosry 1989. I101 A. E. Gamal et al., "BiNMOS a Basic Cell for BiCMOS Logic Circuits", in Custom Integreted Circuits C o d , Tech. Dig., pp. 8.3.1-8.3.4.. 1989.
[ll] B. Ham et al., "0.5-um 2M-Transistor BiPNMOS Channelless Gate Array", IEEE Journal Solid-State Circuits. "01. 26, no. 11, pp, 1615-1620, November 1991.
310
DESIGN
[12] H. Hara ct al., "0.5-um 3.3-V BiCMOS Standlrrd Cells with 32-kb Cache and Ten-Port Register File", IEEE Journal Solid-State Circuits, vol. 27, no. 11, pp. 1579-1584, November 1992. [13] M. I. EImary, and A. Benaoosr, "BiCMOS a$ Low-Supplg Voltage," in IEEE Bipolar/BiCMOS Circuits snd Techoology Meeting, pp. 89-96, October 1993. [14] P. Rsje, et al., "MBiCMOS: A Device and Circuit Technique for Submicron, s u b 2 V Repjme." Internetiond Solid-State Circuits Conference, Tech. Dig.,pp. 150-151, 1991. [15] P. G. Y. Tsui et al., "Stndy of BiCMOS Logic Gate Configurations for Improved Low-Voltage Performance", IEEE Journal Solid-State Circuits, vol. 28, no. 3, pp. 371-374, March 1993.
[I61 S. W. Sun et al., "A filly Complementary BiCMOS Technology for SubHalf-Micrometer Microprocwror Applications", IEEE Trans. Electron Devices, vol. 39, no. 12, pp. 2733-2739, December 1992.
[171 K. Yano et el., "Quasi-Complementary BiCMOS for Sub-SV Digital Circuits", IEEE Journal Solid-State Cizcuits, vol. 26, no. 11, pp. 1708-1119, November 1991.
[IS] A. Wataosbe et d., "Future BiCMOS Technologies for Scaled Sopply Voltage", International Electron Devices Meeting, Tech. Dig., pp. 429433, D e cember 1989. [I91 A. J. Shin et al., "Full-swing CBiCMOS Logic Circuits", in IEEE Bipolar/BiCMOS Circuits and Technology Meeting, Tech. Dig. pp. 229-233, September 1989.
[20] A. BeUaouar, I. S. Abu-Khater, M. I. Elmasry, and A. Chekims, "WSwing Schottky BiCMOS/BiNMOS and the Effects of Operating Frc queney and Supply Voltage Scaling." IEEE Journal of Solid-State Circuits, vol. 29, no. 6. pp. 693-700, June 1994. [21] S. H. K. Embabi, A. Bellaonm, M. 1. Elmsiry, and R. A. Hmdaway, "New FoU-Voltageswing BiCMOS Buffers", IEEE Journal Solid-State Circuits, vol. 26. no. 2, pp. 150-153, Febrnary 1991.
[22] M . Hiraki et d., "A 1.5-V FuU-Swing BiCMOS Logic Circuit", IEEE Journal Solid-State Circuits, "01. 27, no. 11, pp. 1568-1574, November 1992.
[23] R. Y. V. Ch& and C. A. T. Salama. "1.5 V Bootsttapped BiCMOS Logic Gate", IEE Electronic Letters. Vol. 29. No. 3, pp. 301-309, February 1993.
REFERENCES
311
(241 S. 8. K. Embabi. A. Bellaouat, and K. Islam, "A Boatstrapped Bipolar CMOS ( B 2 C M O S ) Gate for Low Voltage Applications," IEEE Journal of Solid-State Ckcuits, "01. 30, no. 1,pp. 47-53. January 1995.
(251 A. Bellaouar, M. 1 . Elrnsry, and S.
H. K. Embabi. ' Bootstrapped FullSwing BiCMOS/BiNMOS Logic Circuits b r 1.2-3.3 V Supply Volta8e Regime," IEEE Jaurnsl of Solid-State Circuits, 701. 30, no. 6, June 1995.
('261 J , Shuta, "A 3.3 V 0 . 6 p m RiCMOS Suprrscalar Mic.roproccssor,' IEEE International Solid-State Circuits Conference, Tech. Dig., pp. 202-203.1994.
[27j
F. Murabayarhi, ct s l . , -3.3 V, Novel Circuit Techniqnea for a 2.8-MiionTransistor BiCMOS RISC Microprocessor," IEEE Curtom Integrated Circuit Conference, Tech. Dig., pp. 12.1.1-12.1.4, May 1993.
[28] K. Ueda, H. Suziki, K. Suda, Y. Tnsujihnshi, H . Shinohsra. "A 64-hit Adder By Pass Ttandrtor BiCMOS Circuit,' IEEE Curtom Integrated Circuit Conference, Tech. Dig., pp. 12.2.1-12.2.4, May 1993.
(291 K. Ogiue, et d . . ?4 15 ns/ZSO mW 64K Static RAM," in ICCD. Tech. uig.. pp. i~-z0.1985.
[So] H. Tran o t al., "An 8.m 1-Mb ECL BiCMOS SRAM with a Configurable
Memory Array Sine,' Internationol Solid-State Cireuila Con<. Tech Dig., pp. 36-31, February 1989.
pi] M.
Matrui et al., "An 8-ns I-Mb ECL BiCMOS SRAM," International Solid-state Circuits Cod., Tech. Dig., pp. 38-39, February 1989.
(321 Y. Maki et al.. "A 6.5-0s 1 Y b BiCMOS ECL SRAM,"International SolidState Circuits Conf. Tech. Dig., pp. 136-137. February 1990. (331 M. Takada e t al., "A 5-ns I-Mb ECL BiCMOS SRAM," IEEE Journal of Solid State Circuits, VOI. 25, no. 5 , pp. 1051-3062, October 1990
134) A. Ohbn et al.. "A 7-ns I-MI) BiCMOS ECL SRAM with Program-Free Rcdundancy," in Symp. VLSI Circuits Conf. Tech. Dig.. pp. 41-42, May 1990. (351 Y. Okajiia et &I.. "A 7-nr 4-Mh BiCMOS SRAM with a Parallel Testing Circuit," International Solid-state Circuits Conf. Tech. Dig., pp. 5455, February 1991. 136) N. Tamba el s l . ,'"A 1.5 nr 256Kb BiCMOS SRAM with 11K 60 PI Logic Gates." International Solid-State Citcuits C o d , Tech. Dig., pp. 246-247, Februaiy 1993.
312
[37] K. Nakamvra et al., "A 200-MHz Pipelined 16-Mb BiCMOS SRAM with PLL Propmtional Self-Tim'mg Generator," IEEE Journal of Solid-State Circuits, vol. 29, no. 11, pp. 1317-1322. November 1994.
I-Mb BiCMOS DRAM," IEEE [38] G. Kitsukawa, et al., 'An Exp-ental Jonrnal of Solid-State Circuits, vol. S C Z Z , no. 5, pp. 657-662, October 1987. [39] S. Watanabc, et al., "BiCMOS Circuit Technology for High Speed e c h .Dig.,pp. 79-80, 1987. DBAMs," Symposium on VLSI Circuits, T 1401 G. Kitsukaws, et al., "Design of ECL I-Mb BiCMOS DRAM," Electronics and Communications in Japan, Part 2, vol. 76, no. 5, pp. 89.102, 1992. [41] M. Namura et al., ''A 300-MH8, ]&bit, 0.5-em BiCMOS Dsital Signal Proeesror Core LSI," IEEE Cnstom Integrated Circuits Conference, Tech. D i . , p p . 12.6.1-12.6.4,Me.y 1993.
1421 T. Inoue, et al., "A 300-MHe 16-bit BiCMOS Video Signal Proeersor,", IEEE Journal of Solid-State Circuits, vol. 28, no. 12, pp. 1321-1329, De-
cember 1993. [43] F. Mdurabayshi, et al., "A 0.5 micron BiCMOS Channellcss Gate Amy," IEEE Curtom Integrated Circuits Conference, T e c h .Dig., pp. 8.7.1-8.7.4, May 1989. [44] E.Hara,etal., YA350p~50X0.8micr~nBiCMOS GateAnaywithShared B i p o h Cell Structure," IEEE Custom Integrated Circuits Cenferenee, Tech. Dig., pp. 8.5.1-8.5.4,Msy 1989. I451 J. D. Gallia, et al., "High-Performance BiCMOS 100K-Gate Array," IEEE Journal of Solid-State Circuits, "01.25, no. 1, pp. 142149, February 1990. [46] T. Hanibuchi, et al., "A Bipolar-PMOS Merged Basic Cell for 0.8 micron BiCMOS Sea of Gates," IEEE Joarnal of Solid-State Circuits, vol. 26, no. 3, pp. 427-431, March 1991.
6
LOW-POWER CMOS RANDOM ACCESS MEMORY CIRCUITS
Low-power Random Access Memory (RAM) h a s seen a remarkable and rapid progress in power reduction. Many circuits techniques lor active and standby power reduction in static and dynamic RAMShave been devised. In this chapter we study low-power memory circuit techniques which are very interesting for several other applications. Among these circuits, we eramine memory cells, sense amplifiers, precharging circuits, ete. Circuit techniques for 1 . r V power supply are also discussed. The voltage targets using NiCd and Mn batteries are 1.2 and 1.5 V respectively. The minimum voltage of a NiCd cell is 0.9 V. Also we consider the Voltage Down Converters (VDCr) which are used in memories and processors. No consideration is given to the detail of designing B complete memory chip because a single configuration requires an entire book.
314
CHAPTER 6
CMOS
technology
0.35-pm 0 50-pm 0.60-pm
Access
time
7 ns 23 ns 68 ns 15 ns 15 ns
Power dissipation
140 m W C 3 100 MHa 100 mW d 10 MHz
1-Mb [f] 4 M b [8] 4 M b [9] 16-Mb [lo] 16-Mb [Ill 16-Mb [I21
0.25-pm
0.40-pm
0.35-pm
3.3 V
9 nr
The power dissipation iednction in SRAMr is not o d y due to power supply voltage reduction, but &o to low-power circuit techniques. In this section we review some of these circuit techniques for low-power applications.
OY~T
BS:
However, S U M S have the great disadvantage ofa large memory eeU eompered to DRAMS. For this reason, their capadties rue smaller than that of DRAMs.
315
2. Write Enable
3. Chip Select system;
4. Output Enable
6. Power supply pins. A timing disgram during read eyde is shorn in Fig. 6.l(a). Daring this time the data stared in a specific SRAM location (defined by the address) is read out. For a read cycle, two times are shown i n the figare; the read cycle time, ixc, and the address access time, IAA. Fig. 6.l(b) shows the write cycle which permits change to the data in an SRAM. Two timer are indicated. the write cyde time, f w c , and the write recovery time, ~ W R . Same of this information is used in this chapter. For more detail on the timing, the reader can refer to any memory data book. A typical SRAM mchitecture is shown in Fig. 6.2. The memory array contains the memmy cells which a x readable and writable. The row decoder (Xdecoder) selects 1 out of n = 2 rows, while the column decoder (Y-decoder) Selects I = 2 out of m = 21 columns. The address (row and column) are not multiplexed as in the ease ofa DRAM. Sense amplifiers detect small voltage variations on the memory complementary bit-line which reduces the reading time. The conditioning circuit permits the preehaige of the bit-lines. The aces~ b e is determined by the critical path from the address input to the data output as shown in Fig. 6.3. This path contbins address input buffer, row decoder, memory cell array, sense amplifier and output buffer circuits. The word-line decoding and bit-lines sensing delay timer am critical delay componentr. T o reduce the sensing time during a read operation, the swing on the bit-liner should be as s m a l l as pamible.
For an aspchronons S U M , a s p e d Circuit called an Address Detection Transition (ATD) permits the generation of internal pulses. These pulses are of two types; activation and equalieation. Activation pulses selectively activate particula circuits, w h i l e equalization pukes permit the reduction ofthe delay by restoring and equalking differential nodes prior to being selected. In t h m section we treat only asynchronous SFLAMr.
Not docked crternoily.
316
CHAPTER 6
CS (Chip Select)
OE (Output Enable) I
Data Out
ktnn-
r-
I tWK
Dafa valid
\\\
(b)
Figure 0.1
Typicd timing of a SRAM: (s)mad q d e ; (b) w i l e cydc.
LlC
318
CHAPTER 6
Input
address
Row decoder
idnver
Memory
cell
To write in the cell, one of the bit-liner is pulled low and the other high and a then the cell i s selected by W L , Assume that B is set to "0" whil e mltlally ' ' ' "1" is stored at node A ("0" at B).N1 and P1 should be riaed such that node A is pulled down enough to turn P2 ON. This in turn causes node B to be pulled np. The crosssoupled inverter pair have a high gain to cause the nodes A and B to switch to opposite voltages. The data retention (standby) current of thk cell can be 85 low BS 10-"A. Although this full-CMOS cell has low retention current, the cell area is so large that it does not allow high-density SRAMs. A typical cell area using a 0.8 ~m design rules is 75 p d , The stability of the memory cell is its sbility to hold a stable state. Fig. 6.5(a) ahows the transfer cumes of full CMOS S U M S . The box between the two
319
Figure 6.4
characteristics (I and 11) defines the Static Noise Margin (SNM). Static noise is DC disturbance, such ffi offsets and mismatches, due to the pioeesskg and variations in process conditions. The SNM is defined as the maximum value of V, (static noise IOOIC~ ffi shown in Fig. 6.5jb)) that can be tolerated by the cross-coupled inverters before altering state. A n important parameter in SNM is the memory cell ratio, I , defined by
where transistors N , and N , sre the a c e m and driver NMOS transistors shown
320
CHAPTER 6
"DO
about 30% to 40% smaller than the CMOS &-transistor memory cell, because the two polyrilieon resistances c a n be formed on top of the two NMOS driver transistors. The High Resistive Load (HRL) memory cell har been used in several S R A M generations from 4 K b . The high state storage node of Fig. 6.6 ulll be p d e d down with time due to two kinds of leakage current; the I d a g e current ofthe drsin junction and the subthreshold current. The voltage drop BCZOBI the resistance R prevents iegvlac cell operation, if the leakage current reacher the l e d of the poly-Si remtor current. In several SRAMs generations using BRL memory cell, the total standby current w w act to 1 p A per chip a t room temperature for battery-backup applications. Thus, for each memory generation with quadrupled density, the polyJi resistance value is also quadrupled. For 4 M b chip which h a II total standby current less than 1 PA,
321
I
typical d u e s of &'stance me in the 5 x 1 P 0 range and the resistance current is limited to 10-laA. This current should be mvch larger than the total leakage current of the storage node of the cell to improve tho data retention margin. The leakage current cannot be scaled because, fist, the subthreshold current per channel width, tends to increase; particalerly with the trend to decrease the threshold voltage for low-voltage. Second, the leaksge current of the drain jonction per area unit tends t o increase with technology scaling. Moreover the junction area is shrank with a rate lower than the SRAM density increase rate. In [14], it w m determined that the maxim- SRAM capacity for low-power applications, using an ERL memory cell is 4 Mb where the retention current is 1 @A. Note that the high-level node voltages of a l l poly-Si load memory cells are (VDD- VT)after mite cycle, where VT is the threshold voltage of the access transistor, subject to body effect. These nodes need a time of several ms to charge np to VDD. The SNM of the ply-Si load memory cell L more sensitive to cell ratio 7 , than the full CMOS cell 1131. A typical valne of I is 3. Also the cell stability is drastically degraded when VDDis 3 V or less. The transfer curves in the read mode can be easily plotted for different VDDto flnd out that the cell cannot store the data a t a certain low-voltage.
322
CHAPTER 6
p-Suhsmle
323
For 4 Mb and higher density SRAMs, the polysilieon load cell starts to be replaced by a polysjliean PMOS load called PMOS Thin Film Damistor (TFT) for low-power applications [S,9, 151. Fig. 6.7 shows a cmss section and k c n i t diagram of the poly-Si PMOS load memory cell 181. The TFT device is fabricated from amorphous silicon (a- Si). This material has a grain size of 2 ~ r while n that of the conventional poly-Si material is 0.03 pm. The thickness of this a - S i is 100 n m and the gate oxide thickness of lhe TFT is 40 nm. T h i s technology rerulls in improved ON/OFF currents compared to the one using poly-Si. The N i drain area of the NMOS transistor ia used ar the gate electrode for the PMOS TFT. To obtain a small area, the polydimn PMOS farms the must be stacked on the NMOS driver. The second palysilicon Iaye~ channel regions. The T F T memory cell area is more than 40% s d e r than the fall CMOS one. PMOS TFT used in a 4-Mb SRAM as W 7 A is obtained is attained. The ON current is larger by more than six order of magnitude than memory cell leakage currents which b much better than the current of the HRL cell Thos, it results in an excellent data letentian characterbtic. Moreover, the very low OFF current results in a standby current less than 1 p A for 4-Mb SRAM. This current is low enough for battery back-up operation. At 1.2 V power supply, the current flowing in the PMOS TFT is more than one-and-a-half order of magnitude larger than the OFF current. Thk demonstrates the ability of this teehnoiogy for iow-voitsge operation.
B
a function of the gate voltage. An ON current more than at a supply voltage of 3 V, while an OFF current of lO-"A
Afier write cyde, the hgh-storage node voltage i n the cell becomes VDD - VT. The time needed for charging up this node to VDD is t,h = -
C,VT
(6.2)
where 4 ir the current flowing in tho load device and C , is the total parasitic capacitance of the node. Using 4-Mb data for TFT memory cell, VT = 1 V , C , = 10 fF and 4 = 10 p A the to&is around 1 me. For poly-Si load this chage-np time is larger than 100 m i because h k low i y ~0.1 PA. The average interval time between two word-line selections (for the same word-line) is given by
1.
= Nlcy,rr M
~
(6.3)
where N is the number of memory ceUr per SRAM chip, M is the number of memory cells pel word-line, and (or noted t n c ) b the operating cycle time. For CMb, a typical value oft, is 4.5 ma when the cycle time is 70 na and
324
CHAPTER 6
M equ& 64cell/word-line. Comparing t. to t.k for poly-Si load and PMOS TFT we have t,* < t, For P M O S TFT (6.4)
to*
> 1.
For p l y - S t
Lond
(6.5)
Thus, the high-storage node, in the ease of PMOS T F T sell, is charged-np qvkkly to VDD. For this rearon, the Soft Error Rate (SER) of the PMOS T F T cell i s much lower than that of the poly-Si cell [El.
6 . 1 . 3
R e a m r i t e Operation
Fig. 6.9 shows a simplified readout circuitry for an SRAM. The circuit has static bit-line loads composed of pull-up NMOS devices N , and N2.The bitlines are pulled-up to a voltage (VDD - h), where V!, is the threshold voltage
325
326
CHAPTER 6
"OD
WL
Figure 8.10
mbjett tu body effect. When the word-line W L is asserted, one word is selected. At this time, the bit-line B L is p d e d down to s level determined by the pull-up NMOS HI, the word-line transistor N . , and the driver NMOS transistor Nd ss shown in Fig. 6.9(b). The voltage at the node A should be low (mar ground) to not alter the RAM content during this read operation. A s m a l l swing change on BL is dwirable to achieve the high-speed readout, particularly if CnL is high. The Sense Amplifier (SA) amplifies the small swing, AV on the bit-line. Typical values 0fAV-J are 100 mV wd.L?& respectively. It should be noted that t&FA phould provide a wide opemting margin over all pmcess, temperature, and voltage cornerr.
If the W L signal stays asserted, all selected eolamns consume a DC current flowing through the NMOS devices N,. N. and Nd. Thus, the shortening of read mode duration is necessary to reduce the power dissipation during this active mode. This is possible by pulsing W L with enough time to read the cell as shown in Fig. 6.10. The generation of pulsed W L signal is possible owing to the Address Transition Detection (ATD) technique as will be discussed in Section 6.1.5.
Fig. 6.11(a) shows asimplified circuit configuration for SRAM write operation. For II write operation the memory cell state should be Ripped. When the write signal W E is asserted, the input data and its complement are placed on the bit-lines. If for example, a vero has to be stored in the node A initially at VDD,the voltage at this node should be below the threshold voltage of the coll, as shown in equivalent circuit of Fig. 6.ll(b). The bit-line in thia crse is pulled-down to almost 0 V. The design of write circuitry should provide a wide operating margin o v a all process, temperature, and voltage corners. Note that B DC current is consumed during a write mode, hence the W E signal should
327
WL
BL
of the write operation. In high-speed SEAMS, write recovery time is an important component of the write eyde time. It is defined BE the time necessary to recover from the write cycle to the read
&o be short to cut this current at the end
state after the W E s i g d i s disabled. Note that the swing on bit-lines after mite operation is large. Thus, an equalizer circuit is needed to reduce this s-g, so that the read operation is performed qoidrly. Fig. 6.12 illustrates b simplified achematic of an SEAM with xead/write circuitry. At the end of the memory cycle a differential voltage existed on the bit-lines. A PMOS equalizing device is used to equalise the bitliner after each read and write operation. The differential voltages on the bit-lines are restored
328
CHAPTER 6
Dafa-i"
%D WE
0
WL
@.@
329
column 1
AQ
1M
a%
9 X3LdVH3
OEE
331
rn
The decoders (row and column); The memory array. Ifm memory cells are connected to the ward-he, the active power of memory array (in read mode) is given by
Pmm-ma,
=mPd
(6.6)
Where P . , is the power dissipated in active mode when selecting the m cells and ~ I . . I , is the data retention (standby) power of the unselected mekory cells in the m Y n array. The second term is neplipible. The third term is due to the DC current, ID,, dadng the read operation. At is the activation t i m e of the DC eonr-g parts and f is the operating frequency (f = 1Jinc).An example of such a current is the DC current flowing Gom the bit-line load to the ground through the memory cell;
rn
To reduce the active power consumption many techniques can be used and are summatized 85 follows :
m
rn
Reducing the capacitances of the word-line and the number of m cells connected to it. This is possible by osing Hierarchical Word-Line (HWL) techniques. Reducing the DC current by using the pulse operation technique for the word-tine and the periphery circuits (including sense amplifier).
Use of multi-stage static CMOS decoding to reduce the AC current.
Lowering the operating power supply d t a g e .
The standby power (or Sometimes called retention current) of an SRAM has a major contribution from the memozy cells in the array if the sense amplifiers are disabled in this mode. It is given by
Pstcdbv
mnprcar
(6.71)
332
CHAPTER6
One way to reduce the standby current is to reduce the operating voltage. However, note that the data-retention cnirent will increase with memory capacity. Moreover, the leakage current, per cell, tends to increase because the threshold voltage is expected to be reduced for low-voltage operation.
In the following sections, many key circuits in an SRAM are reviewed. The circnit techniqocs and memory organisation to reduce the lrctive and dataretention currents are presented.
6 . 1 . 5
an on-chip pulse generator, which detects the address change, is needed. It is baaed on address transition detection technique. The ATD is a key technique to reduce the active power of memories. Fig. 6.14(a) shows the schematic
diagram of an ATD pulse generator. Short pulses are generated with XOR circuits when the address changes from "L" to ' H " or "H"t o "L"; then summed through an OR gate. The overall pulse width is controlled by the RC delay line shown in Fig. 6.14(b). The corresponding waveforms are shown in Fig. 6.14(c). The d m o pulse is usually stretched out with a d&y circuit to generate the different pulses needed in the SRAM. Note that the CS signal is also included as m input to the ATD generator.
6.1.6 Decoders
Usually the decoding in an SRAM is performed by using complementary CMOS. Two kinds of decoders arc used ; the row and the column decoders. Fast static decoders are based on OR//NOR and ANDINAND gates. Fig. 6.15 shows an example of a two-bit input address EOW decoder. The input bnffers have to drive the interconnect capacitance of the address lines and the input capacitance of the NAND gates. To match the pitch of the memory cell and to perform decoding for severals blocks, twostages decoders ale used. The first stage performs predecoding and the second one performs the final decoding function [Fig. 6.161. The twostages decoder circuit has other advantages over the onc Stage decoder such as to reduce the number of transistors and fanin. Also it reduces the loading on the address input buffers. This predecoding teehnique optimiiaer both speed and power. In the last stage an additional signd 4, is included in the AND gate. This signal is generated from an ATD pulse generator to enable the decoder and ensue the pulse activated word-line. There
333
(h)
Address
i i
334
CHAPTER 6
Address h e r
335
Predecodcr
Final decoder
are several ways to build mw-decoderr and it depends on the R.AM architecture division.
The column decoder permits the selection d l out of m bits of the accessed TOW. Fig. 6.17(a) shows the circuits involved for column selection uskg an example of 4 columns. The selected gate permits the transferring of the data from the bit-lines to the common data-lines I j O . The signals Y i a r e controlled by the ANDINAND c o l u m decoder BS shown in Fig. 6.17(b).
336
CHAPTER 6
337
To reduce the DC current, during the write circuit, a variable bit-line load It realizes fast sensing in the read cycle and B short wdte pulse width in the mite cycle. For fast sensing, the voltage swing of the bit-line shodd be s m a l l . To achieve this, the load impedance should be low. On the other hand, to obtain a low current dndng write cycle, the load impedance of the bit-lines shonld be high. As shown in Fig. 6.19, during the read operation, all four NMOS transistors N,, Na, N,, and N4 are turned ON. The bit-lines are switched into a low-impedance state so that the Voltage swing of the bit-lines is limited to R s m a l l value (e.g., 100 mV). During the write operation, the NMOS devices N, and NI arc witched OFF and only the small she transistors N, and N , are turned ON.
tdmique can be employed [Fig. 6.191,
338
CHAPTER 6
NI
i
Figure 6.19
Variable load bit-hrs.
339
As the power supply voltage is sealed down to 3 V, the preeharge level can be lower t h q 2 V, Thus, d g r e d operation the high-level node of the memory cell can t;,f&e equal to the bit-line d t s g e . Hence, the noise margin of the memory cell is drastically degraded and consequently the cell stebbility and soft error are degraded. Therefore, at 3 V power supply voltage, a PMOS trsnsktor can be used w bit-liner' load [Fig. 6 . 201. The bit-lines precharge voltage i s V b ~ Far . law-voltage bit-liner precharge voltage, special ~ e n s eamplifiers should be used because conventional sensing circuits have poor voltage gain (less than 10). A variable impedance bit-line, using PMOS transistois, can
&o
be implemented.
6.1.8
Sense Amplifier
When reading II memory cell, the bit-lines are initially precharged. then one i f the two bit-lines goes down, while the other stays high. The operation of polling down the bit-line i s very slow because the discharging MOS device, in the memory cell, is small and the bit-line capacitance is high. This results i n very slow memory read time. Sense ampliiiers are used t o detect the small "adation on the bit-lines and amplify it to get at the end fuU-swing signal. A dmple anbalanced inverter with a high logic threshold voltage can be used. j i c e its input is single and has very s m a l l noise margin,it ir very sensitive to noise on the bit-line. Thus, sense amplification, for the data-liner, is a key to aehieve fast access time and low-power dissipation. In general, the delay of B sense amplifier (from the time of word-line activation) represents 30 to 40 %of the whole read aserr tie.
Various kinds of sense amplifiers have been devised for fast sensing operation and low-power dissipation. Fig. 6.21(a) shows a ringlcend sense ampliser with an active current-mlror. Thin structure forms the basin for ~ n SRAMa' y sense amplifier circuits. It has two differentid inputs, D L and DL. The noise equally affects both the two inputs and only the difference is detected. The transistor N, acts as a curent source. Before the signal $ 4 . ~ is asserted, the data-lines D L and DL are high. AU the nodes, A, B and C, a x high. The signal & A is a s e r t e d when DL starts, for example, to drop slowly. In this m e , the NMOS transistor N, is ON. The output voltage (node C) drops suddenly to a c a t & voltage. Thus, the input signal is amplified by the gain of this differential amplifier.
Fig. 6.2l(b) shows the voltage waveforms of the single-end sense amplifier uskg SPICE simulation. The signal is generated with an ATD pulse. It i s
340
CHAPTER 6
341
asserted for a time, enough to amplify the small variation (few hundreds of rnV) on data-lines', then it i s disadivated. In this scheme the DC cnrrent consumed by the sense amplifier is cnt off. Usually the sense amplifier is common to msny columns through the common data-liner. The small Signel gain of this amplifier is given by * = 9-(6.8)
90
is the transconductance of the driver NMOS Nd and go is the cornbioed output conductance of the PMOS load and the NMOS driver. where
y'mn
In many SRAMs multi-stage sense amplifiers are needed to attain large volte.gge
in Fig. gain. In this case, the daublbend sense arnpLifier is used a6 sh6.22. This circuit h s often been wed in many SRAMs. To attain high-speed data sense, a two and three-stage sense amplifier technique a n be adopted. Fig. 6.23 shows a two-stage amplifier structure. An equalisation technique is used for the data-lines, using the equalization pulse 4sq,which i s generated with an ATD pnlse. It is indispensable, not only to attain faster data transfer
'Thc auipui of the srme ampmcr k then iatchcd.
342
CHAPTER 6
343
I
S
Figure 8.14
during read operation, but also to suppress incorrect data before the comect data appears in the sense amplifier [17]. For low-powei applications and &o due to the plastic packaging limitations of static memories, this type of sense amplifier can result in high power dissipation for high-density memories even if the current source is pulsed. Many circuits have been proposed to reduce the power of the sense amplifier w h i l e improving their sensing delay time. One of them is the PMOS CIOSScoupled amplifier [I81 shown in Fig. 6.24. The PMOS loads, P, and Pz, are cross-coupled and the M e r e n t i d outputs S a m S are connected to their girtes. The positive feedback in this latch amplifier permits much faster sense speed than the conventional one. In this circuit the equalization technique is used for the reasons discussed above. Fig. 6.25 rhawr the senre delnys of both the PMOS cross-coupled amplifier and the double-end current-mirror amplifier as 1 function of the average current of the amplifier. The input voltages simulate
344
CHAPTER 6
0 6 prn CMOS
345
the common data-lines' voltages and the sense delay id is defined as the delay time from the crosso~er point ofthe input voltages to the point when the ontput reacher 1 V difference. The PMOS cross-coupled amplifier has less than half the delay of the conventional current-mirror sense smplifrer. Moreover, this latch amplifier consumes less than one-Mth ofthe power of 6 current-mirror amplifier. The PMOS cross-coupled latch amplifier requires much more accurate timing for to optimize the sensing delay [la], Thin circuit also has low-power property compared to the current-mirror amplifier since it has nearly full-swing outputs with positive feedback.
+.,
346
CHAPTER 6
When the voltage is sealed to 3 V power supply, the data-line voltage is near VDD, then a level shifting can be pedormed. Fig. 6.26 shows a two stage sense amplifier wed for 3.3 V mpply. The first stage is a cross-coupled NMOS amplifier which also performs level shifting of the common data-line voltage. In the second dage, a conventional sense amplifier is used which operates at the maximnm 9 . ; . point since the l e d on SA a d YZ =re medium leutlr.
Fig. 6.21 shows another sense amplifier developed for low-voltage power supply [IS]. This circuit is mcd when the bit-tines are close to VDD, where the gain of a conventional current-mirroi amplifier is poor. The circuit is composed of a level-shift circuit and a conventional current-mirror amplifier. The level-shifter shifts the bibline voltage to a medium voltage; 0.6 to 0.7 V, (@ 1 V power
347
supply voltage) where the gain IS maximum. Low-VT NMOS devices NL and N2 are used to provide these medium levels. There devices are subject to the body effect. Recently current sense-amplifiers have been proposed to overcome the gain reduction of voltage amplifiers a t low power supply [T, 121. Alao they reduce the power diiaipntion of the sensing operation compared to voltage sense amplifiers at the same delay. There circuits require wry careful dengn.
6.1.9
Output Latch
In low-power SRAM, the pulse technique for word-line and seme amplifter ir indispensable in order to reduce the DC Current. In such B pulse mode. a datalatch circuit is required to Store the amplified data by the sense amphfier from the memory cell for the data output circuitry. Fig. 6.28 shows an example of an output latch placed after the sense amplifier. The requirements of such an ontput latch are the following '
m
The latch circuit must not delay the mad access time. Such a requirement is attained by connecting the latch with data-bus lines in parallel. One input transmission gate, controlled by 41, is used to enter the data to the latch. Another transmission gate, controlled by 40, is used to put the dat. back into the det-bnr. The latched data must not be destroyed by the noise entering the SRAM. A noise in an SFAM is generated and propagated by the following mechanism. On the system board, 8 ground noire can enter the SRAM. When the peak level of the ground noise becomes large enough for the first gate of the address buffer to change the logic value of the address input, an ATD pulse noise is generated. This noise pulse could turn on the word-lineand the *erne amplifier for a short time resulting in an expected signal on the data-bus. Therefore, the Latched data conld be destroyed if the inpnt Gp.1 is ON. To avoid such a problem, two circuit techniques m e included in the eireuit of Fig. 6.28. The first one is the generation of Qr only when the pulse width of the ATD is large enongh, compared to that of the noise. The other circuit technique is to place latch-protecting invertem [Fig. 6.281 in the front of the output gates. The inverterr prevent noise from entering the output gates.
rn
348
CHAPTER 6
The new data must be quickly latched into the data-latch. The circuit of F i g . 6.28 can be optimbed for fast operation.
349
Block
n -
2nd Block
nBch Block
Elnck
sdcct
lillC
Figure
B.m
is reduced, since only the selected columns switch. Moreover, the ward-line selection delay, which i s the delay time from the address input to the divided word-line, is reduced. T h i s delay is composed ofthe main word-line select delay and the divided word-linc select delay. The main word-line selection delay is reduced compared to the conventional one, because the total capacitance of connected transistors is reduced. In a conventional S U M , the word-he has all the row memory c e k ' gates of B row connected to it. The insin word-line delay increases as the number of blocks increase because the number of block select gates increases. On the other hand, the divided word-line delay decreases as the number of connected cells i s reduced with the increasing number of blocks. Consequently, the word-line selection delay has a minimum for a certain number of blocks.
6.30 shows the effect of the number of blocks i n DWL structure on the word-line select delay and the colvmn power for 64-Kb SRAM [l o]. In this example. a number of blocks of eight can be chosen. The ares penalty for this case is only 5%, compared to the conventional memory. AE an example, for I-Mb SRAM, the cell array is divided into 16 blocks and each black consists of 612 OWE by 128 columns. 9-bit address (,4...Ae) is used to select B I O W within
Fig.
350
CHAPTER 6
16
32
Number of Blocks
address. The DWL structure has been widely used in high-density SRAMa for its lowpower. high-speed characteristics. However, in high-density SRAMs, with a capacity more than 4 M b , the nomber of blocks in the DWL structure will have t o increase. Therefore, the capacitance of the global w o r d - h e increases cansing the delay and power increase. To solve this problem, the concept of Hierarchical Word Decoding (HWD) was proposed in [21] as shown in Fig. 6.31. The word select line is divided into more than two lev&. The number of lev& (hierarchy) is determined by the total load capacitance of the word select line to efficiently distribute it. Hence. the delay hnd the power ayt reduced. For 4-Mb, three levels of hierarchy haw been used with 32 blocks; each block having 128 columns by 1024 rows. Fig. 6.32 shows the delsy time and the total
352
CHAPTER 6
capacitance of the word decoding path comparison for the optimized DWL and HWD strmtures of 256-Kb, 1-Mb, snd 4-Mh S U M S . For 256-Kb SRAM there is no significant advsnthge of HWD over DWL. However, for high-density SRAMs the perfounance, of HWD i n terms ofpower and delay, becomes dear. The three-levels scheme can be used efficiently for 16-Mb SRAMs.
353
Figure 8.54
Twertep t.Ehniq\is
354
CHAPTER 6
Word driver
Low- VT MOSFET
Din
(a)
WE
Din
Figure B.55
A TwrrStep Word (TSW) voltage technique has been proposed by Ishibarhi et al. 1191 to solve the cited problem. Fig. 6.34 shows the block diagram of the proposed memory. The boosted-level generator' generates a voltage V , , = 1.5V for VDO = 1V. The word-line voltage har two-steps, one is VDD and the other is K h . The circuitry for the TSW method is shown in Fig. 6.35(s). When Q , goes to zero, the signal W L is raired to V , , = VDD. Then when .$ch is mserted with a high l e d , equal to Vch, the transistor P i tnms ON and then the W L level is increared to V , , = Vch. In this e a e , the low threshold voltage device N, tun. OFF and the inverter formed by the transistors P a m d N, is isolated to reducc m y leakage current. Fig. 6.35(b) shows the voltage waveforms for the TSW circuitry in read/write modes. During the write cycle, the high node A is first charged to a low voltage,
'The boostcdLvel8~lcratorir prcsentcdin ScetionB.2.11.
355
then raised to Vms. The bit-hes are initially floating, then prechaged at the end of mite cycle. In the next read cycle, the b i t - k s are floating. Before the , , , the cell discharges BL through the low node B . word-line voltages rise to V Thus, when the word-line has risen to Vwt, current does not flow in the cell and the node B stays at low level voltage. Note that this technique requires mdti-V, CMOS devices and causes delay in writing because the bit-lines are discharged before writing.
However. the low-voltagge S U M S discussed above require a relatkely high threshold voltage VT 2 0.5V. Thus, their speed is qnite slow. As an example. a 258-Kb SRAM with full CMOS memory cells attained 3 ps access time at 1 V power supply using 0.8 pm CMOS technology [22]. The active power at 0.1 MHa is 0.2 mW and the standby power is 5 nW.Another example i s a 1-Mb SRAM with fuU CMOS memory c c b which achieves 200 n s access t h e at 1 V power supply using 0.5 p n CMOS technology 1231. The active
356
CHAPTER 6
cuprent at 1 MHs is 0.1 mW snd the standby current is 10 nW.Note that if the tbrerhald voltage i s too low for ultra-low voltage applications, all the eirwits composing the SRAM will suffer from the subthreshold current leakage. Thus, the retention current increases drastically cansing B sedous problem for low-power applications. Moreover, the temperature effect and the threshold voltage variation enhance this current. So far, no practical solution has been proposed.
357
WL SWING
LIMITER
?
w
0 3 4
t;
? I
1 -
-,
-
Li
4 4
Mn
NiCd
0
DENSITY
I
1M
I
4M
0.8 20
I
16M 0.5
I5
I
MM 0.3
10
I
256M 0.2
I Ic
0.1
5
(hi0
ipim)
FEAT.SlzE1.3 Toi
25
(nm)
down converter [see Section 6.31. Howevez the 3 3 V externill power supply wlll dominate.
Recently, activities to r e d r e 1.5 V battery-operated DRAMs are accelerating
the trend i n lowvoltage operation [ZT. 28. 291. Fig. 6.36 shows the trend of DRAM supply [ZS]. In battery operation, the chip must be operated on B variety of batteries with various supply voltages for a long-term and under supply fluctuationr.
358
CHAPTER 6
CAS
6.2.1
Basics of a DRAM
:
Address; which is seprrrated i n time with two separate fields. There fields are the row and column address.
Row Address Strobe
rn
Column Address Strobe The column address on the multiplexed pins is clocked by this signal. Write Enable
rn
(m).
359
.
m
It is d e a r that the multiplexed address penalims the access delay so for fast DRAMr separate address input pins can be used. The multiplexing permits the reduction of the pin count and the cost of packaging. An example of DRAM timing, ndng the addresa multiplexing during read mode, is shown in Fig. 6.31. Some important times are shown, such as the access time from low, tmS, the row addxss strobe cyde time (or cycle time), tRC,and the row address strobe low-state time, 1x1s. Fig. 6.38 shows B gene& 4 M b DRAM architecture. It uses almost the same circuit techniques as SRAM except for memory army. Some additional circuits are needed such e s a Back Bias Generator (BEG), B Half-Voltage Generator (BVG), an optiond Voltage-Down Converter (VDC), a R,eference Voltage Geaerator (RVG), and a boosted voltage generator circnit. The substrate back-bias voltage is indispensable for stable operation of the DRAM array. The halfvoltage generatar permits generation of the precharge level for the bit-lines to half-VDD as it is explained in the following sections. The reference voltage generator ir needed for the VDC. The boosted voltage generator uses b chargepump circuit and permits overdriving of the word-line WL to a voltage higher than VDD.More details on these circuits, composing the DRAM, are given in the following sections.
6.2.2
CMOS DRAMr, with threetransistor and four-transistor cells, were used i n 1and 4-kb generations. One-tranristor (IT) cell offers smdei chip size and low cost. These justify the process complexity to fabricate the IT ccU, particularly its capacitor. A &hematic of B 1T DRAM cell is illustrated in Fig. 6.39(a). The charge is stared in capacitor C,.To prevent loss of the stored information, the capacitor must be refreshed within a specific time with spedal circuitry. The bit line has a capacity CBLinduding the parasitic load of the canneeted circuits. Typical values for the storage and the bit-line eapaeiton are 30 f F And 250 f F , respectively. The ratio R = CBL,C, is very important for the sensing operation.
360
CHAPTER 6
---
RAS CAS WE
9.
102
I'
361
where (VMC- Vm,) is the difference between the memory cell voltage and the bit-line voltage before the selection ofthe cell. A typicd value of the difference i s V D D ,Hence, ~ we have fog the hit-line renre signal
(63)
For 3.3 V supply voltage, and using a rstio E = 8 far 16-Mb DRAM,the sense
signal V , = 180 mV. This r m d voltage change, of the bit-line, requires sensing circuits. For low-voltage operation, V . decreases, thus a low ratio R is required. This is possible by reducing CBLand increasing C,.
C, was implemented ming a simple planar-type capacitor a~ rhom in the structure of Fig. 6.39(b). Thi structure WBS used in DRAMS with capacity up to I-Mb. With the increased density, many threedimensional approaches were used for DRAMs with capacity higher than I-Mb. One approach is to stack the capacitor over the access transistor (STCcell). Another approach is to m e a trench capacitor. For more details on advanced cell structure the reader can consult 130, 311.
The signal charge (Q.ig = C.AV,) transferred to the bit-line during a r e d operation should have enongh margin agsinst noise. The sources of noise are the following :
rn
bit-line noise; which is caused by capacitive couplings and other sonr~eei leakage charge; which is mainly due to the leakage in the junction of the NMOS trmsistor of a IT memory cdl; and
a-particleinduced soft errom
In the early DRAM,the plate of the capacitor WBS grounded to reduce the noise injection from the VDD power supply. However, for multi-Mb DRAMs, a VDD/Z bias or the eeU plate was nsod. This scheme has several advantages such as, the reduction of the stcess on the thinner oxide of the atorage capacitor, and the reduction of supply voltage noise. Many I-Mb DRAMs have used this cell biasing scheme.
362
CHAPTER 6
DRAM cell design with redneed VOD, the ratio R should be rednced. This L possible by reducing the bit-line capacitance, Csr. and increasing the . . On the other hand, the area occupied by C . should storage capacitance C . reduction is the be rednced to increase the chip capacity. One solution for C use or* capacitor insulator with extremely high permittivity 6 such BI Ferraelectric materials nuch as BoSrTiOJ film. Consequently B simple planar-typo capacitor can be nsed in that c a ~ e
For Gb
363
6 . 2 . 3 R e a m r i t e Circuitry
Fig. 6.40 illurtrstes the Merent circuits for read, write precharge, and equalisation funotions. The read operation is performed as follows. Initially both the bit-lines ( B L and BZ)are precharged to V, which is equal to VDD/Z and eqndized before the data reading operatirm. This hali-yoo preeharge technique permits the reduction of the active power disdpation 89 discussed i n Section 6.2.9. The signal W L is seleded by the TOW decoder. The high level of the word-line voltage har to be greater than VDD to increase the stored chaise in the memory cell. The selected memory cell is connected to one bit-line. Then AVBL (100 to 200 mV) appears between the bit-lines, immediately &her the word-line rises. Then it is amplified by the latch-type CMOS sense amplifier
364
CHAPTER6
which is connected to both bit-liner. After the sensing and the restoring o p erations, the voltage levels of the bit-lines bsve a full-swing condition. The bit-line differential voltage signal is transferred to the differential output-lines (0 and d), through a read drcnit. The signal YR i selected h o s t at the 8-e time with W L . The parasitic capadtance of the output-line is large (a typical value 2 pF for 4-Mb DRAM), and the readout circuit would need a long time to amplify the ootput-line signal. A main sense amfler is used to read the output-liner, then the data is selected among several main SAs connected
to different sub-arrays. Finally it ia transferred to the output buffer.
The DRAM cell readout mechanism is destructive, and hence the same data must be wsdtten to the cell on every read access. Consequently, on each bitline pair, a CMOS mpifier is needed to amplify and restore the level. This mechanism is not needed in SRAMs since the lead operation is non-destructive.
i g n d is selected by a column decoder as shown In the write made, the YW J in Fig. 6.40. In this ease, the write control signal is actiTated. The selected bit-lines are connected to a pak of wdte-liner W and W and the data are transferred to the memory cell when W L goer HIGH.
The decoders
The memory army. This is the dominant one. If m memory e d s ate connected to the word-line, the active power of memoly array is &ken
by
P.,,sm.a,,ov = m x Poem
(6.11)
Where Pmctm is the power dissipated in active mode when selecting the m cells. It is given by
Pacam = CmAVmVDDf
m
(6.12)
365
=
m
Other circuits such as refresh circuit, substrate back-bias generator, boosted l e d generator, B voltage reference circuit, and a half-VDD generator. These circuits &a dissipate a DC current; The rest ofperiphery such BS main sense amplifier, input/antput buffers, write circuitry etc.
To ieduce this active power, many techniques can be used and a m smnmarieed
as follows :
rn
Reducing a l l capacitances; particularly the bit-line and word-lines <apaeitanees. As seen from Equations (6.11) and (6.12)m Y Csr.should be reduced. Techniques which permit this are partial activation multidivided bit-line and shared IjO [see Section 6.2.7]. Also to *educe the word-line capacitance, a techniqne such as partial activation of mdtidivided ward-line can be used [see Section 6.2.81; Lowering the internal VDD.This i n d u d e the generation of half-Voo for precharging the bit-lines and reducing the external supply voltage; and Reducing the DC power required by periphery circuits. This is possible by using static CMOS decodes and puke operation technique using an ATD circuit (as in SRAMs).
The data retention power in a DRAM is mainly due to refresh operation and the DC power ( I D c ) due to peripheral circuits such 8s BBG, BVG. VRG, HVG. The refresh process is performed by reading the m cells connected on each word-line and restoring them. Thus, n refresh cycles are needed for n x m DRAM. It can be estimated by
where
n/fvejrS,b
to the ieLwb mode, one obvious technique is to increase f,<j,<,h and decrease
n.
P , L the AC and DC power dissipated by the other circuits such BS VDC, BBG, RVG, BVG,and boosted level generator. To redoee this power m y
366
CHAPTER 6
Figure 8.41
techniques can be used. One of them is to reduce the frequency of operation of circuits which have high-power during active mode when operating in data retention mode. Another one is to reduce the DC current of there ckcuits using, for example, dynamic concept.
In the following sections, the circuit techniques to reduce the active and dataretention power dissipation are presented. Also, different circuits conrtitnting a DRAM are described and low-power issues of these eirenits are discussed.
6.2.5 Decoder
In a DRAM, the static CMOS NAND decoders are used. The power is reduced by sing the predecoding technique. This topic is discussed more in Section 6.1.6 for SRAMs. Fig. 0.41 shows astatie CMOS word-line driver. The boosted level, K h , generated by an intunsl charge pump circnit, is used in the output stage. When node A is high at (VDD- VT), the antpnt inverter le& a high DC ourent because this is l m w then Vrhby 8% least two threshold voltages, sobjeet to body effect. Therefore, a s m a l l size PMOS transistor PI is used to restme the level of the node A to K, l e d . Also this transistor permits the latching of the low output level (ground). Thc Xi signal, when selected, is normally at Voo. The unselected X, is discharged to ground in the selected block before the row decoder selection.
367
Redocing the bit-line capacitance not only reduces the power dissipation but ratio of the memoiy cell. This is possible by
In this ease,
multi-divided bit-line technique is used. 2. Redncing the jnnctian capacitances of connected transistors such 8 s access devices. One possible solotion is the back-bias of the substrate cant- these devices. A negative voltage on the substrate permits to reduce the junction capacitance. In addition, the we of the trench isolation technique for CMOS devices rather than the LOCOS isolation results in almost 50% ieduction in capacitance, Fig. 6.42 shows the principle of multi-divided bit-line architecture for the memory array. The m x n m a y is now divided into m columns by k snbarrays. Each subarray contains n/k word-lines. In this scheme the bit-line capacitance CsLis reduced by dividing it into k sections. Also the signal-twmise of the cell is improved. Fig. 6.43 illustrates an example of I-Mb DRAM [32]. The memmy is divided into two parts; upper and lower. One part is divided into N = 16 sub-arrays and the total number of rubarrays i s k = 32. Two subbit-lines share one amplifier which are selected by isolation sign&, I S 0 and ISO. Thus, a partial activation is performed by selecthg only one SA along the bit-line. The switeh SW is controlled by the Y signal from the shared e o l m decoder. This signal runs in parallel to the bit-linw and uses metal-2. Thos, the 1 / 0is shared by two sub-bit-hes. Thk principle results in reduced power dissipation and chiprize. It has been used foz many DRAM generations up to 16Mb.
368
CHAPTER 6
369
Row decodri
._ - - _
--_
---__
Bit-lineinmetal-l
(meid-2)
Figure (1.45 Multi-divided bit.8ne orchilceturr with shard SA, I/O snd eolum.dccodrr[Zl].
370
CHAPTER 6
,,,R
._ ..-._ ._
Fig. 6.44 shows the hierarchical word-line structure proposed for a 256-Mb DRAM [26]. This scheme resembles the one used in the SRAM. The DRAM cell array is divided into several blocks and each o m itself is divided into sub arrays. The SnbWord-Line (SWL) circnitry is embedded in the subarray. Only one S W L is activated by the Main-Word-Line (MWL) and the 109" select Jignd. It i s common to two sub-mays as shown in Fig. 6.44. Thus, only two cell rubarrays are activated which represents B very small portion of the total cell arrays. In the case of the 256-Mb, the active cell array rise is 1/1024 of the total number. This ntrosture results in reduced active current and ground bounce.
371
n A simple circuit which permits the generation of this half-VDn is shown i Fig. 6.45. The HVG CLcait is composed of two stager. One stage L B bias generator which generates two voltagelevelr; (VDD/Z+VT) and (VDD/Z-VT). The second one is the push-pull output stage which generates the level V D D / ~ distributed to the memory array. The load capacitance, seen by the push-pull output stage, is huge. A typical value is a few tens of nF. A typical response time when the circuit is powered-op is few tens of ps at 3.3 V power supply voltage for 16-Mb DRAM. This HVG circuit has many disadvantages such as
ZL6
373
duty ratio of the H V G E signal in the data-retention mode. To solve the other problems dted an HVG G c d t was proposed k [28] but this circuit dissipates B DC =-rent.
For NMOS devices with P-well (substrate) a negative Vsa is generated by pumping electrons out of the ground node and into the substrate. A typical VBB generator configuration is shown in Fig. 6.47. This circuit is known as charge p a p . The node A oscillates between V T and (Vr - VDD). D n k g the high side of the cycle, the node A must be at least at V T to pump the chsrge from the gronnd. On the low side o f the cyde, the node A mart be a V T drop below V s S .The antput node VBs stablize. at a voltage l e d equal t o (ZVT - VDD),since the losd capacitance is huge. The clock (clk) is generated by B ling oscillator with N (N is an odd number) stage. The frequency f of oscillation, is approximately 1/(2Ntd), where t d is the delay of one inverter. The buffer is needed to drive the huge C , , , , capacitance. The average current pumped out of the substrate is approximated by
Ipmp= ( V m - vBBm;.)c,,f
(6.14)
where VBBminis the back-bias voltage when no current is pumped and is equal t o ( W - V D n ) (optimumvalue). During thertart-upalargecorrent Lpumpcd; equal to (-Vasin..C,,,f). Another PMOS version, of the charge-pump circuit, ir shown i n Fig. 6.48. Since the gate voltage of PI only reaches -VOD, Vsa ir pumped to a limit of (VT - VDD). For VDD = 5V, the NMOS and PMOS charge pump circuits generates typical voltage. of-3 and-4 V,respectively. However, for 3.3 V power supply, the PMOS version can generate a low negative voltage of -2.5 V which is Lower than the one generated by the NMOS version at this power supply. Fig. 6.49 shows e. pumping circuit which avoids the VT losses and hence is suitable for low-voltage operation [35]. When the clock ( c l k ) is low, the voltage of the node A reaches (IVT~I - VDD), and the PMOS transistor PI clamps
374
CHAPTER 6
375
376
CHAPTER 6
the voltage of node B to the ground level. The Vgg level is in that case, (IVT,~ - VOD- V T , , ) . When clk goes to a hieh level, the voltage of A rises to V T and ~ the voltage of B , by capacitive coupling, becomes -VOD, causing VBB to be equal to -VDD. Therefore the Vse will be
Vsa = mas{-Vm,
V l ,I~
VDD
- VF")
(6.15)
T h i s eircvit needs a special triplewell strncture to avoid minority carrier injwtion of the NMOS transistor N, as discussed in [SS]. To reduce the power dissipation of the BBG dreuit, while the DRAM is not i n an active mode, the BBG can be operated a t low fpequency. Fig. 6.50 shows D simplified circuit diagrsm of the BBG circuits for low-power operation [Xi]. In the normal mode, the ring oscillator works a l l the time to retain the Vsa level. In the data retention mode, the BBG Enable (BBGE) signal is clocked
377
with a low duty ratio. Then the ring oscillator is operating with low-frequency to iefresh the pumping eircuit.
is needed to generate a voltage level above VDDby at least VT. Tho word-line driver is powered with this voltage Vrh. A simple boosted voltage generator is shown in Fig. 6.51. It use6 the charge pump circuit technique discussed i n Section 6.2.10. The outpnt of this Circnit is switching between (VDD- VT)and ( 2 % ~ -V ) . The clock 4 is generated by a simple ring oxillator. Another circuit which switches between VDDand ~ V D D is shown in Fig. 6.51(b). It uses two non-overlapping clock phases. This second circuit configuration uses feedback NMOS devices, N I and Na, to eliminate the threshold voltage loss and boost the voltage a t higher voltage. This circuit is
not sensitive to power supply voltage reduetion. The boosted level can not be dkctly used to drive the load. Thus a pass transistor is needed to isolate the switching boosted level from the load as shown in the example of the drcuit of Fig. 6.52(a) [28]. The charge pump circuit CP1 generates at the node A, B boosted signal switching between VDD and ZVOD. To control the pass tiandstor N , two pump circuits CP2 and CP3, and an inverter INV are needed. The pump circuit C P generates, a t node B, a signal switching between WDD and ~ V D and D uses the boosted voltage Vrh. The other pump circuit CP3, controls the inverter INV. The output of this inverter (node D) switches between VDDand SVDD. The output of this KVG circuit is Vc,, = 2VDD and it is stable since is large. The voltage waveforms are shown in Figure 6.52(b). This ekcnit is insensitive to VDDreduction and can work down to s u b 1 V power supply.
378
CHAPTER
379
380
CHAPTER 6
perature. One way to increase this time, and hence reduce the dato retention powex dissipation, is to eontrol the refresh period funftion of the chip temperature. Fig. 6.53 shows LUL on-chip self-refresh control circuit with a memory-cell l e h g e monitoring scheme. A iefreJh dock hraffrlh ir generated automatically with a period of t,s,va,h. The moOitox cell, which has s hk?.&ecunent I&, controls the refresh period. Initially node A is high, the NMOS transistor N is OFF, and node B is low. When the c h a w on node A is deereased to the p&t that the PMOS transistor P toms ON, node B riser up. Then, during t h e 7 B high puke is generated at the node C, whieh in turn charges OP node A to high level.
381
One solotion to these problems is the use oflow-VTdevices in the DRAM army for the CMOS SA, prechlrrge and equ&g circuits. However, this leads to a drastic inuerse in the leakage current during the active period. The leakage current paths are shown in Fig. 6.55. To significantly reduce this leahge current the concept of Welldynchronized Sensing and Equalizing (WSSE) concept was proposed [37]. It is based on the following two concepts:
382
CHAPTER 6
rn
The voltage levels of the transistor souxes and the well are equaled during the sensing, the restoring, and the equalizing period. T h i s dimh a t e s the body effect. A negative (positive) him, V s s (&) is applied to P-well (N-well), respectively, during the active period. Thus, the leakage current is reduced because VT incremes due to the body effect.
rn
383
Fig. 6.56(a) shows the WSSE eireuits using a triple-well structure. The N-well and the P-well control voltages, Vw, and Vwp, respectively, are controlled by B s p e d logic. Fig. 6.56(b) finstrates the voltage waueforms. Before the wordline is activated, the bit-lines and # ,, and $ , are equaliaed to haKVoo. The P-well and N-well levels BIC prechapged to ( ~ / ~ V DV Dn ) and (1/2Yon VT~), respectively. There voltage levels permit to avoid any drain-well voltsge forward-biasing during the initial time, after W L activation. During this initial time, one bit-line is different than VDD/Z. In the sensing and restoring period, the signals 4.. and Vwp are palled-down w h i l e the signals $ , and Vw. are pallhp; each pair is synchronimd. After this period, the bit-lines BL and are in full-Jwing condition. Then, the level Vw, is pulled below GND to VHH and isolated from &, while the level Vw. is pulled above VDDto V& and isolated from qLp.
~
> VDD+ ~ ( V D + Da )
(6.16)
where a is the voltagemarginand VT(VDD) is the threshold voltngeofthe access NMOS transistor when its source is at VDD.Note that the NMOS device has (VDD+IVHHI) a5 an effective back-bias voltage. Far transistor reliability, V s , should be as s m d a s . possible. This meam that Vr(Voo)is required to be smd. T h i s threshold voltage is given by
VT(V?D)
V T O
+ 7v,-
(6.17)
where VTois threshold at zero source and substrate bias, 7 is the body effect coefficient and 4, is the Fermi potential.
Fig. 6.58 shows the anselected memory oell in long cyde operation. The bitline hsr completed t h e s g operation and is at gronnd level (GND). In this situation, t h e memory cell is exposed to worst case leakage condition. The c h q e stored in the cell leaks rapidly due to the subthreshold current. This situation sets the lower limit of the threshold voltage. Note that the access transistor of the memory cell has lVss1 as back-bias voltage. The threshold voltage in this mode is given by
384
CHAPTER 6
385
To meet these two requirements of the threshold voltage, the substrate voltage should have a suEcient bad-bias voltage to suppress the body effect.
For example when the internal supply voltage is VOD = 1.5 V, the IVsel is set to -1. The V~(1.5 V) is 1 V and the Vp(0) is 0.75 V and S = 90 mV/decade.
Extrapolakd thrcrhold v o h g r .
386
CAAPTER 6
Therefore, the lcskage current of e transistor with W = 1 pm, is 10 fF. In this case, Vch must be larger than (VDD VT(VDD)) which is 3 V.
When the VT of the memory cell is reduced, the leakage current increases drastically. The concept of Boosted Senre Gronnd (BSG) [38] w a s proposed to shnt down the subthreshold current in the memory cell B C C ~ S S transistor. This is achieved by slightly boosting the low-level voltage of the bit-line. This level i s called BSG level, and is set at 0.5 V. During a long cycle operation, the gatesource ofan unseleeted cell is negative (-0.5 V), then the subthreshold current is redveed by 6 orders ofmagnitude (for S = 80 mV/decade). Fig. 6.59 shows the BSG circuit applied to a memory cell. The BSG line is common to all N-channel sense amplifiers. The BSG l e d is generated by . e circuit similar to the VDC circuit [see Section 6.3. I0 active mode, the differential amplifier and N I are activated and the voltage of the sense ground becomes Kc,. The W2 transistor has alarge width and is activated by the signal SE at the beginning of the sensiig period to suppress an unnecessary rise in the BSG level by the s made inactive sensing current. In the standby mode, the differential amplifier i to reduce the standby current and also N , and N 2 . The BSG level is clamped to the threshold voltage of N,. Note that the boosted level, Vrh, is reduced compared to the conventional scheme because VT is reduced.
When the threshold voltage is low, the subthreshold elurent of each driver is important. Then for &DRAM the total subthreshold current of the drivers is
L,adr
=L . n . l . , a
(6.19)
where I,,s is the subthreshold current of NMOS and PMOS transistors (assumed the same). For B high-capacity DRAM, the current L b d , would be huge. For example, a multi-Giia-bit DRAM har B 1 million drivers, and each driver har a subthreshold current of 10 nA at room temperature, then the total subthreshold current would be 10 mA. At 75 C,this current can be hundreds of mA. This high DC current destroys the Vc6 level because the charge
387
Figure 8.59
pump eLcuit cannot handle such a DC current. Note that this current should always be evaluated in the worst case; maximum temperature, and the lowest value of VT. In the standby mode, all the drivers are turned OFF. The current L a d - is still the same. To solve this problem, the concept of Self-Reverse-Biasing (SRB) scheme c 8 n be used !24]. This concept has already been discussed in Seetion 4.10 [Chapter 41. Fig. 6.61 shows the application of the SRB scheme to word-he drivers. During the active mode, the control signal 3 is low and the node SL is equal to Kh. Only one word-line is selected. When goes to high (standby mode), the PMOS device Ps limits the subthreshold current. In this mode, all drivers are OFF,even lhe selected one. Fig. 6.62rhowr the technique to turn off the
388
CHAPTER 6
389
d is low,
One problem associated with the SRB acheme is that daring the actke mode, after one selected word-line driver is activated, d the other drivers m e leaking thereby substantidly contributing to the active current. This problem is solved by the partial Betivation of hierarchical power-line scheme 139). Fig. 6.63 shows the principle of the 2-D selection scheme. In this scheme, the array of k blodrs b7 n drivers is divided into E sob-blocks in columns and I sub-blocks in mw6. The total of sub-blocks, each containing a set of drirers, is k x I . Dudng the active mode, only one subblock is activated. Thus the subthreshold carrent in the active mode is drastically reduced.
390
CHAPTER 6
391
Figure 8.82
vch
,O
Vb
u
h
392
CHAPTER 6
bfiers are powered with the external voltage to maintain the compatibility. However. the VDC, in thk situation, should be stable when supplying a large current to periphery and memory array. When the VDC is used for battery operated applications, the standby current should be less than 1 p A over a wide range of temperature (0-70C).
VDC structure for a DRAM, used to convert Generator (RVG), a driver circuit and B time-dependent load. The buffer dreuit consists of a differential amplifier [Fig. 6.661 and common-smrw drive PMOS transistor Pb. The current load has B peak, for the memory spray, ofmore than 100 mA in 1030 nd time and more than 100 mA in few ns for the periphery <Leuit. To deliver such a large carrent, the width of the PMOS 8 of the outpot stage shanld be large. Moreover, when the output current changes rapidly, the output voltage VDD decreases by AVDD. To m i n i = AVDD, the gate control voltsge, VG, hes to change quickly. This is possible by increasing the differential amplifier tail current, I,. The current snomce, I., is needed to clamp the mtpnt voltage VDDwhen the load ourrent becomes almost zero.
Fig. 6.65 shows a schematic of the
393
10
circuit
t.
Figure
6.08
394
CHAPTER 6
A VDC circuit is one of the keys for achieving 8. DRAM with data-retention current that can be used in battery based applications. The requirements for low-power are the following : The standby current mast be less than 1 P A o v a a wide range of temperature, process and power supply voltage variations; and
rn
(6.20)
for the differential amplifier and The circuit has two poles: m = l/CGq, PI = l/C,,n for the output stage. The two poles must be sufficiently separated from each other to M J U I ~ a good phase margin 1481. For a DRAM application, the pole pa varies drastically, because of the load variation. Thus. the circuit CM fail to ensure a sufficient phase margin and hence it c a n generate ringing or oscillation. Therefore, phase compensation has to be applied. One
'A typical ralw of C , is 1OOpF. 'A typical ralm 01C , is 1200 DF.
395
possible compensation technique is shorn in Fig. 6.68(a) and it is called Miller compensation technique. The compensation capacitor C,is connected between the input and the output ofthe second stage. It shifts the pole p1 towards lower fieqoeney pk, BS shown in Fig. 6.68(b). Thos, the phase m a r g i n is improved. The condition of the stablization is defined at the paint of 0 dB loop gain where the phase margin is larger than 45 degrees. Using the smd-sigignal analysis with the compensation eapacitm C . the condition c a n be utracted. This capacitor is a function of gma, gml, CL and Co. To determine it, g m m l has to be known, using Iarge-Signd analysis. The PMOS driver Pb has to be rised to satisCy the condition on A V D D ~ V D(less D than lo%), due to the transient load current variation. Hence 9-2 can be determined from the she of &. For a 1 6 M b DRAM, the width of the antpot PMOS Pb can be as high as 30,000 p m and C, eqn& t o 200 p F . This is for 3.3 V internal power supply generation from 5 V. The current tail of the differential amplifier can be high (few ma) in active mode. The driver can be &activated in standby mode to conmme only a very s m a l l current by Chip Select (CS) signal. In this case, the internal vdte.ge can be supplied by a low-power voltage follower (461. The voltage fallowex has the same eonfigmation as the driver but the tail current is in the nub-fiA range.
396
CHAPTER 6
LOOP
Gain
397
The former consumes a DC current which is not low enough for low-power applications. The latter is more suitable far B CMOS technology. Fig. 6.69(a) shows a PMOS-VTdifference generator with an output voltage AVT = l V ~ ~i lIvTpsl (VT,, < V T ~ Z < 0). The equivalent circuit is shown in Fig. 6.69(b). This circuit needs a PMOS device with high threshold voltage. A typical value for the threshold voltage difference is I.]*. The PMOS transistam are chosen as threshold voltage difference generator because they are in N-weUs and therefore the difference is independent of back-biar (VBB). The circuit of Fig. 6.69(a) does not s&er mnch f m m V~D..~ bounce. The temperatwe dependency of the VT difference is expressed by [49]
(6.21)
where N.il and N.42 are the surface impurity concentrations of PI and P2$ respectively. Far B stable-temperature design, the concentration ratio N.il/N,i2 and. therefore the threshold voltage difference, should not be excessively large. A typical valne of temperature dependency is 0.4 mV/C, whieh is small for the VDC circuit. Since the AVT is around 1 V, the circuit of Fig. 6.10 is used to convert this difference to the required internal supply voltage. The voltageup converter amplifies AVT to:
V,.t = AVT
R (1+ 2)
(6.22)
The mismatch between the two PMOS devices PI and P , of Fig. 6.69 can be minimised by using large channel widths and lengths. But stiU the deviation on VT, dne to the fabrication process, has to be eliminated. This can be done by using fuse trimming technique to control the ratio of the resistors R1 and R2. The total current consumed by this RVG circuit is
where 31 is the current consumed by the voltage regulator [eee Fig. 6.69(a)] and I, is the current of the differential amplifier. I& = K c f / ( R r + R2) is the current of the ontput stage. I can be made < Ip A, however I. and II, can not be made rmdcr, particdarly I,. The resistor is implemented, foz example, by using doped polysilicon. Typical valuei of the resistances m e of the order of 100 K l l . They can not be increased excessively, otherwise the m a of the RVC can be significantly high. Moreover, the substrate noise can affect the reference
398
CHAPTER 6
399
voltage through the coupling capacitances of the resistors. The total current of this type of RVG is i n the order of few . e tens of p A .
To redme the current of the RVG to rub-pArmgefbr battery-operated DRAMs, the concept of dynamic RVG can be used [50] - s h o w i n Fig. 6.71. A PMOS transistor P, with low [VT~ is used. Doring the sampling peiiod (#, is high), all switches S, -54 are closed. The threghold voltage difference, AVT, between the two PMOS devices, P i and P2* appears a c m s the resistor RR.If the transistor dimensions of the pairs P, and P2, and HIand are identical, the reference voltage is given by I , = A VT (6.24) RR This current is mirrored to the output node. If the dimension of P is identical to that of P>, the output voltage V,, is given by
~
V7#, = AVT-Rr.
RR
(6.25)
This shows that the reference voltage e m be adjusted to any voltage. Moreover, with trimming technique V , , , can be adjusted against pmcess vadation effect (AVT variation). The ontput voltage is sampled on the hold capacitor C , . When 4, is low, the circuit is in hold mode. Clock +2 is delayed to clock to minimbe fluctuation of the output voltage. These clocks ape generated from the self-refresh clack circuit in il DRAM. The ciircuit consumes a DC current only when 4, is applied. The average cuiient consumed by this circuit i s I,,
= 31x74 = ~ ( A V T I R E ) ~ ~
(6.26)
The corrent of thb circuit c m be reduced where 7+ is the duty ratio of to a low-level in sub-PA iange by controlling the duty ratio. For example t o generate a reference voltage of 2.4 V from an externd power supply voltage of 3.3 V, RR and Rr. me 9 kR and 12 kfl, respectively. AVT has a typical value of 0.3 V. The total DC is 100 PA. So with a duty ratio lower than 1/100, the average current can be reduced below 1 p A . It can be easily shown that this circuit has a low sensitivity to power supply voltage and temperature variations.
6.4
CHAPTER SUMMARY
Low-power architectures/circuitr techniques for SRAMs, DRAMs and VDCs were reviewed. The obviow technique to reduce the power dissipation is the
400
CHAPTER 6
401
voltage ~ealing. The reduction of power supply voltage to 1- and sub-1 V range requires new circuit innovations and breakthroughs, particularly when low threshold voltage devices are used. It ww shown that not only the power supply voltage scaling contribntes to the power consvmption reduction but &o the reduction of capacitances and DC currents using sophisticated techniques. Many of the techniques presented for memories can be useful to other applications such as : ASICs, DSPs, etc. Design issuer for stable operation of a VDC and Iow-rtandby current techniques were invertigated.
REFERENCES
[I] 8. Tram ct al., "An 8 - m 1-Mb ECL BiCMOS SRAM ~ t a h ConfigurabIe Memory Array Size," International Solid-state Circuits Cod. Tech. Dig., pp. 36-37, Febzuluy 1989.
[2] M. Matsni et al., "An 8-ns I-Mb ECL BiCMOS SRAM," International Solid-State Circuits Conf. T e c h .Dig.,pp. 38-39, February 1989. [3] Y.Maki et al., 'A 6.5-nr 1 Mb BiCMOS ECL SRAM," International SolidState Circuits Conf. Tech. Dig., pp. 136-137, February 1990. [4] M. Takada et al., "A 5-11s 1-Mb ECL BiCMOS SRAM," BEE Journal of Solid State Circuits, uol. 25, no. 5, pp. 1057-1062, October 1990. 151 A. Ohba et al.. "A 7--ns I-Mb BiCMOS ECL SRAM with Program-Free Redundancy," in Symp. VLSI Circuits C o d Tech. Dig., pp. 41-42, May 1990.
[6] Y. Okajimact al., "A 7-nr 4-Mb BiCMOS SRAM with a Parallel Testing Circuit," International Solid-State Circuits Conf. Tech. Dig., pp. 54-55, Febrosry 1991.
[7] K. Sas&
ct d., "A 7-ns 140-mW 1-Mb CMOS SRAM with Current Sense Amplifier," IEEE Journal of Solid.State Circuits, vol. 27, no. 11, pp. 15111518, November 1992.
[8] T. Ootani et al., "A 4-Mb CMOS SRAM with a PMOS Thin-Film Transistor Load Cell," IEEE Journal of Solid-State Circuits, "01. 25, no. 5, pp. 1082-1092, October 1990. [9] S. Mur&kami et al.. "A ZI-mW 4 M b CMOS SRAM for Battery Operetion,' lEEE Journal ofSolid-State Circuits, vol. 26, no. 11, pp. 1563-1570, November 1991.
[lo] K. Saraki et al., "16-Mb CMOY SRAM with a 2 . 3 - p ~Single-Bit-Line ~~ Memory C e l l , " IEEE Journal of Solid-state Circuits, val. 28, no. 11, pp. 1125-1130, November 1993.
404
[Ill M. Metrumiya et al., 'A 15-ns 16-Mb CMOS SRAM with Interdigitated Bit-Lme Architecture," IEEE Journal of Solid-State Circuits, ual. 27, no. 11, pp. 1497.1503, November 1992.
[I21 K. Sen0 et al.. " A 9-ns 16-Mb CMOS SRAM with OfEset-Compensated Cnrrent Sense Amplifier," IEEE Journal of Solid-State Cirenitr, vol. 28,
no. 11, pp. 1119-1124,November 1993.
[I31 E. Seevinck, F. J. List, and J. Lohrtroh, Static-Noise Marsin Analysis of MOS SRAM C e b , " IEEE Journal of Solid-State Circuits, vol. SC-22, no. 5 , pp. 748-754, Oetobei 1987.
[I41 H. Kato et al., "Consideration of Poly-Si Loaded Cell Capacity Limits for Low-Power and High-speed," IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 683-685. April 1992. [I51 K. Saraki et al.,"A 23-ns 4-Mb CMOS SHAM with 0.2-pA Standby Current," IEEE Journal of Solid-state Circuits, vol. 25, no. 5, pp. 1075-1081, October 1990.
[I61 K. Ishibarhi, T. Yamanaka, and K. Shimohigashi, "An a-Immune.2-V Supply Voltage SRAM using a Polysilicon PMOS Load Cell," IEEE Journal of Solid-state Circuits, vol. 25, no. 1, pp. 55-60, February 1990.
[I?] K. Saraki et al., "A 15-ns I-Mbit CMOS SRAM," IEEE Journal of SolidState Circuits, vol. 23, no. 5 , pp. 1067-1072, October 1988.
[I81 K. S s a k i e l al., "A 9-ns I-Mbit CMOS SRAM," IEEE Jonrnal of SolidState Circuits, "01. 24, to. 5, pp. 1219-1225, October 1989.
[I91 K. Ishibarhi, K. Takasugi, T. Yamanaka, T. Hashimoto, K. Sasaki. " A I-V TFT-Losd SRAM using a Two-step Word-Voltage Method," IEEE Journal of Solid-state Circuits, vol. 27, no. 11, pp. 1519-1524, Msy 1992.
[20] M. Yoshimito, K. An-, H. Shioohara,T. Yoshihara, H. Takagi, S. Nagao, S. Kayano. and T. Nakano, "A Divided Word-Line Structure in the Static RAM and its Applieation to a 64K Fall CMOS RAM," IEEE Journal of Solid-State c i r c u i t s , vol. SC-18, no. 5, pp. 479-485, October 1983.
[21] T. Hirose, H. Kuriyama, S. Mnmkami, K. Yuzuriha, T. Mukai, K. Tsutsumi, Y. Nishimura, Y . Kohno, and K. Anami, "A 20-ns 4 M b CMOS
SRAM with Eieraichical Word Decoding Architecture," IEEE Journal of Solid-State Circuits, vol. 25, no. 5, pp. 1068-1074, October 1990.
REFERENCES
405
[22] A. Sekiyama, T. Seki, S. Nagai, A. Iwase, N. Surilti, and M. Hayaraka, "A I-V Operating 256-Kb FaLI-CMOS SRAM," IEEE Journal of Solid-state Circuits, vol. 21, no. 5, pp. 776-782, May 1992. [23] T. Yabe, et al.. "High-Speed and Low-Standby-Power Cieuit Design of 1 to 5 V Operating 1 Mb Full CMOS SRAM." Symposium on VLSI Circuits Tech. Dig., pp, 107-108, May 1993. [24] G. Kitrukawa, et 81.. "256-Mb DRAM Circuit Technologies for File Applications," IEEE Journal of Solid-State Circuits, "01. 28, no. 11, pp. 11051113, November 1993. [25] T. Hasegawa, et a l . , "An Experimental DRAM with a NAND-Structnred Cell," IEEE Journal ofSolid-State Circuits, val. 28, no. 11, pp. 1099-1104, November 1993.
1261 T. Sugibayashi, et al., "A 30-nn 256-Mb DRAM with a Multidivided Array Structure," IEEE Journal of Solid-State Circuits, "01. 28, no. 11, pp. 10921099, November 1993. [27] M. A&, J. Etoh, K. Itoh, S-I. Kimura, and Y. Kawamota, "A 1.5-V DRAM for Battery-Bwed Applications," IEEE Journal of Solid-State Circuits, "01. 24, no. 6, pp. 1206-1212, October 1989.
[28] Y. Nakagome, et d.,-An Experimental 1.5-V 64-Mb DRAM," IEEE Journal of Solid-State Circuits, vol. 26, no. 4, pp. 465-471, April 1991.
[29] H. Yamauehi, et al., "A Circuit Technology for High-speed BatteryOpersted 16-Mb CMOS DRAMS,~ IEEE Journal of Solid-State Circuits, "01. 28, no. 11, pp. 10841091, November 1993.
[30] N. C. C. Lu, " Advanced Cell Structnres for Dynamic RAMS," IEEE Circuits m d Devices Magashe, no. 1, pp. 21-36, Jenuary 1989.
[31] M. Takadn, "DRAM Technology for Giga-bit Age," International Conf. Solid State Devices and Materials, Tech. Dip., pp. 874876, 1993. [32] L. Itoh, et d.,"An Experimental 1-Mb DRAM with on Chip Voltage Limiter," in International Solid-State Circuits Cod., Tech. Dig., pp. 282283, 1984. [33] N. C-C. Lu, and H. H. Chao, '' Half-Voo Bit-Line Sensing Scheme in CMOS DRAMS," IEEE Journal of Solid-State Circuits, "01. SC-19, no. 5, pp. 451-454, August 1984.
406
(341 B. Kawamoto, T. Shinods, Y. Yamapehi, S. Shimiuu, K.Ohishi, N. Tanimum, T. YasUi, 'A 288K CMOS Pseudostatic RAM," IEEE Journal of Solid-state Circuits, vol. SC-19, no. 5 , pp. 619-625, October 1984.
1.351
Y. Trikihwa et d., "An Emcient Back-Bias Gcnezstor 6 t h Xybzid P u m p ing Circuit for 1.5 V DRAMs," in Symposium of VLSI Circuits, Tech. Dig.,
pp. 85-86, May 1993.
(361 Y. KQnishi, ct al., "A 3&ns 4-Mb DRAM with a Battery-Backup (BBU) Mode," IEEE Journal ofsolid-state Circuits, vol. 25, no. 5 , pp. 1112-1117. October 1990.
[37] T. Ooirhi, et al., "A Wen-Synchronized Senring/Equalizing Method for S u b 1 V Operating Advanced DRAMs," in Symposium on VLSI Circuits. Tech. Dig., pp. 81-82, May 1993.
1381 M. Asakura, et al., "An Experimental 256-Mb DRAM with Boosted SenseGround Scheme," IEEE Journal of Solid-state Circuits, d. 29. no. 11, pp.
1303-1309, November 1994. 1391 T. Sskata et al., "Subthreshold-Current Reduction Circuits for MultiGigabit DRAMS," i n Symposium on VLSl Circuits, Tech. Dig.. pp. 45-46, May 1993. [40] T. hrruyama, et al.. "A New On-Chip Voltage Converter for Submicrome ter High-Density DRAMs," IEEE Journal of Solid-state Circnits, vol. 22, no. 3, pp. 437-441, June 1987. 141) M. T s h d a . e l al., -A 4-Mb DRAM with Aalf Internal Voltage Bit-Cine Precharge," IEEE Journal ofSolid-State Circuits, vol. 21, no. 5 , pp. 612617. October 1986.
1.121 M. Hiroguchi, e l
aL, "Dual-Operation-Vdtage Scheme for B S i g l e 5-V. 16-Mb DRAM," IEEE Journal of Solid-State Circuits, vol. 23, no. 5. pp. 1128-1132, Oetober 1988.
1431 G. Kitsukawe, et al., "A I-Mb BiCMOS DRAM Using TemperatureCompensstion Circuit Techniques," IEEE Journal of Solid-State Circuits, "01. 24, no. 3, pp. 597-602. Jnnc 1989.
144) M. Boriguchi, et al., "A Tunable CMOS-DRAM Voltage Limiter with Stabilised Feedback Amplifier," IEEE Journal of Solid-State Circuits, YO\. 25. no. 5. pp. 1129-1135, October 1990.
REFERENCES
407
[45] M. Roriguchi, et al., "Dual-Regulator Dual-Decoding-Trimmer DRAM Voltage Limiter far Brun-in Test," IEEE Journal of Solid-State Circuits, d. 26, no. 11, pp. 15441549, November 1991.
and H. Topshima, " A Voltage Doan Converter [46] K. Ishibashi, K. S-ki, with Submicroampere Standby Corrent for Low-Power Static RAMS," IEEE Journal of Solid-State Circuits, "01. 27, no. 6, pp. 920-926, June 1992.
[47] P. E. Anen, and D. R. Rolberg, "CMOS Analog Circuit Design," Holt, Rinehart and Winston Publisher, 1987.
[48]
P . R. Gray, and R. G. Meyer, "Analysis and Design of Analog Integrated Cteuit," 2nd Edition Wiley Publisher, 1984.
[49] R. A. Blauschild et al., " A New NMOS Temperature Stable Voltage Reference," IEEE Journal of Solid-State Cicuitr. vol. SC-13, pp. 767-774, December 1978. [60] H. &aka,
Y. Nsksgome, J. Etoh, E. Ymaeki, M. Ao?4 and K. Miyamwa, *Sub-l-prn Dynamic Reference Voltage Generator for BatteryOperated DRAMS," in Symp. VLSI Circuits, T e d . Dig., pp. 87-88, May
1993.
7
VLSI CMOS SUBSYSTEM DESIGN
In this chapter, we study the application of the dreuit techniqnes developed through Chapter 4 in the implementation of CMOS b d d i n g blocks soch as adders, multipliers, ALUs, data-path, and regnlar structures, etc. The pow= dissipation constraint is also included through the several options presented for each dreuit. The use of Phase locked Loop (PLL) in high-speed CMOS systems for deskewing the internal clock is also examined. Low-power issuer of the circuits presented are also discussed.
-.
m m
Ripple Carry Adders (RCA); Carry Look-Ahead Adders (CLA); Carry Select Adders (CS); and Conditional Sum Adders (CSA).
to
T h i s section h dovoted
410
CHAPTER 7
7.1.1
In Chapta 4, a d-rription of the fnmtiondity o f an adder cell was presented. In an n-bit adder, a propagation of the carry always occurs. This propagation limits the speed of the adder. The simplest way to construct an n-bit adder is to cascade n 1-bit adders as shown in Fig. 7.1. This adder is called Ripple Carry Adder (RCA). Beesuse the carry ripples through the n-stager, the sum of the nthbit csnnot be perhmed until the c a w C = . L is evaluated. The delay of n-bit addition is given by
+ .,
= (n - 1)t.
+ t,
(7-1)
where t , is the esrry delay and t. is the som delay. Since the carry propagation path is II critical stage for the delay, the full-adder cell should be optlnied. The sum and carry out are given by
S = A @ B ( B C
(7.2)
C , , = A.B (A B).C;, (7.3) The schematic of Fig. 7.2 cam be genewted to &dently implement the adder cell. Compared to the conventional CMOS full-adder implementation, there is no inveiter stage. Therefore, the carry delay is redoced. To optimiae the cell, the transistors in the carry path W, and W,, UUL be s i n 4 up [see Fig. 1.21. The other devices can be kept amall to reduce the load on the carry and the power dissipation. The transistors, driven by the carry in C,,, are placed close to the output. Thir will reduce the body effect. since the cairy signal is the
411
Crilicai path
412
CHAPTER 7
latest one i n an adder chain. The schematic of Fig. 1.2 ir symmetrical and leads to better layout and small area. Since the outpnts are complemented, and in order t o implement an RCA circuit, the configuration of Fig. 7.3 can be used. In this case, many cells use inverted inputs.
Note that an n-bit RCA circuit is subject to the glitching problem. Fig. 7.4 shows 8 static simulation of a 4-bit adder, vrith the inputs A; set to zero (0), and the inputs B; and C , . i i s i g from 0 to 1. The outputs S, should stay at 0, however, due to the delay of the carry signal, through the chain of fulladders, the autpnts exhibit spurious transitions (glitching). There dynamic transitions dissipate extra powm and can represent an important portion of the total power. With careful design this glitchhg problem cam he minimized. One ddvbntage of the RCA is its low-power characteristic. However, its speed is very limited, particularly when the adder is wide.
Another efficient full-adder cell is based on Transmission Gates (TGs). Fig. 7.5 shows an optimived version of the fd-adder cell wing TGs & e d y discussed in e a l propagates only through one TG. Hence, an n-hit Chapter 4. The carry i RCA would be faster and more compact than the conventional o n e ' . Fig. 7.6 shows the construction ofan n-bit d d e r . Pmctiedy, an inverter is added every four stages to reduce the degradation of the carry signal due to the dktribnted RC effect. When the carry rignd is inverted after 4 I-bit stager, complementary carry path adders are used for the next 4-bit stages. This adder structure is sometimes called Mancherter adder. This circuit is faster than the RCA and may have loww power dissipation.
G; =
B.
(7.4)
413
414
CHAPTER 7
Ci"
415
(1.9)
Fig. 1.7 shows the block diagram of a 4bit CLA adder. The carry generator blocks (CLG1 to CLG4) generate the carries CL to Cn, in parallel, &om the wryi n signal Co. The different P< and G; signals are implemented following the expressions given b7 Equations (7.4) and (1.51. The B generator blocks (SG1 to SG4) generate the sums. The mm, S ( , Li generated by
Sc = Ci-1
Ai
B;
(7.10)
416
CHAPTER 7
or
s ,=
if the propagate signal is given by
C<L, B Pj
(7.11)
P, = A<
B,
(7.12)
In general, an n-bit CLA adder can be implemented dciently using 4-bit blocks.
Fig. 7.8(a) and 7.8(b) show the first and the fourth CMOS carry lookahead generator kcuits, respectively. The generate and propagate signals are generated in parallel and are fed to all carry generators with the input carry signal Co. The e u r y signals %regenerated simultaneously. However, because the number of stacked MOS transistors increases, the delay of the fourth carry is greater than that of the first and limits the adder speed. The sum generator of the CMOS adder of Fig. 7.2 c m be used in this ewe. The same circuit is used for all four bits. This implementation is slow beeavae of the large numbers of stacked MOS transistors which represent a high equivalent resistance in the pull-up and pd-down paths.
Another CLA circuit implementation in static CMOS design which improves the critical carry path delay i s shown in Fig. 7.9(s). In this circuit, the number of stacked devices i s reduced. The same cell of Fig. 7.9(a) can be used to generate each carry within a 4-bit block. P and G are the global prqagate and generate signals, respectively. The invezter of the circuit of Fig. 7.9(4 i s used to reduce the load on the fourth carry, C , , when it is used to drive the next fourth CLG circuit. The output of this inverter, I, drives many blocks such BS the next first-bit, the next second-bit, the next third-bit CLGs, and the next sum blocks. For the fourth bit stage, P and G aze given by
P = P.+sP,+2P,+,P;
(7.13)
(7.14) G = Gi+a Pi+sGi+? +P;+aP;+2Gi+i +Pi+sPd+&+tGi The circuits of Fig. 7.9(b) and Fig. 7.9(c) show the implementations of the global functions P and G . Simildy, the P and G sign& for the third. second and first bit stages c a n be constructed. For an n-bit adder, all the P and G signals are computed in parallel. Hence, the critical path is the carry path C, C;+,, except for the fust &bit adder block, where the oritieal path can be from one of the inputs ( A , or Bo) to the carry out C4.
The 11101 generator is implemented using the propagate signals, P<and p;. Fig. 7.10(a) illustrates one pwsible circuit using B static CMOS implementation.
417
t Gn
418
CHAPTER 7
419
ci -
Another circuit more compact and faster i s shown in transmisJion gates and needs only 6 transistors.
Fig.
T.lO(b). It uses
Many urcuit techniques for high-speed carry lookahead adders have been propored. One of them uses the pseudo-NMOS like style [I]. The adder w~ used in a multiplier and achieved a high-speed static operation. However, it consumer a DC current and it is not snitable for low-power applications.
420
CHAPTER 7
Other CLA implementations, to improve the carry path delay, are based on the transmission gates and CPL families. In this section we present the one based on CPL. The TG version is left to the reader to design. Fig. 7.11 shows the block digram of a 32-bit PMOS lsttch CPL carry loakahesd adder using 4 b i t blocks. The carry generators (CLGs) of each 4 b i t block generate the carries C,+> through C(+$ in parallel from the carry in, C . . The different P; and G, signals, required by each 4-bit block, m e not shown for clarity reasons. When the carry Cj+4 is fed to the next 4-bit block it "re3 B buffer to distribute this carry to other CLGs and SGs. Therefore, the carry path is not signifmtly loaded. This results in a h t operation. Fig. 7.12 shows the CPL implementation of the CLG of the fourth bit. This circuit is located in the clitical path of the carry signal. It is compact and uses only NMOS pass transistors. P and G are the global propagate and generate signals, respectively. The fourth carry is generated from the carry in or G signals through only one NMOS device. The P signal block i b implemented using ANDINAND CPL style. After each 4 CLG blocks of the critical path, the carry is buffered and restored using PMOS latch buffers. The PMOS latch restorer the reduced high level to full-swing to avoid any DC leakage current as shown in Fig. 7.11. Fig. 7.13 shows the G signal block for the fourth-bit CLG 8s an example. The same circuit gtyle can be used t o generate this G signal for the third-bit, the second-bit, and the first-bit CLGs. In addition the output inverter rises a PMOS latch to rertore the swing. The PMOS latch circuit is incorporated only when dual rail signals are available. However, for a single-ended signal, a feed-back PMOS, transistor is added to restore the full r d high-level ar in the case of the sum generator of Fig. 7.14.
Buffers
I
C"
... ...
422
CHAPTER 7
Figure 1 . 1 3
423
in Chapter 3, simulations show that the optimal staging of a 32bit CS adder nSing TGr is 4-4-7-9-8 at 3.3 V power supply '&age. This implementation is regular and easy to layout. however it has a higher occupied area than the
RCA.
7 . 1 . 4 Conditional S u m Adders
In 1960 Sklansky considered the Conditional S u m Adder (CSA) 8s the fastest one,from a theoretical point ofview [Z, 31. The concept behind this architecture is explained using the basic circuit of Fig. 7.16. This example is for a 4 b i t conditional rum adder. It user two types of c e h i) the conditional cell, and ii) the multiplexer. For each bit there is one conditional cell circuit. It computes two sums and two carries: So and Coare cdculsted for a eauy in iero, and S' and C ' are calcdated for a carry in one. The selection of the true s is done with the first carry in and the previous carries. The troe final carry out (G in Fig. 7.16) is also selected.
424
CHAPTER 7
A possible implementation ofthe conditional som adder is shown i n Fig. 7.17 for the c s e of B 4-bit adder [4]. The conditional cell can be implemented vith the compact logic elements of Fig. 7 . 1 7 ( b ) . The different sign& ofthe conditional cell ate constructed using the following relations
s'p
A;.B*
+ A*.B+
425
The adder uses mainly for the multiplexers transmission gates as shown in Fig. 7.17(~). Note that the architectue we6 the signals and their complements (dualhail architecture) to avoid the use ofinverterr for the multiplexers. Otherwise the delay of the csrrg path w i l l be pen&& by the addition ofinverterr. To design an n-bit (e.g., 32-bit) adder, one possible technique for fast operation is to use staged blocks of constant width or variable width. In this case, d l the conditional sum blocks compute thelr respective double snms and double output carrier in paallel. The troe sum and carry out signals of each block a r e then selected by the carry in generated by the preYions stage. The architecture at the block level UBU B any-select like technique where the carry in of each block ir the true carry out of the previous block. The optimal staging a n be determined from circuit simulation. The architecture has two critical delay paths within a block. One from the carry in to the carry out which is affected by the layout routing since the carry in of a block is distribnted to all the final multiplexers. The other critical delay path is the one from the LSB-inpnt of B block to the cnrry out. To reduce the power dissipation and the delay of the CSA adder, B CPL-Wre circuit style can be used. Fig. 7.18 shows the different circuit cells needed to implement such an adder. In Fig. ?,la(*), the conditional cell schematic is shown. The output signals have a high level voltage equal to VDD - VT. Fig. 7.18(b) shows the compact mdtiplexer using NMOS pass-transistors. The control signals of the multiplexers should have f u l - r d swing, When using t h e e reduced swing circoits in the adder, whenever a full-rail swing is needed it can be generated with the double-rail swing restored circuit of Fig. ?.lS(c). The output inverter ofthe rum Signal is shown in Fig. 7,18(d). The feedback PMOS transistor is needed to restore the high level when only a single-rail exists. The layout of such an adder is regular. Only three c& of the first. second and third bits have to be drawn. Fig. 7.19 illutratw the layout of a 4bit block 0.8 pm design rules.
7.1.5
The ripple adder has the smallest area compared to the other classes and the lowest power in many ca~es. So it should be limited to applications where the area and/or the power must minimized, while the speed is not important. For fast adders, u ~ u d l y the CLA &cuit is used, however its power dissipation can be relatively high. The carry select adders are widely used as the optimum compromise between high-speed operation of the CLAr and the small area of
426
CHAPTER 7
* : MUXs
(a1
427
428
CHAPTER 7
Figure 1.18
I bit ~anditional SM
sddcr layout
R C h . The conditional snm adder, with variable block staging, combincd with
carry select like style ULO iesult in the fastest adder if well optimized. The power dissipation of this adder can be comparable or maybe less than that of the RCA because it u e s jl reduced internal swing and a datively small transistor count if thc CPL-like style is used. When considering all the criteria ouch as the power, the area and the speed, a tool can be developed to select the adder class which satisfies the specified requirements.
Far wide adders, having operand's sire more than Whit, the different arehitecturer can still be utilised. However, to optimize the speed and power of such a wide adder, several additional algorithms can be combined. Examples of wide adders can be found in 15. 61.
429
which have been used in VLSI. The reader can consult references [7, 81 for more details on array multiplication algorithms.
7.2.1
Braun Multiplier
(7.20) The product P = P ~ ~ ~ , . . . P ~ P which , , , results from multiplying the mdtipli-d X by the multiplier Y, c a n be written in the following form
i=o
j=o
Each of the partial product terms Pk = Xi% is c d e d summand. Fig. 7.20(a) s h o w an example of 4 x 4 multiplication. The summands are generated in parallel with AND gates. Fig. 7.20(b) shows the Braun's array multiplier [7]. Such a multiplier of n x n requires n(n - 1) addecs and na AND gates. The adder can be implemented efficiently by arranging the array for a regular layout. Fig. 7.21 shows 8 regular 4 Y 4 array implementation of the multiplier of Fig. 7.20 using three different cells. The fist cell contains an AND gate [Fig. 7.21(b)]. The second cell shown in Fig. 721(c) contains a fd-adder and an AND gate. T h e routing lines arc d s o illostmted in these cells. The last cell represents a M-addex composing the final carry propagate adder. The multiplier array is using what ir called carry-save adders. The delay of such a multiplier is dependent on the delay of the full-adder cell and the final adder in the last row. In the multiplier array, an sdder with balanced carry and s u m delays is desirable beoause sum and carry signals are both on the critical path. This is diJkent than the case of a p d l e l adder where the carry path should be optimized and speed up compared t o the s u m path. For large arrays, the speed and power of the full-adder are very important. CPLlike styles discussed in Chapter 4 can result in reduced power dissipation and high-speed of operation. The final sdder in the last row can USE the techniques presented in Section 7.1.
430
CAAPTER 7
x,
Y3
x* x, xo
Y> Y, Yo
=x
=Y
431
xi
(bl
qv;
432
CHAPTER 7
= -x,-12"-'
; a - I
Y = -Y,-,2"-'
+ +
i=n-*
c c
i=o
X.2'
(7.22)
K2i
(7.23)
i=o
P = XY
x"_rY,_,2"-'
cc
i=o
j=o
n . i
X;Ip'"
<=*-a
-x-.,
i=n->
c
i=o
fi2"f"-Y
c
i=o
X,2"+'-'
(7.24)
In order to avoid the use of subtractor cells and use only adders, the negative t e r m should be transformed. So
i=n-2
__,.x , _ 1
c
i=o
KZ"+L
x ".I
(-
p . 2
+ 2"-' + i=n-2 E P - 1
*=o
(7.25)
-2-'+(z".l
x".*Y"-,)
.2'*-2
Using the above rdstion M n x n multiplier, using only adders, can be imple mented. The schematic circuit diagram of 8.4 x 4 two's complement mdtiplicr bared on Baugh-Wooley'a algorithm is shown in Fig. 1.22. The different cells composing the array are &o shown. In this scheme n(n- 1) 3 full-addus are
433
required. So for the ease a f n = 4 the array needs 15 adders. When n is relatively large, the Rnal adder stage in the multiplier army a n be implemented with the techniques discussed in Section 7.1. This type of multiplier L suitable for applications where operands vith less than 16 bits are to be processed. Application;, for snch a mdtiplier are, far exxamplc, for digital filters where s m d operands mc used (q., 6 , 8 and 1 2 ) . For low-power and high-speed of operation, the array uses a CPL-like adder BS mentioned pieviously in Section 7.2.1,while a CSA scheme, combined with carry select, a n be u t i e d in the final adder. For operands equal or greater than &bit, the Baugh-Wooley scheme becomes too area-consuming and slow.
434
CHAPTER
Henee, techniques t o reduce the size of the array, while maintaining the regularity are required.
435
Y = -Y,-,2"-'
It can be rewritten as follows
1 Y.2'
irnO
(7.27)
In this equation, the terms in brackets have valuer in the set{-2, -1,O, 1, +2}. The reeoding of Y ,using the modified Booth algorithm, generates another number with the following five signed digits, -2, -1. 0, +1, +2. Each recoded digit in the multipliei performs B certain operation on the multiplicand, X ,85
Y2,+>Ya, Y , , . , Recoded
0 0 0
0 0
0
0 1
1
1 0
1
digit 0 +I +I +2
-2 -1 -1
Operation on X
OXX
+ l X X
+I x x +2xx
-2 x
1
1 1 1
0
0 1 1
0
1 0 1
-1
-1xx
OxX
x x
So the bits of the multiplier are partitioned into groups of overlapped 3-hits, each group permits generation of B ceitain partial product. The five posible multiples of the multiplicand are relatively easy to generate following the explanation given in Table 7.2
The generated partial prodnct is related to the multiplicand for each recoded digit by the relationships presented in Table 7.3. PP,is the partial product and PP, is the sign bit of the partial product w t h P , = Pn-l when no shifting of the partial product is performed. Note that the partial product is represented on n 1 bits.
436
CHAPTER 7
Recoded Digit 0
+1 +2
-1
-2
Opuation on X Add 0 to the partial product Add X to the-partid-product Shift left X one position and add it to the partial product Add twos complement ofX to the partial product Take twos complement of X and shift left one
Table 7.S
Recoded Digit
Operation on X
Added to
LSB
0 +1 +2
-1
-2
0 0
1 1
To clarify this algorithm, an example is presented in Fig. 7.23. Let X = l O O l O l O l and Y = 01101001. The recoded digits of Y are
oiioio,oi:
+a
-1 -2 +I
The bits are grouped into 3-bit groups overlapped by one bit and a bit with a value of aero is added on the right side of Y 85 Y-I. So the mdtiplicstian of two %bit numbers generates only 4 partial products. The number is then reduced by half, The partial prodnet i n thb example is represented on 9 bits. For a correct partial products addition, the signs aze extended 85 shown in Fig. 7.23. The shape ofthe multiplier is then trapeiaidal due to the sign extension.
437
(-107)
(+165)
10010101 = X
% E L z y
Operalion
BltE recoded
+I
-2
extension
010
100
101
ni I
+2
I n order to make the =nay rectangular, and then more regular for VLSI implementation, the problem of sign extension must be addressed. T h i s problem is more crucial when the operand lengths ars wide, where each partial product must be sign-extended to the length of the product. In thirIeetion we will not deal with the techniques to solve the problem of the sign extension. Bat we d discuss one technique which is shown i n Fig. 1.24 for the e m p l e of Fig. 7.23. The bmie idea is to use two extra bits in the partial product. For the first partial product, the two additional bits, PP,+I and PP,+. ale equal to the sign bit of the partial product
P P . . , ,
= PP-,, = PP,
(7.29)
For the second partial product, if the first partial product was positive, then the two additional bits for this second partial product a e given by the expression above, otherwire we have two clues
PP,+z = PPm+,=l
and
if PP,=O if PP, = 1
(1.30)
PP*+, = P P . . + > =1
(7.31)
So it is more interesting to use a third bit, F, as a flag to indicate whether there is, from the previous partial, a negative sign bit to be propagated. F 1 is the flag generated by the first partial product to the next one. For the example of Fig. 1.24, FO = 0 (no PP before the first one). and F, = F2 = F , = 1. SO for the first partial product there is a sign propagation to all the others. This
438
CHAPTER 7
(-107)
(+I051
. .
lOOlOlOl = X K O E l=Y
Y Y
Operation
Bits recoded
:1E110010101
mOl10101 I0
~OOllOlOll
+I
-2 -I +2
010 100
101
01 1
D~00l01010
ll~10100P0011101 = P (-11235)
,
. . I
8-1
0 Additional bits generated fmm the previous Sign and the prescnl sign
Figure 1.24 Thc prcviour trample of Figvrc 7.23 eith aimpiifiId sign cxtm<om.
Fj+1 = F j + P P , , j
where PP,,i k t h e sign bit of the j t h partial product.
(7.32)
Let us now see the implementation of the n x n modified Booth multiplier. Fig. 7.25 shows the block diagram of the multiplier. Also it gives an idea about the fioorplan of this subsystem. It is composed of the following blodrs:
m
The multiplier axray containing partial products generators and I-bit adders; The Booth encoder and the sign extenJon bits (PP,+2,PP,+l,F). The Booth encoder generates the five signals (0, +lx, +2x, -Ix, and - 2 x ) for each group of 3-bit of Y ; and The final stage adder performs 2n bits addition.
. i
rn
For the sake of simplicity, we treat the case of B 6 x 6 multiplier. All the c& described in this easmple are the besic cells of any multiplier size. Fig. 7.26
439
X<*-l:O>
3
Y<n-l:O>
" Y
I
+JcF.w
n-bit adder
P<Zn-l:n:
Figure 7.25 Block diagram of the n x al*mithm.
n multiplier uing
modificd Bovth
shows the implementation of such a multiplier. Four types of c& the final adder. There cells are:
The ADD cell which generates 0 or 1 [see Table 7.31. The schematic circuit of this cell is shown in Fig. 7.27(a). Two implementations m e possible: one using pars-transistors controlled by the five signals d&g the recoded digit code, and the other one is an AND2 gate of the two sign& -1x and -2x. The partial product MUX (PP-MUX) which generates the partial product. Fig. 7.2T(b) shows the schematic of PP-MUX using CPL type logic. The feedback PMOS, P j in this figure or in the o m of Fig.
440
CHAPTER 7
441
sumin
'i-1
*
5
cT 4
Sum"",
(b)
(Ci c&:
(4 ADD; (b)PP-MUX;
(0)
PP-FA (or
442
CHAPTER 7
TheBooth Encoder (BE).It generates thcfivecontrolrignalsox, +lx, +2x, -lx, and -2x from a group of three bits of the multiplier Y. Fig. 7.28 shows the schematic of the different circuits involved in the BE block. The additional circuits ofthe two bits PP,,+i,j and PPn+z,j of the jth PP are &o illutrsted. Pj and Fj+, are the previous and the next flags, respectively. PPn,, is the sign bit of the jth PP. Note that Po is 0.
The Booth multiplier exhibits a lot ofunnecessary glitches. The main mason for glitchcs is due to the race condition between the multiplicand sod the multiplier due to the Booth encoder. The power dissipation assodated with the glitches can be an important portion ofthe total power and henee it needs to be reduced by some techniques of signal synehroniaation.
7.2.4
Wallace Tkee
By applying the Booth algorithm, the number of partial products is hdfed. However for large moltipliers, 32bit and over, the nnmber of the partial products is over 16-bit. In this case, the performanee of the modified Booth a l g e rithm is limited. One techniqne, to improve the performance of there multipli. ers, b to adopt the Wallace tree using 4 2 compressors. A 4 2 compressor accepts 4 numbers and a carry in, and $urns them to produce 2 numbers and carry out (really it is a 5-3 compressor). Fig. 7.29(a) shows an example of rueh a tree on partial products of 110. unaigned 8 x 8 multiplisr. Eight partial products are produced. Using 4-2 eompressors, two levels of additioru (rteges) are needed. The final two summands are added nsing a fast 16-bit adder. Some eeros me added to the array. This example shows that the bits which m e not nsed in the M stage (level) jnmp to the next one t o be combined with the ones produced by the compressors. Fig. ?.29(b) shows the architectme of the 8 x 8 multiplier. For the first stage of the tree, two blocks, A and B,are required. The block A (B) of compressors group the first (last) four partial products, respectively.
443
3-1
Figure T.28
sion losir
444
CHAPTER 7
pp"J Fl
Fig. 1.30 shows how the 4-2 compressor can be implemented by 2 full-adders or by custom static CMOS Iogjc [9]. 4-bit 11,...,In. are added to produce 2 s u m S and C. Hence, 4-bit of the partial product are compressed to produce two new partial products. The compressor is implemented, using carry-save adder construction, by two cascaded fd-adders as shown in Fig. 1.30(b). Notice that carry-out2 is never generated by carry-in. Fig. 1.31 shown the 4 2 compressor circuit osing B compact structure of multiplexers [lo]. This structure is faster than the static complementary version. Fig. 1.32 shows the intereonneetion of the 4-2 compressors for block A of the example of Fig. 1.29. C . is connected
445
x7 Y7
........... ...........
X Y :
0 zcra
446
CHAPTER 7
447
As
I
B
L
448
CHAPTER 7
449
x<31:0>
7 I
iz-
2nd stage-BlockE
laslage-BlockC
I.
i i
-P<15:0>
1st stage-Block D
]
PPG: Gcncrator of panial
products
2nd slage.Block F
3rd alage-Block G
to the next carry-in f&. Since these signals are independent, the carry is not propagated through the row.
To further enhance the Wallace tree multiplier, the modified Booth algorithm can be used to rednee the number of partial prodocts by half in a camy-save adder array. One example of such combined construction is the architectme of the 32 x 32 multiplier shown in Fig. 7.33. It consists of four functions:
the Booth encoder, the partial product's generator, the compressor blocks, and the final 64-bit adder. The Wallace tree is constructed with 3 stages (levels). The first stage har 4 blocks (A to D ) , with each block summing up 4 partial
450
CHAPTER 7
products among 16. The second stage s u m up the 8 new generated partial products from the first stage. Hence, two blocks are needed, E and F. Finally, block G of the third stage of the tree generates two other new partial products to the find adder. This architectare exhibits some irregularities in the b y m t since it has a complicated interconnection scheme. Hence, the interconnection wirer affect the speed and power dirsipntion of the adder.
7.2.5
Multipliers Comparison
The basic array multipliers, like Baugh-Wooley scheme, consume low-power and have relatively good performance. However, their use ean be limited to process operands with less than 16-bit (e.g., &bit). For operands of 16-bit and over, the modified Booth algorithm reduces the partial products numbers by half. Therefore, the speed of the multiplier is reduced. Its power dissipation ir comparable to the Baugh-Wooley multiplier due to the circuitry overhead in the Booth algorithm. However, circuit techniques can ~ a n e e this multiplier to have low-power characteristics. The fastest multipliers adopt the Wallace tree with modified Booth encoding. A Wallace tree would lead, in general, to larger power dissipation and area, due t o the interconnect wlres. Henee, it is not recommended for low-power consumption applications. Dynamic multipliers ace not discussed in this section since they introduce problems of control and timing. Hence a t m area and power dissipation are added to the design.
7 . 3 DATAPATH
A VLSI chip can be partitioned in two piuts; the data path (oz execution unit) and the control unit. Data paths are often used in digital signal proce~~ors, microprocessors and application specific ICs (ASKS). The data path consists of a combination of an Arithmetic Logic Unit (ALU), a shifter, a file register, 1 / 0ports, a multiplier, an adder, B magnitude comparator, and data busses, etc. It performs many operations on the data in the register file, to which the results are sent back. The data busses permit communication between the diSerent units of the data path. The data busses are the communication means for the dats transfer between the ALU, shiiler, and file register, ete. These busses have a heavy load (few p F ) . In CMOS design, dynamic techniques are used to &ow fast operation. One way to reduce the power dissipation, doe to the precharging transistors, is to use static burres (111.
451
Lalch A
Lalch C
Latch B
Op Code
I
Figure 7.34
Atithmeti= LogiE u
*I
d (4l.U).
Bus-B
The control unit delivers the instructions to the data path. These instructions determine the operations that the data path has to perform. The eontrol unit can be implemented using random logic, micro-ROM (Read Only Memory), PLA (Programmable Logic Array) or n combination of these three implementations. Other macrocells, snch as TLB (Itandation Lookaside Suffe~), cache memory. ete., can be added to the data path and the control nnit. In thj, section, several blocks of a data path are discussed.
452
CHAPTER 7
operation is due mainly to the carry propagation along the width of the ALU. There are many types of ALU, depending on the number of operations t o be performed. Fig. 7.35 shows the block diagram of a 1-bit slice of an ALU. It has exactly the same structure as the adder, except that the P and G blocks are programmable. Fig. 7.35(a) shows the P block with 4 control sign& (OPI . . . O&). The feedbaek PMOS transistor. P j , permits restoration ofthe high-level from VDD - V . , to VDD. Hence the DC current of the first inverter, due to the reduced high-level, is eliminated. Fig. 7.35(b) shows the G block with 4 op code sign& (O&..OPa). The P and G b l a h use the pass-transistor style. The techniques discussed in Section 7.1 can be applied to achieve lowpower and fast operation. The carry and resdt (sum) blocks m e shown in Fig. 1.35(c) and (d), respectively. Table 7.4 summarises some of the functions that can be implemented with these blocks. Several other operations can be realimd with this ALU.
Table 1.1 Examples of ALU wcrationr
(1. me-
with).
Operation
LSB-C..
P function
G fanction G = A 01 B G=AorB
G=O G=O G=O
Op code
(0P1 ...ope)
Add w. carry Subtraction Bit-wke AND Bit-wire OR Not A
0
1
0 0
0
10011101
10011101
01110000 00010000
P=H
10100000
Table 1.4
(cm6inwd)
Not A
To implement an n-bit ALU, all the techniques discussed for carry speed-up in
adders can be applied. Drivers are needed to dirtribvte the op code signals for
453
P P
P
454
CHAPTER7
Eigure 1.38
an n-bit ALU. Foi low-power design, the busses which communicate with the ALU are in general not precharged 8s in the case of many data paths.
1.32
The Absolute Valne Calculator (AVC) is, in general, used in data path. of video processors to compare the data of two pictuw. Fig. 7.36 shows the architecture of the AVC. This pardel circuits performs two subtractions simultaneously, A - B and B A. Using the most significant bit of there two operations, the MUX circuit selects the positive one. Then the output giver the absolute d u e IA-BI.
~
area of an n-bit AVC, the logic of two n adders rewired c a n be reduced by the merging of the common functions for both operations. Also the techniques described in Section 7.1. for n-bit addition. should be nsed
455
7 . 3 . 3
Comparator
A magnitude comparator is oscd in many DSP applications. It permits comparison of the magnitudes of two numbcis A and B by providing if A < B, or A = B, or A > B. Fig. 7.37(a) shows an example of a two-bit comparator which requires two types of eelk C1 and CZ. The cell, C1, is constructed by the eireuit of Fig. 7.37(b). Table '1.5 shows the truth table for this cell.
Table 7.5 b t h tsbk for cLil C 1
B %bit comparator works. When A, c B,, then C, = DI = 0, and A1Aa < BIBo regardless of the magnitudes of the lower bits Simile.& for A1 > B,, then C, = 1, D , = 0, and AlAo > BIBo regardler. of the magnitudes of the lower bits. When A1 = BL = 0, the magnitudes of the two 2-b numbers depends on A. and Bo. In this situation, there are three
different cases:
1. AlAo
< B I B ofor
A.
Eo = Fo = 0.
c BO (i.e.,
set set
we c m
These relations can easily be nsed to implement the second cell, Cz, of the comparator a8 shown in Fig. 7.37(c)
This technique, for the two-bit comparator, can be extended for an n-bit =omparator. It can be constructed by using B parallel tree of the cells C1 and C2. A 4-bit comparator could. for example, be constructed with two 2-bit comparators connected in parallel and at the output the 4 E and F generated signals
456
CHAPTER 7
fed to an added C2 cell. In this architecture, the glitching is reduced by equdizing the delay paths of each cell.
are
7.3.4
Shifter
Another macrocell of the data path is the shifter. It pertorms shift or rotate operations on the data If the number of bits to be shifted is arbitnuy, then a barrel rhifter is used [12,131. Fig. 7.38 shows the CMOS implementation
457
s3
s2
S1
SO
of a 4 b i t barrel sbifter. NMOS transistors are used as switches in the array. The input bns (Do - D,) can be connected to the output bus (Ra - RB)via the pass transistors. The control signal So-hselects the pass transistors to be switched. These signals determine the amount of shift and they m e generated by a 2-bit decoder. Since the outpots have a high level of VDD - VT,due to the pass transistor, then the output buffer nses a feedback PMOS device, Pf, to iestore the high level to VDO.This eliminates any DC current i n the first inverter of the buffer.
Table 7.6 shows the values of the output bus function of the input data. Depending on the values ofD < 6 : 0 >, several shift operation8 can be performed. For example if D < G : 4 >= O, and D < 3 : 0 > is the 4-bit input data, then
458
CHAPTER 7
B l o g i d shift is realiued. However, if D < 6 :4 >= 1 and D < 3 : 0 > is the input data, then an arithmetic shift operation is performed.
Table 7.6
The barrel rhiftei is not 8 critical unit for the delay. A low-power operation is performed by odng a static implementation. This shifter can be implemented with transmission gates and the feeedbak PMOS are not required. However for low-power, the use of NMOS array is more efficient. The feedback PMOS should be sized to minimum.
7.3.5
Register File
A register file is a set oircgisters which store data. It consists of a small array of static memory c&. Register files are wed by miemprocessors and DSPs and they permit multiple read and write ports [14. 15, 16, IT]. A typical array is 32 registers of 32-bit. For example an ALU needs two pieces o i data from the regjster file. The array has dual-read ringle-te architecture.
Fig. 7.39 shows the schematic ofthe singleended memory eeU with 2 read ports and 1 write port (2R-IW). The read ports are the r e d bit-lines BL.RI and BL-R2. The memory cell, composed of two cross-coupled inverters h and 1 2 is addrwsed by two read word-line signals, W L R l and WL-R2. The NMOS transistor N, is controlled by the Wzite Enable ( W E ) signal. N1 is connected aerially to the write B E C ~ S S transistor N 2 . The transistor flz is controlled by the write word-line ( WL - W) signal. The transistor N, isolates the stored data from the write bit-line ( B L W ) .To write the datain the storage node A from the write bit-line, the imerters I , and I2 rhonld be sized earefnlly. The ratio of the inverter I, should be larger than 1 (e.g., 5 ) to set the threshold voltage of 1, to a law-level. This is due to the fact that Nl and N2 we&!+ transfers a high level (only 1 0 0 -VT=). Moreover, to ensure a correct write operation, the
ThedeFdlianofB iasivoninChc~pirr4.
459
BL-W
BL.RI
BL-RZ
WL-w WL-RI
WLLRZ
WE(Wdte Enable)
Figure 7.8s
( Z R I W ) rcgisterflle rrU.
feedback inverter 1 , should he we& so the access transistors N, and N, can chmge the state of node A. For example the NMOS and PMOS of I, shodd be minim- siae except that the length of the NMOS is twice the minimum. Also the acce55 transistars should have highcr p compared to the transistors of 1,. For a given technology, the sizes should be determined by circuit simulation for a correct write operation. The inverter 1% is a buffer for the storage node.
A pair of three-port memory e& is shown in Fig. 7.40. This rtrueture has shared access transistor N a and write bit-line, B L W . To read and write the memory cell, the simplified rchematio of Fig. 7.41 is nsed. T h i s schematic uses the calomn multiplexing scheme. For low-power, the register file U E ~ S static design and avoids the use of the conventional sense amplifier for bitlines sensing. The sense amplifier consumes DC power. For a three port register file, two read and one write row decoders are required. Also, Write Enable (WE) and column addresses are needed to produce the column write enable for writing the data to the specified storage node. For fast operation AND gates can be u.ed with a m-om of of 5-bit inputs.
During the read operation, if for example Na is asserted, then the data is put on the bit-line, BL.Rl. The bit-line is selected through the pass-transistor N,. The data is then senred by the inverter I , in Fig. 7.41. During this period, the
460
CHAPTER 7
BL-FSA
HL-W
BL_R2H
BL-RIA
WE-I
WE-2 (2H-1W).
BCRiB
read enable signel, RE, is asserted, Ni is OFF and only the feedbaek PMOS P j is activated when a one ( V D~V T , ) is on the data-line. In this situation, the feedback PMOS charges up the data-line to VDD. Also the DC current, which c m be generated due to the reduced high l e d on the data-line, is completely eliminated. The p ratio of the inverter I, should be higher than one (e.g., 5 ) to achieve a symmetrical r e d access time for a % e m and a one. When R E = 0, then the data-lines axe i 4 a t e d from the bit-liner and the NMOS transistor N z is ON. Therefore, the latch formed by the pair of inverters 11 and I , latches the old data. The operation of such a re&a file is fully static and does not dissipate any atatic power at any mode of operation. Furthermore, the read and write o p erations are asynchronous. T h i s type of register file is suitable for low-power applications.
7.4
REGULAR STRUCTURES
In this section we examine the design of large regular rtruetnres such as Programmable Logic Arrays (PLAs), Read Only Memories (ROMs) and Content Addressable Memories (CAMS). The ROMs and PLAs are not only used to implement controllers in a regular manner but they also can be applied to signel processing. RAMS arc treated separately in Chapter 6. These large structures
461
WSie decoder
(WAI
462
CHAPTER 7
me usually dynamic circuits for fart operation. These dynamic circuits can be shut down with a power management Unit for power ravings. If for example the do& is turned OFF, all dynamic circuits go into 8 piechsrge mode with all PMOS precharge devices are ON.
in finitestate machines,
PLAs have regular architecture divided mainly in two planes BS shown in Fig. 7.42. Theso planes pelform a specific fnnction such 85 OR and AND. CMOS PLAs can be implemented in both static and dynamic styles. The style is chosen depending on the timing strategy in the chip. Other factors such BJ speed, power dissipation, and the allowed area, p l q an important role in the PLA design style. A CMOS PLA example, ushg psendo-NMOS like style, is s h a m in Fig. 7.43. The output OR functions are r & d with NOR gates. From Fig. 7.43(a), we have
(7.33)
(7.34)
(7.35)
(7.36)
P , = A + 6 = A.C
The buffers are used when the load on the bit-line is large. They consist in general of two invectez's stages. The OR plane i s in principle similar to the AND plane [Fig. 7.43(b)]. From Fig. 7.43(b), we have
= Pi
+ P, + Pa
(7.37) (7.38)
Y = P, + P,
For this pseudo-NMOS PLA, NOR-NOR logic gate style iz used. This example shows that the PLA organization is useful for implementing Sum Of Products (SOP) functions. Hence any SOP function can be redzed by programming the army with the AND and OR cells. Any type of latch or register cm be used at the input and output. ThL design style of PLAs has e n m d size area and
463
Inputs
0"tP"tE
AND-OR PLA ~ h r t e c t u r e .
Figvre T.12
it is simple to implement. However,it is not suitable for low-power application due to the high DC power dissipetion, p a r t i d w l y when the PLA is large. Moreover, it has B speed problem.
In dynamic CMOS style, the circuit shown in Fig. 7.44 can be used. It is a selftimed PLA, where the AND and OR planes are both realised =sing precharged NOR configuration. In this structure, o d a ~ &gle clock phase is needed. When the dock, elk, is high the bit-lines are preeharged in both planes. The NMOS transistors NA and No are OBF, guaranteeing that there is no p.th to ground. Tracking liner in both planes are used to generate a delayed clock to the OR plane. When the clod is law, the prechargt PMOS transistors, in the AND plane, turn OFF, N A tarns ON and the produets a ~ l e evdnsted. The tiaching lines ensure that No tuns ON only when the inputs to the OR planer are stable. Othetwise the outputs can be spmiously discharged. This PLA is fast, bnt it har a lot of wasted dynamio power. The wmted power har r e v a d sources such ar:
464
CHAPTER 7
X = ARC+AC+RC
_ _ _
Y = ABCiAC
x = q + Pi+ Fj$
(bl
Figure 1.48
P#eudD-NMOS
465
AND-plane
OR-plane
clk
- :vinua1Ground
Figure 7.44
Sclf-timcd d+c
The virtual ground Liner are charged and discharged every cycle. The total eapheitance of the virtual ground is important, particularly for large PLAs because for the purpose oflayout compactness the ground lines ate in diffusion. T h i s capacitance can be reduced using metal level in multi metals technology; The number of inverters forming the buffers are important. Then, duiing the evaluation, several of them switch; and The switching activity of dynamic NOR implementation is high [see Chapter 41.
m
m
Consider now the PLA shown in Fig. 7.45 mith AND-NOR structure. The OR plane is still the same compmed to the PLA of Pig. 7.44. However, the AND plane is considerably simplified because:
rn
466
CHAPTER 7
AND-plane Delay
OR plane
Tra'h"g
- 'Vinual Ground
Figure 1.45
The number of inverters for buffering is reduced by half. The switching activity of the NAND implementation is aLo lower than that of NOR implementation, resulting in Iower power in the AND plane. O n e problem associated with this struetme is that the use of NAND may result in a large discharge time. Another dynamic PLA combines the pseudo-NMOS and dynamic logic design styles [19].Fig. 7.46 shows an example of such a structure. The AND plane uses a predseharged pseud-NMOS NOR style, while the OR plane uses B conventional dynamic precharged style. During the precharge phase, the clock signal is high and the bit-lines in the AND are predircharged to ground. In the OR plane, the bit-lines are precharged to VDD.The i n p d s @ to the OR plane are low. During the evaluation phare (clk = 0), the PMOS loads in the AND plane are ON, and t h e plane behaves as pseudo-NMOS logic. In this case, the PMOS device should be siaed correctly to ensure safe operation when the output stays at a low level. The product terms are evaluated and then the outputs. During this evaluation phase, the PLA dissipates a static power m d y by the AND plane. Then the power i s increased by this DC component.
467
PMOSlOad
This PLA does not need the seW-t-g techaiqne nsed previously. Also it was shown that this PLA has a kst operation [IQ]. When implementing smaller controllers, it is sometimes more interesting to use random logic. The implementation consists of two or more levels of logic gates using s standard cell library. It is much less regular than a PLA structure and it can have lower power dissipation.
7.4.2
Read Only Memory (ROM) is used in many applications. In DSPs, for example. it can be used BJ table lookup to store coefficients. Also it i s often used in VLSI processors as a microcode controller. In this case, the ROM contains the microprogram instructions. Typical miero-ROM size is 2k words of 64 bits. The read-out cycle of the ROM limits the speed of the processor. Conceptually, the structore of a ROM is quite similar to that of B PLA. Fig. 7.41 shows a simple ROM circuit architecture using NOR logic design. The state of the memory array is retained even if the ROM is not powered. The
89P
469
Bit-he (merall)
A
G
- word-fine (rnCtSl2)
Diffurian
Ward-ime (polyriiicon)
Figure 7.41
The ROM can be implemented in both styles: static and dynamic. In static styla, the pseudo-NMOS logic, similar to that of static PLA, can be used. Fb. 1.49 shows an example of a s m a l l ROM 'Lsing pseudo-NMOS circuit style. The conditioning circuits use PMOS devices, with their gates grounded, and the sense amplifier circuit is simply an inverter. The column decoder is also shown. One of the column decoders selects one of the two bit-lines. Then, node A is initially at VDD.If the selected bit-line is &charged, then node A is discharged and the outpot is pulled up to VDD.The pseud-NMOS is eaey to design and does not need a careful design, howveer, the power dissipation may be significant due to the DC current. For a relatidy large ROM, like the one used in microcontrollers, the power dissipation c m be significantly rcduced using the low-power techniques of SRAMsa. They include pulse mode operation using address transition detection, and r m d swing sensing, ete.
*These tecbsiisuca M discused in mom detail in Chapter 6.
470
CHAPTER 7
ROW demder
q<
Gmunded PMOS
Figure 7.40
A dynamic version of the ROM ir shown in Fig. 1.50. During preeharge phase, elk = 1 and the bit-lines are precharged to VDD- VT, where VT is subject to the body effect. Node A is also precharged by the PMOS trensistar Pp. The select lines Sell and Sei2 are controlled by a column decoder. Ail the word-lines are predirchsrged to groond. Dudog evsluation, cfk = 0 and if the hit-line is discharged to gro.aund, node A is also discharged. Then the ontput of the inverter I is p d e d up. If node A is not discharged, the feedbadr PMOS transistor Pt permits to maintain the high level at VDD.Since the swing on the high-load bit-line is reduced, the power dissipation is reduced on this line by a factor V D D / ( V D D - VT).
471
decoder
Word-linc
Sdl
Bit-line
Figure T . 6 0
A CAM stores tags which can be compared against an input address word (A o...A,,,) as shown in Fig. 7.51(*). A match detection signal is sent by the CAM if the valuer stored in the CAM array match with the input address word. A CMOS implementation of the CAM cell is illustrated in Fig. ?.5l(b). It c m be readable and writable jwt as an ordinary memory cell. The read/write and decoder circuits are similar to that of B RAM. A tag word ir formed by identical cells which are repeated in a horiaontd array. The write lines are used to write data in the array. The comparison procehs k described e ~ , follows. Dnring prechmge phase, the bit-lines me predischarged low. All the write lines are low. The Match line (ML) is precharged high. During the evaluation phase, suppose that a "1" is stored at node A. Assume that C B L line is held high and m l i n e is held low. In this case, the transistors N3 and N1 are OFF, hence the M L Line remains high, indiea&a match at this bit location. Assume now that C B L is driven low and C B L high. The transistor NQis OFF, but N1 and N2 are ON. Then the ML line is discharged, indicating B mismatch at this bit location.
For an array of n tags, there m e n matchliner f M L ( 0 ) ...ML(n)). Each match line i s common to m cells. If there is B mismatch in any bit of the tag wocd, the match line is discharged. If all the m bits match, the common match-line remains high, To detect the match signal in any of the match liner a dynamic
472
CHAPTER 7
Wnfe Line(WL)
CBL
(b)
CBL
473
NOR
circuit is used, LU shown in Fig. 7.62. When the clock is low the NOR gate i s precharged along with the match lines. The inputs to the NOR gate me predischarged to ground. When the cUr signal is high (evaluation phase), one of the match lines, MI,((), stays high and the others are discharged to ground. When the msteh liner are stable, the eual signal i n asserted with elk using self-timing (similar to the PLA case). This permits keeping the dynamic NOR gate from falsely diecharging. The inputs to the NOR gate must not go high until the data is stable. If one of the match line stays high, then the NOR gate i s discharged and the output matoh signal goes to high.
To reduce clock skew dne to clock distdbntion. As systems continue to demand higher clock frequencies, dock skew associated with input buffers snd clock distribution becomes a significant design problem LU shown in Fig. 7.63(a). The internal dock drives the output register, which in turn delivers the data to the output pad (with a buffer). The
474
CHAPTER 7
skew between the external and internal clocks is due to the clock tree.
The outpot datais significantly delayed compared to the external clock. One main contribution is the dock skew. In Fig. T.SS(b), the internal dock is deskewed via the use of a PLL. The PLL shonld reduce this skew OD B wide range of process, temperatnre and voltage vadations;
To synchronize data between chips as shown in Fig. 7.54. The PLL solves the problem of clock skew Grom chip to chip. An example of such an application is &cussed ia 2 2 1 ;and
To generate internal clocks with higher frequencies than the external dock (system dock).
There are other applications of PLL for clock recovery in serial data communications and these are not discussed in this section. Several theoretical references on PLLs can be found [23,24, 251. Thu section provides m introduction to the PLL. The CMOS circuit design of the PLL, for low-power applications, is then discussed.
7 . 5 . 1
Charge-PumpedPLL
One interesting C O Z L ~ ~ ~ U F L ~of ~O the O PLL is the charse-pumped loop shown in Fig. 7.55. It is B PLL-based frequency multiplier which consists of a Phase Frequency Detector (PFD), B ChargePump(CP), a Loop Filter(LF), II Voltage Controlled Oscillator (VCO), and a programmable frequency divider. The feedback of the internal dock is compared to the external clock for phase m d frequency error. The outputs of the phase/frequency detector are two +tal si& called U (for Up) and D (for Down). The charge pump and loop flter convert these digital EignaLE into ap analog signal (control) suitable for the VCO. The VCO function of the control signal level generates a certain oscillation frequency. If the PLL generates multiples of the external clock Gequency, then a frequency divider is inserted between the generated clock and the phase detector.
A simplified diagram of the charge pump and loop filter is shown in Fig. 7.68. It consists of two switchable corrent S O U ~ C driving ~ ~ an impedance (LF). The
pnlses generated by the PFD block are nsed to switch the charge pump, to charge or discharge the impedance. The loop filter flters these pukw and has an analog output signal to control the VCO.
Thc chargo PUP
Oltagcl.
102
PLL should not he confused with the one vacd to sonerate diffeicnt
475
Clock
Data oul
p
outpu,
D a a uul
Figure 7.6s PLL clock gener*ticm ior drakeluing: (a) n chip without PLLi (b) a chip with PLL.
476
CHAPTER 7
Chip#l
Chip #2
Data pad
Figure T.66
7.5.2
T h i s section presents the design of the PLL components. Fig. 7.57 shows the I@ diagram of the PFD circuit. It usel m a i n l y static-CMOS NAND gates
which results in good performance and law-power dissipation. The operation of this circuit using the state diagram of Pig. 7.6T(c) is aa followa. The circuit has three states: 1) UP,where the up signal U is w e r t e d when the external clock elk.., f a down, 2) D O W N ,where the down signal D is asserted when the internal clock elk fall. down, and 3) NOP,where the detector does not
477
LF
Q
r4
change the ontpnt control signals. In thia last state both U and D signals are at zero level. The d a t a changes whenevu clk or clk..t f a down. In no case U and D are both activated. Consider that d k and elk..t have the same freqneney bnt the f&g edges of eB..t (elk) leads the falling edges ofclk (~lkept), respectively. Then, d ( 8 ) is asserted with II certain duty cyde, while D (U) is never asserted. In this case, the PFD is characteiiaed &B the phase detector. Consider now the case where clkezt has a higher frequency than elk. d is asserted moat of the time. More falling edger of clEsmt signal than elk. A similar sitnation vhen clE h s higher freqoency than clk,,, and D is assected most of the t h e . In this case, the PFD is characterbed as frequency detector. The 8 and b signals, generated by the PFD, BE connected to the charge p m p dreuit of Fig. 1.58(a). When the signal d (d) is asserted the pull-up PMOS (pull-down NMOS) transistor charges (discharges) the output, respect i d y . Another variation of the charge pump circuit is shown in Fig. 7.58(b). are added as current 80urces biased by 8 current Two tranei4tors P,*j and
478
CHAPTER 7
clk
479
mirror circuit. In this situation, the output curent of the h g e pump can be adjusted through the control of the current mirror.
The manolit!ic impLenentation of the filter of Fig. 7.56 is shmn in Fig. 7.59. The two capacitors C , and Cz are in the order of tens of pF and are made with the NMOS transistors Ncr and Ivct. The re*stoz is made with a transmission gate in dosed stste. It can also be implemented with an N-well implant available in the CMOS pmcenn. The capacitor C a is added in parallel to the simple RC (R-C;) low-pass filter to form a second order filter. In this ease, the stability of the system is maintained even with the process variation of these on-chip components. Note that these capacitors c a n occupy a large portion of the PLL.
The charge pump and filter generate a control voltage for the VCO. One important parameter of the VCO is the VCO gain. When considering the charaeted4tic frequency-control voltage, the VCO gai0 is the sbpe of lhis characteristic. A linear characteristic is, in general, desirable. In general the VCO is implemented using h ring oscillator as shown in Fig. 7.60. A series connection of de1e.y inverter cells forms a tapped delay line which oscillates with a frequency determined by the delay time of the cell and the odd number
480
CHAPTER 7
of stages. The delay of the cell is controlled by a current which in turn is controlled by the control voltage V,. V, modulates the ON resistance of p d down N1, and through the current mirror,the p d - u p PI. All the devices of the VCO should be oriented in the same direction and have redundant contacts to reduce the jitter due to process variations. In the VCO of Fig. 7.60. madmnm frequency is achieved at madmum control voltage. Typical values of the VCO gain at low power supply voltage E B range ~ from 10 MHn/V to 100 MAzjV depending on the number of stages and technology. Note thst the bandwidth of the VCO presented previously is limited. The VCO of Fig. 7.61 har an excellent bandwidth characteristic, where B wide range of frequency can be generated I%]. It ia used for video signal processors end covers a wide range of applications. The freqnency range EM change by one order of magnitude from 50 MHz to 350 MHe. In fig. 7.61, by turning ON and OFF 8 CMOS TGs with control signals, the number ofring oacihtor stages can be selected among eight values (7,S,ll,l5,Zl,ZS.3S.61). Each stage of the ling oscillator combines an inverter in parallel with I I current-controlled inverter. The inverter inereares the frequency of oscillation of the VCO, where= the currenteontrolled inverter permits tuning of the frequency of the VCO. The generated clock frequency can be N times the external dock frequency (reference frequency). This dock then feeds the clock driver and tree. Since the PLL discussed here is intended to be integrated on-ehip, it is then sensitive to the noise generated on the power lines (called power-supply-induced dock jitter). If the power supply changes by 100 mV the skew 01 phaae error will
481
Flgure T.00
Selection signals
7 t h stage
5 I It stage
Generated clock
Figure T.01
VCO
482
CHAPTER 7
be important before the PLL has time (tens of clodrJ eydes) to correct this emor [ZT]. One vay to reduce the effect of thjs problem is to dedicate an analog power supply pin to the VCO and the charge pump. At the drcuit l e d , a ncw VCO delay cell war proposed by Young [ZT] to iedoce the phase error. Another VCO dhmatilse is shown in Fig. 7.62. It is rimilm to the VoltageControlled Delay Line (VCDL) [%]. The control voltage, V., is used to vary the amount of the effective load seen by each inverter output. The frequencycontrol voltage characteristic of this VCO has a negative slope. Then the minimum frequency of osdllation is linlited by the maximum V D DTherefore, . the minimum freqnency is increased with iednced VDD. A positive slope is, i n g e n e d , desirable so the mioimum frequency is not set by VDD. The frequency divider can be implemented using togglc flip-flops. Fig. 7.63 shows an example o f a divider with division ralm of 1, 112, 114, and 118. The PLL, so far discussed, is not completely digital. Only the PFD, charge pump and the frequency divider are digital. While, the I F and VCO are analog m d operate 8s eontinuoostime systems.
7 . 5 . 3
Low-Power Design
In deep mode, the on-chip PLL may bc controlled for low-frequency operation, or it may be disabled to reduce its power dissipation to the lealrsge currents.
483
T clk
T clk
Figure 1.84
modc
484
CHAPTER 7
As an exsmple, to disable the PLL, is to shvt down the VCO and disable the external clock. Fig. 7.64 shows the Same VCO of Fig. 7.62 but with one inverter transformed to a tw&nput NAND gate. One of the inputs is controlled by the Enable signal to shut down the PLL when it is low. The NAND gate can be used for any of the VCOs presented previously. Also the enable signal can be used to disable any current O O I I T C ~used i n the PLL to eliminate any DC cunent. A typical power dissipation of B PLL, at 3.3 V,is in the range of tens of mW depending on the frequency.
7.6
CHAPTER SUMMARY
T h i s chapter has presented the design of aeverd subsystems used in VLSI chips.
Many circuit alternatives are discussed which trade area, speed and power. The reader can construct theoe options and compare their performance in terms of power, delay and area. The power dissipation isrue is stressed more. Also several building blocks of VLSI chips using advanced circuit tcdrniqoes have been investigated. These iodnde
rn rn
I
REFERENCES
[l] J. Mori, et al., "A 10-ns 54 x 54-b Pardel Structured Full Army Multiplier
with 0.6-pm CMOS Technology." IEEE Journal of Solid-state Circuits, vol. 26. no. 4, pp. 600-606, April 1991.
(21 J. SUansky, "An Evaluation of Several Two-Snmmand Binary Adders." IRE 'Itanrllctions on Electronic Computers, vel. EC-9, pp. 213-226, June 1960.
[3] J. Sklansky, 'Conditional-Sum Addition Logic," IRE Transactions on Eleetronic Camputem "01. E C Q ,pp. 226-231, June 1960. [4] I. S. Abu-Khater, R.H.Yan,A. Bellaouar, and M. 1. ELnaary. -A 1-V LowPower High-Performance 32-b Conditional Sum Adder." IEEE Symposium on Loar-Power Electronics. Tech. D i g . , San Diego, pp. 68-67, October 1994. [5] T. Sato, et al., "An 8.6ns 112-b Transmission Gate Adder with a ConflictFrec Smass Circuit," IEEE Journal of Solid-State Circuits. 701. 27, no. 4, pp. 657-659, A p d 1992.
161 K. Ucda. H. Susiki.. K. Suds. Y. Tasuiihashi..~X. Shinohara. "A Whit ' Adder Ey P a r Tranaislor B&OS Ci"rcuit," IEEE Custom' lntcgrsfcd Circuit Conference. Tech Dig. pp. 12.2 1-12 2 4 \lay 1993
~
(71 K. Hwang, "Compoter Arithmetic: Principles, Architecture, and Design," John Wiley and Sons, 1979. [8] J. J. F. Cawnagh, "Compoter Science Series: Digital Computer Arithmetic." MeGraw-Hill Book Co.. 1984.
[Q] M. Nagsmatsu, S. Tanaks, J. Mori, T. Noguchi, and K. Hstanska, "A 16-ns 32x32-bit CMOS Multiplier with an improved Pardel Structure," IEEE Cuatom Integrated Circuits Conference, Tech. Dig., pp. 10.3.1- 10.3.4, May
1989.
486
[lo] N. Ohkubo, M. Suzild, T. Shinbo, T. Yamanaka, A. Shimieu, K. Sasab, and Y. Nakagome, 'A 4.4-n5 CMOS 54x54-b Multiplier nsing PassTransistor Multiplexer," IEEE Custom Integrated Circuits Conference, Tech. Dig., pp. 599-602, May 1994. [Ill R. Bechade, et al., "A 32b 66MAu Microprocessor," IEEE International Solid-State Circuits Conference, Tech. Dig.. pp. 208-209, Februaiy 1994.
[12] C. A. Mead, and 1 .A. Conway, "Introduction to VLSI Systems," AddisonWesley, 1980.
[13] R. W. Sherbnme, e t al., "Data path Design for RISC," Pme. Conf. Advanced Research in VLSI, pp. 53-62, 1982. [14] R. W. Sherburne, et al.. "A 32-bit NMOS Microprocessor with e Large Register File," IEEE Journal of Solid-State Circuits, vol. SC-19, no. 5, pp. 682-689, October 1984. [I61 K. J. O'Connoz, "The %-Port Memory Cell." IEEE Journal of SolidState Circaits, vol. SC22, no. 5, pp, 712-720, October 1987. [I61 R. D. Jolly, *A 9-ns, 1.4Gigabyte/s IT-Ported CMOS Register File," IEEE Journal of Solid-State Circnits, vol. 2 6 , no. 10, pp. 1407-1412, October 1991.
[I?] H.Shinoharn, et al., '"A Flexible Multipoit RAM Compiler for Data Path," IEEE Journal of Solid-state Circuits, "01. 26, no. 3, pp. 343-349, March 1991.
1181 A. R. L , "A Low-Power PLA for B Signal Processor," IEEE Jonmal of Solid-State Circuits, voL 26, no. 2, pp. 107-115, Febrnary 1991.
[I91 G. M. Blair, "PLA Design for Single-Clock CMOS," IEEE Jounal ofsolidState Circuits, vol. 27, no. 8, pp. 1211-12113, August 1992.
[ZO] H. Kadota,
et el., "A 32-bit Microprocessor with On-Chip Cache and TLB." IEEE Journal ofsolid-State Circuits, vol. SC-22, no. 5, pp. 800.807, October 1987.
[Zl] A. J. Smith, "Cache Memories," Computing Snrveys, Vol. 14, pp. 473-530, September 1982.
(221 L. Ashby, "ASIC Clock Distribution using a Phare Locked Loop (PLL)," in IEEE International ASIC Conference and Exhibit, Tech. Dig., pp. P1.6.1P1.6.3, September 1991.
REFERENCES
487
[23]F. M. Gardner, "Phase Lock Techniques," John Wiley and Sons, 1919.
[24] F. M. Gardner, "Charge-Pump PhaseLocked Loops," IEEE Transactions on Communications, COM-28(11). pp. 1849-1858, November 1980.
[ZS] M. G. Johnson, and E. L. Hodsan, 'A Vaiahle Deb7 Line PLL far CPUCoprocessor SyruchroniUation," IEEE Journal of Solid-State Circuits, vol. 23, no. 5 , pp. 1218-1223,October 1988.
8
LOW-POWER VLSI DESIGN METHODOLOGY
methodologies at several abstraction levels such as physical, logical, architectural, and algorithmic levels. AU the power reduction techniques discussed are related to the dynamic power dissipation. It is shown that LP techniques, at the high-level (algorithmic and architectural) of the design, lead to power ravings of several orders of magnitode. Many uampleo are included to give the reader a quaotitative picture of LP issues. Several LP techniques, particularly at the circuit level have already been discussed in Chapters 4, 6 , and 7 including those related to static power oonsiderstiona. However, they are not reconsidered in this chapter. The power estimation techniques at the circuit, logical,architectural and behavioral levels are overviewed. Power aoalysk a t high-level d o - a~ early prediction and apt-stion of the power of a system. The LP concepts such as switching ncti.;ty, glitching, etc., discussed in Chapter 4 are used throughout this
chapter.
490
CHAPTER 8
8.1.1 Floorplanning
Floorplanning of a circuit is the first step in VLSI layout design. It permits the allocation of space on a chip for a given set ofmodules. A module can be rigid, e.g., the module is in the library and its dimension and power dissipation are known. or pezibie, e.g., it has not beon deaigned and has B list of parameters such as different shapes and power consumptions for feasible implementations. Floorplanner for low-power design should choose a suitable implementation for each module such as the total power/area of a chip are optimieed [I].
[4.
491
mized by restructuring a logic circuit during the technology-independent phase [3]. It is assumed that at the higher-level of abstraction, decisions regmding the power supply voltage and the dock Bequency have already been made. The power minimidion is eonstrained by the delay, however, the area may increase. D g this p h e of logic minimization, the function to be minimis& is
where P, is the probability of the node i being a "1" (1 P$)is the probability that node i is a 'V", and C s ia the capacitance of this node. For more infarmation on thia model see Section 8.5.2.1. To minimiie the above equation. one has to first evaluate the current value of P; and then change it by making P : dose to 0 or close to 1. Also i n [3], zero-delay approximation i s assumed. This implies that the glitching power is neglected.
To minimize the switching activity, some techniques that can be used are:
rn
Use don't-cares to minimize the probability P< of I I function. Indeed, the signal probability of B gate can change by altering the ON-set or the OFF-set by adding points from the don't-cme set. Collapse nodes that are not on the critical path. The intermediate signal lines me implemented as single node. The delay may increase, however this does not affect the m m d l performance of the circuit.
Power dissipation can be imprwed by m much as 60%, at the expense of an 8 % area increke [3] and with no delay degradation. More typical power reduction would be in the range of 10.20% [4]. The technology mapping step for low-power refers to the process of transforming a logic function into a technology-dependent (e.g,, CMOS) circuit with minimieed power consomed. This technology dependent Step ~ s e sa target technology. The first step in technology mapping is to decompose each logic function into twwinput gates. The objective of this decomposition is to minimize the total power dissipation by reducing the total switching activity. Fig. 8.1 shows an example of a foor-inpot AND gate decomposition into two different implementations. The probabilities of inpots being at "1" logical are also shown in pig. 8.1. Primary inputs ace assumed to be uncotrelatcd. The switching activity at each internal node is also shown in Fig. 8.1. A two-input ( i , j ) AND gate is given by
a = (1- P,Pj)PdPj
(8.2)
492
CHAPTER a
Lmpiomcnration 1
lrnpiemsntition 2
W e s m m e also that the gate delays are zero to ignore the power dne to the
glitehing phenomenon. The total switching setivitie for implementations 1 and 2 are 0.888 and 1.056, respectively. Therefore, implementation 1 is better than implementation 2. This problem ofdecomposition was addressed by [5,6]. In 151, the power dissipation, associated to glitehing, is neglected while in [6]it is not. Taking into rrccount the power dissipation of glitches is very i m p o r t a t ar is discussed in Section 8.2.2. The concept of technology mapping of logic opt-ation is an important step for standard c e h and gate anays (or sea of gstes) circuit design. All the cells in the library are characterized in terms of ares and speed. Another parameter to be added for low-power design is the characterization ofthe internal power of the gate and its output parasitic capacitance. Hence the process of technology
493
mapping ir to search, using B target library, the best possible implementation following constraints such power, area and delay.
In this aectian we do not consider the algorithms for technology mapping. The
reader can consult rcfcrencea [5, 71. W e illnstrste this concept of technology mapping by the following example. Fig. 8.2 shows an example for implementing the logic circuit of Fig. 8.2(a) into two implementations. The first implementation [Fig. 8.2(a)] is for minimal area deign using OAI (OR-AND-INVERT) gate. The second implementation [Fig. 8.2(b)] is for minimal power design where the high switching node N of Fig. 8.2(a) ir hidden using B mom complex gate.
Thus the process of technology mapping is to &st decompose the logic function such that the total switching activity is minimbed. Then, to hide any high svitching activity node within complex gates 80 that the capacitance of that node is minimisod. However, mahiog LL gate too complex c a n trade the delay for low-power. Typical reduction i n power dissipation is on the order of 20% without any degradation in performance but st the expenac of small area penalty. The quality of the targeted cell library can considerably impact the results of mapping [S]. For eremple, the availability ofcells with different drive etrengths and doublerail outputs (signal and its complement) gives more fleldbility for logic optimisstion. A goad library a n result in 20-5095 of power dissipation reduction.
494
CHAPTER 8
Another techniqne employs self-timing techn;gues to reduce the lo@= depth 1 1 . The self-timed circuit should save more and then the glitehing power [9, 1 power than what it introducer. As B cLcuit example that exhibits spadous transitions, is an adder. The rum sign& can have fake transitions before they are stable. If the load capacitances on the outputs are relatively large, then the power due to the glitches can be important.
A conventional self-timed method for an adder is shown in Fig. 8.3. A Tran(TD) similar to the one discussed for SFLAMs h Chapter 6 is used. For each set of inputs ( A and B;) there is one transition detector. If A and B are both n-bit wide, then n TDs are reqnired for the pardel adder. For any transition at the inputs, the TD generates a pulse for the self-timed function. This self-timed circuit delays the pulse by an amount equal to the critical pnth of the adder. The delayed pulse then feeds the clock of a D-FlipFlop (DFF) or B gated &wit for the sum function. Consequently, the output
sition Detector
495
Self-timed
Pdlel-adder
funclion
Gated
function I
s m s are not witched notil they are evaluated. The additional Circuitry in the conventional approach UUI colls~unr more power than it mag s m e .
Another approach bsded on self-timing to reduce the spudous transition was proposed by [ll]. Fig. 8.4 shows a parallel adder using simple self-timed circuitry. When input signals are written into the registerr A and B, a single register bit is used to genepate an 'Input Valid" signal to the self-timed function. For an n-bit pardel adder, only B onebit register is required. e s shown in Fig. 8.4. The self-timed function is implemented using a series of inverters with dual-rail. Two enable signals E and 3 are generated by the selEtimed Circuit. They feed the gsted sum XOR gates. Also the enable ipd, E. cantrola the one-bit register to disable the i n p m t d i d signal. This technique har resulted in 25% power reduction [ll].
496
CHAPTER 8
Parallel-adder
497
i s critical for power savings. Otherwise, the additional circuitry can dissipate a relatively important power. Note that this added logic slightly increases the area of the circnit and may also inerese the clock cycle. The preeomputation techniqne can be applied to a mnltiple output function. However, if the logic has a large number of ontputs, then it may be worthwhile to s e k c t i d y apply precompotation technique to a small number of complex outputs. This selective partitioning will add a duplication of combinational logic and regirtera and this may offset the powex savings.
498
CHAPTER 8
8 . 3 LP ARCHlTECTUKE-LEVEL DESIGN
In this section, sxhitecture meens also Register Transfer Level (RTL). The architecture uses a set of primitives suoh 8s adders, multipliers, ROMs, register filer, etc. RTL synthesis programs m e used to convert an RTL description to a set of registers and combinational lwgic. The impact of low-power techaiqnes on the architecture level c a n be more significant than the gate level as . r i l l be shown in this section. Techniques to reduce the power dissipation discxssed m e : parallelism, pipeline, distributed processing m d power man<&ment.
8.3.1 Parallelism
Parallelirm can be used to reduce the power dissipation at the expense of area while maintaining the same throughput [lo]. To finstrate thia, the quantitative example of Fig. 8.7 is considered. In Fig. 8.7(a), a regbter snpplies two 16-bit operands to a 16 x 16 multiplier. We refer to this architecture to reference one and we w e the ref notation for frequency, power snpply voltsge, power dissipation, etc. This register is clocked at a maximal frequency f , s j = 50 ME$. We assume that the worse case delay of the multiplication is 20 ns at V,el = 3.3 V power supply voltage. It is clear that we cannot reduce %,I to reduce the
499
500
CAAPTER 8
throughput as in the c s e of Fig. 8.7(a). The input registers are docked at f7.,/2 = 26 M A S . Therefore, the power snpply can be reduced to achieve B worst c delay of 40 m. With the same 16 x 16 multiplier, the power supply UUL be reduced Gom K,f = 3.3 V to 1.8 V ( V , s j / l . 8 3 ) . This value can be determined from the simulation of the two architectures. The effective capacitance has increased by a factor of 2 due to the duplication. However, due to the extra routing to both multipliers, thb effective capacitance is around 2.2 G C j . Thus, the estimated power dissipation is given by
Hence
Ppe7= 0.33P,.j
Thus, the power dissipation is significantly reduced.
n parallel The key to this power ssVings is the duplication of the hardware i configuration. In general, N processors E B be ~ paralldked by duplication, with each processor running with slower do& (by 8 factor of N).In this case, for the s a m e throughput, the power dissipation c a n be ieduced with the increase of N. Therefore. the power ropply voltage (VDD) can be aggressively rednced to meet II worst case delay almost equal to the reference delay divided by N. To wploit this power mpply reduction, the threshold voltage ( V T ) should also be reduced to limit the degradation of the delay as VDDapproaches VT. Keep in mind that the scaling of VT is also limited by the static current oonsiderations.
When the number N is relatively large, the parallelism can lead to several problems. A highly p m d d k e d configuration can result in s drastic incresse of the occupied area. In addition, there is rooting overhead to distribute the input and output signals. This also increases the &re8 and the wiring capacitance. Therefore, the power dissipation &a tends to increase and then limits the utility of parallelism.
8.3.2 Pipelining
Pipelining is another arehiteetluc that can reduce the power dissipdion [lo]. As an example, let us consider the case of the 16 x 16 multiplier presented in Section 8.3.1. The 60 MAB multiplier is broken into two equal parts as shown i n Fig. 8.8. A set of pipeline registtun (or latches) is inserted, resulting in a 2-stage pipelined version of the multiplier. Architectures with more pipeline stages can
501
i
mulliplicr be realized. S i e e the hardware between the pipeline stager is reduced then the reference voltege V,.! = 3.3 V c a n be reduced to 1.8 V (V,.t/1.83) to maintain a worst case delay of 20 ns (50 MHe). The estimated power dissipation is given
hv
The switching capacitance has increased slightly due to the pipelining. Thus, the power dissipation is redneed by a faetar ofalmost 2.8 which is spprodmately the same IU the pardel EIUC. Alao the area increase is relatively low and the area penalty h due only to the additional registers (or latches). As the pipeline registers reduce the logic depth, the power dissipation, due to the glitches, is also reduced.
In general, if a processor is pipelined with N stages of regiptets, then the delay between the pipeline stages is reduced by almost a factor N while the dock frequency is maintained. Then, the power supply voltage can be scaled sggressively. Canscqnently, the power saving is large.
Note that ez in the case of pardelism, an architecture with a large nnmber of pipeline stages can result in an offset in power and &re&. The added registers must be clocked and hence the load on the clock network c a n be important, with increased pipelining. One drawback ofthe w e of the pipeline is that more latency is added to the ontput signal.
The combination of pipelining and pardelism c a n result in further power redoction. because the power gopply voltage can be reduced aggressively. Also
502
CHAPTER 8
the frequency of operation is reduced. However. the luea would increase sign%eantly. For low-voltage, the threshold voltage should also be reduced to reduce the power dissipation, otherwise the power supply voltage redoction is limited. Indeed, at low-voltage, VDO approaches VT and the delay inereares d r a r t i d y . To maintain the throughput with pardelism/pipelhing, the threshold voltsge should be reduced compared to VDO.
A video image, represented by a group of pixel, is vector qoantized by b r e a m it into blocks (uectois) of pix& that are mapped to a codebook of probable vectors using Mean Square Error (MSE) as the distortion m e m e . For the example given in [15], the image is segmented into 4 Y 4 pkel-vector (vector siae is 16). The VQ employs B codebook of 256 lev& The inpot data is represented on 16 x &bit and the output (&bit) represents the index of the best match as shown in Fig. 8.9 [ E l . Then the compression ratio is 163. To process 30 framesjs, a vector must be compressed every 17.3 ps ( e d frame is 128 Y 240 pixels). The MSE (distortion metric) between a vector X of 16 pix& and a codebook vector C i s given by
15
MSE = c ( C ; - X $
i=o
(8.8)
To compute this algorithm, a large number of memory access to the codebook and arithmetic operations is needed (see Section 8.4). The number of computations can be reduced by using differential search a priori combined with TrecSearch (TS) between two vectors a and b at the s a m e level of the tree. The distortion diffeience between the two vectors a and 6 at the same level o f the tree is given by M.7E.s = M S E , - M S E b (8.9) Then,
1s
16
(8.10)
i=o
503
The two terms ( C : ; - CiJ and Z(C,; - C , ) are Computed in a memory to reduce the number of operations.
Fig. 8.10(a) shows the centralized implementation of the VQ. It has a tentraliaed memory, processing element, and eontroller. This architecture is timemultiplexed, wbich performs operations sequentially over a large number of clock cycle^. In TSVQ, each l e d of the tree has specific code vectors that are found only at that level. Therdore, the memory can be paltitimed into separate memories for each level of the tree. Fig. 8.10(b) shows the distributed implementation of the VQ.The memory s k e from one module to the other increaser. The architecture is pipelined allowing the dock frequency and supply voltage to be reduced. The distributed memory architectme has lover switched capacitance when leading the code vectors than the centralized ease. This distributed imple mentation has eight controners and prowsing elements, bot since th.7 arc clocked a t lower freqneney, with low svpply voltage, the energy dissipated per vector does not change [15]. Through this partitioning, the power dissipated, of the eentraliaed implementation, was reduced by a factor 11 at the expense of an area increase by a factor of 2.
504
CHAPTER a
505
From this example we can learn that proper design of the architecture, through distributed processing, is more power-efficient than the centralieed procerror. In the distributed implementation, the different l o d hardware ~esonrcescan be optimized more efficiently than the global hardware in the centralized implementation. The application of this technique depends on whetha the executed algorithm can be partitioned. Keep in mind, that the power s8-g trades the occupied area, while the throughput is maintained.
In the PowerPC' 603 [21], the DPM mode is ensbled by software. The DPM logic automatically stops the dock switching of specific unit generated by clock regenerators. The clock regenerators produce two docks, C1 and C2, which feed master and slave latches. Two "freeze" input signals control the clocks, C1 and C2, as s h o w in the timing diagrams of Fig. 8.11. The logic needed for DPM does not introdnee any performance degradation and it eons - ~ 0.3% of the total die areain the PowerPC. The DPM provides a power raving of 10.20% depending on the application to be executed. The DPM can be implemented at either high-level (cg., execution u.it) and low-level (e.g., a block inside a unit) of hardwlue.
Static Power Management (SPM) permits the awing of the power dissipation in the standby mode. In this $me, the activity of the entire system is monitored rather than a specific unit (or block). When the system remains idle for a
'PowerPC 603 is h a m l B M C o w .
506
CHAPTER 8
y l T
................
........
c1
...............
...............
CLLiRr-tLh
a_FP.EEz
c2
c1mm c1
e
................
~
........
.........
significant period of time, then the entire chip L rhut-down2. The SPM may have several modes depending on whether the entire chip is shut-down or a part ofit. For example, the PowerPC 603 has three modes which are programmable through a hardware bat controlled by software (operating +em). In this microprocwor, one mode is called sleep mode which allows a m-am power swings by disabling the do& to all units. h this mode the PLL and external input do& are disabled to bring the power dissipation down to the leakage levels. The power of PowerPC 603. in the sleep mode, is as low as 1.8mW 1201.
507
256 levels of the codebook. Each level requires 16 memory access l o perform 16 aubtrastions, 16 multiplications, 15 additions, etc. Hence a large number of primitive operations are needed. In the binary TSVQ already presented in Section 8.3.3, the codebook is orga, nieed into a tree structure a~ shown in Fig. 8.12. The input vector is compared with two code vectors at each node. Based on this comparison, one of the two branches is chosen and the eodehook search space is reduced compared to the full search, since a reduced number of code vectors (16) is utiked. For each comparison, at 8 specific level, an index bit is generated as shown in Fig. 8.12. The process of comparison thmngh the tree is repeated until a leaf node is reached. Far II codebook of 256 levels, the tree has depth of 8 (d=7). Compared to the full search, the nvmber ofmemary ~ e e e s s and executing operations
508
CRAPTER 8
d=O
d=l
d=2
d=3
6 . 7
iedoced considerably since only 16 code vectoxs -re used in the TSVQ a l p rithm. One VLSI implementation of the TSVQ algorithm uses systolic arrays P21. The number of computations can be fulther reduced by using the djffermtial search of the TSVQ [see Eqnation (8.11)]. At each level (i) of the tree the daferentd distortion between the left (vector a) and right (Tector 6) code vectors connected to the level (i 1) is compnted. Therefore, the number of operations is reduced. Table 8.1 [15] shows the computation complexity of the three methods of the VQ. The differential TSVQ results in a lower number of operations to be executed for each type.
~
8.4.2
Minimizing the switching activity, at high level, is one way ta ieduee the power dissipation of digital proccsso~s. This can hsve an infinenee on the power reduction, erpedally when the switching signals have a large capseitanee. One method to minimiae the switching activity, at the algorithmic level, is to USE an appropriate coding for the signals rather then strakht binary code.
509
Algorithm
Memory Access
4096 266
136
Multiplication
Add/ Substract
8448 520 136
In [23], Grey-coding h s been nsed for the address lines of B microprocessor, for both instructions Bnd data accesses, to reduce the switching activity of the nets. The sdwntage of Gray code over binary code is that Gray code changes by only one bit as it sequences from one number to the next. In other words, if the memory access pattern is a sequence of consecutive addresses, then each memory access chmgen only one bit at its address bit. Dur to instruction locality, dudng program execution, most of the memory accesses are sequential. Therefore the Gray code eliminates the simultanmus switches of a significant nnmber of bits. Table 8.2 shows B eomphrison of 3-bit representation of the binary and Gray codes. Note that the Gray code have only one transition for reqoential change
Tabla 8.2
Binary snd Gray-oode rcpresmtstion.
Binary
Code-
Grav
COG
000
000
Decimal Equivalent 0
110
111
101 100
6 7
510
CHAPTER 8
In 1231, the switching property of the address coding w e memured Using the number of bit switches per executed instruction. For instroction accesses, both the Gray and binary coding were compared wing benchmark programs. The maximum reduction in bit switches was found to be as high as 58% and the average reduction was equal to 31%. The same study was also carried out for data addresses. The average reduction of bit switches was 8%.
8.5
Power estimation means, i n general, the techniques of estimating the average powex dissipation of cirenits. The goal of t h s section is to present an overview of power analysis techniques and took at the eleuit, gate, architectural, and behavioral levels of sbstractian. Measuring the power consumption is cdtie a l for low-power design as it permits the designer to optimise power, meet rq~ements, and know the power distribution through the chip.
8.5.1
Circuit-Level Tools
The most straight-forward method of power estimation is by circuit simulation; perform a circuit airnulation of the design and m m u e the average current drawn fram the supply. Therefore, the average power can be estimated. The disadvantage of this approach is that the results are strongly dependent on the input patterns to the circuit (pattern-dependenttechnique) also called dynamic3 power simulation. If the circuit has 8 large number of inputs, thcn the circuit simulation would be lime consuming and w e n impractical. The most accurate power simulator to date is still SPICE.However, it can handle only very small circuits (e.g, hundreds of transistors). SPICE accurately taker into account non-linear capacitances ljunction and gate) which esnnot be eaptvred by higher level tools. Also, it rnaccurately measwe short-circuit and leakage currents. The latter is very important for low-VT applications. SPICE cannot be used to estimate the power of large circuits or chips, due to the time e o n r u i n g nature of the simulator. It is a pattern-dependent power analysis tool.
' D y n d c l l i y computed PQWY should not bm c o d a d with dynamic power.
511
Another transistor-level power simulator/analyeer is PowerMdI' [24]. It a p plies an event-driven airnulation algmithm to inere- the computation speed by two to three oiderr of magnitude over SPICE,with an acceptable level of aecuracy (within 10%). Also, it uses table lookup to determine the terminal current of the device from the applied voltages.
PowerNIill can also identify the hot spots (which consnme more dynamic power) and twuble spots (which comnme unexpectedly large amoontr ofleahge .mulent). Moreover, elements with excessive short-circait are detected. This allows the designer to resise the circuit to reduce the riselfall time. Static reduced-swing nodes ace detected as shown in the example of Fig. 8.13. The node A is charged to VDD- VT when the input is low.
Another approach far power estimation is the use of statistical techniques. The work in [25] suggested the use of Monte Carlo simdation to ert-te the total average power of the circait. Basically, this statiitical technique is based on applying randomly generated inpnt patterns, a t the primary inpnto, and monitoring the convergenee of the power dissipation. The simulation is stopped when the measured power is dose enough to the troe average power. This approach, based on the Monte Carlo method, requires simulation over B large number of measurements. The advantage of the statistical techniques is that they can be built around existing simulation tools.
'PorerMill is fromEPlC D&gn Technology.
512
CHAPTER 8
8.52
Gate-Level Techniques
In order to oveccome the shortcoming of power analysis tools, at the *renit level, recently several gatdeml estimation tools have been proposed. In this section, we present two techuiqnes for power estimation at the gatelevel. The first approach relies on the probabilistic method. while the second one is bared on event-driven simulation.
8.5.2.1
The power dissipation c a n be analyeed wing pattern-independent approach when the sign& sre represented with probabilities (also called static techniques). This approach permits to overcome the shortcomings of simulationbaaed techniques. The nser supplier the probabilities of the primary inputs to a logic network. The average power dissipation of a logic network is estimated
as
P = V&fC%C,
i=l
(8.12)
where N is the nnmber of nodes in the network. With a total physical capxitance Ci. ai is the switching activity (or c d e d transition probability, P,)given by
(8.13) = P,(1- P,) where P*ir the probability that the node i i s at high level. In this expression of sctivity it in assumed that the circuit input and internal nodes me independent
oli
(spatial independence). Also the values of the same Jignal, in two consecutive dock cycles, are assumed independent ( t e m p m l independence).
If the input probabilities to a network w e provided, then they are propagated through the circuit to evaluate the transition probability at each node. For example, for a 2-input AND function: y = z,.=a, the probability of the outThe computation of the put to be at high level is given by: Pu = Pz,.P*,. probabilities for different gates is discussed in Chaptu 4.
One tool (LTIMES), bared on probabilities, w s r h t proposed in [26]. In this work, the temporal and spatial independence of rignds are assumed. Practically, the signals may be correlated. Also e aero-delay model wm aasumed, which leadds to an error in ertimating the power, since the glitching power h not accounted for.
Low-Power
513
Probabilistic power estimation approaches that compute the power, due to glitches, and apply a r e d delay model have been proposed [Z7, 281. In [27], the switching activity computation is based on the tmnailion density. The assnmption made in [ZT] is the spatial independence of the sign&. A power estimator tool, based on the tran&tion demity, has been called DENSIM. The transition density of a node is defined as the ayerage number of nodal transitions per unit time. If y is a boolean function with inputs, z,, then the boolean difference of y, with respect to zi,is defined by
(8.14)
It was shown in [29] that if 2, are spatially independent, then the density of the boolean fonction is given by
(8.15)
where P ( z ) is the equilibrium probability of the signal over time. Equations (8.14) and (8.15) are used to propagate the density throngh the boolean network. Byfa=; is one if B transition at zi will cause a simultaneous transition at y. As an example, consider the c8se of a 2-input AND gate with that D ( ~ = ) ~n thi. CW, ay/a., = c2 and ay/ars = =,, Y = P(Z~)D(Z +P(z,)D(ra). I) Hence, from the probability and density d u e s , at the p d m a y inputs of a logic network, the density at the aotput can be =omputed. The boolean differences of B logic network s l e calculated using Binory Doeision Diagrams (BDDs) [30].
The factor 112 k added to a c c o r d for the doable transition pm dock period.
This model, blued on transition density, ignores the spatial correlation of the signals and eompntes, approximatidy, the power due to glitches. The work in [28] attempts to handle both spatial and temporal eorrdations. One disadvantage of the approach in [28] is that the use of BDDs, for the whole circuit, tends to limit the siw of the network thst can be analyzed.
The probabilistic techniques have the advantage that the user does not have to supply dmnlation patterns and they are daimed to have fast computation
514
CHAPTER a
time. However, they do not account for the internal power of the gates and static power dissipation. These techniques can be nsed, for example, as a fast power estimator for logic synthesis. They might also be suited for comparing varioos subsystem structures.
515
The dynamic power of each cell is computed by multiplying the number of power events (transitions' count) by the energy dissipation per transition event of I cell. This proce$s is applied to all dynamic power vectors for a cell to obtain the total energy dissipated. The total dynamic power of a cell, over a certain time period, is equal to the total energy divided by the t h e period.
The static power vector is used to compute the leakage of B cell. Note that the static power of B cell is dependent on the logic state of a cell, 85 shown in Fig 8.15. To compute the static power dissipation, the duration of activation time of the corresponding static power vector is measured. A transition of net signal may cause a static power vector to be activated and another vector to be deactivated. Vectors are time stamped during aetiwtion andnpon deactivation. Then the total time length in which the vector is active is foand. The activation time length of the static power vector is multiplied with the power dissipation value (per time unit) to obtain the static power of the vector. Again the static power dissipation for aU veotors asrociatcd with a cell instance is summed to derive the total power dissipation.
516
CHAPTER 8
The results reported by Aspen, such SJ the switching activity of nodes, can be used to drive floorplanning, placement and routing tools. Also Aspen can handle chips with B complexity of o w e d hundred thousand gates and is four orders ofmagnitude faster than SPICE.It prodnces results within 10% accuracy of SPICE results. One disadvantsge of Aspen is that it cannot handle power due to the glitches.
517
1
latch
The power ofthe on-chip memory is modeled for a certain memory architectnre. The interconnections are defined in two categories, local and intermediate, and s defined as interconnections within a global busses. The local interconnection i logic gate. The intermediate interconnections are used for connection between gates or functional blocks (subsystems). The global bun includes data, control, and address busses. The lengths of local and intermediate interconnections are modeled by the Rent's rule [34]. Then the power can be computed from the lengths u&g a fixed switching activity equal to the one specled far the logic. The global interconnect is determined from the dimensions of the ehip and the number of drivers/receivers connected to it. The power model of the clock network ia bared on the H-tree [34] and the chip dimFnsionr. The power of on-chip drivers are also modeled in two components. One'is the power used to drive the off-chip total capacitance. The other is the pou/er consumed by the pad driver itself. The activity factor for the pads is ars med fixed and is equal to 1 [33].
T$e tool developed in [33] is used ar a power estimator in the early stage of t#e design. It requires some technological parameters (feature siae, gate oxlde fltickncss, p a m e t e r e of the intereonneetion layers, etc.), the snpply voltage, the chip area, the switching fhctor and the gate count. This tool can only be used ar a roogh estimator of the total power of the chip since the switching activity is arrumed fixed through the design. Therefore the pourer partition between the different units can be incorrectly estimated.
518
CHAPTER 8
where G is the number of the logic gates comparing the fnnctional block, ui is the switching activity of the ith gate, C ,is the load of the ith gate, i,.,i is the
short eirenit component, and f is the frequency. This power equation expressed in more compact form as
can
be
Pavg = SGf
(8.18)
where x i s the PFA constant snd can be related e d y to Equation (8.17). G can also be looked at a the hardware complexity factor instead of a number of gates. The parameter I( has Merent d o e s for different blodts. For example for an n-bit multiplier, thc factor G can be approldmately equal to 2 as shown in Fig. 8.17. This is due to the number of addw eelk in the multiplier. Then,
we have
P."d< =
K.".ltn2f."".
(8.19)
The power supply voltage is included in the parameter IC. This parameter is extracted e m p i ~ i d l y from meeaured or simulated power valuer at a h e d power supply voltage.
For a VLSI chip, composed of several functional blocks, the t o t d power dissipation can be determined by summing the power o f & bloekr. We have PM =
d, b l e r l .
niG,f,
(8.20)
Thus, this PFA technique is based on modeling precharacterimd functiond blocks. Each block has a PFA factor independent from the other. Hence this technique provides some general methodologg compared to the gate esnivalent model of Svenssan and Liu discussed previously. The PFA factor is extracted using independent Uniform mile Noise (UNW) inputs (i.e, random inputs). UWN inputs mean that the input's bit axe uncorrelated in space and time and
'Withon* ,he static power diaaip.,i.,,
519
independent of the data distribution. The signal and transition probabilities of each i bit of the input are given by
P i ( 1 ) = 0.5
and
P((0+ 1) = 0.25
(8.21)
Consequently, this technique doer not account for the strong dependency of power consumption on the statistics of the input data [36]. The next section tr t s the ease of power modeling, taking into account the correlated behavior ofthe bits.
520
CBAPTER8
P(0-1)
p =4.99
p =4.80
p = -0.60 p = 0.0
p = 0.60 p=o.80
p = 0.99
14
12
10
I1
Fig. 8.18 shows the transition activity for several different two's complement data stream versus the bit (for an n-bit word). In this figure, eaeh enme corresponds to B different temporal correlation given by
P = cou(Xt-l,X,) sl (8.22)
where X,_l and Xt are successive data ( i ntime) and rais the variance. p = 0 corresponds to the white noise case, where P ( 0 1) = 0.25. From Figure 8.18 it is evident that the UWN model, while sufficient for describing activity in the Least Significant Bits (LSBs), is inadequate for the Most Significant Bit (MSB) region. The U N W model works correctly for the LSBs up to the break point BPO. The MSB region corresponds to the sign bits and consequently, the signal and transition probabilities of there bits are far from random. p > 0 eorrerpands to a lower activity for positively correlated signals, while p < 0 corresponds t o a higher activity for negatively correlated signals. T h e MSB region starts from the break point B P I . The region between BPO and BPI can be modeled by linear interpolation. BPO and B P 1 can be determined from the word-level statistics [37]. The power estimation of the architecture modules is based on B black-box teehnique of the switched capacitance. T y p i d modules are: adders, multipliers,
521
shifterr, RAMS, ROMs, ete. The power dissipation is modeled for each module by P = CV&f (8.23) where the switched capacitance C is related to the compleity and the activity of the module. For example of an n-bit dpple-carry subtractor, the switching capacitance is modeled by
= CGf,n
(8.24)
where C,,, is a capacitive coefficient (in fF/bit) determined from the DBT model. Ce,f can be a single coefficient for the U W N case. The DBT model employs several codfieienti for C . , , , which reflect the data representation and signal statistics. For the case of the subtractor, for example, B table of Cc,j is generated as a function of all possible data transitions, i.e., i g n bits transitions and LSB bits random transitions.
To extract the capaeitiae coefficients ofeaeh module, the library should be characterbed. This operetion is performed onetime for one library. The process of
extraction consists of several steps:
I
Pattern generation. Input patterns to B module are generated based on the DBT data model. Both xandom (UWN) and sign data stlearns should be used. The input patterns containing the U W N camponent must be simulated for several cycles. This allows convergence of the a~erage capacitance.
Simulation. The generated patterns are fed to a simulator (such 85 a circuit simulator) from which the switching capacitances ace extracted.
rn
Capacitive coefficient's extraction. The simulation step produces the average effective switching capacitances for the entire series of applied input tramitions such a: U U, S 9 , cte. The capacitive coefficients are utracted from the effective switching capacitances and the complexity parameters.
- -
Based on this methodology, a power mdysis tool, at the architectural level, has been developed
[%I.
522
CHAPTER 8
One approach for power estbation, at the behavioral level, h a been proposed in [38]. It is based on the combination of analytical and stochatic power models. In this work, e cl- ofapplieationa such a zeal time DSPs is considered for the power estimator. In the behavioral context, the power consnmed by a hardware resource is given by
P = N.CV'f
(8.25)
where N . is the number of accesses to the resource over the period of computation. Cis the average capacitance switched per access and f is the computation frequency.
In [38] the power of aome hardware ielionrce~, such as execntion units, registers, etc., are analytically modeled (using Equation (8.25)) from the Control/Data
Flow Graph (CDFG)which is used to represent the design. The average capacitance switched, per BCC~JI, for a partioular hardware is estimated from the white noise data modd. The power consumed by hardware resources such a controllers, interconnects, and clock network is diScult to estimate. Statistically a large number of reabed chips i used to estimate the switched capacitance of there hardware ~esources.
8.6
CHAPTER SUMMARY
Low dynamic power techniques at several levels of abstractions have been presented. Algorithmic and architectural decisions c ~ influence n the power dissipation of a circuit by orders of magnitude. Therefore, CAD tools that help the designer to analyee the power of the ckeuit at these levels are needed. At lower levels of the design, the power reduction teehniqner offer some ravings but less than the one expected at higher levels. Several powor estimation tools have been discussed at the different levels of the design. Keep in mind that the circuit simulators provide B high accuracy for power analyais and take into account all power components.
REFERENCES
[Z] H. V8ishnav and M. Pedram, "PCUBE A Performance Driven Placement Algorithm for Lower Power Designs," Proc. of the EURO-DAC'93, pp.7277, September 1983. [3] A. Shcn, A. Ghosh, S. Devadar, and K. Keutaer, "On Average Power Dissipation and Random Pattern Testability of CMOS Combinational Logic Network," Proc. of the International Conference on Computer-Aided Design, pp. 402-401, November 1992. [4] K. Keutaer, "The Impact of CAD on the Design of Low Power Digital Circuits." IEEE Symposinm on Low Power Electronics, Tech. Dig., pp. 4245, October 1994. [5] GY. Tsui, M. Pedram, and A. M. Despain, "Technology Decomposition and Mapping Targeting Low Power Dissipation," 30th ACMfIEEE Dcsign Automation Conference, Tech. Dig., pp.68-T3, June 1993. [6] R. Murgai, R. K. Brayton, and A. Sangiovanni-VinEente, "Deeomposition of Logic Functions for Minimum Transition Activity," Proe. of the International Workshop on Low Power Design, pp. 33-38, A p d 1994.
[TI
"Technology Mapping for Low Power." 30th ACMfIEEE Design Antomation Conference, Tech. Dig.,pp.74-79, Jrme 1993.
[a] K.
Scott and K. Keutsc., "Improving Cell Libraries for Synthesis," IEEE Custom Integrated Circuits Conference, Tech. Dig., pp. 128-151, May 1994.
[9] C. Lemonds and S. Mhhant Shetti, "A Low Power 16 by 16 Multiplier using Transition Reduction Circuitry," Proe. of the International Workshop on Low Power Design, pp. 139-142, April 1994.
524
LOW-POWER DIGITALVLSI
DESIGN
A. Chandrakasan, S. Sheng, and R. W. Brodcrren, '%w-Power CMOS Design," IEEE Journal of Solid-state Circuits, "01. 27,no. 4, pp. 472-484, A p d 1992. U. KO,P. T. Balsam, and W. Lee, '"A Self-timed Method to Mlnimiie Spurious Trannitionr in Low Power CMOS Cixcuit.," IEEE Symposium on Low Power Electronics, Tech. Dig., pp. 62-63,October 1994.
[I21 R. I. Bahar, H.Cho. 0 . D. Hachtcl, E. Mac", and F. Somenzi. "An Application of ADD-Based Timing Analysis to Combinational Low Power ReSynthesis," Proe. of the International Workshop on Low Power Design, pp. 139-142. April 1994.
[I31 M. Alidins, 1. Montiero. S. Devadar, A. Ghosh, and M. Papaefthmiou, "Precomputing-Based Sequential Logic Optimization for Low-Power," IEEE lhnsactionr on Very Large Scale Integration Systems, vol. 2, no. 4, pp. 426-436, December 1994. 1141 A. Ghersho, and R. Gray, "Vector Qusntisation and Signal Compression,' Khwer Academic Pubhhers, MA, 1992.
[I51 D. B. Lidrky, and J. M. Rabaey, "Low-Power Design of Memory Intensive Functions," IEEE Symposium on Low Power Electronic-, Tech. Dig., pp.
16-11. October 1994.
[16] A. P. Chnndrskasan, A. Burstein, and R. W. Brodersen, "A Low-Power Chipset for B Portable Multimedia I/O Terminal," IEEE Jonrnal of SolidState Circuits, "01. 29, no. 12, pp. 1415-1428. December 1994.
[I71 J. Sfhut., *A 3.3 V 0.6 p m HiCMOS Superscalar Microprocessor," IEEE International Solid-State Cholits Conf., Tech. Dig., pp. 202203,Febiuary 1994.
[I81 N. K. Yeung, Y-H. Sutu. T. Y-F. Su, E. T. Pat, C-C Chao, S. Akki, D. D. Yau, and R. Lodenquai. "The Design o f a SSSPECint92 RISC Processor under ZW," IEEE International Solid-state Circuits Conference, Tech Dig., pp. 206-207, February 1994.
[19] D. Pham, et s l . , "A 3.0W 75SPECint92 85SPECfp92 Superscalar RISC," IEEE International Solid-state Circuits Conference. Tech. Dix., DO. 212213. February 1994
[ZO] G. Gerora, et al., "A 2.2 W 80 MHz Superscalar RISC Microprocessor." lEEE Journal of Solid-State Circuits, vol. 29, no. 12, pp. 1440-1454, De-
cember 1994.
REFERENCES
525
[XI S. Gary, C. Diete, J. Eno, G. Geross, S. Park, and H. Sanches. "The PoaerPC 603 Microprocessor: A Low-Pow- Design for Portable Apphtiom," Proc. of COMPCON'94, Tech. Dig., pp. 307-315, February 1994.
[22] R. K. Kolagotla, S-S. Yu, and J. F. Jda, "VLSI Implementation of a 'Itee Searched Vector Quantieer," IEEE Transactions on Signal Processing, "01. 41, no. 2, pp. 901-905, February 1993.
[23] C-L. Su, C-Y. Tsui, and A. M. Derpain, "Low Power Aichitecture Design and Compilation Techniques foz High-Performance Processors," Proceedings of COMPCON'OI, Tech. Dig., pp. 489-498, Februsry 1994.
[24] A-C Deng, "Power Analysis for CMOS/BiCMOS Circuits." Proe. of the International Workshop on Low Pow- Design, pp. 3-8, A p d 1994.
[25] C. M. Emher, "Power Dkipation Andyysk of CMOS VLSI Circaits by Means of Switch-Level Simulation," Proc.of the European Solid-state Circuits Conference,pp. 61-64, 1990.
1261 M. A. Cirit, "Estimating Dynamic Power Consumption of CMOS Circuits," IEEE International Conference on Computer Aided Design, pp. 534537, November 1987.
[27]F. Najm, I. Hai,and P. Yang, *An extension of Probabilistic Simulation for Reliability Andy& of CMOS VLSI Circnits," 28th ACMjIEEE Design Automation Conference, Tech. Dig., pp. 644649, June 1991.
[28] A. Ghosh, S. Devadas, K. Keutser, and J. White, 'Estimation of Average Switching Activity in Combinational and Sequential Circuits," 29th ACM/IEEE Design Automation Conference, Tech. Dig., pp. 253-259. June 1992. [29] F. N. Najm, '"A Survey of Power Estimation Techniques in VLSI Circuits," IEEE Transactions on Very Large Scale Integration Systems. vol. 2, no. 4, pp. 446-455, December 1994. [30] R. E. Bryant, "Graph-Baaed Algorithms For Boolean Function Manipulation," IEEE Tmnsaetiona on Computer-Aided Design, pp. 677-691, Augort 1986. [31] B. J. George, G. Yeap, M. G. Wloka. S. C. Tyle., and D. GossCn, "Power Analysis for Semi-custom Design," IEEE Custom Integrated Circuits Conference, Tech. Dig., pp. 249-252, 1994.
526
[32] B. J. George, G. Yeap, M. G. Wloka, S. C. Tyler, and D. Goss&, "Power Analysis and Characteridion for Semi-Custom Design," Proc. of the Int e r n s t i o d Workshop on Low Power D e s i g n ,pp. 215-218, April 1934. 1.331 D. Lui, and C. Svensron, "Power Conramption Estimation in CMOS VLSI Chips,' IEEE Journal of Solid-state Circuits, uol. 29, no. 6, pp. 663-610, June 1994.
[34] A. B. Bakoglu, "Circuits, Interconnects, and Packaging for VLSI," Addison-Wesley, Rcading, MA, 1990.
[35] S. R. Powell and P. M. Chm, 'Estimating Power Dissipation o f VLSI Signal Processing Chips: The PFA Technique," VLSI Signal Procesing N.pp. 250-259, 1990.
1361 P. E. Landman, and J. M. Rabaey, "Power Estimation for High Level Synthesis," EDAGEUROASIC, Paris, Rance, pp. 361-366,February 1993.
[37] P. E. Landman, and J. M. Rahaey, "Bla&-Box Capacitance Models for Architectural Power Analysis," Proceedings of the International Workshop on Low Power Design, N a p , CA, pp. 165-170,A p d 1994.
1381 R. Mehra, and J. Rabaey, "Behavioral Level Power Estimation and Exploration," Proceedings of the International Workshop on Low Power Design, Nape, CA, pp. 191-202. April 1994.
INDEX
Absolute value calculator. 454 Adders carry lookahead, 412 carry select, 420 sompruison, 425 conditional I-, 423 Manchester, 412 ripple carry, 410 Address transition detection, 332 Adiabatic computing, 249 ALU, 451 Arithmetic logic unit, 451 Array multiplication, 429 ATD,332 AVC, 454 Back-biar generator, 373 Barrel rhifter, 456 BiCMOS applications, 299 BiNMOS logic, 272 bootstzapped, 288 CEBiCMOS, 285 comparison, 294 complementaiy technology, 43 complementary, 283 conventional gate, 257 delay analysis, 262 DSP, 303 gate array, 304 low-voltage families, 280 merged, 281 power dissipation. 266
pracesser, 36
Bidirectional I/O, 229 BiNMOS family, 272 gate design, 274 logic gates, 277 p-transistor, 299 Bipolar EberrMoU model. 94 Gummel-Poon model, 101 high current effects, 99 hwh level injection, 101 Kirk effect, 99 knee cumnt, 101 structure, 91 technology, 21 transit time, 105 Webster effect, 99 Birds beak, 30 Body effect, 66 Boosted voltsge generator, 377 Booth multiplier, 434 Bootstrapped BiCMOS, 288 BSlM model, 77 Buffet siring, 221 By-pars capacitance, 235 CAM, 470 Capacitance estimation, 138 fringing, 144 gate, 83 i n.w t . 139 junction, 82 MOS. 82 parasitic, 141 wiring, 143
528
CBiCMOS, 283 CEBiCMOS, 285 Channel length moddation, 75 Chmge pump, 373 Charge sharing, 180 Clock buffers, 226 Clock distribution, 224 Clock skew, 187, 474 Clock tree, 226 Clacked CMOS, 183 C I O ~ singlephase, 198 strategy, 188 two-phase, 202 CMOS sealing, 89 CMOS complex gate, 149 CPL, 203 delay- 124 domino, 177 DPL, 207 dynamic, 177 full-adder, 171 inverter, 116 layout, 161
Dital d g d P I O C ~ Q S O I , 303 Distzibuted processing, 502 Domino logic, 177 DPL, 207 DRAM, 356
asceoo t i e ,
359
NORA, 183
power dissipation, 129
process technology, 14 peodc-NMOS, 176 SRPL, 210 tranamistiion gate, 169 Zipper, 183 Colnmn decoder, 332 Comparator, 455 Complementary BICMOS, 283 Complementary pass-transistor logic, 203 Compressor, 442 Content addressable memarp: 70 .. 4: Control unit, 451 CPL, 203 current gain, 97
architecture, 359 baek-bi- generator, 373 boosted voltage generator, 377 ceh 359 charge pump, 373 deeodez, 366 half-voltage generator, 371 hierarchical word-line, 370 lowvoltage, 381 refresh, 377 sense amplifier, 367 DSP, 303 Dnal pass-tramistor logic, 203 Dynamic logic, 177 Early effect, 89 voltage, 99 Ebers-Moll model, 94 Edgetriggered D-Ripflop, 194 F&, 146 Fanout, 146 Flipflop, 194 Floorplanning, 490 hequency divider, 482 FuU-adder, 171 Full-custom design, 165 Gate array, 166, 304 Glitches, 160, 493 Ground bounce, 233 CTL, 236 Gummcl-Poon model, 101 Gunning 110, 236 Half-voltage generator. 371 High level injection, 101
Indez
529
HSPICE bipolar parsmeters, 105 MOS parameters, 77 1 1 0 circuits, 214 Input pad, 214 Isolation, 27 JK Bipflop, 197 Kink effect, 62 Kirk efteet, 99 Latch, 190 dynamic, 191 hold time, 190 setnp t i e , 190 static, 190 Leakage current, 130 Lightly doped drain, 17 L o 4 oxidation of silicon, 28 LOCOS, 28 Low-power algorithmic-level, 507 arehitreturtlevel, 498 circuit techniques, 239 CMOS technology, 17 DRAM, 364 gate-level, 490 Layout guidelines, 165 physical design, 489 reference voltage generator ,399 SRAM, 330 Low-voltage CMOS technology, 20 DRAM. 381 MOS model, 84
Mobility model, 74 MOS SPICE Models, 69 MOSl model, 72 MOS3 model, 73 Multi-threshold voltage techniqne,
242
SRAM, 352 TTL, 215 MBiCMOS, 281 Memory DRAM, 356 ROM. 467 SRAM, 313 Merged BiCMOS, 281 Minimum power supply, 123
Multiplexer, 171 Multipliers Baugh-Wooley, 432 Braun, 429 comparison, 450 modiiied Baath, 434 Wanace, 442 N-well process, 14 Noise margin, 121 NORA logic, 183 Output buffer, 229 Output pad, 227 Pardel adders, 409 Parallelirm. 498 P-tranristor logic complementary, 203 conventional, 169 dud. 203 swing restored, 203 Phase IocEred loop, 473 Pipelining, 500 PLA, 462 Plaeement and routing, 490 PLL, 473 charge pumped loop, 414 filter, 479 phase frequency detector. 476 voltage controlled oscillator, 479 Power diSsip&on components, 129 dynamic, 132 estimation, 510 internal, 152 measurement, 138 short-circuit, 135 stetic, 130
530
Power management, 505 Prechargc transistor, 178 Preeomputation, 496 Prababilirtic power estimation, 512 Programmable logic a ~ r a y462 , Pseudo-NMOS, 176 QCBiCMOS, 282 Quasi-complementary BEMOS,
282 Raee, 493
equalieing, 327 hieiacbical word decoding, 350 law-voltage, 352 ontpnt latch, 347 read cycle time, 315 readjwsrite circuitry, 324 row decoder. 332
s-e
amp&,
339
SRPL. 210
Standard-cd, 165 Subthreshold current, 86 Swing restored pars-transistor logic, 203 Switchiw activity. 152 Technology mapping, 491 TFT, 323 Thin film transistor, 323 Threshold mltage, 66, 85 TLB, 470 Toggle, 197 Trench isolation, 3 1 TTL. 215
RAM dynamic, 356 static, 313 Read only memory, 467 Reference voltage generator. 395 Register file, 458 Register transfer level, 498
Register, 194
..
Reg& structures, 460 RGM, 467 Row decoder, 332 RTL, 498 RVG, 395 Scaling, 89 Schmitt trigget, 218 Self-reverse biasing, 239 Semi-custom design, 165 Sense amplifier. 339 Shift-, 456 Silicon On Insulator. 52 SO1 SIMGX, 52 Sol. 52 SPICE, 510 Spnrious transition, 160, 412,493 SEAM, 313 addrear access time, 315 architectnx, 315 ATD, 332 bitline prechatge, 337 cell. 318 column decoder, 332 divided word-line. 348
Video compression, 502 Voltage controlled oscillator, 479 Voltage down convcrtez, 389 Voltage levels interface, 231 Voltage-eontrolled delay h e , 482 VQ, 502 Wallace tree, 442 webster effect, 99 Zipper CMOS logic, 183