Lpvlsi Full TB

Low-Power VLSI Circuits and Systems
Ajit Pal
Low-Power VLSI Circuits

and Systems
1 3
Ajit Pal
Computer Science and Engineering
Indian Institute of Technology Kharagpur
Kharagpur
West Bengal
India
ISBN 978-81-322-1936-1 ISBN 978-81-322-1937-8 (eBook)

DOI 10.1007/978-81-322-1937-8
Springer New Delhi Heidelberg New York Dordrecht London
Library of Congress Control Number: 2014950352
© Springer India 2015

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts
in connection with reviews or scholarly analysis or material supplied specifically for the purpose of
being entered and executed on a computer system, for exclusive use by the purchaser of the work. Du-
plication of this publication or parts thereof is permitted only under the provisions of the Copyright Law
of the Publisher’s location, in its current version, and permission for use must always be obtained from
Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center.
Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publica-
tion does not imply, even in the absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publica-
tion, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors
or omissions that may be made. The publisher makes no warranty, express or implied, with respect to
the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Preface
Several years ago, I introduced a graduate course entitled “Low Power VLSI Cir-
cuits and Systems” (CS60054) to our students at IIT Kharagpur. Although the
course became very popular among students, the lack of a suitable textbook was
sorely felt. To overcome this problem, I began to hand out lecture notes, which
was highly appreciated by the students. Over the years, those lecture notes have
gradually evolved into this book. The book is intended as a first-level course on
VLSI circuits for graduate and senior undergraduate students. While a basic course
on digital circuits is a prerequisite, no background in the area of VLSI circuits is
necessary to use this book. Each chapter is provided with an abstract and keywords
in the beginning and a chapter summary, review questions and references at the
end to meet pedagogical requirements of a textbook. This will help the students in
understanding the topics covered and also help the instructors while teaching the
subject. The book comprises the following 12 chapters covering different aspects
of the digital VLSI circuit design with particular emphasis on low-power aspects. A
chapter-wise summary of coverage is given below.
Chapter 1: Introduction
This chapter begins with the historical background that led to the development of
present-day VLSI circuits. In the next section, Sect. 1.2, the importance of low-
power in high-performance and battery-operated embedded systems is highlighted.
Various sources of power dissipation are identified in Sect. 1.3. Low-power design
methodologies are introduced in Sect. 1.4.
v
vi Preface
Chapter 2: MOS Fabrication Technology
The basic metal–oxide–semiconductor (MOS) fabrication processes such as diffu-

sion, photolithography, etc. are introduced in Sect. 2.1. Then, n-type metal–oxide–
semiconductor (nMOS) fabrication steps are highlighted in Sect. 2.2 followed by an
overview of complementary metal–oxide–semiconductor (CMOS) fabrication steps
in Sect. 2.3. The latch-up problem, which is an inherent problem of CMOS circuits,
is introduced and two approaches to overcome the latch-up problem are explained
in Sect. 2.4. Short channel effects arising out of smaller dimension of MOS devices
are highlighted. The chapter ends with Sect. 2.5 with a brief introduction of emerg-
ing MOS technologies such as high-K and Fin field-effect transistor (FinFET) to
overcome short channel and other effects.
Chapter 3: MOS Transistors
The structure of various types of MOS transistors that can be obtained after fabri-
cation is presented in Sect. 3.1. In Sect. 3.2, characteristics of MOS transistors are
explained with the help of fluid model, which helps to understand the operation of
a MOS transistor without going into the details of device physics. Three different
modes of operation such as accumulation, depletion, and inversion are discussed
in Sect. 3.3. Electrical characteristics of MOS transistors are explained in detail in
Sect. 3.4. Use of MOS transistors as a switch is explored in Sect. 3.5.
Chapter 4: MOS Inverters
Basic characteristics of an inverter followed by its noise margin are explained in

Sect. 4.1. The advantages and disadvantages of different inverter configurations
are explored along with their transfer characteristics and noise margin in Sect. 4.2.
Section 4.3 considers the inverter ratio in different situations. Switching character-
istics of MOS inverters are discussed in Sect. 4.4. Different configurations of MOS
inverters on MOS inverters are presented in Sect. 4.4. Various delay parameters
have been estimated in Sect. 4.5. Section 4.6 presents different circuit configura-
tions such as super buffers, bipolar CMOS (BiCMOS) inverters, and buffer sizing
to drive a large capacitive load.
Preface
vii
Chapter 5: MOS Combinational Circuits
The operation of pass transistor logic circuits is introduced in Sect. 5.1. Advantages

and limitations of pass transistor logic circuits have been highlighted. Different
members of the pass transistor logic family have been introduced. Logic circuits
based on gate logic are considered in Sect. 5.2 by considering the realization of
NAND and NOR gates. Differences between gate logic and pass transistor logic
circuits are highlighted. The operation of MOS dynamic circuits is discussed in
Sect. 5.3. The charge sharing and charge leakage problems of MOS dynamic cir-
cuits are explained. The clock skew problem of MOS dynamic circuits is intro-
duced. To overcome the clock skew problem, the operation of the domino-CMOS
and NORA-CMOS circuits is presented. In Sect. 5.4, realization of several example
functions such as full-adder, parity generator, and priority encoder and using differ-
ent logic styles are considered and compared.
Chapter 6: Sources of Power Dissipation
Various sources of power dissipation in MOS circuits are presented in this chapter. It
begins with the explanation of the difference between power and energy. How short
circuit power dissipation takes place in CMOS circuits is explained and the expres-
sion for short circuit power dissipation is derived in Sect. 6.1. Switching power
dissipation in CMOS circuits has been considered in Sect. 6.2 and an expression
for switching power dissipation is derived. Switching activity for different types of
gates is calculated and that for dynamic CMOS circuits is highlighted. Expression
for power dissipation due to charge sharing is derived. Section 6.3 presents glitch-
ing power dissipation along with techniques to reduce it. Sources of leakage power
dissipation such as subthreshold leakage and gate leakage have been introduced and
techniques to reduce them are presented in Sect. 6.4. Various mechanisms which
affect the subthreshold leakage current are also highlighted.
Chapter 7: Supply Voltage Scaling for Low Power
In this chapter various voltage scaling techniques starting with static voltage scal-
ing are discussed. The challenges involved in supply voltage scaling for low power
are highlighted. The distinction between constant field and constant voltage scal-
ing are explained in detail. First, the physical level-based approach, device feature
size scaling, to overcome the loss in performance is discussed in Sect. 7.1. The
short-channel effect arising out of feature size scaling is introduced. In Sect. 7.2
architecture level approaches such as parallelism and pipelining for static voltage
scaling are discussed. The relevance of multi-core for low power is explained. Static
viii Preface
voltage scaling exploiting high-level transformation is discussed in Sect. 7.3. Multi-

level voltage scaling (MVS) approach is explained and various challenges in MVS
are highlighted. Dynamic voltage and frequency scheduling (DVFS) approach is
discussed in Sect. 7.4. The adaptive voltage scaling (AVS) approach is explained
in Sect. 7.5.
Chapter 8: Switched Capacitance Minimization
A system-level approach based on hardware–software co-design is presented in

Sect. 8.1. Various bus-encoding techniques are presented in Sect. 8.2. The dif-
ference between redundant and non-redundant bus-encoding technique to reduce
switching activity is explained in detail. Non-redundant bus encoding technique
such as Gray coding technique for address bus is explained. Redundant bus encod-
ing techniques such as one-hot encoding, bus-inversion encoding and T0 encoding
techniques are explained with examples. Various aspects of clock gating technique
to reduce dynamic power dissipation are provided in Sect. 8.3. Clock gating at dif-
ferent levels of granularity is highlighted. Section 8.4 presents the basic principle
behind gated clock finite state machines (FSMs) to reduce switching activity in
FSMs. In Sect. 8.5, FSM state encoding approach is presented to minimize switch-
ing activity. Another approach for reducing the switching activity of an FSM is
FSM partitioning in which a single FSM is partitioned into more than one FSM to
reduce switching activity, which is presented in Sect. 8.6. The technique of operand
isolation presented in Sect. 8.7 can be used to reduce the switching activity of a
combinational circuit. Pre-computation is a technique in which selective computa-
tion of output values is done in advance with the objective of using it to reduce
switching activity in the subsequent cycles. This technique is presented in Sect. 8.8.
The basic approach of minimizing glitching power is considered in Sect. 8.9. Fi-
nally, various logic styles including dynamic CMOS and pass transistor logic styles
are considered in Sect. 8.10 for low-power logic synthesis.
Chapter 9: Leakage Power Minimization
As multiple threshold voltages are used to minimize leakage power, various ap-
proaches for the fabrication of multiple threshold voltage transistors are first pre-
sented in Sect. 9.1. Variable threshold voltage CMOS (VTCMOS) approach for
leakage power minimization is discussed in Sect. 9.2. Transistor stacking approach
based on the stack effect to minimize standby leakage power is highlighted in
Sect. 9.3. How run-time leakage power can be minimized by using multiple-thresh-
old voltage (MTCMOS) approach is discussed in Sect. 9.4. Section 9.5 addresses
the power-gating technique to minimize leakage power and various issues related
to power-gating approaches are highlighted. How power management approach can
Preface
ix
be used to reduce leakage power dissipation and how it can be combined with dy-
namic voltage scaling approach are explained. Isolation strategy is highlighted in
Sect. 9.6. State retention strategy is introduced in Sect. 9.7. Power gating control-
lers are discussed in Sect. 9.8. Power management techniques are considered in
Sect. 9.9. Dual-Vt assignment technique is introduced in detail in Sect. 9.10. De-
lay-constrained dual-Vt technique is presented in Sect. 9.11 and energy constrained
dual-Vt technique is considered in Sect. 9.12. Dynamic Vt scaling technique is in-
troduced in Sect. 9.13.
Chapter 10: Adiabatic Logic Circuits
Section 10.1 introduces adiabatic charging which forms the basis of adiabatic cir-
cuits. The difference between adiabatic charging and conventional charging of a
capacitor is explained. As amplification is a fundamental operation performed by
electronic circuits to increase the current or voltage drive, adiabatic amplifica-
tion is presented in Sect. 10.2. The steps of realization of adiabatic logic gates are
explained and illustrated with the help of an example. Adiabatic logic gates are
introduced in Sect. 10.3. Realization of pulsed power supply, which is the most
fundamental building block of an adiabatic logic circuit is introduced in Sect. 10.4.
The realizations of both synchronous and asynchronous pulsed power supplies are
explained. How stepwise charging and discharging can be used to minimize power
dissipation is explainedin Sect. 10.5. Various partially adiabatic circuits such as
efficient charge recovery logic (ECRL), positive feedback adiabatic logic (PFAL),
and 2N-2N2Pare introduced and compared in Sect. 10.6.
Chapter 11: Battery-Aware Systems
This chapter discusses few design techniques and proposes an architectural power
management method to optimize the battery lifetime and to obtain maximum num-
ber of cycles per recharge. Section 11.1 introduces the so called battery gap, which
depicts that ever-increasing power requirement versus the actual rate of growth of
energy density of the battery technology. An overview of different battery technolo-
gies is provided in Sect. 11.2. Section 11.3 introduces different characteristics of
a rechargeable battery. The underlying process of battery discharge is explained
in Sect. 11.4. Different approaches of battery modeling are briefly introduced in
Sect. 11.5. Realizations of battery-driven systems are presented in Sect. 11.6. As an
example of a battery-aware system, Sect. 11.7 presents battery-aware sensor net-
works.
x Preface
Chapter 12: Software for Low Power
This chapter introduces different software optimization techniques for low pow-
er. Power aware software does not require any additional hardware, but performs
suitable optimization of software to minimize energy consumption for their execu-
tion. The optimization techniques can be broadly classified into two categories:
machine independent and machine dependent. Machine-independent optimization
techniques are independent of the processor architecture and can be used for any
processor. Various software optimization techniques to reduce power consumption
without any change in the underlying hardware are considered in this chapter. Both
types of software are discussed here. Various sources of power dissipation in the
computer hardware are highlighted in Sect. 12.1. Machine-independent software
optimizations approaches are discussed in Sect. 12.2. Various loop optimization
techniques have been combined with DVFS to achieve larger reduction in energy
dissipation; this has been discussed in detail in Sect. 12.3. Power aware software
prefetching approach exploit the architectural features of the target processor and
the hardware platform, which has been discussed in detail in Sect. 12.4.
Acknowledgements
I am indebted to the editorial team at Springer, especially Kamiya Khatter for help-
ing shape the raw manuscript of the book to the present form. I am also grateful to
Ms. Zaenab Khan, Crest Premedia Solutions Private Limited, Pune, for her patience
during the production work-flow of the manuscript and resolving all my queries. I
am thankful to my wife Alpana, my younger daughter Amrita, her husband Shilad-
itya, my elder daughter Aditi and her husband Arjun for their help and encourage-
ment in going through this daunting task of writing a book.
xi
Contents
1 Introduction�� 1
1.1 Introduction�� 1
1.2 Historical Background [1]�� 2
1.3 Why Low Power? [2]�� 7
1.4 Sources of Power Dissipations [3]�� 9
1.4.1 Dynamic Power�� 10
1.4.2 Static Power�� 13
1.5 Low-Power Design Methodologies�� 14
1.6 Chapter Summary�� 16
1.7 Review Questions�� 16
References�� 17
2 MOS Fabrication Technology�� 19

2.2 Basic Fabrication Processes [1, 2]�� 20
2.2.1 Wafer Fabrication�� 20
2.2.2 Oxidation�� 20
2.2.3 Mask Generation�� 21
2.2.4 Photolithography�� 22
2.2.5 Diffusion�� 23
2.2.6 Deposition�� 24
2.3 nMOS Fabrication Steps [2, 3]�� 24
2.4 CMOS Fabrication Steps [2, 3]�� 26
2.4.1 The n-Well Process�� 26
2.4.2 The p-Well Process�� 30
2.4.3 Twin-Tub Process�� 31
2.5 Latch-Up Problem and Its Prevention�� 31
2.5.1 Use of Guard Rings�� 33
2.5.2 Use of Trenches�� 34
2.6 Short-Channel Effects [6]�� 34
2.6.1 Channel Length Modulation Effect�� 35
xiii
xiv Contents
2.6.2 Drain-Induced Barrier Lowering�� 35

2.6.3 Channel Punch Through�� 36
2.7 Emerging Technologies for Low Power�� 37
2.7.1 Hi-K Gate Dielectric�� 37
2.7.2 Lightly Doped Drain–Source�� 38
2.7.3 Silicon on Insulator�� 39
2.7.4 Advantages of SOI�� 40
2.7.5 FinFET�� 40
References�� 42
3 MOS Transistors�� 43

3.2 The Structure of MOS Transistors�� 44
3.3 The Fluid Model�� 45
3.3.1 The MOS Capacitor�� 46
3.3.2 The MOS Transistor�� 47
3.4 Modes of Operation of MOS Transistors [2]�� 50
3.5 Electrical Characteristics of MOS Transistors�� 50
3.5.1 Threshold Voltage�� 54
3.5.2 Transistor Transconductance gm�� 56
3.5.3 Figure of Merit�� 57
3.5.4 Body Effect�� 57
3.5.5 Channel-Length Modulation�� 58
3.6 MOS Transistors as a Switch [3]�� 60
3.6.1 Transmission Gate�� 60
References�� 65
4 MOS Inverters�� 67

4.2 Inverter and Its Characteristics�� 68
4.3 MOS Inverter Configurations�� 70
4.3.1 Passive Resistive as Pull-up Device�� 71
4.3.2 nMOS Depletion-Mode Transistor as Pull up�� 72
4.3.3 nMOS Enhancement-Mode Transistor as Pull up�� 74
4.3.4 The pMOS Transistor as Pull Up�� 75
4.3.5 pMOS Transistor as a Pull Up in Complementary Mode�� 76
4.3.6 Comparison of the Inverters�� 82
4.4 Inverter Ratio in Different Situations�� 82
4.4.1 An nMOS Inverter Driven by Another Inverter�� 83
4.4.2 An nMOS Inverter Driven Through Pass Transistors�� 84
Contents
xv
4.5 Switching Characteristics�� 86

4.5.1 Delay-Time Estimation�� 87
4.5.2 Ring Oscillator�� 89
4.6 Delay Parameters�� 90
4.6.1 Resistance Estimation�� 91
4.6.2 Area Capacitance of Different Layers�� 92
4.6.3 Standard Unit of Capacitance Cg�� 93
4.6.4 The Delay Unit�� 94
4.7 Driving Large Capacitive Loads�� 94
4.7.1 Super Buffers�� 95
4.7.2 BiCMOS Inverters�� 97
4.7.3 Buffer Sizing�� 98
References�� 102
5 MOS Combinational Circuits�� 103

5.2 Pass-Transistor Logic�� 104
5.2.1 Realizing Pass-Transistor Logic�� 105
5.2.2 Advantages and Disadvantages�� 107
5.2.3 Pass-Transistor Logic Families�� 109
5.3 Gate Logic�� 113
5.3.1 Fan-In and Fan-Out�� 113
5.3.2 nMOS NAND and NOR Gates�� 114
5.3.3 CMOS Realization�� 115
5.3.4 Switching Characteristics�� 117
5.3.5 CMOS NOR Gate�� 119
5.3.6 CMOS Complex Logic Gates�� 119
5.4 MOS Dynamic Circuits�� 120
5.4.1 Single-Phase Dynamic Circuits�� 121
5.4.2 Two-Phase Dynamic Circuits�� 122
5.4.3 CMOS Dynamic Circuits�� 123
5.4.4 Advantages and Disadvantages�� 125
5.4.5 Domino CMOS Circuits�� 128
5.4.6 NORA Logic�� 129
5.5 Some Examples�� 130
6 Sources of Power Dissipation�� 141

6.2 Short-Circuit Power Dissipation [1]�� 143
xvi Contents
6.3 Switching Power Dissipation [1]�� 147

6.3.1 Dynamic Power for a Complex Gate�� 149
6.3.2 Reduced Voltage Swing�� 149
6.3.3 Internal Node Power�� 150
6.3.4 Switching Activity [2, 3]�� 150
6.3.5 Switching Activity of Static CMOS Gates�� 151
6.3.6 Inputs Not Equiprobable�� 152
6.3.7 Mutually Dependent Inputs�� 152
6.3.8 Transition Probability in Dynamic Gates�� 155
6.3.9 Power Dissipation due to Charge Sharing�� 156
6.4 Glitching Power Dissipation�� 157
6.5 Leakage Power Dissipation [4]�� 158
6.5.1 p–n Junction Reverse-Biased Current�� 158
6.5.2 Band-to-Band Tunneling Current�� 160
6.5.3 Subthreshold Leakage Current�� 160
6.6 Conclusion�� 171
7 Supply Voltage Scaling for Low Power�� 175

7.2 Device Feature Size Scaling [1]�� 178
7.2.1 Constant-Field Scaling�� 178
7.2.2 Constant-Voltage Scaling�� 181
7.2.3 Short-Channel Effects�� 182
7.3 Architectural-Level Approaches�� 183
7.3.1 Parallelism for Low Power�� 183
7.3.2 Multi-Core for Low Power�� 186
7.3.3 Pipelining for Low Power�� 187
7.3.4 Combining Parallelism with Pipelining�� 188
7.4 Voltage Scaling Using High-Level Transformations�� 189
7.5 Multilevel Voltage Scaling�� 192
7.6 Challenges in MVS�� 194
7.6.1 Voltage Scaling Interfaces�� 195
7.6.2 Converter Placement�� 196
7.6.3 Floor Planning, Routing, and Placement�� 197
7.6.4 Static Timing Analysis�� 197
7.6.5 Power-Up and Power-Down Sequencing�� 197
7.6.6 Clock Distribution�� 198
7.6.7 Low-Voltage Swing�� 198
7.7 Dynamic Voltage and Frequency Scaling�� 199
7.7.1 Basic Approach�� 199
7.7.2 DVFS with Varying Work Load�� 202
7.7.3 The Model�� 204
Contents
xvii
7.7.4 Workload Prediction�� 205

7.7.5 Discrete Processing Rate�� 206
7.7.6 Latency Overhead�� 207
7.8 Adaptive Voltage Scaling�� 208
7.9 Subthreshold Logic Circuits�� 209
8 Switched Capacitance Minimization�� 213

8.2 System-Level Approach: Hardware–Software Codesign�� 214
8.3 Transmeta’s Crusoe Processor�� 215
8.3.1 The Hardware�� 216
8.3.2 The Software�� 217
8.4 Bus Encoding�� 220
8.4.1 Gray Coding�� 221
8.4.2 One-Hot Coding�� 223
8.4.3 Bus-Inversion Coding�� 224
8.4.4 T0 Coding�� 224
8.5 Clock Gating�� 226
8.5.1 CG Circuits�� 227
8.5.2 CG Granularity�� 229
8.6 Gated-Clock FSMs�� 231
8.7 FSM State Encoding�� 233
8.8 FSM Partitioning�� 234
8.9 Operand Isolation�� 235
8.10 Precomputation�� 236
8.11 Glitching Power Minimization�� 237
8.12 Logic Styles for Low Power�� 238
8.12.1 Static CMOS Logic�� 239
8.12.2 Dynamic CMOS Logic�� 240
8.12.3 PTL�� 242
8.12.4 Synthesis of Dynamic CMOS Circuits�� 243
8.12.5 Synthesis of PTL Circuits�� 248
8.12.6 Implementation and Experimental Results�� 250
8.13 Some Related Techniques for Dynamic Power Reduction�� 254
9 Leakage Power Minimization�� 261

9.2 Fabrication of Multiple Threshold Voltages�� 263
9.2.1 Multiple Channel Doping�� 263
xviii Contents
9.2.2 Multiple Oxide CMOS�� 264

9.2.3 Multiple Channel Length�� 265
9.2.4 Multiple Body Bias�� 266
9.3 VTCMOS Approach�� 266
9.4 Transistor Stacking�� 267
9.5 MTCMOS Approach�� 270
9.6 Power Gating [8]�� 272
9.6.1 Clock Gating Versus Power Gating�� 272
9.6.2 Power-Gating Issues�� 273
9.7 Isolation Strategy�� 278
9.8 State Retention Strategy�� 281
9.9 Power-Gating Controller�� 282
9.10 Power Management�� 284
9.10.1 Combining DVFS and Power Management�� 285
9.11 Dual-Vt Assignment Approach (DTCMOS) [10]�� 286
9.12 Delay-Constrained Dual-Vt CMOS Circuits [12]�� 289
9.13 Energy-Constrained Dual-Vt CMOS Circuits[13]�� 293
9.14 Dynamic Vth Scaling�� 298
10 Adiabatic Logic Circuits�� 303

10.2 Adiabatic Charging�� 304
10.3 Adiabatic Amplification�� 306
10.4 Adiabatic Logic Gates�� 307
10.5 Pulsed Power Supply�� 308
10.6 Stepwise Charging Circuits�� 310
10.6.1 Stepwise Driver Using Tank Capacitors�� 313
10.7 Partially Adiabatic Circuits�� 313
10.7.1 Efficient Charge Recovery Logic�� 314
10.7.2 Positive Feedback Adiabatic Logic Circuits�� 315
10.7.3 2N−2N2P Inverter/Buffer�� 316
10.8 Some Important Issues�� 316
11 Battery-Aware Systems�� 323

11.2 The Widening Battery Gap [1]�� 324
11.3 Overview of Battery Technologies�� 326
11.3.1 Nickel Cadmium�� 326
11.3.2 Nickel–Metal Hydride�� 327
Contents
xix
11.3.3 Lithium Ion�� 328

11.3.4 Rechargeable Alkaline�� 329
11.3.5 Li Polymer�� 329
11.4 Battery Characteristics [4, 5]�� 329
11.4.1 Rate Capacity Effect�� 330
11.4.2 Recovery Effect�� 331
11.4.3 Memory Effect�� 331
11.4.4 Usage Pattern�� 331
11.4.5 Battery Age�� 332
11.5 Principles of Battery Discharge�� 332
11.6 Battery Modeling�� 333
11.7 Battery-Driven System Design�� 335
11.7.1 Multi-battery System�� 336
11.7.2 Battery-Aware Task Scheduling�� 336
11.7.3 Task Scheduling with Voltage Scaling [12]�� 339
11.8 Wireless Sensor Networks�� 340
11.9 Energy-Aware Routing�� 346
11.10 Assisted-LEACH�� 348
11.11 Conclusion�� 352
12 Low-Power Software Approaches�� 355

12.2 The Hardware�� 356
12.3 Machine-Independent Software Optimizations�� 359
12.3.1 Compilation For Low Power�� 359
12.4 Combining Loop Optimizations with DVFS�� 364
12.4.1 Loop Unrolling�� 365
12.4.2 Loop Tiling�� 366
12.4.3 Loop Permutation�� 367
12.4.4 Strength Reduction�� 367
12.4.5 Loop Fusion�� 368
12.4.6 Loop Peeling�� 369
12.4.7 Loop Unswitching�� 370
12.5 Power-Aware Software Prefetching�� 371
12.5.1 Compilation For Low Power�� 375
12.5.2 Experimental Methodology and Results�� 380
12.5.3 Conclusions�� 384
Index�� 387
About the Author
Ajit Pal is currently a Professor in the Department of Computer Science and

Engineering at Indian Institute of Technology Kharagpur (IITKGP). He received
his MTech and PhD degrees for the Institute of Radio Physics and Electronics
from Calcutta University in 1971 and 1976, respectively. Before joining IITKGP
in the year 1982, he served at Indian Statistical Institute (ISI), Calcutta; Indian
Telephone Industries (ITI), Naini; and Defense Electronics Research Laboratory
(DLRL), Hyderabad in various capacities. He was designated professor in 1988 and
served as Head of Computer Center from 1993 to 1995 and Head of the Computer
Science and Engineering Department from 1995 to 1998. His research interests
include embedded systems, low-power VLSI circuits, sensor networks and optical
communication. He has served as the principal investigator of several sponsored
research projects including ‘Low Power circuits’ sponsored by Intel, USA and ‘formal
methods for power intent verification’, sponsored by Synopsis (India) Pvt. Ltd. He
has over 150 publications in reputed journals and conference proceedings and three
books entitled Microprocessors: Principles and Applications, Microcontrollers:
Principles and Applications, and Data Communication and Computer Networks.
He is a Fellow of IETE, India and Senior Member of IEEE, USA.
xxi
List of Figures
Fig. 1.1 Moore’s law based on his famous prediction�� 3

Fig. 1.2 Evolution tree of microprocessor. RISC reduced instruction
set computer, DSP digital signal processor�� 5
Fig. 1.3 Moore’s law and the Intel microprocessors�� 6
Fig. 1.4 Power dissipation of Intel processors�� 6
Fig. 1.5 Increasing power density of the very-large-scale-
integration (VLSI) chip�� 8
Fig. 1.6 Different failure mechanisms against temperature�� 8
Fig. 1.7 Power versus energy�� 9
Fig. 1.8 Types of power dissipation�� 10
Fig. 1.9 Dynamic (switching) power. GND ground�� 11
Fig. 1.10 Short-circuit current or crowbar current. GND ground�� 12
Fig. 1.11 Leakage currents in an MOS transistor.
MOS metal–oxide–semiconductor [5]�� 13
Fig. 1.12 Leakage currents in a CMOS inverter. CMOS complemen-
tary metal–oxide–semiconductor�� 14
Fig. 2.1 a Set up for forming silicon ingot. b An ingot�� 21
Fig. 2.2 Furnace used for oxidation�� 21
Fig. 2.3 nMOS fabrication steps�� 25
Fig. 2.4 CMOS transistors realized using n-well process�� 31
Fig. 2.5 CMOS transistor realized using twin-tub process�� 32
Fig. 2.6 Latch-up problem of a CMOS transistor�� 32
Fig. 2.7 Guard ring to avoid latch-up problem�� 34
Fig. 2.8 Trench to overcome latch-up problem�� 34
Fig. 2.9 Threshold voltage roll-off with channel length [8]�� 35
Fig. 2.10 DIBL effect [8]�� 36
Fig. 2.11 Punch-through effect [8]�� 37
Fig. 2.12 a Conventional structure. b Lightly doped drain–structure�� 38
Fig. 2.13 MOS transistor structure to overcome short channel effects�� 39
Fig. 2.14 CMOS inverter using twin SOI approach�� 39
Fig. 2.15 Simple FinFET structure�� 40
Fig. 3.1 Structure of an MOS transistor�� 44
xxiii
xxiv List of Figures
Fig. 3.2 a nMOS enhancement-mode transistor.

b nMOS depletion-mode transistor�� 45
Fig. 3.3 a nMOS enhancement. b nMOS depletion.
c pMOS enhancement.
d pMOS depletion-mode transistors�� 45
Fig. 3.4 a An MOS capacitor. b The fluid model�� 46
Fig. 3.5 a An MOS transistor. b The fluid model�� 47
Fig. 3.6 The fluid model of an MOS transistor�� 48
Fig. 3.7 a Variation of drain current with gate voltage.
b Voltage–current characteristics�� 49
Fig. 3.8 a Accumulation mode, b depletion mode, and
c inversion mode of an MOS transistor�� 50
Fig. 3.9 Structural view of an MOS transistor�� 51
Fig. 3.10 Voltage–current characteristics of nMOS
enhancement-type transistor�� 53
Fig. 3.11 Voltage–current characteristics of nMOS
depletion-type transistor�� 54
Fig. 3.12 Variation of drain current with gate voltage.
a n-Channel enhancement. b n-Channel depletion.
c p-Channel enhancement. d p-Channel depletion�� 55
Fig. 3.13 Variation of the threshold voltage as a function
of the source-to-substrate voltage�� 58
Fig. 3.14 a Nonsaturated region. b Onset of saturation.
c Deep in saturation�� 58
Fig. 3.15 Drain-current variations due to channel-length modulation�� 59
Fig. 3.16 a nMOS pass transistor. b pMOS pass transistor.
c Transmission gate�� 60
Fig. 3.17 a and e Output node charges from low-to-high level or
high-to-low level. b and f The output voltage changing
with time for different transitions. c and g The drain cur-
rents through the two transistors as a function of the output
voltage. d and h The equivalent resistances as a function
of the output voltage�� 62
Fig. 3.18 a Charging a small capacitor. b Variation of the output cur-
rents with the input voltage. c Variation of the equivalent
resistances with the input voltage�� 63
Fig. 4.1 General structure of an nMOS inverter.
nMOS n-type metal–oxide–semiconductor�� 68
Fig. 4.2 Truth table and logic symbol of the inverter�� 68
Fig. 4.3 Ideal transfer characteristics of an inverter�� 69
Fig. 4.4 a Various voltage levels on the transfer characteristics;
b low- and high-level noise margins�� 69
Fig. 4.5 a An nMOS inverter with resistive load; b voltage–current
characteristic; c transfer characteristic. nMOS n-type–
metal–oxide semiconductor�� 71
List of Figures xxv
Fig. 4.6 Realization of a resistive load�� 72

Fig. 4.7 a nMOS inverter with depletion-mode transistor as pull-up
device; b voltage current characteristic; c transfer charac-
teristic. nMOS n-type metal–oxide–semiconductor�� 73
Fig. 4.8 a nMOS inverter with enhance-mode transistor as a pull-up
device; b transfer characteristic. nMOS n-type metal–
oxide–semiconductor�� 75
Fig. 4.9 a A pseudo-nMOS inverter; b transfer characteristic.
Pseudo-nMOS pseudo-n-type metal–oxide–semiconductor�� 75
Fig. 4.10 a CMOS inverter; b voltage–current characteristic; and
c transfer characteristic�� 76
Fig. 4.11 Transfer characteristics for different inverter ratio�� 81
Fig. 4.12 a An nMOS inverter driven by another inverter;
b inverter with Vin = Vdd; and c inverter with Vin = Vdd – Vt.
nMOS n-type metal–oxide–semiconductor, Vin voltage
input to the inverter, Vdd positive supply rail, Vt inverter
threshold voltage�� 83
Fig. 4.13 An inverter driven through one or more pass transistors�� 84
Fig. 4.14 a Parasitic capacitances of a CMOS inverter.
b CMOS complementary metal–oxide–semiconductor�� 86
Fig. 4.15 Internal parasitic capacitances of an MOS transistor. MOS
metal–oxide–semiconductor�� 86
Fig. 4.16 a CMOS inverter; b delay-time timings; c fall-time
model; d rise-time model; e Rise time and fall times.
CMOS complementary metal–oxide–semiconductor�� 87
Fig. 4.17 Ring oscillator realized using odd number of inverters�� 89
Fig. 4.18 Output waveform of a three-stage ring oscillator�� 90
Fig. 4.19 One slab of conducting material�� 91
Fig. 4.20 Two different inverter configurations with inverter ratio 4:1�� 93
Fig. 4.21 a Inverting super buffer; b noninverting super buffer�� 95
Fig. 4.22 a A conventional BiCMOS inverter; b output characteris-
tics of static CMOS and BiCMOS. CMOS complementary
metal–oxide–superconductor�� 97
Fig. 4.23 Delay of static CMOS and BiCMOS for different fan-out.
CMOS complementary metal–oxide–superconductor�� 98
Fig. 4.24 a Using a single driver with W to L ratio of 1000:1;
b using drivers of increasing size with stage ratio
of 10. W width; L length�� 99
Fig. 4.25 Variation of delay with stage ratio�� 100
Fig. 5.1 Pass-transistor output driving another pass-transistor stage�� 104
Fig. 5.2 a Relay logic to realize f = a + b′c. b Pass-transistor net-
work corresponding to relay logic. c Proper pass-transistor
network for f = a + b′c �� 105
Fig. 5.3 a A 2-to-1 multiplexer. b A 4-to-1 multiplexer circuit using
pass-transistor network�� 106
xxvi List of Figures
Fig. 5.4 a Multiplexer realization of f = a′b + ab′ .

b Minimum transistor pass-transistor realization
of f = a′b + ab′�� 106
Fig. 5.5 a Pass-transistor network. b RC model for the pass-transis-
tor network. RC resistance capacitance�� 108
Fig. 5.6 Buffers inserted after every three stages�� 109
Fig. 5.7 a Basic complementary pass-transistor logic (CPL) struc-
ture; and b 2-to-1 multiplexer realization using CPL logic�� 110
Fig. 5.8 Complementary pass-transistor logic (CPL) logic circuit
for a 2-input AND/NAND, b 2-input OR/NOR, and
c 2-input EX-OR�� 110
Fig. 5.9 a Basic swing-restored pass-transistor logic (SRPL) con-
figuration; and b SRPL realization of 2-input NAND gate�� 111
Fig. 5.10 Double pass-transistor logic (DPL) realization of 2-input
AND/NAND function�� 111
Fig. 5.11 Single-rail pass-transistor logic (LEAP) cells�� 112
Fig. 5.12 a Fan-in of gates; and b fan-out of gates�� 113
Fig. 5.13 a n-input nMOS NAND gate; b equivalent circuits; and
c n-input nMOS NOR gate. nMOS n-type MOS�� 114
Fig. 5.14 a General CMOS network; and b n-input CMOS NAND
gate. CMOS complementary MOS, p-type MOS,
n-type MOS�� 115
Fig. 5.15 a Equivalent circuit of n-input complementary MOS
(CMOS) NAND gate; and b transfer characteristics of
n-input CMOS NAND gate�� 116
Fig. 5.16 a n-input complementary MOS (CMOS) NOR gate and
b the equivalent circuit�� 117
Fig. 5.17 a Pull-up transistor tied together with a load capacitance;
and b equivalent circuit�� 118
Fig. 5.18 a Pull-down transistors along with load capacitance CL,
and b equivalent circuit�� 118
Fig. 5.19 a Realization of a function f by complementary MOS
(CMOS) gate; b realization of f = A′ + BC; and c realiza-
tion of S = A ⊕ B �� 120
Fig. 5.20 a Single-phase clock; and b single-phase n-type MOS
(nMOS) inverter�� 121
Fig. 5.21 a 2-input single-phase NAND; and b 2-input single-phase
NOR gate�� 122
Fig. 5.22 a Two-phase clock; and b a two-phase clock generator circuit�� 123
Fig. 5.23 Two-phase n-type MOS (nMOS) inverter�� 123
Fig. 5.24 Realization of function f = x3 ( x1 + x2 ) using a static comple-
mentary MOS (CMOS), b dynamic CMOS with n-block,
and c dynamic CMOS with p-block�� 124
Fig. 5.25 Reverse-biased parasitic diode and subthreshold leakage�� 125
Fig. 5.26 a Charge sharing problem; and b model for charge sharing�� 126
List of Figures xxvii
Fig. 5.27 A weak p-type MOS (pMOS) transistor to reduce the

impact of charge leakage and charge sharing problem�� 127
Fig. 5.28 a Evaluate phase of a particular stage overlapping
with the pre-charge phase of the preceding stage�� 128
Fig. 5.29 Domino logic and low levels, respectively�� 128
Fig. 5.30 NORA logic style�� 130
Fig. 5.31 Block diagram of the full adder�� 131
Fig. 5.32 Static complementary MOS (CMOS)
realization of full adder�� 131
Fig. 5.33 NORA complementary MOS (CMOS)
realization of full adder�� 132
Fig. 5.34 Pass-transistor realization of the full adder�� 132
Fig. 5.35 Block diagram of 4-bit parity generator�� 132
Fig. 5.36 Static complementary MOS (CMOS) realization
of parity generator�� 133
Fig. 5.37 Domino complementary MOS (CMOS) realization
of 4-bit parity generator�� 134
Fig. 5.38 Pass-transistor realization 4-bit parity generator�� 134
Fig. 5.39 Block diagram of 8-input priority encoder�� 134
Fig. 5.40 Static complementary MOS (CMOS) realization
of the priority encoder functions�� 135
Fig. 5.41 Domino complementary MOS (CMOS) realization
of the priority encoder functions�� 136
Fig. 5.42 Pass-transistor realization of the priority encoder functions�� 136
Fig. 6.1 Power versus energy�� 142
Fig. 6.2 Short-circuit power dissipation during input transition�� 144
Fig. 6.3 Model for short-circuit power dissipation�� 145
Fig. 6.4 Short-circuit current as a function of input rise/fall time�� 146
Fig. 6.5 Variation of short-circuit current with load capacitance�� 146
Fig. 6.6 Voltage transfer characteristics for Vdd ≥ (Vtn + Vtp ) �� 147
Fig. 6.7 Transfer characteristics for Vdd < (Vtn + Vtp ) �� 147
Fig. 6.8 Dynamic power dissipation model�� 148
Fig. 6.9 Reduced voltage swing at the output of a gate�� 149
Fig. 6.10 Switching nodes of a three-input NAND gate�� 150
Fig. 6.11 Variation of switching activity with increase
in the number of inputs�� 153
Fig. 6.12 a Circuit without re-convergent fan-out. b Circuit with
re-convergent fan-out�� 153
Fig. 6.13 Three different realizations for the six-input OR function�� 154
Fig. 6.14 Three-input NAND dynamic gate�� 156
Fig. 6.15 Output waveform showing glitch at output O2�� 157
Fig. 6.16 Realization of A, B, C, and D, a in cascaded form,
b balanced realization�� 157
Fig. 6.17 Summary of leakage current mechanisms
of deep-submicron transistors�� 158
xxviii List of Figures
Fig. 6.18 nMOS inverter and its physical structure�� 159

Fig. 6.19 BTBT in reverse-biased p–n junction�� 160
Fig. 6.20 Log( ID) versus VG at two different drain voltages for
20 × 0.4-µm n-channel transistor in a 0.35-µm
CMOS process�� 162
Fig. 6.21 Subthreshold leakage in nMOS transistors�� 163
Fig. 6.22 Lateral energy band diagram at the surface versus distance
from the source to drain for three different situations�� 163
Fig. 6.23 n-Channel drain current versus gate voltage illustrating
various leakage components�� 164
Fig. 6.24 n-Channel log( ID) versus gate voltage for different
substrate biases�� 166
Fig. 6.25 Variation of threshold voltage with gate width for different
body biases and uniform doping�� 166
Fig. 6.26 Threshold voltage roll-off with change in channel length�� 167
Fig. 6.27 Schematic diagram for charge-sharing model�� 167
Fig. 6.28 Variation of drain current with temperature�� 167
Fig. 6.29 Tunneling of electrons through nMOS capacitor�� 168
Fig. 6.30 Injection of hot electrons from substrate to oxide�� 169
Fig. 6.31 GIDL effect. GDIL gate-induced drain leakage�� 170
Fig. 6.32 Contribution of various sources of power dissipation�� 171
Fig. 6.33 Change in active and standby power with
change in technology�� 171
Fig. 7.1 a Variation of normalized energy with respect
to supply voltage; b variation of delay with respect
to supply voltage�� 176
Fig. 7.2 Trends in metal–oxide–semiconductor
(MOS) device scaling�� 179
Fig. 7.3 Scaling of a typical metal–oxide–semiconductor field-
effect transistors (MOSFET) by a scaling factor S�� 179
Fig. 7.4 a Conventional structure; b lightly doped drain structure�� 183
Fig. 7.5 a A 16-bit adder; b parallel architecture of the
16-bit adder. MUX multiplexer�� 184
Fig. 7.6 A four-core multiplier architecture. MUX multiplexer�� 186
Fig. 7.7 Pipelined realization 16-bit adder�� 187
Fig. 7.8 Parallel-pipelined realization of 16-bit adder.
MUX multiplexer�� 188
Fig. 7.9 a A first-order infinite impulse response (IIR) filter;
b directed acyclic graph (DAG) corresponding
to the IIR filter�� 189
Fig. 7.10 Directed acyclic graph (DAG) after unrolling�� 190
Fig. 7.11 Directed acyclic graph (DAG) after unrolling and
using distributivity and constant propagation�� 190
Fig. 7.12 Directed acyclic graph (DAG) after unrolling and pipelining�� 191
Fig. 7.13 Speed optimization is different than power optimization�� 192
List of Figures xxix
Fig. 7.14 Assignment of multiple supply voltages based on delay

on the critical path�� 192
Fig. 7.15 Clustered voltage scaling. FF flip-flop�� 193
Fig. 7.16 Distribution of path delays under single supply voltage
(SSV) and multiple supply voltage (MSV)�� 194
Fig. 7.17 Macro-based voltage island approach to achieve low power�� 194
Fig. 7.18 Signal going from low-Vdd to high-Vdd domain causing a
short-circuit current�� 195
Fig. 7.19 a Logic symbol of high-to-low level converter; b high-to-
low-voltage level converter realization�� 195
Fig. 7.20 a Logic symbol of low-to-high level converter; b low-to-
high-voltage level converter realization�� 196
Fig. 7.21 a High-to-low converter placement; b low-to-high con-
verter placement�� 197
Fig. 7.22 Placement and routing in multi-Vdd design�� 198
Fig. 7.23 Reduced voltage swing circuit using a driver and a receiver�� 199
Fig. 7.24 Energy versus workload. DVFS dynamic voltage and
frequency scaling�� 200
Fig. 7.25 Four different cases with two different workloads and with
voltage and frequency scaling�� 201
Fig. 7.26 Processor-voltage versus clock frequency of Strong ARM
processor. CPU central processing unit�� 203
Fig. 7.27 Block diagram of a direct current (DC)-to-DC converter�� 203
Fig. 7.28 Efficiency versus load�� 204
Fig. 7.29 Model for dynamic voltage scaling�� 205
Fig. 7.30 Prediction performance of different filters. MAW moving
average workload, EWA exponential weighted averages,
LMS least mean square, RMS root mean square�� 206
Fig. 7.31 Effects of number of discrete processing levels L. LMS
least mean square�� 207
Fig. 7.32 Adaptive voltage scaling system. DVC dynamic voltage
control, DFC dynamic frequency control, DVFM dynamic
voltage and frequency management, DC direct current,
DRAM dynamic random-access memory,
PLL phase lock loop�� 208
Fig. 7.33 Subthreshold region of operation�� 209
Fig. 8.1 a Analog-to-digital converter ( ADC) implemented by
hardware and b ADC implemented by hardware–software
mix. DAC digital to analog, EOC end of conversion�� 215
Fig. 8.2 A molecule can contain up to four atoms, which are
executed in parallel. FADD floating point addition, ADD
addition, LD load, BRCC branch if carry cleared, ALU
arithmetic logic unit�� 216
Fig. 8.3 Superscalar out-of-order architecture�� 217
xxx List of Figures
Fig. 8.4 The code morphing software mediates between x86 soft-
ware and the Crusoe processor. BIOS basic input/output
system, VLIW very long instruction word�� 218
Fig. 8.5 Flowchart of a program with a branch�� 219
Fig. 8.6 Encoder and decoder blocks to reduce switching activity�� 221
Fig. 8.7 Encoder and decoder for Gray code�� 222
Fig. 8.8 One-hot encoding�� 223
Fig. 8.9 Bus-inversion encoding�� 224
Fig. 8.10 Encoder and decoder of bus-inversion encoding. CLK
clock signal, INV invalid�� 225
Fig. 8.11 T0 encoding�� 225
Fig. 8.12 T0 encoder and decoder. CLK clock signal, MUX multi-
plexer, INC increment�� 226
Fig. 8.13 Power reduction using clock gating�� 227
Fig. 8.14 Clock-gating mechanism. EN enable, CLK global clock,
CLKG gated clock�� 227
Fig. 8.15 a Clock gating using AND gate, b clock gating using OR
gate, c glitch propagation through the AND gate, and
d glitch propagation through the OR gate. EN enable,
CLK global clock, CLKG gated clock�� 228
Fig. 8.16 a Clock gating using a level-sensitive, low-active latch
along with an AND gate and b clock gating using a level-
sensitive, low-active latch along with an OR gate. EN
enable, CLK global clock, CLKG gated clock�� 228
Fig. 8.17 Clock gating the register file of a processor.
EN enable, CLK global clock, CLKG gated clock,
ALU arithmetic logic unit�� 229
Fig. 8.18 a Synchronous load-enabled register bank and b clock-
gated version of the register bank. EN enable, CLK global
clock, CLKG gated clock, MUX multiplexer�� 230
Fig. 8.19 Basic structure of a finite-state machine. PI primary input,
PO primary output, PS previous state, NS next state�� 231
Fig. 8.20 Gated-clock version of the finite-state machine. PI primary
input, PO primary output, PS previous state, NS next state,
EN enable, CLK clock, CLKG gated clock�� 231
Fig. 8.21 State-transition diagram of a finite-state machine ( FSM)�� 232
Fig. 8.22 Gated-clock implementation of the finite-state
machine ( FSM) of Fig. 8.20. CLK clock,
CLKG gated clock, EN enable�� 232
Fig. 8.23 State-transition diagram of a modulo-6 counter�� 233
Fig. 8.24 State-transition diagram of the “11111” sequence detector�� 234
Fig. 8.25 a An example finite-state machine FSM and
b decomposed FSM into two FSMs�� 235
Fig. 8.26 a An example circuit and b operand isolation.
CLK clock signal, AS activation signal�� 235
List of Figures xxxi
Fig. 8.27 Combinational circuit sandwiched between two registers�� 236

Fig. 8.28 Generalized schematic diagram to perform precomputation�� 236
Fig. 8.29 Precomputation to realize comparator function�� 237
Fig. 8.30 a Glitch generated due to finite delay of the gates,
b cascaded realization of a circuit with high glitching
activity, and c tree realization to reduce glitching activity�� 237
Fig. 8.31 a Static complementary metal–oxide–semiconductor
( CMOS) gate and b realization of f = A + B ⋅ C with static
CMOS gate�� 239
Fig. 8.32 Dynamic complementary metal–oxide–semiconductor
( CMOS) gate with a n-block and b p-block�� 240
Fig. 8.33 a Domino gate and b realization of f = A + B · C with
domino gate�� 241
Fig. 8.34 a NORA logic and b realization of f = A + B · C with
NORA logic�� 242
Fig. 8.35 Dynamic CMOS circuits based on two-level unate
decomposition: a domino CMOS circuit and
b NORA CMOS circuit�� 244
Fig. 8.36 Realization of dynamic circuits for f1 a using domino logic
and b using NORA logic�� 245
Fig. 8.37 Basic steps for synthesizing dynamic CMOS circuit�� 247
Fig. 8.38 Partitioning of a circuit graph�� 247
Fig. 8.39 Area (#Transistor) for static CMOS, dynamic CMOS, and
PTL circuit. CMOS complementary metal–oxide–semicon-
ductor, PTL pass-transistor logic�� 253
Fig. 8.40 Delay for static CMOS, dynamic CMOS, and PTL circuits.
CMOS complementary metal–oxide–semiconductor, PTL
pass-transistor logic�� 253
Fig. 8.41 Power dissipation for static CMOS, dynamic CMOS,
and PTL circuits. CMOS complementary metal–oxide–
semiconductor, PTL pass-transistor logic�� 254
Fig. 8.42 Operand isolation approach to reduce dynamic power dis-
sipation�� 255
Fig. 8.43 Logic restructuring technique�� 256
Fig. 8.44 Logic resizing technique�� 256
Fig. 8.45 Transition rate buffering technique�� 256
Fig. 8.46 Pin-swapping technique�� 257
Fig. 9.1 Gate delay time (a) and subthreshold leakage current
(b) dependence on threshold voltage�� 262
Fig. 9.2 Variation of threshold voltage with doping concentration�� 264
Fig. 9.3 Variation of threshold voltage with gate oxide thickness�� 264
Fig. 9.4 Variation of threshold voltage with oxide thickness for
constant AR. AR aspect ratio�� 265
Fig. 9.5 Variation of threshold voltage with channel length�� 265
xxxii List of Figures
Fig. 9.6 Physical structure of a CMOS inverter a without body bias,

b with body bias. CMOS complementary metal–oxide–
semiconductor�� 267
Fig. 9.7 Substrate bias control circuit�� 267
Fig. 9.8 a Source voltages of the nMOS transistors in the stack,
b A 4-input NAND gate. nMOS n-channel metal–oxide–
semiconductor�� 268
Fig. 9.9 MTCMOS basic structure�� 270
Fig. 9.10 a Delay characteristic of MTCMOS gate, b dependence
of energy on supply voltage. MTCMOS multi-threshold
complementary metal–oxide–semiconductor�� 271
Fig. 9.11 Gate delay time and effective supply voltage dependence
on the normalized gate width of the sleep control transistor
in simulation mode�� 271
Fig. 9.12 a Activity profile for a subsystem with clock gating,
b activity profile of the same subsystem with power gating�� 273
Fig. 9.13 An SoC that uses internal power gating. SoC system on chip�� 274
Fig. 9.14 Example of global power gating�� 276
Fig. 9.15 Example of local power gating�� 276
Fig. 9.16 Example of switch in cell power gating�� 277
Fig. 9.17 a Header switch and b footer switch�� 278
Fig. 9.18 Ring-style switching fabric�� 279
Fig. 9.19 Grid-style switching fabric�� 279
Fig. 9.20 Output of a power-gated block driving a power-up block�� 280
Fig. 9.21 AND gate to clamp the output to LOW level�� 280
Fig. 9.22 a AND gate to clamp the output to LOW level and b OR
gate to clamp the output to HIGH level�� 281
Fig. 9.23 Pull-down and pull-up transistor to clamp the output to
LOW and HIGH levels, respectively�� 281
Fig. 9.24 Retention registers used for state retention�� 282
Fig. 9.25 Activity profile with realistic power gating�� 283
Fig. 9.26 Power-gating control without retention�� 283
Fig. 9.27 Power-gating control with retention�� 284
Fig. 9.28 Linear power savings of conventional power management�� 285
Fig. 9.29 Reduction in power dissipation using DVFS. DVFS
dynamic voltage and frequency scaling�� 286
Fig. 9.30 Combining DVFS along with conventional power manage-
ment. DVFS dynamic voltage and frequency scaling�� 286
Fig. 9.31 a Darker gates on the critical path, b high Vt = 0.25
assigned to all gates in the off-critical path, c high
Vt = 0.396 assigned to some gates in the off-critical
path, and d high Vt = 0.46 assigned to some gates in the
off-critical path�� 288
Fig. 9.32 Standby leakage power for different Vth2�� 289
Fig. 9.33 Dual-Vt CMOS circuit�� 289
List of Figures xxxiii
Fig. 9.34 Leakage power with different high-threshold voltages�� 292

Fig. 9.35 Dual-Vt assignment to more number of gates�� 292
Fig. 9.36 A plot of leakage energy versus delay�� 293
Fig. 9.37 Reduction of leakage power in active mode in delay-
constrained realization comparing leakage power for all
low-Vt, dual-Vt, and all high-Vt circuits�� 296
Fig. 9.38 Reduction of leakage energy in standby mode for energy-
constrained realization comparing leakage energy for all
low-Vt, dual-Vt, and all high-Vt circuits�� 299
Fig. 9.39 A simple approach for Vth hopping for leakage power
minimization�� 300
Fig. 10.1 a Charging of a capacitor C through a resistor R using a
power supply. b As charging progresses, current decreases
and charge increases�� 305
Fig. 10.2 Adiabatic charging of a capacitor�� 305
Fig. 10.3 Output waveform of a pulsed power supply�� 306
Fig. 10.4 Adiabatic amplification�� 307
Fig. 10.5 a Static CMOS schematic diagram, b adiabatic circuit
schematic diagram�� 308
Fig. 10.6 Adiabatic realization of the AND/NAND gate�� 308
Fig. 10.7 Asynchronous two-phase clock generator a 2N, b 2N2P�� 309
Fig. 10.8 Synchronous two-phase clock generator a 2N, b 2N2P�� 310
Fig. 10.9 CMOS inverter driven by a stepwise
supply voltage waveform�� 311
Fig. 10.10 Charging a capacitor in n steps�� 311
Fig. 10.11 Stepwise driver circuit to charge capacitive loads�� 312
Fig. 10.12 Stepwise driver circuit using tank capacitors�� 313
Fig. 10.13 ECRL generalized schematic diagram�� 314
Fig. 10.14 ECRL inverter�� 315
Fig. 10.15 Data transfer in ECRL gates�� 315
Fig. 10.16 Schematic diagram of a PFAL logic gate�� 316
Fig. 10.17 Sum cell of a full adder realized using PFAL logic�� 316
Fig. 10.18 Schematic diagram of a 2N-2N2P logic gate�� 317
Fig. 10.19 Energy consumption per switching operation versus fre-
quency for a CMOS inverter, an ECRL inverter, a PFAL
inverter and a 2N−2N2P inverter�� 319
Fig. 11.1 Advancement of VLSI technology and Moore’s law�� 325
Fig. 11.2 Power consumption of Intel processors�� 325
Fig. 11.3 Widening battery gap�� 326
Fig. 11.4 Energy density of the commonly used batteries used in
portable devices�� 327
Fig. 11.5 Simplified schematic diagram of an electrochemical cell�� 332
Fig. 11.6 Typical discharge characteristics of a battery�� 333
Fig. 11.7 Typical charge characteristics of different batteries�� 334
Fig. 11.8 Lifetime of the battery under constant-current discharge�� 337
Fig. 11.9 Five load profiles P1–P5�� 338
xxxiv List of Figures
Fig. 11.10 Three approaches to task scheduling with voltage scaling�� 341
Fig. 11.11 Schematic diagram of a clustered sensor network�� 347
Fig. 11.12 Schematic diagram of a clustered sensor network
with sensor nodes�� 349
Fig. 11.13 Protocol operation of assisted-LEACH�� 351
Fig. 12.1 Simplified schematic diagram of a computer system�� 356
Fig. 12.2 Codes after “before inlining” and “after inlining”�� 360
Fig. 12.3 Codes after “before code hoisting” and
“after code hoisting”�� 361
Fig. 12.4 Dead-store elimination�� 362
Fig. 12.5 Dead-code elimination�� 362
Fig. 12.6 Loop-invariant computation�� 363
Fig. 12.7 Loop unrolling�� 364
Fig. 12.8 Loop unrolling, where n = 10,000 and uf = 8.
a Original code. b Transformed code�� 366
Fig. 12.9 Loop tiling, where n = 10,000 and block = 32.
Fig. 12.10 Loop permutation, where n = 256.
Fig. 12.11 Strength reduction, where n = 10,000.
Fig. 12.12 Loop fusion, where n = 10,000. a Original code.
b Transformed code�� 369
Fig. 12.13 Loop peeling, where n = 10,000. a Original code.
Fig. 12.14 Loop unswitching, where n = 10,000. a Original code.
Fig. 12.15 3D Jacobi’s kernel�� 372
Fig. 12.16 3D Jacobi’s kernel with software prefetching�� 373
Fig. 12.17 General structure of a program with software prefetching�� 374
Fig. 12.18 General structure of power-aware software prefetching
program (PASPP)�� 374
Fig. 12.19 3D Jacobi’s Kernel with power-aware software prefetching�� 379
Fig. 12.20 Detailed power dissipation at different units for three
versions of 3D Jacobi’s Kernel�� 382
List of Tables
Table 1.1 Evolution of IC technology�� 4

Table 4.1 Comparison of the inverters�� 82
Table 4.2 Sheet resistances of different conductors�� 92
Table 4.3 Capacitance of different materials�� 93
Table 4.4 Variation of delay with buffer sizing�� 100
Table 5.1 Qualitative comparisons of the logic styles�� 113
Table 5.2 Parity generator truth table�� 133
Table 5.3 Truth Table of the priority encoder�� 135
Table 5.4 Comparison of area in terms of number of transistors�� 136
Table 5.5 Comparison of delay for different logic styles�� 137
Table 6.1 Truth table of NAND gate�� 152
Table 6.2 Switching activity of different gates�� 152
Table 6.3 Switching activity of different gates for inputs not
equiprobable�� 153
Table 6.4 Characteristics of the standard cells�� 154
Table 6.5 Transition activity at different points and relative perfor-
mance of the three implementations�� 155
Table 6.6 Transition activity of dynamic gates�� 155
Table 7.1 Recent history of device size scaling for CMOS circuits�� 178
Table 7.2 Constant-field scaling of the device dimensions,
voltages, and doping densities�� 180
Table 7.3 Effects of constant-field scaling on the key
device parameters�� 181
Table 7.4 Constant-voltage scaling of the device dimensions,
voltages, and doping densities�� 182
Table 7.5 Effects of constant-voltage scaling on the key device
parameters�� 182
Table 7.6 Impact of parallelism on area, power, and throughput�� 185
Table 7.7 Power in multi-core architecture�� 186
Table 7.8 Impact of pipelining on area, power, and throughput�� 187
Table 7.9 Impact of parallelism and pipelining on area, power,
and throughput�� 189
xxxv
xxxvi List of Tables
Table 7.10 Relationship between voltage, frequency, and power�� 202

Table 8.1 Comparison of the die sizes�� 217
Table 8.2 Binary and Gray codes for different decimal values�� 222
Table 8.3 Bit transitions per second for different
benchmark programs�� 223
Table 8.4 State assignments using Gray code and binary
code for modulo 6 counter�� 233
Table 8.5 State assignments using Gray code and binary
code for sequence detector�� 234
Table 8.6 Ratio parameter table of f3�� 248
Table 8.7 Area, delay, and switching power in static
CMOS, dynamic CMOS, and PTL circuits�� 252
Table 9.1 Input vectors and corresponding leakage currents
for the three-input NAND gate�� 269
Table 9.2 Traditional power management states�� 285
Table 9.3 Leakage power dissipation in delay-constrained
dual-Vt CMOS circuits�� 295
Table 9.4 Total power dissipation during active mode�� 296
Table 9.5 Leakage energy dissipation in energy-constrained
dual-Vt CMOS circuits�� 297
Table 9.6 Total energy requirement during active mode�� 297
Table 9.7 Energy reduction in delay-constrained and energy-con-
strained dual-Vt CMOS circuits�� 298
Table 11.1 Measured lifetimes and the delivered charges
for different profiles�� 339
Table 11.2 A table showing the tasks to be scheduled�� 343
Table 11.3 Energy consumption in three different situations�� 344
Table 11.4 Variation in energy consumption with the change
in duty cycle for 180 nm�� 344
Table 11.5 Variation in energy consumption with the change
in duty cycle for 70 nm�� 344
Table 11.6 States of processor, radio, and the sensor for four
different tasks�� 345
Table 11.7 Current requirement of different resources used
in realizing the sensor node�� 345
Table 11.8 Energy consumption in three different situations�� 346
Table 11.9 Simulation parameters used in Assisted-LEACH�� 353
Table 11.10 Round versus node death: for the death
of half the network�� 353
Table 11.11 Energy versus round: for intermediate round�� 353
Table 12.1 Voltage–frequency pairs supported by the
XEEMU simulator�� 365
Table 12.2 Loop unrolling experimental results�� 366
Table 12.3 Loop tiling experimental results�� 367
Table 12.4 Loop permutation experimental results�� 368
List of Tables xxxvii
Table 12.5 Strength reduction experimental results�� 369

Table 12.6 Loop fusion experimental results�� 370
Table 12.7 Loop peeling experimental results�� 370
Table 12.8 Loop unswitching experimental results�� 371
Table 12.9 Lists of benchmark circuits�� 380
Table 12.10 TEPD_TABLE for JACOBI�� 381
Table 12.11 Performance and power for different
benchmark programs�� 381
Table 12.12 Performance and power requirements of
three different versions�� 382
Table 12.13 Performance and energy gains of SPP of
the benchmark programs�� 382
Table 12.14 Performance and energy gains of PASPP of
the benchmark programs�� 383
Table 12.15 Performance and energy gains of PASPP
with respect to SPP�� 383
Table 12.16 Power and time overhead due to PAC and
switching of ( V, f) pairs�� 383
Table 12.17 Percentage of execution time spent by PASPP
at different (v, f) and PD�� 384
Chapter 1
Introduction
Abstract This chapter provides an introduction to low-power, very-large-scale-

integration (VLSI) circuits and systems, which we intend to present in this book.
To put the reader in proper perspective, historical background of the evolution of
metal–oxide–semiconductor (MOS) technology is presented. Then, to motivate
the reader, need for low-power VLSI circuit realization is emphasized. In order
to develop techniques for minimizing power dissipation, it is essential to identify
various sources of power dissipation and different parameters involved in it. Vari-
ous low-power design methodologies to be applied throughout the design process
starting from system level to physical or device level to get an effective reduction
of power dissipation are briefly introduced.
Keywords Moore’s law · Power dissipation · Power density · Energy consumption ·

Switching power · Short-circuit power · Glitching power · Subthreshold leakage ·
Gate leakage · Junction leakage
1.1 Introduction
Design for low power has become nowadays one of the major concerns for com-
plex, very-large-scale-integration (VLSI) circuits. Deep submicron technology,
from 130 nm onwards, poses a new set of design problems related to the power
consumption of the chip. Tens of millions of gates are nowadays being implemented
on a relatively small die, leading to a power density and total power dissipation
that are at the limits of what packaging, cooling, and other infrastructure can sup-
port. As technology has shrunk to 90 nm and below, the leakage current has in-
creased dramatically, and in some 65-nm designs, leakage power is nearly as large
as dynamic power. So it is becoming impossible to increase the clock speed of
high-performance chips as technology shrinks and the chip density increases, be-
cause the peak power consumption of these chips is already at the limit and cannot
be increased further. Also, the power density leads to reliability problems because
the mean time to failure decreases with temperature. Besides, the timing degrades
and the leakage currents increase with temperature. For battery-powered devices
also, this high on-chip power density has become a significant problem, and tech-
niques are being used in these devices from software to architecture to implemen-
A. Pal, Low-Power VLSI Circuits and Systems, 1

DOI 10.1007/978-81-322-1937-8_1 © Springer India 2015
2 1 Introduction
tation level to alleviate this problem as much as possible like power gating and
multi-threshold libraries. Some other techniques being used nowadays are using
different supply voltages at different blocks of the design according to the perfor-
mance requirements, or voltage scaling techniques. Moreover, aggressive device
size scaling used to achieve high performance leads to increased variability due to
short-channel and other effects. This, in turn, leads to variations in process param-
eters such as, Leff, Nch, W, Tox, Vt, etc. Performance parameters such as power and
delay are significantly affected due to the variations in process parameters and envi-
ronmental/operational ( Vdd, temperature, input values, etc.) conditions. For designs,
due to variability, the design methodology in the future nanometer VLSI circuit
designs will essentially require a paradigm shift from deterministic to probabilistic
and statistical design approach. The objective of this book is to provide a com-
prehensive coverage of different aspects of low-power circuit synthesis at various
levels of design hierarchy. This chapter gives an overview of different low-power
techniques in practice at this juncture. In Sect. 1.2, the historical background of
VLSI circuits is briefly introduced. In Sect. 1.3, we shall focus on why low power
is so important. Before we embark upon various techniques for low power, in Sect.
1.4, we shall briefly discuss the sources of power dissipation in complementary
metal-oxide-semiconductor (CMOS) circuits, which is the technology of choice of
present-day VLSI circuits. In Sect. 1.5 we shall introduce the low-power design ap-
proaches which will be elaborated in subsequent chapters.
1.2 Historical Background [1]
The invention of transistor by William Shockley and his colleagues at Bell Labo-
ratories, Murray Hills, NJ, ushered in the “solid state” era of electronic circuits
and systems. Within few years after the invention, transistors were commercially
available and almost all electronic systems started carrying the symbol “solid state,”
signifying the conquest of the transistor over its rival—the vacuum tube. Smaller
size, lower power consumption, and higher reliability were some of the reasons
that made it a winner over the vacuum tube. About a decade later, Shockley and his
colleagues, John Bardeen and Walter Brattain, of Bell Laboratories were rewarded
with a Nobel Prize for their revolutionary invention.
The tremendous success of the transistor led to vigorous research activity in the
field of microelectronics. Later, Shockley founded a semiconductor industry. Some
of his colleagues joined him or founded semiconductor industries of their own.
Gordon Moore, a member of Shockley’s team, founded Fairchild and later Intel. Re-
search engineers of Fairchild developed the first planner transistor in the late 1950s,
which was the key to the development of integrated circuits (ICs) in 1959. Planner
technology allowed realization of a complete electronic circuit having a number
of devices and interconnecting them on a single silicon wafer. Within few years
of the development of ICs, Gordon Moore, director, Research and Development
Laboratories, Fairchild Semiconductor, wrote an article entitled “Cramming More
1.2 Historical Background 3
Fig. 1.1 Moore’s law 16

based on his famous 15
prediction 14
Log2 of the Number of Components

13
12
Per Integrated Function

11
10
9
8
7
6
5
4
3
2
1
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
Year
Components onto Integrated Circuits” in the April 19, 1995 issue of the Electronics
Magazine. He was asked to predict what would happen over the next 10 years in
the semiconductor component industry. Based on the very few empirical data, he
predicted that by 1975, it would be possible to cram as many as 65,000 components
onto a single silicon chip of about one fourth of a square inch. The curve, which
Moore used to make his prediction, is shown in Fig. 1.1. The starting point was the
year 1959—the year of production of the first planner transistor. The other three
points are based on the ICs made by Fairchild in the early 1960s, including an IC
with 32 components in production in 1964. The last one was an IC to be produced in
1965 with 64 components. By extrapolating the plot to the year 1975, which he used
for the purpose of prediction, he observed that “by 1975, the number of components
per IC for minimum cost will be 65,000.” He also concluded that component den-
sity in an IC would double every year.
Later in the year 1975, Moore revisited the topic at the IEEE International Elec-
tron Devices Meeting and observed that his 10-year-old forecast of 65,000 compo-
nents was on the mark. However, he revised his prediction rate from 1 year to 18
months, that is, the component density would double every 18 months. This became
known as Moore’s law. Again after 30 years, Moore compared the actual perfor-
mance of two kinds of devices—random-access memories (RAM) and micropro-
cessors. Amazingly, it was observed that both kinds traced the slope fairly closely
to the revised 1975 projection.
Moore’s law acted as a driving force for the spectacular development of IC tech-
nology leading to different types of products. Based on the scale of integration,
the IC technology can be divided into five different categories, as summarized in
Table 1.1. The first half of the 1960s was the era of small-scale integration (SSI),
with about ten planner transistors on a chip. The SSI technology led to the fab-
4 1 Introduction
Table 1.1 Evolution of IC technology

Year Technology Number of components Typical products
1947 Invention of transistor 1 –
1950–1960 Discrete components 1 Junction diodes and
transistors
1961–1965 Small-scale integration 10–100 Planner devices, logic
gates, flip-flops
1966–1970 Medium-scale integration 100–1000 Counters, MUXs, decod-
ers, adders
1971–1979 Large-scale integration 1000–20,000 8-bit µp, RAM, ROM
1980–1984 Very-large-scale integration 20,000–50,000 DSPs, RISC processors,
16-bit, 32-bit µP
1985– Ultra-large-scale integration > 50,000 64-bit µp, dual-core µP
MUX multiplexer, μP microprocessor, RAM random-access memory, ROM read-only memory,
DSP digital signal processor, RISC reduced instruction set computer
rication of gates and flip-flops. In the second half of the 1960s, counters, multi-
plexers, decoders, and adder were fabricated using the medium-scale integration
(MSI) technology having 100–1000 components on a chip. The 1970s was the era
of large-scale integration (LSI) technology with 10,000–20,000 components on a
chip producing typical products like 8-bit microprocessor, RAM, and read-only
memories (ROM). In the 1980s, VLSI with about 20,000–50,000 components led
to the development of 16-bit and 32-bit microprocessors. Beyond 1985 is the era of
ultra-large-scale integration (ULSI) with more than 50,000 devices on a chip, which
led to the fabrication of digital signal processors (DSPs), reduced instruction set
computer (RISC) processor, etc.
In 1971, Intel marketed an IC with the capability of a general-purpose building
block of digital systems. It contained all the functionality of the central processing
unit (CPU) of a computer. The chip was code named as 4004. It was a 4-bit CPU.
Later on, this device was given the name “microprocessor.” Thus, the micropro-
cessor—“the CPU on a chip”—was born. The huge success of this chip led to the
development of 8008 and 8085, the most popular 8-bit microprocessors, by Intel.
Other semiconductor device manufacturers such as Motorola and Zilog joined the
race of producing more and more powerful microprocessors. In the past three de-
cades, the evolution tree of microprocessors has grown into a large tree with three
main branches as shown in Fig. 1.2.
The main branch in the middle represents the general-purpose microprocessors,
which are used to build computers of different kinds such as laptops, desktops,
workstations, servers, etc. The fruits of this branch have produced more and more
powerful CPUs with processing capability of increased number of bits starting
from 4-bit processors to the present-day 64-bit processors. Moreover, the clock
rates increased from few megahertz to thousands of megahertz, and many advanced
architectural features such as pipelining, superscalar, on-chip cache memory, dual
core, etc. Computers built using the present-day microprocessors have the capabil-
ity of mainframe computers of the 1980s and 1990s. Figure 1.3 shows the series of
1.2 Historical Background 5
0LFURSURFHVVRUV
6SHFLDO
'XDOFRUH SXUSRVH
,WDQLXP SURFHVVRUV
V
HU
5,6&
ROO
QWU
3HQWLXP
FR
'63706

FUR

XS O
0L

&
,QWH

;;

,;3
(PEHGGHG

6\VWHPV 6ZLWFKHV

5RXWHUV

0DLQ%UDQFK*HQHUDO3XUSRVH0LFURSURFHVVRUV
ELWELWELWELW
SLSHOLQHV6XSHUVFDODUVFDFKHPHPRU\
Fig. 1.2 Evolution tree of microprocessor. RISC reduced instruction set computer, DSP digital
signal processor
microprocessors produced by Intel in the past three-and-a-half decades conform-

ing to Moore’s law very closely. It may be noted that the first microprocessor had
only 2200 transistors and the latest microprocessors are having more than a billion
transistors.
The left branch represents a new breed of processors, known as microcontrollers.
A microcontroller can be considered a “computer on a chip.” Apart from the CPU,
other subsystems such as ROM, RAM, input/output (I/O) ports, timer, and serial
port are housed on a single chip in a microcontroller. The CPUs of the microcon-
troller are usually not as powerful as general-purpose microprocessors. Microcon-
trollers are typically used to realize embedded systems such as toys, home appli-
ances, intelligent test and measurement equipment, etc.
The branch on the right side represents special-purpose processors such as DSP
microprocessors (TMS 320), network processors (Intel PXA 210/215), etc. These
special-purpose processors are designed to enhance performance of special applica-
tions such as signal processing, router and packet-level processing in communica-
tion equipment, etc.
With the increase in the number of transistors, the power dissipation also kept
on increasing as shown in Fig. 1.4. This forced the chip designers to consider low
power as one of the design parameters apart from performance and area. In the fol-
lowing section, we shall focus on the importance of low power in IC design.
6 1 Introduction
transistors
10,000,000,000
Dual-Core Intel® Itanium® 2 Processor

1,000,000,000
® ®
MOORE’S LAW Intel Itanium 2 Processor
® ®
Intel Itanium 2 Processor
100,000,000
® ®
Intel Pentium 4 Processor
Intel® Pentium® III Processor
Intel® Pentium® II Processor 10,000,000

® ®
Intel Pentium Processor
Intel486® Processor
1,000,000
Intel386® Processor
286
100,000
8086
8080 10,000
8008
4004
1,000
1970 1975 1980 1985 1990 1995 2000 2005 2010
Fig. 1.3 Moore’s law and the Intel microprocessors. (Source: Intel)
Fig. 1.4 Power dissipation of Intel processors. (Source: Intel)

1.3 Why Low Power? 7
Landmark years of semiconductor industry

• 1947: Invention of transistor in Bell Laboratories.
• 1959: Fabrication of several transistors on a single chip (IC).
• 1965: Birth of Moore’s law; based on simple observation, Gordon Moore
predicted that the complexity of ICs, for minimum cost, would double ev-
ery year.
• 1971: Development of the first microprocessor—“CPU on a chip” by In-
tel.
• 1978: Development of the first microcontroller—“computer on a chip.”
• 1975: Moore revised his law, stipulating the doubling in circuit complexity
to every 18 months.
• 1995: Moore compared the actual performance of two kinds of devices,
dynamic random-access memory (DRAM) and microprocessors, and ob-
served that both technologies have followed closely.
1.3 Why Low Power? [2]
Until recently, performance of a processor has been synonymous with circuit speed
or processing power, e.g., million instructions per second (MIPS) or million float-
ing point operations per second (MFLOPS). Power consumption was of secondary
concern in designing ICs. However, in nanometer technology, power has become
the most important issue because of:
• Increasing transistor count
• Higher speed of operation
• Greater device leakage currents
Increased process parameter variability due to aggressive device size scaling has
created problems in yield, reliability, and testing. As a consequence, there is a
change in the trend of specifying the performance of a processor. Power consump-
tion is now considered one of the most important design parameters. Among various
reasons for this change in trend, some important reasons are considered below.
In order to continuously improve the performance of the circuits and to integrate
more and more functionality in the chip, the device feature size has to continuously
shrink. Figure 1.4 shows the power dissipation of Intel processors. As a conse-
quence, the magnitude of power per unit area known as power density is increasing
as shown in Fig. 1.5. To remove the heat generated by the device, it is necessary to
provide suitable packaging and cooling mechanism. There is an escalation in the
cost of packaging and cooling as the power dissipation increases. To make a chip
commercially viable, it is necessary to reduce the cost of packaging and cooling,
which in turn demands lower power consumption.
8 1 Introduction
Fig. 1.5 Increasing power 10000

density of the very-large-
scale-integration (VLSI) Rocket
Power Density (W/cm2)

chip. (Source: Intel) 1000 Nozzle
Nuclear
Reactor
100
8086
10 4004 Hot Plate P6
8008 8085 386 Pentium® proc
286 486
8080
1
1970 1980 1990 2000 2010
Year
7KHUPDOUXQZD\
*DWHGLHOHFWULF
-XQFWLRQGLIIXVLRQ
(OHFWURPLJUDWLRQGLIIXVLRQ
(OHFWULFDOSDUDPHWHUVKLIW
3DFNDJHUHODWHGIDLOXUH
6LOLFRQLQWHUFRQQHFWIDWLJXH

R&DERYHQRUPDORSHUDWLQJ
WHPSHUDWXUH
2QVHWWHPSHUDWXUHVRIYDULRXVIDLOXUH
Fig. 1.6 Different failure mechanisms against temperature
Increased customer demand has resulted in proliferation of hand-held, battery-

operated devices such as cell phone, personal digital assistant (PDA), palmtop,
laptop, etc. The growth rate of the portable equipment is very high. Moreover,
users of cell phones strive for increased functionality (as provided by smartphones)
along with long battery life. As these devices are battery operated, battery life is of
primary concern. Unfortunately, the battery technology has not kept up with the
energy requirement of the portable equipment. Commercial success of these prod-
ucts depends on size, weight, cost, computing power, and above all on battery life.
Lower power consumption is essential to make these products commercially viable.
It has been observed that reliability is closely related to the power consumption
of a device. As power dissipation increases, the failure rate of the device increases
because temperature-related failures start occurring with the increase in tempera-
ture as shown in Fig. 1.6. It has been found that every 10 ºC rise in temperature
1.4 Sources of Power Dissipations 9
Power is the height of the curve
Lower power could just be slower

Watts
Approach 1
Approach 2
time
Energy is the area under the curve
Two approaches require the same energy

Watts
Approach 1
Approach 2
time
Fig. 1.7 Power versus energy
roughly doubles the failure rate. So, lower power dissipation of a device is essential
for reliable operation. According to an estimate of the US Environmental Protec-
tion Agency (EPA), 80 % of the power consumption by office equipment is due to
computing equipment and a large part from unused equipment. Power is dissipated
mostly in the form of heat. The cooling techniques, such as air conditioner, transfer
the heat to the environment. To reduce adverse effect on environment, efforts such
as EPA’s Energy Star program leading to power management standard for desktops
and laptops has emerged.
1.4 Sources of Power Dissipations [3]
Although power and energy are used interchangeably in many situations, these
two have different meanings and it is essential to understand the difference be-
tween the two, especially in the case of battery-operated devices. Figure 1.7 il-
lustrates the difference between the two. Power is the instantaneous power in
the device, while energy is the integration of power with time. For example, in
Fig. 1.7, we can see that approach 1 takes less time but consumes more power
than approach 2. But the energy consumed by the two, that is, the area under the
curve for both the approaches is the same, and the battery life is primarily deter-
mined by this energy consumed.
10 1 Introduction
Fig. 1.8 Types of power dissipation
Power dissipation is measured commonly in terms of two types of metrics:

1. Peak power: Peak power consumed by a particular device is the highest amount
of power it can consume at any time. The high value of peak power is generally
related to failures like melting of some interconnections and power-line glitches.
2. Average power: Average power consumed by a device is the mean of the amount
of power it consumes over a time period. High values of average power lead to
problems in packaging and cooling of VLSI chips.
In order to develop techniques for minimizing power dissipation, it is essential to
identify various sources of power dissipation and different parameters involved in
each of them. The total power for a VLSI circuit consists of dynamic power and
static power. Dynamic power is the power consumed when the device is active, that
is, when signals are changing values. Static power is the power consumed when the
device is powered up but no signals are changing value. In CMOS devices, the static
power consumption is due to leakage mechanism. Various components of power
dissipation in CMOS devices can therefore be categorized as shown in Fig. 1.8.
1.4.1 Dynamic Power
Dynamic power is the power consumed when the device is active, that is, when
the signals of the design are changing values. It is generally categorized into three
types: switching power, short-circuit power, and glitching power, each of which will
be discussed in details below.
Fig. 1.9 Dynamic (switch- Power

ing) power. GND ground VDD
IN OUT
CL Capacitance
(From transistors and
interconnect wires)
GND
1.4.1.1 Switching Power
The first and primary source of dynamic power consumption is the switching pow-
er, the power required to charge and discharge the output capacitance on a gate.
Figure 1.9 illustrates switching power for charging a capacitor.
The energy per transition is given by
1
Energy/transition = × CL × Vdd 2 ,
2
where CL is the load capacitance and Vdd is the supply voltage. Switching power is
therefore expressed as:
Pswitch = Energy transition × f = CL × Vdd 2 × Ptrans × f clock ,
where f is the frequency of transitions, Ptrans is the probability of an output transi-

tion, and fclock is the frequency of the system clock. Now if we take:
Cswitch = Ptrans × CL ,
then, we can also describe the dynamic power with the more familiar expression:
Pswitch = Ceff × Vdd 2 × f clock .
Switching power is not a function of transistor size, but it is dependent on switching

activity and load capacitance. Thus, it is data dependent.
In addition to the switching power dissipation for charging and discharging the
load capacitance, switching power dissipation also occurs for charging and dis-
charging of the internal node capacitance. Thus, total switching power dissipation
is given by
Ptotalswitch = Ptrans CL × Vdd 2 × f clock + ∑ α i × Ci × Vdd × (Vdd − Vth ) × f clock ,

12 1 Introduction
Fig. 1.10 Short-circuit cur- Power

rent or crowbar current. GND VDD
ground
Current
IN OUT
Capacitance
(From transistors and
GND interconnect wires)
where α i and Ci are the transition probability and capacitance, respectively, for an
internal node i.
1.4.1.2 Short-Circuit Power
In addition to the switching power, short-circuit power also contributes to the dy-
namic power. Figure 1.10 illustrates short-circuit currents. Short-circuit currents
occur when both the negative metal–oxide–semiconductor (NMOS) and positive
metal–oxide–semiconductor (PMOS) transistors are on. Let Vtn be the threshold
voltage of the NMOS transistor and Vtp is the threshold voltage of the PMOS tran-
sistor. Then, in the period when the voltage value is between Vtn and Vdd–Vtp, while
the input is switching either from 1 to 0 or vice versa, both the PMOS and the
NMOS transistors remain ON, and the short-circuit current follows from Vdd to
ground (GND).
The expression for short-circuit power is given by
µ·ε ox ·W
Pshort circuit = tsc × Vdd × I peak × f clock = × (Vdd − Vth )3 × tsc × f clock ,
12 LD
where tsc is the rise/fall time duration of the short-circuit current, Ipeak is the total in-
ternal switching current (short-circuit current plus the current to charge the internal
capacitance), μ is the mobility of the charge carrier, ε ox is the permittivity of the
silicon dioxide, W is the width, L is the length, and D is the thickness of the silicon
dioxide.
From the above equation it is evident that the short-circuit power dissipation
depends on the supply voltage, rise/fall time of the input and the clock frequency
apart from the physical parameters. So the short-circuit power can be kept low if the
ramp (rise/fall) time of the input signal is short for each transition. Then the overall
dynamic power is determined by the switching power.
Fig. 1.11 Leakage currents Source Gate Drain

in an MOS transistor. MOS
metal–oxide–semiconductor
[5]
I3
I4 I5
I6
I1 I2 I1 I2
Well
I7
1.4.1.3 Glitching Power Dissipation
The third type of dynamic power dissipation is the glitching power which arises due
to finite delay of the gates. Since the dynamic power is directly proportional to the
number of output transitions of a logic gate, glitching can be a significant source
of signal activity and deserves mention here. Glitches often occur when paths with
unequal propagation delays converge at the same point in the circuit. Glitches oc-
cur because the input signals to a particular logic block arrive at different times,
causing a number of intermediate transitions to occur before the output of the logic
block stabilizes. These additional transitions result in power dissipation, which is
categorized as the glitching power.
1.4.2 Static Power
Static power dissipation takes place as long as the device is powered on, even when
there are no signal changes. Normally in CMOS circuits, in the steady state, there is
no direct path from Vdd to GND and so there should be no static power dissipation,
but there are various leakage current mechanisms which are responsible for static
power dissipation. Since the MOS transistors are not perfect switches, there will be
leakage currents and substrate injection currents, which will give rise to static pow-
er dissipation in CMOS. Since the substrate current reaches its maximum for gate
voltages near 0.4Vdd and gate voltages are only transiently in this range when the
devices switch, the actual power contribution of substrate currents is negligible as
compared to other sources of power dissipation. Leakage currents are also normally
negligible, in the order of nano-amps, compared to dynamic power dissipation. But
with deep submicron technologies, the leakage currents are increasing drastically
to the extent that in 90-nm technology and thereby leakage power also has become
comparable to dynamic power dissipation.
Figure 1.11 shows several leakage mechanisms that are responsible for static
power dissipation. Here, I1 is the reverse-bias p–n junction diode leakage current,
14 1 Introduction
Fig. 1.12 Leakage cur-

rents in a CMOS inverter.
CMOS complementary
OUT
drain junction leakage

gate
leakage
Sub-threshold current
I2 is the reverse-biased p–n junction current due to tunneling of electrons from

the valence band of the p region to the conduction band of the n region, I3 is the
subthreshold leakage current between the source and the drain when the gate volt-
age is less than the threshold voltage ( Vth), I4 is the oxide tunneling current due to
reduction in the oxide thickness, I5 is the gate current due to hot carrier injection
of electrons (I4 and I5 are commonly known as IGATE leakage current), I6 is the
gate-induced drain leakage current due to high field effect in the drain junction, and
I7 is the channel punch through current due to close proximity of the drain and the
source in short-channel devices.
These are generally categorized into four major types: subthreshold leakage, gate
leakage, gate-induced drain leakage, and junction leakage as shown in Fig. 1.12.
Apart from these four primary leakages, there are few other leakage currents which
also contribute to static power dissipation, namely, reverse-bias p–n junction diode
leakage current, hot carrier injection gate current, and channel punch through cur-
rent.
1.5 Low-Power Design Methodologies
Low-power design methodology needs to be applied throughout the design process

starting from system level to physical or device level to get effective reduction of
power dissipation in digital circuits based on MOS technology [2–4]. Various ap-
proaches can be used at different level of design hierarchy. Before venturing to do
this, it is essential to understand the basics of MOS circuits and the way these are
fabricated. So, we have started with fabrication technology in Chap. 2. The sub-
sequent three chapters introduce MOS transistor, followed by MOS inverters, and
then complex MOS combinational circuits. Chapter 6 introduces various sources
of power dissipation in details. As the most dominant component has quadratic
dependence and other components have linear dependence on the supply voltage,
reducing the supply voltage is the most effective means to reduce dynamic power
1.5 Low-Power Design Methodologies 15
consumption. Unfortunately, this reduction in power dissipation comes at the ex-

pense of performance. It is essential to devise suitable mechanism to contain this
loss in performance due to supply voltage scaling for the realization of low-power
high-performance circuits. The loss in performance can be compensated by using
suitable techniques at the different levels of design hierarchy; that is physical level,
logic level, architectural level, and system level. Techniques like device feature size
scaling, parallelism and pipelining, architectural-level transformations, dynamic
voltage, and frequency scaling.
Apart from scaling the supply voltage to reduce dynamic power, another alter-
native approach is to minimize the switched capacitance comprising the intrinsic
capacitances and switching activity. Choosing which functions to implement in
hardware and which in software is a major engineering challenge that involves is-
sues such as cost complexity, performance, and power consumption. From the be-
havioral description, it is necessary to perform hardware/software partitioning in a
judicious manner such that the area, cost, performance, and power requirements are
satisfied. Transmeta’s Crusoe processor is an interesting example that demonstrated
that processors of high performance with remarkably low power consumption can
be implemented as hardware–software hybrids. The approach is fundamentally
software based, which replaces complex hardware with software, thereby achieving
large power savings.
In CMOS digital circuits, the switching activity can be reduced by algorithmic
optimization, by architectural optimization, by use of suitable logic-style or by log-
ic-level optimization. The intrinsic capacitances of system-level busses are usually
several orders of magnitude larger than that for the internal nodes of a circuit. As
a consequence, a considerable amount of power is dissipated for transmission of
data over I/O pins. It is possible to save a significant amount of power reducing the
number of transactions, i.e., the switching activity, at the processors I/O interface.
One possible approach for reducing the switching activity is to use suitable encod-
ing of the data before sending over the I/O interface. The concept is also applicable
in the context of multi-core system-on-a-chip (SOC) design. In many situations
the switching activity can be reduced by using the sign-magnitude representation
in place of the conventional two’s complement representation. Switching activity
can be reduced by judicious use of clock gating, leading to considerable reduction
in dynamic power dissipation. Instead of using static CMOS logic style, one can
use other logic styles such as pass-transistor and dynamic CMOS logic styles or a
suitable combination of pass-transistor and static CMOS logic styles to minimize
energy drawn from the supply.
Although the reduction in supply voltage and gate capacitances with device size
scaling has led to the reduction in dynamic power dissipation, the leakage power
dissipation has increased at an alarming rate because of the reduction of threshold
voltage to maintain performance. As the technology is scaling down from submi-
cron to nanometer, the leakage power is becoming a dominant component of total
power dissipation. This has led to vigorous research for the reduction of leakage
power dissipation. Leakage reduction methodologies can be broadly classified
into two categories, depending on whether it reduces standby leakage or runtime
16 1 Introduction
leakage. There are various standby leakage reduction techniques such as input vec-
tor control (IVC), body bias control (BBC), multi-threshold CMOS (MTCMOS),
etc. and runtime leakage reduction techniques such as static dual threshold voltage
CMOS (DTCMOS) technique, adaptive body biasing, dynamic voltage scaling, etc.
Aggressive device size scaling used to achieve high performance leads to in-
creased variability due to short-channel and other effects. Performance parameters
such as power and delay are significantly affected due to the variations in process
parameters and environmental/operational ( Vdd, temperature, input values, condi-
tions. For designs, due to variability, the design methodology in the future nanome-
ter VLSI circuit designs will essentially require a paradigm shift from deterministic
to probabilistic and statistical design approach. The impact of process variations
has been investigated and several techniques have been proposed to optimize the
performance and power in the presence of process variations
1.6 Chapter Summary
• Historical background of the evolution of MOS technology is provided.

• Importance of low-power high-performance and battery-operated embedded
systems are explained.
• Various sources of power dissipation in CMOS circuits are explained.
• Low-power design methodologies to minimize static as well as dynamic power
dissipation are highlighted.
1.7 Review Questions
Q1.1. Why has low power become an important issue in the present-day VLSI
circuit realization?
Q1.2. How is reliability of a VLSI circuit related to its power dissipation?
Q1.3. How is the environment affected by the power dissipation of VLSI circuits?
Q1.4. Why has leakage power dissipation become an important issue in deep sub-
micron technology?
Q1.5. What are the different components of dynamic power dissipation?
Q1.6. What are the different components of leakage power dissipation?
Q1.7. Distinguish between energy and power dissipation of VLSI circuits. Which
one is more important for portable systems?
Q1.8. What is glitching power dissipation?
References 17
References
1. Pal, A.: Microcontrollers: Principles and Applications, PHI Learning, India (2011)
2. Raghunathan, A., Jha, N.K., Dey, S.: High-Level Power Analysis and Optimization. Kluwer,
Norwell (1998)
3. Bellamour, A., Elmasri, M.I.: Low Power VLSI CMOS Circuit Design, Kluwer, Norwell
(1995)
4. Chandrakasan, A.P., Brodersen, R.W.: Low Power Digital CMOS Design, Kluwer, Boston
(1995)
5. Roy, K., Mukhopadhyay, S., Mahmooddi-Meimand, H.: Leakage Current Mechanisms and
Leakage Reduction Techniques in Deep-Submicrometer CMOS Circuits, Proceedings of the
IEEE, vol.91, no. 2, pp. 305–327 (2003)
Chapter 2
MOS Fabrication Technology
Abstract This chapter is concerned with the fabrication of metal–oxide–semicon-

ductor (MOS) technology. Various processes such as wafer fabrication, oxidation,
mask generation, photolithography, diffusion, deposition, etc. involved in the fab-
rication of MOS devices are introduced. Various steps used in the n-type MOS
(nMOS) and complementary MOS (CMOS) fabrication are highlighted. The latch-
up problem, an inherent problem of CMOS circuits, is introduced and appropriate
techniques to overcome this problem are explained. Various short-channel effects
arising out of the shrinking size of MOS devices are discussed. Some emerging
MOS technologies such as high-K and FinFET to overcome short channel and other
drawbacks are introduced.
Keywords Wafer fabrication · Oxidation · Mask generation · Photolithography ·
Diffusion · Ion implantation · Deposition · Fabrication steps · p-Well process · n-Well
process · Twin-tub process · Silicon on insulator · Mask generation · Latch-up
problem · Guard ring · Short-channel effect · High-K dielectric · Lightly doped
drain structure · FinFET
2.1 Introduction
Metal–oxide–semiconductor (MOS) fabrication is the process used to create the

integrated circuits (ICs) that are presently used to realize electronic circuits. It in-
volves multiple steps of photolithographic and chemical processing steps during
which electronic circuits are gradually created on a wafer made of pure semicon-
ducting material. Silicon is almost always used, but various compound semiconduc-
tors such as gallium–arsenide are used for specialized applications. There are a large
number and variety of basic fabrication steps used in the production of modern MOS
ICs. The same process could be used for the fabrication of n-type MOS (nMOS),
p-type MOS (pMOS), or complementary MOS (CMOS) devices. The gate material
could be either metal or poly-silicon. The most commonly used substrate is either
bulk silicon or silicon on insulator (SOI). In order to avoid the presence of para-
sitic transistors, variations are brought in the techniques that are used to isolate the
devices in the wafer. This chapter introduces various technologies that are used to
fabricate MOS devices. Section 2.2 provides various processes used in the fabrica-
tion of MOS devices. Section 2.3 introduces fabrication of nMOS devices. Steps for
DOI 10.1007/978-81-322-1937-8_2, © Springer India 2015
20 2 MOS Fabrication Technology
the fabrication of CMOS devices are presented in Sect. 2.4 Latch-up problem and
various techniques to prevent it are highlighted in Sect. 2.5. Short-channel effects
(SCEs) have been considered in Sect. 2.6 and emerging technologies for low power
have been considered in Sect. 2.7.
2.2 Basic Fabrication Processes [1, 2]
Present day very-large-scale integration (VLSI) technology is based on silicon,

which has bulk electrical resistance between that of a conductor and an insula-
tor. That is why it is known as a semiconductor material. Its conductivity can be
changed by several orders of magnitude by adding impurity atoms into the silicon
crystal lattice. These impurity materials supply either free electrons or holes. The
donor elements provide electrons and acceptor elements provide holes. Silicon hav-
ing a majority of donors is known as n-type. On the other hand, silicon having a
majority of acceptors is known as p-type. When n-type and p-type materials are put
together, a junction is formed where the silicon changes from one type to the other
type. Various semiconductor devices such as diode and transistors are constructed
by arranging these junctions in certain physical structures and combining them with
other types of physical structures, as we shall discuss in the subsequent sections.
2.2.1 Wafer Fabrication
The MOS fabrication process starts with a thin wafer of silicon. The raw material
used for obtaining silicon wafer is sand or silicon dioxide. Sand is a cheap material
and it is available in abundance on earth. However, it has to be purified to a high
level by reacting with carbon and then crystallized by an epitaxial growth process.
The purified silicon is held in molten state at about 1500 °C, and a seed crystal is
slowly withdrawn after bringing in contact with the molten silicon. The atoms of
the molten silicon attached to the seed cool down and take the crystalline structure
of the seed. While forming this crystalline structure, the silicon is lightly doped
by inserting controlled quantities of a suitable doping material into the crucible.
The set up is for wafer fabrication to produce nMOS devices is shown in Fig. 2.1a.
Here, boron may be used to produce p-type impurity concentration of 1015 cm3 to
1016 per cm3. It gives resistivity in the range of 25–2 Ω cm. After the withdrawal of
the seed, an “ingot” of several centimeters length and about 8–10 cm diameter as
shown in Fig. 2.1b is obtained. The ingot is cut into slices of 0.3–0.4 mm thickness
to obtain wafer for IC fabrication.
2.2.2 Oxidation
Silicon dioxide layers are used as an insulating separator between different conduct-
ing layers. It also acts as mask or protective layer against diffusion and high-energy
2.2 Basic Fabrication Processes 21
6HHG
FU\VWDO
'RSDQW
JDV
0HOWHGVLOLFRQ
'RSDQW & &UXFLEOH
JDV
a +HDWHU b
Fig. 2.1 a Set up for forming silicon ingot. b An ingot
:DIHU
2
:DWHU
YDSRXU
+HDWHU
Fig. 2.2 Furnace used for oxidation
ion implantation. The process of growing oxide layers is known as oxidation be-
cause it is performed by a chemical reaction between oxygen (dry oxidation), or
oxygen and water vapor (wet oxidation) and the silicon slice surface in a high-
temperature furnace at about 1000 °C as shown in Fig. 2.2. To grow an oxide layer
of thickness tox, the amount of silicon consumed is approximately 0.5tox. Dry oxida-
tion performed in O2 with a few percent of hydrochloric acid added to produce thin,
but robust oxide layers is used to form the gate structure. These layers are known as
gate oxide layers. The wet oxidation produces a thicker and slightly porous layer.
This layer is known as field oxide layer. The oxide thickness is limited by the diffu-
sion rate of the oxidizing agent through the already grown layer and is about 1 µm
at one atmospheric pressure, but can be doubled by using higher pressure, say ap-
proximately 20 atm. Another advantage of a high-pressure system is the possibility
to grow thicker oxides in less time at high temperature.
2.2.3 Mask Generation
To create patterned layers of different materials on the wafer, masks are used at
different stages. Masks are made of either inexpensive green glass or costly low-
expansion glass plates with opaque and transparent regions created using photo-
graphic emulsion, which is cheap but easily damaged. Other alternative materials
used for creating masks are iron oxide or chromium, both of which are more durable
and give better line resolution, but are more expensive.
A mask can be generated either optically or with the help of an electron beam.
In the optical process, a reticle, which is a photographic plate of exactly ten times
the actual size of the mask, is produced as master copy of the mask. Transparent
and opaque regions are created with the help of a pattern generator by projecting an
image of the master onto the reticle. Special masking features such as parity masks
and fiducials are used on the reticle to identify, align, and orient the mask. Master
plates are generated from reticles in a step-and-repeat process by projecting an im-
age of the reticle ten times reduced onto the photosensitized plate to create an array
of geometrical shapes in one over the entire plate. Fiducials are used to control the
separation between exposures and align the reticle images relative to one another.
This process has the disadvantage that if there is a defect on the reticle, it is repro-
duced on all the chips. The step-and-repeat process not only is slow but also suffers
from alignment problems and defect propagation due to dust specks. The electron
beam mask generation technique overcomes these problems.
In the electron beam masking process, the masking plate is generated in one step.
It is based on the raster scan approach where all the geometrical data are converted
into a bit map of 1’s and 0’s. While scanning the masking plate in a raster scan man-
ner, squares containing 1’s are exposed and those containing 0’s are not. Exposures
are made by blanking and un-blanking the beam controlled by the bit map. Using
this technique, several different chip types can be imprinted on the same set of
masks. The main disadvantage of this approach is that it is a sequential technique.
A better alternative is to use the soft X-ray photolithographic technique in which
the entire chip can be eradicated simultaneously. This technique also gives higher
resolution.
These master plates are usually not used for mask fabrication. Working plates
made from the masters by contact printing are used for fabrication. To reduce turn-
around time, specially made master plates can be used for wafer fabrication.
2.2.4 Photolithography
The photolithographic technique is used to create patterned layers of different ma-

terials on the wafer with the help of mask plates. It involves several steps. The first
step is to put a coating of photosensitive emulsion called photo-resist on the wafer
surface. After applying the emulsion on the surface, the wafer is spun at high speed
(3000 rpm) to get a very thin (0.5–1 µm) and uniform layer of the photo-resist.
Then the masking plate is placed in contact with the wafer in a precise position and
exposed to the UV light. The mask plate, with its transparent and opaque regions,
defines different areas. With negative photo-resist, the areas of the wafer exposed
to UV light are polymerized (or hardened), while with positive photo-resist, the
exposed areas are softened and removed.
The removal of the unwanted photo-resist regions is done by a process known
as development. Unexposed (negative) or exposed (positive) portions of the photo-
resist are chemically dissolved at the time of development. A low-temperature bak-
ing process hardens the subsequently remaining portion.
2.2 Basic Fabrication Processes 23
To create the desired pattern, actual removal of the material is done by the etch-
ing process. The wafer is immersed in a suitable etching solution, which eats out
the exposed material leaving the material beneath the protective photo-resist intact.
The etching solution depends on the material to be etched out. Hydrofluoric acid
(HF) is used for SiO2 and poly-silicon, whereas phosphoric acid is used for nitride
and metal.
Another alternative to this wet chemical etching process is the plasma etching
or ion etching. In this dry process, a stream of ions or electrons is used to blast the
material away. Ions created by glow discharge at low pressure are directed to the
target. Ions can typically penetrate about 800 Å of oxide or photo-resist layers, and
thick layers of these materials are used as a mask of some area, whereas the exposed
material is being sputtered away. This plasma technique can produce vertical etch-
ing with little undercutting. As a consequence, it is commonly used for producing
fine lines and small geometries associated with high-density VLSI circuits.
Finally, the photo-resist material is removed by a chemical reaction of this mate-
rial with fuming nitric acid or exposure to atomic oxygen which oxides away the
photo-resist. Patterned layers of different materials in engraved form are left at the
end of this process.
2.2.5 Diffusion
After masking some parts of the silicon surface, selective diffusion can be done in
the exposed regions. There are two basic steps: pre-deposition and drive-in. In the
pre-deposition step, the wafer is heated in a furnace at 1000 °C, and dopant atoms
such as phosphorous or boron mixed with an inert gas, say nitrogen, are introduced
into it. Diffusion of these atoms takes place onto the surface of the silicon, form-
ing a saturated solution of the dopant atoms and solid. The impurity concentration
goes up with a temperature up to 1300 °C and then drops. The depth of penetration
depends on the duration for which the process is carried out. In the drive-in step, the
wafer is heated in an inert atmosphere for few hours to distribute the atoms more
uniformly and to a higher depth.
Another alternative method for diffusion is ion implantation. Dopant gas is
first ionized with the help of an ionizer and ionized atoms are accelerated between
two electrodes with a voltage difference of 150 kV. The accelerated gas is passed
through a strong magnetic field, which separates the stream of dopant ions on the
basis of molecular weights, as it happens in mass spectroscopy. The stream of these
dopant ions is deflected by the magnetic field to hit the wafer. The ions strike the
silicon surface at high velocity and penetrate the silicon layer to a certain depth as
determined by the concentration of ions and accelerating field. This process is also
followed by drive-in step to achieve uniform distribution of the ions and increase
the depth of penetration.
Different materials, such as thick oxide, photo-resist, or metal can serve as
mask for the ion implantation process. But implantation can be achieved through
thin oxide layers. This is frequently used to control the threshold voltage of MOS
transistor. This control was not possible using other techniques, and ion implanta-
tion is now widely used not only for controlling the threshold voltage but also for
all doping stages in MOS fabrication.
2.2.6 Deposition
In the MOS fabrication process, conducting layers such as poly-silicon and alu-
minium, and insulation and protection layers such as SiO2 and Si3N4 are deposited
onto the wafer surface by using the chemical vapor deposition (CVD) technique in
a high-temperature chamber:
°
Poly: SiH 4 
1000 C
→ Si + 2H 2
°
400 − 450 C
SiO 2 : SiH 4 + O 2  → SiO 2 + 2H 2
°
600 − 750 C
Si3 N 4 : 3SiCl2 H 2 + 4NH 3  → Si3 N 4 + 6HCl + 6H 2
Poly-silicon is deposited simply by heating silane at about 1000 °C, which releases

hydrogen gas from silane and deposits silicon. To deposit silicon dioxide, a mixture
of nitrogen, silane, and oxygen is introduced at 400–450 °C. Silane reacts with oxy-
gen to produce silicon dioxide, which is deposited on the wafer. To deposit silicon
nitride, silane and ammonia are heated at about 700 °C to produce nitride and hy-
drogen. Aluminium is deposited by vaporizing aluminium from a heated filament
in high vacuum.
2.3 nMOS Fabrication Steps [2, 3]
Using the basic processes mentioned in the previous section, typical processing
steps of the poly-silicon gate self-aligning nMOS technology are given below. It can
be better understood by considering the fabrication of a single enhancement-type
transistor. Figure 2.3 shows the step-by-step production of the transistor.
Step 1 The first step is to grow a thick silicon dioxide (SiO2) layer, typically of
1 µm thickness all over the wafer surface using the wet oxidation technique. This
oxide layer will act as a barrier to dopants during subsequent processing and pro-
vide an insulting layer on which other patterned layers can be formed.
Step 2 In the SiO2 layer formed in the previous step, some regions are defined
where transistors are to be formed. This is done by the photolithographic process
discussed in the previous section with the help of a mask (MASK 1). At the end of
this step, the wafer surface is exposed in those areas where diffusion regions along
with a channel are to be formed to create a transistor.
2.3 nMOS Fabrication Steps 25
Fig. 2.3 nMOS fabrication steps
Step 3 A thin layer of SiO2, typically of 0.1 μm thickness, is grown all over the entire
wafer surface and on top of this poly-silicon layer is deposited. The poly-silicon
layer, of 1.5 μm thickness, which consists of heavily doped poly-silicon is deposited
using the CVD technique. In this step, precise control of thickness, impurity con-
centration, and resistivity is necessary.
Step 4 Again by using another mask (MASK 2) and photographic process, the
poly-silicon is patterned. By this process, poly-gate structures and interconnections
by poly layers are formed.
Step 5 Then the thin oxide layer is removed to expose areas where n-diffusions are
to take place to obtain source and drain. With the poly-silicon and underlying thin
oxide layer as the protective mask, the diffusion process is performed. It may be
noted that the process is self-aligning, i.e., source and drain are aligned automati-
cally with respect to the gate structure.
Step 6 A thick oxide layer is grown all over again and holes are made at selected
areas of the poly-silicon gate, drain, and source regions by using a mask (MASK 3)
and the photolithographic process.
Step 7 A metal (aluminium) layer of 1 μm thickness is deposited on the entire
surface by the CVD process. The metal layer is then patterned with the help of a
mask (MASK 4) and the photolithographic process. Necessary interconnections are
provided with the help of this metal layer.
Step 8 The entire wafer is again covered with a thick oxide layer—this is known
as over-glassing. This oxide layer acts as a protective layer to protect different parts
from the environment. Using a mask (MASK 5), holes are made on this layer to pro-
vide access to bonding pads for taking external connections and for testing the chip.
The above processing steps allow only the formation of nMOS enhancement-type
transistors on a chip. However, if depletion-type transistors are also to be formed,
one additional step is necessary for the formation of n-diffusions in the channel
regions where depletion transistors are to be formed. It involves one additional step
in between step 2 and step 3 and will require one additional mask to define channel
regions following a diffusion process using the ion implantation technique.
2.4 CMOS Fabrication Steps [2, 3]
There are several approaches for CMOS fabrication, namely, p-well, n-well, twin-
tub, triple-well, and SOI. The n-well approach is compatible with the nMOS pro-
cess and can be easily retrofitted to it. However, the most popular approach is the
p-well approach, which is similar to the n-well approach. The twin-tub and silicon
on sapphire are more complex and costly approaches. These are used to produce
superior quality devices to overcome the latch-up problem, which is predominant
in CMOS devices.
2.4.1 The n-Well Process
The most popular approach for the fabrication of n-well CMOS starts with a lightly
doped p-type substrate and creates the n-type well for the fabrication of pMOS tran-
sistors. Major steps for n-well CMOS process are illustrated as follows:
Step 1 The basic idea behind the n-well process is the formation of an n-well or tub
in the p-type substrate and fabrication of p-transistors within this well. The forma-
tion of an n-well by ion implantation is followed by a drive-in step (1.8 × 102 p cm− 2,
80 kV with 1150 °C for 15 h of drive-in). This step requires a mask (MASK 1),
which defines the deep n-well diffusions. The n-transistor is formed outside the
well. The basic steps are mentioned below:
2.4 CMOS Fabrication Steps 27
• Start with a blank wafer, commonly known as a substrate, which is lightly doped.
SVXEVWUDWH
• Cover the wafer with a protective layer of SiO2 (oxide) using the oxidation pro-
cess at 900–1200 °C with H2O (wet oxidation) or O2 (dry oxidation) in the oxida-
tion furnace.
6L2
SVXEVWUDWH
• Spin on photoresist, which is a light-sensitive organic polymer. It softens where

exposed to light.
3KRWRUHVLVW
6L2
SVXEVWUDWH
• Expose photoresist through the n-well mask and strip off the exposed photoresist
using organic solvents. The n-well mask used to define the n-well in this step is
shown below.
3KRWRUHVLVW
6L2
SVXEVWUDWH
• Etch oxide with HF, which only attacks oxide where the resist has been exposed.
3KRWRUHVLVW
6L2
SVXEVWUDWH
• Remove the photoresist, which exposes the wafer.
6L2
SVXEVWUDWH
• Implant or diffuse n dopants into the exposed wafer using diffusion or ion im-
plantation. The ion implantation process allows shallower wells suitable for the
fabrication of devices of smaller dimensions. The diffusion process occurs in all
directions and dipper the diffusion more it spreads laterally. This affects how
closely two separate structures can be fabricated.
6L2
QZHOO
• Strip off SiO2 leaving behind the p-substrate along with the n-well.
QZHOO
SVXEVWUDWH
Step 2 The formation of thin oxide regions for the formation of p- and n–transistors
requires MASK 2, which is also known as active mask because it defines the thin
oxide regions where gates are formed.
3RO\VLOLFRQ
7KLQJDWHR[LGH
QZHOO
SVXEVWUDWH
3RO\VLOLFRQ
3RO\VLOLFRQ
7KLQJDWHR[LGH
QZHOO
SVXEVWUDWH
2.4 CMOS Fabrication Steps 29
Step 3 The formation of patterned poly-silicon (nitride on the thin oxide) regions
is done using MASK 3. Patterned poly-silicon is used for interconnecting different
terminals.
n well
p substrate
n well
p substrate
Step 4 The formation of n-diffusion is done with the help of the n+ mask, which is
essentially MASK 4.
n+ Diffusion
n+ n+ n+
n well
p substrate
Step 5 The formation of p-diffusion is done using the p+ mask, which is usually a
negative form of the n+ mask. Similar sets of steps form p+ diffusion regions for the
pMOS source and drain and substrate contact.
S'LIIXVLRQ
S Q Q S S Q
QZHOO
SVXEVWUDWH
Step 6 Thick SiO2 is grown all over and then contact cut definition using another
mask.
Step 7 The whole chip then has metal deposited over its surface to a thickness of
1 μm. The metal layer is then patterned by the photolithographic process to form
interconnection patterns using MASK 7.
Metal
Metal
Thick field oxide
p+ n+ n+ p+ p+ n+
n well
p substrate
Step 8 Over-glassing is done by an overall passivation layer and a mask is required

to define the openings for access to bonding pads (MASK 8).
&RQWDFW
7KLFNILHOGR[LGH
S Q Q S S Q
QZHOO
SVXEVWUDWH
Two transistors, one pMOS and another nMOS, which can be used to realize a
CMOS inverter are formed using the n-well process shown in Fig. 2.4.
2.4.2 The p-Well Process
Typical p-well fabrication steps are similar to an n-well process, except that a p-
well is implanted to form n-transistors rather than an n-well. p-Well processes are
preferred in circumstances where the characteristics of the n- and p-transistors are
required to be more balanced than that achievable in an n-well process. Because the
transistor that resides in the native substrate has been found to have better character-
istics, the p-well process has better p-devices than an n-well process.
2.5 Latch-Up Problem and Its Prevention 31
Fig. 2.4 CMOS transistors realized using n-well process
2.4.3 Twin-Tub Process
In the twin-tub process, the starting material is either an n+ or p+ substrate with

a lightly doped epitaxial layer, which is used for protection against latch-up. The
process is similar to the n-well process, involving the following steps:
• Tub formation
• Thin oxide construction
• Source and drain implantations
• Contact cut definition
• Metallization
This process allows n-transistors and p-transistors to be separately optimized to pro-
vide balanced performance of both types of transistors. The threshold voltage, body
effect, and the gain associated with n- and p-devices have to be independently opti-
mized. Figure 2.5 visualizes a CMOS inverter fabricated using the twin-tub process.
2.5 Latch-Up Problem and Its Prevention
The latch-up [4, 5] is an inherent problem in both n-well- and p-well-based CMOS
circuits. The phenomenon is caused by the parasitic bipolar transistors formed in
the bulk of silicon as shown in Fig. 2.6a for the n-well process. Latch-up can be
defined as the formation of a low-impedance path between the power supply and
ground rails through the parasitic n–p–n and p–n–p bipolar transistors. Figure 2.6a
shows a cross section of a CMOS inverter. Two parasitic bipolar transistors, Q1 and
Q2 are shown in the figure. The p–n–p transistor has its emitter formed by the p+
source/drain implant used in the pMOS transistors. It may be noted that either the
drain or the source may act as the emitter, although the source is the terminal that
maintains the latch-up condition. The base is formed by the n-well, and the collector
is formed by the p-substrate. The emitter of the n–p–n transistor is the n+ source/
GND VDD
Epitaxy:
High purity
p+ n+ n+ p+ p+ n+
silicon grown
p-well n-well with accurately
epitaxial layer determined
dopant
n+ substrate concentrations
Fig. 2.5 CMOS transistor realized using twin-tub process
Fig. 2.6 Latch-up problem of 9LQ

a CMOS transistor 9GG
9RXW
Q S S Q Q S
5ZHOO
QZHOO 4 4 5V
SVXEVWUDWH
a
9GG
,
5ZHOO
,+
5V
9+
b c 9
drain implant. The base is formed by the p-substrate and the collector is the n-well.
The parasitic resistors Rwell and Rs are formed because of the resistivity of the
semiconductor material in the n-well and p-substrate, respectively.
As shown in Fig. 2.6b, the bipolar junction transistors (BJTs) are cross-coupled
to form the structure of a silicon-controlled rectifier (SCR) providing a short-circuit
path between the power rail and the ground. Leakage current through the para-
sitic resistors can cause one transistor to turn on, which in turn turns on the other
transistor due to positive feedback, leading to heavy current flow and device failure.
The mechanism of latch-up may be understood by referring to Fig. 2.6b. In normal
2.5 Latch-Up Problem and Its Prevention 33
operation, currents passing through the intrinsic resistors are diode-leakage cur-
rents, which are very small and the voltages developed across the resistors cannot
turn on either of the BJTs. However, because of some external disturbance, current
may increase through one of the two BJTs leading to a voltage drop across Rs ( Rwell)
which turns on the transistors. This leads to high collector current and causes higher
voltage drop across Rwell (or Rs) and the resulting feedback leads to a self-sustaining
low-resistance current path between Vdd and ground (GND).
The latch-up process is triggered by transient currents or voltages generated in-
ternally during power-up, or externally due to voltages and currents beyond the
normal operating ranges. Two distinct situations responsible for triggering are re-
ferred to as vertical triggering and lateral triggering. Vertical triggering takes place
due to current flow in the vertical p–n–p transistor Q1. The current is multiplied by
the common-base current gain, which leads to a voltage drop across the emitter–
base junction of the n–p–n transistor, due to resistance Rs. In a similar way, lateral
triggering takes place when a current flows in the lateral n–p–n transistor leading
to voltage drop across Rwell. In either of the situations, the resulting feedback loop
causes the current transients multiplied by β1 × β 2 . It may be noted that when the
condition β1 × β 2 ≥ 1 is satisfied, both transistors continue to conduct a high current
even after the initial disturbance no longer exists. At the onset of latch-up, the volt-
age drop across the BJT pair is given by
VH = VBE1•sat + VCE2•sat = VBE2•sat + VCE1•sat ,
where VH is called the holding voltage. The latch-up condition is sustained as long
as the current is greater than the holding current IH; the holding current value de-
pends on the total parasitic resistance RT in the current path. There are several ap-
proaches to reduce the tendency of latch-up. The slope of the I–V curve depends on
the total parasitic resistance RT in the current path. The possibility of internal latch-
up can be reduced to a great extent by using the following rules:
• Every well must have an appropriate substrate contact.
• Every substrate contact should be directly connected to a supply pad by metal.
• Substrate contacts should be placed as close as possible to the source connection
of transistors to the supply rails. This helps to reduce the value of both Rs and
Rwell.
• Alternatively, place a substrate contact for every 5–10 transistors.
• nMOS devices should be placed close to Vss and pMOS devices close to Vdd.
In addition to the above, guard rings and trenches, as discussed below, are used to
overcome latch-up.
2.5.1 Use of Guard Rings
The gain of the parasitic transistors can be reduced by using guard rings and making
additional contacts to the ring as shown in Fig. 2.7. This reduces parasitic resis-
Fig. 2.7 Guard ring to avoid

9GG
latch-up problem
%RXQGDU\
RIQZHOO
QJXDUGULQJ
S Q Q S S Q
QZHOO
S S
SVXEVWUDWH
Fig. 2.8 Trench to overcome latch-up problem
tance values and the contacts drain excess well or substrate leakage currents away
from the active device such that trigger current which initiates latch-up is not at-
tained. The guard bands act as dummy collectors and these reduce the gain of the
parasitic transistors by collecting minority carriers and preventing them from being
injected into the base. This, however, increases the space between the n-channel and
p-channel devices and leads to reduction in gate density.
2.5.2 Use of Trenches
Another approach to overcome the latch-up problem is to use trenches between the
individual transistor devices of the CMOS structure, and highly doped field regions
are formed in the bottom of the trenches. Each n- and p-well includes a retrograde
impurity concentration profile and extends beneath adjacent trenches as shown in
Fig. 2.8.
2.6 Short-Channel Effects [6]
The channel length L is usually reduced to increase both the speed of operation and
the number of components per chip. However, when the channel length is the same
order of magnitude as the depletion-layer widths (xdD, xdS) of the source and drain
2.6 Short-Channel Effects 35
Fig. 2.9 Threshold voltage roll-off with channel length [8]
junction, a metal–oxide–semiconductor field-effect transistor (MOSFET) behaves

differently from other MOSFETs. This is known as short-channel effect (SCE). The
SCEs are attributed to two physical phenomena:
• The limitation imposed on electron drift characteristics in the channel
• The modification of the threshold voltage due to the shortening of channel length
Some of the important SCEs are mentioned below.
2.6.1 Channel Length Modulation Effect
As the channel length is reduced, the threshold voltage of MOSFET decreases as

shown in Fig. 2.9. This reduction of channel length is known as Vth roll-off. The
graph in Fig. 2.9b shows the reduction of threshold voltage with reduction in chan-
nel length. This effect is caused by the proximity of the source and drain regions
leading to a 2D field pattern rather than a 1D field pattern in short-channel devices
as shown in Fig. 2.9a. The bulk charge that needs to be inverted by the application
of gate voltage is proportional to the area under the channel region. So, the gate
voltage has to invert less bulk charge to turn the transistor on, leading to more band
bending in the Si–SiO2 interface in short-channel devices compared to long-channel
devices. As a consequence, the threshold voltage is lower for a short-channel de-
vice for the same drain-to-source voltage. Moreover, the effect of the source–drain
depletion region is more severe for high drain bias voltage. This results in further
decrease in threshold voltage and larger subthreshold leakage current.
2.6.2 Drain-Induced Barrier Lowering
For long-channel devices, the source and drain regions are separated far apart, and
the depletion regions around the drain and source have little effect on the potential
distribution in the channel region. So, the threshold voltage is independent of the
1E-02 VD = 4.0 V
1E-03
VD = 0.1 V
ID
1E-04 VD = 2.7 V
(A) 1E-05
1E-06
1E-07 DIBL
1E-08 GIDL
1E-09
1E-10
1E-11
Week Inversion
1E-12 &
1E-13 Junction Leakage
1E-14
-0.5 0 0.5 1 1.5 2
VG (V)
Fig. 2.10 DIBL effect [8]
channel length and drain bias for such devices. However, for short-channel devices,
the source and drain depletion width in the vertical direction and the source drain
potential have a strong effect on a significant portion of the device leading to a
variation of the subthreshold leakage current with the drain bias. This is known as
the drain-induced barrier-lowering (DIBL) effect. Because of the DIBL effect, the
barrier height of a short-channel device reduces with an increase in the subthreshold
current due to lower threshold voltage. Therefore, DIBL occurs when the depletion
regions of the drain and the source interact with each other near the channel surface
to lower the source potential barrier. The DIBL effect is visualized in Fig. 2.10.
2.6.3 Channel Punch Through
Due to the proximity of the drain and the source in short-channel devices, the deple-
tion regions at the source–substrate and drain–substrate junctions extend into the
channel. If the doping is kept constant while the channel length is reduced, the sepa-
ration between the depletion region boundaries decreases. Increased reverse bias
across the junction further decreases the separation. When the depletion regions
merge, majority carriers in the source enter into the substrate and get collected by
the drain. This situation is known as punch-through condition as shown in Fig. 2.11.
The net effect of punch through is an increase in the subthreshold leakage current.
2.7 Emerging Technologies for Low Power 37
Fig. 2.11 Punch-through

effect [8] n+ n+
Depletion-region
boundaries
2.7 Emerging Technologies for Low Power
Over the past two decades, industries have closely followed Moore’s law by fabri-
cating transistors with gate dielectric scaling using silicon dioxide (SiO2). But, as
transistor size shrinks, leakage current increases drastically. Managing that leakage
is crucial for reliable high-speed operation. As a consequence, this is becoming an
increasingly important factor in chip design. High-K (Hi-K) materials are proposed
to reduce the gate leakage current, a metal gate is used to suppress the poly-silicon
gate depletion, and SOI technologies with single or multiple gate transistors offer
opportunities for further scaling down of the transistor dimensions. Many other
alternatives such as dual-gated SOI and substrate biasing have recently been pro-
posed to address the conflicting requirement of high performance during active
mode of operation and low leakage during sleep mode of operation.
2.7.1 Hi-K Gate Dielectric
A significant breakthrough has been made by industries in solving the chip power
problem, identifying a new “Hi-K” material called hafnium (Hf) to replace the tran-
sistor’s silicon dioxide gate dielectric, and new metals like nickel (Ni) silicide to
replace the poly-silicon gate electrode of n-type and p-type MOS transistors. The
scaling of CMOS transistors has led to the silicon dioxide layer to be used as a
gate dielectric and, being very thin (1.4 nm), its leakage current is too large. It is
necessary to replace the SiO2 dielectric with a physically thicker layer of oxides of
higher dielectric constant (K) or “Hi-K” gate oxides such as hafnium oxide (HfO2)
and hafnium silicate (HfSiO). Thus, for the sub-100-nm MOS structure, it reduces
leakage current significantly more than the SiO2 dielectric under the same electrical
equivalent thickness. It has been established that the oxides must be implemented
in conjunction with metal gate electrodes, the development of which is further be-
hind. The metal gate electrode is a gate electrode with a metal or a compound with
metallic conductivity. The current standard gate electrode is doped polycrystalline
silicon (poly-Si), which is slightly depleted at its surface due to its semiconduct-
ing nature and decreases the current drivability of MOS transistors. But, the metal
Fig. 2.12 a Conventional

structure. b Lightly doped
drain–structure
gate perfectly eliminates such depletion and, therefore, it is considered to be the

indispensable component for advanced VLSI circuits. These new materials, along
with the right process step, reduces gate leakage more than 100X while delivering
record transistor performance. In early 2007, Intel announced the deployment of
hafnium-based Hi-K dielectrics in conjunction with a metallic gate for components
built on 45 nm technologies. At the same time, IBM announced plans to transition
to Hi-K materials, also hafnium based, for some products in 2008. Although the
International Technology Roadmap for Semiconductors (ITRS) predicted the im-
plementation of Hi-K materials for gate dielectrics along with metal gate electrodes
to be commonplace in the industry by 2010 but it is still far from reality.
2.7.2 Lightly Doped Drain–Source
In the lightly doped drain–source (LDD) structure, narrow, self-aligned n-regions

are introduced between the channel and the n+ source–drain diffusions of an MOS-
FET. This helps to spread the high field at the drain pinch-off region and thus to
reduce the maximum field intensity. N-channel devices are fabricated with LDD
extensions in a CMOS process, without the requirement of an extra mask level. A
smaller peak electric field near the drain is realized in this structure because of the
reduced N gradient. This results in lowering hot-carrier effects (or fewer hot elec-
trons into oxide) and increase in series resistance.
A pattern of lightly doped regions in the substrate is formed under the structures
by multiple ion implantations. After the ion implantations, the lightly doped regions
are annealed at a temperature and time to obtain a critical and desired dopant dif-
fusion. A dielectric spacer structure is formed upon the sidewalls of each of the
structures and over the adjacent portions of the substrate. A pattern of heavily doped
n+ regions is formed in the substrate adjacent to the dielectric spacer structure on
the sidewalls of the structures and over the adjacent portions of the substrate which
form LDD structures of an MOSFET device to form the said integrated circuit de-
vice as shown in Fig. 2.12. The n+ regions provide smaller ohmic contacts required
to avoid punch through. In the p-channel regions, the n-type LDD extensions are
counterdoped by the regular p+ source/drain implant. This results in significant im-
provements in breakdown voltages, hot-electron effects, and short-channel thresh-
old effects. A pattern of gate electrode structures is formed upon a semiconductor
substrate whose structures each include a gate oxide and a poly-silicon layer as
shown in Fig. 2.12.
2.7 Emerging Technologies for Low Power 39
Fig. 2.13 MOS transistor Lg Wsp

structure to overcome short
channel effects Poly-Si
Gate Spacer
Lov
Tox
XjSDE SDE
XjCon Halo Halo S/D

Super Steep
Retrograde Well
Fig. 2.14 CMOS inverter IN

using twin SOI approach
GND VD
OUT
N+ P N+ P+ N P+
N-Channel P-Channel
Figure 2.13 shows a device with various channel-doping implants (source/drain

extension, SDE; Gaussian halo; and vertical retrograde well) which have been de-
veloped to mitigate the SCEs and to improve the leakage characteristics.
2.7.3 Silicon on Insulator
Rather than using silicon as the substrate, technologies such as SOI have been de-
veloped that use an insulating substrate to improve process characteristics such as
latch-up and speed. Figure 2.14 shows a CMOS inverter fabricated using the SOI
approach. The steps used in a typical SOI CMOS process are as follows:
• A thin film (7–8 µm) of very lightly doped n-type Si is epitaxially grown over an
insulator. Sapphire or SiO2 is a commonly used insulator.
• An anisotropic etch is used to etch away the Si except where a diffusion area will
be needed.
• Implantation of the p-island where an n-transistor is formed.
• Implantation of the n-island where a p-transistor is formed.
• Growing of a thin gate oxide (100–250 Å).
• Depositing of phosphorus-doped poly-silicon film over the oxide.
• Patterning of the poly-silicon gate.
• Forming of the n-doped source and drain of the n-channel devices in the p-
islands.
Fig. 2.15 Simple FinFET Drain

structure
Gate
Source Fin
• Forming of the p-doped source and drain of the p-channel devices in the n-is-
lands.
• Depositing of a layer of insulator material such as phosphorus glass or SiO2 over
the entire structure.
• Etching of the insulator at contact cut locations. The metallization layer is formed
next.
• Depositing of the passivation layer and etching of the bonding pad location.
2.7.4 Advantages of SOI
• Due to the absence of wells, transistor structures denser than bulk silicon are
feasible.
• Lower substrate capacitance.
• No field-inversion problems (the existence of a parasitic transistor between two
normal transistors).
• No latch-up is possible because of the isolation of transistors by insulating sub-
strate.
2.7.5 FinFET
The finFET [7] is a transistor realization, first developed by Chenming Hu and

his colleagues at the University of California at Berkeley, which attempts to over-
come the worst types of SCE encountered by deep submicron transistors, such as
DIBL. These effects make it difficult for the voltage on the gate electrode to deplete
the channel underneath and stop the flow of carriers through the channel; in other
words, to turn the transistor off. By raising the channel above the surface of the
wafer instead of creating the channel just below the surface, it is possible to wrap
the gate around up to three of its sides, providing much greater electrostatic control
over the carriers within it. This led to the development of FinFET structure as shown
in Fig. 2.15. In current usage, the term FinFET has a less precise definition. Among
microprocessor manufacturers, AMD, IBM, and Motorola describe their double-
gate development efforts as FinFET development, whereas Intel avoids using the
2.9 Review Questions 41
term to describe their closely related tri-gate architecture. In the technical literature,
FinFET is used somewhat generically to describe any fin-based, multi-gate transis-
tor architecture regardless of the number of gates. In a FinFET, gates turn on and
off much faster than with planar transistors, since the channel is surrounded on
three sides by the gate. As a result, leakage current is substantially reduced. Vdd and
dynamic power are significantly lower as well.
2.8 Chapter Summary
• Basic MOS fabrication processes are explained.

• nMOS fabrication steps are highlighted.
• An overview of CMOS fabrication steps is provided.
• The inherent latch-up problem of CMOS devices is explained.
• Two approaches to overcome the latch-up problem are explained.
• SCEs arising out of smaller dimension of MOS devices are highlighted.
• Emerging MOS technologies such as Hi-K and FinFET to overcome short chan-
nel and other drawbacks are introduced.
Q2.1. Compare the two oxidation techniques used in the MOS fabrication process.
Q2.2. Explain the steps used in the photolithographic techniques for the fabrication
of MOS transistors.
Q2.3. Compare the two approaches used for diffusion in the MOS fabrication
process.
Q2.4. State the steps used for nMOS fabrication.
Q2.5. State the steps used for the fabrication of an n-well CMOS process.
Q2.6. Explain the latch-up problem of CMOS devices. How can it be overcome?
Q2.7. Explain the twin-tub process of CMOS fabrication. What are the advantages
of this technique?
Q2.8. Explain the channel-length modulation effect.
Q2.9. Explain the LDD structure for the fabrication of MOS transistors. How does
it help to overcome SCE?
Q2.10. How is the SOI approach used to overcome latch-up problems of CMOS
transistors?
Q2.11. Explain the FinFET approach for the fabrication of MOS transistors.
References
1. Mukherjee, A.: Introduction to nMOS and CMOS VLSI Systems Design. Prentice Hall, Engle-
wood Cliffs (1986)
2. Kang, S.-M., Leblebici, Y.: CMOS Digital Integrated Circuits, 3rd edn. Tata McGraw-Hill,
New Delhi (2003)
3. Pucknell, D.A., Eshraghian, K.: Basic VLSI Design Systems and Circuits, 2nd edn. Prentice-
Hall, New Delhi (1988)
4. Troutman, R.R.: Latch-up in CMOS Technology: The Problem and Its Cure. Kluwer, Boston
(1986)
5. Estreich, D.B., Dutton, R.W.: Modeling latch-up in CMOS integrated circuits and systems.
IEEE Trans. Comput. Aided Des. CAD–1(4), 347–354 (1982)
6. D’Agostino, F., Quercia, D.: Short-Channel Effects in MOSFETs, Project Report (2000)
7. Fossum, J.G., Trivedi, V.P.: Fundamentals of Ultra-Thin-Body MOSFETs and FinFETs. Cam-
bridge University Press, Cambridge (2013)
8. Roy, K., Mukhopadhyay, S., Mahmooddi-Meimand, H.: Leakage Current Mechanisms and
Leakage Reduction Techniques in Deep-Submicrometer CMOS Circuits, Proceedings of the
IEEE, vol. 91, no. 2, pp. 305–327 (2003)
Chapter 3
MOS Transistors
Abstract The fundamentals of metal–oxide–semiconductor (MOS) transistors are

introduced in this chapter. Basic structure of an MOS transistor is introduced along
with the concept of enhancement- and depletion-mode MOS transistors. The behav-
ior of MOS transistors is explained with the help of the fluid model, which helps to
visualize the operation of MOS transistors without going into the details of device
physics. Then, the three modes of operation of an MOS transistor, namely accu-
mulation, depletion, and inversion, are introduced. The electrical characteristics of
MOS transistors are explained in detail by deriving the expression of drain cur-
rent. The threshold voltage and transconductance of MOS transistors are defined,
and their dependence on various parameters is highlighted. The body effect and
channel-length modulation effect are explained. Use of MOS transistors to realize
transmission gate and to use it as a switch is discussed in detail.
Keywords Fluid model · Threshold voltage · Transconductance · Cutoff region ·
Nonsaturated region · Saturated region · Figure of merit · Channel-length modulation
effect · Body effect · MOS switch · Transmission gate
3.1 Introduction
The base semiconductor material used for the fabrication of metal–oxide–semicon-

ductor (MOS) integrated circuits is silicon. Metal, oxide, and semiconductor form
the basic structure of MOS transistors. MOS transistors are realized on a single
crystal of silicon by creating three types of conducting materials separated by in-
tervening layers of an insulating material to form a sandwich-like structure. The
three conducting materials are: metal, poly-silicon, and diffusion. Aluminum as
metal and polycrystalline silicon or poly-silicon are used for interconnecting differ-
ent elements of a circuit. The insulating layer is made up of silicon dioxide (SiO2).
Patterned layers of the conducting materials are created by a series of photolitho-
graphic techniques and chemical processes involving oxidation of silicon, diffu-
sion of impurities into the silicon and deposition, and etching of aluminum on the
silicon to provide interconnection. In Sect. 3.2, we discuss the structure of various
types of MOS transistors obtained after fabrication. In Sect. 3.3, characteristics of
an MOS transistor will be studied with the help of the Fluid Model, which helps to
understand the operation of an MOS transistor without going into detailed physics
44 3 MOS Transistors
Fig. 3.1 Structure of an 6RXUFH *DWH 'UDLQ

MOS transistor
0HWDO
3RO\VLOLFRQ
2[LGH
'LIIXVLRQ
VXEVWUDWH 'HSOHWLRQ
of the device. Electrical characteristics of MOS transistors are studied in detail in

Sect. 3.5. Use of MOS transistors as a switch is explored in Sect. 3.6.
3.2 The Structure of MOS Transistors
The structure of an MOS transistor is shown in Fig. 3.1. On a lightly doped sub-

strate of silicon, two islands of diffusion regions of opposite polarity of that of the
substrate are created. These two regions are called source and drain, which are
connected via metal (or poly-silicon) to the other parts of the circuit. Between these
two regions, a thin insulating layer of silicon dioxide is formed, and on top of this
a conducting material made of poly-silicon or metal called gate is deposited. There
are two possible alternatives. The substrate can be lightly doped by either a p-type
or an n-type material, leading to two different types of transistors. When the sub-
strate is lightly doped by a p-type material, the two diffusion regions are strongly
doped by an n-type material. In this case, the transistor thus formed is called an
nMOS transistor. On the other hand, when the substrate is lightly doped by an n-
type material, and the diffusion regions are strongly doped by a p-type material, a
pMOS transistor is created.
The region between the two diffusion islands under the oxide layer is called
the channel region. The operation of an MOS transistor is based on the controlled
flow of current between the source and drain through the channel region. In order
to make a useful device, there must be suitable means to establish some channel
current to flow and control it. There are two possible ways to achieve this, which
have resulted in enhancement- and depletion-mode transistors. After fabrication,
the structure of an enhancement-mode nMOS transistor looks like Fig. 3.2a. In this
case, there is no conducting path in the channel region for the situation Vgs = 0 V,
that is when no voltage is applied to the gate with respect to the source. If the gate is
connected to a suitable positive voltage with respect to the source, then the electric
field established between the gate and the substrate gives rise to a charge inversion
region in the substrate under the gate insulation, and a conducting path is formed be-
tween the source and drain. Current can flow between the source and drain through
this conducting path.
By implanting suitable impurities in the channel region during fabrication, prior
to depositing the insulation and the gate, the conducting path may also be established
in the channel region even under the condition Vgs = 0 V. This situation is shown in
3.3 The Fluid Model 45
6RXUFH *DWH 'UDLQ 6RXUFH *DWH 'UDLQ
0HWDO
Q Q Q Q 3RO\VLOLFRQ
2[LGH
'LIIXVLRQ
SVXEVWUDWH SVXEVWUDWH
'HSOHWLRQ
a b
Fig. 3.2 a nMOS enhancement-mode transistor. b nMOS depletion-mode transistor
Fig. 3.3 a nMOS enhance-

ment. b nMOS depletion.
c pMOS enhancement.
d pMOS depletion-mode
transistors
a b c d
Fig. 3.2b. Here, the source and drain are normally connected by a conducting path,
which can be removed by applying a suitable negative voltage to the gate. This is
known as the depletion mode of operation.
For example, consider the case when the substrate is lightly doped in p-type and
the channel region implanted with n-type of impurity. This leads to the formation of
an nMOS depletion-mode transistor. In both the cases, the current flow between the
source and drain can be controlled by varying the gate voltage, and only one type of
charge carrier, that is, electron or hole takes part in the flow of current. That is the
reason why MOS devices are called unipolar devices, in contrast to bipolar junction
transistors (BJTs), where both types of charge carriers take part in the flow of cur-
rent. Therefore, by using the MOS technology, four basic types of transistors can be
fabricated—nMOS enhancement type, nMOS depletion type, pMOS enhancement
type, and pMOS depletion type. Each type has its own pros and cons. It is also pos-
sible to realize circuits by combining both nMOS and pMOS transistors, known as
Complementary MOS ( CMOS) technology. Commonly used symbols of the four
types of transistors are given in Fig. 3.3.
3.3 The Fluid Model
The operation of an MOS transistor can be analyzed by using a suitable analytical

technique, which will give mathematical expressions for different device character-
istics. This, however, requires an in-depth knowledge of the physics of the device.
Sometimes, it is possible to develop an intuitive understanding about the operation
of a system by visualizing the physical behavior with the help of a simple but very
Fig. 3.4 a An MOS capaci- Y

tor. b The fluid model
(OHFWURQV
'HSOHWHG
5HJLRQ
SW\SH
a VXEVWUDWH
,QWHUIDFH ³)OXLG´UHSUHVHQWLQJ
SRWHQWLDO DPRXQWRIFKDUJH
effective model. The Fluid model [1] is one such tool, which can be used to visu-
alize the behavior of charge-controlled devices such as MOS transistors, charge-
coupled devices (CCDs), and bucket-brigade devices (BBDs). Using this model,
even a novice can understand the operation of these devices.
The model is based on two simple ideas: (a) Electrical charge is considered as
fluid, which can move from one place to another depending on the difference in
their level, of one from the other, just like a fluid and (b) electrical potentials can
be mapped into the geometry of a container, in which the fluid can move around.
Based on this idea, first, we shall consider the operation of a simple MOS capacitor
followed by the operation of an MOS transistor.
3.3.1 The MOS Capacitor
From the knowledge of basic physics, we know that a simple parallel-plate ca-
pacitor can be formed with the help of two identical metal plates separated by an
insulator. An MOS capacitor is realized by sandwiching a thin oxide layer between
a metal or poly-silicon plate on a silicon substrate of suitable type as shown in Fig
3.4a. As we know, in case of parallel-plate capacitor, if a positive voltage is applied
to one of the plates, it induces a negative charge on the lower plate. Here, if a posi-
tive voltage is applied to the metal or poly-silicon plate, it will repel the majority
carriers of the p-type substrate creating a depletion region. Gradually, minority car-
riers (electrons) are generated by some physical process, such as heat or incident
light, or it can be injected into this region. These minority carriers will be accu-
mulated underneath the MOS electrode, just like a parallel-plate capacitor. Based
on the fluid model, the MOS electrode generates a pocket in the form of a surface
Fig. 3.5 a An MOS transis- 6RXUFH 9JE 'UDLQ

tor. b The fluid model 9VE 9GE
B B
a 3W\SH6L
6RXUFH &KDQQHO 'UDLQ

9VE
9GE

,QWHUIDFH
3RWHQWLDO

IXQFWLRQ
b RI9JE
potential in the silicon substrate, which can be visualized as a container. The shape
of the container is defined by the potential along the silicon surface. The higher the
potential, the deeper is the container, and more charge can be stored in it. However,
the minority carriers present in that region create an inversion layer. This changes
the surface potential; increase in the quantity of charge decreases the positive sur-
face potential under the MOS electrode. In the presence of inversion charge, the
surface potential is shown in Fig. 3.4b by the solid line. The area between the solid
line and the dashed line shows not only the presence of charge but also the amount
of charge. The capacity of the bucket is finite and depends on the applied electrode
voltage. Here, it is shown that the charge is sitting at the bottom of the container
just as a fluid would stay in a bucket. In practice, however, the minority carriers
in the inversion layer actually reside directly at the silicon surface. The surface of
the fluid must be level in the equilibrium condition. If it were not, electrons would
move under the influence of potential difference until a constant surface potential
is established. From this simple model, we may conclude that the amount of charge
accumulated in an MOS capacitor is proportional to the voltage applied between the
plates and the area between the plates.
3.3.2 The MOS Transistor
By adding diffusion regions on either side of an MOS capacitor, an MOS transistor

is realized. One of the diffusion regions will form the source and the other one will
form the drain. The capacitor electrode acts as the gate. The cross-sectional view of
an MOS transistor is shown in Fig. 3.5a. We can use the fluid model to explain the
behavior of MOS transistors.
To start with, we may assume that the same voltage is applied to both the source
and drain terminals ( Vdb = Vsb) with respect to the substrate. This defines the poten-
tial of these two regions. In the potential plot, the diffusion regions (where there
is plentiful of charge carriers) can be represented by very deep wells, which are
filled with charge carriers up to the levels of the potentials of the source and drain
regions. The potential underneath the MOS gate electrode determines whether these
Fig. 3.6 The fluid model of *

an MOS transistor 6 9JE '
9VE 9GE
3W\SHVXEVWUDWH
9VE 9GE
a 9VE 9GE 9JE ±9WK
b 9GE 9VE ∆
c 9GE ! 9VE
9VE
9GE
d 9GE ! 9JE 9W 9JE ±9WK
two wells are connected or separated. The potential in the channel region can be
controlled with the help of the gate voltage. The potential at the channel region is
shown by the dotted lines of Fig. 3.5b. The dotted line 1 corresponding to Vgb = 0
is above the drain and source potentials. As the gate voltage is gradually increased,
more and more holes are repelled from the channel region, and the potential at the
channel region moves downward as shown by the dotted lines 2, 3, etc. In this situ-
ation, the source and drain wells are effectively isolated from each other, and no
charge can move from one well to the other. A point is reached when the potential
level at the gate region is the same as that of the source and diffusion regions. At
this point, the channel region is completely devoid of holes. The gate voltage at
which this happens is called the threshold voltage ( Vt) of the MOS transistor. If the
gate voltage is increased further, there is an accumulation of electrons beneath the
SiO2 layer in the channel region, forming an inversion layer. As the gate voltage is
increased further, the potential at the gate region moves below the source and drain
potentials as shown by the dotted lines 3 and 4 in Fig. 3.5b. As a consequence, the
barrier between the two regions disappears and the charge from the source and drain
regions spills underneath the gate electrode leading to a uniform surface potential in
the entire region. By varying the gate voltage, the thickness of the inversion layer
can be controlled, which in turn will control the conductivity of the channel as visu-
alized in Fig. 3.5b. Under the control of the gate voltage, the region under it acts as a
movable barrier that controls the flow of charge between the source and drain areas.
When the source and drain are biased to different potentials ( Vdb ˃ Vsb), there will
be a difference in the potential levels. Let us consider two different situations. In the
first case, the drain voltage is greater than the source voltage by some fixed value,
and the gate voltage Vgb is gradually increased from 0 V. Figure 3.6 shows different
situations. Initially, for Vgb = 0 V, the potential level in the channel region is above
the potential level of either of the source and drain regions, and the source and drain
are isolated. Now, if the gated voltage is gradually increased, first, the gate region
potential reaches the potential of the source region. Charge starts moving from the
source to the drain as the gate voltage is slightly increased. The rate of flow of
Fig. 3.7 a Variation of

drain current with gate
voltage. b Voltage–current
characteristics ,GV
,GVLQP$
9W
a 9JV b 9GVLQYROWV
charge moving from the source to the drain region, represented by the slope of the
interface potential in the channel region, keeps on increasing until the gate region
potential level becomes the same as that of the drain potential level. In this situation,
the device is said to be operating in an active, linear, or unsaturated region. If the
gate voltage is increased further, the width of the channel between the source and
drain keeps on increasing, leading to a gradual increase in the drain current.
Let us consider another case when the gate voltage is held at a fixed value for a
heavily turned-on channel. To start with, the drain voltage is the same as that of the
source voltage, and it is gradually increased. Figure 3.6a shows the case when the
source and drain voltages are equal. Although the path exists for the flow of charg-
es, there will be no flow because of the equilibrium condition due to the same level.
In Fig. 3.6b, a small voltage difference is maintained by externally applied voltage
level. There will be continuous flow of charge resulting in drain current. With the
increase in voltage difference between the source and drain, the difference in the
fluid level increases, and the layer becomes more and more thin, signifying faster
movement of charges. With the increasing drain potential, the amount of charge
flowing from the source to drain per unit time increases. In this situation, the device
is said to be operating in an active, linear, or unsaturated region. However, there
is a limit to it. It attains a maximum value, when the drain potential Vdb = ( Vgb−Vt).
Further increase in drain voltage does not lead to any change in the rate of charge
flow. The device is said to be in the saturation region. In this condition, the drain
current becomes independent of the drain voltage, and it is fully determined by the
gate potential.
The strength of the fluid model is demonstrated above by the visualization of the
operation of an MOS transistor. It can be applied to more complex situations where
it is difficult to derive closed form of equations. In such situations, the fluid model
will be of real help in understanding the operation of such circuits.
To summarize this section, we can say that an MOS transistor acts as a voltage-
controlled device. The device first conducts when the effective gate voltage ( Vgb−Vt)
is more than the source voltage. The conduction characteristic is represented in
Fig. 3.7a. On the other hand, as the drain voltage is increased with respect to the
source, the current increases until Vdb = ( Vgb−Vt). For drain voltage Vdb ˃ ( Vgb−Vt),
the channel becomes pinched off, and there is no further increase in current. A plot
of the drain current with respect to the drain voltage for different gate voltages is
shown in Fig. 3.7b.
+ + + + + + Polysilicon + + + + + + Polysilicon
Oxide Oxide
Vgs << Vt + + + + + + Vgs = Vt
+ + + + + + depletion region
+ + + + + + p-substrate
+ + + + + +
a c
p-substrate
+ + + + + + Polysilicon
Oxide
Vgs >Vt - - - - - -
depletion region
p-substrate
+ + + + + +
b
Fig. 3.8 a Accumulation mode, b depletion mode, and c inversion mode of an MOS transistor
3.4 Modes of Operation of MOS Transistors [2]
After having some insight about the operation of an MOS transistor, let us now have
a look at the charge distribution under the gate region under different operating con-
ditions of the transistor. When the gate voltage is very small and much less than the
threshold voltage, Fig. 3.8a shows the distribution of the mobile holes in a p-type
substrate. In this condition, the device is said to be in the accumulation mode. As the
gate voltage is increased, the holes are repelled from the SiO2–substrate interface
and a depletion region is created under the gate when the gate voltage is equal to
the threshold voltage. In this condition, the device is said to be in depletion mode
as shown in Fig. 3.8b. As the gate voltage is increased further above the threshold
voltage, electrons are attracted to the region under the gate creating a conducting
layer in the p substrate as shown in Fig. 3.8c. The transistor is now said to be in
inversion mode.
3.5 Electrical Characteristics of MOS Transistors
The fluid model, presented in the previous section, gives us some basic understand-
ing of the operation of an MOS transistor [3, 4]. We have seen that the whole con-
cept of the MOS transistor is based on the use of the gate voltage to induce charge
(inversion layer) in the channel region between the source and the drain. Applica-
tion of the source-to-drain voltage Vds causes this charge to flow through the chan-
nel from the source to drain resulting in source-to-drain current Ids. The Ids depends
on two variable parameters—the gate-to-source voltage Vgs and the drain-to-source
voltage Vds. The operation of an MOS transistor can be divided into the following
three regions:
3.5 Electrical Characteristics of MOS Transistors 51
Fig. 3.9 Structural view of *DWH

an MOS transistor
6RXUFH
' 'UDLQ
/ :
(a) Cutoff region: This is essentially the accumulation mode, when there is no
effective flow of current between the source and drain.
(b) Nonsaturated region: This is the active, linear, or weak inversion mode, when
the drain current is dependent on both the gate and the drain voltages.
(c) Saturated region: This is the strong inversion mode, when the drain current is
independent of the drain-to-source voltage but depends on the gate voltage.
In this section, we consider an nMOS enhancement-type transistor and establish
its electrical characteristics. The structural view of the MOS transistor, as shown
in Fig. 3.9, shows the three important parameters of MOS transistors, the channel
length L, the channel width W, and the dielectric thickness D. The expression for the
drain current is given by
charge induced in the channel (Qc )
I ds = . (3.1)
electron transit time (tn )
Let us separately find out the expressions for Qc and tn.

With a voltage V applied across the plates, the charge is given by Q = CV, where
εA
C is the capacitance. The basic formula for parallel-plate capacitor is C = ,
D
where ε is the permittivity of the insulator in units of F/cm. The value of ε depends
on the material used to separate the plates. In this case, it is silicon dioxide (SiO2).
For SiO2, εox = 3.9ε0, where ε0 is the permittivity of the free space. For the MOS
transistor, the gate capacitance
ε oxWL
CG = . (3.2)
D
Now, for the MOS transistor,
Qc = CG •Veff ,
where CG is the gate capacitance and Veff is the effective gate voltage.
lengthof the channel ( L)
Now, the transit time, tn = . (3.3)
velocity of electron (τ n )
The velocity, τ n = µ n ⋅ Eds , where μn is the mobility of electron and Eds is the drain to
the source electric field due to the voltage Vds applied between the drain and source.
Now, Eds = Vds/L.
So,
(3.4) µV L2
τ n = n ds and tn = .
L µnVds
Typical value of µ n = 650cm 2 /V (at room temperature).

The nonsaturated region: As the channel formation starts when the gate voltage
is above the threshold voltage and there is a voltage difference of Vds across the
channel, the effective gate voltage is
Veff = (Vgs − Vt − Vds /2). (3.5)
Substituting this, we get

WLε ox  Vds 
(3.6)
Qc = (Vgs − Vt ) − 2  .
D  
Now, the current flowing through the channel is given by
Qc
Ic = .
tn
Substituting the value of tn, we get

W µ n ε ox  Vds 
Ic = (Vgs − Vt ) − 2  Vds . (3.7)
LD  
µn ε ox
Assuming Vds ≤ Vgs − Vt in the nonsaturated region and K = , we get
D

KW  Vds2  (3.8)
I ds = (Vgs − Vt )Vds − .
L  2 
Now, the gate-channel capacitance based on parallel-plate capacitor model is

ε ins ε 0WL Cg µn
Cg = and K = . (3.9)
D WL
So, in terms of the gate-channel capacitance the expression for drain-to-source cur-
rent can be written as
Cg µn  Vds2 
I ds =  (Vgs − Vt )Vds − . (3.10)
L2  2 
Fig. 3.10 Voltage–current $FWLYH

characteristics of nMOS UHJLRQ 9 GV 9 JV 9 W
enhancement-type transistor
9
6DWXUDWLRQ
UHJLRQ
,GV LQ P$

9

9
9
9

9 GV LQYROWV
The Saturated Region As we have seen in the previous section, the drain current
( Ids) increases as drain voltage increases until the IR drop in the channel equals the
effective gate voltage at the drain. This happens when Vds = Vgs−Vt. At this point,
the transistor comes out of the active region and Ids remains fairly constant as Vds
increases further. This is known as saturation condition. Assuming Vds = Vgs−Vt for
this region, the saturation current is given by
W (Vgs − Vt )
2
I ds = K
L 2
or
Cg µn CoxW µ n µC W
I ds = 2
(Vgs − Vt ) 2 = (Vgs − Vt ) 2 = n ox (Vgs − Vt ) 2 . (3.11)
2L 2L 2 L
It may be noted that in case of the enhancement-mode transistor, the drain-to-source

current flows only when the magnitude exceeds the threshold voltage Vt. The Ids−Vds
characteristic for an enhancement-type nMOS transistor is shown in Fig. 3.10.
I ds = 0 for Vgs < Vt ,

µC W
I ds (lin) = n ox (2(Vgs − Vt )Vds − Vds 2 ) for Vgs ≥ Vt and Vds < Vgs − Vt ,
2 L
µC W
I ds (sat) = n ox (Vgs − Vt ) 2 for Vgs ≥ Vt and Vds ≥ Vgs − Vt .
2 L
Electrical characteristics of the nMOS enhancement-type transistor have been

discussed above. In the depletion-type nMOS transistor, a channel is created by
implanting suitable impurities in the region between the source and drain during
fabrication prior to depositing the gate insulation layer and the poly-silicon layer.
As a result, channel exists even when the gate voltage is 0 V. Here, the channel
current can also be controlled by the gate voltage. A positive gate voltage increases
the channel width resulting in an increase of drain current. A negative gate voltage
Fig. 3.11 Voltage–current

characteristics of nMOS $FWLYH
9GV 9JV 9W
depletion-type transistor UHJLRQ
9
6DWXUDWLRQ
UHJLRQ
, G V LQP$

9

9
9
9

9GVLQYROWV
decreases the channel width leading to a reduced drain current. A suitable negative
gate voltage fully depletes the channel isolating the source and drain regions. The
characteristic curve, as shown in Fig. 3.11, is similar except the threshold volt-
age, which is a negative voltage in case of a depletion-mode nMOS transistor. In a
similar manner, the expression for drain current can be derived and voltage–current
characteristics can be drawn for pMOS enhancement-mode and pMOS depletion-
mode transistors.
3.5.1 Threshold Voltage
One of the parameters that characterize the switching behavior of an MOS transistor
is its threshold voltage Vt. As we know, this can be defined as the gate voltage at
which an MOS transistor begins to conduct. Typical value for threshold voltage for
an nMOS enhancement-type transistor is 0.2 Vdd, i.e., for a supply voltage of 5 V,
Vtn = 1.0 V. As we have seen, the drain current depends on both the gate voltage and
the drain voltage with respect to the source. For a fixed drain-to-source voltage, the
variation of conduction of the channel region (represented by the drain current) for
different gate voltages is shown in Fig. 3.11 for four different cases: nMOS deple-
tion, nMOS enhancement, pMOS enhancement, and pMOS depletion transistors, as
shown in Fig. 3.12a–d, respectively.
The threshold voltage is a function of a number of parameters, including gate
conductor material, gate insulation material, thickness of the gate insulator, doping
level in the channel regions, impurities in the silicon–insulator interface and voltage
between the source and substrate Vsb.
Moreover, the absolute value of the threshold voltage decreases with an increase
in temperature at the rate of −2 mV/ C and − 4 mV/ C for low and high substrate
doping levels, respectively.
'UDLQ 'UDLQ
&XUUHQW &XUUHQW
,GV ,GV
9WQ 9WQ
*DWHWR6RXUFH9ROWDJH9JV *DWHWR6RXUFH9ROWDJH9JV
a b
*DWHWR6RXUFH9ROWDJH9JV *DWHWR6RXUFH9ROWDJH9JV

9WS 9WS

'UDLQ 'UDLQ
&XUUHQW &XUUHQW
,GV ,GV
c d
Fig. 3.12 Variation of drain current with gate voltage. a n-Channel enhancement. b n-Channel
depletion. c p-Channel enhancement. d p-Channel depletion
The threshold voltage may be expressed as

(3.12)
Vt = Vt 0 + γ ( −2ϕ b + Vsb − 2ϕ b ),
where the parameter γ is the substrate bias coefficient, φb is substrate Fermi poten-
tial and Vsb is the substrate-bias coffecient.
The expression holds good for both n-channel and p-channel devices.
• The substrate Fermi potential φb is negative in nMOS and positive in pMOS.
• The substrate bias coefficient γ is positive in nMOS and negative in pMOS.
• The substrate bias voltage Vsb is positive in nMOS and negative in pMOS.
Vt0 is the threshold voltage for Vsb = 0.

KT  ni   1.45 × 1010 
ϕb = ln   = 0.026 ln  16  = −0.35
q  NA   10 
ε 3.97 × 8.85 × 10 −14
Cox = ox = = 7.03 × 10−8 F / cm 2
tox 500 × 10−8 (3.13)
2qε si N A 2 × 1.6 × 10 −19
× 10 × 11.7 × 8.85 × 10
16 −14
γ = = = 0.82
Cox 7.03 × 10−8
Vt = Vt 0 + λ −2ϕ b + Vsb − 2ϕ b = 0.4 + 0.82 0.7 + Vsb − 0.7 ,
where q is the charge of electron, εox is the dielectric constant of the silicon sub-
strate, NA is the doping concentration densities of the substrate (1016 cm−3), and
Cox is the oxide capacitance, Ni is the carrier concentration of the intrinsic silicon
(1.45 ×1010 cm−3).
3.5.2 Transistor Transconductance gm
Transconductance is represented by the change in drain current for a change in gate

voltage for a constant value of drain voltage. This parameter is somewhat similar to
β, the current gain of BJTs.
δ I ds
gm = (3.14)
δ Vgs Vds = constant
This can be derived from

Q
(3.15) δ Qc
I ds = c or δ I ds = ,
tsd tsd
(3.16)L2
tsd = .
µnVds
Thus,
(3.17) δQ
δ I ds = 2 c Vds µn .
L
But,
δ Qc = Cgδ Vgs .
So,
(3.18) µn Cg
δ I ds = 2 Vdsδ Vgs
L
δ I ds Cg µ nVds
or g m = = ,
δVgs L2
in saturation Vds = (Vgs − Vt ), (3.19)
ε ins ε 0WL
and substituting Cg = .
D
We get
µε ε W
(3.20)
g m = n ins 0 (Vgs − Vt ).
D L
3.5.3 Figure of Merit
The figure of merit W0 gives us an idea about the frequency response of the device
g m µn
W0 = = 2 (Vgs − Vt )
cg L
(3.21)
1
= .
tsd
A fast circuit requires gm as high as possible and a small value of Cg. From Eq. 3.23,
it can be concluded that higher gate voltage and higher electron mobility provide
better frequency response.
3.5.4 Body Effect
All MOS transistors are usually fabricated on a common substrate and substrate
(body) voltage of all devices is normally constant. However, as we shall see in
subsequent chapters, when circuits are realized using a number of MOS devices,
several devices are connected in series. This results in different source potentials for
different devices. It may be noted from Eq. 3.13 that the threshold voltage Vt is not
constant with respect to the voltage difference between the substrate and the source
of the MOS transistor. This is known as the substrate-bias effect or body effect. In-
creasing the Vsb causes the channel to be depleted of charge carriers, and this leads
to an increase in the threshold voltage.
Using Eq. 3.13, we compute and plot the threshold voltage Vt as a function of the
source-to-substrate voltage Vsb. The voltage Vsb will be assumed to vary between 0
and 5 V. The graph obtained is shown in Fig. 3.13.
Fig. 3.13 Variation of the

threshold voltage as a func-
7KUHVKROG9ROWDJH9W9

tion of the source-to-substrate

voltage

6XEVWUDWH%LDV9VE9
The variation of the threshold voltage due to the body effect is unavoidable in
many situations, and the circuit designer should take appropriate measures to over-
come the ill effects of this threshold voltage variation.
3.5.5 Channel-Length Modulation
Simplified equations derived in Sect. 3.3 to represent the behavior of an MOS tran-

sistor is based on the assumption that channel length remains constant as the drain
voltage is increased appreciably beyond the onset of saturation. As a consequence,
the drain current remains constant in the saturation region. In practice, however, the
channel length shortens as the drain voltage is increased. For long channel lengths,
say more than 5 μm, this variation of length is relatively very small compared to
the total length and is of little consequence. However, as the device sizes are scaled
down, the variation of length becomes more and more predominant and should be
taken into consideration.
To have better insight of this phenomenon, let us examine the mechanisms of
the formation of channel and current flow in an MOS transistor in different operat-
ing conditions. Figure 3.14a shows the situation of an MOS transistor operating in
the active or nonsaturation region (0 < Vds < Vgs − Vtn ) . In this mode, the inversion
layer (i.e., channel) formed under the influence of gate voltage provides a current
6RXUFH *DWH 'UDLQ 6RXUFH *DWH 'UDLQ 6RXUFH *DWH 'UDLQ
Q Q Q Q Q Q
&KDQQHO 'HSOHWLRQ 3LQFKRII

SVXEVWUDWH 5HJLRQ SVXEVWUDWH SVXEVWUDWH SLQFKRIISRLQW
a b c
Fig. 3.14 a Nonsaturated region. b Onset of saturation. c Deep in saturation
,GVLQP$
9GVLQYROWV
Fig. 3.15 Drain-current variations due to channel-length modulation
path between the source and drain. As the drain voltage is increased from zero, the
current flow increases linearly with the drain voltage, and the channel depth at the
drain end also gradually decreases. Eventually at drain voltage Vds = Vgs − Vt , the
inversion charge and the channel depth reduces to zero as shown in Fig. 3.14b. This
is known as the pinch-off point. As the drain voltage is increased further, a depletion
region is formed adjacent to the drain, and the depletion region gradually grows
with the increase in drain voltage. This leads to gradual shifting of the pinch-off
point towards the source, thereby reducing channel length as shown in Fig. 3.14c.
This effective channel length Leff can be represented by
(3.22)
Leff = L − ∆L.
Substituting Eq. 3.14 in Eq. 3.11, we get
1 µC W
I ds (sat ) = ⋅ n ox ⋅ (Vgs − Vtn ) 2 .
 ∆L  2 Ln
1 − 
 L 
This expression can be rewritten in terms of λ, known as channel-length modulation

coefficient. It can be shown that ∆L ∝ Vds − Vdsat
∆L
1− ≈ 1 − λVds .
L
Assuming λVds 1,
µn Cox wn
I ds(sat) = ⋅ (Vgs − Vt 0 ) 2 (1 + λVds ) (3.23)
2 Ln
The channel-length modulation coefficient λ has the value in the range of 0.02–
0.005 per volt. Taking into consideration the channel-length modulation effect, the
voltage–current characteristic is shown in Fig. 3.15.
Fig. 3.16 a nMOS pass 9GG 9

transistor. b pMOS pass tran-
sistor. c Transmission gate 9 9 9 9WS
9GG
9GG 9
9GG 9GG
9GG9WQ 9 9GG
9GG GG

a b c
3.6 MOS Transistors as a Switch [3]
We have seen that in the linear region (when the drain-to-source voltage is small)
an MOS transistor acts as a variable resistance, which can be controlled by the gate
voltage. An nMOS transistor can be switched from very high resistance when the
gate voltage is less than the threshold voltage, to low resistance when Vgs exceeds
the threshold voltage Vt n. This has opened up the possibility of using an MOS tran-
sistor as a switch, just like a relay. For example, an nMOS transistor when used as
a switch is OFF when Vgs = 0 V and ON when Vgs = Vdd. However, its behavior as a
switch is not ideal. When Vgs = Vdd, the switch turns on but the on resistance is not
zero. As a result, there is some voltage drop across the switch, which can be ne-
glected when it is in series with a large resistance. Moreover, if Vdd is applied to the
input terminal, at the other end we shall get ( Vdd−Vt n). This is because when output
voltage is more than ( Vdd−Vt n), the channel turns off, and it no longer functions as
a closed switch as shown in Fig. 3.15a. However, a low-level signal can be passed
without any degradation. The transistor used in the above manner is known as pass
transistor. It may be noted that the roles of drain and source are interchangeable,
and the device truly acts as a bilateral switch.
Similarly, a pMOS transistor can also be used as a switch. In this case, the mini-
mum voltage that it can pass is Vtp, since below this value gate-to-source voltage
will be higher than −Vtp and the transistor turns off. This is shown in Fig. 3.16b.
Therefore, a p-channel transistor passes a weak low-level signal but a strong high-
level signal as shown below. Later, we shall discuss the use of pass transistors in
realizing Boolean functions and discuss its advantages and disadvantages.
To overcome the limitation of either of the transistors, one pMOS and one nMOS
transistor can be connected in parallel with complementary inputs at their gates. In
this case, we can get both low and high levels of good quality of the output. The low
level passes through the nMOS switch, and the high level passes through the pMOS
switch without any degradation as shown in Fig. 3.16c. A more detailed discussion
on transmission gates is given in the following subsection.
3.6.1 Transmission Gate
The transmission gate is one of the basic building blocks of MOS circuits. It finds
use in realizing multiplexors, logic circuits, latch elements, and analog switches.
3.6 MOS Transistors as a Switch 61
The characteristics of a transmission gate, which is realized by using one nMOS

and one pMOS pass transistors connected in parallel, can be constructed by com-
bining the characteristics of both the devices. It may be noted that the operation
of a transmission gate requires a dual-rail (both true and its complement) control
signal. Both the devices are off when “0” and “1” logic levels are applied to the
gates of the nMOS and pMOS transistors, respectively. In this situation, no signal
passes through the gate. Therefore, the output is in the high-impedance state, and
the intrinsic load capacitance associated to the output node retains the high or low
voltage levels, whatever it was having at the time of turning off the transistors. Both
the devices are on when a “1” and a “0” prior to the logic levels are applied to the
gates of the nMOS and pMOS transistors, respectively. Both the devices take part in
passing the input signal to the output. However, as discussed below, their contribu-
tions are different in different situations.
To understand the operation of a transmission gate, let us consider two situations.
In the first case, the transmission gate is connected to a relatively large capacitive
load, and the output changes the state from low to high or high to low as shown in
Fig. 3.17.
Case I: Large Capacitive Load First, consider the case when the input has
changed quickly to Vdd from 0 V and the output of the switch changes slowly from
0 V ( Vss) to Vdd to charge a load capacitance CL. This can be modeled by using Vdd
as an input and a ramp voltage generated at the output as the capacitor charges from
Vss to Vdd. Based on the output voltage, the operations of the two transistors can be
divided into the following three regions:
Region I: As the voltage difference between the input and output is large, both
nMOS and pMOS transistors are in saturation. Here, Vout < Vtp .
Region II: nMOS is in saturation and pMOS in linear for Vtp < Vout < Vdd − Vtn .
Region III: nMOS is in cutoff and pMOS in linear for Vout > Vdd − Vtn .
Region I: Here,
Vdsn = Vdd − Vout ,

Vgsn = Vdd − Vout ,
Vdsp = Vout − Vdd ,
Vgsp = −Vdd
The current contributing to charge the load capacitor by the two transistors is
Wn
I dsn = K n (Vdd − Vout − Vtn ) 2 , (3.24)
Ln
(V ),
wp 2
I dsp = K p − Vtp (3.25)
2 Ln
dd
for the nMOS and pMOS transistors, respectively.

9GG 9GG
&/ &/
a e
9GG 9GG
9RXW 9RXW
b W f W
,GVQ,GVS ,GVQ,GVS
,GVS ,GVQ
,G ,GVQ ,GVS
,G
_9WS _ 9GG 9WQ 9WQ 9GG _9WS_
c 9RXW g 9RXW
5S 5Q
5Q 5S
5S __ 5Q 5S __ 5Q
5 5
d 9RXW h 9RXW
Fig. 3.17 a and e Output node charges from low-to-high level or high-to-low level. b and f The
output voltage changing with time for different transitions. c and g The drain currents through the
two transistors as a function of the output voltage. d and h The equivalent resistances as a function
of the output voltage
Now, the equivalent resistances for the two transistors are

Vdd − Vout 2 Ln (Vdd − Vout )
Reqn = = ⋅ (3.26)
I dsn K nWn (Vdd − Vout − Vtn ) 2
and
Vdd − Vout 2 Lp (Vdd − Vout )
Reqp = = ⋅ . (3.27)
I sdp
dd (
K pWp V − V 2
tp )
3.6 MOS Transistors as a Switch 63
5HTS
9GG
,GVQ,GVS
9RXW 9LQ 9
5HTQ
521
,GVQ ,GVS
,G
&/ _
5HTS 5
_ HTQ
_ 9WS_ 9GG 9WQ 9GG 9 WQ
_ 9WS_
a b 9RXW c 9RXW
Fig. 3.18 a Charging a small capacitor. b Variation of the output currents with the input voltage.
c Variation of the equivalent resistances with the input voltage
Region II: In this region, the nMOS transistor remains in saturation region, whereas
the pMOS transistor operates in the linear region. Therefore, in this case
K pWp  (Vdd − Vout ) 
2
I dsp =
Lp 
(
 Vdd − Vtp ) (Vdd − Vout ) −
2
,

(3.28)
2 Lp 1
Reqp = .
 ( )
K pWp  2 Vdd − Vtp − (Vdd − Vout ) 
 (3.29)
Region III: In this region, the nMOS transistor turns off and pMOS transistor con-
tinues to operate in the linear region.
These individual nMOS and pMOS currents and the combined current are shown in
Fig. 3.17c. It may be noted that the current decreases linearly as voltage builds up
across the capacitor CL. The equivalent resistances and their combined values are
shown in Fig. 3.17d.
Similarly, when the input voltage changes quickly from Vdd to 0 V and the load
capacitance discharges through the switch, it can be visualized by Fig. 3.17e–h.
Region I: Both nMOS and pMOS are in saturation for Vout < Vtp .
Region II: nMOS is in the linear region, and pMOS is in saturation for
(V
dd − Vtp )
< Vout < Vtn .
Region III: nMOS is in the linear region, and pMOS is cutoff for Vout < (Vdd − Vtn ) .
As shown in Fig. 3.17f, the current decreases linearly as voltage across the capacitor
decreases from Vdd to 0 V. Note that the role of the two transistors reverses in the
two cases.
Case II: Small Capacitive Load Another situation is the operation of the trans-
mission gate when the output is lightly loaded (smaller load capacitance). In this
case, the output closely follows the input. This is represented in Fig. 3.18a.
In this case, the transistors operate in three regions depending on the input voltage
as follows:
Region I: nMOS is in the linear region, pMOS is cutoff for Vin < Vtp .
Region II: nMOS is in the linear region, pMOS linear for Vtp < Vin < (Vdd − Vtn ) .
Region III: nMOS is cutoff, pMOS is in the linear region for Vin > (Vdd − Vtn ) .
As the voltage difference between the transistors is always small, the transistors
either operate in the nonsaturated region or are off as shown above. The individual
currents along with the total current are shown in Fig. 3.18b. The variation of the on
resistance and combined resistance is also shown in Fig. 3.18c.
3.7 Chapter Summary
• Different structures of MOS transistors have been explained.

• The behavior of MOS transistors has been explained with the help of the fluid
model.
• Different modes of operation of MOS transistors have been introduced.
• Electrical characteristics of MOS transistors have been explained in detail.
• The threshold voltage of an MOS transistor is defined, and an analytical expres-
sion for threshold voltage is given.
• The body effect of MOS transistors is explained.
• The transconductance of an MOS transistor is defined.
• The use of MOS transistors as a switch is explained.
• The function of a transmission gate to drive large- and small-capacitive loads is
discussed.
Q3.1. A depletion-type nMOS with µ n = 500 cm 2 / V ⋅ s, tox = 345 × 10−8 cm,

W /L = 1.0, examine relationship between the drain current and terminal voltages
( Vgs varying from 2 to 5 V).
Q3.2. An MOS system has the following parameters:
Vt 0 = 0.7 V
tox = 150 × 10−8 cm
N A = 6 × 1016 cm −3
Calculate and plot the threshold voltage at room temperature, for Vsb varying from
0 to 5 V.
References 65
Q3.3. What is the channel-length modulation effect? How does it affect the charac-
teristics of an MOS transistor?
Q3.4. The input of a lightly loaded transmission gate slowly changes from high
level to low level. How do the currents through the two transistors vary?
Q3.5. Explain the operation of the transmission gate as a switch? How does the on-
resistance change as the input varies from 0 V to Vdd, when the output has a light
capacitive load.
Q3.6. What is the body effect? How can it be used to realize low-power and high-
performance circuits?
Q3.7. Explain the function of an MOS transistor in the saturation mode using the
fluid model.
Q3.8. Explain the function of an MOS transistor in the nonsaturation mode using
the fluid model.
Q3.9. Explain the linear region of the I–V characteristic of an nMOS transistor us-
ing the fluid model.
Q3.10. What is the hot electron effect? How can its effect be minimized?
Q3.11. Explain the behavior of an MOS transistor based on the fluid model.
Q3.12. The input of a heavily loaded transmission gate slowly changes from high
level to low level. How do the currents through the two transistors vary, and how
does the output voltage vary with time?
References
1. Mead Conway, L.: Introduction to VLSI Systems. Addison-Wesley, Reading (1980)
2. Pucknell, D.A., Eshraghian, K.: Basic VLSI Design: Systems and Circuits, 2nd edn. Prentice
Hall, New Delhi (1998)
3. Weste, N.H.E., Eshraghian, K.: Principles of CMOS VLSI Design: A System Perspective, 2nd
edn. Addison-Wesley, Reading (1993)
4. Kang, Sung-Mo, Leblebici, Y.: CMOS Digital Integrated Circuits Analysis and Design.
Mc-Graw-Hill, New Delhi (2003)
Chapter 4
MOS Inverters
Abstract This chapter deals with different types of metal–oxide–semiconductor

(MOS) inverters. Basic inverter characteristics including transfer characteristics
are explained, and high-level and low-level noise margins are defined. Different
inverter configurations that can be realized using the four types of metal–oxide–
semiconductor field-effect transistors (MOSFETs) are introduced, and their key
features are highlighted. Voltage–current and transfer characteristics for inverters of
different configurations are compared. Switching characteristics of complementary
metal–oxide–semiconductor (CMOS) inverter are analyzed and delay time is esti-
mated based on the delay parameter components. Operation of the ring oscillator,
which is used to measure the delay time for the characterization of a new technol-
ogy generation, is explained. Super buffers and BiCMOS inverters are introduced,
and their role in reducing delay time is explained. The concept of buffer sizing to
reduce delay time for driving large capacitive loads is highlighted.
Keywords Noise margin · Inverter ratio · Pseudo-nMOS inverter · Transfer

characteristics · Inverter ratio · Switching characteristics · Delay time · Super buffer ·
BiCMOS inverter · Buffer sizing · Ring oscillator
4.1 Introduction
In Chap. 3, we have seen that a metal–oxide–semiconductor (MOS) transistor can

be considered as a voltage-controlled resistor. This basic property can be used to
realize digital circuits using MOS transistors. In this chapter, we discuss the realiza-
tion of various types of MOS inverters. The inverter forms the basic building block
of gate-based digital circuits. An inverter can be realized with the source of an n-
type metal–oxide–semiconductor (nMOS) enhancement transistor connected to the
ground, and the drain connected to the positive supply rail Vdd through a pull-up
device. The generalized block diagram is shown in Fig. 4.1. The input voltage is ap-
plied to the gate of the nMOS transistor with respect to ground and output is taken
from the drain. When the MOS transistor is ON, it pulls down the output voltage to
the low level, and that is why it is called a pull-down device, and the other device,
which is connected to Vdd, is called the pull-up device.

68 4 MOS Inverters
Fig. 4.1 General 9 GG

structure of an nMOS
inverter. nMOS n-type 3XOOXS
GHYLFH
9 RXW
9 LQ
*1'
Fig. 4.2 Truth table and 7UXWKWDEOH

logic symbol of the inverter 9LQ 9RXW
9 LQ 9 RXW

The pull-up device can be realized in several ways. The characteristics of the
inverter strongly depend on the pull-up device used to realize the inverter. Theoreti-
cally, a passive resistor of suitable value can be used. Although the use of a possible
resistor may be possible in realizing an inverter using discrete components, this is
not feasible in very-large-scale integration (VLSI) implementation. Instead, an ac-
tive pull-up device realized using a depletion-mode nMOS transistor or an enhance-
ment-mode nMOS transistor or a p-type metal–oxide–semiconductor (pMOS)
transistor could be used. Basic characteristics of MOS inverters are highlighted in
Sect. 4.2. The advantages and disadvantages of different inverter configurations are
explored in Sect. 4.3 Section 4.3 explores the inverter ratio in different situations.
The switching characteristics on MOS inverters are considered in Sect. 4.5 Various
delay parameters have been estimated in Sect. 4.6 Section 4.7 presents the different
circuit configurations to drive a large capacitive load.
4.2 Inverter and Its Characteristics
Before we discuss about the practical inverters realized with MOS transistors, we
consider the characteristics of an ideal inverter [1, 2]. The truth table and logic sym-
bol of an inverter are shown in Fig. 4.2. The input to the inverter is Vin and output
is Vout.
Figure 4.3 shows how the output of an ideal inverter changes as the input of the
inverter is varied from 0 V (logic level 0) to Vdd (logic level 1). Initially, output is Vdd
when the output is 0 V, and as the input crosses Vdd/2, the output switches to 0 V, and
it remains at this level till the maximum input voltage Vdd. This diagram is known as
the input–output or transfer characteristic of the inverter. The input voltage, Vdd/2,
at which the output changes from high ‘1’ to low ‘0’, is known as inverter threshold
voltage. For practical inverters realized with MOS devices, the voltage transfer
characteristics will be far from this ideal voltage transfer characteristic represented
by Fig. 4.3. A more realistic voltage transfer characteristic is shown in Fig. 4.4a. As
shown in Fig. 4.4a, because of some voltage drop across the pull-up device, the out-
4.2 Inverter and Its Characteristics 69
Fig. 4.3 Ideal transfer char-

acteristics of an inverter
put high voltage level is less than Vdd for the low input voltage level. This voltage is
represented by VOH, which is the maximum output voltage level for output level ‘1’.
As the input voltage increases and crosses the threshold voltage of the pull-down
transistor, it starts conducting, which leads to a decrease in the output voltage level.
However, instead of an abrupt change in the voltage level from logic level ‘1’ to
logic level ‘0’, the voltage decreases rather slowly. The unity gain point at which
dV0 / dVin = −1 is defined as the input high voltage VIL, which is the maximum in-
put voltage which can be treated as logic level ‘0’.
As the input voltage is increased further, the output crosses a point where Vin = Vout.
The voltage at which this occurs is referred to as the inverter threshold voltage VT.
9GG
9
9GG
92+ 9,1 9287
+,*+
RXWSXW
UDQJH
^
9,+
92+
` 10+ ` +,*+LQSXW
UDQJH
`
9,/
` 10 / /2:LQSXW
^
/2: 92/ UDQJH
RXWSXW
92/ UDQJH 287387 ,1387
9,/ 97 9,+ 9LQ

a b ,QWHUFRQQHFW
Fig. 4.4 a Various voltage levels on the transfer characteristics; b low- and high-level noise
margins
70 4 MOS Inverters
It may be noted that the inverter threshold voltage may not be equal to Vdd/2 for
practical inverters. Before the output attains the output low voltage VOL, which is
the minimum output voltage for output logic level ‘0’, the transfer-characteristic
curve crosses another important point VIH, the minimum input voltage that can be
accepted as logic ‘1’. This point is also obtained at another unity gain point at which
dV0 / dVin = −1 as shown in Fig. 4.4a.
An important parameter called the noise margin is associated with the input–out-
put voltage characteristics of a gate. It is defined as the allowable noise voltage on
the input of a gate so that the output is not affected. The deviations in logic levels
from the ideal values, which are restored as the signal propagates to the output, can
be obtained from the DC characteristic curves. The logic levels at the input and
output are given by
logic 0 input: 0 ≤ Vin ≤ VIL ,

logic 1 input: VIH ≤ Vin ≤ Vdd ,
logic 0 output: 0 ≤ V0 ≤ VOL ,
logic 1 output: VOH ≤ V0 ≤ Vdd .
The low-level noise margin is defined as the difference in magnitude between the
minimum low output voltage of the driving gate and the maximum input low volt-
age accepted by the driven gate.
(4.1)
NM L = VIL − VOL
The high-level noise margin is defined as the difference in magnitude between the
minimum high output voltage of the driving gate and the minimum voltage accept-
able as high level by the driven gate:
(4.2)
NM H = VOH − VIH .
To find out the noise margin, we can use the transfer characteristics as shown in
Fig. 4.4a. The noise margins are shown in Fig. 4.4b.
When any of the noise margins is low, the gate is susceptible to a switching noise
at the input.
4.3 MOS Inverter Configurations
The various MOS inverter configurations [3] realized using different types of pull-
up devices are discussed in this section. In Sect. 4.3.1, the use of a passive re-
sistor as the pull-up device is discussed and disadvantages are highlighted. The
use of a depletion-mode nMOS transistor as the pull-up device is discussed in
Sect. 4.3.2. Section 4.3.3 discusses the use of an enhancement mode of nMOS
transistor, whereas Sect. 4.3.4 discusses the use of a pMOS transistor as a pull-up
4.3 MOS Inverter Configurations 71
9GG
9GG
%
5/
,GV
9RXW
9RXW
9LQ 92/
92/ 9GG 9GG
9LQ
9GV
a b c
Fig. 4.5 a An nMOS inverter with resistive load; b voltage–current characteristic; c transfer char-
acteristic. nMOS n-type–metal–oxide semiconductor
device in configuration. The pMOS device can also be used to realize the CMOS
inverter, where the two transistors are used in complementary mode, as discussed in
Sect. 4.3.5. Various inverters introduced in this section are compared in Sect. 4.3.6.
4.3.1 Passive Resistive as Pull-up Device
A passive resistor RL can be used as the pull-up device as shown in Fig. 4.5a. The
value of the resistor should be chosen such that the circuit functionally behaves
like an inverter. When the input voltage Vin is less than Vtn, the transistor is OFF
and the output capacitor charges to Vdd. Therefore, we get Vdd as the output for any
input voltage less than Vtn. When Vin is greater than Vtn, the MOS transistor acts as
a resistor Rc, where Rc is the channel resistance with Vgs > Vtn. The output capacitor
discharges through this resistor and output voltage is given by
(4.3) Rc
VOL = Vdd
Rc + RL
Normally, this output is used to drive other gates. Functionally, this voltage can be
accepted as low level provided it is less than Vt. So,
Rc
VOL = Vdd < Vtn .
Rc + RL
Assuming the typical value of threshold voltage Vtn = 0.2Vdd , we get
(4.4) Rc
VOL = Vdd ≤ 0.2Vdd or RL > 4 RC
Rc + RL
This imposes a restriction on the minimum value of load resistance for a successful
operation of the circuit as an inverter. The input–output characteristic of the inverter
72 4 MOS Inverters
Fig. 4.6 Realization of a 9GG

resistive load
5/
9RXW
9LQ
is shown in Fig. 4.5b. The circuit operates along the load line as shown in Fig. 4.5b.
For Vin = 0 V, the output voltage Vout = Vdd (point A), and for Vin = Vdd, the output volt-
age Vout = VOL, as shown by point B. The transfer characteristic is shown in Fig. 4.5c,
which shows that the output is Vdd for Vin = OV, but for Vin = Vdd the output is not OV.
This implementation of this inverter has a number of disadvantages:
• As the charging of the output capacitor takes place through the load resistor RL
and discharge through Rc and their values must be different as per Eq. 4.4, there
is asymmetry in the ON-to-OFF and OFF-to-ON switching times.
• To have higher speeds of operation, the value of both Rc and RL should be re-
duced. However, this increases the power dissipation of the circuit. Moreover, as
we shall see later, to achieve a smaller value of Rc, the area of the MOS inverter
needs to be increased.
• The resistive load can be fabricated by two approaches—using a diffused resis-
tor approach or using an undoped poly-silicon approach. In the first case, an
n-type or a p-type isolated diffusion region can be fabricated to realize a resistor
between the power supply line and the drain of the nMOS transistor. To realize
a resistor of the order of few K Ω, as required for proper operation of the circuit,
the length to width must be large. To realize this large length-to-width ratio in
a small area, a serpentine form is used as shown in Fig. 4.6. However, this re-
quires a very large chip area. To overcome the limitation of this approach, the
second approach based on undoped poly-silicon can be used. Instead of using
doped poly-silicon, which is commonly used to realize the gate and interconnect
regions having lower resistivity, undoped poly-silicon is used here to get higher
resistivity. Although this approach leads to a very compact resistor compared
to the previous approach, the resistance value cannot be accurately controlled
leading to large process parameter variations. In view of the above discussion,
it is evident that this inverter configuration is not suitable for VLSI realization.
Better alternatives for the realization of the pull-up resistor are explored in the
following subsections.
4.3.2 nMOS Depletion-Mode Transistor as Pull up
To overcome the limitations mentioned above, MOS transistors can be used as pull-
up devices instead of using a passive resistor. There are three possible alternatives
for pull-up devices—an nMOS enhancement-mode transistor, a depletion-mode
9GG
9GG $
%
(
&
, GV
9RXW '
9 RXW
& %
9LQ ' (
$
9GV
a b c 9LQ
Fig. 4.7 a nMOS inverter with depletion-mode transistor as pull-up device; b voltage current
characteristic; c transfer characteristic. nMOS n-type metal–oxide–semiconductor
nMOS transistor, or a pMOS transistor. Any one of the transistors can be used as a
pull-up device. First, we consider the use of an nMOS depletion-mode transistor as
an active pull-up (pu) device as shown in Fig. 4.7a. As the output of an inverter is
commonly connected to the gate of one or more MOS transistors in the next stage,
there is no fan-out current, and the currents flowing through both the transistors
must be equal. The input voltage is applied to the gate of the pull-down (pd) transis-
tor, and the output is taken out from the drain of the pd device.
1. Pull-down device off and pull-up device in linear region: This corresponds to
point ‘A’ on the curve with the input voltage Vin < Vtn , Vout = Vdd and I ds = 0 .
In this situation, there is no current flow from the power supply and no current
flows through either of the transistors.
2. Pull-down device in saturation and pull-up device in linear region: This corre-
sponds to point B. Here,
K nWpd
(4.5)
I pd = (Vin − Vtpd ) 2
2 Lpd
and
K nWpu  V 
(4.6)
I pu = (Vout − Vtpu ) − out  Vout ,
Lpu   2 
where Vtpd and Vtpu are the threshold voltages of the enhancement- and depletion-
mode MOS transistors, respectively.
3. Pull-down and pull-up device, both in saturation: This is represented by point C
on the curve. In this situation,
K nWpd
I pd = (Vin − Vtpd ) 2
2 Lpd
(4.7)
K nWpu 2
and I pu = Vtpu .
2 Lpu
74 4 MOS Inverters
4. Pull-down device in linear region and pull-up device in saturation: This situa-
tion occurs when input voltage is equal to Vdd. Here,
 V 
I pd = β pd  Vin − Vtpd − OL  VOL
 2 
β pu 2
(4.8) I pu = Vtpu ,
2
Wpd Wpd
where β pd = K n and β pu = K n .
Lpd Lpd
Equating the two currents and ignoring VOL / 2 term, we get
(4.9) β pu
β pd (Vdd − Vtpd )VOL = (Vtpu ) 2 ,
2
1 (Vtpu )
2
βpu (Vtpu ) 2 1 (Vtpu )
2
(4.10)
VOL = = · = · ,
2 K Vdd − Vtpd 2βpd Vdd − Vtpd 2 K Vdd − Vtpd
where
β pd (W / L) pd
(4.11)
K= = .
β pd (W / L) pu
The quantity K is called the ratio of the inverter. For successful inverter operation,
the low output voltage, VOL, should be smaller than the threshold voltage of the
pull-down transistor of the next stage. From the above discussion, we can make the
following conclusion:
• The output is not ratioless, which leads to asymmetry in switching characteris-
tics.
• There is static power dissipation when the output logic level is low.
• It produces strong high output level, but weak low output level.
4.3.3 nMOS Enhancement-Mode Transistor as Pull up
Alternatively, an enhancement-mode nMOS transistor with gate normally connect-

ed to its drain ( Vdd) can be used as an active pull-up resistor as shown in Fig. 4.8a.
Let us consider the output voltage for two situations—when Vin = O and Vin = Vdd. In
the first case, the desired output is Vdd. But as the output, Vout, approaches the volt-
age ( Vdd − Vtn), the pull-up transistor turns off. Therefore, the output voltage cannot
Fig. 4.8 a nMOS inverter 9GG

with enhance-mode transistor 9GG
as a pull-up device; b transfer 9GG9WQ
characteristic. nMOS n-type
9RXW
9RXW
92/
9LQ
9LQ 9GG
a b
Fig. 4.9 a A pseudo- 9G

nMOS inverter; b transfer G
characteristic. Pseudo- 9G
nMOS pseudo-n-type 9RXW G
9RXW
9L 9/
Q 2 9LQ
a b
reach Vdd. The maximum output voltage that can be attained is ( Vdd − Vtn), where
Vtn is the threshold voltage of the enhancement-mode pull-up transistor. The output
voltage for Vin = Vdd is not 0 V, because in this case both the transistors are conduct-
ing and act as a voltage divider. The transfer characteristic is shown in Fig. 4.8b.
From the above discussion, we can make the following conclusion:
• The output is not ratioless, which leads to asymmetry in switching characteris-
tics.
• There is static power dissipation when the output level is low.
• It produces weak low and high output levels.
As a consequence, nMOS enhancement-type transistor is not suitable as a pull-up
device for realizing an MOS inverter.
4.3.4 The pMOS Transistor as Pull Up
We can realize another type of inverter with a pMOS transistor as a pull-up device
with its gate permanently connected to the ground as shown in Fig. 4.9a. As it is
functionally similar to a depletion-type nMOS load, it is called a ‘pseudo-nMOS’
inverter. Unlike the CMOS inverter, discussed in Sect. 4.2.4, the pull-up transis-
tor always remains ON, and there is DC current flow when the pull-down device
is ON. The low-level output is also not zero and is dependent on the β n / β p ratio
like the depletion-type nMOS load. The voltage-transfer characteristic is shown in
Fig. 4.9b.
76 4 MOS Inverters
,,,
9G
, ,, ,9 9
G 9GG
,GVQ ,GVS
&
,GV
9L 9RXW ' % 9RXW
Q
( $
*1 9L
' Q 9LQY 9GG
9LQ
a b c
Fig. 4.10 a CMOS inverter; b voltage–current characteristic; and c transfer characteristic
4.3.5 pMOS Transistor as a Pull Up in Complementary Mode
In this case, a pMOS enhancement type transistor is used as a pull-up device. How-
ever, here the gates of both the pull-up and pull-down transistors are tied together
and used as input as shown in Fig. 4.10a. Output is taken from the drain of the pull-
down device as usual. In this case, when the input voltage Vin = 0 V, the gate input
of the pull-up transistor is below Vdd of its source voltage, i.e., Vgs = −Vdd, which
makes the pull-up transistor ON, and the pull-down transistor OFF. So, there is no
DC current flow between Vdd to ground. When the input voltage Vin = Vdd, the gate
input of the pull-up transistor is zero with respect to its source, which makes it OFF.
The pull-down transistor, however, is ON because the Vgspd = Vdd . In this situation
also, there is no DC current flow between Vdd and ground. However, as the gate
voltage is gradually increased from ‘0’ to ‘1’, the pull-up transistor switches from
ON to OFF and the pull-down transistor switches from OFF to ON. Around the
midpoint, both transistors are ON and DC current flows between Vdd and ground.
Detailed analysis can be made by dividing the entire region of operation into five
basic regions as follows:
Region 1: 0 ≤ Vin < Vtn The pull-down transistor is off and the pull-up transistor is in
the linear (subthreshold) region as shown by a representative point ‘A’ on the super-
imposed nMOS and pMOS transistor’s drain-to-source voltage–current character-
istic curves. In this region, there is no DC current flow and output voltage remains
close to Vdd. This corresponds to point ‘A’ in Fig. 4.10b.
Region 2: Vtn < Vin < Vinv Here, the pull-down transistor moves into a saturation
region and the pull-up transistor remains in the linear region as represented by point
B, when the input is VIL. The pull-down transistor acts as a current source and pull-
up transistor acts as a resistor. The current flow through the transistor increases as
Vin increases from Vtn to Vinv and attains a maximum value when Vin = Vinv.
Since the same current flows through both the devices, I dsp = − I dsn
For the pMOS devices, the drain current is given by
(4.12)  1 
I dsp = − β p (Vin − Vdd − Vtp )(VO − Vdd ) − (VO − Vdd ) 2  ,
 2 
where
Wp
βp = Kp , Vgsp = Vin − Vdd and Vdsp = VO − Vdd .
Lp
The saturation current of the nMOS transistor is given by
(Vin − Vtn ) 2
I dsn = β n ,
2
where β n = K n (Wn / Ln ) and Vgsn = Vin .

Equating these two, we get
(4.13)  V  β
VO = (Vin − Vtp ) + (Vin − Vtp ) − 2  Vin − dd − Vtp  Vdd − n (Vin − Vtn ) 2 .
 2  βp
A plot of this is shown in Region II of Fig. 4.10c.

The VIL corresponds to the point dVout / dVin = −1 , and, at this point, the nMOS
transistor operates in the saturation region, whereas the pMOS transistor operates in
the linear region. Equating Idsn = −Idsp,
we get
βn β
(Vgsn − Vtn ) = 2p 2 (Vgsp − Vtp )Vdsp − Vdsp2  .
2
Substituting
Vgsn = Vin , Vgsp = −(Vdd − Vin ) and Vdsp = −(Vdd − Vout ),
we get
βn β
(Vin − Vtn ) = p  2 (Vin − Vdd − Vtp ) (Vout − Vdd ) − (Vout − Vdd )  .
(4.14) 2 2
2 2
Differentiating both sides with respect to Vin, we get
βn  dV dV 
(Vin − Vtn ) = β p (Vin − Vdd − Vtp ) out + (Vout − Vdd ) − (Vout − Vdd ) out  .
2  dVin dVin 
Substituting, Vin = VIL and dVout / dVin = −1 , we get
β n (VIL − Vtn ) = β p (2Vout − VIL + Vtp − Vdd )

(4.15) (2Vout + Vtp − Vdd + ( β n / β p ) ⋅ Vtn )
or VIL =
(1 + ( β n / β p ))
78 4 MOS Inverters
For β n / β p = 1, and Vout ≅ Vdd , the value of VIL can be found out to be
1
(4.16)
VIL = (3Vdd + 2Vtn )
8
Region 3: Vin = Vinv At this point, both the transistors are in the saturation condition
as represented by the point C on the superimposed characteristic curves. In this
regions, both the transistors can be modeled as current sources.
Assuming
Vgspd = Vin and
Vgspu = Vin − Vdd = Vinv − Vdd ,
we may equate the saturation currents of the pull-up and pull-down transistors to get
1 W
K n n (Vinv − Vtn )
2
I dsn =
2 Ln
1 Wp
(Vinv − Vdd − Vtp )
2
I dsp = − K p
2 Lp
Equating we get,
βn β
(Vinv − Vtn ) = − p (Vinv − Vdd − Vtp )
2 2
or
2 2
Vinv − Vdd − Vtp βn
or =−
Vinv − Vtn βp
 βn  βn
or Vinv 1 +  = Vdd + Vtp + Vtn
 
βp  βp

βn
Vdd + Vtp + Vtn
βp
or Vinv =
βn
1+
βp (4.17)
For
βn = βp and Vtn = −Vtp , we get Vinv = Vdd / 2.
In a CMOS process,
K n µn
= ≈ 2.5
K p µp
To make β p = β p , one may choose
W 
(4.18) W 
 L  = 2.5  L  ,
 p  n
which can be realized with pMOS channel 2.5 times wider than the nMOS channel
of the same length. The inverter acts as a symmetrical gate when this is realized.
Since both the transistors are in saturation, they act as current sources and at this
point the current attains a maximum value as shown in Fig. 4.10c. This leads to
power dissipation, known as short-circuit power dissipation. Later, we will derive a
detailed expression for short-circuit power dissipation.
Region 4: Vinv < Vin ≤ Vdd − Vtp As the input voltage has been increased above
Vinv, the nMOS transistor moves from the saturation region to the linear region,
whereas the pMOS transistor remains in saturation. With the increase in input volt-
age beyond Vinv, the output voltage and also the drain current continue to drop. A
representative point in this region is point D. In this region, nMOS transistor acts as
a resistor and pMOS transistor acts as a current source.
The drain current for the two transistors are given by
 V2 
( )
2
I dsn = β n (Vin − Vtn )VO − O  and I dsp = − β p Vin − Vdd − Vtp
 2 
As
I dsn = − I dsp ,
we get
 V2 
β n (Vin − Vtn )VO − O  = β p (Vin − Vdd − Vtp ) 2
 2 
or
VO2 βp
− (Vin − Vtn )VO + (Vin − Vdd − Vtp ) 2 = 0.
2 βn
80 4 MOS Inverters
Solving for VO, we get
βp
(Vin − Vdd − Vtp )
2
VO = (Vin − Vtn ) − (Vin − Vtn )2 −
βn

βp
VO = (Vin − Vtn ) − (Vin − Vtn )2 − (Vin − Vdd − Vtp ) 2 (4.19)
βn
Similarly, we can find out the value of VIH when the nMOS transistor operates in
the linear region and pMOS transistor operates in the saturation region. Equating
Idsn = Idsp, at this point, we get
βn
(4.20) β
 2(Vgsn − Vtn )•Vdsn − Vdsn
2
 = p (Vgsp − Vtp ) 2 .
2 2
Substituting Vgsp = −(Vdd − Vin ) and Vdsp = −(Vdd − Vout ),

we get
βn β
 2 (Vin − Vtn ) .Vout − Vout  = p (Vin − Vdd − Vtp ) .
2
(4.21) 2
2 2
Differentiating both sides with respect to Vin, we get
 dV  dV  
β n (Vin − Vtn ) out + Vout − Vout .  out   = β p (Vin − Vdd − Vtp )
 dVin  dVin  
Now, substituting Vin = VIH and dVout / dVin = −1,

we get
β n (−VIH + Vtn + 2Vout ) = β p (VIH − Vdd − Vtp )
or β
Vdd + Vtp + n (2Vout + Vtn )
βp
(4.22)
VIH = .
βn
1+
βp
This equation can be solved with the Kirchhoff’s current law (KCL) equation to obtain
1
numerical values of VIH and Vout. For β n / β p = 1, the value of VIH = (5Vdd − 2Vtn ).
8
For a symmetric inverter, VIH + VIL = Vdd .
Noise Margin:
Fig. 4.11 Transfer charac- β Q β S

teristics for different inverter
ratio 9GG
β Q β S
9RXW
9LQY
β Q β S
9LQ
NM L = VIL − VOL = VIL

NM H = VOH − VIH = Vdd − VIH
= VIL
which are equal to each other.

Region 5: Vdd − Vtp < Vin ≤ Vdd In this region, the pull-up pMOS transistor remains
OFF and the pull-down nMOS transistor goes to deep saturation. However, the
current flow through the circuit is zero as the p transistor is OFF and the output
voltage VO = 0.
Based on the above discussions, key features of the CMOS inverter are high-
lighted below:
It may be noted that unlike the use of nMOS enhancement- or depletion-mode
transistor as a pull-up device, in this case, there is no current flow either for ‘0’ or
‘1’ inputs. So, there is no static power dissipation. Current flows only during the
transition period. So, the static power dissipation is very small. Moreover, for low
and high inputs, the roll of the pMOS and nMOS transistors are complementary;
when one is OFF, the other one is ON. That is why this configuration is known as
the complementary MOS or CMOS inverter. Another advantage is that full high and
low levels are generated at the output. Moreover, the output voltage is independent
of the relative dimensions of the pMOS and nMOS transistors. In other words, the
CMOS circuits are ratioless.
β n / β p Ratio: As we have mentioned earlier, the low- and high-level outputs
of a CMOS inverter are not dependent on the inverter ratio. However, the transfer
characteristic is a function of the β n / β p ratio. The transfer characteristics for three
different ratio values are plotted in Fig. 4.11. Here, we note that the voltage at which
the gate switches from high to low level ( Vinv) is dependent on the β n / β p ratio. Vinv
increases as β n / β p decreases. For a given process technology, the β n / β p can be
changed by changing the channel dimensions, i.e., the channel length and width.
Keeping L the same, if we increase Wn / Wp ratio, the transition moves towards the
left and as Wn / Wp is decreased, the transition moves towards the right as shown
in Fig. 4.11. As the carrier mobility depends on temperature, it is expected that the
transfer characteristics will be affected with the temperature. However, both β n
and β p are affected in the same manner ( β ∝ T −1.5 ) and the ratio β n / β p remains
more or less the same. On the other hand, both the threshold voltages Vtn and Vtp
82 4 MOS Inverters
Table 4.1 Comparison of the inverters

Inverters VLO VHI Noise margin Power
Resistor Weak Strong Poor for low High
nMOS depletion Weak Strong Poor for low High
nMOS Weak Weak Poor for both low High
enhancement and high
Psuedo-nMOS Weak Strong Poor for low High
CMOS Strong Strong Good Low
nMOS n-type metal–oxide–semiconductor, CMOS complementary metal–oxide–semiconductor
decrease with increase in temperature leading to some shrinkage of the region I and
expansion of region V.
4.3.6 Comparison of the Inverters
Table 4.1 summarizes the performance of the five different types of inverters dis-
cussed in this section. As given in column 2, low output level is weak, i.e., the out-
put does not go down to 0 V, for all the inverters except CMOS. High-output level is
weak only for the inverter with the enhancement-mode nMOS transistor as pull-up
devices. For all other inverters, full high-level ( Vdd) is attained as shown in column
3. As shown in column 4, CMOS gives a good noise margin both for high- and low-
logic levels. The inverters with nMOS enhancement-mode transistor as pull up have
poor noise margin both for high- and low-logic levels. For all other types of invert-
ers, noise margin is poor for low levels. So far, as power consumption is concerned,
all inverters except CMOS draw DC current from the supply when the input is high.
As a consequence, only CMOS consumes lower power compared to other types of
inverters. From this table, we can conclude that CMOS is superior in all respects
compared to other types of logic circuits because of these advantages. As a conse-
quence, CMOS has emerged as the dominant technology for the present-day VLSI
circuits. In case of nMOS depletion mode of transistor as a pull-up device, low-level
noise margin is poor, and in case of nMOS enhancement-mode transistor, the noise
margin is poor for both low and high levels. It may be noted that CMOS circuits
provide better noise margin compared to other two cases as given in Table 4.1.
4.4 Inverter Ratio in Different Situations
In a practical circuit, an inverter will drive other circuits, and, in turn, will be driven
by other circuits. Different inverter ratios will be necessary for correct and satis-
factory operation of the inverters. In this section, we consider two situations—an
inverter driven by another inverter and an inverter driven through one or more pass
transistors—and find out the inverter ratio for these two cases.
4.4 Inverter Ratio in Different Situations 83
9G 9GG 9GG
G
, ,
=SX
9RXW 9RXW 9RXW

9LQ =SG 5 5

,QYHUWHUZLWK 9LQ 9GG ,QYHUWHUZLWK 9LQ 9GG 9W

a b c
Fig. 4.12 a An nMOS inverter driven by another inverter; b inverter with Vin = Vdd; and c inverter
with Vin = Vdd – Vt. nMOS n-type metal–oxide–semiconductor, Vin voltage input to the inverter, Vdd
positive supply rail, Vt inverter threshold voltage
4.4.1 An nMOS Inverter Driven by Another Inverter
Let us consider an nMOS inverter with depletion-type transistor as an active load is

driving a similar type of inverter as shown in Fig. 4.12. In order to cascade two or
more inverters without any degradation of voltage levels, we have to meet the con-
dition Vin = Vout = Vinv; and for equal margins, let us set Vinv = 0.5 Vdd. This condition is
satisfied when both the transistors are in saturation, and the drain current is given by
W (Vgs − V )
2
(4.23)
I ds = K
L 2
For the depletion-mode transistor,
Wpu (−Vtdp ) 2
(4.24)
I dspu = K ,
Lpu 2
where Vtdp is the threshold voltage of the depletion-mode transistor and Vgs = 0.
And for the enhancement-mode transistor,
Wpd (Vinv − Vtn ) 2

(4.25)
I ds = K
2 Lpd
Equating these currents, we get
Wpd Wpu
(4.26)
(Vinv − Vtn ) 2 = (−Vtdp ) 2
Lpd Lpu
=
Assuming Z pd L= pd / Wpd and Z pu Lpu / Wpu , where Z is known as the aspect
ratio of the MOS devices, we have
84 4 MOS Inverters
Fig. 4.13 An inverter driven 9GG 9WS

through one or more pass 9GG 9WS
transistors 9LQ 9RXW
,QY ,QY
9GG 9GG 9GG

1 1
( )
2
(Vinv − Vtn )2 = −Vtdp
Z pd Z pu
Z pu
(Vinv − Vtn ) = −Vtdp (4.27)
Z pa
Vtdp
or Vinv = Vtn −
Z pu
Z pd
Substituting typical values Vtn = 0.2Vdd , Vtdp = −0.6Vdd and Vinv = 0.5Vdd ,
we get
Z pu 4
(4.28)
= .
Z pd 1
This ratio Z pu / Z pd is known as the inverter ratio ( Rinv) of the inverter, and it is 4:1
for an inverter directly driven by another inverter.
4.4.2 An nMOS Inverter Driven Through Pass Transistors
Here, an nMOS inverter is driven through one or more pass transistors as shown in
Fig. 4.13. As we have seen in Sect. 4.1, a pass transistor passes a weak high level.
If Vdd is applied to the input of a pass transistor, at the output, we get ( Vdd–Vtp),
where Vtp is the threshold voltage of the pass transistor. Therefore, instead of Vdd, a
degraded high level ( Vdd–Vtp) is applied to the second inverter. We have to ensure
that the same voltage levels are produced at the outputs of the two Inverters in spite
of different input voltage levels.
First, let us consider inverter 1 with input voltage Vdd. In this situation, the pull-
down transistor is in active mode and behaves like a resistor, whereas the pull-up
transistor is in saturation, and it represents a current source.
For the pull-down transistor
(4.29)
Wpd1  Vds21 
I ds = K (Vdd − Vtn )Vds1 − 
Lpd1  2 
4.4 Inverter Ratio in Different Situations 85
Therefore,
V 1 Lpd1 1
R1 = sd1 =
(4.30) .
I ds1 K Wpd1  Vds1 
 Vdd − Vtn − 2 
 
Ignoring
Vds1 Z pd1 1
(4.31)
factor, R1 = .
2 K (Vdd − Vtn )
Now, for the depletion-mode pull-up transistor, Vgs = 0 ,
Wpu1 (−Vtdp ) 2
(4.32)
I1 = I ds = K .
Lpu1 2
The product
 (−Vtdp )
2
Z pd1  1
(4.33)
I1 R1 = Vout1 =   .
Z pu1  Vdd − Vtn  2
In a similar manner, we can consider the case of inverter 2, where the input voltage
is ( Vdd−Vtp). As in case of inverter 1, we get
1 ( −Vtd )
2
Z pd2 1
(4.34)
R2 ≈ and I 2 = K
K (Vdd − Vtp ) − Vtn  Z pu2 2
 
Therefore,
(4.35)
Z pd 2 1 (−Vtd ) 2
Vout 2 = I 2 R2 = .
Z pu 2 (Vdd − Vtp − Vtn ) 2
If we impose the condition that both the inverters will have the same output,
Zpu2 Z pu1 (Vdd − Vtn )

(4.36)
Vout1 = Vout 2 then, I1 R1 = I 2 R2 or =
Zpd2 Z pd1 (Vdd − Vtp − Vtn )
=
Substituting typical values Vtn 0=
.2Vdd and Vtp 0.3Vdd ,
we get
Z pu 2 4.0 Z pu1 Z pu 2 Z pu1 8
(4.37)
= . or ≈ 2. =
Z pd2 2.5 Z pd1 Z pd 2 Z pd1 1
86 4 MOS Inverters
9GG &JS 9GG 9GG 9GG
&GES
&GJS
9LQ 9LQ 9RXW
&GJQ &LQW
&/
&GEQ
&JQ
a b
Fig. 4.14 a Parasitic capacitances of a CMOS inverter. b CMOS complementary metal–oxide–
semiconductor
Fig. 4.15 Internal para-

sitic capacitances of an &JV &JG
MOS transistor. MOS
metal–oxide–semiconductor 6RXUFH 'UDLQ
&JE
&VE &GE
It may be noted that, if there are more than one-pass transistors before the second
inverter, the degradation in the high voltage level is the same. Therefore, we may
conclude that, if an inverter is driven through one or more pass transistors, it should
have inverter ratio Z pu / Z pd ≥ 8 / 1.
4.5 Switching Characteristics
So far, we have discussed the DC characteristics of an inverter, which gives us the

behaviour of the inverter in static conditions when inputs do not change. To under-
stand the switching characteristics, it is necessary to understand the parameters that
affect the speed of the inverters. In this section, we consider the dynamic behav-
iour of a CMOS inverter. A CMOS inverter driving a succeeding stage is shown in
Fig. 4.14a. As shown in the figure, various parasitic capacitances exist (Fig. 4.15).
The capacitances Cgd and Cgs are mainly due to gate overlap with the diffusion
regions, whereas Cdb and Cgb are voltage-dependent junction capacitances. The ca-
pacitance Cout is the lumped value of the distributed capacitances due to intercon-
nection and Cgn and Cgp are due to the thin oxide capacitances over the gate area
of the nMOS and pMOS transistors, respectively. For the sake of simplicity, all the
capacitances are combined into an equivalent lumped capacitance CL, which is con-
nected as the load capacitance of the inverter as shown in Fig. 4.14b. Here,
4.5 Switching Characteristics 87
9LQ
9GG 9GG
,GHDO
τSKO W W
W
9LQ 9RXW
τSOK W W
9RXW
&/ τSKO τSOK
τG

W
a b W W W W
9GG

&/
&/ W$ W% W& W'
9RXW
τ IDOO W% íW $ τ ULVH W ' í W&
c d e
Fig. 4.16 a CMOS inverter; b delay-time timings; c fall-time model; d rise-time model; e Rise
time and fall times. CMOS complementary metal–oxide–semiconductor
CL = Cdgn + Cdgp + Cdbn + Cdbp + Cint + Cgn + Cgp .
We assume that an ideal step waveform, with zero rise and fall time is applied to
the input as shown in Fig. 4.14b. The delay td is the time difference between the
midpoint of the input swing and the midpoint of the swing of the output signal. The
load capacitance shown at the output of the inverter represents the total of the input
capacitance of driven gates, the parasitic capacitance at the output of the gate itself,
and the wiring capacitance. In the following section, we discuss the estimation of
the load capacitances.
4.5.1 Delay-Time Estimation
The circuit for high-to-low propagation delay time tPHL estimation can be modeled
as shown in Fig. 4.16c. It is assumed that the pull-down nMOS transistor remains
in saturation region and the pMOS transistor remains off during the entire discharge
period. The saturation current for the nMOS transistor is given by
βn
I dsn = (Vgs − Vtn ) 2 (4.38)
2
88 4 MOS Inverters
The capacitor CL discharges from voltage Vdd to Vdd/2 during time t0 to t1.
Vdd
(4.39) β
= Vdd − n ⋅ (Vgs − Vtn ) 2 × (t1 − t0 ).
2 2CL
Substituting Vgs = Vdd and tphl = t1 − t0 ,

we get the fall delay
Vdd CL
tphl = × . (4.40)
β n (Vdd − Vtn ) 2
When the input goes from high ( Vdd) to low, initially the output is at a low level.
The pull-up pMOS transistor operates in the saturation region. In a similar manner,
based on the model of Fig. 4.16d, the rise time delay is given by
C 1
tplh = L .
(4.41)
βp  Vtp 
2
Vdd 1 − 
 Vdd 
For equally sized MOS transistors, the rise delay is greater than the fall delay be-
cause of lower mobility of p-type carriers. So, the pMOS transistor should be sized
by increasing the width Wp in order to get equal size and fall delays.
Delay Time By taking average of the rise and full delay time, we get the delay time

td =
1
2
(
tphl + tplh )
 
 
CL  1 1  (4.42)
=  2
+ 2
2Vdd   V   Vtp  
 βn 1 −
tn
 βp 1 − 
  Vdd   Vdd  
Assuming Vtn = −Vtp = Vt , the delay is given by
 L Lp  CL
td =  n + 
(4.43)
 K nWn K pWp 
2
 V 
Vdd 1 − t 
 Vdd 
4.5 Switching Characteristics 89
Fig. 4.17 Ring oscillator

realized using odd number of …
inverters
This expression gives us a simple analytical expression for the delay time. It is
observed that the delay is linearly proportional to the total load capacitance CL. The
delay also increases as the supply voltage is scaled down, and it increases drasti-
cally as it approaches the threshold voltage. To overcome this problem, the thresh-
old voltage is also scaled down along with the supply voltage. This is known as
the constant field scaling. Another alternative is to use constant voltage scaling, in
which the supply voltage is kept unchanged because it may not be always possible
to reduce the supply voltage to maintain electrical compatibility with other sub-
systems used to realize a complete system. The designer can control the following
parameters to optimize the speed of CMOS gates.
• The width of the MOS transistors can be increased to reduce the delay. This is
known as gate sizing, which will be discussed later in more detail.
• The load capacitance can be reduced to reduce delay. This is achieved by using
transistors of smaller and smaller dimensions as provided by future-generation
devices.
• Delay can also be reduced by increasing the supply voltage Vdd along and/or
reducing the threshold voltage Vt of the transistors.
4.5.2 Ring Oscillator
To characterize a particular technology generation, it is necessary to measure the

gate delay as a measure of the performance [1]. It is difficult to measure delays of
the order of few nanoseconds with the help of an oscilloscope. An indirect approach
is to use a ring oscillator to measure the delay. A ring oscillator is realized by a
cascade connection of odd number of a large number of inverters, where the output
of the last stage is connected to the input of the first stage, as shown in Fig. 4.17.
The circuit oscillates because the phase of the signal fed to the input stage from
the output stage leads to positive feedback. The frequency of oscillation for this
closed-loop cascade connection is determined by the number of stages and the de-
lay of each stage. The output waveform of a three-stage ring oscillator is shown in
Fig. 4.18. The time period can be expressed as the sum of the six delay times
T = tphl1 + tplh 2 + tphl3 + tplh1 + tphl 2 + tplh 3

= (tphl1 + tplh1 ) + (tphl 2 + tplh 2 ) + tphl3 + tplh 3
= 2td + 2td + 2td
= 6td
= 2.3td .
90 4 MOS Inverters
Fig. 4.18 Output waveform

of a three-stage ring oscillator
$ % &
9$
9%
9&
WGU WGI WGU WGI WGU WGI WGU
For an n-stage (where n is an odd number) inverter, the time period T = 2•nτ d .
Therefore, the frequency of oscillation f = 1 / 2ntd or td = 1 / 2nf . It may be noted
that the delay time can be obtained by measuring the frequency of the ring oscil-
lator. For better accuracy of the measurement of frequency, a suitable value of n
(say 151) can be chosen. This can be used for the characterization of a fabrication
process or a particular design. The ring oscillator can also be used for on-chip clock
generation. However, it does not provide a stable or accurate clock frequency due to
dependence on temperature and other parameters. To generate stable and accurate
clock frequency, an off-chip crystal is used to realize a crystal oscillator.
4.6 Delay Parameters
In order to understand the delay characteristics of MOS transistors, we have to con-

sider various parameters such as resistance and capacitances of the transistors along
with wiring and parasitic capacitances. For ease of understanding and simplified
treatment, we will introduce simplified circuit parameters as follows:
4.6 Delay Parameters 91
Fig. 4.19 One slab of con- /

ducting material
4.6.1 Resistance Estimation
Let us consider a rectangular slab of conducting material of resistivity ρ, of width

W, of thickness t and length L as shown in Fig. 4.19. The resistance between A and
B, RAB between the two opposite sides is given by
(4.44)ρL ρL
RAB = = Ω,
t ·W A
where A is the cross section area. Consider the case in which L = W, then
ρ
(4.45)
RAB = = RΩ
t
where Rs is defined as the resistance per square or the sheet resistance.

Thus, it may be noted that Rs is independent of the area of the square. The actual
value of Rs will be different for different types of layers, their thickness, and resis-
tivity. For poly-silicon and metal, it is very easy to envisage the thickness, and their
resistivities are also known. But, for diffusion layer, it is difficult to envisage the
depth of diffusion and impurity concentration level. Although the voltage–current
characteristics of an MOS transistor is nonlinear in nature, we may, approximate
its behaviour in terms of a ‘channel’ resistance to estimate delay performance. We
know that in the linear region
 V2 
I ds = β (Vgs − Vt )Vds − ds  (4.46)
 2 
Assuming, Vds (Vgs − Vt ), we may ignore the quadratic term to get
( )
I ds = β Vgs − Vt .Vds
(4.47)V 1 1 L
or Rc = ds = =
I ds β Vgs − Vt µCg Vgs − Vt W ( ) ( )
 L 1
= K   where K = .
W  µCg Vgs − Vt( )
92 4 MOS Inverters
Table 4.2 Sheet resistances of different conductors

Sheet resistance in ohm/sq.
Layer Min. Typical Max.
Metal 0.03 0.07 0.1
Diffusion (nt, pt) 10 25 100
Silicide 2 3 6
Poly-silicon 15 20 30
n-channel 104
p-channel 2.5 × 104
K may take a value between 1000 to 3000 Ω/sq. Since the mobility and the threshold
voltage are functions of temperature, the channel resistance will vary with tempera-
ture. Typical sheet resistances for different conductors are given in Table 4.2.
Sheet resistance concept can be conveniently applied to MOS transistors and
inverters. A diffusion and a poly-silicon layer are laid orthogonal to each other with
overlapping areas to represent a transistor. The thinox mask layout is the hatched
overlapped region. The poly-silicon and the thinox layer prevent diffusion to occur
below these layers. Diffusion takes place in the areas where these two layers do not
overlap. Assuming a channel length of 2λ and channel width of 2λ, the resistance of
this channel is given by
R = 1 square × Rs Ω per square where Rs = 104 Ω.
In this case, the length to width ratio is 1:1.

For a transistor with L = 8λ and W = 2λ,
L 4
=
Z = .
W 1
Thus, the channel resistance = 4 × Rs = 4 × 104 Ω. It can be regarded as four-unit

squares in series. It is also possible to consider an inverter with a given inverter ratio
of zpu:zpd. The most common ratio of 4:1 can be realized in one of the two ways as
shown in Fig. 4.20.
4.6.2 Area Capacitance of Different Layers
From the structure of MOS transistors, it is apparent that each transistor can be rep-
resented as a parallel-plate capacitor. The area capacitance can be calculated based
on the dielectric thickness (silicon dioxide) between the conducting layers.
Here, area capacitance
(4.48) εε A
C = 0 ins Farads ,
D
4.6 Delay Parameters 93
Table 4.3 Capacitance of different materials

Capacitance Value in pF/µm2 Relative value
Gate to channel 4 × 10−4 1
Diffusion 1 × 10−4 0.25
Poly-silicon 4 × 10−4 0.1
Metal 1 0.3 × 10−4 0.075
Metal 2 0.2 × 10−4 0.50
Metal 2 to metal 0.4 × 10−4 0.15
Metal 2 to poly 0.3 × 10−4 0.075
Fig. 4.20 Two different

inverter configurations with
inverter ratio 4:1

where D is the thickness of the silicon dioxide, A is the Area of place, ε ins is the rela-
tive permittivity of SO2 = 3.9, and ε 0 = 8.85 × 10–14 F/cm, permittivity of free space
For 5-µm MOS technology, typical area capacitance is given in Table 4.3:
4.6.3 Standard Unit of Capacitance Cg
The standard unit of capacitance Cg is defined as the gate-to-channel capacitance

of a minimum-size MOS transistor. It gives a value approximate to the technology
and can be conveniently used in calculations without associating with the absolute
value.
Considering 5 µm technology, where gate area = 5 µm x 5 µm = 25 µm2, area ca-
pacitance = 4 × 10−4 pF/cm2.
Therefore, standard value of Cg = 25 × 4 x 10−4 pF = .01 pF.
Example 2.2
Let us consider a rectangular sheet of length L and width W. Assuming L = 20 and
W = 3, the relative area and capacitances can be calculated as follows:
20 × 3
Relative area = = 10
2×3
a. Consider the area in metal

capacitance to substrate = 15 × 0.075 Cg = 1.125 Cg
94 4 MOS Inverters
b. Consider the same are in poly

capacitance to substrate = 15 × 0.20 = 1.5 Cg
c. Considering it in diffusion
capacitance to substrate = 15 × 0.25 = 3.70 Cg
A structure occupying different layers can be calculated in the same manner.
4.6.4 The Delay Unit
Let us consider the situation of one standard gate capacitance being charged through
one square of channel resistance.
Time constant = 1 Rs × 1 Cg s
For 5-µm technology
τ = 104 Ω × 0.01 pF = 0.1 ns
when circuit wiring and parasitic capacitances are taken into account, this figure
increases by a factor of 2–3.
So, in practice τ = 0.2–0.3 ns.
The value of τ obtained from transit time calculations is
L2
I sd =
µn •Vds
Substituting the values corresponding to 5 µm technology
252 µm 2 Vs 109 ns104

τ sd = × = 0.13ns.
650cm 2 3V 1010 sµm 2
(Assuming Cg is charged from 0 V to 63 % of Vdd)

It may be noted that this is very close to the theoretical estimate, and it is used as
the fundamental time unit and all timings can be evaluated in terms of τ .
4.7 Driving Large Capacitive Loads
There are situations when a large load capacitance such as, long buffers, off-chip
capacitive load or I/O buffer are to be driven by a gate [1]. In such cases, the delay
can be very high if driven by a standard gate. A super buffer or a BiCMOS inverter
is used or cascade of such gates of increasing can be used to achieve smaller delays
as discussed in the following subsections.
4.7 Driving Large Capacitive Loads 95
9GG
9GG 9GG
9LQ

9RXW
9RXW

9LQ
a b
Fig. 4.21 a Inverting super buffer; b noninverting super buffer
4.7.1 Super Buffers
We have seen that one important drawback of the basic nMOS inverters (because
of ratioed logic) in driving capacitive load is asymmetric drive capability of pull-up
and pull-down devices. This is because of longer channel length (four times) of the
pull-up device. Moreover, when the pull-down transistor is ON, the pull-up transis-
tor also remains ON. As a consequence, a complete pull-down current is not used
to discharge the load capacitance, but part of it is neutralized by the current pass-
ing through the pull-up transistors. This asymmetry can be overcome in especially
designed circuits known as super buffers. It is possible to realize both inverting and
noninverting super buffers as shown in Fig. 4.21. In standard nMOS inverter, the
gate of the depletion-mode pull-up device is tied to the source. Instead, the gate of
the pull-up device of the super buffer is driven by another inverter with about twice
the gate drive. Thus, the pull-up device is capable of sourcing about four times the
current of the standard nMOS inverter. This is the key idea behind both the inverting
and noninverting types of super buffers. This not only overcomes the asymmetry
but also enhances the drive capability.
A schematic diagram of nMOS super buffers, both inverting and noninverting
types, are shown in Fig. 4.21. As shown, the output stage is a push–pull stage. The
gate of the pull-up device is driven by a signal of opposite level of the pull-down de-
vice, generated using a standard inverter. For the inverting-type super buffer, when
the input voltage is low, the gates of both the pull-down transistors are low, the gates
of both the pull-up devices are high, and the output is high. When the input voltage
is high, the gates of both pull-down devices switch to high, the gates of both pull-up
devices switch to low, and the output switches from high to low. The small-channel
resistance of the pull-down device discharges the output load capacitor quickly.
When the input voltage goes from high to low, the gate of the pull-up device quickly
switches to high level. This drives the pull-down stage very quickly, and the output
switches quickly from low to high. It can be shown that the high-to-low and low-to-
high switching times are comparable.
96 4 MOS Inverters
In order to compare the switching speed of a standard inverter and a super buffer,
let us compare the current sourcing capability in both the situations.
For a standard inverter, the drain current of the pull-up device in saturation
(0 < V0 < 2 V) and linear region are given as follows:
β pu
I ds ( sat ) = (Vgs − Vtdep ) 2 , where Vtdep = −3V
(4.49) 2
β pu
= (0 + 3) 2 = 4.5 β pu .
2
 V2 
I ds ( lin ) = β pu (Vgs − Vtdep )Vds − ds 
 2 
(4.50)
β pu
=  2(0 + 3)2.5 − (2.5) 2  = 4.38 β pu .
2
The average of the saturation current for Vds = 5 V and linear current for Vds = 2.5 V
is approximately 4.4 βpu.
For the super buffer, the pull-up device is always in linear region because
Vgs = Vds. The average pull-up current can be assumed to be an average of the linear
currents at Vds = 5 V and Vds = 2.5 V.
β pu
I ds ( 5v ) =  2(5 + 3)5 − 52 
2  
= 27.5 β pu ,
β pu
I ds ( 2.5V ) =  2(2.5 + 3)2.8 − (2.52 ) 
2  
= 10.62 β pu .
The average source current
(27.5 + 10.62) β pu
= = 19.06 β pu .
2
=
Therefore, the current drive is 19=.06 / 4.44 4.3 times that of standard inverter
for the super buffer using totem pole output configuration. This high drive also al-
leviates the asymmetry in driving capacitive loads and makes the nMOS inverter
behave like a ratioless circuit.
9GG
3 9LQ
4
1
9LQ W
9RXW
1 &026
&/ 9RXW
4 %L&026
1
a b W
Fig. 4.22 a A conventional BiCMOS inverter; b output characteristics of static CMOS and BiC-
MOS. CMOS complementary metal–oxide–superconductor
4.7.2 BiCMOS Inverters
Higher current drive capability of bipolar NPN transistors is used in realizing BiC-
MOS inverters. Out of the several possible configurations, the conventional BiC-
MOS inverter, shown in Fig. 4.22, is considered here. The operation of the inverter
is quite straightforward. The circuit requires four MOS transistors and two bipolar
NPN transistors Q1 and Q2. When the input Vin is low, the pMOS transistor P1 is
ON, which drives the base of the bipolar transistor Q1 to make it ON.The nMOS
transistor N1 and N2 are OFF, but transistor N3 is ON, which shunts the base of Q2 to
turn it OFF. The current through Q1 charges capacitor CL and at the output Vout, we
get a voltage ( Vdd−Vbe), where Vbe is the base–emitter voltage drop of the transistor
Q1. If the input is high ( Vdd), transistors P1 and N3 are OFF and N1 and N2 are ON.
The drain current of N2 drives the base of the transistor Q2, which turns ON and
the transistor N1 shunts the base of Q1 to turn it OFF. The capacitor CL discharges
through Q2. The conventional BiCMOS gate gives high-current-drive capability,
zero static power dissipation and high input impedance. The DC characteristic of
the conventional BiCMOS inverter is shown in Fig. 4.22a. For a zero input voltage,
the pMOS transistor operates in the linear region. This drives the NPN transistor
Q1, and the output we get is Vdd − Vbe, where Vbe is the base–emitter voltage drop
of Q1. As input voltage increases, the subthreshold leakage current of the transistor
N3 increases leading to a drop in the output voltage. For Vin = Vinv ( Vdd/2), both P1
and N2 transistors operate in the saturation region driving both the NPN transistors
(Q1 and Q2) ON. In this region, the gain of the inverter is very high, which leads
to a sharp fall in the output voltage, as shown in Fig. 4.22b. As the input voltage is
further increased, the output voltage drops to zero. The output voltage characteris-
tics of CMOS and BiCMOS are compared in Fig. 4.22b. It may be noted that the
BiCMOS inverter does not provide strong high or strong low outputs. High output
is Vdd−VCE1, where VCE1 is the saturation voltage across Q1 and the low-level output
is VCE2, which is the saturation voltage across Q2.
98 4 MOS Inverters
Fig. 4.23 Delay of static

CMOS and BiCMOS &026
'HOD\SV
for different fan-out. 9GG 9
CMOS complementary
metal–oxide–superconductor %L&026

IDQRXW
The delays of CMOS and BiCMOS inverters are compared in Fig. 4.23 for dif-
ferent fan-outs. Here, it is assumed that for CMOS, Wp = 15 µm and Wn = 7 µm and
for the BiCMOS, Wp = Wn = 10 µ m and WQ1 = WQ2 = 2 µ m. It may be noted that for
fan-out of 1 or 2, CMOS provides smaller delay compared to BiCMOS due to lon-
ger delay through its two stages. However, as fan-out increases further, BiCMOS
performs better. So, BiCMOS inverters should be used only for larger fan-out.
4.7.3 Buffer Sizing
It may be observed that an MOS transistor of unit length (2λ) has gate capacitance
proportional to its width ( W), which may be multiple of λ. With the increase of the
width, the current driving capability is increased. But this, in turn, also increases the
gate capacitance. As a consequence, the delay in driving a load capacitance CL by a
transistor of gate capacitance Cg is given by the relationship (cL / cg )τ , where τ is
the unit delay, or delay in driving an inverter by another of the same size.
Let us now consider a situation in which a large capacitive load, such as an out-
put pad, is to be driven by an MOS gate. The typical value of such load capacitance
is about 100 pF, which is several orders of magnitude higher than Cg. If such a load
is driven by an MOS gate of minimum dimension (2λ x 2λ), then the delay will be
103τ. To reduce this delay, if the driver transistor is made wider, say 103 × 2λ, the
delay of this stage becomes τ, but the delay in driving this driver stage is 1000τ, so,
the total delay is 1001τ, which is more than the previous case. It has been observed
that the overall delay can be minimized by using a cascaded stage of inverters of
increasing size as shown in Fig. 4.24. For example, considering each succeeding
stage is ten times bigger than that of the preceding stage, each stage gives a delay
of 10τ. This results in a total delay of 31τ instead of 1001τ. Now, the question arises
about the relative dimension of two consecutive stages. If relative dimension is
large, fewer stages are needed, and if relative dimension is small, a large number
of driver stages is needed. Let us assume that there are n stages and each buffer is
scaled up by a factor of f, which gives stage delay of fτ. For n stages, the delay is
nfτ. Let y be the ratio of the load capacitance to one gate capacitance. Then, for n
buffer stages in cascade
:/

τ τ τ
τ τ
&/ &*
&* &* &* &/ &*
&*
a b
Fig. 4.24 a Using a single driver with W to L ratio of 1000:1; b using drivers of increasing size
with stage ratio of 10. W width; L length
cL cg 2 cg 3 cL
= = fn =
y
cg cg1 cg 2 cgn
(4.51)
or ln y = nln f
ln y
or n = .
ln f
As each stage gives a delay of fτ, the total delay of n stages is
(4.52)  ln y  f
nf τ = f τ   = ⋅τ ⋅ ln y.
 ln f  ln f
The delay is proportional to ln( y), which is dependent on the capacitive load CL and
for a given load ln( y) is constant. To find out the value of f at which the delay is
minimum, we can take the derivative of f/ln( f) with respect to f and equate it to zero.
 f  1
d  ln f − f  
(4.53)
 ln( f )  =  f  = 0.
2
dt ln f
or ln( f) = 1 or f = e, where e is the base of natural logarithm.

The minimum total delay is
c 
(4.54)
tmin = nf τ = eτ ln  L 
 cg 
Therefore, the total delay is minimized when each stage is larger than the previous
one by a factor of e. For 4:1 nMOS inverters, delay per pair is 5τ and overall delay is
n 5
e5τ = neτ = 2.5enτ (n even).
2 2
100 4 MOS Inverters
Fig. 4.25 Variation of delay

with stage ratio

IHOQI

6WDJHUDWLRI
Table 4.4 Variation of delay f f/e ln(f)

with buffer sizing
2 1.062
e 1.000
3 1.005
4 1.062
5 1.143
6 1.232
For CMOS, the overall delay is = 3.5enτ ( n even)

It may be noted that the minimum total delay is a slowly varying function of f as
shown in Fig. 4.25. Therefore, in practice, it may be worthwhile to scale the buffers
by 4 or 5 to 1. This allows saving of the real estate at the cost small speed penalty,
say about 10 % that the minimum possible delay (Table 4.4).
4.8 Chapter Summary
• Characteristics of an inverter are explained.

• The noise margin of an inverter is explained.
• Different configurations of MOS inverters are introduced.
• Inverter ratio is explained.
• Transfer characteristics for different inverter ratio are highlighted.
• Switching characteristics of an inverter are explained.
• Delay time of an inverter is introduced.
• The operation of super buffers is explained.
• The operation of a BiCMOS inverter is explained.
• Buffer sizing to reduce the delay time is explained.
Q4.1. Compare the performance of nMOS inverters with the following types of
pull-up MOS transistors: (a) nMOS depletion type, (b) nMOS enhancement type,
and (c) pMOS enhancement type.
Q4.2. Why are nMOS circuits called ‘ratioed:’ logic? How does it affect the area
and delay as the fan-in is increased from 2 to 4?
Q4.3. hat is the inversion voltage? Compare the inversion voltage of a CMOS in-
verter with that of a four-input NAND gate. Assume that Vdd = 5 V, Vtn = !Vtp! = 1 V.
Q4.4. Derive the expression for delay time of a CMOS inverter. For a fixed
threshold voltage of the transistors, how does the delay vary with the supply
voltage?
Q4.5. Design a scaled chain of inverters such that the delay time between the logic
gate ( Cg = 100 fF) and a load capacitance 2 pF in minimized. Find out the number
of stages and stage ratio.
Q4.6. Explain the latch-up problem of CMOS gates. Suggest a technique to over-
come the problem.
Q4.7. Give the schematic diagram of a BiCMOS inverter. Explain its operation.
Q4.8. What is a noise margin? Compare the noise margin of a CMOS inverter and
an nMOS inverter with enhancement transistor as a pull-up device. Assume that
Vdd = 5 V, Vtn = !Vtp! = 1 V, and Vinv = Vdd/2 in both cases.
Q4.9. Calculate the delay of a CMOS inverter and identify the parameters, which
can be used to control the delay of the inverter.
Q4.10. For a CMOS inverter, establish the condition for Vinv = Vdd/2.
Q4.11. What is a ring oscillator? How can it be used to characterize the delay of a
new process technology?
Q4.12. Justify the statement ‘NAND gates are generally a better choice than NOR
gates for realizing CMOS circuits’.
Q4.13. Why are the nMOS gates called ‘ratioed’ logic? How does it affect the area
and delay as the fan-in is increased from 2 to 4?
Q4.14. Design a scaled chain of inverters such that the delay time between the logic
gate ( Cg = 150 fF) and a load capacitance 1.2 pF in minimized. Find out the number
of stages and stage ratio.
Q4.15. Prove that the delay of a series of pass transistors can be reduced from qua-
dratic dependence to linear dependence on the number of transistors in series by
inserting buffers at suitable intervals.
Q4.16. What is inversion voltage? How does it change as the fan-in of a CMOS
NAND gate increases?
Q4.17. What is super buffer? In what way does it differ from a standard inverter?
Q4.18. Give a plot of the variation of the power dissipation and delay of a CMOS
inverter as the supply voltage is gradually reduced from 5 to 1 V.
Q4.19. What is a noise margin? Compare the noise margin of CMOS inverter with
that of nMOS inverters with enhancement- and depletion-type nMOS transistors as
the pull-up devices.
Q4.20. Explain how a scaled chain of inverters can be used to drive a large ca-
pacitive load with minimum delay? Find out the optimum stage ratio. Calculate the
number of stages and the stage ratio that you will require to drive a capacitive load
of 50 pF with minimum delay, assuming that the gate capacitance of a minimum-
sized CMOS inverter is 25 fF.
102 4 MOS Inverters
References
1. Fabricius, E.D.: Introduction to VLSI Design. McGraw-Hill, New York (1990)

2. Pucknell, D.A., Eshraghian, K.: Basic VLSI Design: Systems and Circuits, 2nd edn. Prentice
Hall of India, New Delhi (1998)
4. Baker, R.J., Li, H.W., Boyce, D.E.: CMOS Design, Layout, and Simulation. IEEE, New York
(1998)
3. Weste, N.H.E., Eshraghian, K.: Principles of CMOS VLSI Design: A System Perspective,
2nd edn. Addison–Wesley, Reading (1994)
5. Kang, S.-M., Leblebici, Y.: CMOS Digital Integrated Circuits Analysis and Design. Tata Mc-
Graw-Hill, New Delhi (2003)
Chapter 5
MOS Combinational Circuits
Abstract This chapter deals with metal–oxide–semiconductor (MOS) combina-

tional circuits. The operation of pass-transistor logic circuits based on switch logic
is explained and advantages and limitations of pass-transistor logic circuits are
highlighted. The different members of pass-transistor logic family are introduced.
The static and switching characteristics of multi-input NOR and NAND gates based
on gate logic are discussed in detail. The operation of MOS dynamic circuits are
explained and charge sharing and charge leakage problems associated with MOS
dynamic circuits are introduced. The clock skew problem of MOS dynamic circuits
is also discussed. The operation of domino-complementary metal–oxide–semi-
conductor (CMOS) and (NO Race) NORA-CMOS dynamic circuits is explained.
Realization of several example functions, such as full adder, parity generator, and
priority encoder, using the three logic styles is considered and their area and delay
are compared.
Keywords Pass-transistor logic · MOS dynamic circuits · Complementary pass-
transistor logic · Swing-restored pass-transistor logic · Double pass-transistor
logic · Charge sharing problem · Charge leakage problem · Clock skew · Domino
CMOS · NORA CMOS · Fan-in · Fan-out · Switching characteristic · Pre-charge
logic · Priority encoder · Parity generator
5.1 Introduction
There are two basic approaches of realizing digital circuits by metal–oxide–semi-

conductor (MOS) technology: switch logic and gate logic. A switch logic is based
on the use of “pass transistors” or transmission gates, just like relay contacts, to
steer logic signals through the device. On the other hand, gate logic is based on the
realization of digital circuits using inverters and other conventional gates, as it is
typically done in transistor–transistor logic (TTL) circuits. Moreover, depending
on how circuits function, they can also be categorized into two types: static and
dynamic gates. In case of static gates, no clock is necessary for their operation and
the output remains steady for as long as the supply voltage is maintained. Dynamic
circuits are realized by making use of the information storage capability of the in-
trinsic capacitors present in the MOS circuits.

104 5 MOS Combinational Circuits
Fig. 5.1 Pass-transistor Vdd Vdd - V t

output driving another pass- 3.5V
(5V)
transistor stage
Vdd Vdd - 2Vt
Vdd (2V)
A basic operation of pass-transistor logic circuits is considered in Sect. 5.2. Re-

alization of pass-transistor logic circuits, overcoming the problems highlighted in
Sect. 5.2.1, has been presented in Sect. 5.2.2. The advantages and disadvantages
of pass-transistor logic circuits are considered in Sect. 5.2.2. Pass-transistor logic
families are introduced in Sect. 5.2.3. Logic circuits based on gate logic are consid-
ered in Sect. 5.3. The operation of n-type MOS (nMOS), NAND, and NOR gates
has been presented in Sect. 5.3.2. Realization of complementary metal–oxide–semi-
conductor (CMOS) NAND and NOR gates has been discussed in Sect. 5.3.3. The
switching characteristic of CMOS gates is considered in Sect. 5.3.4. Section 5.4
introduces MOS dynamic logic circuits. Section 5.5 presents the realization of some
example circuits.
5.2 Pass-Transistor Logic
As pass transistors functionally behave like switches, logic functions can be re-
alized using pass transistors in a manner similar to relay contact networks. Re-
lay contact networks have been presented in the classical text of Caldwell (1958).
However, a closer look will reveal that there are some basic differences between
relay circuits and pass-transistor networks. Some of the important differences are
discussed in this section [1]. In our discussion, we assume that the pass transistors
are nMOS devices.
As we have seen in Sect. 3.6, a high-level signal gets degraded as it is steered
through an nMOS pass transistor. For example, if both drain and gate are at some
high voltage level, the source will rise to the lower of the two potentials: Vdd and
( Vgs−Vt). If both drain and gate are at Vdd, the source voltage cannot exceed ( Vdd−Vt).
We have already seen that when a high-level signal is steered through a pass tran-
sistor and applied to an inverter gate, the inverter has to be designed with a high
inverter ratio (8:1) such that the gate drive (3.5 V assuming the threshold voltage
equal to 1.5 V) is sufficient enough to drive the inverter output to an acceptable
low level. This effect becomes more prominent when a pass-transistor output is
allowed to drive the gate of another pass-transistor stage as shown in Fig. 5.1. As
the gate voltage is 3.5 V, the high-level output from the second pass transistor can
never exceed 2.0 V, even when the drain voltage is 5 V as shown in Fig. 5.1. In
such a situation, the gate potential will not be sufficient enough to drive the output
low of an inverter with an aspect ratio of 8:1. Therefore, while synthesizing nMOS
pass-transistor logic, one must not drive the gate of a pass transistor by the output
of another pass transistor. This problem does not exist either in relay logic or in case
of transmission gate.
5.2 Pass-Transistor Logic 105
a
a
Vdd
a b
Vdd V0 a' V0
O
Vref V0 b'
b' c
b' c
C
a b c
Fig. 5.2 a Relay logic to realize f = a + b′c. b Pass-transistor network corresponding to relay
logic. c Proper pass-transistor network for f = a + b′c
In the synthesis of relay logic, the network is realized such that high-level signal
reaches the output for which the function is “1.” Absence of the high voltage level
is considered to be “0.” In pass-transistor logic, however, this does not hold good.
For example, to realize the function f = a + b′ c the relay logic network is shown
in Fig. 5.2a. The pass-transistor realization based on the same approach is shown in
Fig. 5.2b. It can be easily proved that the circuit shown in Fig. 5.2b does not realize
the function f = a + b′c. Let us consider an input combination (say 100) for which
the function is “1,” the output will rise to high level by charging the output load
capacitance. Now, if we apply an input combination 010, for which the function is
“0,” then the output remains at “1” logic level because the output capacitance does
not get a discharge path when input combination of 010 is applied. A correct nMOS
pass-transistor realization of f = a + b′c is shown in Fig. 5.2c. The difference be-
tween the circuit of Fig. 5.2c and that of Fig. 5.2b is that it provides a discharge
path of the output capacitor when the input combination corresponds to a low-load
output. Therefore, in synthesizing pass-transistor logic, it is essential to provide
both charging and discharging path for the load capacitance. In other words, charg-
ing path has to be provided for all input combinations requiring a high output level
and discharge path has to be provided for all input combinations requiring a low
output level.
Finally, care should be taken such that advertently or inadvertently an output
point is not driven by signals of opposite polarity. In such a situation, each transis-
tor acts as a resistor and a voltage about half of the high level is generated at the
output. Presence of this type of path is known as sneak path. In such a situation, an
undefined voltage level is generated at a particular node. This should be avoided.
5.2.1 Realizing Pass-Transistor Logic
Pass transistors can be used to realize multiplexers of different sizes. For example,
2-to-1 and 4-to-1 multiplexer realization using pass transistor is shown in Fig. 5.3a,
b, respectively. Now, any two-variable function can be expanded around the vari-
ables using Shannon’s expansion thereon. The expanded function can be represent-
ed by the following expression:
f (a, b) = g 0 a ′b′ + g1a ′b + g 2 ab′ + g3 ab. (5.1)
a
a' b'
g0
a' b
g1
a' g1
a b'
V0 V0
g0 g2
a b
g3
a b
Fig. 5.3 a A 2-to-1 multiplexer. b A 4-to-1 multiplexer circuit using pass-transistor network
Fig. 5.4 a Multiplexer real- a' b'

ization of f = a′b + ab′ .
b Minimum transistor 0
b a
pass-transistor realization of a'
f = a′b + ab′ Vdd b'
a b' V0 a'
Vdd a b V0
b
0
Each gi can assume a binary value, either 0 or 1, depending on the function. By ap-
plying the variables to control the gate potential and applying the logic value gi to
an input ai the function can be realized. In this way, any function of two variables
can be realized.
The realization of the function f = a ′b + ab′ is shown in Fig. 5.4a. When a func-
tion is realized in the above manner, none of the problems mentioned earlier arises.
This approach may be termed as universal logic module (ULM)-based approach
because of the ability of multiplexers to realize any function up to a certain number
of input variables.
However, the circuit realized in the above manner may not be optimal in terms of
the number of pass transistors. That means, there may exist another realization with
lesser number of pass transistors. For example, optimal realization of f = a ′b + ab′
is shown in Fig. 5.4b. Optimal network may be obtained by expanding a function
iteratively using Shannon’s expansion theorem. The variable around which expan-
sion is done has to be suitably chosen and each reduced function is further expanded
until it is a constant or a single variable.
Example 5.1 Consider the function f = a + b′c. Realize it using pass-transistor logic.
By expanding around the variable “a,” we get f = a.1 + a′.b′c.
Now, by expanding around the variable “b,” we get f = a ⋅ 1 + a ′(b ⋅ 0 + b′ ⋅ c)
No further expansion is necessary because there does not exist any reduced
function, which is a function of more than one variable. Realization of the function
f = a + b′c based on this approach is shown in Fig. 5.2c. Circuit realization based
on the above approach is not only area efficient but also satisfies all the require-
ments of pass-transistor logic realization mentioned earlier.
5.2.2 Advantages and Disadvantages
Realization of logic functions using pass transistors has several advantages and
disadvantages. The advantages are mentioned below:
a. Ratioless: We have seen that for a reliable operation of an inverter (or gate logic),
the width/length (W/L) ratio of the pull-up device is four times (or more) that of
the pull-down device. As a result, the geometrical dimension of the transistors
is not minimum (i.e., 2λ × 2λ ). The pass transistors, on the other hand, can be
of minimum dimension. This makes pass-transistor circuit realization very area
efficient.
b. Powerless: In a pass-transistor circuit, there is no direct current (DC) path from
supply to ground (GND). So, it does not require any standby power, and power
dissipation is very small. Moreover, each additional input requires only a mini-
mum geometry transistor and adds no power dissipation to the circuit.
c. Lower area: Any one type of pass-transistor networks, nMOS or p-type MOS
(pMOS), is sufficient for the logic function realization. This results in a smaller
number of transistors and smaller input loads. This, in turn, leads to smaller chip
area, lower delay, and smaller power consumption.
However, pass-transistor logic suffers from the following disadvantages:
1. When a signal is steered through several stages of pass transistors, the delay can
be considerable, as explained below.
When the number of stages is large, a series of pass transistors can be modeled
as a network of resistance capacitance (RC) elements shown in Fig. 5.5. The ON
resistance of each pass transistor is Rpass and capacitance CL. The value of Rpass and
CL depends on the type of switch used. The time constant Rpass CL approximately
gives the time constant corresponding to the time for CL to charge to 63 % of its final
value. To calculate the delay of a cascaded stage of n transistors, we can simplify the
equivalent circuit by assuming that all resistances and capacitances are equal and
can be lumped together into one resistor of value n × Rpass and one capacitor of value
n × CL. The equivalent time constant is n2 Rpass CL. Thus, the delay increases as the
square of number of pass transistors in series. However, this simplified analysis
leads to overestimation of the delay. A more accurate estimate of the delay can be
obtained by approximating the effective time constant to be

τ eft = ∑ Rk 0 Ck, (5.2)
k
Fig. 5.5 a Pass-transistor
network. b RC model for the Q1 Q2 Qn
pass-transistor network. RC
resistance capacitance
R1 R2 Rn
C1 C2 CL
where Rk stands for the resistance for the path common to k and input.
For k = 4,
τ eft = R1C1 + ( R1 + R2 )C2 + ( R1 + R2 + R3 )C3 + ( R1 + R2 + R3 + Rn )CL.
(5.3)
n(n + 1)
Time constant τ = CReq .
2
Assuming all resistances are equal to Rpass and all capacitances are equal to C, the
delay of the n-stage pass-transistor network can be estimated by using the Elmore
approximation:

n(n + 1)
tp = 0.69 Rpass C. (5.4)
2
For n = 4, the delay is
tp = 6.9 Rpass C ,
which are ten times that of single-stage delays.

The above delay estimate implies that the propagation delay is proportional to n2
and increases rapidly with the number of cascaded stages. This also sets the limit on
the number of pass transistors that can be used in a cascade. To overcome this prob-
lem, the buffers should be inserted after every three of four pass-transistor stages.
For a large value of n, let buffers are introduced after each k stages, as shown
in Fig. 5.6 for k = 3. Assuming a propagation delay of each buffer is tbuf, the overall
propagation delay can be computed as follows:
n k (k + 1)   n   n(k + 1)   n 
tp = 0.69  CL Rpass  +  − 1 tbuf = 0.69 CL Rpass +  − 1 tbuf .
k 2  k   2   k 
Rpass
Vin
CL
Fig. 5.6 Buffers inserted after every three stages
In contrast to quadratic dependence on n in a circuit without buffer, this expression

shows a linear dependence on the number n. An optimal value of k can be obtained
from
δ tp tbuf
= 0 ⇒ kopt = 1.7 ,
δk Rpass CL
for Rpass = 10kΩ, C = 10 fF, tbuf = 0.5 n sec, we get k = 3.8 .

Therefore, a cascade of more than four pass-transistor stages is not recommend-
ed to restrict the delay within a reasonable limit.
2. As we have mentioned earlier, there is a voltage drop ( Vout = Vdd−Vtn) as we steer
the signal through nMOS transistors. This reduced level leads to high static cur-
rents at the subsequent output inverters and logic gates. In order to avoid this, it
is necessary to use additional hardware known as the swing restoration logic at
the gate output.
3. Pass-transistor structure requires complementary control signals. Dual-rail logic
is usually necessary to provide all signals in the complementary form. As a con-
sequence, two MOS networks are again required in addition to the swing restora-
tion and output buffering circuitry.
The required double inter-cell wiring increases wiring complexity and capacitance
by a considerable amount.
5.2.3 Pass-Transistor Logic Families
Several pass-transistor logic styles have seen proposed in recent years to overcome
the limitations mentioned above. The most common ones are: the conventional
nMOS pass-transistor logic or complementary pass-transistor logic (CPL), the dual
pass-transistor logic (DPL), and the swing-restored pass-transistor logic (SRPL)
[2]. In this section, they are introduced and compared.
Complementary Pass-Transistor Logic A basic structure of a CPL gate is shown
in Fig. 5.7a. It comprises two nMOS pass-transistor logic network (one for each
rail), two small pull-up pMOS transistors for swing restoration, and two output
inverters for the complementary output signals. CPL realization for the 2-to-1
Vdd
Vdd
S
Pass
A
transistor O’
S’ O’
logic
Vdd
B Vdd
S
Pass
A’
transistor O
S’ O
logic
B’
a b
Fig. 5.7 a Basic complementary pass-transistor logic (CPL) structure; and b 2-to-1 multiplexer
realization using CPL logic
Vdd Vdd Vdd

B B’ B’
A A A
B’ B B
B Vdd A. B B Vdd A+B A’ Vdd XNOR
B B’ B’
A’ A’ A’
B’ B B
A.B A+B
B’ B’ A XOR
a b c
Fig. 5.8 Complementary pass-transistor logic (CPL) logic circuit for a 2-input AND/NAND, b
2-input OR/NOR, and c 2-input EX-OR
multiplexer is shown in Fig. 5.7b. This is the basic and minimal gate structure with
ten transistors. All two-input functions can be implemented by this basic gate struc-
ture. Realizations of 2-input NAND, NOR, and EX-OR functions are shown in
Fig. 5.8a–c, respectively.
Swing-Restored Pass-Transistor Logic The SRPL logic style is an extension of
CPL to make it suitable for low-power low-voltage applications. The basic SRPL
logic gate is shown in Fig. 5.9a, where the output inverters are cross-coupled with
a latch structure, which performs both swing restoration and output buffering. The
pull-up pMOS transistors are not required anymore and that the output nodes of
the nMOS networks are the gate outputs. Figure 5.9b shows an example of AND/
NAND gate using SRPL logic style. As the inverters have to drive the outputs and
must also be overridden by the nMOS network, transistor sizing becomes very dif-
ficult and results in poor output-driving capacity, slow switching, and larger short-
circuit currents. This problem becomes worse when many gates are cascaded.
Double Pass-Transistor Logic The DPL is a modified version of CPL, in which
both nMOS and pMOS logic networks are used together to alleviate the problem
of the CPL associated with reduced high logic level. As this provides full swing on
B
A
B' A.B
O' B
nMOS
B
CPL
network A'
O B' A.B
B'
a b
Fig. 5.9 a Basic swing-restored pass-transistor logic (SRPL) configuration; and b SRPL realiza-
tion of 2-input NAND gate
Fig. 5.10 Double pass-tran- B'

sistor logic (DPL) realiza- Vdd
tion of 2-input AND/NAND A
function B A
A' A.B
B A.B B
A'
A
A' B'
B'
a b
the output, no extra transistors for swing restoration is necessary. Two-input AND/
NAND DPL realization is shown in Fig. 5.10. As shown in the figure, both nMOS
and pMOS pass transistors are used. Although, owing to the addition of pMOS
transistor, the output of the DPL is full rail-to-rail swing, it results in increased
input capacitance compared to CPL. It may be noted that DPL can be regarded as
dual-rail pass-gate logic. The DPL has a balanced input capacitance, and the delay
is independent of the input delay contrary to the CPL and conventional CMOS pass-
transistor logic, where the input capacitance for different signal inputs is the same.
As two current paths always drive the output in DPL, the reduction in speed due to
the additional transistor is compensated.
Single-Rail Pass-Transistor Logic A single-rail pass-transistor logic style has
been adopted in the single-rail pass-transistor logic (LEAP; LEAn integration with
pass transistors) [7] design which exploits the full functionality of the multiplexer
structure scheme. Swing restoration is done by a feedback pull-up pMOS transistor.
This is slower than the cross-coupled pMOS transistors of CPL working in dif-
ferential mode. This swing restoration scheme works for Vdd > Vtn + Vtp , because
the threshold voltage drop through the nMOS network for a logic “1” prevents the
nMOS of the inverter and with that the pull-up pMOS from turning ON. As a conse-
quence, robustness at low voltages is guaranteed only if the threshold voltages are
appropriately selected. Complementary inputs are generated locally by inverters.
Y1 Y2 Y3
A B
C C
out out out

a
1.6 1.6
9
11
=
12 9
or
1.6
12
1
b 12
Fig. 5.11 Single-rail pass-transistor logic (LEAP) cells
Three basic cells; Y1, Y2, and Y3, used in LEAP logic design is shown in Fig. 5.11.
It may be noted that Y1 cell corresponds to a 2-to-1 multiplexer circuit, whereas Y3
corresponds to 4-to-1 multiplexer circuit realized using three 2-to-1 multiplexers.
Y2 is essentially a 3-to-1 multiplexer realized using two 2-to-1 multiplexers. Any
complex logic circuit can be realized using these three basic cells.
Comparison of the pass-transistor logic styles based on some important circuit
parameters, such as number of MOS transistors, the output driving capabilities, the
presence of input/output decoupling, requirement for swing restoration circuits, the
number of rails, and the robustness with respect to voltage scaling and transistor
sizing, is given in Table 5.1.
5.3 Gate Logic 113
Table 5.1 Qualitative comparisons of the logic styles

Logic style #MOS Output I/O Swing # Rails Robustness
networks driving decoupling restoration
CMOS 2n Med/good Yes No Single High
CPL 2n + 6 Good Yes Yes Dual Medium
SRPL 2n + 4 Poor No Yes Dual Low
DPL 4n Good Yes No Dual High
LEAP n + 3 Good Yes Yes Single Medium
DCVSPG 2n + 2 Medium Yes No Dual Medium
MOS metal–oxide–semiconductor, I/O input/output, CMOS complementary metal–oxide–semi-
conductor, CPL complementary pass-transistor logic, SRPL swing-restored pass-transistor logic,
DPL double pass-transistor logic, LEAP single-rail pass-transistor logic, DCVSPG differential cas-
code voltage switch pass gate
Fig. 5.12 a Fan-in of gates; fan-out = 3

and b fan-out of gates fan-in = 2
fan-in= 4
a b
5.3 Gate Logic
In gate logic, conventional gates, such as inverters, NAND gates, NOR gates, etc.,
are used for the realization of logic functions. In Chap. 4, we have already discussed
the possible realization of inverter circuits by using both nMOS and CMOS tech-
nology. Realization of NAND, NOR, and other types of gates are considered in this
section.
A basic inverter circuit can be extended to incorporate multiple inputs leading
to the realization of standard logic gates such as NAND and NOR gates [4]. These
gates along with the inverters can be used to realize any complex Boolean function.
In this section, we shall consider various issues involved in the realization of these
multi-input gates. But, before that we introduce the concept of fan-in and fan-out,
which are important parameters in the realization of complex functions using these
gates.
5.3.1 Fan-In and Fan-Out
Fan-in is the number of signal inputs that the gate processes to generate some out-
put. Figure 5.12a shows a 2-input NAND gate and a four-input NOR gate with
fan-in of 2 and 4, respectively. On the other hand, the fan-out is the number of logic
inputs driven by a gate. For example, in Fig. 5.12b a four-input NOR gate is shown
with fan-out of 3.
Fig. 5.13 a n-input nMOS Vdd Vdd

NAND gate; b equivalent
Zpu
circuits; and c n-input nMOS Vdd
NOR gate. nMOS n-type Vout Vout
MOS n Zpu
1 2 n
2 Zpd
1
Zpd
a b c
5.3.2 nMOS NAND and NOR Gates
Realizations of nMOS NAND gate with two or more inputs are possible. Let us
consider the generalized realization with n inputs as shown in Fig. 5.13 with a de-
pletion-type nMOS transistor as a pull-up device and n enhancement-type nMOS
transistors as pull-down devices. In this kind of realization, the length/width (L/W)
ratio of the pull-up and pull-down transistors should be carefully chosen such that
the desired logic levels are maintained. The critical factor here is the low-level
output voltage, which should be sufficiently low such that it turns off the transistors
of the following stages. To satisfy this, the output voltage should be less than the
threshold voltage, i.e., Vout ≤ Vt = 0.2 Vdd. To get a low-level output, all the pull-down
transistors must be ON to provide the GND path. Based on the equivalent circuit
shown in Fig. 5.13b, we get the following relationship:

Vdd × nZ pd
0.2Vdd , (5.5)
nZ pd + Z pu
for the boundary condition
nZ pd
0.2
nZ pd + Z pu
or nZ pd 0.2nZ pd + 0.2 Z pu
Z pu
or 4.
nZ pd
Therefore, the ratio of Zpu to the sum of all the pull-down Zpds must be at least 4:1.
It may be noted that, not only one pull-down transistor is required per input of
the NAND gate stage but also the size of the pull-up transistor has to be adjusted
5.3 Gate Logic 115
Vdd Vdd
pMOS …
network . Vout
Vn
n
…
VOUT
…
2
V1
nMOS 1
network V2
Vin
a b
Fig. 5.14 a General CMOS network; and b n-input CMOS NAND gate. CMOS complementary
MOS, p-type MOS, n-type MOS
to maintain the required overall ratio. This requires a considerably larger area than
those of the corresponding nMOS inverter.
Moreover, the delay increases in direct proportion to the number of inputs. If
each pull-down transistor is kept of minimum size, then each will represent one
gate capacitance at its input and resistance of all the pull-down transistors will be
in series. Therefore, for an n-input NAND gates, we have a delay of n-times that of
an inverter, i.e.,
τ NAND = nτ inv .
As a consequence, nMOS NAND gates are only used when absolutely necessary
and the number of inputs is restricted so that area and delay remain within some
limit.
An n-input NOR gate is shown in Fig. 5.13c. Unlike the NAND gate, here the
output is low when any one of the pull-down transistors is on, as it happens in the
case of an inverter. As a consequence, the aspect ratio of the pull-up to any pull-
down transistor will be the same as that of an inverter, irrespective of the number of
inputs of the NOR gate.
The area occupied by the nMOS NOR gate is reasonable, because the pull-up
transistor geometry is not affected by the number of inputs. The worst-case delay of
an NOR gate is also comparable to the corresponding inverter. As a consequence,
the use of NOR gate in circuit realization is preferred compared to that of NAND
gate, when there is a choice.
5.3.3 CMOS Realization
The structure of a CMOS network to realize any arbitrary Boolean function is

shown in Fig. 5.14a. Here, instead of using a single pull-up transistor, a network of
pMOS transistors is used as pull up. The pull-down circuit is a network of nMOS
transistors, as it is done in the case of an nMOS realization.
Fig. 5.15 a Equivalent cir-

cuit of n-input complemen- nwp:Lp
tary MOS (CMOS) NAND higher n
Vout
gate; and b transfer charac- wn:nLn
Vin
teristics of n-input CMOS Vinv
NAND gate a b
5.3.3.1 CMOS NAND Gates
Let us consider an n-input NAND gate as shown in Fig. 5.14b. It has fan-in of n

having n number of nMOS transistor in series in the pull-down path and n number
of pMOS transistors in parallel in the pull-up network. We assume that all the tran-
sistors are of the same size having a width W = Wn = Wp and length L = Ln = Lp. If all
inputs are tied together, it behaves like an inverter. However, static and dynamic
characteristics are different from that of the standard inverter. To determine the
inversion point, pMOS transistors in parallel may be equated to a single transistor
with the width n times that of a single transistor. And nMOS transistors in series
may be considered to be equivalent to have a length equal to n times that of a single
β
transistor as shown in Fig. 5.15a. This makes the transconductance ratio = 2 n .
Substituting these values in Eq. 4.17, we get n βp

βn
Vdd + Vtp + Vtn
n 2 βp
Vinv =
βn (5.6)
1+
n 2 βp
or
Vth βn
Vdd + Vtp +
n βp
Vinv = . (5.7)
1 βn
1+
n βp
βn
As the fan-in number n increases, the β ratio (transconductance ratio) decreases
p
leading to increase in Vinv. In other words, with the increase in fan-in, the switching
point (inversion point) moves towards the right. It may be noted that in our analysis,
we have ignored the body effect.
5.3 Gate Logic 117
Fig. 5.16 a n-input comple- Vdd

mentary MOS (CMOS) NOR
gate and b the equivalent n
circuit
wp: nLp
2 Vin Vout
nwn: Ln
1
a b
5.3.3.2 CMOS NOR Gates
The realization of the n-input CMOS NOR gate is shown in Fig. 5.16a. Its equiva-
lent circuit is shown in Fig. 5.16b. In this case, the transconductance ratio is equal
n2 βn
to .
βp
Substituting this ratio in Eq. 5.6, we get

βn
Vdd + Vtp + nVth
βp
Vinv = . (5.8)
β
1+ n n
βp
βn
As the fan-in number n increases, the ratio (transconductance ratio) increases
βp
leading to a decrease in Vinv. In other words, with the increase in fan-in, the switch-
ing point (inversion point) moves towards the left. Here, also, we have ignored
the body effect. Therefore, with the increase in fan-in the inversion point moves
towards left for NOR gates, whereas it moves towards right in case of NAND gates.
5.3.4 Switching Characteristics
To study the switching characteristics, let us consider the series and parallel tran-
sistors separately. Figure 5.17a shows n pull-up pMOS transistors with their gates
tied together along with a load capacitance CL. An equivalent circuit is shown in
Fig. 5.17b. Intrinsic time constant of this network is given by
Rp Rp
tdr = (nCoutp ) +
n n
(5.9)
Rp
CL = (nCoutp + CL ).
n
Vdd Vdd
… Rp Rp … Rp Coutp
Coutp
Vin Coutp
CL Vout
CL
a b
Fig. 5.17 a Pull-up transistor tied together with a load capacitance; and b equivalent circuit
Vout Vout
Rn c outn CL
n CL
…
Cinn
…
R2
Vin 2 Vin c out2
Cin2 R1
1 c out1
Cin1
a b
Fig. 5.18 a Pull-down transistors along with load capacitance CL, and b equivalent circuit
Here, the load capacitance CL includes all capacitances on the output node except
the output capacitances of the transistors in parallel. The worst-case rise time occurs
when only one pMOS transistor is ON and remaining pMOS transistors are OFF.
This rise time is close to the rise time of an inverter.
To find out the fall-time (high-to-low time), let us consider the n numbers of
series-connected-nMOS transistors as shown in Fig. 5.18a. The intrinsic switching
time for this network based on Fig. 5.18b is given by

C 
tdf = nRn  outn + Cload  + 0.35 Rn Cinn (n − 1) 2 (5.10)
 n 
The first term represents the intrinsic switching time of the n-series-connected tran-
sistors and the second term represents the delay caused by Rn charging Cinn. For an
n-input NAND gate

Rp  C 
tdr =  nCoutp + outn + Cload  (5.11)
n  n 
and

C 
tdf = nRn  outn + nCoutp + CL  + 0.35 Rn Cinn (n − 1) 2 . (5.12)
 n 
Thus, the delay increases linearly with the increase of fan-in number n.
5.3 Gate Logic 119
5.3.5 CMOS NOR Gate
In a similar manner, the rise and fall times of an n-input NOR gate can be obtained
as

Rn
tdf = (nCoutp + CL ) (5.13)
n
and

 Coutp 
tdr = nRp  + nCoutn + CL  + 0.35 Rp Cinn (n − 1) 2 . (5.14)
 n 
It may be noted that in case of NAND gate, the discharge path is through the series–
connected-nMOS transistors. As a consequence, the high-to-low delay increases
with the number of fan-in. If the output load capacitance is considerably larger than
other capacitances then the fall time reduces to tdf = nRn CL and tdr = Rp CL . On the
other hand, in case of NOR gate the charging of the load capacitance take place
through the series connected pMOS transistors giving rise time tdr = nRp CL and
tdf = Rn CL . As Rp > Rn, the rise-time delay for NOR gates is more than the fall-time
delay of NAND gates with the same fan-in. This is the reason why NAND gates
are generally a better choice than NOR gates in complementary CMOS logic. NOR
gates may be used for limited fan-out.
5.3.6 CMOS Complex Logic Gates
By combining MOS transistors in series–parallel fashion, basic CMOS NAND/NOR

gates can be extended to realize complementary CMOS complex logic gates. Basic
concept of realizing a complex function f by a single gate is shown in Fig. 5.19a.
Here, the nMOS network corresponds to the function f ′ and the pMOS network is
complementary of the nMOS network. For example, the realization of the function
f = A′ + BC is shown in Fig. 5.19b.
Here, f = A′ + BC ⇒ f ′ = ( A′ + BC )′ = A( B ′ + C ′).
The pull-down nMOS network realizes A( B ′ + C ′), whereas pull-up pMOS net-
work realizes A′ + BC . In this realization, both nMOS and pMOS network are, in
general, series–parallel combination MOS transistors as shown in Fig. 5.19b. Full-
complementary CMOS realization of half-adder function is shown in Fig. 5.19c.
Here, S = A ⊕ B = A′B + AB ′ ⇒ S ′ = ( A′B + AB ′)′ = ( A + B ′)( A′ + B).
In realizing complex functions using full-complementary CMOS logic, it is nec-
essary to limit the number of MOS transistors in both the pull-up and pull-down
network, so that the delay remains within the acceptable limit. Typically, the limit is
in the range of 4–6 MOS transistors.
$ %
$ $
I &
%
%
I 6
I
$ $ %
3ULPDU\ &
$
%
%
LQSXWV
Fig. 5.19 a Realization of a function f by complementary MOS (CMOS) gate; b realization of

f = A′ + BC ; and c realization of S = A ⊕ B
5.4 MOS Dynamic Circuits
The MOS circuits that we have discussed so far are static in nature [2]. In static
circuits, the output voltage levels remain unchanged as long as inputs are kept the
same and the power supply is maintained. nMOS static circuits have two disad-
vantages: they draw static current as long as power remains ON, and they require
larger chip area because of “ratioed” logic. These two factors contribute towards
slow operation of nMOS circuits. Although there is no static power dissipation in
a full-complementary CMOS circuit, the logic function is implemented twice, one
in the pull-up p-network and the other in the pull-down n-network. Due to the extra
area and extra number of transistors, the load capacitance on gates of a full-comple-
mentary CMOS is considerably higher. As a consequence, speeds of operation of
the CMOS and nMOS circuits are comparable. The CMOS not only has twice the
available current drive but also has twice the capacitance of nMOS. The trade-off in
choosing one or the other is between the lower power of the CMOS and the lower
area of nMOS (or pseudo nMOS).
In the static combinational circuits, capacitance is regarded as a parameter re-
sponsible for poor performance of the circuit, and therefore considered as an un-
desirable circuit element. But, a capacitance has the important property of holding
charge. Information can be stored in the form of charge. This information storage
capability can be utilized to realize digital circuits. In MOS circuits, the capaci-
tances need not be externally connected. Excellent insulating properties of silicon
dioxide provide very good quality gate-to-channel capacitances, which can be used
for information storage. This is the basis of MOS dynamic circuits. In Sect. 5.4.1,
we discuss how dynamic MOS circuits are realized utilizing these capacitances us-
ing single-phase and two-phase clocks.
We shall see that the advantage of low power of full-complementary CMOS cir-
cuits and smaller chip area of nMOS circuits are combined in dynamic circuits lead-
ing to circuits of smaller area and lower power dissipation. MOS dynamic circuits
5.4 MOS Dynamic Circuits 121
9GG
4 &ON
+LJK : 287
4
&ON ,1 4
/RZ
7 1(;7
a b 67$*(
Fig. 5.20 a Single-phase clock; and b single-phase n-type MOS (nMOS) inverter
are also faster in speed. However, these are not free from disadvantages. Like any
other capacitors, charge stored on MOS capacitors also leak. To retain information,
it is necessary to periodically restore information by a process known as refreshing.
There are other problems like charge sharing and clock skew leading to hazards
and races. Suitable measures should be taken to overcome these problems. These
problems are considered in Sect. 5.4.3.
5.4.1 Single-Phase Dynamic Circuits
To realize dynamic circuits, it is essential to use a clock. Two types of clocks are
commonly used; single-phase clock and nonoverlapping two-phase clock. The
single-phase clock consists of a sequence of pulses having high (logic 1) and low
(logic 0) levels with width W and time period T as shown in Fig. 5.20a. A single-
phase clock has two states (low and high) and two edges per period. The schematic
diagram of a single-phase dynamic nMOS inverter is shown in Fig. 5.20b. Opera-
tion of a single-phase inverter circuit is explained below.
When the clock is in the high state, both transistors Q2 and Q3 are ON. Depend-
ing on the input, Q1 is either ON or OFF. If the input voltage is low, Q1 is OFF and
the output capacitor (gate capacitor of the next stage) charges to Vdd through Q2 and
Q3. When the input voltage is high, Q1 is ON and the output is discharged through it
to a low level. When the clock is in the low state, the transistors Q2 and Q3 are OFF,
isolating the output capacitor. This voltage is maintained during the OFF period of
the clock, provided the period is not too long. During this period, the power supply
is also disconnected and no current flows through the circuit. As current flows only
when the clock is high, the power consumption is small, and it depends on the duty
cycle (ratio of high time to the time period T). It may be noted that the output of the
circuit is also ratioed, because the low output voltage depends on the ratio of the ON
resistance of Q1 to that of Q2. As we know that, this ratio is related to the physical
dimensions of Q1 to Q2 ( L:W ratio) and is often referred to as the inverter ratio. The
idea of the MOS inverter can be extended to realize dynamic MOS NAND, NOR,
and other gates as shown in Fig. 5.21.
The circuits realized using the single-phase clocking scheme has the disadvan-
tage that the output voltage level is dependent on the inverter ratio and the number
of transistors in the current path to GND. In other words, single-phase dynamic
Fig. 5.21 a 2-input single- 9GG 9GG

phase NAND; and b 2-input
single-phase NOR gate &ON &ON
$%
7RQH[W
7RQH[W VWDJH
$
VWDJH $ % $%
%
a b
circuits are ratioed logic. Moreover, as we have mentioned above, the circuit dis-
sipates power when the output is low and the clock is high.
Another problem arising out of single-phase clocked logic is known as clock
skew problem. This is due to a delay in a clock signal during its journey through a
number of circuit stages. This results in undesired signals like glitch, hazards, etc.
Some of the problems can be overcome using two-phase clocking scheme as dis-
cussed in the following subsection.
5.4.2 Two-Phase Dynamic Circuits
A two-phase nonoverlapping clock is shown in Fig. 5.22a. As the two phases ( φ1

and φ2) are never high simultaneously, the clock has three states and four edges
and satisfies the property φ1.φ′2 = 0. There is a dead time, Δ, between transitions of
the clock signals as shown in Fig. 5.22a. The schematic diagram of a circuit that
generates two-phase clock is shown in Fig. 5.22b. The circuit takes a single-phase
clock as an input and generates two-phase clock as the output. The dead time, Δ, is
decided by the delay through the NAND gates and the two inverters. As the clock
(clk) signal goes low, φ1 also goes low after four gate delays. φ2 can go high after
seven gate delays. In a similar manner, as clk goes high, φ2 goes low after five gate
delays and φ1 goes high after eight gate delays. So φ2 ( φ1) goes high after three gate
delays and φ1 ( φ2) goes low. Here the dead time is three gate delays. A longer dead
time can be obtained by inserting more number of inverters in the feedback path.
This two-phase clocking gives a great deal of freedom in designing circuits.
An inverter based on two-phase clock generator is shown in Fig. 5.23. When the
clock Φ2 is high, the intrinsic capacitor charges to Vdd through Q1. And clock Φ1,
which comes after Φ2 performs the evaluation. If Vin is high, Q2 is turned ON and
since Q3 is ON, the capacitor discharges to the GND level and the output V0 attains
low logic level. If Vin is low, the Q2 is OFF and there is no path for the capacitor to
discharge. Therefore, the output V0 remains at high logic level. It may be noted that
the pull-up and pull-down transistors are never simultaneously ON. The circuit has
no DC current path regardless of the state of the clocks or the information stored
on the parasitic capacitors. Moreover, the output is not ratioed, i.e., the low-level
output is independent of the relative value of the aspect ratio of the transistors. That
Clk
Clk
φ1
φ2
Fig. 5.22 a Two-phase clock; and b a two-phase clock generator circuit
Fig. 5.23 Two-phase n-type Vdd

MOS (nMOS) inverter φ2 Q1
V0
Vin Q2
C
φ1 Q3
is why the circuits based on two-phase clocking are often termed as ratioless and
powerless. Moreover, in dynamic circuits, there can be at the most one transition
per clock cycles, and therefore there cannot be multiple spurious transitions called
glitches, which can take place in static CMOS circuits. Like inverter, NAND, NOR,
and other complex functions can be realized by replacing Q2 with a network of two
or more nMOS transistors.
5.4.3 CMOS Dynamic Circuits
Dynamic CMOS circuits avoid area penalty of static CMOS and at the same
time retains the low-power dissipation of static CMOS, and they can be con-
sidered as extension of pseudo-nMOS circuits. The function of these circuits
can be better explained using the idea of pre-charged logic. Irrespective of
Vdd
Vdd Vdd
φ φ
x1 f
x3
x3 x1
x2
f x1 x2 x3
x3 x2
f
φ
φ
x1 x2
a b c
Fig. 5.24 Realization of function f = x3 ( x1 + x2 ) using a static complementary MOS (CMOS),

b dynamic CMOS with n-block, and c dynamic CMOS with p-block
several alternatives available in dynamic circuits, in all cases the output node
is pre-charged to a particular level, while the current path to the other level
is turned OFF. The charging of inputs to the gate must take place during this
phase. At the end of this pre-charge phase, the path between the output node and
the pre-charge source is turned OFF by a clock, while the path to the other level
is turned ON through a network of MOS transistors, which represents this func-
tion. During this phase, known as evaluation phase, the inputs remain stable
and depending on the state of the inputs, one of the two possibilities occurs. The
network of transistors either connects the output to the other level discharging
the output or no path is established between the other level and output, thereby
retaining the charge. This operation is explained with the help of a circuit re-
alization for the function f = x3 + x1 x2 . The realization of this function using
full-complementary CMOS approach is shown in Fig. 5.24a. There are two al-
ternative approaches for realizing it using dynamic CMOS circuits. As shown in
Fig. 5.23b, the circuit is realized using the nMOS block scheme, where there is
only one pMOS transistor, which is used for pre-charging. During φ = 0, the out-
put is pre-charged to a high level through the pMOS transistor. And during the
evaluation phase ( φ = 1), the output is evaluated through the network of nMOS
transistors. At the end of the evaluation phase, the output level depends on the
input states of the nMOS transistors. In the second approach, the circuit uses
the pMOS block scheme, where there is only one nMOS transistor used for pre-
discharging the output as shown in Fig. 5.24c. Here, during pre-charge phase,
the output is pre-discharged to a low level through the nMOS transistor and
during evaluation phase the output is evaluated through the network of pMOS
transistors. In the evaluation phase, the low-level output is either retained or
charged to a high level through the pMOS transistor network.
Fig. 5.25 Reverse-biased

parasitic diode and subthresh-
old leakage CL
n n
+ Isubthreshold + Source
Drain
Ileakage
p substrate
5.4.4 Advantages and Disadvantages
The dynamic CMOS circuits have a number of advantages. The number of transis-
tors required for a circuit with fan-in N is ( N + 2), in contract to 2 N in case of state
CMOS circuit. Not only dynamic circuits require ( N + 2) MOS transistors but also
the load capacitance is substantially lower than that for static CMOS circuits. This is
about 50 % less than static CMOS and is closer to that of nMOS (or pseudo nMOS)
circuits. But, here full pull-down (or pull-up) current is available for discharging (or
charging) the output capacitance.
Therefore, the speed of the operation is faster than that of the static CMOS cir-
cuits. Moreover, dynamic circuits consume static power closer to the static CMOS.
Therefore, dynamic circuits provide superior (area-speed product) performance
compared to its static counterpart. For example, a dynamic NOR gate is about five
times faster than the static CMOS NOR gate. The speed advantage is due to smaller
output capacitance and reduced overlap current. However, dynamic circuits are not
free from problems, as discussed in the following subsections.
5.4.4.1 Charge Leakage Problem
However, the operation of a dynamic gate depends on the storage of information in

the form of charge on the MOS capacitors. The source–drain diffusions form para-
sitic diodes with the substrate. If the source (or drain) is connected to a capacitor
CL with some charge on it, the charge will be slowly dissipated through the reverse-
biased parasitic diode. Figure 5.25 shows the model of an nMOS transistor. The
leakage current is expressed by the diode equation:
iD = I s (e − qV / kT − 1),
where Is is the reverse saturation current, V is the diode voltage, q is the electronic
charge (1.602 × 10− 19 °C), k is the Boltzman in constant (1.38 × 10− 23 J/K), and T is
the temperature in Kelvin.
It may be noted that the leakage current is a function of the temperature. The cur-
rent is in the range 0.1–0.5 nA per device at room temperature. The current doubles
Fig. 5.26 a Charge sharing 9GG

problem; and b model for &ON
charge sharing $
$
&/ 6 6
% & &/ % &
& &

&
&
&ON
a b
for every 10 °C increase in temperature. Moreover, there will be some subthreshold
leakage current. Even when the transistor is OFF and the current can still flow from
the drain to source. This component of current increases when the gate voltage is
above zero and as it approaches the threshold voltage, this effect becomes more
pronounced. A sufficiently high threshold voltage Vt is recommended to alleviate
this problem. The charge leakage results in gradual degradation of high-level volt-
age output with time, which prevents it to operate at a low clock rate. This makes it
unattractive for many applications, such as toys, watches, etc., which are operated at
lower frequency to minimize power dissipation from the battery. As a consequence,
the voltage on the charge storage node slowly decreases. This needs to be compen-
sated by refreshing the charge at regular intervals.
5.4.4.2 Charge Sharing Problem
Dynamic circuits suffer from charge sharing problem because of parasitic capaci-
tances at different nodes of a circuit. For example, consider the circuit of Fig. 5.26a.
The node A charges to Vdd during the pre-charge phase with a charge of CLVdd stored
on the capacitor CL. Assume that the parasitic capacitances of nodes B and C are C1
and C2, respectively. To start with these capacitors are assumed to have no charges.
As there is no current path from node A to GND, it should stay at Vdd during the
evaluation phase. But because of the presence of C1 and C2, redistribution of charge
will take place leading to the so-called charge-sharing problem. This charge-shar-
ing mechanism can be modeled as shown in Fig. 5.26b, where C1 and C2 are con-
nected to node A through two switches representing the MOS transistors. Before the
switches are closed, the charge on CL is given by QA = VddCL and charges at node B
and C are QB = 0 and QC = 0, respectively.
After the switches are closed, there will be a redistribution of charges based on
the charge conservation principle, and the voltage VA at node A is given by

CLVdd = (CL + C1 + C2 )VA. (5.15)
Fig. 5.27 A weak p-type 9GG 9GG

MOS (pMOS) transistor to
reduce the impact of charge :HDN
leakage and charge sharing φ S026 φ
problem
1EORFN 1EORFN &/

&/
φ φ
a b
Therefore, the new voltage at node A, at the end of the evaluation phase, will be

CL
VA = Vdd . (5.16)
(CL + C1 + C2 )
If C1 = C2 = 0.5 CL, then VA = 0.5 Vdd. This may not only lead to incorrect interpreta-
tion of the output but also results in the high-static-power dissipation of the suc-
ceeding stage.
To overcome the charge sharing and leakage problem, some techniques may be
used. As shown in Fig. 5.27, a weak pMOS (low W/L) is added as a pull-up tran-
sistor. The transistor always remains ON and behaves like a pseudo-nMOS circuit
during the evaluation phase. Although there is static power dissipation due to this
during evaluation phase, it helps to maintain the voltage by replenishing the charge
loss due to the leakage current or charge sharing.
5.4.4.3 Clock Skew Problem
It is not uncommon to use several stages of dynamic circuits to realize a Boolean

function. Although same clock is applied to all these stages, it suffers delay due to
resistance and parasitic capacitances associated with the wire that carry the clock
pulse and this delay is approximately proportional to the square of the length of the
wire. As a result, different amounts of delays are experienced at different points in
the circuit and the signal-state changes that are supposed to occur simultaneously
may never actually occur at the same time. This is known as clock skew problem
and it results in hazard and race conditions. For example, let us consider the two
stages of CMOS dynamic circuits. Timing relationship of the two clocks of two
consecutive stages is shown in Fig. 5.28a. The two clocks are not in synchroniza-
tion. When pre-charging is in progress for the first stage, evaluation phase has al-
ready started in the second stage. As the output is high at the time of the pre-charge
phase, this will discharge the output of the second stage through the MOS transistor
Q1, irrespective of what will be the output of the first stage during the evaluation
phase. The charge loss leads to reduced noise margins and even malfunctioning
3UHFKDUJH 9GG
ĳL ĳL
ĳL
1EORFN 4
6WDJHL
ĳL ĳL ĳL
6WDJHL
6WDJHL 6WDJHL

a (YDOXWH b
Fig. 5.28 a Evaluate phase of a particular stage overlapping with the pre-charge phase of the
preceding stage
Fig. 5.29 Domino logic and 9GG

low levels, respectively
&ON
7RQH[W
1EORFN
VWDJH
&ON
of the circuit; because, if the output depends on the difference in delay time of the
two clocks, then the charge on the load capacitance CL of the second stage may
discharge to the extent that the voltage is less than the threshold voltage of the MOS
transistor of the next stage. As a consequence, this problem may lead to incorrect
output as a cascading problem arises because the output is pre-charged to a high
level, which leads to inadvertent discharge of the next stage. This problem can be
overcome if the output can be set to low during pre-charge. This basic concept is
used in realizing domino CMOS.
There are several approaches to overcome this problem. One straightforward
approach to deal with this problem is to introduce a delay circuit in the path of the
clock that will compensate for the discharge delay between the first and the second
stage of the dynamic circuit. It may be noted that the delay is a function of the
complexity of the discharge path. This “self-timing” scheme is fairly simple and
often used is CMOS programmable logic array (PLA) and memory design. How-
ever, there exist other approaches by which this problem can be overcome. Special
type of circuits, namely domino and (no race) NORA CMOS circuits, as discussed
below, can also be used.
5.4.5 Domino CMOS Circuits
A single-domino CMOS gate is shown in Fig. 5.29. It consists of two distinct com-

ponents. The first component is a conventional dynamic pseudo-nMOS gate, which
works based on the pre-charge technique. The second component is a static invert-
ing CMOS buffer. Only the output of the static buffer is fed as an input to the subse-
quent stage. During the pre-charge phase, the dynamic gate has a high output, so the
buffer output is low. Therefore, during pre-charge, all circuit inputs which connect
the output of one domino gate to the input of another are low, and the transistors
which are driven by these outputs are OFF. Moreover, during the evaluation phase,
a domino gate can make only one transition namely a low to high. This is due to
the fact that the dynamic gate can only go from high to low during the evaluation
phase. As a result, the buffer output could only go from low to high during evalua-
tion phase, which makes the circuits glitch-free.
During the evaluation phase, output signal levels change from low to high from
the input towards the output stages, when several domino logic gates are cascaded.
This phenomenon resembles the chain action of a row of falling dominos, hence the
name domino logic [5].
Domino CMOS circuits have the following advantages:
• Since no DC current path is established either during the pre-charge phase or
during the evaluation phase, domino logic circuits have lower power consump-
tion.
• As n-block is only used to realize the circuit, domino circuits occupy lesser chip
area compared to static CMOS circuits.
• Due to lesser number of MOS transistors used in circuit realization, domino
CMOS circuits have lesser parasitic capacitances and hence faster in speed com-
pared to static CMOS.
Full pull-down current is also available to drive the output nodes. Moreover, the
use of single clock edge to activate the circuit provides simple operation and maxi-
mum possible speed. One limitation of domino logic is that each gate requires an
inverting buffer. However, this does not pose a serious problem, because buffers are
needed anyway to achieve higher speed and higher fan-out. Another limitation of
the domino logic is that all the gates are non-inverting in nature, which calls for new
logic design paradigm which is different from the conventional approach.
5.4.6 NORA Logic
Another alternative is to use the duality property of the n-blocks and p-blocks used
in the realization CMOS circuits. The pre-charge output of an n-block is “1,” where-
as the pre-charge output of a p-block is 0. By alternatively using p-blocks and n-
blocks, as shown in Fig. 5.30, the clock skew problem is overcome in NORA logic
circuits.
NORA stands for NO RAce [6]. Here, both the n- and p-blocks are in pre-charge
phase when clk' = 1 or (clk = 0) . The outputs of n-mostly and p-mostly blocks are
pre-charged and pre-discharged to 1 and 0, respectively. During this phase, the in-
puts are set up. The n-mostly and p-mostly blocks are in evaluation phase when
Fig. 5.30 NORA logic style

FON FON
FON
QEORFN SEORFN
FON FON
FON
clk' = 1 . During this phase, inputs are held constant and outputs are evaluated as a
function of the inputs.
As pre-charged condition output is 1 (0) for an n-block (p-block), it cannot lead
to any discharge of the outputs of a p-block (n-block). As a consequence, the inter-
nal delays cannot result in a race condition, and we may conclude that the circuits
designed based on the NORA logic are internal-delay race-free.
Compared to the domino technique, this approach gives higher flexibility. As
both inverted and non-inverted signals are available from the NORA technique,
alternating n-mostly and p-mostly blocks can be used to realize circuits of arbitrary
logical depth. When connection between same types of blocks is necessary, domino
like inverters may be used between them. The NORA logic style has the disadvan-
tage that p-block is slower than n-blocks, due to lower mobility of the current carrier
of the pMOS circuits. Proper sizing, requiring extra area, may be used to equalize
the delays. Compared to domino logic circuits, the NORA logic circuits are faster
due to the elimination of the static inverter and the smaller load capacitance. The
Digital Equipment Corporation (DEC)-alpha processor, the first 250 MHz CMOS
microprocessor, made extensive use of NORA logic circuits.
NORA technique can be also used to realize pipelined circuit in a convenient
manner. A single stage of a pipeline block is shown in Fig. 5.30. It consists of one
n-mostly, one p-mostly, and a clocked CMOS (C2MOS) block, which is used as a
latch stage for storing information. The pipeline circuits can be realized by apply-
ing φ and φ′ to consecutive pipeline blocks. For φ = 0 and φ′ = 1, the φ-sections are
pre-charged; while, the φ-sections outputs are held constant by the C2MOS latch
stages. Then for phase φ = 1 and φ′ = 0, the φ-sections are in evaluation phase and
φ-sections are in pre-charge phase. Now, the φ-section outputs, evaluated in the
previous phase, are held constant and the φ-sections can use this information for
computation. In this way, information flows through the pipeline of alternating φ
and φ′ sections.
5.5 Some Examples
In this section, realization of several example functions such as full adder, par-
ity generator and priority encoder and using different logic styles are considered
and the number of transistors required for realization by different logic styles is
compared.
5.5 Some Examples 131
Fig. 5.31 Block diagram of $ 6

the full adder )8//
% $''(5
&LQ &R
d. Full adder
The block diagram of a full adder is shown in Fig. 5.31. The sum of product for
the sum and carry functions are given below:
S = ABCin + C0 ( A + B + Cin )
C0 = AB + Cin ( A + B).
Realizations of full-adder function by using static CMOS, dynamic CMOS, and

pass-transistor circuit are shown in Figs. 5.32, 5.33, and 5.34, respectively.
e. Parity generator
The block diagram of a 4-bit parity generator is shown in Fig. 5.35. The truth
table is given in Table 5.2. Realizations of the 4-bit parity generator using static
CMOS, domino CMOS, and pass-transistor circuit are shown in Figs. 5.36, 5.37,
and 5.38, respectively.
9GG
9GG
$ % &LQ
$ %
$ &LQ
&LQ
% %
$
&R 6
% &LQ &LQ
$ %
$ %
$ % &LQ $
&R
Fig. 5.32 Static complementary MOS (CMOS) realization of full adder

FON
$ % &LQ
&LQ
FON &R
%
% &LQ
$
6
$ 6
$ %
FON
FON
&R
Fig. 5.33 NORA complementary MOS (CMOS) realization of full adder
&R 6
& LQ & LQ & LQ & LQ
$ $ $ $
$ $ $ $
% % % % % %
Fig. 5.34 Pass-transistor realization of the full adder
Fig. 5.35 Block diagram of [

4-bit parity generator 3$5,7<
[
[
*(1(5$725 \ = [ ⊕ [ ⊕ [ ⊕ [
[
f. Priority encoder
The block diagram of the 8-bit priority encoder is shown in Fig. 5.39. The truth
table is given in Table 5.3. The sum-of-product form of the three functions to be
realized are given below:
y0 = x4 x6 ( x1 x2 + x3 ) + x5 x6 + x7
y1 = x4 x5 ( x2 + x3 ) + x6 + x7
y2 = x4 + x5 + x6 + x7 .
5.5 Some Examples 133
Table 5.2 Parity generator x3 x2 x1 x0 y

truth table
0 0 0 0 0
0 0 0 1 1
0 0 1 0 1
0 0 1 1 0
0 1 0 0 1
0 1 0 1 0
0 1 1 0 0
0 1 1 1 1
1 0 0 0 1
1 0 0 1 0
1 0 1 0 0
1 0 1 1 1
1 1 0 0 0
1 1 0 1 1
1 1 1 0 1
1 1 1 1 0
Fig. 5.36 Static complemen-

tary MOS (CMOS) realiza-
tion of parity generator [ [ [ [
[ [ [ [
[ [ [ [
[ [ [ [
\
[ [ [ [
[ [ [ [
[ [ [ [
[ [ [ [
FON FON
\
[ [ [ [
FON [ [ [ [ [ [
[ [
FON
[ [ [ [
[
[ FON
[
FON
Fig. 5.37 Domino complementary MOS (CMOS) realization of 4-bit parity generator
Fig. 5.38 Pass-transistor \

realization 4-bit parity
generator
[ [
[ [
[
[
[ [
[ [
[
[
Fig. 5.39 Block diagram of [

8-input priority encoder [
[ \
[ 35,25,7<
(1&2'(5 \
[
[ \
[
[
5.6 Chapter Summary 135
Table 5.3 Truth Table of the priority encoder

x7 x6 x5 x4 x3 x2 x1 x0 y2 y1 y0
0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 x 0 0 1
0 0 0 0 0 1 x x 0 1 0
0 0 0 0 1 x x x 0 1 1
0 0 0 1 x x x x 1 0 0
0 0 1 x x x x x 1 0 1
0 1 x x x x x x 1 1 0
1 x x x x x x x 1 1 1
The realizations of the function using static CMOS, domino CMOS, and pass-tran-
sistor circuit are shown in Figs. 5.40, 5.41, and 5.42, respectively.
Comparison Table 5.4 compares the area requirements in terms of the number
of transistors for the three examples using three logic styles. We find that both
dynamic CMOS and pass-transistor realizations require lesser number of transistors
and hence the smaller area. Delay of different implementations, given in nanosec-
ond, has been estimated using Cadence SPICE simulation tool. Delay of different
implementations is presented in Table 5.5. We find that the domino CMOS has the
minimum delay.
5.6 Chapter Summary
• The operation of pass-transistor logic circuits has been explained.

• The advantages and limitations of pass-transistor logic circuits have been high-
lighted.
[ [ [
[
[
[
[ [
[
[ [
[ \ [
[
[
[ [ \ [
[ [ [
[ [ [ [
[ \
[ [ [
[ [ [ [ [ [ [
Fig. 5.40 Static complementary MOS (CMOS) realization of the priority encoder functions
\
\
[ [ [ \
[ [ [
[ [ [
[ [ [ [ [
[ [ [ [
FON
[ FON
FON
Fig. 5.41 Domino complementary MOS (CMOS) realization of the priority encoder functions
\ \
[ [ [ [
\

[ [ [ [

[ [

[ [ [ [ [ [

[ [ [
[

[ [

[ [ [
[ [ [
[ [ [
Fig. 5.42 Pass-transistor realization of the priority encoder functions
Table 5.4 Comparison of area in terms of number of transistors

Functions Area in terms of the number of transistors
Static CMOS Domino CMOS LEAP
Priority encoder
Y0 18 12 16
Y1 14 10 16
Y2 10 8 12
Parity generator 34 33 16
Full adder 28 20 8 (Sum)
8 (Carry)
CMOS complementary MOS, LEAP single-rail pass-transistor logic
Table 5.5 Comparison of delay for different logic styles

Functions Delay in ns
Static Domino LEAP
Priority encoder
Y0 2.32 0.68 0.84
Y1 1.9 0.73 2.30
Y2 1.17 0.47 2.90
Parity generator 4.10 1.1 2.10
Full-adder sum 0.99 1.03 1.13
Carry 0.55 0.45 1.51
LEAP single-rail pass-transistor logic
• The different members of the pass-transistor logic family have been introduced.
• The difference between gate logic and pass-transistor logic circuits has been
explained.
• The operation of MOS dynamic circuits has been explained.
• The charge-sharing problem of MOS dynamic circuits has been explained.
• The charge-leakage problem of MOS dynamic circuits has been explained.
• The clock skew problem of MOS dynamic circuits has been explained.
• The operation of domino CMOS circuits to overcome the clock skew problem
has been explained.
• The operation of the NORA CMOS circuits to overcome the clock skew problem
has been explained.
• The area and delay of several real-life circuits for different logic styles are com-
pared.
Q5.1. In what way do relay logic circuits differ from pass-transistor logic circuits?
Why is the output of a pass-transistor circuit not used as a control signal for the next
stage?
Q5.2. Realize the function F = AB + AC + BC using CPL circuit.
Q5.3. How does the resistance of the two transistors of a lightly loaded transmission
gate vary as the input slowly changes from Vdd to 0 V?
Q5.4. Realize the function f = A′ + BC′ using SRPL pass-transistor circuits. Why is it
necessary to restore the output voltage level of a pass-transistor circuit?
Q5.5. What is clock skew problem? How is it overcome by using the NORA logic?
Q5.6. Consider a transmission gate with an input terminal tied to GND (0 V), while
the other non-gate terminal is tied to 1 pF load capacitance initially charged to Vdd
(5 V). At time t = 0, both the transistors of the transmission gate are turned ON. Plot
the effective resistance of the transmission gate with time till the voltage on the
capacitor is 0 V.
Q5.7. Obtain the reduced and ordered binary decision diagrams (ROBDD) of the
function f = ( x1 + x2 ).( x3 + x4 ). Realize the function with a minimum number of
LEAP cells.
Q5.8. Why is it necessary to have a swing restoration logic in pass-transistor logic
circuits? Explain its operation.
Q5.9. Why is it necessary to insert a buffer after not more than four-pass transistors
in cascade?
Q5.10. How can sneak path be avoided in ULM-based realization of pass-transistor
network?Illustrate with an example.
Q5.11. What are the advantages and limitations of pass-transistor logic circuits?
How are the limitations overcome?
Q5.12. Use LEAP-like pass-transistor logic cells to realize the Boolean function:
f = A+ B +C + D + E
Q5.13. How is sneak path avoided in ULM-based realization of pass-transistor net-

work? Illustrate with an example.
Q5.14. What is clock skew problem? How is it overcome by using the domino logic
circuit?
Q5.15. What is charge-sharing problem of the dynamic CMOS circuits? How can
its effects be minimized?
Q5.16. Is dynamic CMOS circuit ratioed logic? Justify your answer.
Q5.17. How is the clock skew problem overcome in the NORA logic style?
Q5.18. Realize the two functions given in the following truth table using the NORA
logic.
x1 x2 x3 y1 y2
0 0 0 0 0
0 0 1 0 1
0 1 0 1 0
0 1 1 1 0
1 0 0 1 1
1 0 1 1 1
1 1 0 1 1
1 1 1 1 1
Q5.19. Explain how dynamic power dissipation is minimized using adiabatic

switching.
Q5.20. Give the schematic diagram of a domino CMOS 4-input NAND gate. Ex-
plain its operation. Briefly mention its advantage and disadvantages with respect to
the static CMOS counterpart.
Q5.21. How is the charge leakage problem overcome in dynamic CMOS circuits?
Q5.22. Give the schematic diagram of a domino CMOS 4-input NAND gate. Ex-
plain its operation. Briefly mention its advantage and disadvantages with respect to
its static CMOS counterpart.
References 139
References
1. Fabricius, E.D.: Introduction to VLSI Design. McGraw-Hill, New York (1990)

2. Pucknell, D.A., Eshraghian, K.: Basic VLSI Design Systems and Circuits, 2nd edn. Prentice-
Hall of India Private Limited, New Delhi (1988)
3. Kang, S.-M., Leblebici, Y.: CMOS Digital Integrated Circuits: Analysis and Design, 3rd edn.
Tata McGraw-Hill Publishing Company Limited, India (2003)
4. Baker, R.J., Li, H.W., Boyce, D.E.: CMOS Design, Layout, and Simulation. IEEE Press,
Piscataway (1998)
5. Oklobdzija, V.G., Montoye, R.K.: Design-performance trade-offs in CMOS-domino logic.
IEEE J. Solid-State Circuits SC-21(2), 304–306 (1986)
6. Goncalves, N.F., De Man, H.J.: NORA: a racefree dynamic CMOS technique for pipelined
logic structures. IEEE J. Solid-State Circuits SC-18(3), 261–266 (1983)
7. Yano, K., Sasaki, Y., Rikino, K., Seki, K.: Top-down pass transistor logic design. IEEE J. Solid-
State Circuits 31, 792–803 (1996)
Chapter 6
Sources of Power Dissipation
Abstract This chapter presents various sources of power dissipation. At the outset,
the difference between power and energy is explained, because most of the time
when we say low power, we actually mean low energy. Various sources of power
dissipations, both static and dynamic, are introduced first. The expression for short-
circuit power dissipation, which is one of the dynamic power dissipations, is derived.
Switching power dissipation, which occurs due to the charging and discharging of
node capacitances, is introduced and the expression for switching power dissipation
is derived. Switching activity for different types of gates is calculated. Switching
activity for dynamic complementary metal–oxide–semiconductor (CMOS) circuits
is also discussed. The expression for power dissipation due to charge sharing in
dynamic CMOS circuits is obtained. Glitching power dissipation, which is another
source of dynamic power dissipation, is introduced along with suitable approach
to reduce it. Various sources of leakage power dissipations such as subthreshold
leakage, gate leakage power dissipations are presented. Various mechanisms which
affect the subthreshold leakage current are also highlighted.
Keywords Short-circuit power · Switching power · Glitching power · Energy
consumption · Voltage transfer characteristics · Dynamic power · Switching
activity · Transition activity · Leakage power · Diode leakage · Band-to-band
tunneling · Subthreshold leakage · Drain-induced barrier lowering · Body effect ·
Narrow-width effect · Vth rolling · Temperature effect · Hot-carrier injection · Punch
through · Vth roll-off · Gate-induced drain leakage
6.1 Introduction
In order to develop techniques for minimizing power dissipation, it is essential to

identify various sources of power dissipation and different parameters involved in
each of them. Power dissipation may be specified in two ways. One is maximum
power dissipation, which is represented by “peak instantaneous power dissipation.”
Peak instantaneous power dissipation occurs when a circuit draws maximum power,
which leads to a supply voltage spike due to resistances on the power line. Glitches
may be generated due to this heavy flow of current and the circuit may malfunction,
if proper care is not taken to suppress power-line glitches. The second one is the
“average power dissipation,” which is important in the context of battery-operated
142 6 Sources of Power Dissipation
Power (Watts)
Power (Watts)
P1 E1
P2 E2
Time Time
Fig. 6.1 Power versus energy
portable devices. The average power dissipation will decide the battery lifetime.
Here, we will be concerned mainly with the average power dissipation, although
the techniques used for reducing the average power dissipation will also lead to the
reduction of peak power dissipation and improve reliability by reducing the pos-
sibility of power-related failures.
Power dissipation can be divided into two broad categories—static and dynamic.
Static power dissipation takes place continuously even if the inputs and outputs
do not change. For some logic families, such as nMOS and pseudo-nMOS, both
pull-up and pull-down devices are simultaneously ON for low output level causing
direct current (DC) flow. This leads to static power dissipation. However, in our
low-power applications, we will be mainly using complementary metal–oxide–
semiconductor (CMOS) circuits, where this type of static power dissipation does
not occur. On the other hand, dynamic power dissipation takes place due to a change
in input and output voltage levels.
At this point, it is important to distinguish between the two terms—power and
energy. Although we often use both the terms interchangeably, these two do not
have the same meaning. Power dissipation is essentially the rate at which energy is
drawn from the power supply, which is proportional to the average power dissipa-
tion. Power dissipation is important from the viewpoint of cooling and packaging of
the integrated circuit (IC) chips. On the other hand, energy consumed is important
for battery-driven portable systems. As power dissipation is proportional to the clock
frequency, to reduce power dissipation by half we may reduce the clock frequency
by half. Although this will reduce the power dissipation and keep the chip cooler,
the time required to perform a computation will double that of the previous case. In
this case, the energy is drawn from the battery at half the rate of the previous case.
But, the total energy drawn from the battery for performing a particular computation
remains the same. Figure 6.1 illustrates the difference between power and energy.
The height of the graphs represents the power and the area under the curve repre-
sents the energy. Here, the same energy is drawn for different average power dis-
sipation. For battery-operated portable systems, a reduction in energy consumption
is more important than a reduction in power consumption. Instead of using power as
a performance metric, the power-delay product, which is dimensionally the same as
energy, is the natural metric for low-power systems, where battery life is the index of
energy efficiency. We usually mean low energy when we say low power.
6.2 Short-Circuit Power Dissipation 143
In CMOS circuits, power dissipation can be divided into two broad categories:
dynamic and static. Dynamic power dissipation in CMOS circuits occur when the
circuits are in working condition or active mode, that is, there are changes in input
and output conditions with time. In this section, we introduce the following three
basic mechanisms involved in dynamic power dissipation:
• Short-circuit power: Short-circuit power dissipation occurs when both the nMOS
and pMOS networks are ON. This can arise due to slow rise and fall times of the
inputs as discussed in Sect. 6.2.
• Switching power dissipation: As the input and output values keep on changing,
capacitive loads at different circuit points are charged and discharged, leading to
power dissipation. This is known as switching power dissipation. Until recently,
this was the most dominant source of power dissipation. The switching power
dissipation is discussed in Sect. 6.3.
• Glitching power dissipation: Due to a finite delay of the logic gates, there are
spurious transitions at different nodes in the circuit. Apart from the abnormal
behavior of the circuits, these transitions also result in power dissipation known
as glitching power dissipation. This is discussed in Sect. 6.4.
Static power dissipation occurs due to various leakage mechanisms. The following
seven leakage current components are discussed in Sect. 6.5:
• Reverse-bias p–n junction diode leakage current
• Reverse-biased p–n junction current due to the tunneling of elections from the
valence bond of the p region to the conduction bond of the n region, known as
band-to-band-tunneling current
• Subthreshold leakage current between source and drain when the gate voltage is
less than the threshold voltage Vt
• Oxide-tunneling current due to a reduction in the oxide thickness
• Gate current due to a hot-carrier injection of elections
• Gate-induced drain-leakage (GIDL) current due to high field effect in the drain
junction
• Channel punch-through current due to close proximity of the drain and the source
in short-channel devices
6.2 Short-Circuit Power Dissipation [1]
When there are finite rise and fall times at the input of CMOS logic gates, both
pMOS and nMOS transistors are simultaneously ON for a certain duration, short-
ing the power supply line to ground. This leads to current flow from supply to
ground. Short-circuit power dissipation takes place for input voltage in the range
Vtn < Vin < Vdd − | Vtp |, when both pMOS and nMOS transistors turn ON creating
a conducting path between Vdd and ground (GND). It is analyzed in the case of a
CMOS inverter as shown in Fig. 6.2. To estimate the average short-circuit current,
Fig. 6.2 Short-circuit power Vdd

dissipation during input
transition
Vin Vout
we have used a simple model shown in Fig. 6.3. It is assumed that τ is both rise and
fall times of the input (τ r = τ f = τ ) and the inverter is symmetric, i.e., βn = βp = β
and Vtn = −Vtp = Vt . The mean short-circuit current of the inverter having no load
attached to

1 2 
t t3
I mean = 2 ×  ∫ i (t )dt + ∫ i (t )dt  . (6.1)

T  t1 t2 
Because of the symmetry, we may write

4  t2 
I mean =  ∫ i (t )dt  . (6.2)
T  t1 
As the nMOS transistor is operating in the saturation region,

β
i (t ) = (Vin (t ) − Vt ) 2 , substituting this, we get
2

4 2 β 2
t
I mean =  ∫ (Vin (t ) − Vt )  dt. (6.3)

T  t1 2 
Vdd τ τ
Here, Vin (t ) = × t , t1 = × Vt , and t2 =
τ Vdd 2
Substituting these in Eq. 6.2, we get
τ
2β 2
I mean =
T τ
∫ (Vin (t ) − Vt ) 2 dt (6.4)
.Vt
Vdd
τ
2
2β 2
 Vdd 
=
2 τ
∫  τ t − Vt  dt.
 
(6.5)
Vt
Vdd
Vdd V
Substituting t − Vt = X and dd dt = dX , we get
τ τ
6.2 Short-Circuit Power Dissipation 145
Fig. 6.3 Model for short-

circuit power dissipation
Vdd -|Vtp|
Vin
Vtn
t
I(t)
t1 t2 t3 t t4 t5 t6
Vdd
−Vt
2β 2
τ
I mean =
T ∫0
X 2.
Vdd
.dx
3
2 β . τ 1  Vdd 
= − Vt  .
T Vdd 3  2 
2β τ
(Vdd − 2Vt )
3
= ×
T 24Vdd
This results in
β τ (6.6)
I mean = (Vdd − 2Vt )3 × .
12Vdd T
The short-circuit power is given by

3
β β  V 
Psc = Vdd I mean = (Vdd − 2Vt )3τ f = Vdd3 1 − 2 t  τ f . (6.7)
12 12  Vdd 
As the clock frequency decides how many times the output changes per second,
the short-circuit power is proportional to the frequency. The short-circuit current is
also proportional to the rise and fall times. Short-circuit currents for different input
slopes are shown in Fig. 6.4. The power supply scaling affects the short-circuit
power considerably because of cubic dependence on the supply voltage.
So far, we have assumed that there is no load (capacitive) at the output. In prac-
tice, there will be some load capacitance at the output and the output rise and fall
times will be dependent on that. If the load capacitance is very large, the input
changes from high to low or low to high before the output changes significantly. In
Fig. 6.4 Short-circuit current Ĳ QV Ĳ QV

as a function of input rise/ Ĳ QV
fall time

,VF LQ ȝ$

7LPHLQQV
Fig. 6.5 Variation of short- &/

circuit current with load
capacitance
& / I)

& / I)
,VF LQ ȝ$
& / S)

&/ ∝

7LPHLQQV
such a situation, the short-circuit current will be very small. It is maximum when
there is no load capacitance. The variation of short-circuit current for different out-
put capacitances is shown in Fig. 6.5. From this analysis, it is evident that the short-
circuit power dissipation can be minimized by making the output rise/fall times
smaller. The short-circuit power dissipation is also reduced by increasing the load
capacitance. However, this makes the circuit slower. One good compromise is to
have equal input and output slopes.
Because of the cubic dependence of the short-circuit power on supply voltage,
the supply voltage may be scaled to reduce short-circuit power dissipation. It may
also be noted that if the supply voltage is scaled down such that Vdd < (Vtn + Vtp ) ,
the short-circuit current can be virtually eliminated, because the nMOS and pMOS
transistors will not conduct simultaneously for any input values. The variation of
voltage transfer characteristics (VTC) for different supply voltages are shown in
Fig. 6.6. It is evident from the diagram that the circuit can function as an inverter
even when the supply voltage is equal to or less than the sum of the two threshold
voltages ( Vt + |Vtp|). In such a situation, however, the delay time of the circuit in-
creases drastically. Figure 6.7 shows the VTC when the supply voltage is less than
the sum of the two threshold voltages. Here, the VTC contains a region when none
of the two transistors conduct. However, the previous voltage level is maintained by
the stored charge in the output capacitor and the VTC exhibits a hysteresis behavior.
We may conclude this subsection by stating that the short-circuit power dissipa-
tion depends on the input rise/fall time, the clock frequency, the load capacitance,
gate sizes, and above all the supply voltage.
6.3 Switching Power Dissipation 147
Fig. 6.6 Voltage trans-

fer characteristics for
Vdd ≥ (Vtn + Vtp ) 9GG 9
9WQ 9

9WS 9
9GG 9

2XWSXW9ROWDJH
9GG 9

9GG 9

,QSXW9ROWDJH
Fig. 6.7 Transfer characteris-

tics for Vdd < (Vtn + Vtp ) 9GG
2XWSXW9ROWDJH
9GG,9WS, 9WQ 9GG

,QSXW9ROWDJH
6.3 Switching Power Dissipation [1]
There exists capacitive load at the output of each gate. The exact value of capaci-
tance depends on the fan-out of the gate, output capacitance, and wiring capaci-
tances and all these parameters depend on the technology generation in use. As the
output changes from a low to high level and high to low level, the load capacitor
charges and discharges causing power dissipation. This component of power dis-
sipation is known as switching power dissipation.
Switching power dissipation can be estimated based on the model shown in
Fig. 6.8. Figure 6.8a shows a typical CMOS gate driving a total output load ca-
pacitance CL. For some input combinations, the pMOS network is ON and nMOS
network is OFF as modeled in Fig. 6.8b. In this state, the capacitor is charged to Vdd
by drawing power from the supply. For some other input combinations, the nMOS
network is ON and pMOS network is OFF, which is modeled in Fig. 6.8c. In this
state, the capacitor discharges through the nMOS network. For simplicity, let us as-
sume that the CMOS gate is an inverter.
nMOS ip(t)
PTL 1
network
CL 0
nMOS
PTL in(t) CL
CL
network
a b c
Fig. 6.8 Dynamic power dissipation model
During the transition of the output from 0 to Vdd, the energy drawn from the
power supply is given by
Vdd
E0→1 = ∫ p ( t ) dt = ∫ V .i ( t ) dt ,
0
dd (6.8)
where i( t) is an instantaneous current drawn from the supply voltage Vdd, and it can
be expressed as
dV
(6.9) i (t ) = CL 0 ,
dt
where V0 is the output voltage. Substituting this in Eq. 6.8, we get

Vdd
(6.10)
E0 →1 = Vdd ∫ C dV
0
L 0 = CLVdd2 .
Regardless of the waveform and time taken to charge the capacitor, CLVdd2 is the
energy drawn from the power supply for 0 to Vdd transition at the load capacitance.
Part of the energy drawn from the supply is stored in the load capacitance
T Vdd
1
Ec = ∫ V0 i (t )dt = ∫ C V dV L 0 0 = CLVdd2 . (6.11)
0 0
2
This implies that half of the energy is stored in the capacitor, and the remaining half
(1 / 2)CLVdd2 is dissipated in the pMOS transistor network. During the Vdd to 0 transi-
tion at the output, no energy is drawn from the power supply and the charge stored
in the capacitor is discharged in the nMOS transistor network.
If a square wave of repetition frequency f ( I/T) is applied at the input, average
power dissipated per unit time is given by
1
=Pd =CLVdd2 CLVdd2 f . (6.12)
T
Fig. 6.9 Reduced voltage Vdd

swing at the output of a gate
Vdd Vdd - Vt
CL
The switching power is proportional to the switching frequency and independent of

device parameters. As the switching power is proportional to the square of the sup-
ply voltage, there is a strong dependence of switching power on the supply voltage.
Switching power reduces by 56 %, if the supply voltage is reduced from 5 to 3.3 V,
and if the supply voltage is lowered to 1 V, the switching power is reduced by 96 %
compared to that of 5 V. This is the reason why voltage scaling is considered to be
the most dominant approach to reduce switching power.
6.3.1 Dynamic Power for a Complex Gate
For an inverter having a load capacitance CL, the dynamic power expression is
CLVdd2 f . Here, it is assumed that the output switches from rail to rail and input
switching occurs for every clock. This simple assumption does not hold good for
complex gates because of several reasons. First, apart from the output load capaci-
tance, there exist capacitances at other nodes of the gate. As these internal nodes
also charge and discharge, dynamic power dissipation will take place on the internal
nodes. This leads to two components of dynamic power dissipation-load power and
internal node power. Second, at different nodes of a gate, the voltage swing may not
be from rail to rail. This reduced voltage swing has to be taken into consideration
for estimating the dynamic power. Finally, to take into account the condition when
the capacitive node of a gate might not switch when the clock is switching, a con-
cept known as switching activity is introduced. Switching activity determines how
often switching occurs on a capacitive node. These three issues are considered in
the following subsections.
6.3.2 Reduced Voltage Swing
There are situations where a rail-to-rail swing does not take place on a capacitive
node. This situation arises in pass transistor logic and when the pull-up device is an
enhancement-type nMOS transistor in nMOS logic gates as shown in Fig. 6.9. In
such cases, the output can only rise to Vdd − Vt. This situation also happens in inter-
val nodes of CMOS gates. Instead of CLVdd2 for full-rail charging, the energy drawn
from power supply for charging the capacitance to ( Vdd − Vt) is given by
E0→1 = CLVdd (Vdd − Vt ).

(6.13)
Fig. 6.10 Switching nodes of

a three-input NAND gate
A Q1 B Q2 C Q3
A Q4 CL
Q5 C1
B
Q6 C2
C
6.3.3 Internal Node Power
A three-input NAND gate is shown in Fig. 6.10. Apart from the output capacitance
CL, two capacitances C1 and C2 are shown in two interval nodes of the gate. For
input combination 110, the output is “1” and transistors Q3, Q4, and Q5 are ON. All
the capacitors will draw energy from the supply. Capacitor CL will charge to Vdd
through Q3, capacitor C1 will charge to ( Vdd − Vt) through Q3 and Q4. Capacitor C2
will also charge to ( Vdd − Vt) through Q3, Q4, and Q5. For each 0-to-Vdd transition at
an internal node, the energy drawn is given by
(6.14)
E0→1 = CiVV
i dd ,
where Ci is the internal node capacitance and Vi is internal voltage swing at node i.
6.3.4 Switching Activity [2, 3]
For a complex logic gate, the switching activity depends on two factors—the topol-
ogy of the gate and the statistical timing behavior of the circuit. To handle the transi-
tion rate variation statistically, let n( N) be the number of 0-to-Vdd output transitions
in the time interval [0,N]. Total energy EN drawn from the power supply for this
interval is given by
(6.15)
EN = CLVdd2 × n( N ).
E
The average power dissipation during an extended interval is Pavg = lim N × f ,
N →∝ N
where f is the clock frequency.
 n( N ) 
Pavg =  lim 2
 CLVdd f . (6.16)
 N →∝ N

The term lim (n( N ) / N ) gives us the expected (average) value of the number of
N →∝
transitions per clock cycle, which is defined as the switching activity. Therefore,
n( N )
α 0→1 = lim . (6.17)
N →∝ N
Taking into account various aspects discussed so far, the dynamic power consump-
tion is given by
k
Pd = α 0 CLVdd2 f + ∑ α i CiViVdd f ,
(6.18)
i =1
where α0 is the switching activity at the output node, αi is the switching activity on
the ith internal node, and f is the clock frequency. Here, it is assumed that there are
k internal nodes.
6.3.5 Switching Activity of Static CMOS Gates
The switching activity at the output of a static CMOS gate depends strongly on the
function it realizes. It is assumed that the inputs are independent to each other and
the probability of occurrence of “0” and “1” is same, let P0 be the probability that
the output will be “0” and P1 is the probability that the output will be “1.” Then,

P0→1 = P0 P1 = P0 (1 − P0 ). (6.19)
For an n-input gate, the total number of input combinations is 2n. Out of 2n combina-
tions in the truth table, let n0 be the total number of combinations for which the out-
put is 0 and n1 is the total number of combinations for which the output is “1,” then

n0 n1
=P0 = n
and P1 . (6.20)
2 2n
We get
(6.21)n n n (2n − n )
P0 →1 = 0n × 1n = 0 2 n 0 .
2 2 2
For example, consider a two-input NAND gate with the truth table given in
1 3
Table 6.1. In this case,
= P0 = and P1 .
3 4 4
So P0→1 = .
16
For NAND/NOR gates, the switching activity decreases with the increase in the
number of inputs. On the other hand, the switching activity at the output of an EX-
OR gate in 1/4, which is independent on the number of inputs. Output activities for
different gates are shown in Table 6.2. The variation of the switching activity at the
output of NAND, NOR, and EX-OR gates with the increase in the number of inputs
is shown in Fig. 6.11.
Table 6.1 Truth table of A B FNAND

NAND gate
0 0 1
0 1 1
1 0 1
1 1 0
6.3.6 Inputs Not Equiprobable
So far, we have assumed that the inputs are independent of each other and equiprob-
able. But, inputs might not be equiprobable. In such cases, the probability of transi-
tions at the output depends on the probability of transitions at the primary inputs. Let
us consider a two-input NAND gate with two inputs A and B having mutually inde-
pendent signal probabilities of PA and PB. The probability of logical “0” at the out-
put is P0 = PA PB and the probability of “1” at the output is P1 = (1 − P0 ) = (1 − PA PB ) .
Therefore,
(6.22)
P0→1 = P0 P1 = (1 − PA PB ) PA PB .
In a similar manner, the switching activity of a two-input NOR gate (output is 1

when both the inputs are 0) is
P1 = (1 − PA )(1 − PB )
(6.23)
P0 = [1 − (1 − PA )(1 − PB ) ]
P0→1 = (1 − PA )(1 − PB ) [1 − (1 − PA )(1− PB ) ].
In this manner, the switching activity at the output for any complex gate can be ob-
tained. The output switching activity of different gates is given in Table 6.3.
6.3.7 Mutually Dependent Inputs
When a multilevel network of complex gates are considered, the inputs to a gate at a
particular stage might not be independent. This can happen because the signal from
one input may propagate through multiple paths and again recombine at a particular
stage. This problem of re-convergent fan-out is shown with the help of a simple ex-
Table 6.2 Switching activity of different gates

GATE INV Two-input NAND/AND Two-input NOR/OR Two-input EX-OR
P0 →1 1 3 3 1
4 16 16 4
Fig. 6.11 Variation of

switching activity with
increase in the number of ;25
inputs

1$1'125

Table 6.3 Switching activity Function

of different gates for inputs P0 →1
not equiprobable NAND/AND PA PB (1 − PA PB )
NOR/OR (1 − PA )(1 − PB ) [1 − (1 − PA )(1 − PB )]
XOR
[1 − ( PA + PB − 2PA PB )][ PA + PB − 2PA PB ]
ample given in Fig. 6.12. Two circuits are shown, one without re-convergent fan-out
and the other with re-convergent fan-out. In the case of the circuit of Fig. 6.12a, sig-
63
nals at nodes E and F are independent and it gives a switching activity P0→1 =
at node G. 256
This computation of transition probability is quite straightforward. Starting from
the primary inputs, signal probabilities are calculated till the output node is reached.
But, this approach will not work for circuits having re-convergent fan-out. Circuits
having re-convergent fan-out result in inter-signal dependencies. For example, in
circuit 6.12b, E and F signals had inter-signal dependencies leading to different
switching activities at node G of the circuit and Fig. 6.9. In this case, the inputs E
15
and F are mutually dependent, and the switching activity P0→1 at node G is 64 .
Example 6.1 Let us now consider three different implementations of a six-input
OR function, which drives a capacitive load of 0.1 pF. The three implementations
are:

[ $ [ $
[ % ( *
%
( *
& [ &
[
[
' [
'
) î ) î

a b
Fig. 6.12 a Circuit without re-convergent fan-out. b Circuit with re-convergent fan-out
Fig. 6.13 Three different A

B O1 Z
realizations for the six-input C D
OR function E
F 63 63
4096 4096
7
A
B 64 O Z
1
C
E
F 7 O2
G
64 63
4096
3
A 16
B 3 O
1
C 16 Z
D 3 63
O
E 16 3 4096
F O2
(i) A six-input NOR and an inverter

(ii) Two three-input NOR followed by a two-input NAND
(iii) Three two-input NOR followed by a three-input NAND
The three realizations are shown in Fig. 6.13.
Table 6.4 provides details of various gates used for the realization of the three
implementations. The table provides the area in terns of unit area called cell grid,
output and input capacitances, and delay in terms of output capacitance C0 in pico-
farad. For all transistors, Wp = 2Wn = 10 µm. In Table 6.5, the switching activity and
different node points for the three realizations are given. The power cost, which is
the sum of the product of switching activity and node capacitance at different points
are given. As it is evident, implementation-1 requires minimum area and it is the
most power efficient. However, the delay time of this implementation is longer than
the other two implementations. In the other two cases, the delay is smaller but the
power cost is higher.
Table 6.4 Characteristics of the standard cells

Gate type Area (cell unit) Input capacitance Output capacitance Average delay (ns)
(fF) (fF)
INV 2 85 48 0.22 + 1.00 C0
NAND2 3 105 48 0.30 + 1.24 C0
NAND3 4 132 48 0.37 + 1.50 C0
NAND6 7 200 48 0.65 + 2.30 C0
NOR2 3 101 48 0.27 + 1.50 C0
NOR3 4 117 48 0.31 + 2.00 C0
Table 6.5 Transition activity at different points and relative performance of the three
implementations
I II III
O1 Z O1 O2 Z O1 O2 O3 Z
P1 63/64 1/64 7/8 7/8 1/64 3/4 3/4 3/4 1/64
P0 1/64 63/64 1/8 1/8 63/64 1/4 1/4 1/4 63/64
P0→1 63/4096 63/4096 7/64 7/64 63/4096 3/16 3/16 3/16 63/4096
Area 9 11 13
Delay 1.1 0.85 0.87
Pcost 6.7 42.5 89.4
Table 6.6 Transition activity of dynamic gates

Gate INV Two-input NOR Two-input NAND
P0 →1 1 3 1
2 4 4
6.3.8 Transition Probability in Dynamic Gates
The logic style has a strong influence on the node transition probability. For static
CMOS, we calculated the transition probability as P0 P1 = P0 (1 − P0 ) . In the case of
dynamic gates, the situation is different. As we have seen, in the case of domino
or NORA logic styles, the output nodes are pre-charged to a high level in the pre-
charge phase and then are discharged in the evaluation phase, depending on the
outcome of evaluation. In other words, the output transition probability will depend
on the signal probability P0, i.e.,
(6.24)
P0→1 = P0 ,
where P0 is the probability for the output is in the zero state. For n independent
inputs to a gate
(6.25) N
P0→1 = N0 ,
2
where N0 is the number of zero entries in the truth table. Table 6.6 gives the out-
put transition probabilities for NAND, NOR, and inverter. Compared to the static
CMOS, transition probabilities are more, because the output is always pre-charged
to 1, i.e., P1 = 1.
It may be noted that the switching activity for NAND and NOR gates is more in
the dynamic gates than their static gate counterparts.
Fig. 6.14 Three-input

NAND dynamic gate
$
&
% &/
&
&
&ON
6.3.9 Power Dissipation due to Charge Sharing
Moreover, in the case of dynamic gates, power dissipation takes place due to the
phenomenon of charge sharing even when the output is not 0 at the time evalua-
tion, i.e., the output load capacitance is not discharged, but part of the charge of the
load capacitance might get redistributed leading to a reduction in the output voltage
level. In the next pre-charge period, the output is again pre-charged back to Vdd. For
example, consider the three-input NAND dynamic gate shown in Fig. 6.14.
As shown in Fig. 6.14, after charge sharing, the new voltage level is

(C1 + C2 + CL )Vnew = Vdd CL
CLVdd (6.26)
or Vnew = .
C1 + C2 + C3
Therefore, CL
∆V = Vdd − × Vdd
(6.27) C1 + C2 + C3
.
C1 + C2
= × Vdd
C1 + C2 + CL
The power transferred from the supply is

(6.28) C (C + C2 ) 2
P = CLVdd ∆V = L 1 Vdd .
(C1 + C2 + CL )
The power is additional to the normal switching activity power dissipation.

6.4 Glitching Power Dissipation 157
Fig. 6.15 Output waveform A O1

showing glitch at output O2
B
C O2
101 000
ABC 101 000
O1
O2
delay
6.4 Glitching Power Dissipation
In the power calculations so far, we have assumed that the gates have zero delay.
In practice, the gates will have finite delay and this delay will lead to spurious
undesirable transitions at the output. These spurious signals are known as glitches.
In the case of a static CMOS circuit, the output node or internal nodes can make
undesirable transitions before attaining a stable value. Consider the circuit shown
in Fig. 6.15. If the inputs ABC change value from 101 to 000, ideally for zero gate
delay the output should remain at the 0 logic level. However, considering unit gate
delay of the first gate stage, output O1 is delayed compared to the C input. As a
consequence, the output switches to 1 logic level for one gate delay duration. This
transition increases the dynamic power dissipation and this component of dynamic
power is known as glitching power. Glitching power may constitute a significant
portion of dynamic power, if circuits are not properly designed.
Usually, cascaded circuits as shown in Fig. 6.16a exhibit high glitching power.
The glitching power can be minimized by realizing a circuit by balancing delays,
as shown in Fig. 6.16b. On highly loaded nodes, buffers can be inserted to balance
delays and cascaded implementation can be avoided, if possible, to minimize glitch-
ing power.
O1
A O2 A
B O3 B
C C
D D
a b
Fig. 6.16 Realization of A, B, C, and D, a in cascaded form, b balanced realization
Fig. 6.17 Summary of leak- Gate

age current mechanisms of
deep-submicron transistors Source Drain
I3
I6 I4 I5
I1 I2 I1 I2
Well
I7
6.5 Leakage Power Dissipation [4]
When the circuit is not in an active mode of operation, there is static power dissipa-
tion due to various leakage mechanisms. In deep-submicron devices, these leak-
age currents are becoming a significant contributor to power dissipation of CMOS
circuits. Figure 6.17 illustrates the seven leakage mechanisms. Here, I1 is the re-
verse-bias p–n junction diode leakage current; I 2 is the reverse-biased p–n junction
current due to tunneling of electrons from the valence bond of the p region to the
conduction bond of the n region; I3 is the subthreshold leakage current between the
source and the drain when the gate voltage is less than the threshold voltage Vt; I4
is the oxide-tunneling current due to a reduction in the oxide thickness; I5 is gate
current due to hot-carrier injection of elections; I6 is the GIDL current due to a high
field effect in the drain junction; and I7 is the channel punch-through current due
to the close proximity of the drain and the source in short-channel devices. These
leakage components are discussed in the following subsections.
6.5.1 p–n Junction Reverse-Biased Current
Let us consider the physical structure of a CMOS inverter shown in Fig. 6.18. As

shown in the figure, source–drain diffusions and n-well diffusions form parasitic
diodes in the bulk of silicon substrate. As parasitic diodes are reverse-biased, their
leakage currents contribute to static power dissipation. The current for one diode is
given by
 − qVd 
I rdlc = AJ s e nKT − 1 , (6.29)
 
where Js is the reverse saturation current density (this increases with temperature),
Is is the AJs, Vd is the diode voltage, n is the emission coefficient of the diode (some-
times equal to 1), q is the charge of an electron (1.602 × 10−19), K is the Boltzmann
6.5 Leakage Power Dissipation 159
Vin
Vout
p n n p p n
+ + +
+ +
+ n-well
p-type substrate
Vin
Vout
Fig. 6.18 nMOS inverter and its physical structure
KT
constant (1.38 × 10−23 j/k), T is the temperature in K, VT = is known as the
thermal voltage. q
At room temperature, VT ≈ 26mV .
The leakage current approaches Is, the reverse saturation current even for a
small reverse-biased voltage across the diode. The reverse saturation current per
unit area is defined as the current density Js, and the resulting current is approxi-
mately IL = AD · Js, where AD is area of drain diffusion. For a typical CMOS process,
J s ≈ 1 − 5 pA/µ m 2 at room temperature, and the AD is about 6 μm2 for a 1.0-μm
minimum feature size. It leads to a leakage current of about 1 fA per device at
room temperature. The reverse diode leakage current Irdlc increases significantly
with temperature.
n
The total static diode leakage current for n devices is given by
I rdlc = ∑ I i and the total static power dissipation for n devices is equal to
i =1
(6.30) n
P = Vdd .∑ I i .
i =1
Then, the total static power dissipation due to diode leakage current for one million
transistors is given by
106

P = Vdd ∑ I di ≈ 0.01µ W. (6.31)
i =1
This is small enough to make any significant impact on the power dissipation when
the circuit is in a normal mode (active) of operation. However, when a particular
portion of the chip remains in the standby mode for a considerable period of time,
this component of power dissipation cannot be ignored.
Fig. 6.19 BTBT in reverse- Ec

biased p–n junction qVbi
Ev qVapp
Ec
Ev
6.5.2 Band-to-Band Tunneling Current
When both n regions and p regions are heavily doped, a high electric field across a
reverse biased p–n junction causes a significant current to flow through the junction
due to tunneling of electrons from the valence band of the p region to the conduc-
tion band of n region. This is illustrated in Fig. 6.19. It is evident from this diagram
that for the voltage drop across that junction should be more than the band gap. The
tunneling current density is given by
EVapp  3
Eg 2 
J b-b = A 
exp − B ,
1
Eg 2  E 
 
where
2m* q 3 4 2m*
A= and B = ,
4π 3 2 3q
where m* is the effective mass of electron, E is the elective field at the junction, q is
the electronic charge, and h is the reduced Planck’s constant (1/2π times).
Assuming a step junction, the electric field at the junction is given by
2qN a N d (Vapp + Vbi )

E= ,
ε si ( N a + N d )
where Na and Nd are the doping in the p and n side, respectively. εsi is the permittiv-
ity of silicon and Vbi is the built-in voltage across the junction. High doping concen-
trations and abrupt doping profiles are responsible for a significant band-to-band
tunneling BTBT current in scaled devices through the drain–well junction.
6.5.3 Subthreshold Leakage Current
The subthreshold leakage current in CMOS circuits is due to carrier diffusion be-
tween the source and the drain regions of the transistor in weak inversion, when
the gate voltage is below Vt. The behavior of an MOS transistor in the subthreshold
operating region is similar to a bipolar device, and the subthreshold current exhibits
an exponential dependence on the gate voltage. The amount of the subthreshold
current may become significant when the gate-to-source voltage is smaller than, but
very close to, the threshold voltage of the device.
The subthreshold current expression as given by the BSIM3 model is stated
below:
q
(V −V th ) − qVds

Istlc = Ae N ’KT 1 − e
gs
kT

 
W  −Vds

( m − 1)(VT ) × e( g th )
2 V −V mVt
= µ0 cox × 1 − e VT 
L  
cdm ε w 3t
m = 1+ = 1 + si dm = 1 + ox
cox ε ox tox wdm
KT
where m is the subthreshold swing coefficient, VT = is the thermal voltage, μ0
q
is the zero bias mobility, cox is the gate oxide capacitance per unit area, wdm is the
maximum depletion layer width, and tox is the gate oxide thickness.
The typical value of this current for a single transistor is 1–10 nA. In long-chan-
nel devices, the subthreshold current is independent of the drain voltage for Vds
larger than few Vt. However, for short-channel devices, the dependence on the gate
voltage is exponential as illustrated in Fig. 6.20. The inverse of the slope of the
log10( Ids) versus Vgs characteristic is called the subthreshold slope ( St), which is
given by the following relationship
−1
 d(log10 I ds )  mkT kT  Cdm 
St =   = 2.3 = 2.3 1 + ,
 dV gs  q q  Cox 
where Cdm is capacitance of the depletion layer and Cox is oxide capacitance. Here,
St is measured in mVs per decade of the drain current. Typical values St for a bulk
CMOS process can range from 70 to 120 mV/decade. As a low value of St is desir-
able, the gate oxide thickness tox is reduced or the doping concentration is lowered
to achieve this.
Various mechanisms which affect the subthreshold leakage current are:
• Drain-induced barrier lowering (DIBL)
• Body effect
• Narrow-width effect
• Effect of channel length and Vth roll-off
• Effect of temperature
Fig. 6.20 Log( ID) versus VG at two different drain voltages for 20 × 0.4-µm n-channel transistor
in a 0.35-µm CMOS process
6.5.3.1 Drain-Induced Barrier Lowering
For long-channel devices, the sources and drain region are separated far apart and
the depletion regions around the drain and source have little effect on the potential
distribution in the channel region. So, the threshold voltage is independent of the
channel length and drain bias for such devices. However, for short-channel devices,
the source and drain depletion width in the vertical direction and the source drain
potential have a strong effect on a significant portion of the device leading to varia-
tion of the subthreshold leakage current with the drain bias. This is known as the
DIBL effect. Because of the DIBL effect, the barrier height of a short-channel de-
vice reduces with an increase in the subthreshold current due to a lower threshold
voltage.
DIBL occurs when the depletion regions of the drain and the source interact with
each other near the channel surface to lower the source potential barrier. Figure 6.19
shows the lateral energy band diagram at the surface versus distance from the source
to the drain. It is evident from the figure that DIBL occurs for short-channel lengths
and it is further enhanced at high drain voltages. Ideally, the DIBL effect does not
change the value of St, but does lower Vth. Figure 6.23 shows the DIBL effect as it
shows the ID − VG curve for different gate voltages (Fig. 6.22).
Fig. 6.21 Subthreshold leak- log(Isunth)

age in nMOS transistors
G Isubth Subthreshold Slope

+
[mV/decade of current]
VGS
-
S
VGS
VTH
Fig. 6.22 Lateral energy

band diagram at the surface
versus distance from the
source to drain for three dif- y
ferent situations
0 L
Curve L Vd1
A 6.25µm 0.5V
B 1.25µm 0.5V
C 1.25µm 5V
Curve A
Curve B
Curve C
y/L
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
6.5.3.2 Body Effect
As a negative voltage is applied to the substrate with respect to the source, the
well-to-source junction, the device is reverse-biased and bulk depletion region is
widened. This leads to an increase in the threshold voltage. This effect is known as
the body effect. The threshold voltage equation given below gives the relationship
of the threshold voltage with the body bias
2ε si qN a (2τ B + Vbs )
Vth = Vfb + 2ψ B + ,
Cox
1E-02
VD = 4.0 V
1E-03
ID VD = 0.1 V
1E-04 VD = 2.7 V
(A) 1E-05
1E-06
1E-07 DIBL
1E-08 GIDL
1E-09
1E-10
1E-11
Week Inversion
1E-12 &
1E-13 Junction Leakage
1E-14
-0.5 0 0.5 1 1.5 2
VG (V)
Fig. 6.23 n-Channel drain current versus gate voltage illustrating various leakage components
where Vfb is the flat-band voltage, Na is the doping density in the substrate, and
 KT   N a 
ψB =   ln   is the difference between the Fermi potential and the intrin-
 ε   ni 
sic potential in the substrate.
The variation of the threshold voltage with respect to the substrate bias dVth / dVbs
is referred to as the substrate sensitivity:
ε si qN a
dVth 2(2ψ B + Vbs )
= .
dVbs Cox
Substrate sensitivity is higher for higher bulk doping concentration, and it decreases
as the substrate reverse bias increases. At Vbs = 0, it is equal to Cdm/Cox or m − 1
where m is also called the body effect coefficient.
Figure 6.24 shows the Id versus VG curve for different body bias from 0 to − 5 V.
The curves show that the St does not change with body bias, but IOFF decreases due
to an increase in the threshold voltage.
Taking into consideration all the three effects, e.g., weak inversion, DIBL, body
effect and subthreshold leakage can be modeled as
I subth = A × e1/ mvT (VG −VS −Vth 0 −γ ′×VS +ηVDS ) × (1 − e −VDS /VT ),
where
W
A = µ0 Cox’ (VT ) 2 e1.8 e − ∆Vth /ηVT ,
Leff
KT
Vth0 is the zero-bias threshold voltage and VT = is the thermal voltage. γ′ is the
q
linearized body effect coefficient for small values of source to bulk voltage, η is the
DIBL coefficient, Cox is the gate oxide capacitance, μ0 is the zero bias mobility, m
 3tox 
is the subthreshold swing coefficient  m = 1 +  , and ΔVth is the parameter to
 W dm 
take into account the transistor-to-transistor leakage variations.
6.5.3.3 Narrow-Width Effect
The width of a gate, particularly when it becomes narrow, affects the threshold
voltage as shown in Fig. 6.25. The reduction in threshold voltage also leads to an
increase in the subthreshold leakage current (Fig. 6.24).
6.5.3.4 Vth Roll-Off
As the channel length is reduced, the threshold voltage of metal–oxide–semicon-

ductor field-effect transistor (MOSFET) decreases. This reduction of channel length
is known as Vth roll-off. Figure 6.26 shows the reduction of threshold voltage with
a reduction in channel length. This effect is caused by the proximity of the source
and drain regions leading to a 2D field pattern rather than a 1D field pattern in
short-channel devices. The bulk change that needs to be inverted by the applica-
tion of gate voltage is proportional to the area under the trapezoidal region, shown
in Fig. 6.27, given by QB′ ∝ Wdm ( L + L′) / 2 . So, the gate voltage has to invert less
bulk charge to turn the transistor on leading to more band bending in the Si–SiO2
interface in the short-channel device compared to long-channel devices. As a con-
sequence, the threshold voltage is lower for a short-channel device. Moreover, the
effect of the source–drain depletion region is more severe for high drain bias volt-
age. This results in a further decrease in threshold voltage and larger subthreshold
current.
6.5.3.5 Temperature Effect
There is a strong dependence of threshold voltage on temperature. The equation

kT  C 
S t = 2.3 1 + dm  indicates that the subthreshold slope changes linearly with
q  Cox 
1.00E-02
ID 1.00E-03
(A) 1.00E-04
VSUB = 0V
1.00E-05
1.00E-06 VSUB -0, -1, -2, -3, -4, -5 V
1.00E-07
1.00E-08
1.00E-09 = -5 V
1.00E-10
1.00E-11
1.00E-12
1.00E-13
-0.3 -0.1 0.1 0.3 0.5 0.7 0.9 12 13 15
VG (V)
Fig. 6.24 n-Channel log( ID) versus gate voltage for different substrate biases
Fig. 6.25 Variation of thresh- 1.2

old voltage with gate width
for different body biases and 1.0
Threshold Voltage in volt
uniform doping -5V

0.8
-4V
0.6 -3V
-2V
0.4
-1V
0.2
VBS = 0V
0.0
0 2 4 6 8 10
Gate width in µ m
temperature. For a 0.3-μm technology, St varies from 58.2 to 81.9 mV/decade. As

the temperature varies from − 50 to + 25 °C in the 0.3-μm technology as shown in
Fig. 6.28, the IOFF increases from 0.45 to 100 pA, an increase by a factor of 356, for
a 20-μm-wide device (23 fA/μm to 8 pA/μm). The following two parameters are
responsible for increases in subthreshold leakage current:
(i) St increases linearly with temperature.
(ii) The threshold voltage decreases with temperature at the rate of 0.8 mV/°C.
Fig. 6.26 Threshold voltage

roll-off with change in chan-
nel length
9 WK 9 GV ORZ
9 GV KLJK
&KDQQHO/HQJWK
Fig. 6.27 Schematic diagram

for charge-sharing model *$7(
/

QVRXUFH :GP QGUDLQ
L’
1.00E-02
1.00E-03
100 °C
ID 1.00E-04 VD = 2.7 V
1.00E-05
(A)
1.00E-06
T °C St IOFF
1.00E-07
-50 58.2 4.50E-13
1.00E-08
-25 65.4 3.80E-12
1.00E-09 0 72.6 2.88E-11
-50 °C 1.60E-10
1.00E-10 25 81.9
50 88.0 7.07E-10
1.00E-11 75 96.6 2.50E-09
1.00E-12 100 105.8 7.67E-09
1.00E-13
-0.5 -0.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
Fig. 6.28 Variation of drain current with temperature
6.5.3.6 Tunneling Through Gate Oxide
Device size scaling leads to a reduction in oxide thickness, which in turn results in
an increase in the field across the oxide. The high electric field along with low oxide
thickness leads to the tunneling of electrons from the substrate to the gate and vice
versa, through the gate. The basic phenomenon of tunneling is explained with the
help of a heavily doped n + type poly-silicon gate and a p-type substrate.
Fig. 6.29 Tunneling of

electrons through nMOS Φox
capacitor Ec
Ev Ev
n+ poly-silicon p substrate
a
Φox E
+Vox e c
Ev
Ec p substrate
Ev
n+ poly-silicon
b
Ec e -Vox
Ev
Ec
n+ poly-silicon
Ev
p substrate
c
Because of higher mobility, primarily electrons take part in the tunneling pro-
cess. An energy band diagram in flat-band condition is shown in Fig. 6.29a, where
Φ is the Si–SiO2 interface barrier height for electrons. The energy band diagram
changes to that of Fig. 6.29b, when a positive bias is applied to the gate. The elec-
trons at the strongly inverted surface can tunnel through the SiO2 because of the
small width of the potential barrier arising out of small oxide thickness. Similarly,
if a negative gate bias is applied, the electrons from the n + poly-silicon can tunnel
through the oxide layer as shown in Fig. 6.29c.
This results in gate oxide-tunneling current, which violates the classical infinite
input impedance assumption of MOS transistors and thus affects the circuit perfor-
mance significantly.
6.5.3.7 Hot-Carrier Injection
Electrons and holes near the Si–SiO2 interface can gain sufficient energy due to the
high electric field in short-channel MOSFETs and cross the interface potential bar-
rier and enter into the oxide layer as shown in Fig. 6.30. This phenomenon is known
Fig. 6.30 Injection of hot e Ec

electrons from substrate to
oxide EF
Ec P-substrate
EF
n+ poly-silicon
as the hot-carrier injection. Because of the lower effective mass and smaller barrier
height (3.1 eV) for electrons compared to holes (4.5 eV), the possibility of an injec-
tion due to electrons is more than the holes.
6.5.3.8 Gate-Induced Drain Leakage
Due to a high electric field ( Vdg) across the oxide, a deep depletion region under
the drain overlap region is created, which generates electrons and holes by band-to-
band tunneling. The resulting drain to body current is called GIDL current. When
the negative gate bias is large ( v, gate at zero or negative voltages with respect to
drain at Vdd), the n + region under the gate can be depleted and even inverted as
shown in Fig. 6.31. As a consequence, minority carriers are emitted in the drain
region underneath the gate. As the substrate is at a lower potential, the minority car-
riers at the drain depletion region are swept away laterally to the substrate, creating
a GIDL current. Thinner oxide thickness and higher Vdd enhance the electric field
and therefore enhance GIDL current.
At low drain doping, the electric field is not high enough to cause tunneling. At
very high drain doping, the depletion width and hence the tunneling is limited. The
GIDL is worse for moderated doping, when the depletion width and electric field
are both considerable.
6.5.3.9 Punch-Through
Due to the proximity of the drain and the source in short-channel devices, the deple-
tion regions at the source–substrate and drain–substrate junctions extend into the
channel. If the doping is kept constant while the channel length is reduced, the sepa-
ration between the depletion region boundaries decreases. Increased reverse bias
(higher Vds) across the junction further decreases the separation. When the depletion
region merge, a majority of the carriers in the source enter into the substrate and get
collected by the drain. This situation is known as punch-through condition. The net
Fig. 6.31 GIDL effect. GDIL Vg < 0

Vd > 0
gate-induced drain leakage
n+ ploy gate
n+ drain
Depletion edge
a p-substrate
Vg < 0
Vd = VDD
n+ ploy gate
Tunnel created
minority carrier
GIDL
n+ drain
Depletion edge
p-substrate
b
effect of punch through is an increase in the subthreshold current. Moreover, punch-

through degrades the subthreshold slope.
The punch-through voltage VPT estimates the value of Vds for which punch
through occurs at Vgs = 0:
VPTα N B ( L − Wj )3 ,
where NB is the doping concentration at the bulk, L is the channel length, and Wj is
the junction width.
One method for controlling the punch through is to have a halo-implant at the
leading edges of the drain–source junctions.
6.6 Conclusion 171

6ZLWFKLQJSRZHU
/HDNDJHSRZHU
6KRUWFLUFXLWSRZHU
6WDWLFSRZHU
Fig. 6.32 Contribution of various sources of power dissipation
Fig. 6.33 Change in active

and standby power with
change in technology $FWLYH3RZHU

3RZHU:

6WDQGE\3RZHU

7HFKQRORJ\*HQHUDWLRQµP
6.6 Conclusion
In this chapter, various sources of power dissipation in digital CMOS circuits have
been presented. The contribution of various sources of power dissipation in the
total power for present-generation static CMOS circuits is shown in Fig. 6.32. It is
evident from the figure that the switching power dissipation constitutes 80–90 %
of the total power dissipation. Next dominant source of power dissipation is due to
threshold leakage current, which constitutes 10–30 % of the total power dissipation.
However, the size of the MOS transistors is shrinking, the power dissipation due to
leakage current is increasing rapidly as shown in Fig. 6.33. It is anticipated that the
dynamic power and subthreshold leakage power dissipation will be comparable in
terms of percentage of the total power in the next generation circuits of submicron
technology. Both short-circuit power dissipation and static power dissipation con-
stitute about 5 % of the total power dissipation.
However, the above situation holds good when the circuit is switching at the
same rate of the operating frequency. This is not true when some subsystems re-
main in the standby mode. In such cases, the standby power dissipations due to
diode leakage current and subthreshold leakage current takes place, of which the
subthreshold leakage current is dominant.
From the equation for dynamic power, we find that, because of quadratic depen-
dence of the dynamic power on the supply voltage, the supply voltage reduction is
the dominant technique for realizing low-power circuits. The other parameters that
affect power (or energy) are the switching activity α and the capacitance CL. The
product of the switching activity and capacitance, known as switched capacitance,
is another parameter that can be minimized to reduce power dissipation. In the sub-
sequent chapters, we discuss various low-power synthesis approaches at different
levels of design hierarchy.
6.7 Chapter Summary
• The difference between power and energy is explained.

• How short-circuit power dissipation takes place in CMOS circuits is explained.
• Expression for short-circuit power dissipation is derived.
• Switching power dissipation in CMOS circuits is explained.
• Expression for switching power dissipation is derived.
• Switching activity for different types of gates has been calculated.
• Switching activity for dynamic CMOS circuits is highlighted.
• Expression for power dissipation due to charge sharing is derived.
• How glitching power dissipation occurs is explained.
• Different sources of leakage currents such as subthreshold leakage and gate leak-
age have been introduced.
• Various mechanisms that affect the subthreshold leakage current have been high-
lighted.
Q6.1. Derive the expression for short-circuit power. How does the short-circuit
power vary with the load capacitance?
Q6.2. What is switching activity? Calculate the switching activity at the output of
the following circuit.
Q6.3. A 32-bit off-chip bus operating at 3.3 V and 150 MHz clock rate is driving a
capacitance of 15 pF/bit. Each bit is estimated to have a toggling probability of 0.20
at each clock cycle. What is the power dissipation in operating the bus?
Q6.4. Derive the expression for short-circuit power dissipation of a CMOS inverter.
How is it affected for different load capacitances?
Q6.5. What is body bias? How can it be used to reduce static power dissipation?
Q6.6. What is subthreshold leakage current? Briefly explain the mechanisms that
affect subthreshold leakage current.
Q6.7. For a two-input NAND gate, assume that the supply voltage = 5 V, the output
capacitance = 100 fF, input capacitance = 40 fF, and average delay = 0.30 + 1.20C0
(in ns). Also, assume that the inputs are uncorrelated and random in nature.
References 173
Q6.8. What is short-circuit power? Justify the statement—there will be no short-

circuit power dissipation if the supply voltage is the sum of the pull-up and pull-
down transistor threshold voltages.
Q6.9. What is body effect? How is it used to reduce power dissipation?
Q6.10. What is glitching power dissipation? How can it be minimized?
Q6.11. What is charge sharing? How does it lead to power dissipation of a circuit?
Q6.12. Calculate the dynamic power dissipation of a three-input static CMOS NOR
gate due to an output load capacitance of 0.1 pf with the circuit operating a 100 MHz
and power supply voltage of 3.3 V.
Q6.13. What is subthreshold leakage current? Briefly explain the mechanisms that
affect subthreshold leakage current.
References
1. Bellaouar, A., Elmasry, M.I.: Low-power Digital VLSI Design Circuits and Systems. Kluwer,
Norwell (1965)
2. Chandrakasan, A., Sheng, S., Broderson, R.W.: Low-power CMOS design. IEEE J. Solid-State
Circuits 27, 472–484 (1992)
3. Chandrakasan, A.R., Brodersen, R.W.: Low-Power Digital CMOS Design. Kluwer, Norwell
(1995)
4. Roy, K., Mukhopadhyay, S., Mahmooddi-Meimand, H.: Leakage current mechanisms and
leakage reduction techniques in deep-submicrometer CMOS circuits. Proc. IEEE, 91(2), 305–
327 (2003)
Chapter 7
Supply Voltage Scaling for Low Power
Abstract This chapter focuses on supply voltage scaling which is the most effective
way to reduce power dissipation. First, the challenges involved in supply voltage
scaling for low power are highlighted. Then, the difference between constant-field
and constant-voltage scaling are explained in the context of feature size scaling. The
short-channel effects arising out of feature size scaling are also discussed. Architec-
ture-level approaches for low power, using parallelism and pipelining are explored.
Multi-core processor architecture as an approach for low power is explained. Volt-
age scaling techniques using high-level transformations are presented. The mul-
tilevel voltage scaling (MVS) approach is introduced and various challenges in
MVS are discussed. The implementation of dynamic voltage and frequency scaling
(DVFS) approach is presented. Then, a close-loop approach known as the adaptive
voltage scaling (AVS) is implemented which monitors the performance at execu-
tion time to estimate the required supply voltage and accordingly voltage scaling is
performed. Finally, subthreshold circuits are introduced that operate with a supply
voltage less than the threshold voltage of the metal–oxide–semiconductor (MOS)
transistors, resulting in a significant reduction of power dissipation at the cost of
longer delay.
Keywords Static voltage scaling · Multilevel voltage scaling · Dynamic voltage
scaling · Adaptive voltage scaling · Feature size scaling · Constant-field scaling ·
Constant-voltage scaling · Short-channel effects · Parallelism for low power ·
Pipelining for low power · Multi-core for low power · High-level transformations ·
Voltage-scaling interfaces · Level converters · Converter placement · Dynamic
voltage scaling (DVS) · Dynamic voltage and frequency scaling · Workload
prediction · Adaptive voltage scaling
7.1 Introduction
In the preceding chapter, various sources of power dissipation in complementary

metal–oxide–semiconductor (CMOS) circuits have been discussed. The total power
dissipation can be represented by the simplified equation:
(7.1)
Ptotal = Pdynamic + Pstatic

176 7 Supply Voltage Scaling for Low Power

1250$/,=('(1(5*<
1250$/,=(''(/$<

& HII 9
∝ 9GG 9GG ± 9W
GG

a 9GG b 9GG
Fig. 7.1 a Variation of normalized energy with respect to supply voltage; b variation of delay with
respect to supply voltage
Although the dynamic power has three components, the switching power is the
most dominant component. The switching power Pswitching = α 0 CLVdd2 f caused by
the charging and discharging of capacitances at different nodes of a circuit can be
optimized by reducing each of the components such as the clock frequency f, the
total switched capacitance ∑ α i Ci , and the supply voltage Vdd. Another dynamic
power component, the glitching power is often neglected. But, it can account for
up to 15 % of the dynamic power. The third component, the short-circuit power,
captures the power dissipation as a result of a short-circuit current, which flows be-
tween the supply voltage and ground (GND), when the CMOS logic gates switches
from 0 to − 1 or from 1 to 0. This can be minimized by minimizing the rise and fall
times. The static power dissipation has also three dominant components. The most
significant among them is the subthreshold leakage power due to the flow of current
between the drain and source. The second important component is the gate leakage
power due to the tunneling of electrons from the bulk silicon through the gate oxide
potential barrier into the gate. In sub-50-nanometer devices, the source–substrate
and drain–substrate reversed p–n junction band-to-band tunneling current, the third
component, is also large. Because of the quadratic dependence of dynamic power
on the supply voltage, supply voltage scaling was initially developed to reduce dy-
namic power. But, the supply voltage scaling also helps to reduce the static power
because the subthreshold leakage power decreases due to the reduction of the drain-
induced barrier lowering (DIBL), the gate-induced drain leakage (GIDL), and the
gate tunneling current as well. It has been demonstrated that the supply voltage
scaling leads to the reduction of the subthreshold leakage and gate leakage currents
of the order of V3 and V4, respectively.
From the above discussion, it is quite evident that reducing the supply voltage,
Vdd, is the most effective way to reduce both dynamic and static power dissipations.
A plot of the normalized energy with supply voltage variation is shown in Fig. 7.1.
The normalized energy, which is equivalent to the power-delay product (PDP), can
be considered as the most appropriate performance metric for low-power applica-
tions. Unfortunately, this reduction in power dissipation comes at the expense of
7.1 Introduction 177
performance. The delay of a circuit is related to the supply voltage by the following
equation:
Vdd 1
Delay ∝ = .
(Vdd − Vt ) 2  V 
2
(7.2)
Vdd 1 − t 
 Vdd 
As it is evident from Eq. (7.2), there is a performance penalty for the reduction in

the supply voltage. If the threshold voltage is not scaled along with the supply volt-
age to avoid an increase in leakage current, a plot of the variation of the normalized
delay with the supply voltage variation is shown in Fig. 7.1b. The plot shows that
the delay increases with the decrease in supply voltage in a nonlinear manner, and it
increases sharply as the supply voltage approaches the threshold voltage.
It is essential to devise a suitable mechanism to contain this loss in performance
due to the supply voltage scaling for the realization of low-power high-performance
circuits. The loss in performance can be compensated by using suitable techniques
at different levels of design hierarchy, that is, the physical level, logic level, archi-
tectural level, algorithmic level, and system level.
The voltage scaling approaches can be divided into the following four categories:
Static Voltage Scaling (SVS) In this case, fixed supply voltages are applied to one
or more subsystems or blocks.
Multilevel Voltage Scaling (MVS) This is an extension of the SVS, where two or
few fixed discrete voltages are applied to different blocks or subsystems.
Dynamic Voltage and Frequency Scaling (DVFS) This is an extension of the MVS,
where a large number of discrete voltages are applied in response to the changing
workload conditions of the subsystems.
Adaptive Voltage Scaling (AVS) This is an extension of the DVFS, where a close-
loop control system continuously monitors the workload and adjusts the supply
voltage.
In this chapter, we discuss the abovementioned voltage scaling techniques start-
ing with SVS. In the first physical-level-based approach, the device feature size is
scaled to overcome the loss in performance as discussed in Sect. 7.2. In Sect. 7.3,
we focus on architecture-level approaches, such as parallelism and pipelining for
SVS. SVS using high-level transformation has been discussed in Sect. 7.4. Dynam-
ic voltage and frequency scheduling approach is discussed in Sect. 7.5. Section 7.5
introduces multilevel voltage scaling (MVS). and various challenges of MVS are
highlighted in Sect. 7.6. Dynamic voltage and frequency scaling (DVFS) has been
discussed in Sect. 7.7. Adaptive voltage scaling has been highlighted in Sect. 7.8.
Then, subthreshold logic circuits has been introduced in Sect. 7.9.
Table 7.1 Recent history of device size scaling for CMOS circuits
Year 1985 1987 1989 1991 1993 1995 1997 1999 2003 2005 2007 2009
Feature 2.5 1.7 1.2 1.0 0.8 0.5 0.35 0.25 0.18 0.090 0.065 0.045
CMOS complementary metal–oxide–semiconductor
7.2 Device Feature Size Scaling [1]
Continuous improvements in process technology and photolithographic techniques

have made the fabrication of metal–oxide–semiconductor (MOS) transistors of
smaller and smaller dimensions to provide a higher packaging density. As a reduction
in feature size reduces the gate capacitance, this leads to an improvement in perfor-
mance. This has opened up the possibility of scaling device feature sizes to compen-
sate for the loss in performance due to voltage scaling. The reduction of the size, i.e.,
the dimensions of metal–oxide–semiconductor field-effect transistors (MOSFETs),
is commonly referred to as scaling. To characterize the process of scaling, a parame-
ter S, known as scaling factor, is commonly used. All horizontal and vertical dimen-
sions are divided by this scaling factor, S ˃ 1, to get the dimensions of the devices
of the new generation technology. Obviously, the extent of scaling, in other words
the value of S, is decided by the minimum feature size of the prevalent technology.
It has been observed that over a period of every 2 to 3 years, a new generation tech-
nology is introduced by downsizing the device dimensions by a factor of S, lying in
the range 1.2–1.5. Table 7.1 presents the recent history of device size scaling. If the
trend continues, according to International Technology Roadmap for Semiconduc-
tors (ITRS), the feature size will be 8 nm by the year 2018. The trend in MOS device
scaling is represented by the curves shown in Fig. 7.2. It may be noted that the slope
of all the curves in this figure is equal to the scaling parameter S. Figure 7.2a and b
shows the reduction in gate delay for the n-type MOS (nMOS) and p-type MOS
(pMOS) transistor, respectively. Figure 7.2c shows how gate oxide thickness varies
with the scaling of channel length, whereas Fig. 7.2d shows how the supply voltage
is scaled with the scaling of channel length.
Figure 7.3 shows the basic geometry of an MOSFET and the various parameters
scaled by a scaling factor S. It may be noted that all the three dimensions are pro-
portionally reduced along with a corresponding increase in doping densities. There
are two basic approaches of device size scaling–constant-field scaling and constant-
voltage scaling. In constant-field scaling, which is also known as full scaling, the
supply voltage is also scaled to maintain the electric fields same as the previous
generation technology as shown in Fig. 7.2d. In this section, we examine, in detail,
both the scaling strategies and their effect on the vital parameters of an MOSFET.
7.2.1 Constant-Field Scaling
In this approach, the magnitudes of all the internal electric fields within the device
are preserved, while the dimensions are scaled down by a factor of S. This requires
7.2 Device Feature Size Scaling 179
QFKDQQHO SFKDQQHO

*DWHGHOD\SV
*DWHGHOD\SV

a &KHQQHOOHQJWK µ P b &KHQQHOOHQJWK µ P
*DWHR[LGHWKLFNQHVV$
2SHUDWLQJYROWDJH9

c &KHQQHOOHQJWK µ P d &KHQQHOOHQJWK µ P
Fig. 7.2 Trends in metal–oxide–semiconductor (MOS) device scaling
*$7(
6RXUFH 'UDLQ
WR[
WR[6
1'
1'î6

2;,'( /
/6

;
M ;M6
6XEVWUDWHGRSLQJ :
:6
1
$ 1$6
Fig. 7.3 Scaling of a typical metal–oxide–semiconductor field-effect transistors (MOSFET) by a
scaling factor S
that all potentials must be scaled down by the same factor. Accordingly, supply and
threshold voltages are scaled down proportionately. This also dictates that the dop-
ing densities are to be increased by a factor of S to preserve the field conditions. A
list of scaling factors for all device dimensions, potentials, and doping densities are
given in Table 7.2.
Table 7.2 Constant-field scaling of the device dimensions, voltages, and doping densities
Quantity Before scaling After scaling
Channel length L L′ = L / S
Channel width W W ′ = W /S
Gate oxide thickness tox t′ = t /S
ox ox
Junction depth xj x′j = x j / S
Power supply voltage Vdd Vdd′ = Vdd / S
Threshold voltage VT0 VT′0 = VT 0 / S
Doping densities NA N A′ = N A ⋅ S
ND N D′ = N D ⋅ S
As a consequence of scaling, various electrical parameters are affected. For ex-

ample, the gate oxide capacitance per unit area increases by a factor of S as given
by the following relationship:
ε
(7.3) ε
′ = ox = S ox = S ⋅ Cox .
Cox
′
tox tox
As both length and width parameters are scaled down by the same factor, the W/L
remains unchanged. So, the transconductance parameter Kn is also scaled by a factor
S. Both linear-mode and saturation-mode drain currents are reduced by a factor of
S, as given below:
(7.4) K′
I ds′ (lin) = n 2 Vgs′ − VT′ ⋅ Vds′ − Vds2
2
(( ) )
SK n 1
(7.5)
= ⋅
2 S2
2 Vgs − V Vds − Vds2 (( ) )
I (lin )
(7.6)
= D .
S
As both supply voltage and the drain current are scaled down by a factor of S, the
power dissipation is reduced by a factor of S2. This significant reduction in power
dissipation is the most attractive feature of the constant-field scaling approach:
P = I ds ⋅ Vds
1 P (7.7)
P ′ = I D′ ⋅ Vds′ = ⋅ I ds ⋅ Vds = 2 .
S2 S
Another important parameter is the power density per unit area, which remains
unchanged because the area and power dissipation, both are reduced by a factor of
S2. As the gate oxide capacitance reduces by a factor of S, there will be a reduction
in both the rise-time and fall-time of the device. This leads to the reduction of delay
7.2 Device Feature Size Scaling 181
Table 7.3 Effects of constant-field scaling on the key device parameters

Quality Before scaling After scaling
Gate capacitance Cg C′ = C / S
g g
Drain current ID I D′ = I D / S
Power dissipation P P′ = P / S 2
Power density P/area P ′/area ′ = ( P/area)
Delay td td ′ = td / S
Energy E = P.td P td P ⋅ td 1
E′ = × = 3 = 3E
S2 S S S
time and the consequent improvement in performance. Effects of constant-field

scaling on the key device parameters are shown in Table 7.3:
(7.8) C C
Cg′ = W ′ ⋅ L′ ⋅ Cox
′ = W ⋅ L ⋅ ox = g
S S
Important benefits of constant-field scaling are: (i) smaller device sizes leading to
a reduced chip size, higher yield, and more number of integrated circuits (ICs) per
wafer, (ii) higher speed of operation due to smaller delay, and (iii) reduced power
consumption because of the smaller supply voltage and device currents.
7.2.2 Constant-Voltage Scaling
In constant-voltage scaling, all the device dimensions are scaled down by a factor of
S just like constant-voltage scaling. However, in many situations, scaling of supply
voltage may not be feasible in practice. For example, if the supply voltage of a cen-
tral processing unit (CPU) is scaled down to minimize power dissipation, it leads
to electrical compatibility with peripheral devices, which usually operate at higher
supply voltages. It may be necessary to use multiple supply voltages and compli-
cated-level translators to resolve this problem. In such situations, constant-voltage
scaling may be preferred. In a constant-voltage scaling approach, power supply
voltage and the threshold voltage of the device remain unchanged. To preserve the
charge–field relations, however, the doping densities have to be scaled by a factor
of S2. Key device dimensions, voltages, and doping densities for constant-voltage
scaling are shown in Table 7.4.
Constant-voltage scaling results in an increase in drain current (both in linear
mode and in saturation mode) by a factor of S. This, in turn, results in an increase
in the power dissipation by a factor of S and the power density by a factor of S3, as
shown in Table 7.5. As there is no decrease in delay, there is also no improvement in
performance. This increase in power density by a factor of S3 has possible adverse
effects on reliability such as electromigration, hot-carrier degradation, oxide break-
down, and electrical overstress.
Table 7.4 Constant-voltage scaling of the device dimensions, voltages, and doping densities
Quantity Before scaling After scaling
Channel length L L′ = L / S
Channel width W W ′ = W /S
Gate oxide thickness tox ′ = tox /S
tox
Junction depth xj x′j / s
Power supply voltage Vdd Vdd′ = Vdd
Threshold voltage Vto Vt′0 = Vt 0
Doping densities NA N A′ = N A ⋅ S 2
ND N D′ = N D ⋅ S 2
Table 7.5 Effects of constant-voltage scaling on the key device parameters

Quality Before scaling After scaling
Gate capacitance Cg C′ = C /S
g g
Drain current ID I D′ = I D ⋅ S
Power dissipation P P′ = P ⋅ S
Power density P/area P ′ / area ′ = S3 P /area
Delay (td ∝ Cg ⋅ Vdd / I ds ) td td′ = td / S 2
I ′(lin ) = S ⋅ I (lin )
(7.9)
I 0′ (sat ) = S ⋅ I D (sat ),
(7.10)
P ′ = I 0′ ⋅ Vds′ = ( SI D )Vds = S ⋅ P.
7.2.3 Short-Channel Effects
However, associated with the benefits mentioned above, scaling has some unwant-
ed side effects, commonly referred to as short-channel effects, which has been dis-
cussed at length in Sect. 2.6. Short-channel effects arise when channel length is of
the same order of magnitude as depletion region thickness of the source and drain
junctions or when the length is approximately equal to the source and drain junc-
tion depths. As the channel length is reduced below 0.6 μm, the short-channel effect
starts manifesting. This leads to an increase in subthreshold leakage current, reduc-
tion in threshold voltage with Vgs, and a linear increase in the saturation current
instead of square of the gate-to-source voltage.
Moreover, if channel length is scaled down without scaling the supply voltage
(constant-voltage scaling), the electric field across a gate oxide device continues to
increase, creating a hot carrier. Hot carriers can cause an avalanche breakdown of
the gate oxide. It is necessary to restrict the maximum electric field across the gate
oxide to 7 MV/cm, which translates into 0.7 V/10 Å of gate oxide thickness. For
7.3 Architectural-Level Approaches 183
Fig. 7.4 a Conventional * *

structure; b lightly doped 6 ' 6 '
drain structure Q Q Q Q Q
Q
a b
gate oxide thickness of 70 Å, the applied gate voltage should be limited to 4.9 V for
long-term reliable operation.
From the above discussion, it is evident that voltage scaling and scaling down of
the device feature size are complementary to each other. However, if voltage scaling
is not done along with the scaling of feature size because of some other design con-
straints, it will be necessary to use an appropriate measure to reduce the number of
hot carriers. One technique is to use lightly doped drain structure shown in Fig. 7.4.
The physical device structure is modified so that the carriers do not gain energy
from the field to become hot carriers. Of course, the performance of the device is
traded to obtain long-term reliability.
7.3 Architectural-Level Approaches
Architectural-level refers to register-transfer-level (RTL), where a circuit is rep-

resented in terms of building blocks such as adders, multipliers, read-only memo-
ries (ROMs), register files, etc. [2, 3]. High-level synthesis technique transforms a
behavioral-level specification to an RTL-level realization. It is envisaged that low-
power synthesis technique on the architectural level can have a greater impact than
that of gate-level approaches. Possible architectural approaches are: parallelism,
pipelining, and power management, as discussed in the following subsections.
7.3.1 Parallelism for Low Power
Parallel processing is traditionally used for the improvement of performance at the

expense of a larger chip area and higher power dissipation. Basic idea is to use
multiple copies of hardware resources, such as arithmetic logic units (ALUs) and
processors, to operate in parallel to provide a higher performance. Instead of using
parallel processing for improving performance, it can also be used to reduce power.
We know that supply voltage scaling is the most effective way to reduce power con-
sumption. Unfortunately, the savings in power come at the expense of performances
or, more precisely, maximum operating frequency. This follows from the equation:
2
(7.11)  V 
f max ∝ (Vdd − Vt ) 2 / Vdd = Vdd 1 − t  .
 Vdd 
9
UHI
/$7&+
$
ELW
$GGHU
/$7&+
%

a IUHI
9 UHI
/$7&+
ELW
$GGHU 9
UHI
/$7&+
9 08;
UHI
/$7&+

ELW
$GGHU
IUHI
/$7&+
b IUHI
Fig. 7.5 a A 16-bit adder; b parallel architecture of the 16-bit adder. MUX multiplexer
If the threshold voltage is scaled by the same factor as the supply voltage, the maxi-
mum frequency of operation is roughly linearly dependent on the power supply
voltage. Reducing the supply voltage forces the circuit to operate at a lower fre-
quency. In simple terms, if the supply voltage is reduced by half, the power is re-
duced by one fourth and performance is lowered by half. The loss in performance
can be compensated by parallel processing. This involves splitting the computation
into two independent tasks running in parallel. This has the potential to reduce the
power by half without reduction in the performance. Here, the basic approach is to
trade the area for power while maintaining the same throughput.
To illustrate this, let us consider the example of Fig. 7.5a, where two 16-bit reg-
isters supply two operands to a 16 × 16 adder. This is considered as the reference
Table 7.6 Impact of parallelism on area, power, and throughput

Parameter Without Vdd scaling With Vdd scaling
Area 2.2X 2.2X
Power 2.2X 0.227X
Throughput 2X 1X
architecture and all the parameters, such as power supply voltage, frequency of
operation, power dissipation, etc., of this architecture are referred by ref notation.
If 10 ns is the delay of the critical path of the adder, then the maximum clock fre-
quency that can be applied to the registers is f ref = 100 MHz. In this situation, the
estimated dynamic power of the circuit is:
(7.12)
Pref = Cref Vref2 ⋅ f ref ,
where Cref is the total effective switched capacitance, which is the sum of the prod-
ucts of the switching activities with the node capacitances, that is,
Cref = ∑ Ciα i .
(7.13)
Without reducing the clock frequency, the power dissipation cannot be reduced by
reducing the supply voltage. However, same throughput (number of operations per
unit time) can be maintained by the parallel architecture shown in Fig. 7.5b. Here,
the adder has been duplicated twice, but the input registers have been clocked at half
the frequency of fref. This helps to reduce the supply voltage such that the critical
path delay is not more than 20 ns. With the same 16 × 16 adder, the power supply
can be reduced to about half the Vref. Because of duplication of the adder, the ca-
pacitance increases by a factor of two. However, because of extra routing to both the
adders, the effective capacitance would be about 2.2 times of Cref.
Therefore, the estimated power dissipation of this parallel implementation is:
2
(7.14) V  f
P = 2.2C ⋅ ref × ref , par ref  
 2  2
2.2
(7.15)
≈ ⋅ Pref ≈ 0.277 Pref .
8
This shows that the power dissipation reduces significantly. The impact of parallel-
ism is highlighted in Table 7.6. Column 2 shows parallelism for higher performance
without voltage scaling having larger power dissipation, whereas column 3 corre-
sponds to parallelism for low power with voltage scaling and without degradation
of performance.
/$7&+
1 0XOWLSOLHU
3 FRUH
8
7
/$7&+
0XOWLSOLHU
FRUH
0
/$7&+
8
;
287387
/$7&+
0XOWLSOLHU
FRUH
0+]
/$7&+
0XOWLSOLHU
FRUH
0+]
0XOWLSKDVH
FORFN 08;
FRQWURO
0+]
Fig. 7.6 A four-core multiplier architecture. MUX multiplexer
Table 7.7 Power in multi-core architecture

Number of cores Clock in MHz Core supply voltage Total power
1 200 5 15.0
2 100 3.6 8.94
4 50 2.7 5.20
8 25 2.1 4.5
7.3.2 Multi-Core for Low Power
The idea behind the parallelism for low power can be extended for the realiza-
tion of multi-core architecture. Figure 7.6 shows a four-core multiplier architecture.
Table 7.7 shows how the clock frequency can be reduced with commensurate scal-
ing of the supply voltage as the number of cores is increased from one to four while
maintaining the same throughput. This is the basis of the present-day multi-core
commercial processors introduced by Intel, AMD, and other processor manufac-
turers. Thread-level parallelism is exploited in multi-core architectures to increase
throughput of the processors.
IUHI

V
D ELW
DGGHU
E
V
/$7&+
/$7&+
/$7&+
V
D ELW
DGGHU
E
Fig. 7.7 Pipelined realization 16-bit adder
Table 7.8 Impact of pipelining on area, power, and throughput

Area 1.15X 1.15X
Power 2.30X 0.28X
Throughput 2X 1X
7.3.3 Pipelining for Low Power
Instead of reducing the clock frequency, in pipelined approach, the delay through
the critical path of the functional unit is reduced such that the supply voltage can be
reduced to minimize the power. As an example, consider the pipelined realization of
16-bit adder using two-stage pipeline shown in Fig. 7.7. In this realization, instead
of 16-bit addition, 8-bit addition is performed in each stage. The critical path delay
through the 8-bit adder stage is about half that of 16-bit adder stage. Therefore, the
8-bit adder will operate at a clock frequency of 100 MHz with a reduced power
supply voltage of Vref/2. It may be noted that in this realization, the area penalty is
much less than the parallel implementation leading to Cpipe = 1.15Cref . Substituting
these values, we get:
2
(7.16) V 
Ppipe = Cpipe ⋅ Vpipe
2
⋅ f pipe = (1.15Cref ) ⋅  ref  ⋅ f = 0.28 Pref .
 2 
It is evident that the power reduction is very close to that of a parallel implementa-
tion with an additional bonus of a reduced area overhead. The impact of pipelining
is highlighted in Table 7.8. Here, column 2 shows pipelining for improved perfor-
mance with larger power dissipation, higher clock frequency, and without voltage
scaling, whereas column 3 corresponds to parallelism for low power with voltage
scaling and without degradation of performance.
V
D ELW
DGGHU
E
V
V
D ELW
DGGHU
/$7&+
E
/$7&+
08;
/$7&+
V V
D V
ELW
DGGHU
E
V
D ELW
DGGHU IUHI
E
IUHI
Fig. 7.8 Parallel-pipelined realization of 16-bit adder. MUX multiplexer
7.3.4 Combining Parallelism with Pipelining
An obvious extension of the previous two approaches is to combine the parallelism

with pipelining. Here, more than one parallel structure is used and each structure
is pipelined. Figure 7.8 shows the realization of a 16-bit adder by combining both
pipelining and parallelism. Two pipelined 16-bit adders have been used in parallel.
Both power supply and frequency of operation are reduced to achieve substantial
overall reduction in power dissipation:
(7.17)
Pparpipe = Cparpipe ⋅ Vparpipe
2
⋅ f parpipe .
The effective switching capacitance Cparpipe will be more than the previous because
of the duplication of functional units and more number of latches. It is assumed to
be equal to 2.5 Cref. The supply voltage can be more aggressively reduced to about
one quarter of Vref and the frequency of operation is reduced to half the reference
frequency fref/2. Thus,
 f 
Pparpipe = (2.5Cref )(0.3Vref ) 2  ref 
(7.18)  2 
= 0.1125 Pref .
Table 7.9 highlights the impact of combined parallelism and pipelining. Without Vdd
scaling, the performance can be improved by four times with an increase in power
7.4 Voltage Scaling Using High-Level Transformations 189
Table 7.9 Impact of parallelism and pipelining on area, power, and throughput
Area 2.5× 2.5×
Power 5.0× 0.1125×
Throughput 4× 1×
<1 . ;1
;1 <1
9GG 9

7KURXJKSXW ;
. ' 3RZHU ;

a b <1
Fig. 7.9 a A first-order infinite impulse response (IIR) filter; b directed acyclic graph (DAG) cor-
responding to the IIR filter
dissipation by five times. On the other hand, with Vdd scaling, the power dissipation
can be reduced by about 90 %, without the degradation of performance.
7.4 Voltage Scaling Using High-Level Transformations
For automated synthesis of digital systems, high-level transformations such as dead

code elimination, common sub-expression elimination, constant folding, in-line ex-
pansion, and loop unrolling are typically used to optimize the design parameters
such as the area and throughput [4]. These high-level transformations can also be
used to reduce the power consumption either by reducing the supply voltage or the
switched capacitance. In this section, we discuss how loop unrolling can be used
to minimize power by voltage scaling. Let us consider a first-order infinite impulse
response (IIR) filter as shown in Fig. 7.9a, which can be modeled by the following
equation:
(7.19)
YN = X N + K × YN −1 .
In a high-level synthesis, the behavioral specification is commonly transformed

into an intermediate graph-based representation known as directed acyclic graph
(DAG). The DAG corresponding to Fig. 7.9a is shown in Fig. 7.9b.
If we simply unroll the loop to process two samples in parallel, the effective
critical path remains unchanged as shown in Fig. 7.10. As the switched capaci-
tance is also doubled, the power consumption per sample remains unchanged for
<1 . ;1 . ;1
9GG 9

7KURXJKSXW ;
3RZHU ;
<1 <1
Fig. 7.10 Directed acyclic graph (DAG) after unrolling
<1 . ;1 . ;1 <1 .
9GG 9 9
7KURXJKSXW ; ;

3RZHU ; ;
<1
<1 <1
Fig. 7.11 Directed acyclic graph (DAG) after unrolling and using distributivity and constant
propagation
the same supply voltage. Normalized throughput and power dissipation are shown
in Fig. 7.10.
However, loop unrolling facilitates other transformations, such as distributiv-
ity, constant propagation, and pipelining, to be readily applied. For example, after
applying distributivity and constant propagation, the output samples can be repre-
sented by the following equations:
YN −1 = X N −1 + K × YN − 2
(7.20)
YN = X N + K × YN −1 = X N + K × X N −1 + K 2 × YN − 2 .
The DAG corresponding to these equations are shown in Fig. 7.11. It may be noted
that the critical path is now reduced to 3, i.e., the computation of two samples can be
7.4 Voltage Scaling Using High-Level Transformations 191
6WDJH 6WDJH
;1 . ;1 . <1 . =
9GG 9 9

7KURXJKSXW ; ;
3RZHU ; ;
= <1 <1
Fig. 7.12 Directed acyclic graph (DAG) after unrolling and pipelining
performed in three clock cycles as opposed to four-clock cycles in case of Fig. 7.10.

To maintain the same throughput level, the clock frequency can be reduced by 25 %
with a corresponding reduction in the supply voltage from 5 to 3.7 V. This leads to a
reduction in power by about 20 %. Alternatively, the performance can be improved
by 25 %, keeping the supply voltage and the operating frequency unchanged. This,
however, results in an increase in normalized power dissipation by 50 % due to an
increased switched capacitance. Furthermore, by applying pipelining to this struc-
ture, computation can be performed in two stages.
Stage 1:
Z = X N + K × X N −1 .
Stage 2:
YN −1 = X N −1 + K × YN − 2
(7.21)
YN = Z + K 2 × YN − 2 .
The corresponding DAG is shown in Fig. 7.12. It may be noted that the critical
path of both the stages is now reduced to two. To maintain the throughput level,
the sample rate can be reduced to half with a corresponding reduction of the supply
voltage to 2.9 V. This results in the reduction in power by 50 %. Alternatively, the
throughput can be doubled by keeping the supply voltage and sample rate same as
in the case of Fig. 7.9b. This, however, increases the normalized power dissipation
by about 50 % because of the increase in effective capacitance by 50 %.
This simple example highlights the point that a particular transformation may
have conflicting effects on the different parameters of power dissipation. In this
case, loop unrolling leads to reduction of supply voltage but increase in effective
capacitance. However, the reduction in the supply voltage more than compensates
for the increase in capacitance resulting in overall reduction in power dissipation.
In the above example, an unrolling factor of 2 (two samples are computed)
has been considered to demonstrate the effectiveness of the loop unrolling for the

3RZHU)L[HG7KURXJKSXW

&DSDFLWDQFH

6SHHGXS

9ROWDJH)L[HG7KURXJKSXW

8QUROOLQJIDFWRU
Fig. 7.13 Speed optimization is different than power optimization
10ns
SCT SCT
D Q D Q
Vdd_low
Q
Vdd_low Q
CLR
Vdd_low CLR
SCT SCT
D Q D Q
Q Q
8ns
CLR CLR
Fig. 7.14 Assignment of multiple supply voltages based on delay on the critical path
reduction of power dissipation. As the unrolling factor is increased, the speedup

grows linearly (for the same supply voltage) and the capacitance also grows linear-
ly. If the supply voltage is scaled to minimize power dissipation, then the power dis-
sipation starts increasing with the unrolling factor after giving an optimum power
dissipation for the unrolling factor of about 3, as shown in Fig. 7.13.
7.5 Multilevel Voltage Scaling
As high Vdd gates have a less delay, but higher dynamic and static power dissipation,
devices on time-critical paths can be assigned higher Vdd, while devices on noncriti-
cal paths shall be assigned lower Vdd, such that the total power consumption can be
reduced without degrading the overall circuit performance [5]. Figure 7.14 shows
that the delay on the critical path is 10 ns, whereas delay on the noncritical path is
8 ns. The gates on the critical path are assigned higher supply voltage VddH. The
slack of the noncritical path can be traded for lower switching power by assigning
lower supply voltage VddL to the gates on the noncritical path.
7.5 Multilevel Voltage Scaling 193
VddH cluster VddL cluster
FF FF
Level Converter
FF FF
FF FF
FF FF
FF FF
Critical Path
Fig. 7.15 Clustered voltage scaling. FF flip-flop
For multiple dual-Vdd designs, the voltage islands can be generated at different
levels of granularity, such as macro level and standard cell level. In the standard-cell
level, gates on the critical path and noncritical paths are clustered into two groups.
Gates on the critical path are selected from the higher supply voltage ( VddH) stan-
dard cell library, whereas gates on the noncritical path are selected from the lower
supply voltage ( VddL) standard cell library, as shown in Fig. 7.15. This approach
modifies the normal distribution of path delays using a single supply voltage in a
design to a distribution of path delays skewed toward higher delay with multiple
supply voltages, as shown in Fig. 7.16.
Macro-based voltage island methodology targets toward an entire macro or func-
tional block to be assigned to operate at different voltages at the time of high-level
synthesis. Starting with an intermediate representation known as DAG, high-level
synthesis involves two basic steps: scheduling and allocation. These are the two
most important steps in the synthesis process. Scheduling refers to the allocation of
the operations to the control steps. A control step is the fundamental sequencing unit
in a digital system. It corresponds to a clock cycle. Allocation involves assigning the
operations to the hardware, i.e., allocating functional units, storage, and communi-
cation paths. After the scheduling step, some operations will be on the critical path,
whereas some operations will be on the noncritical path. The slack of the off-critical
path can be utilized for the allocation of macro modules to off-critical-path opera-
tions. As shown in Fig. 7.17, a multiplier operating at lower supply voltage ( VddL) is
allocated to the off-critical-path multiplier operation *3.
Single Supply Voltage SSV Multiple Supply Voltages MSV

MSV
SSV
crit. paths
td td
1/f 1/f
Fig. 7.16 Distribution of path delays under single supply voltage (SSV) and multiple supply volt-
age (MSV)
Fig. 7.17 Macro-based

voltage island approach to
achieve low power

A number of studies have shown that the use of multiple supply voltages results
in the reduction of dynamic power from less than 10 % to about 50 %, with an aver-
age of about 40 %. It is possible to use more than two, say three or four, supply volt-
ages. However, the benefit of using multiple Vdd saturates quickly. Extending the
approach to more than two supply voltages yields only a small incremental benefit.
The major gain is obtained by moving from a single Vdd to a dual Vdd. It has been
found that in a dual-Vdd/single-Vt system, the optimal lower Vdd is about 60–70 % of
the original Vdd. The optimal supply voltage depends on the threshold voltage Vt of
the MOS transistors as well.
7.6 Challenges in MVS
The overhead involved with multiple-Vdd systems includes the additional power
supply networks, insertion of level converters, complex characterization and static
timing analysis, complex floor planning and routing, and power-up–power-down
sequencing. As a consequence, even a simple multi-voltage design presents the de-
signer with a number of challenges, which are highlighted in this section.
7.6 Challenges in MVS 195
Fig. 7.18 Signal going from 9 GG/ 9GG+

low-Vdd to high-Vdd domain
causing a short-circuit current
9 LQ
Fig. 7.19 a Logic symbol of 9GG/ 9GG/

high-to-low level converter;
b high-to-low-voltage level
converter realization
,1+ 287/ ,1+ 287/
a b
7.6.1 Voltage Scaling Interfaces
When signals go from one voltage domain to another voltage domain, quite often, it
is necessary to insert level converters or shifters that convert the signals of one volt-
age level to another voltage level. Consider a signal going from a low-Vdd domain
to a high-Vdd domain, as shown in Fig. 7.18. A high-level output from the low-Vdd
domain has an output VddL, which may turn on both nMOS and pMOS transistors
of the high-Vdd domain inverter resulting in a short circuit between VddH and the
GND. A level converter needs to be inserted to avoid this static power consump-
tion. Moreover, to avoid the significant rise and fall time degradations between the
voltage-domain boundaries, it is necessary to insert buffers to improve the quality
of signals that go from one domain to another domain with proper voltage swing
and rise and fall times. So, it may be necessary to insert buffers even when signals
go from high- to low-voltage domain. This approach of clean interfacing helps to
maintain the timing characteristics and improves ease of reuse.
High-to-Low-Voltage Level Converters The need for a level converter as a sig-
nal passes from high-Vdd domain to low-Vdd domain arises primarily to provide a
clean signal having a desired voltage swing and rise and fall times. Without a level
converter, the voltage swing of the signal reaching the low-Vdd domain is 0 to VddH.
This causes higher switching power dissipation and high leakage power dissipa-
tion due to GIDL effect. Moreover, because of the longer wire length between the
voltage domains, the rise and fall time may be long leading to increase in short-
circuit power dissipation. To overcome this problem, a level converter as shown
in Fig. 7.19b may be inserted. The high-to-low level converter is essentially two
inverter stages in cascade. It introduces a buffer delay and its impact on the static
timing analysis is small.
Low-to-High-Voltage Level Converters Driving logic signals from a low-voltage
domain to high-voltage domain is a more critical problem because it has significant
9
GG+
9 9GG+
GG/
0 0
287+
,1/ 287+ 0 0
9GG/
9GG/
,1/
a b
Fig. 7.20 a Logic symbol of low-to-high level converter; b low-to-high-voltage level converter
realization
degrading effect on the operation of a circuit. The logic symbol of the low-to-high-
voltage level converter is shown in Fig. 7.20a. One of the most common approaches
of converting a low-voltage level signal to a high-voltage level signal is using dual
cascode voltage switch logic (DCVSL) as shown in Fig. 7.20b. The circuit com-
bines two concepts: differential logic and positive feedback. Two inverters operat-
ing in the low-voltage domain provide inputs to two nMOS transistors that form
the differential inputs. When the signal coming from the low-voltage domain is of
low logic level, the output of the first inverter is high and the output of the second
inverter is low. The high-level output of the first inverter turns the nMOS transistor
M1 ON, which, in turn, turns the pMOS transistor M4 ON. As M4 turns ON, the
pMOS transistor M3 is turned OFF because of the positive feedback implemented
using a cross-coupled connection. This produces a low-level output at the output of
the inverter (OUTH). Similarly, when the input INL is of high logic level, the output
of the second inverter is high, which turns the nMOS transistor M2 ON. This sets in
a regenerative action to quickly switch the OUTH output to high logic level ( VddH).
The circuit requires both VddL and VddH supply voltages for its operation. The low-
to-high level converters introduce considerably larger delay compared to the high-
to-low level converters discussed above. Moreover, these level converters are to be
characterized over an extended voltage range to match different voltage domains of
the low and high side of voltage domains for accurate static timing analysis.
7.6.2 Converter Placement
One important design decision in the voltage scaling interfaces is the placement
of converters. As the high-to-low level converters use low-Vdd voltage rail, it is
appropriate to place them in the receiving or destination domain, that is, in the low-
Vdd domain. This not only avoids the routing of the of the low-Vdd supply rail from
the low-Vdd domain to the high-Vdd domain but also helps in the improvement of
the rise and fall time of the signals in low-Vdd domain. Placement of high-to-low
level converter is shown in Fig. 7.21a. It is also recommended to place the low-to-
7.6 Challenges in MVS 197
9
9GG/ GG/
' '
287/ 287+
9 9 9 9
a b
Fig. 7.21 a High-to-low converter placement; b low-to-high converter placement
high level converters in the receiving domain, that is, in the high-Vdd domain. This,
however, involves routing of the low-Vdd supply rail to the high-Vdd domain. As the
low-to-high level converters require both low- and high-Vdd supply rails, at least
one of the supply rails needs to be routed from one domain to the other domain. The
placement of the low-to-high level converter is shown in Fig. 7.21b.
7.6.3 Floor Planning, Routing, and Placement
The multiple power domain demands multiple power grid structure and a suitable
power distribution among them. As a consequence, it requires more careful floor
planning, placement, and routing. It is necessary to separate low-Vdd and high-Vdd
cells because they have different n-well voltages. Typically, a row-by-row separa-
tion, as shown in Fig. 7.22 is used. This type of layout provides high performance
and is suitable for standard-cell and gate arrays.
7.6.4 Static Timing Analysis
Use of single voltage makes static timing analysis simpler because it can be per-
formed for single performance point based on the characterized libraries. Electronic
design automation (EDA) tools can optimize the design for worst-case process,
voltage, and temperature (PVT) conditions. On the other hand, in the case of multi-
voltage designs, libraries should be characterized for different voltage levels that
are used in the design. EDA tool has to optimize individual blocks or subsystems
and also multiple voltage domains.
7.6.5 Power-Up and Power-Down Sequencing
One of the important system-level issues is the power sequencing. To avoid power
line glitches, it is not advisable to bring up all power supplies simultaneously. It is
essential to plan a power sequence such that different domains can come up in a
VddL VSS VddL

VSS
VddH
VddH
VddHrow
VddLrow
VddLrow
VddHrow
VddLrow
VddHrow
VddLrow
Fig. 7.22 Placement and routing in multi-Vdd design
well-defined order to satisfy the correct operation. Particularly, reset can be initiated
only after all power domains are completely powered up. The ramp times are to be
carefully controlled to avoid overshoot and undershoot on the supply lines.
7.6.6 Clock Distribution
As the clock needs to be distributed across different voltage domains, the clocks
have to go through the level converters. This requires that the clock synthesis tool
must understand the need for level converters and automatically insert them in ap-
propriate places.
7.6.7 Low-Voltage Swing
As the signal is sent through a long interconnect wire having large capacitance, the
voltage swing can be reduced to minimize the switching power dissipation.
Power saving can be achieved based on the following two mechanisms:
• Charge needed to charge/discharge the capacitance is smaller.
• Driver size can be reduced because of smaller current requirement.
7.7 Dynamic Voltage and Frequency Scaling 199
9 GG+
0 0
287+
9GG/
0 0
9
GG/ 9 GG/
,QWHUFRQQHFWZLUH
,1+
Fig. 7.23 Reduced voltage swing circuit using a driver and a receiver
Figure 7.23 shows how low-voltage swing circuit can be realized using the low-to-
high and high-to-low level converters discussed earlier. Before allowing a signal
to pass through a long interconnect wire, it is converted into a low-voltage swing
signal using a high-to-low level converter. At the end of the interconnect wire, the
signal is translated back to high level using a low-to-high level converter as shown
in Fig. 7.23. Although this approach has the potential to reduce the switching power
by a significant amount, its efficacy is gradually decreasing in deep-submicron pro-
cess technology due to the small supply voltage Vdd and threshold voltage Vt.
7.7 Dynamic Voltage and Frequency Scaling
DVFS has emerged as a very effective technique to reduce CPU energy [6]. The
technique is based on the observation that for most of the real-life applications, the
workload of a processor varies significantly with time and the workload is bursty in
nature for most of the applications. The energy drawn for the power supply, which is
the integration of power over time, can be significantly reduced. This is particularly
important for battery-powered portable systems.
7.7.1 Basic Approach
The energy drawn from the power supply can be reduced by using the following
two approaches:
Dynamic Frequency Scaling During periods of reduced activity, there is a
scope to lower the operating frequency with varying workload keeping the sup-
ply voltage constant. As we know, digital CMOS circuits are used in a majority of
microprocessors, and, for present-day digital CMOS, the sources of power dissipa-
tion can be summarized as follows:
Fig. 7.24 Energy versus

workload. DVFS dynamic
voltage and frequency scaling
1RYROWDJHVFDOLQJ
1RUPDOL]HG(QHUJ\

,GHDO
')96

:RUNORDGU
(7.22)
Ptotal = CVdd2 f + ( I sub + I diode + I gate )Vdd ,
where the first component represents the switching power dissipation, the most
dominant component of power dissipation, which is proportional to f, the frequency
of operation, C, the switched capacitance ∑ α i Ci of the circuit, and the square of
the supply voltage. The static power dissipation is due to various leakage current
components shown within the bracket in Eq. 7.22. By scaling frequency with vary-
ing workload, the power dissipation reduces linearly with the workload as shown in
Fig. 7.24 by the solid line.
Dynamic Voltage and Frequency Scaling An alternative approach is to reduce
the operating frequency along with the supply voltage without sacrificing the per-
formance required at that instance. It has been established that CMOS circuits can
operate over a certain voltage range with reasonable reliability, where frequency
increases monotonically with the supply voltage. For a particular process technol-
ogy, there is a maximum voltage limit beyond which the circuit operation is destruc-
tive. Similarly, there is a lower voltage limit below which the circuit operation is
unreliable or the delay paths no longer vary monotonically. Within the reliable oper-
ating range, the delay increases monotonically with the decrease in supply voltage
following Eq. 7.23. Therefore, the propagation delay restricts the clock frequency
in a microprocessor:
Vdd 1
Delay( D) ∝ = .
(Vdd − Vt ) 2  V 
2
(7.23)
Vdd 1 − t 
 Vdd 
:25./2$' :25./2$'
3 3
1RYROWDJHRUIUHTXHQF\ 1RYROWDJHRU
IUHTXHQF\
VFDOLQJ
5HODWLYH
5HODWLYH
VFDOLQJ
3RZHU
3RZHU

7
a 7LPH 7LPH 7
b
5HODWLYH
:25./2$'
3RZHU
3
:25./2$'
5HODWLYH
)UHTXHQF\6FDOLQJ
3RZHU
1RYROWDJHVFDOLQJ 3
)UHTXHQF\6FDOLQJ
:LWKYROWDJHVFDOLQJ

7LPH 7 7LPH 7
c d
Fig. 7.25 Four different cases with two different workloads and with voltage and frequency
scaling
However, the processor can continue to operate at a lower frequency as the supply
voltage is lowered. This has opened up the possibility of lowering the supply volt-
age, as the frequency is lowered due to the reduced workload. Because of the square
law dependence of switching power dissipation on the supply voltage, there is a
significant saving in energy. The goal of the DVFS technique is to adapt the supply
voltage and the operating frequency of the processor to match the workload such
that the performance loss of the processor is minimized, thereby saving maximum
possible energy, as shown by the dotted line in Fig. 7.24.
Four different situations are illustrated in Fig. 7.25. Figure 7.25a shows the
situation when the workload is 100 %. Here, the processor is operating with the
highest voltage and frequency and it is taking time T1 to complete the task. As
a consequence, it requires average power dissipation of P1 and energy dissipa-
tion is E1 = (P1 × T1). Now, when the workload is 50 %, energy consumption is
E2 = (P1 × T2) without voltage and frequency scaling as shown in Fig. 7.25b. Since
T2 = T1/2, E2 = E1/2. Now, let us consider the situation represented by Fig. 7.25c.
Here, frequency scaling by 50 % has been done such that execution time is again
T1. So, the energy consumption is P2 × T1 = E2, which is the same as the situation
represented by Fig. 7.25b. Although energy consumption is the same, the power
dissipation is reduced by 50 %. Finally, consider the situation when the workload
is 50 %, as shown in Fig. 7.25d. In this case, both the voltage and frequency scal-
ing are done and power dissipation is P3, which is less than P2. Here, the energy
consumption is E3 = P3 × T1. As P3 = 0.25P1, there is a significant reduction in the
energy consumption.
Table 7.10 Relationship between voltage, frequency, and power

Frequency ( f) MHz Voltage Vdd Relative power
700 1.65 100
600 1.60 80.59
500 1.50 59.03
400 1.40 41.14
300 1.25 24.60
200 1.10 12.70
7.7.2 DVFS with Varying Work Load
So far, we have considered static conditions to establish the efficiency of DVFS ap-
proach. In real systems, however, the frequency and voltage are to be dynamically
adjusted to match the changing demands for processing power. The implementation
of the DVFS system will require the following hardware building blocks:
• Variable voltage processor µ( r)
• Variable voltage generator V( r)
• Variable frequency generator f( r)
In addition to the above hardware building blocks, there is the need of a work-
load monitor. The workload monitor can be performed by the operating system
(OS). Usually, the OS has no prior knowledge of the workload to be generated by
a bursty application. In general, the future workloads are nondeterministic. As a
consequence, predicting the future workload from the current situation is very dif-
ficult and errors in prediction can seriously reduce the gains of DVFS, which has
been observed in several simulation studies. Moreover, the rate at which the DVFS
is done has a significant bearing on the performance and energy. So, it is essential
to develop suitable strategy for workload prediction and the rate of DVFS, such that
processor utilization and energy saving are maximized.
7.7.2.1 Variable Voltage Processor µ(r)
The need of a processor which can operate over a frequency range with a corre-
sponding lower supply voltage range can be manufactured using the present-day
process technology and several such processors are commercially available. Trans-
meta’s TM 5400 or “Crusoe” processor and Strong ARM processor are examples
of such variable voltage processors. Transmeta’s Crusoe processor can operate over
the voltage range of 1.65–1.1 V, with the corresponding frequency range of 700–
200 MHz. Table 7.10 provides the relation among frequency, voltage, and power
consumption for this processor. It allows the following adjustments:
• Frequency change in steps of 33 MHz
• Voltage change in steps of 25 mV
• Up to 200 frequency/voltage change per second
Fig. 7.26 Processor-voltage

versus clock frequency of 'HVWUXFWLYH
Strong ARM processor. CPU
3URFHVVRUYROWDJH
central processing unit
2SHUDWLRQDO

1RQIXQFWLRQDO

&38FORFNIUHTXHQF\
Fig. 7.27 Block diagram of 6

a direct current (DC)-to-DC /RZSDVV
converter )LOWHU
9)
9 5/
3XOVHZLGWK
0RGXODWLRQ
5HIHUHQFH
9ROWDJH
&RPSDUDWRU
Another processor, Strong ARM, also allows voltage scaling. Experimental results
performed on Strong ARM 1100 are shown in Fig. 7.26. The diagram shows the
operating frequency and corresponding supply voltages for the ARM processor.
7.7.2.2 Variable Voltage Generator V(r)
The variable voltage power supply can be realized with the help of a direct current
(DC)-to-DC converter, which receives a fixed voltage VF and generates a variable
voltage Vr based on the input from the workload monitoring system. Block diagram
of a DC-to-DC converter based on pulse-width modulation (PWM) is shown in
Fig. 7.27. The fixed voltage VF passes through a power switch, which generates a
rectangular wave of variable duty cycle. The pulse-width modulator controls the
duty cycle of the rectangular wave based on the input received from the compara-
tor. The low-pass filter removes the alternating current (AC) ripple to the accept-
able level and generates an average value of the PWM signal. The average value is
nearly equal to the derived DC output voltage. With the availability of good quality
components such as integrated gate bipolar junction transistor (IGBT) and other
semiconductor devices, it is possible to realize DC-to-DC converters having high
efficiency, say 90 % and above. However, the efficiency drops off as the load de-
creases as shown in Fig. 7.28. At a lower current load, most of the power drawn
Fig. 7.28 Efficiency versus

load
5HODWLYH
(IILFLHQF\
5HODWLYH/RDG
from the fixed voltage supply gets dissipated in the switch and the efficiency falls
off. This also leads to a reduction in the efficiency of the dynamic voltage scaling
(DVS) system.
7.7.2.3 Variable Frequency Generator f(r)
The variable frequency is generated with the help of a phase lock loop (PLL) sys-
tem. The heart of the device is the high-performance PLL-core, consisting of a
phase frequency detector (PFD), programmable on-chip filter, and voltage-con-
trolled oscillator (VCO). The PLL generates a high-speed clock which drives a fre-
quency divider. The divider generates the variable frequency f( r). The PLL and the
divider together generate the independent frequencies related to the PLL operating
frequency.
7.7.3 The Model
A generic block diagram of a variable voltage processing system is shown in

Fig. 7.29. The tasks generated from various sources are represented by the task
queue. Each of the sources produce events at an average rate of λk , (k = 1, 2, …, n).
The task scheduler of the OS manages all these tasks and decides which process
should run on the processor. The average rate of arrival of tasks at the processor is
λ = λk . The processor, in turn, provides time varying processing rate μ( r). The OS
kernel measures the CPU utilization over some observation frames and estimates
the current workload w. The workload monitor predicts the processing rate r based
on w and a history of workloads in the previous frames. The predicted workload
r, in turn, is used to generate the frequency of operation f( r) and the power supply
voltage V( r) for the next observation time slot. Predicting the future workload from
the current situation is extremely difficult and errors can seriously reduce the gain
of DVS approach.
U U
'&'& :RUNORDG )UHTXHQF\
9 IL[HG
&RQYHUWHU 0RQLWRU *HQHUDWRU
9 U : I U
λ
λ λ 9DULDEOH9ROWDJH
3URFHVVRU µ U
λQ
7DVN
4XHXH
Fig. 7.29 Model for dynamic voltage scaling
7.7.4 Workload Prediction
It is assumed that the workload for the next observation interval can be predicted
based on the workload statistics of the previous N intervals. The workload predic-
tion for ( n + 1) interval can be represented by
N −1
W n + 1 = h k W ( n − k ), p[ ] ∑ n[ ]
(7.24)
k =0
where W[n] denotes the average normalized workload in the interval (n − 1)T ≤ t ≤ nT
and hn[k] represents an N-tap, adaptable finite impulse response (FIR) filter, whose
coefficients are updated in every observation interval based on the difference be-
tween the predicted and actual workloads. Uses of three possible filter types are
examined below:
Moving Average Workload (MAW) In this case hn[k] = 1/N, that is the filter pre-
dicts the work load in the next time slot as the average of the previous N times slots.
This simplistic scheme removes high-frequency workload changes and does not
provide a satisfactory result for the time-varying workload statistics.
Exponential Weighted Averages (EWA) In this approach, instead of giving equal
weightage to all the workload values of the previous slots, higher weightages are
given to the most recent workload history. The idea is to give progressively decreas-
ing importance to historical data. Mathematically, this can be achieved by providing
the filter coefficients hn[k] = a−k, for all n, where a positive value of a is chosen such
that ∑ hn [ k ] = 1 .
Fig. 7.30 Prediction

performance of different 0$:
filters. MAW moving average (:$
workload, EWA exponential /06
weighted averages, LMS least
506(UURU
mean square, RMS root mean
square

7DSV1
Least Mean Square In this approach, the filter coefficients are modified based on
the prediction error. One of the popular adaptive filter algorithms is the least-mean-
square (LMS) algorithm, where w[n] and wp[n] are the actual workload and predicted
workload, respectively. Then the prediction error is given by We[n] = W[n] − We[n].
The filter coefficients are updated based on the following rule:
hn + 1[k] = hn[k] + µWe[n]. W[n − k], where µ is the step size.
Although these types of filters learn from the history of workloads, the main
problem is the stability and convergence. Selection of a wrong number of coeffi-
cients or an inappropriate step size may lead to undesirable consequences.
A comparison of the prediction performance in terms of root-mean-square error
of the three filters is given in Fig. 7.30. It shows that the prediction is too noisy for
a small number of taps and when the number of taps is large, it leads to excessive
low pass filtering. Both result in poor prediction. It is observed that the LMS adap-
tive filter outperforms the other techniques and an optimum result is obtained with
N = 3 taps.
7.7.5 Discrete Processing Rate
The operating points are determined analytically, first by finding out appropriate
clock frequencies for different workloads. As PLLs along with frequency dividers
are used to generate different frequencies, it is preferable to select clock periods
which are multiples of the PLL frequency. This helps to generate clock frequencies
with minimum latency. Otherwise, it is necessary to change the PLL frequency,
which requires a longer time to stabilize. Finally, the voltages required to support
each of the frequencies are found out.
Too few operating points may result in the ramping between two levels for work-
load profiles and too many levels may result in hunting of the power supply for a
new supply voltage, most of the time. To optimize the number of operating points,
it is necessary to understand the number of distinct performance levels necessary
under realistic dynamic workload conditions. Most of the processors available with
built-in DVS facility have a discrete set of operating frequencies, which implies that
Fig. 7.31 Effects of number

of discrete processing levels
1
L. LMS least mean square
7
(DFWXDO(SHUIHFW
/06ILOWHU

1XPEHURI/HYHOV/
the processing rate levels are quantized. Question arises whether the quantization
has any impact on the performance. The effect of the number of discrete processing
level is shown in Fig. 7.31. As the number of discrete levels increases, the energy
saving ratio gets close to the perfect prediction case. For L = 10, which is provided
by all practical processors, the root mean square (RMS) (ESR) degradation due to
quantization noise is less than 10 %.
7.7.6 Latency Overhead
There is a latency overhead involved in processing rate update. This is due to the
finite feedback bandwidth associated with the DC-to-DC converter. Changing the
processor clock frequency also involves a latency overhead, during which the PLL
circuit locks. To be on the safe side, it is recommended that the voltage and frequen-
cy changes should not be done in parallel. In case of switching to higher processing
rate, the voltage should be increased first, followed by the increase in frequency,
and the following steps are to be followed:
• Set the new voltage.
• Allow the new voltage to settle down.
• Set the new frequency by changing the divider value, if possible. Otherwise,
change the PLL clock frequency.
In case of switching to low processing rates, the frequency should be decreased first
and then the voltage should be reduced to the appropriate level, and the following
steps are to be followed:
• Set the new frequency by changing the divider value, if possible. Otherwise,
change the PLL clock frequency.
• Set the new voltage. The CPU continues operating at the new frequency while
voltage settles to the new value.
Fig. 7.32 Adaptive voltage scaling system. DVC dynamic voltage control, DFC dynamic fre-
quency control, DVFM dynamic voltage and frequency management, DC direct current, DRAM
dynamic random-access memory, PLL phase lock loop
This ensures that the voltage level is always sufficient to support the required
operating frequency, and help to avoid data corruption due to failure of the cir-
cuit. The voltage must be limited to the range over which delay and voltage track
monotonically. Moreover, there is temperature dependence on delay; normally de-
lay increases as temperature increases, but below a certain voltage (2. VT), this rela-
tionship inverts and the delay decreases as temperature increases.
7.8 Adaptive Voltage Scaling
The voltage scaling techniques discussed so far are open loop in nature [7]. Volt-
age–frequency pairs are determined at design time keeping sufficient margin for
guaranteed operation across the entire range of best- and worst-case PVT condi-
tions. As the design needs to be very conservative for successful operation, the
actual benefit obtained is lesser than actually possible. A better alternative that can
overcome this limitation is the adaptive voltage scaling (AVS) where a close-loop
feedback system is implemented between the voltage scaling power supply and de-
lay-sensing performance monitor at execution time. The on-chip monitor not only
checks the actual voltage developed but also detects whether the silicon is slow,
typical, or fast and the effect of temperature on the surrounding silicon.
The implementation of the AVS system is shown in Fig. 7.32. The dynamic
votage control (DVC) emulates the critical path characteristic of the system by us-
ing a delay synthesizer and controls the dynamic supply voltage. It consists of three
major components: the pulse generator, the delay synthesizer, and the delay detec-
tor. By comparing the digitized delay value with the target value, the delay detec-
tor determines whether to increase, decrease, or keep the present supply voltage
7.9 Subthreshold Logic Circuits 209
Fig. 7.33 Subthreshold log(Isubth)

region of operation
G Isubth Subthreshold Slope

+
[mV/decade of current]
VGS
-
S
VGS
VTH
value. The minimum operating voltage from 0.9 to 1.6 V at 5-mV step is supplied
in real time by controlling off-chip DC–DC converter to adjust the digitized value
to target. The dynamic frequency control (DFC) adjusts the clock frequency by
monitoring the system activity. It consists of an activity monitor and a frequency
adjuster. The DFC block controls the clock frequency at the required minimum
value autonomously in hardware without special power-management software. The
activity monitor calculates the total large scale integrated (LSI) activity periodically
from activity information of embedded dynamic random-access memory (DRAM),
bus, and CPU. The voltage and frequency are predicted according to the perfor-
mance monitoring of the system. The dynamic voltage and frequency management
(DVFM) system can track the required performance with a high level of accuracy
over the full range of temperature and process deviations.
7.9 Subthreshold Logic Circuits
As the supply voltage continues to scale with each new generation of CMOS tech-
nology, subthreshold design is an inevitable choice in the semiconductor road map
for achieving ultra-low power consumption [8, 9]. Figure 7.33 shows the opera-
tion of an MOS transistor in the subthreshold region. The subthreshold region is
often referred to as the weak inversion region. Subthreshold drain current Isubth
is the current that flows between the source and drain of an MOSFET when the
transistor is in subthreshold region, that is, for gate-to-source voltages below the
threshold voltage. In digital circuits, subthreshold conduction is generally viewed
as a parasitic leakage in a state that would ideally have no current. In the early years
of very-large-scale integration (VLSI) technology, the subthreshold conduction of
transistors has been very small, but as transistors have been scaled down to a deep
submicron region, leakage from all sources has increased. For a technology genera-
tion with threshold voltage of 0.2 V, leakage can exceed to 50 % of total power con-
sumption. The reason for a growing importance of subthreshold conduction is that
the supply voltage has continually scaled down, both to reduce the dynamic power
consumption and to keep electric fields inside small devices low, to maintain device
reliability. The amount of subthreshold current depends on the threshold voltage,
which lies between 0 V and Vdd, and so has to be reduced along with the supply
voltage. That reduction means less gate voltage swing below threshold to turn the
device OFF, and as subthreshold current varies exponentially with gate voltage, it
becomes more and more significant as MOSFETs shrink in size.
Subthreshold circuits operate with a supply voltage Vdd less than the threshold
voltage Vt of the MOS transistor. As power is quadratically related to the supply
voltage, reducing it to the subthreshold level leads to significant reduction in both
power and energy consumption. However, the reduction in power consumption re-
sults in an increase in delay. As a consequence, subthreshold logic will be suitable
only for specific applications which do not need high performance, but requires
ultra-low power consumption. Such applications include medical equipment, such
as hearing aid, pacemaker, wireless sensor nodes, radio-frequency identification
(RFID) tags, etc. Subthreshold circuits can also be used in applications where the
circuits remain idle for extended periods of time.
One important metric is the power delay product (PDP). The PDP for Vdd = 0.5 V
is about 40 times smaller than that of the PDP for Vdd = 3 V. As the reduction in
power dissipation outweighs the increase in delay, the PDP is lower. It has been
established that the delay of circuits operating in the subthreshold region is three
orders of magnitude higher compared to that of circuits operating in the strong
inversion region, but the power consumption is four orders of magnitude lower
for subthreshold circuits [6]. It implies that if both the circuits operate at the same
clock frequency, the subthreshold circuits consume lesser energy compared to the
circuit operating in the strong inversion region. One of the major concerns for sub-
threshold circuit design is increased sensitivity to PVT variations. The performance
and robustness comparisons of bulk CMOS and double-gate silicon-on-insulator
(DGSOI) subthreshold basic logic gates with and without parameter variations have
been studied [7] and it has been observed that 60–70 % improvement in PDP and
roughly 50 % better tolerance to PVT variations of DGSOI subthreshold logic cir-
cuits compared to bulk CMOS subthreshold circuits at the 32 nm node.
7.10 Chapter Summary
• The various challenges involved in supply voltage scaling for low power have
been explained.
• The differences between constant-field and constant-voltage scaling are ex-
plained.
• The short-channel effects arising out of feature size scaling have been discussed.
• The various architecture level approaches for low power, such as parallelism and
pipelining, have been discussed.
• The concept of multi-core for low power is introduced.
• How voltage scaling is performed using high-level transformations is explained.
• MVS approach has been discussed.

• Various challenges in MVS are explained.
• The DVFS approach is discussed.
• The AVS approach is introduced.
• The operation of subthreshold logic circuits is highlighted.
Q7.1. As you move to a new process technology with a scaling factor S = 1.4, how
do the drain current, power density, delay, and energy requirement change for the
constant-field scaling?
Q7.2. Distinguish between constant-field and constant-voltage feature size scaling.
Compare their advantages and disadvantages.
Q7.3. Compare the constant-field and constant-voltage scaling approaches in terms
of the area, delay, energy, and power density parameters.
Q7.4. Why is it preferable to use constant-field scaling over constant-voltage scal-
ing in the context of low power?
Q7.5. Explain with a schematic diagram the basic scheme of DVS to minimize the
power dissipation of a microprocessor.
Q7.6. Explain the use of pipelining in realizing low-power circuits.
Q7.7. Explain how parallelism can be used to achieve low power instead of high
performance in realizing digital circuits.
Q7.8. How can you combine pipelining and parallelism to achieve low power in-
stead of higher performance?
Q7.9. Discuss how low power can be achieved in multi-core processors.
Q7.10. What is loop unrolling? How can it be used to realize low-power circuits?
Illustrate it with an example circuit.
Q7.11. Explain the concept of clustered voltage scaling approach for the reduction
of power dissipation.
Q7.12. How is the path delay of a circuit affected due to the use of multiple supply
voltages?
Q7.13. List the challenges we face in realizing VLSI circuits using multiple volt-
ages.
Q7.14. Why is it necessary to use a level converter while driving a logic signal from
a low-voltage domain to a high-voltage domain? Draw the schematic diagram and
explain the operation of a low-to-high level converter.
Q7.15. Give the basic scheme for DVS used for reducing the power dissipation of a
CPU. How can the workload be predicted efficiently?
Q7.16. What is AVS? Explain the function of the different components used to real-
ize AVS.
Q7.17. Explain how pin swapping can be used to minimize dynamic power
dissipation.
Q7.18. Briefly explain how buffer resizing can be used to reduce dynamic power
dissipation.References
References
1. Kang, S-M., Leblebici, Y.: CMOS Digital Integrated Circuits Analysis and Design, 3rd edn.
Tata McGraw-Hill, Noida (2003)
2. Chandrakasan, A.P., Brodersen, R.W.: Low-Power Digital CMOS Design. Kluwer, Norwell
(1995)
3. Bellaouar, A., Elmasry, M.I.: Low-Power Digital VLSI Design. Kluwer, Norwell (1995)
4. Chandrakasan, A.P., Sheng, S., Brodersen, R.W.: Low-power CMOS digital design. IEEE J.
Solid-State Circuits 27(4):473–484 (1992)
5. Keating, M., Flynn, D., Aitken, R., Gibbons, A., Shi, K.: Low Power Methodology Manual:
For System-on-Chip Design. Springer (2007)
6. Sinha, A., Chandrakasan, A.P.: Dynamic voltage scheduling using adaptive filtering of work-
load traces. In: Proceedings of 14th VLSI Design 2001 Conference, pp. 221–226, Jan 2001
7. Nakai, M., Akui, S., Seno, K., Meguro, T., Seki, T., Kondo, T., Hashiguchi, A., Kawahara,
H., Kumano, K., Shimura, M.: Dynamic voltage and frequency management for a low-power
embedded microprocessor. IEEE J. Solid-State Circuits 40(1):28–35 (2005)
8. Vaddi, R., Dasgupta, S., Agarwal, R.P.: Device and Circuit Design Challenges in the Digital
Subthreshold Region for Ultralow-Power Applications, VLSI Design, Hindawi, vol. 2009, Ar-
ticle ID 283702, 14 pp. (2009)
9. Vaddi, R., Dasgupta, S., Agarwal, R.P.: Device and circuit co-design robustness studies in the
subthreshold logic for ultralow-power applications for 32 nm CMOS. IEEE Trans. Electron
Devices 57(3):654–664 (2010)
Chapter 8
Switched Capacitance Minimization
Abstract This chapter addresses the problem of switched capacitance minimiza-

tion. As it is possible to use both hardware- and software-based approaches to mini-
mize switched capacitance, a hardware–software codesign approach is presented.
This is followed by the use of bus encoding to reduce switching activity. Both redun-
dant and nonredundant approaches are available. A nonredundant bus-encoding
technique such as Gray coding technique for address bus is explained. Redundant
bus-encoding techniques such as one-hot encoding, bus-inversion encoding, and T0
encoding techniques are presented. Clock-gating technique can be used to reduce
switching activity. Clock gating at different levels of granularity is discussed. The
synthesis of gated-clock finite-state machine (FSM) to reduce power consumption
of FSMs is introduced. An FSM state encoding approach is presented to minimize
the switching activity. Another approach for reducing switching activity of an FSM
is FSM partitioning in which a single FSM is partitioned into more than one FSM
to reduce switching activity, which is also presented. The techniques of operand
isolation and precomputation are briefly introduced. The basic approach of mini-
mizing glitching power has been considered. Finally, various logic styles, including
dynamic complementary metal–oxide–semiconductor (CMOS) and pass-transistor
logic styles, are considered and compared with the static CMOS style, which is the
most popular logic style.
Keywords Hardware–software codesign · Bus encoding · Gray coding · One-hot

coding · Bus-inversion coding · T0 coding · Clock gating · Clock-gating granularity ·
Gated-clock FSMs · FSM state encoding · FSM partitioning · Operand isolation ·
Precomputation · Glitching power minimization · ROBDD · Memory splitting ·
Logic restructuring · Transition rate buffering · Pin swapping
8.1 Introduction
In the previous chapter, we have discussed various voltage scaling approaches,

which provide the most effective techniques to minimize dynamic power dissipa-
tion. However, there exist other alternative approaches such as minimizing switched
capacitance to achieve the same goal. In this chapter, various approaches based on
switched capacitance minimization are considered. Switched capacitance can be
minimized at various levels of design hierarchy. A system-level approach based
214 8 Switched Capacitance Minimization
on hardware–software codesign is presented in Sect. 8.2. Section 8.3 presents the

example of Transmeta’s Crusoe processor, which shows how the function of a com-
plex hardware is replaced by a software to achieve low power. Various bus-encod-
ing techniques such as Gray coding, one-hot encoding, bus-inverse encoding, and
T0 encoding have been presented in Sect. 8.4. These bus-encoding techniques can
be used to reduce the switching activity on a bus. Various aspects of clock-gating
(CG) technique have been provided in Sect. 8.5. Section 8.6 presents the basic prin-
ciple behind gated-clock finite-state machines (FSMs) for reducing the switching
activity in an FSM. In Sect. 8.7, FSM state encoding approach has been presented
to minimize switching activity. Another approach for reducing switching activity
of an FSM is FSM partitioning in which a single FSM is partitioned into more
than one FSM to reduce switching activity, which has been presented in Sect. 8.8.
The technique of operand isolation, presented in Sect. 8.9, can be used to reduce
the switching activity of a combinational circuit. Precomputation is a technique in
which selective computation of output values is done in advance with the objective
of using it to reduce the switching activity in the subsequent cycles. This technique
is presented in Sect. 8.10. The basic approach of minimizing glitching power has
been considered in Sect. 8.11. Finally, various logic styles have been considered in
Sect. 8.12 for low-power logic synthesis.
8.2 System-Level Approach: Hardware–Software

Codesign
It is well known that a particular functionality can be realized purely by hardware,

or by a combination of both hardware and software. Consider Example 8.1:
Example 8.1 Realize a microcontroller-based system to monitor the temperature
of a water bath.
Solution: The above application will require an analog-to-digital converter
(ADC) to convert analog data obtained from a temperature transducer to a digital
form, which can be processed by the central processing unit (CPU) of the micro-
controller. Assuming that the microcontroller has no inbuilt ADC, there are two
alternative approaches to implement the ADC. These two approaches are discussed
below:
Approach-I: In this case, an ADC chip is interfaced externally as shown in
Fig. 8.1a. This approach involves the use of a costly ADC chip along with a few
lines of program code to read the ADC data. The ADC chip can be selected based on
the sampling rate and the precision of the digital data. The software overhead is very
small. This approach can provide higher performance in terms of conversion time
and sampling rate. However, it involves higher cost and higher power dissipation.
Approach-II: If the requirement analysis of the application is done, it is observed
that the present application does not require fast conversion of the analog data to
digital data because the temperature of the water bath changes slowly and sampling
8.3 Transmeta’s Crusoe Processor 215
Digital output
M M
I I
Digital output C C 8-bit
R R DAC
Analog O O
input DATA LINES C C 8
O O
ADC Start convert N N +5V
T T
EOC R R
O O +
L L
L L
E E
R R
Analog
-
input
a b 1/2 LM 393
Fig. 8.1 a Analog-to-digital converter ( ADC) implemented by hardware and b ADC implemented
by hardware–software mix. DAC digital to analog, EOC end of conversion
of the output of the temperature sensor can be done at the interval of few seconds.
So, the ADC functionality can be implemented by software with few inexpensive
external components such as a digital-to-analog converter (DAC) and a compara-
tor, as shown in Fig. 8.1b. Any of the A/D conversion algorithms such as a succes-
sive approximation can be implemented by the software utilizing the DAC and the
comparator. As a consequence, it will have higher software overhead, but lesser
hardware cost of implementation.
The first approach provides a fast conversion time at the cost of higher cost or larger
chip area. In the second alternative, the hardware cost and chip area are lower, but
conversion time is longer. So, for a given application, there is a trade-off between
how much is implemented by hardware and by software. This has led to the concept
of hardware–software codesign, which involves partitioning of the system to be re-
alized into two distinct parts: hardware and software. Choosing which functions to
implement in hardware and which in software is a major engineering challenge that
involves consideration of issues such as cost, complexity, performance, and power
consumption. From the behavioral description, it is necessary to perform hardware/
software partitioning.
8.3 Transmeta’s Crusoe Processor
Transmeta’s Crusoe processor [16] is an interesting example that demonstrated that

processors of high performance with remarkably low power consumption can be
implemented as hardware–software hybrids. The approach is fundamentally soft-
ware based, which replaces complex hardware with software, thereby achieving
large power savings. By virtualizing the x86 CPU with a hardware–software com-
bine, the Transmeta engineers have drawn a line between hardware and software
such that the hardware part is relatively simple and high-speed, but much less
power-hungry very long instruction word (VLIW) engine. On the other hand, the
128-bit molecule
FADD ADD LD BRCC
Floating-Point Integer Load/Strore Branch

Unit ALU #0 Unit Unit
Fig. 8.2 A molecule can contain up to four atoms, which are executed in parallel. FADD floating
point addition, ADD addition, LD load, BRCC branch if carry cleared, ALU arithmetic logic unit
complex task of translating the x86 instructions into the instructions of the VLIW
is performed by a piece of software known as the code morphing software (CMS).
8.3.1 The Hardware
The Crusoe processor is a very simple, high-performance VLIW processor with

two integer units, a floating-point unit, a memory (load/store) unit, and a branch
unit. The long instruction word, called a molecule, can be 64 bits or 128 bits long,
containing up to four reduced instruction set computing (RISC)-like instructions
called atoms. All atoms within a molecule are executed in parallel and the format
of the molecule directly determines how atoms get routed to the functional units. A
simple 128-bit molecule is shown in Fig. 8.2. The diagram also shows how different
fields (atoms) of the molecule map to different functional units. The molecules are
executed sequentially. So, there is no need for complex out-of-order hardware. The
molecules are packed as fully as possible with atoms to keep the processor running
in full speed. The superscalar out-of-order x86 processors, such as the Pentium
II and Pentium III processors, also have multiple functional units like the Crusoe
processor. These functional units can also execute RISC-like operations in parallel.
However, in contrast to a software (CMS) used in Crusoe processor to translate
x86 instructions into micro-operations and schedule the micro-operations to make
best the use of the functional units, the x86 processors use a hardware. As the dis-
patch unit reorders the micro-operations to keep the functional units busy, a separate
hardware, known as in-order retire unit, is necessary to effectively reconstruct the
order of the original x86 instructions, and ensures that they take effect in the proper
order. The schematic diagram of the traditional superscalar architecture showing
different building blocks is given in Fig. 8.3. As the 8086 instruction set is quite
complex, the decoding and dispatch hardware requires large quantities of power-
hungry transistors, which is responsible for high power consumption in rough pro-
portion to the number of transistors used in realizing the functional elements shown
in Fig. 8.2. Table 8.1 compares the sizes of Intel mobile and Crusoe processors.
x86 instructions
Superscalar
Decode Translate Micro-ops Dispatch In-Order
Functional
Units Units Unit Retire
Units
Unit
Fig. 8.3 Superscalar out-of-order architecture
Table 8.1 Comparison of the die sizes

Mobile Pl Mobile Pll Mobile PllI TM3120 TM5400
Process 0.25 m 0.25 m shrink 0.18 m 0.22 m 0.18 m
On-chip LI cache (KB) 32 32 32 96 128
On-chip L2 cache (KB) 0 256 256 0 256
Die size (mm2) 130 180 106 77 73
It is evident from the table that Crusoe processors using the same technology and
comparable performance level require about 50 % chip area. This is also reflected
in the power dissipations in the form of heat by Pentium III and a Crusoe proces-
sor running a DVD software. In the case of PIII, the temperature rises to 105 °C,
which may lead to the point of failure if it is not aggressively cooled. On the other
hand, in the case of the Crusoe processor module TM5400, the temperature rises to
only 48 °C, which requires no active cooling. A smaller power dissipation of Crusoe
processors not only makes the chip less prone to failure but also cheaper in terms of
packaging and cooling costs.
8.3.2 The Software
The success of the Crusoe processor lies in the use of an innovative software known
as CMS that surrounds the VLIW hardware processor. This software layer decouples
the x86 instruction set architecture (ISA) from the underlying processor hardware,
which is very different from that of the conventional x86 processor architectures.
The CMS is essentially a dynamic translation system of a program that compiles the
instruction of the ISA into the instruction of the VLIW processors. Normally, it is
the first program that starts execution. The CMS is the only program that is written
directly for the VLIW processor as shown in Fig. 8.4. All the other software, such
as basic input/output system (BIOS), operating system (OS), and other application
programs, are written using x86 ISA. The x86 only sees the CMS, which insulates
x86 programs from the native VLIW processor. As a consequence, the native pro-
cessor (the ISA) can be changed arbitrarily without any need to modify the x86
software. Any change in the VLIW ISA calls for a change in the CMS itself. The
proof of this concept has been demonstrated by replacing the TMS3120 processor
with the TMS5400 processor simply by having a different version of the CMS.
Fig. 8.4 The code morphing

software mediates between
x86 software and the Crusoe BIOS
processor. BIOS basic input/
output system, VLIW very
Code Morphing
long instruction word Software
VLIW engine
Operating Code Morphing

System software Applications
Decoding and Scheduling In a conventional superscalar processor such as x86,

instructions are fetched from the memory and decoded into micro-operations,
which are then recorded by an out-of-order dispatch hardware and fed to the func-
tional units for parallel execution. In contrast to this, CMS translates an entire group
of x86 instructions at once and saves the resulting translation in a translation cache
and in subsequent executions. The system skips the translation step and directly
executes the existing optimized translation. In a superscalar architecture, the out-
of-order dispatch unit has to translate and schedule instructions every time these
are executed, and it must be done very quickly. In other words, power dissipation
occurs each time when instructions are executed. The cost of translation is amor-
tized in the code morphing approach over many executions. This also allows much
more sophisticated translation and scheduling algorithms. The generated code is
better optimized, resulting in lesser number of instructions in the generated code.
This not only speeds up execution but also reduces power dissipation.
Caching A separate memory space is used to store the translation cache and the
CMS. The memory space is not accessible to x86 code, and the size of this memory
can be set at boot time. One primary advantage of caching is to reuse the translated
code by making use of the locality of reference property. In real-life applications,
it is very common to execute a block of code many times over and over after it has
been translated once. Because of the high repeat rates of many applications, code
morphing has the opportunity to optimize execution and amortize the initial transla-
tion overhead. As a result, the approach of caching translations provides excellent
opportunity of reuse in many real-life applications.
Fig. 8.5 Flowchart of a

program with a branch
Yes
Condition?
(taken)
No
Filtering It is of common knowledge that a small percentage (may be less than

10 %) of the code accounts for 95 % of the execution time. This opens up the pos-
sibility of applying different amounts of translation efforts to different portions of
a program. The code morphing has a built-in feature of a wide choice of execution
modes of x86 code, starting from interpretation, which has no translation overhead
at all, but the execution is slower, to highly optimized code, which has a large
overhead for code generation, but that runs the fastest once it is translated. A set of
sophisticated heuristics helps to choose from several execution modes based on the
dynamic feedback information gathered during the actual execution of the code.
Prediction and Path Selection The CMS can gather feedback information about
the x86 program with the help of an additional code present in the translator whose
sole purpose is to collect information about block execution frequencies or branch
history. Decision can be made about how much effort is required to optimize a
code based on how often a piece of x86 code is executed. For example, whether a
conditional branch instruction, as shown in Fig. 8.5, is balanced (50 % probability
of taken) or biased in a particular direction, decision can be made about how much
effort to put to optimize that code. In a conventional hardware-only x86 implemen-
tation, it would be extremely difficult to make similar kinds of decisions.
Example 8.2 The translation of a piece of x86 code into an equivalent of code for
Crusoe VLIW processor is illustrated with the help of the following example. The
x86 code, consisting of four instructions, to be translated is given below:
1. addl %eax,(%esp) // load data from stack, add to %eax

2. addl %ebx,(%esp) // ditto, for %ebx
3. movl %esi,(%ebp) // load %esi from memory
4. subl %ecx,5 // subtract 5 from %ecx register
The translation is performed in three phases. In the first pass, the front end of the
translator system decodes the x86 instructions and translates them into a simple
sequence of atoms of VLIW ISA. Using %r30 and %r31 as temporary registers for
the memory load operations, the output of the first pass is given below:
ld %r30,(%esp) // load from stack, into temporary

add.c %eax,%eax,%r30 // add to %eax, set condition codes.
ld %r31,[%esp]
add.c %ebx,%ebx,%r31
ld %esi,[%ebp]
sub.c %ecx,%ecx,5
In the second pass, the translator performs typical compiler optimizations such as
common subexpression elimination, dead code elimination, etc. This helps in elimi-
nating some of the atoms. For example, the output of the second pass for the above
code is given below:
ld %r30,[%esp] // load from stack only once

add %eax,%eax,%r30
add %ebx,%ebx,%r30 // reuse data loaded earlier
ld %esi,[%ebp]
sub.c %ecx,%ecx,5 // only this last condition code needed
In the final pass, the scheduler reorders the atoms and groups them into individual
molecules, which is somewhat similar to the function of the dispatch hardware. The
output, consisting of only two molecules of the final phase, is given below:
1. ld %r30,[%esp]; sub.c %ecx,%ecx,5

2. ld %esi,[%ebp]; add %eax,%eax,%r30; add %ebx,%ebx,%r30
The above VLIW instructions are executed in-order by the hardware and the mol-
ecules explicitly encode the instruction-level parallelism. As a consequence, the
VLIW engine is very simple.
8.4 Bus Encoding
The intrinsic capacitances of system-level busses are usually several orders of mag-
nitude larger than that for the internal nodes of a circuit. As a consequence, a con-
siderable amount of power is dissipated for transmission of data over input/output
(I/O) pins. It is possible to save a significant amount of power, reducing the number
of transactions, i.e., the switching activity, at the processors’ I/O interface. One pos-
sible approach for reducing the switching activity is to suitably encode [4, 9] the
data before sending over the I/O interface. A decoder is used to get back the original
data at the receiving end as shown in Fig. 8.6.
Encoding can be used either to remove undesired correlation among the data
bits, as it is done while doing encryption, or to introduce controlled correlation.
Coding to reduce the switching activity falls under the second category. To reduce
the switching activity, it is necessary to increase the correlation among the signal
values transmitted over the I/O pins.
8.4 Bus Encoding 221
ENCODER
DECODER
SENDER RECEIVER
n m n
Sender Chip Receiver Chip
Fig. 8.6 Encoder and decoder blocks to reduce switching activity
Coding scheme can be broadly divided into two broad categories—nonredun-

dant and redundant. In case of nonredundant coding, an n-bit code is translated
into another n-bit code ( m = n) and the 2n code elements of n-bit are mapped among
themselves. Although there is no additional lines for sending off chip data, the non-
redundant coding is done based on the statistics of the sequence of words such that
the switching activity is reduced.
In case of redundant coding technique, additional lines are used for sending data,
i.e., m ˃ n. Here, 2n different n-bit words are mapped to a larger set of m-bit 2m data
words. Here, there are two alternatives—the encoder/decoder may be memory-less
or may use memory elements for the purpose of encoding and/or decoding. If an
n-bit word has a unique m-bit code word, then neither encoder nor decoder requires
memory. Another possibility is to use a static one-to-many mapping of 2n unique
words to 2m code words. The current state of the encoder selects a particular m-bit
code word for transmission out of several alternatives for a particular n-bit data
word. In this scheme, an encoder requires memory, but the decoder can be imple-
mented without memory.
8.4.1 Gray Coding
One popular example of the nonredundant encoding scheme is the Gray coding.
Gray coding produces a code word sequence in which adjacent code words differ
only by 1 bit, i.e., Hamming distance of 1 as shown in Table 8.2. The number of
transitions for binary representation is 30. On the other hand, the number of transi-
tions for Gray code will always have 16. As a consequence, the transition activity,
and hence the power dissipation, reduces by about 50 %, and it is very useful when
the data to be transmitted is sequential and highly correlated. The encoder and de-
coder are memory-less as shown in Fig. 8.7. Each encoder and decoder requires
( n − 1) two-input exclusive OR (EX-OR) gates.
Table 8.2 Binary and Gray Decimal value Binary code Gray code
codes for different decimal
0 0000 0000
values
1 0001 0001
2 0010 0011
3 0011 0010
4 0100 0110
5 0101 0111
6 0110 0101
7 0111 0100
8 1000 1100
9 1001 1101
10 1010 1111
11 1011 1110
12 1100 1010
13 1101 1011
14 1110 1001
15 1111 1000
Gray coding has been applied to the address lines for both instruction and data
access for the reduction of switching activity. The simulation results show that there
is significant reduction in switching activity on the address bus because instructions
are fetched from sequential memory locations. However, when a branch instruction
is executed, the temporal transition is not sequential which leads to more than one
transition even when Gray coding is used. The frequency of occurrence of branch
instructions in a typical code is 1 in 15.
As data accesses are not usually sequential in nature, there is no noticeable re-
duction in the switching activity for Gray coding compared to uncoded binary data
accesses. Approximately, equal transition activity has been found for random data
Encoder Decoder
bn-1 gn-1 bn-1

bn-2
bn-2 gn-2
gn-3 bn-3
bn-3
b0
b0 g0
Binary Gray Binary

Code Code Code
Fig. 8.7 Encoder and decoder for Gray code

Table 8.3 Bit transitions per second for different benchmark programs
Benchmark program Instruction address Data address
Binary coded Gray coded Binary coded Gray coded
Fastqueens 2.46 1.03 0.85 0.91
Qsort 2.64 1.33 1.32 1.25
Reducer 2.57 1.71 1.47 1.40
Circuit 2.33 1.47 1.33 1.18
Semigroup 2.68 1.99 1.38 1.34
Nand 2.42 1.57 1.25 1.16
Boyer 2.76 2.09 1.76 1.72
Browse 2.51 1.64 1.32 1.40
Chat 2.43 1.54 1.32 1.20
2n - 1
ENCODER
DECODER
SENDER RECEIVER
n 1 n
0
Fig. 8.8 One-hot encoding
patterns. Table 8.3 shows the switching activities in terms of the number of bit
transitions per instruction (BPI) for different situations for a set of benchmark pro-
grams.
8.4.2 One-Hot Coding
Nonredundant coding provides reduction in the switching activity when the data
sent over the lines are sequential and highly correlated. This can be ensured only in
many situations. As we have mentioned in the previous section, when data sequenc-
es are random in nature, the Gray coding does not provide reduction in the switch-
ing activity. If we want reduction of the switching activity in all possible situations,
we have to go for nonredundant coding. One-hot encoding is one such nonredun-
dant coding that always results in reduction in switching activity. Here, an n-bit data
word maps to a unique code word of m-bit in size, where m = 2n. In this case, two
devices are connected by using m = 2n signal lines as shown in Fig. 8.8. The encoder
m=n+1
ENCODER
DECODER
n n-1
n
SENDER Bus(t) RECEIVER
Data(t) 1
Data(t)
0
Invert
Fig. 8.9 Bus-inversion encoding
receives the n-bit data as input and produces “1” on the i-th line 0 ≤ i ≤ 2n−1, when
i is the binary value of the n-bit data words, and “0” on the remaining lines. In this
case, both encoder and decoder are memory-less.
The most important advantage of this approach is that the number of transitions
for transmission of any pair of data words one after the other is two: one 0-to-1
and one 1-to-0. The reduction in dynamic power consumption can be computed.
Although one-hot encoding provides a large reduction in switching activity, the
number of signal lines increases exponentially (2n) with n. For example, for n = 8,
the number of signal lines is m = 256, and reduction in switching activity is 75 %.
8.4.3 Bus-Inversion Coding
Another redundant coding scheme is bus-inversion coding, which requires only one
redundant bit i.e., m = n + 1 for the transmission of data words as shown in Fig. 8.9.
In case of the Hamming distance between Data( t) and Bus( t − 1) is more than n/2,
complement of Data( t) is sent by the encoder over the bus as Bus( t). On the other
hand, when the Hamming distance between Data( t) and Bus( t − 1) is less than or
equal to n/2, then the complement of Data( t) is sent by the encoder as Bus( t). The
redundant bit P is added to indicate if Bus( t) is an inverted version of Data( t) or
not. The encoder and decoder for bus-inversion encoding are shown in Fig. 8.10.
By using this encoding method, the number of transitions is reduced by 10–20 % for
data busses. It may be noted that this approach is not applicable to address busses.
8.4.4 T0 Coding
The Graycoding provides an asymptotic best performance of a single transition

for each address generated when infinite streams of consecutive addresses are
Data(t)
B(t) B(t-1)
D
Data bus
INV/pass Data(t -1)
G(X) D
P(t) P(t-1)
CLK
Encoder Decoder
Fig. 8.10 Encoder and decoder of bus-inversion encoding. CLK clock signal, INV invalid
m=n+1
n-1
ENCODER
DECODER
n n
SENDER B(t) RECEIVER
b(t) 1
b(t)
0
INC(t)
Fig. 8.11 T0 encoding
considered. However, the code is optimum only in the class of irredundant codes,
i.e., codes that employ exactly n-bit patterns to encode a maximum of 2n words. By
adding some redundancy to the code, better performance can be achieved by adapt-
ing the T0 encoding scheme [1], which requires a redundant line increment (INC)
as shown in Fig. 8.11. When the INC signal value is 1, it implies that a consecutive
stream of addresses is generated. If INC is high, all other bus lines are frozen to
avoid unnecessary transitions. The receiver computes the new addresses directly.
On the other hand, when two addresses are not consecutive, the INC signal is LOW
and the remaining bus lines are used as the raw address. The T0 encoding scheme
can be formally specified as
 B (t − 1),1, if t > 0, b(t ) = b(t − 1) + S

( B (t ), INC(t )) =  .
 b(t ), 0, otherwise
n 1
SUB COMP
b(t) INC(t)
b(t-1)
D Q S
CLK
DECODER
0M n
U 0M b(t)
1X B(t) U
X
1
Q + Q D
D
B(t-1)
b(t-1)
ENCODER
S CLK
CLK
Fig. 8.12 T0 encoder and decoder. CLK clock signal, MUX multiplexer, INC increment
The corresponding decoding scheme can be formally specified as follows:
b(t − 1) + S if INC = 1 and t > 0

b(t ) = 
 B(t ) if INC = 0
where B( t) is the value on the encoded bus lines at time t, INC( t) is the redundant
bus line value at time t, b( t) is the address value at time t, and S is the stride between
two successive addresses. The encoder and decoder for T0 encoding are shown in
Fig. 8.12.
The T0 code provides zero-transition property for infinite streams of consecu-
tive addresses. As a result, it outperforms the Gray coding, because of 1-bit line
switching per a pair of consecutive addresses is performed in Gray coding. On an
average, 35 % reduction in address bus switching activity is achieved by this encod-
ing scheme.
8.5 Clock Gating
A clock is commonly used in majority of logic blocks in a very-large-scale integra-

tion (VLSI) chip. As the clock switches in every cycle, it has a switching activity of
1. It has been found that 50 % of the dynamic power originates from clock-related
circuits. One of the most common and widely used techniques for reducing the
dynamic power dissipation is clock gating (CG) [15]. It is based on the observation
8.5 Clock Gating 227
Clock running (active) Clock gated Clock running Clock gated

Dynamic Idle Dynamic Idle
Leakage
Fig. 8.13 Power reduction using clock gating
EN Functional
Fcg CG
Unit
CLKG
CLK
Fig. 8.14 Clock-gating mechanism. EN enable, CLK global clock, CLKG gated clock
that a large portion of signal transitions are unnecessary and therefore they can be
suppressed to reduce power dissipation without affecting the functionality of the
circuit. CG involves dynamically preventing the clock from propagating to some
portions of the clock path under a certain condition computed by additional circuits
as shown in Fig. 8.13. As shown in Fig. 8.13, only a leakage power dissipation takes
place when a circuit is clock gated. The reduction in dynamic power dissipation
takes place due to the reduction of switched capacitance in the clock network and
the switching activity in the logic block fed by the clock-gated storage elements.
8.5.1 CG Circuits
The mechanism of CG is shown in Fig. 8.14. A combinational circuit is used to

detect the condition when a specific functional unit is in use and generate an enable
(EN) signal. When the idle condition of the functional unit is true, the clock to the
functional unit shuts off with the help of a CG circuit. Here CLK is the global clock
and CLKG is the gated clock generated by the CG unit. Several implementations
have been proposed to realize the CG function. The simplest one is implemented
with the help of an AND gate as shown in Fig. 8.15a. When the output of the CG
function Fcg is 0, the output of the AND gate of Fig. 8.15a is forced to 0, thus sup-
pressing unnecessary transitions on the gated-clock CLKG. However, if the CG
function does not stabilize to 0 level before the clock rises to high level, the glitch
generated by the Fcg propagates through the AND gate as shown in Fig. 8.15c. A
slightly better approach is to use an OR gate as shown in Fig. 8.15b. In this case,
when the output of the CG function Fcg is 0, the output of the OR gate is forced to
1, thereby suppressing unnecessary transitions on the gated-clock CLKG. As long
as the EN signal stabilizes before the high-to-low transition of the clock, the glitch
EN
EN
CLKG CLKG
CLK CLK
a b
EN EN
CLK CLK
CLKG CLKG
AND OR
c d
Fig. 8.15 a Clock gating using AND gate, b clock gating using OR gate, c glitch propagation
through the AND gate, and d glitch propagation through the OR gate. EN enable, CLK global
clock, CLKG gated clock
EN EN
D Q D Q
CLKG CLKG
CLK CLK
a b
Fig. 8.16 a Clock gating using a level-sensitive, low-active latch along with an AND gate and
b clock gating using a level-sensitive, low-active latch along with an OR gate. EN enable, CLK
global clock, CLKG gated clock
generated by the Fcg cannot propagate through the OR gate as shown in Fig. 8.15d.
This glitches generated by the CG function Fcg can be filtered by using a level-
sensitive, low-active latch and an AND gate as shown in Fig. 8.16a. This method
freezes the latch output at the rising edge of the clock (CLK), and ensures that the
new EN signal at the input of the AND gate is stable when the clock is high. This
also allows the entire clock period to propagate the enable (EN) signal to the latch,
thereby providing more flexible timing constraint. Similarly, a level-sensitive, low-
active latch along with an OR gate can be used for a better glitch suppression as
shown in Fig. 8.16b.
8.5 Clock Gating 229
Instruction Register
Data
in
Decoding Logic A
Register Bank
B
EN D Q
ALU
CLKG ALU
CLK
out
Fig. 8.17 Clock gating the register file of a processor. EN enable, CLK global clock, CLKG gated
clock, ALU arithmetic logic unit
8.5.2 CG Granularity
Granularity of the circuit block at which CG is applied is an important issue. It

greatly affects the power savings that can be achieved by CG. Global CG can turn a
larger clock load off, but this approach has lesser opportunity of disabling the clock.
Based on the granularity, the CG can be categorized into the following three levels:
• Module-level CG
• Register-level CG
• Cell-level CG
Module-Level CG This involves shutting off an entire block or module in the
design leading to large power savings. Typical examples are blocks which are
explicitly used for some specific purpose such as transmitter/receiver, ADC, MP3
player, etc. For example, the transmitter can be turned off when the receiver is in use
and vice versa. Normally, this is to be identified by the designer and incorporated
in the register transfer language (RTL) code. Although global- or module-level CG
can provide large saving in power when used, its opportunity is limited. Moreover,
as large blocks are turned off and on, this approach may result in higher di/dt issue.
Example 8.3 The CG of the register file of a processor is shown in Fig. 8.17.

It is known that the arithmetic logic unit (ALU) is not used when load or store
instructions are executed in a processor having load/store architecture. As shown
in Fig. 8.17, the register bank can be clock-gated to prevent unnecessary loading of
operands to the ALU when load/store instruction is executed. In a similar manner,
memory bank can also be clock-gated when ALU instructions are executed.
Register-Level Gating The clock to a single register or a set of registers is gated
in register-level CG. Figure 8.18a shows a synchronous load-enabled register bank
typically implemented using clocked D flip-flops and a recirculating multiplexer.
Here, the register bank is clocked in every cycle, irrespective of whether a new
EN
D Q D_in
Reg
Reg Bank
MUX Bank
D_in CLK CLK
EN CLKG
a b
Fig. 8.18 a Synchronous load-enabled register bank and b clock-gated version of the register
bank. EN enable, CLK global clock, CLKG gated clock, MUX multiplexer
value (D_in) is loaded into it or not. The clock-gated version of the same register
bank is shown in Fig. 8.18b. In this clock-gated version, the register does not get
the clock in the cycles when no new data is to be loaded. The elimination of the
multiplexer circuit (MUX) from the clock-gated version also saves power. As the
CG circuit has its associated power dissipation, it is not cost-effective to do it for a
single-bit register. However, the penalty of the CG circuit can be amortized over a
large number of registers, saving the flip-flop clocking power and the power of all
the multiplexers by a single CG circuit.
The power saving per clock gate in register-level CG is much less compared to
the module-level CG, but there is much more scope for shutting off the clock com-
pared to module-level CG. It also lends itself to automated insertion of CG in the
design and can result in massively clock-gated designs.
Cell-Level Gating Cell designer introduces cell-level CG by incorporating CG
circuit as part of a cell, thereby removing the burden of the circuit designer. For
example, a register bank can be designed such that it receives clock only when new
data need to be loaded. Similarly, memory banks can be clocked only during active
memory access cycles. Although it does not involve any flow issue and design is
simple, it may not be efficient in terms of area and power. There is area overhead
and all registers are to be predesigned with inbuilt CG feature, and it does not allow
sharing of CG logic with many registers.
CG Challenges Although CG helps to reduce dynamic power dissipation, it intro-
duces several challenges in the application-specific integrated circuit (ASIC) design
flow. Some of the important issues are as follows:
Clock latency
Effect of clock skew
Clock tree synthesis
Physical CG
Testability concern
8.6 Gated-Clock FSMs 231
Fig. 8.19 Basic structure

of a finite-state machine. PI PI Combinational PO
primary input, PO primary PS Logic NS
output, PS previous state, NS
next state
Latches
CLK
Combinational PO
PI
Fcg Logic
PS NS
EN D Q
Latches
CLK CLKG
Fig. 8.20 Gated-clock version of the finite-state machine. PI primary input, PO primary output,
PS previous state, NS next state, EN enable, CLK clock, CLKG gated clock
8.6 Gated-Clock FSMs
In digital systems, FSMs are very common components. These are widely used for a
variety of purposes such as sequence generator, sequence detector, controller of data
paths, etc. The basic structure of an FSM is shown in Fig. 8.19. It has a combina-
tional circuit that computes the next state (NS) as a function of the present primary
input (PI) and previous state (PS). It also computes the primary output (PO) as a
function of the PS and PI for Mealy machines, and PO is a function of only the PS
for a Moore machine. An FSM can be characterized by a quintuple, M = ( PI, PO, S,
δ, λ), where PI is the finite nonempty set of inputs, PO is the finite nonempty set of
outputs, S is the finite nonempty set of states, δ is the state transition function ( δ:
S × PI → NS), and λ is the output transition function ( λ: S × PI → PO). For a given
FSM, there are situations when the NS and output values do not change. If these sit-
uations can be identified, the clock network can be shut off, thereby saving unnec-
essary power dissipations. This is the basic idea of CG. As shown in Fig. 8.20, an
activation function Fcg is synthesized that evaluates to logic 1 when the clock needs
to be shut off. Automated gated-clock synthesis approach for Moore-type FSMs
was reported in [13]. The procedure for synthesizing Fcg is based on the observation
that for a Moore FSM, a self-loop in the state-transition graph represents the idle
condition when the clock to the FSM can be synthesized. For a Moore machine ( λ:
S → PO), a self-loop corresponds to the idle condition. The self-loop for each state
START = 1
ZERO = 1
IDLE SET_CT WAIT DO_IT
TC_SET = 0
TC_SET = 1 TC_SET = 1
TC_EN = 0 TC_SET = 0
TC_EN = 1 TC_EN = 0
TC_EN = 1
Fig. 8.21 State-transition diagram of a finite-state machine ( FSM)
Fig. 8.22 Gated-clock imple- TC_EN

mentation of the finite-state TC_SET
machine ( FSM) of Fig. 8.20.
Timer/ START
CLK clock, CLKG gated FSM
clock, EN enable counter
CLK
ZERO
CLKG
TC_EN EN
D Q
Fcg
TC_SET
CLK
Si can be defined as Self Si: PI → {0, 1), such that Self Si (Pi) = 1 iff δ( x, Si) = Si,
Pi ε PI. The activation can be written as Fcg = ∑Self Si. Xi, where xi is the decoded
state variable corresponding to state si, i.e., Xi is 1 if and only if the FSM is in Si.
The circuitry to realize the activation function to identify these idle conditions
may involve a large area, delay, and power overhead. Here, the challenge is to iden-
tify a subset of the idle conditions at a significantly lower overhead such that the
complexity of implementation of Fcg is reduced. The gated-clock FSM implementa-
tion is illustrated with help of Example 8.4.
Example 8.4 The state-transition diagram of an FSM is shown in Fig. 8.21. The
FSM interacts with a timer-counter section, which generates a delay of thousands
of cycles, and at the end of timeout (indicated by the ZERO signal), a short but
complex operation is performed in the DO_IT state of the machine. The clock can
be shut off during the long countdown time of the timer-counter (WAIT state of the
machine). The idle situation can be easily identified by the signals TC_SET = 0 and
TC_EN = 1. Fcg can be realized accordingly as shown in Fig. 8.22.
8.7 FSM State Encoding 233
S1 S2 S3 S4 S5 S6
Fig. 8.23 State-transition diagram of a modulo-6 counter
Table 8.4 State assignments States Binary code Gray code

using Gray code and binary
S1 000 000
code for modulo 6 counter
S2 001 001
S3 010 011
S4 011 010
S5 100 110
S6 101 100
State locus 10 6
8.7 FSM State Encoding
State assignment is an important state in the synthesis of an FSM [3]. In this step,
each state is given a unique code. State assignment strongly influences the complex-
ity of the combinational logic part of the FSM. Traditionally, the state assignment
has been used to optimize the area and delay of the circuit. It can also be used to
reduce the switching activity for the reduction of the dynamic power. Given the
state-transition diagram of a machine, the state assignment can be done such that the
objective function γ, as specified below, is minimized.
γ = ∑ pij wij, for all transitions, where pij is the probability transition from state Si
to state Sj and wij is corresponding weight representing the activity factor.
This is illustrated with the help of the following two examples:
Example 8.5 Consider the state-transition diagram of a modulo 6 counter shown

in Fig. 8.23. Counter is a special type of FSM having no PI or PO. In this case,
the probability of all transitions pij is 1. One metric known as state locus, which is
the sum of bit distances for all transitions between states of a sequential machine,
can be found out. The state assignment which provides minimum state locus can
be used for state assignment. Two state assignments, one using a Gray code and
another using binary code assignments, is shown in the Table 8.4. The state locus of
the Gray code is 6, whereas the same for the binary code is 10. Therefore, the FSM
implementation using the Gray code will have lesser switching activity compared
to that of binary code implementation.
Example 8.6 Consider the state-transition diagram of a sequence detector that pro-
duces “1” when five 1’s appear sequentially at the input, as shown in Fig. 8.24. In
this case, the probability of each transition is 0.5 and weight wij is the Hamming
1/0
0/0 1/0 1/0 1/0
S1 S2 S3 S4 S5
0/0 0/0
0/0 0/0,1/1
Fig. 8.24 State-transition diagram of the “11111” sequence detector
Table 8.5 State assignments States Assignment-1 Assignment-2

using Gray code and binary
S1 000 000
code for sequence detector
S2 111 001
S3 001 011
S4 110 010
S5 101 100
Objective function 10 5.5
distance between the codes of Si and Sj. The objective function is computed for
assignment-1 as well as for assignment-2. The objective function of the assign-
ment-1 is 10, whereas the same for assignment-2 is 5.5. Therefore, the FSM imple-
mentation using the assignment-2 will have lesser switching activity compared to
that of assignment-1 (Table 8.5).
8.8 FSM Partitioning
The idea is to decompose a large FSM into several smaller FSMs with smaller
number of state registers and combinational blocks. Only an active FSM receives
clock and switching inputs, and the others are idle and consume no dynamic power.
Consider a large FSM that includes a small subroutine as shown in Fig. 8.25a. The
FSM is decomposed into two FSMs as shown in Fig. 8.25b. Here, two additional
wait states SW22 and TW0 have been added between the entry and exit points of
the two FSMs. In this case, when one of the FSMs is operating, the other one is in
the wait state. When an FSM is in the wait state, its input and clock can be gated
to minimize switched capacitance. There can be significant savings in switched
capacitance if it is possible to isolate a very small subset of states where the initial
FSM remains most of the time.
8.9 Operand Isolation 235
T0
T0 S21
S21
T1
T1 S22
S22
T2
T2 SW22 TW0
S23
T3
T3 S23
a b
Fig. 8.25 a An example finite-state machine FSM and b decomposed FSM into two FSMs
AS
S_1
S_2 S_1
S_2
A
A 0 0
0 0
+ 1 D + 1 D
1 1
B B CLK
CLK
a b
Fig. 8.26 a An example circuit and b operand isolation. CLK clock signal, AS activation signal
8.9 Operand Isolation
Operand isolation is a technique for power reduction in the combinational part of

the circuit. Here the basic concept is to “shut off” the logic blocks when they do
not perform any useful computation. Shutting off is done by not allowing the inputs
to toggle in clock cycles when the block output is not used. Consider the circuit of
Fig. 8.26a. Here, the output of the adder is loaded into the latch only when the signal
S_1 is 1 and the signal S_2 is 0. Addition can be performed only when the output
of the adder is loaded into the latch, an activation signal (AS) is generated to isolate
the operands using AND gates as shown in Fig. 8.26b. It may be noted that although
operand isolation reduces power, it introduces timing, area, and power overhead.
INPUT
Combinational
R1 R2
logic block
f
E1 E2
Fig. 8.27 Combinational circuit sandwiched between two registers
INPUT
R1 Combinational
logic block
R2
f
E2
g1 FF
E1
g2 FF
Fig. 8.28 Generalized schematic diagram to perform precomputation
8.10 Precomputation
Precomputation is a technique in which selective computation of output values is

done in advance using a much simpler circuit than the original circuit. Precomputed
values are used to reduce the switching activity in the subsequent cycles. Consider
a combinational logic block sandwiched between two registers R1 and R2 as shown
in Fig. 8.27. Figure 8.28 shows a generalized schematic diagram for performing
precomputation. Here, the predictor functions, g1 and g2, are realized, which are far
simpler than the original combinational logic block g. The combinational circuits g1
and g2 are realized such that g1 = 1 = > f = 1 and g2 = 1 = > f = 0. During clock cycle
n, if either g1 or g2 evaluates to 1, the register R1 is disabled from loading. In this
cycle, output is taken from either g1 or g2. However, when both g1 and g2 evaluates
to 0, then computation is done by the logic block g as shown in Fig. 8.28. In this
case, complete disabling of all inputs takes place, resulting in power-consumption
savings. The extra logic circuits added to implement precomputation incurs area,
delay, and power overheads. The area and power overheads can be minimized by
making g1 and g2 as simple as possible and maximizing the signal probabilities of
g1 and g2. The delay overhead can be reduced by applying this approach to subcir-
cuits in the noncritical path.
8.11 Glitching Power Minimization 237
An
R1
Bn A>B
R3
n-bit binary
An-1...A1 f
R2 comparator
Bn-1...B1
E1
An
Bn
Fig. 8.29 Precomputation to realize comparator function
O1
A
B O
C 2
O1
A A
ABC O2
B O3 B
0 1 0 1 1 1
C C
O1 D D
O2
a t=0 b c
Fig. 8.30 a Glitch generated due to finite delay of the gates, b cascaded realization of a circuit
with high glitching activity, and c tree realization to reduce glitching activity
Example 8.7 A simple and realistic example is an n-bit digital comparator. In this
case, the most significant bits are sufficient to predict the output when they are not
equal. Therefore, the predictor functions can be realized using An and Bn, the most
significant bits. The expressions for the predictor functions are: g1 = An · Bn′ and
g2 = An′· Bn. When either of the predictor functions evaluates to 1, all inputs to the
comparator except the most significant bits can be disabled. The realization of the
precomputation circuit is shown in Fig. 8.29. In this example, disabling of a subset
of inputs takes place.
8.11 Glitching Power Minimization
In case of a static complementary metal–oxide–semiconductor (CMOS) circuit, the

output node or internal nodes can make undesirable transitions before attaining a
stable value. As shown in Fig. 8.30a, if the input changes from 010 to 111 at time
t = 0, the output at O1 switches from 1 to 0 after one gate delay duration. This, in
turn, leads to a low-level output of one gate delay duration after two gate delays. A
node, internal or output, can have multiple transitions inside a single clock cycle be-
fore settling down to a final value. These spurious signals are known as glitches [2].
The Glitches are due to converging combinational paths with different propagation
delays. The number of these undesired transitions is dependent on the logic depth,
signal skew of different paths, as well as input signal pattern. At the logic synthesis
phase, selective collapsing and logic decomposition can be carried out before tech-
nology mapping to minimize the level difference between the inputs of the nodes
driving high-capacitive nodes. Alternatively, delay insertion and pin reordering can
be done after technology mapping to make delays of all the paths equal. Here, the is-
sue is to insert minimum number of delay elements to achieve maximum reduction
in glitching activity. On highly loaded nodes, buffers can be inserted to balance de-
lays and cascaded implementation can be avoided, if possible, to minimize glitching
power. Figure 8.30b shows a cascaded realization, which is prone to high glitching
activity compared to the tree realization of Fig. 8.30c that balances the path delays.
8.12 Logic Styles for Low Power
There are two basic approaches to realize a digital circuit by metal–oxide–semicon-

ductor (MOS) technology: gate logic and switch logic [20]. A gate logic is based on
the implementation of digital circuits using inverters and other conventional gates
such as NAND, NOR, etc. Moreover, depending on how circuits function, they can
also be categorized into two types—static and dynamic gates. In case of a dynamic
gate, no clock is necessary for their operation and the output remains steady as long
as the power supply is maintained. Dynamic circuits are realized by making use of
the information storage capability of the intrinsic capacitors of the MOS devices,
which are to be refreshed at regular intervals before information is lost because of
the typical leakage of charge stored in the capacitors. On the other hand, the switch
logic is based on the use of pass transistors or transmission gates, just like relay con-
tacts, to steer logic signals. In this section, we shall briefly introduce three potential
logic styles and mention their advantages and disadvantages.
It has been observed that the choice of logic style affects the power dissipation
significantly. In present-day VLSI circuits, static CMOS has emerged as the tech-
nology of choice because of its robustness, ease of fabrication, and availability of
sophisticated design tools and techniques. However, there exist other logic styles
such as dynamic CMOS, pass-transistor logic (PTL) circuits, etc., which have the
potential for the realization of high-performance and low-power circuits. In the
context of low-power logic synthesis, comparison of PTL with the static CMOS
logic styles has been done by a limited number of researchers [5, 8, 10, 17, 22, 23,
25]. Some of the reported results are conflicting in nature. For example, Yano et al.
[23] have established that pass-transistor realization is superior to the static CMOS
realization, whereas Zimmerman and Fichtner [25] have demonstrated that static
CMOS is superior to PTL circuits. No such investigation has been done to compare
8.12 Logic Styles for Low Power 239
VDD VDD
B
p-block A
C
IN OUT
f
A
n-block
B C
a GND b GND
Fig. 8.31 a Static complementary metal–oxide–semiconductor ( CMOS) gate and b realization of

f = A + B ⋅ C with static CMOS gate
static CMOS and dynamic CMOS styles. To compare the three logic styles on an
equal basis, it is necessary to develop efficient tools and techniques for logic syn-
thesis using dynamic CMOS and PTL styles, which already exists for static CMOS
style, and then compare the performance of the three logic styles.
Based on this motivation, efficient logic synthesis tools have been developed for
PTL circuits [22] and dynamic CMOS [21]. This section reports the comparative
study of these logic styles in terms of area, delay, and power with the help of a large
number of International Symposium on Circuits and Systems (ISCAS) benchmark
circuits, each having large number of inputs and outputs. An overview of the effi-
cient logic synthesis techniques reported in [22] and [21] are given in Sects. 8.10.4
and 8.10.5. Section 8.4 presents the implementation details. Finally, experimental
results based on the realization of ISCAS benchmark circuits are given in Sect. 8.5.
8.12.1 Static CMOS Logic
The static CMOS or full complementary circuits require two separate transis-
tor networks: pull-up pMOS network and pull-down nMOS network, as shown
in Fig. 8.31a, for the realization of digital circuits. Realization of the function
f = A + B ⋅ C is shown in Fig. 8.31b. Both the nMOS and pMOS networks are,
in general, a series–parallel combination of MOS transistors. In realizing complex
functions using full complementary CMOS logic, it is necessary to limit the number
(the limit is in the range of four to six transistors) of MOS transistors in both the
pull-up network (PUN) and pull-down network (PDN), so that the delay remains
within acceptable limit.
Advantages of Static CMOS Logic
• Ease of fabrication
• Availability of matured logic synthesis tools and techniques
• Good noise margin
VDD VDD
φ φ
OUT
IN n-block IN p-block
OUT
φ φ
a GND b GND
Fig. 8.32 Dynamic complementary metal–oxide–semiconductor ( CMOS) gate with a n-block and
b p-block
• Good robustness property against voltage scaling and transistor sizing

• Lesser switching activity
• No need for swing restoration
• Good I/O decoupling
• Easy to implement power-down circuit
• No charge sharing problem
Disadvantages of Static CMOS Logic
• Larger number of transistors (larger chip area and delay)
• Spurious transitions due to finite propagation delays from one logic block to the
next, leading to extra power dissipation and incorrect operation
• Short-circuit power dissipation
• Weak output driving capability
• Large number of standard cells requiring substantial engineering effort for tech-
nology mapping
8.12.2 Dynamic CMOS Logic
Dynamic CMOS circuits are realized based on pre-charge logic. There are two basic
configurations of a dynamic CMOS circuit as shown in Fig. 8.32a and b. In the first
case, an n-block is used as the PDN as shown in Fig. 8.32a. In this case, the output
is charged to “1” in precharge phase, and in the evaluation phase, the output either
discharges to “0” through the PDN, if there is discharge path depending on the input
combination. Otherwise, the output maintains the “1” state. In the second case, a p-
block is used as PUN as shown in Fig. 8.32b. In the pre-charge phase, output is dis-
charged to “0” level, and in the evaluation phase, it is charged to “1” level through
the PUN, if there is charging path depending on the input combination. Otherwise,
the output remains at “0” level.
VDD
φ
f
φ To next
stage B
A
IN n-block C
φ φ
a GND b
Fig. 8.33 a Domino gate and b realization of f = A + B · C with domino gate
Advantages of Dynamic CMOS Logic

• Combines the advantage of low power of static CMOS and lower chip area of
pseudo-nMOS
• The number of transistors is substantially lower compared to static CMOS, i.e.,
N + 2 versus 2N
• Faster than static CMOS
• No short-circuit power dissipation occurs in dynamic CMOS, except when static
pull-up devices are used to reduce charge sharing.
• No spurious transitions and glitching power dissipation, since any node can un-
dergo at the most one power-consuming transition per clock cycle
Disadvantages of Dynamic CMOS Logic:
• Higher switching activity
• Not as robust as static CMOS
• Clock skew problem in cascaded realization
• Suffers from charge sharing problem
• Suffers from charge leakage problem requiring precharging at regular interval
• Difficult to implement power-down circuits
• Matured synthesis tools not available
Special type of circuits, namely domino and NORA CMOS circuits, overcomes the
clock skew problem. The general structure of the domino logic is shown in Fig. 8.33a.
The same function f is realized in the domino logic as shown in Fig. 8.33b. NORA
stands for NORAce. In the NORA technique, logic functions are implemented by
using nMOS and pMOS blocks alternately for cascading as shown in Fig. 8.34a.
The function f realized in the NORA logic is shown in Fig. 8.34b. Here, we follow
the domino logic style for the synthesis of dynamic circuits. Unfortunately, domino
logic circuit can implement only non-inverting logic. Hence, complements of inter-
nal signals need to be realized through the separate cones of logic using comple-
ments of PIs, resulting in a significant area overhead.
VDD VDD VDD φ

φ
φ φ φ f
OUT B φ
A
IN n-block p-block n-block C
φ φ φ φ
a GND GND GND b
Fig. 8.34 a NORA logic and b realization of f = A + B · C with NORA logic
8.12.3 PTL
There are various members in the PTL family such as complementary pass-transis-
tor logic ( CPL), swing-restored pass-transistor logic ( SRPL), double pass-transis-
tor logic ( DPL), and single-rail pass-transistor logic ( LEAP). Among the different
PTL families, LEAP cells based on single-rail PTL requires minimum number of
transistors. All the PTL families except SRPL provide good driving capability using
inverters. SRPL logic style also does not provide I/O decoupling. Additional invert-
ers and weak pMOS transistors are required for all the PTL styles except DPL that
does not require swing restoration. We have used three LEAP-like cells for technol-
ogy mapping stage of logic synthesis.
Advantages of PTL
• Lower area due to smaller number of transistors and smaller input loads.
• As the PTL is ratioless, minimum dimension transistor can be used. This makes
pass-transistor circuit realization very area efficient.
• No short-circuit current and leakage current, leading to lower power dissipation.
Disadvantages of PTL
• When a signal is steered through several stages of pass transistors, the delay can
be considerable.
• There is a voltage drop as we steer signal through nMOS transistors. To over-
come this problem, it is necessary to use swing restoration logic at the gate out-
put.
• Pass-transistor structure requires complementary control signals. Dual-rail logic
is usually necessary to provide all signals in complementary form.
• Double intercell wiring increases wiring complexity, and capacitance by a con-
siderable amount.
• There is possibility of sneak path.
Efficient techniques for the logic synthesis using dynamic CMOS and PTL have
been developed. An overview of the existing approaches is given in the following
section.
8.12.4 Synthesis of Dynamic CMOS Circuits
In recent years, some works [18, 19, 21, 24] have been reported on the synthesis of
dynamic CMOS circuits. In [19], an approach to perform technology-independent
optimization and technology mapping for synthesizing dynamic CMOS circuits us-
ing domino cells has been proposed. This approach assumes a circuit in multilevel
decomposed form, converts the circuit into unate using standard bubble pushing
algorithm and then optimizes the unate circuit using unate optimization procedure
based on extraction/factoring, substitution, elimination, and don’t-care optimiza-
tion. As a task of technology mapping, the approach follows the general tree-by-tree
mapping approach and without using any standard cell library. The approach pro-
posed in [19] has been found to give good results in terms of delay and area. How-
ever, the said approach adopts computationally expensive bubble pushing algorithm
and is not applicable for large circuits. Another approach has been reported in [24]
to synthesize domino circuits. This approach starts with an optimized circuit based
on output phase assignment technique. The optimized circuit is decomposed into
forest of trees to perform technology mapping. The approach proposed on-the-fly
domino cell mapping, satisfying the width (the number of transistors in parallel) and
length (the number of transistors in series) as it was done in. Further, the approach
[19] proposed multi-output domino cells and dynamic programming framework to
reduce the total number of cells to be used. According to the results reported in
[24], some circuits have an area overhead; this is because of the logic replications
to obtain inverters, as necessary. However, there is no reported work on the NORA
logic synthesis.
A different approach is adopted to synthesize the dynamic CMOS circuits. It
is based on the two-level unate decomposition of logic functions [18]. Two-level
unate decomposition of a switching function f can be expressed as, f = ∑ i Pi ⋅ N i ,
where Pi is a positive unate function (all literals are in non-complemented form)
and Ni is a negative unate function (all literals are in complemented form). The
generic structures of dynamic CMOS circuits based on two-level decomposition
with domino logic and NORA logic are shown in Fig. 8.35a and b, respectively.
The keeper transistors are not shown for simplicity. Details of the algorithm to
obtain two-level unate decomposition of a function can be understood from [18].
The domino and NORA circuits based on this unate decomposition are shown in
Fig. 8.36a and b, respectively. As stated earlier, we see that dynamic NORA circuit
is completely inverter-free.
Example 8.8 Let us consider a Boolean function
f1 = ∑ (3,5,7,8,1112
, ,14,15)
Following the algorithm in [21], the two-level unate decomposition of the function
can be given as
f1 = ( x4 + x3 x1 )( x4 + x2 x1 ) + ( x2 x1 + x4 x3 x2 )·1
φ φ
P1 f
φ
φ
P2 N1 N2 Nk
φ
n-block
φ
φ
Pk
φ
a Domino CMOS circuit
φ φ
P1
φ
φ
P2 N1 N2 Nk
φ
p-block
φ
Pk f
φ
φ
b Nora CMOS circuit
Fig. 8.35 Dynamic CMOS circuits based on two-level unate decomposition: a domino CMOS
circuit and b NORA CMOS circuit
Two-level dynamic CMOS realization for functions of large number of variables

using the unate decomposition approach as mentioned above may not be suitable
for practical implementation, because of existence of a large number of transistors
in series and parallel in gates. There is a constraint (to realize circuits of reason-
able delay) on the length and width of a practical gate. By length, we mean the
maximum number of series transistors in a gate. When the number of transistors
φ φ
x3 φ x3
x4 φ x4 φ
x1 f1 x1
x2 x2
x4 x4
φ x1 φ x1
f1
x4 φ x4 φ
x2 x2
x3 x3
x1 x1
x2 x2
φ φ
a Domino b Nora
Fig. 8.36 Realization of dynamic circuits for f1 a using domino logic and b using NORA logic
in series increases (this is the case when a product term consists of a large number
of variables), on-resistance of gate increases and thereby affecting the delay of the
gate. Increase in ( width) the number of transistors in parallel (this is the case when
a large number of product terms are there in the logic) proportionally increases
the drain capacitance of a gate, which in turn increases the delay time as well as
the dynamic power consumption. It is known that the delay of a gate drastically
increases when there are more than four transistors in series. In order to reduce the
delay, there should be four transistors, at the most, in series. Similarly, the number
of transistors in parallel is restricted to five in order to check the delay as well as
dynamic power consumption due to load capacitance. To satisfy length and width
constraints, another decomposition step is proposed, where each node in the two-
level decomposed circuit is further broken down into intermediate nodes till each
node satisfies the length and width constraints. This decomposition is termed as the
cell-based multilevel decomposition. To minimize the number of nodes, all kernels
are extracted and then kernel-based decomposition is performed. After the kernel-
based decomposition is over, each node is examined, whether it satisfies the length
and width constraints or not. If the length and width constraints are not satisfied,
then we carry out the cell-based multilevel decomposition. The cell-based multilev-
el decomposition is better illustrated with Example 8.9. This cell-based multilevel
decomposition accomplishes the task of technology mapping. It can be noted that,
this technology mapping is not based on any library of standard gates; instead, it
does “on the fly” generation of cells, satisfying length and width constraints.
Example 8.9 Consider the function f2
f 2 = x1 x2 x3 x4 x12 + x6 x7 x8 + x1 x9 + x2 x7 + x4 x8 x9 + x2 x5 x6 x9 x10 x11 + x1 x8 x11.
Further assume that the length and width constraints for our technology are 3 and
4, respectively. After the decomposition satisfying the length and width constraints,
we can obtain a netlist as shown below:
A = x1 x2 x3 B = x2 x5 x6 C = x9 x10 x11 D = x6 x7 x8 + x1 x9 + x2 x7 + x4 x8 x9
f 2 = A· x4 x12 + D + B · C + x1 x8 x11
Domino circuits corresponding to this netlist can be readily obtained. On the other
hand, for NORA circuits, we need a decomposition into odd number of levels, so
that alternate n-block, p-block, and n-block (for PDN of NORA) or p-block, n-
block, and p-block (for PUN of NORA) can be realized. As in the current example,
number of levels is two, so at the final level, we can do one more decomposition as
f 21 = A· x4 x12 + D f 22 = B · C + x1 x8 x11 f 2 = f 21 + f 22
The objective of the technology mapping with cell-based decomposition is to de-

compose a node such that it can be realized with cells of permissible size. At the
same time, in order to ensure lesser number of levels, we are to create intermediate
nodes in terms of PIs as far as possible.
Figure 8.37 shows the various steps involved in our synthesis procedure. We as-
sume that input circuits are in the Berkeley logic interchange format (BLIF). The
first step in our approach is partitioning. This step is required because the unate
decomposition algorithm is based on minterm table, and it has been observed that if
the number of input variables is not more than 15, then the runtime of the algorithm
lie within an acceptable range. To ensure this, the partitioning of input circuit graph
is proposed into a number of partitions so that the number of inputs of a partition
does not exceed upper limit 15. The partitioning is done in two phases as shown in
Fig. 8.37. In the first phase, an initial partition is created, and in the second phase,
the initial partition is refined to reduce the total number of partitions. To create the
initial partition, the entire graph is traversed in breadth-first fashion and based on
the maximum size (number of inputs) of a partition; the circuit graph is partitioned
into smaller subgraphs as shown in Fig. 8.38b. The first phase partitions the input
graph into variable-sized partitions. We usually end up with partitions having too
small sizes, which can be merged with other partitions to form a partition with
reasonably larger number of inputs. For example, as shown in Fig. 8.38c, a node
in the partition Px can be moved to Py and a node in partition Py can be moved to
Pz without violating the size limit of partitions Py and Pz. The second phase allows
us to merge or refine those partitions. We move a node from its current position to
Input circuit
Partitioning
Unate Decomposition
Technology Mapping
Synthesized circuit
Fig. 8.37 Basic steps for synthesizing dynamic CMOS circuit
Py
Px
Pz
a A circuit graph b Initial partition c Refined partition
Fig. 8.38 Partitioning of a circuit graph
the new one, if the number of inputs to the new partition is not exceeding the upper
limit on the size of partition. This is similar to the refinement phase in the standard
Fiduccia–Mattheyses (FM) bipartitioning algorithm [14]. It has been observed that,
by adopting this divide-and-conquer approach, it is possible to handle VLSI circuits
involving hundreds of input variables.
In the earlier works, binary decision diagrams (BDDs) of the complete functions
called monolithic BDDs were constructed and then technology mapping onto PTL
cells were performed. The problem with the monolithic BDD-based approach is that
the size of monolithic BDDs grows exponentially with the number of input vari-
Table 8.6 Ratio parameter

Minterm x4 x3 x2 x1 f1
table of f3
0 0 0 0 0 1
1 0 0 0 1 1
4 0 1 0 0 1
6 0 1 1 0 1
8 1 0 0 0 1
10 1 0 1 0 1
11 1 0 1 1 1
RP 3 2 3 2
4 5 4 5
ables of the function, making the approach unsuitable for functions of large num-
ber of variables. Decomposed BDD approach has been proposed to overcome this
problem. In decomposed BDD approach, instead of creating a single big monolithic
BDD, compact BDDs in terms of intermediate variables are constructed. These
smaller BDDs can be optimized with ease, and the approach allows synthesis of
any arbitrary function with a large number of variables. Liu et al. [17] and Chaud-
hury et al. [10] have proposed performance-oriented and area-oriented optimization
techniques based on the decomposed BDDs. Here, decomposed BDDs have been
realized using LEAP-like cells in the PTL circuit synthesis approach. In addition to
using decomposed BDDs, we have used new heuristic, called ratio parameters (RP)
[18], for optimal variable ordering. The RP is purely a functional property, which
can be obtained from the completely specified minterm table consisting of only the
true vectors of Boolean functions. The RP can be defined as follows:
For an n-variable function f, the set of parameters N n /Dn , N n −1 /Dn −1 , …, N1 /D1
are called the RP of f, where Ni and Di are the number of 1’s and 0’s, respectively, in
the xi-th column of the minterm table of f.
8.12.5 Synthesis of PTL Circuits
Example 8.10 Consider a function f3 = ∑ (0, 1, 4, 6, 8, 10, 11). The RP of f3 can be
obtained as shown in Table 8.6.
To explain the working methodology of the RP-based heuristic, let us consider
a Boolean function f with n Boolean variables x1, x2, x3, …, xi, …, xn. Selection of
variable at any stage is governed by three cases:
Case 1: Ni ( Di) = 2n−1 or 0 for a variable xi in f. Then the Shannon’s expansion
[22] will be:
(a) f = xi · C + xi · f1 if Ni = 2n−1 or 0
or (b) f = xi · f1 + xi · C if Di = 2n−1 or 0,
where C = 1 (0), if Ni ( Di) = 2n−1 (0).
Case 2: Ni ( Di) = 2n−2 and for some j ≠ i, Nj ( Dj ) ≥ 2n−2 and simultaneous occur-
rence of xi · x j = 2n − 2. Shannon’s expansion in this case will look like as:
(a) f = xi ⋅ x j + xi ⋅ f 2 , if N i = 2n − 2 , N j ≥ 2n − 2
n−2 n−2
or (b) f = xi · x j + xi · f 2 , if Di = 2 , N j ≥ 2
n−2 n−2
or (c) f = xi • x j + xi • f 2 , if N i = 2 , D j ≥ 2
or (d) f = xi ⋅ x j + xi ⋅ f 2 , if Di = 2n − 2 , D j ≥ 2n − 2
Case 3: There is a k-group: xi1 xi 2 xi 3 … xik , k ≥ 3
BDD can be obtained by cofactoring f with respect to xi1, xi2,…, xik, i.e.,
f = xi1 xi 2 … xik · f1 + xi1 xi 2 … xik · f 2 + …+ xi1· xi 2 … xik · f m, where m = 2k
Note that case 3 is the generalization of case 2.
The RP-based heuristic helps in obtaining BDDs with a smaller number of nodes.
However, BDDs generated by this approach may not have the same ordering of vari-
ables at the same level along different paths. Or in other words, we propose reduced
unordered BDDs (RUBDDs), in contrast to reduced ordered BDDs (ROBDDs)
commonly used in the existing approaches.
There are four major steps in our approach of synthesizing PTL circuits:
• Decompose large circuits into a number of smaller sub circuits. We termed each
decomposed subcircuit as a Boolean node.
• For each Boolean node in the decomposed graph, we create a BDD using RP
heuristic.
• Map the entire circuit with the PTL cells.
A circuit in the BLIF is itself in a multilevel decomposed form where most com-
ponents of the network are, in general, very simple gates. It is an overkill to apply
PTL circuit synthesis at this level, since BDD representation of such a small gate
is seldom advantageous. We, therefore, propose the concept of merging smaller
components into a set of super Boolean nodes using the removal of reconvergent
and partial collapsing.
Our second step in the algorithm is to compose BDD for each node in the decom-
posed circuit obtained in the previous step. We propose a heuristic based on RP to
select a variable for the expansion of function at a current level. The heuristic is to
select a variable at a current stage in such a way that the maximum number of closed
inputs is available at a minimum path length from the current stage.
After the construction of BDD for each Boolean node in the network, we are to
map the BDDs to the PTL cell library. Comprising all the steps mentioned above,
the algorithm ptlRP for the realization of PTL circuits is outlined as below:
$OJRULWKPSWO53
'HFRPSRVHWKHLQSXWFLUFXLWLQWRDQXPEHURIFRPSRQHQWVRIPRGHUDWHVL]HV
([WUDFWDOOUHFRQYHUJHQWVWUXFWXUHVLQWKHFLUFXLWDQGPHUJHHDFKUHFRQYHUJHQWVWUXFWXUHLQWRD
VLQJOHQRGH
3HUIRUPSDUWLDOFROODSVLQJDQRGHLQWRLWVIDQRXWVLIWKHFRVWJDLQRIWKHSDUWLDOFROODSVLQJLVDERYH
DFHUWDLQWKUHVKROGYDOXH
5HSHDW6WHS XQWLOWKHUHLVQRSDUWLDOFROODSVLQJZLWKWKHFRVWJDLQDERYHWKHWKUHVKROGYDOXH
&UHDWH%''IRUHDFKSDUWLWLRQLQWKHGHFRPSRVHGJUDSK
/HWWKHIXQFWLRQI UHSUHVHQWLQJWKHFXUUHQWQRGHEHDQQLQSXWIXQFWLRQZLWK%RROHDQYDULDEOHV[
[«[L«[Q
&RPSXWH53VIRUHDFK%RROHDQYDULDEOH[L LQI
&DVH,IWKHUHLVDYDULDEOH[L IRUZKLFK1L RU'L LVHTXDOWR Q RUWKHQVHOHFWWKHYDULDEOHIRU
H[SDQVLRQ
&DVH ,I WKHUH LV D YDULDEOH [L ZKRVH 1L RU 'L LV HTXDO WR Q DQG WKHUH LV DQRWKHU YDULDEOH [ M
ZKRVH1M RU'M LVHTXDOWRRUJUHDWHUWKDQ Q DQGVLPXOWDQHRXVRFFXUUHQFHRIWKHYDULDEOH[L DQG
[M LVHTXDOWR Q WKHQVHOHFWWKHYDULDEOH[L IRUH[SDQVLRQZLWK[M DVRQHRIWKHLQSXWRIWKHELQDU\
QRGH
&DVH )LQGWKH VPDOOHVW NJURXS/HWWKHVL]HRI LWEH N N ≥ 6HOHFWDYDULDEOH IURPWKH N
JURXS ZKLFKJLYHVWKHPD[LPXPQXPEHURIFORVHGLQSXWVDWWKHHDUOLHVWVWDJH)RUHDFK[L LQN
JURXSH[SDQGLQWRDGHFLVLRQWUHH
5HSHDW6WHS WR IRUHDFKVXEIXQFWLRQVREWDLQHGWRJHWIXOO%''
5HPRYHDOOUHGXQGDQWQRGHVLQWKH%''VRREWDLQHG
5HPRYHDOOGXSOLFDWHQRGHVLQWKH%''
5HPRYHDOOGXSOLFDWHWHUPLQDOQRGHVLQWKH%''
0DSWKH%''VWR37/FHOOOLEUDU\
9LVLWHYHU\QRGHLQ%''LQGHSWKILUVWIDVKLRQ
/HWWKHFXUUHQWQRGHXQGHUYLVLWEHQ
,I Q LV QRW YLVLWHG DOUHDG\ WKHQ PDS Q DQG LWV LPPHGLDWH VXFFHVVRU LI DQ\ FRYHU ZLWK EHVW
VXLWDEOH<FHOO
0DUNDOOQRGHVFXUUHQWO\EHLQJFRYHUHGDVYLVLWHG
6WRS
8.12.6 Implementation and Experimental Results
In order to verify the performances of the three logic styles, we have realized static
CMOS, dynamic CMOS, and PTL circuits for a number of ISCAS benchmark cir-
cuits. The implementation details are discussed in the following sections.
8.12.6.1 Implementation of Static CMOS Circuits
For the synthesis of the static CMOS circuits, we have employed the Berkeley se-
quential interactive synthesis ( SIS) software tool [6, 7, 21] to make the netlist for a
given circuit. For a particular benchmark problem, it is first optimized with script.
rugged code in SIS and then using the standard library 44–1.genlib in SIS; gate
mapping is done with the option of minimum area. In the output file generated by
SIS tool, the circuit is described by the equations employing NAND, NOR, and NOT
operations. In order to simplify the analysis, we have taken the first two types of
gates with two to four inputs. The mapped circuit is then translated into a bidirected
graph using a parser and a graph generator program.
8.12.6.2 Implementation of Dynamic CMOS Circuits
The synthesis of dynamic CMOS circuits starts with partitioning an input circuit
in BLIF into a number of smaller subcircuits. Each subcircuit is then transformed
into PLA form (in terms of minterms), and disjoint decomposition is carried out
to contain nonoverlapping minterms only. After preprocessing the input, our tool
decomposes a Boolean function into a set of positive and negative unate subfunc-
tions based on the algorithm MUD (multiuser detection) [3]. It should be noted that,
algorithm MUD can handle a single logic function at a time, whereas a circuit is, in
general, multi-output in nature. This problem has been sorted out by extracting one
function at a time in a partition and then obtaining its unate decomposed version
(in the form of a netlist) and storing the result in a temporary file. This process is
to be repeated for all functions in the partition, and finally giving the netlist for a
partition. Our tool then performs cell-based multilevel decomposition for nodes in
the netlist in order to satisfy the length and width constraints. During the multilevel
decomposition, for the constraints of a cell, we have chosen length = 4 and width = 5.
After the multilevel decomposition is over, our tool produces the final netlist of the
synthesized circuit.
8.12.6.3 Implementation of PTL Circuits
To synthesize a PTL circuit, the circuit under test is optimized with script.rugged
command using Berkeley SIS tool [7]. A parser is written to store the optimized
circuit in multilevel decomposed form as directed acylic graph (DAG). Then the
graph is transformed by replacing all reconvergent nodes [11] to their correspond-
ing single nodes. After that, partial collapsing [11] is performed to get Boolean
nodes of reasonably larger sizes. For each node in the graph so obtained, BDD is
constructed with our algorithm ptlRP [11]. After the construction of BDD, we have
mapped the BDD to PTL cell library [11].
8.12.6.4 Experimental Results
Switching the power and delay of the realized static CMOS, dynamic CMOS and
PTL circuits are estimated using the estimation models presented in [3, 10, 11], re-
spectively. The accuracy of these models is verified with Spice and Cadence tools.
Value of transistor’s parameters is extracted using BSIM3V3 model and for 0.18 μm
process technology as a particular instance. In our experiment, for all nMOS pass
transistors, we have assumed the effective channel length as 0.18 μm and channel
widths are 0.54 and 1.08 μm for nMOS and pMOS transistors, respectively, and
the thickness of gate oxide is 40 Å. Further, we have assumed 1.0 V as the supply
voltage ( Vdd) and 0.2 V as the threshold voltage. The input transition probability for
each PI is assumed to be 0.3. The operating temperature is assumed to be 110 °C.
Operating frequency is assumed as 100 MHz.
Table 8.7 Area, delay, and switching power in static CMOS, dynamic CMOS, and PTL circuits
1 2 3 4 5 6 7 8 9 10
Bench- Static CMOS circuits Dynamic CMOS circuits PTL circuits
mark Area Delay Power Area Delay Power Area Delay Power
(#Transistor) (ns) (μW) (ns) (μW) (ns) (μW)
C432 692 3.32 122 581 1.82 88 546 2.08 91
C499 1880 2.23 367 1506 1.69 248 1428 1.62 167
C880 1412 2.21 293 1249 1.79 166 988 1.18 267
C1355 1880 2.61 400 1603 1.51 280 1203 1.04 379
C1908 1756 2.91 367 1689 1.80 251 1088 1.57 298
C2670 1804 2.94 493 1584 1.74 395 1010 1.54 449
C3540 4214 4.57 409 2815 2.63 314 2782 2.58 294
C5315 7058 3.65 830 5970 2.69 515 5364 1.62 778
C6288 11,222 11.84 409 8716 4.64 504 6060 4.69 445
C7552 8214 2.99 1604 7328 1.98 1173 5682 1.66 1328
Average % reduction compared to static −16 − 37 − 25 −33 − 47 −17
CMOS circuits
CMOS complementary metal–oxide–semiconductor, PTL pass-transistor logic
The experimental results on the synthesis of circuits with three different logic
styles are shown in Table 8.7. Only dynamic CMOS circuits using domino logic is
reported. In Table 8.7, Column 1 represents the benchmark circuits.
Area of a circuit is approximated as the number of transistors; requirements of
area in static CMOS, dynamic CMOS, and PTL circuits are shown in column 2,
column 5, and column 8, respectively. It is observed that on an average, dynamic
CMOS circuit and PTL circuit require 16 and 33 % less area, respectively, compared
to circuit with static CMOS logic. The area requirement is compared as shown in
Fig. 8.39.
The delay of the circuits with the three logic styles are shown in column 3 (static
CMOS), column 6 (dynamic CMOS), and column 9 (PTL). A comparison of the
delay among the circuits with three logic styles is shown in Fig. 8.40. It is observed
that, dynamic CMOS and PTL circuits are 37 and 47 % faster, respectively, com-
pared to static CMOS circuit.
The power dissipation (sum of the switching power and leakage power) of circuits
is shown in column 4, column 7, and column 10 for static CMOS, dynamic CMOS,
and PTL circuits, respectively. Comparison of the power dissipations among three
circuits is shown in Fig. 8.41. We may observe that static CMOS circuits dissipates
25 and 17 % more power than the dynamic CMOS and PTL circuits, respectively.
8.12.6.5 Conclusion
Our primary goal of this section is to compare the potential logic styles for low-
power synthesis. In the process, we have compared the area, power, and delay
of a number of ISCAS benchmark circuits. It is observed that PTL style requires

7UDQVLVWRU
6WDWLF&026
'\QDPLF&026
37/

&
&
&
&
&
&
&
&
&
&
Fig. 8.39 Area (#Transistor) for static CMOS, dynamic CMOS, and PTL circuit. CMOS comple-
mentary metal–oxide–semiconductor, PTL pass-transistor logic
14
12
10
Delay (ns)
Static CMOS
8
Dynamic CMOS
6 PTL
0
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Fig. 8.40 Delay for static CMOS, dynamic CMOS, and PTL circuits. CMOS complementary
metal–oxide–semiconductor, PTL pass-transistor logic
minimum area and smaller delay compared to the other two logic styles. The power
dissipation of the dynamic CMOS and PTL styles are comparable and are substan-
tially lower than the static CMOS circuits. Based on this study, we may conclude
that the dynamic CMOS and PTL styles have better performance for low-power and
high-performance applications.

6ZLWFKLQJ3RZHUµ:

6WDWLF&026

'\QDPLF&026

37/

&
&
&
&
&
&
&
&
&
&
Fig. 8.41 Power dissipation for static CMOS, dynamic CMOS, and PTL circuits. CMOS comple-
mentary metal–oxide–semiconductor, PTL pass-transistor logic
8.13 Some Related Techniques for Dynamic Power

Reduction
Power reduction techniques are applied throughout the design process from system
level to layout level, gradually refining or detailing the abstract specification or
model of the design. Power optimization approaches at different levels result in
different percentage of reduction. Power optimization approaches at the high level
are significant since research results indicate that higher levels of abstraction have
greater potential for power reductions. Apart from the techniques mentioned in the
preceding sections to reduce dynamic power, there remain few other methods which
can also help to reduce the power dissipation. These are mainly system-level and
gate-level optimization procedures that helps in significant power reduction. These
techniques are discussed below:
Operand Isolation Operand isolation as discussed earlier reduces dynamic power
dissipation in data path blocks controlled by an EN signal. Thus, when EN is inac-
tive, the data path inputs are disabled so that unnecessary switching power is not
wasted in the data path. Operand isolation is implemented automatically in synthe-
sis by enabling an attribute before the elaboration of the design. Operand isolation
logic is inserted during elaboration, and evaluated and committed during synthesis
based on power savings and timing. Figure 8.42 illustrates the concept and how it
contributes to the power savings.
In the digital system (Fig. 8.42), before operand isolation, register C uses the
result of the multiplier when the EN is on. When the EN is off, register C uses only
8.13 Some Related Techniques for Dynamic Power Reduction 255
Reg A Reg B Reg A Reg B
X X
Enable Enable
Reg C Reg C
Before Isolation After Isolation
Fig. 8.42 Operand isolation approach to reduce dynamic power dissipation
the result of register B, but the multiplier continues its computations. Because the
multiplier dissipates the most power, the total amount of power wasted is quite
significant. One solution to this problem is to shut down (isolate) the function unit
(operand) when its results are not used, as shown by the digital system after operand
isolation. The synthesis tool inserts AND gates at the inputs of the multiplier and
uses the enable logic of the multiplier to gate the signal transitions. As a result, no
dynamic power is dissipated when the result of the multiplier is not needed.
Memory Splitting In many systems, the memory capacity is designed for peak
usage. During the normal system activity, only a portion of that memory is actually
used at any given time. In many cases, it is possible to divide the memory into two
or more sections, and selectively power down unused sections of the memory. With
increasing memory capacity, reducing the power consumed by memories is increas-
ingly important.
Logic Restructuring This is a gate-level dynamic power optimization technique.
Logic restructuring can, for example, reduce three stages to two stages through
logic equivalence transformation, so the circuit has less switching and fewer transi-
tions. It helps mainly in reducing the glitching power of a circuit. An example of
logic restructuring is shown in Fig. 8.43.
Logic Resizing By removing a buffer to reduce gate counts, logic resizing reduces
dynamic power. An example of logic resizing is shown in Fig. 8.44. In Fig. 8.44,
there are also fewer stages; both gate count and stage reduction can reduce power
and also, usually, the area.
Transition Rate Buffering In transition rate buffering, buffer manipulation
reduces dynamic power by minimizing switching times. An example of transition
rate buffering is shown in Fig. 8.45.
A A
B B
X C X
C
D
D
Fig. 8.43 Logic restructuring technique
buffer removed
/resizing
Fig. 8.44 Logic resizing technique
Fig. 8.45 Transition rate Buffer introduced

buffering technique to reduce slew
A
X
Y
B
C
D Z
E
Pin Swapping Figure 8.46 illustrates the pin-swapping approach. The pins are

swapped so that, most frequently, switching occurs at the pins with lower capaci-
tive load. Since the capacitive load of pin A (upper pin) is lower, there is less power
dissipation.
• Hardware–software codesign approach for low power has been explained with
an example.
• Difference between redundant and nonredundant bus-encoding technique to re-
duce switching activity has been explained.
Pin Swapping (CA<CC)
X X
Fig. 8.46 Pin-swapping technique
• Nonredundant bus-encoding technique such as Gray coding technique for ad-

dress bus has been explained.
• Redundant bus-encoding techniques such as one-hot encoding, bus-inversion en-
coding, and T0 encoding techniques have been explained with examples
• Clock gating technique to reduce dynamic power dissipation has been discussed.
• Clock gating at different levels of granularity has been considered.
• Synthesis of gated-clock FSM to reduce power consumption has been intro-
duced.
• State encoding of FSM to reduce power consumption has been explained.
• How FSMs can be partitioned to reduce power consumption has been explained.
• FSM isolation to minimize power consumption has been explained.
• Realization logic functions using dynamic CMOS and PTL styles have been con-
sidered to reduce power consumption compared to static CMOS realization of
functions.
Q8.1. Explain how low power is achieved in the hardware–software codesign step
of VLSI synthesis?
Q8.2. Explain how bus encoding is used to minimize off-chip dynamic power dis-
sipation. How does the power dissipation vary with the number of bits on the bus
and increased redundancy in encoding?
Q8.3. Distinguish between redundant and nonredundant bus encoding.
Q8.4. Explain the Gray coding approach for encoding the address bus to minimize
switching activity?
Q8.5. Compare the number of bit transitions for 4-bit Gray coded data with respect
to binary coded data.
Q8.6. Explain how the bus-inversion encoding works? For which bus is this encod-
ing not suitable?
Q8.7. Explain how T0 encoding works? Give a realization of the encoder and de-
coder circuit for T0 encoding.
Q8.8. How does CG minimize power dissipation? Explain how it can be imple-
mented?
Q8.9. Distinguish between module-level CG and register-level CG.
Q8.10. Discuss the realization of clock-gated FSMs.

Q8.11. Explain how the power dissipation of an FSM can be minimized by suitable
encoding of its states. Illustrate with an example.
Q8.12. Discuss how an FSM can be partitioned to reduce the power consumption.
Q8.13. Explain the concept of operand isolation. How can it be used to reduce pow-
er consumption?
Q8.14. Discuss how precomputation can be used in the circuit realization to reduce
power consumption.
References
1. Aghaghiri, Y., Fallah, F., Pedram, M.: Irredundant address bus encoding for low power IS-
LPED ’01. Huntington Beach, CA, 6–7 August 2001
2. Bellaouar, A., Elmasry, M.I.: Low-Power Digital VLSI Design Circuits and Systems. Klu-
wer, Norwell (1965)
3. Benini, L., Micheli, G.D.: State assignment for low power dissipation. IEEE J. Solid-State
Circuits 30, 258–268 (1995)
4. Benini, L., Micheli, G., Macii, E., Sciuto, D., Silvano, C.: Address bus encoding techniques
for system-level power optimization. In: Proceedings of the Conference on Design, Automa-
tion and Test in Europe, pp. 861–867 (1998)
5. Bertacco, V., Minato, S., et al.: Decision diagrams and pass transistor logic synthesis. In:
IEEE/ACM Proceeding of ICCAD. pp. 256–263, Nov 1995
6. Brayton, R., Rudell, R., Sangiovanni-Vincentelli, A., Wang, A.: MIS: a multiple-level
logic optimization system. IEEE Trans. Computer-Aided Des. Integr. Circuits CAD-6(6),
1062–1081 (1987)
7. Brayton, R.K., et al.: SIS: a system for sequential circuit synthesis. Technical Report UCB/
ERL M92/41, Electronics Research Laboratory, College of Engineering, University of Cali-
fornia
8. Buch, P., Narayana, A., Newton, A.R., Vincentelli, A.S.: Logic synthesis for large pass tran-
sistor circuits. In: IEEE/ACM Proceeding of ICCAD. pp. 663–670, Nov 1997
9. Chandrakasan, A.R., Brodersen, R.W.: Low-Power Digital CMOS Design. Kluwer, Norwell
(1995)
10. Chaudhury, R., Liu, T-H., Aziz, A., Burns, J.L.: Area oriented synthesis for pass transistor
logic. In: IEEE/ACM Proceeding of ICCAD, pp. 592–596, Nov 1998
11. Choudhury, P., Pradhan, S.N.: Power modeling of power gated FSM and its low power
realization by simultaneous partitioning and state encoding using genetic algorithm by si-
multaneous partitioning and state encoding using genetic algorithm. In: Proceeding of the
VDAT’12 16th International Conference on Progress in VLSI Design and Test, pp. 19–29.
Springer, Heidelberg (2012)
12. Detjens, E., Gannot, G., Rudell, R., Sangiovanni-Vincentelli, A., Wang, A.: Technology map-
ping in MIS. In: IEEE/ACM Proceeding of ICCAD, pp. 116–119 (1987)
13. Favalli, M., Benini, L., De Micheli, G.: Design for testability of gated-clock FSMs. In: Pro-
ceedings of the European Design and Test Conference ED&TC, pp. 589–596, 11–14 March
1996
14. Fiduccia, C.M., Mattheyses, R.M.: A linear-time heuristic for improving network partitions.
In: Proceedings of the DAC 1982, pp. 171–181, 14–16 June 1982
15. Keating, M., Flynn, D., Aitken, R., Gibbons, A., Shi, K.: Low Power Methodology Manual:
For System-on-Chip Design. Springer, Berlin (2007)
References 259
16. Klaiber, A.: The Technology Behind Crusoe™ Processors. Low-power x86Compatible Pro-
cessors Implemented with Code Morphing™ Software. Transmeta Corporation, Jan 2000
17. Liu, T-H., Aziz, A., Burns, J.L.: Performance oriented synthesis for pass transistor logic. In:
Proceeding of IWLS. pp. 334–338, June 1998
18. Pal, A., Mukherjee, A.: Synthesis of two-level dynamic CMOS circuits. In: IEEE Proceeding
of International Workshop on Logic Synthesis, May 1999
19. Prasad, M.R., Kirkpatrick, D., Brayton, R.K.: Domino logic synthesis and technology map-
ping. In: IEEE Proceeding of IWLS, pp. 233–240 (1987)
20. Samanta, D., Pal, A.: Logic styles for high performance and low power. In: Proceedings of
the 12th International Workshop on Logic and Synthesis, 2003 (IWLS-2003), pp. 355–362.
Laguna Beach, California, USA, 28–30 May 2003
21. Samanta, D., Sinha, N., Pal, A.: Synthesis of high performance low power dynamic CMOS
Circuits. In: IEEE Proceeding of VLSI Design 2002, pp. 99–104, Jan 2002
22. Samanta, D., Dharmadeep, M.C., Pal, A.: Synthesis of high performance low power PTL
circuits. In: ACM Proceeding of ASP-DAC 2003, pp. 209–212 Jan 2003
23. Yano, K., Sasaki, Y., Rikino, K., Seki, K.: Top down pass transistor logic design. IEEE J.
Solid-State Circuits 31(6), 792–803 (1996)
24. Zhao, M., Sapatnekar, S.S.: Technology mapping for domino logic. In: IEEE/ACM Proceed-
ing of DAC, pp. 248–251 (1998)
25. Zimmermann, R., Fichtner, W.: Low-power logic styles: CMOS versus pass-transistor logic.
IEEE J. Solid-State Circuits 32(7), 1079–1090 (1997)
Chapter 9
Leakage Power Minimization
Abstract This chapter is concerned with leakage power minimization techniques.

As leakage power minimization techniques exploit the threshold voltage to mini-
mize leakage power, the dependence of delay and leakage power on threshold volt-
age is discussed first. Various techniques for the fabrication of multiple threshold
voltages are briefly discussed. A standby leakage power minimization technique
using the variable-threshold-voltage complementary metal–oxide–semiconductor
(VTCMOS) approach is presented. How standby leakage power can be minimized
by using the stack effect is explained. Run-time leakage power minimization by
using the multiple-threshold-voltage metal–oxide–semiconductor (MTCMOS)
approach is elaborated. One of the most popular techniques for leakage power
reduction is power gating, which also involves the use of multiple-threshold-volt-
age transistors. Various issues related to the power-gating approach are discussed.
The power management approach used to reduce leakage power dissipation and
how it can be combined with the dynamic voltage scaling approach are explained.
Both delay-constrained and energy-constrained dual-Vth approaches to minimize
leakage power dissipation are presented. Finally, the dynamic Vt scaling approach
to minimize leakage power dissipation is highlighted.
Keywords Multiple channel doping · Multiple oxide thickness · Multiple channel

length · Multiple body bias · Transistor stacking · Variable-threshold-voltage
CMOS (VTCMOS) · Multi-threshold-voltage CMOS (MTCMOS) · Active mode ·
Sleep mode · Power gating · Switch in cell · Switching fabric · Isolation cell ·
Isolation strategy · State retention · State retention strategy · Vth hopping · Power-
gating controller · Power management · DTCMOS · Dual-Vt assignment
9.1 Introduction
Due to aggressive device-size scaling, the very-large-scale integration (VLSI)

technology has moved from the millimetre to nanometre era by providing increas-
ingly higher performance along the way. Performance improvement has been con-
tinuously achieved primarily because of the gradual decrease of gate capacitances.
However, as the supply voltage must continue to scale with device-size scaling to
maintain a constant field, the threshold voltage of the metal–oxide–semiconductor

262 9 Leakage Power Minimization
Subthreshold Leakage current

8.0
Delay Time tpd (normalized)

10E-6 Isub Vdd
Isub (normalized)
10E-5 6.0 1V
10E-4 2V
10E-3
4.0
10E-2 tpd
10E-1
2.0
10E-0
10E-1
10E-2 0.0
0.0 0.2 0.4 0.0 0.2 0.4 0.6
0.6 0.8 0.8
a Threshold Voltage(V) b Threshold Voltage(V)
Fig. 9.1 Gate delay time (a) and subthreshold leakage current (b) dependence on threshold voltage
(MOS) transistors should also be scaled at the same rate to maintain gate overdrive
( Vcc/Vt) and hence performance. Unfortunately, the reduction of Vt leads to an expo-
nential increase in the subthreshold leakage current. As a consequence, the leakage
power dissipation has gradually become a significant portion of the total power
dissipation. For example, for a 90-nm technology, the leakage power is 42 % of
the total power and for a 65-nm technology, the leakage power is 52 % of the total
power. This has led to vigorous research work to develop suitable approaches for
leakage power minimization. Various components of the leakage power at the tran-
sistor level have been presented in Sect. 6.5. In this chapter, we discuss different
approaches for leakage power reduction.
As the supply voltage is scaled down, the delay of the circuit increases as per
the relationship given in Eq. (7.2). Particularly, there is a dramatic increase in delay
as the supply voltage approaches the threshold voltage. This tends to limit the ad-
vantageous range of the supply voltage to a minimum of about twice the threshold
voltage. The delay can be kept constant if the threshold voltage is scaled at the same
ratio as the supply voltage; i.e. the ratio of Vt/Vdd is kept constant. Unfortunately,
as the threshold voltage is scaled down, the subthreshold leakage current increases
drastically, as shown in Fig. 9.1a. Moreover, the delay increases with an increase in
threshold voltage when the supply voltage is kept constant as shown in Fig. 9.1b.
The threshold voltage is the parameter of importance for the control leakage power.
As leakage power has an exponential dependence on the threshold voltage and all
the leakage power reduction techniques are based on controlling the threshold volt-
age either statically or dynamically, we first discuss various parameters on which the
threshold voltage of a MOS transistor depends. Various approaches for the fabrica-
tion of multiple-threshold-voltage transistors are presented in Sect. 9.2. The leakage
power reduction techniques can be categorized into two broad types—standby and
run-time leakage. When a circuit or a part of it is not in use, it is kept in the standby
mode by a suitable technique such as clock gating. The clock gating helps to reduce
the dynamic power dissipation, but leakage power dissipation continues to take
place even when the circuit is not in use. There are several approaches such as tran-
sistor stacking, variable-threshold-voltage complementary metal–oxide–semicon-
ductor (VTCMOS), and multiple-threshold-voltage complementary metal–oxide–
9.2 Fabrication of Multiple Threshold Voltages 263
semiconductor (MTCMOS), which can be used to reduce the leakage power when
a circuit is in the standby condition. Variable-threshold-voltage CMOS (VTCMOS)
approach for leakage power minimization has been discussed in Sect. 9.3. Transis-
tor stacking approach based on the stack effect has been highlighted in Sect. 9.4.
On the other hand, there are several approaches for the reduction of the leakage
power when a circuit is in actual operation. These are known as run-time leakage
power reduction techniques. It may be noted that run-time leakage power reduction
techniques also reduce the leakage power even when the circuit is in standby mode.
As leakage power is a significant portion of the total power, importance of run-time
leakage power reduction is becoming increasingly important.
Classification on leakage power reduction techniques is also possible based on
whether the technique is applied at the time of fabrication of the chip or at run time.
The approaches applied at fabrication time can be classified as static approaches.
On the other hand, the techniques that are applied at run time are known as dynamic
approaches. Run-time leakage power reduction based on multi-threshold-voltage
CMOS (MTCMOS) has been discussed in Sect. 9.5. Section 9.5 has addressed the
power-gating technique to minimize leakage power. The isolation strategy has been
highlighted in Sect. 9.7. A state retention strategy has been introduced in Sect. 9.8.
Power-gating controllers have been discussed in Sect. 9.8. Power management tech-
niques have been considered in Sect. 9.10. The dual-Vt assignment technique has
been introduced in Sect. 9.11. The delay-constrained dual-Vt technique has been
presented in Sect. 9.12 and energy constraint has been considered in Sect. 9.13. The
dynamic Vt scaling technique has been introduced in Sect. 9.14.
9.2 Fabrication of Multiple Threshold Voltages
The present-day process technology allows the fabrication of metal–oxide–semi-

conductor field-effect transistors (MOSFETs) of multiple threshold voltages on a
single chip. This has opened up the scope for using dual-Vt CMOS circuits to re-
alize high-performance and low-power CMOS circuits. The basic idea is to use
high-Vt transistors to reduce leakage current and low-Vt transistors to achieve high
performance. Before we discuss dual-Vt (or multiple Vt) circuit design techniques,
we shall explore various fabrication techniques [1] used for implementing multiple
threshold voltages in a single chip.
9.2.1 Multiple Channel Doping
The most commonly used technique for realizing multiple-VT MOSFETs is to use
different channel-doping densities based on the following expression:

2ε si q · Na (2τ B + Vbs )
Vth = Vtb + 2τ B + , (9.1)
Cox
Fig. 9.2 Variation of 800
Threshold voltage
threshold voltage with doping 600
concentration
400
(mV)
200
100
0
2 3 4 5 6
Doping Concentration
(1e17cm-3)
Fig. 9.3 Variation of thresh-

Threshold voltage
800
old voltage with gate oxide
thickness 600
(mV)
400
200
0
3.7 4.2 4.7 5.2 5.7 6.2 6.7 7.2
Oxide thickness (nm)
where Vfb is the flat-band voltage, Na is the doping density in the substrate, and
τ B = kT / q ( Lx ( N a / x)) .
Based on this expression, the variation of threshold voltage with channel-doping
density is shown in Fig. 9.2. A higher doping density results in a higher threshold
voltage. However, to fabricate two types of transistors with different threshold volt-
ages, two additional masks are required compared to the conventional single-Vt
fabrication process. This makes the dual-Vt fabrication costlier than single-Vt fab-
rication technology. Moreover, due to the non-uniform distribution of the doping
density, it may be difficult to achieve dual threshold voltage when these are very
close to each other.
9.2.2 Multiple Oxide CMOS
The expression for the threshold voltage shows a strong dependence on the value
of Cox, the unit gate capacitance. Different gate capacitances can be realized by us-
ing different gate oxide thicknesses. The variation of threshold voltage with oxide
thickness ( tox) for a 0.25-μm device is shown in Fig. 9.3. Dual-Vth MOSFETs can
be realized by depositing two different oxide thicknesses. A lower gate capacitance
due to higher oxide thickness not only reduces subthreshold leakage current but also
provides the following benefits:
a. Reduced gate oxide tunnelling because the oxide tunnelling current exponen-
tially decreases with the increase in oxide thickness.
9.2 Fabrication of Multiple Threshold Voltages 265
Fig. 9.4 Variation of 0.28

threshold voltage with oxide 0.27
thickness for constant AR.
Threshold voltage
AR aspect ratio 0.26
0.25
(mV)
0.24
0.23
0.22
3.7 4.2 4.7 5.2 5.7
Oxide thickness (nm)
Fig. 9.5 Variation of thresh-
Threshold voltage
400
old voltage with channel
length 300
(mV) 200
100
0
0 0.25 0.5 0.75 1 1.25
Channel Length(um)
Reduced dynamic power dissipation due to reduced gate capacitance, because of

higher gate oxide thickness.Although the increase in gate oxide thickness has the
above benefits, it has some adverse effects due to an increase in short-channel ef-
fect. For short-channel devices as the gate oxide thickness increases, the aspect ratio
(AR), which is defined by AR = lateral dimension/vertical dimension, decreases:
L
AR = 1/3
, (9.2)
  ε si   1/3 1/3
tox    Wdm x j
ε
  ox  
where ε si and ε ox are silicon and oxide permittivities, L, tox, Wdm, and Xj are chan-
nel length, gate oxide thickness, depletion depth, and junction depth, respectively.
Immunity to the short-channel effect decreases as the AR value reduces.
Figure 9.4 shows the channel lengths for different oxide thicknesses to maintain
AR. A sophisticated process technology is required for fabricating multiple oxide
CMOS circuits.
9.2.3 Multiple Channel Length
In the case of short-channel devices, the threshold voltage decreases as the channel
length is reduced, which is known as Vth roll-off. This phenomenon can be exploited
to realize transistors of dual threshold voltages. The variation of the threshold volt-
age with channel length is shown in Fig. 9.5. However, for transistors with feature
sizes close to 0.1 μm, halo techniques have to be used to suppress the short-channel
effects. As the Vth roll-off becomes very sharp, it turns out to be a very difficult task
to control the threshold voltage near the minimum feature size. For such technolo-
gies, longer channel lengths for higher Vth transistors increase the gate capacitance,
which leads to more a dynamic power dissipation and delay.
9.2.4 Multiple Body Bias
The application of reverse body bias to the well-to-source junction leads to an in-
crease in the threshold voltage due to the widening of the bulk depletion region,
which is known as body effect. This effect can be utilized to realize MOSFETs hav-
ing multiple threshold voltages. However, this necessitates separate body biases to
be applied to different nMOS transistors, which means the transistors cannot share
the same well. Therefore, costly triple-well technologies are to be used for this pur-
pose. Another alternative is to use silicon-on-insulator (SoI) technology, where the
devices are isolated naturally.
In order to get best of both the worlds, i.e. a smaller delay of low-Vt devices and
a smaller power consumption of high-Vt devices, a balanced mix of both low-Vt and
high-Vt devices may be used. The following two approaches can be used to reduce
leakage power dissipation in the standby mode.
9.3 VTCMOS Approach
We have observed that low supply voltage along with low-threshold voltage pro-
vides a reduced overall power dissipation without a degradation in performance.
However, the use of low-Vt transistors inevitably leads to increased subthreshold
leakage current, which is of major concern when the circuit is in standby mode. In
many recent applications, such as cell phones, personal digital assistants (PDAs),
etc., a major part of the circuit remains in standby mode most of the time. If the
standby current is not low, it will lead to a shorter battery life.
VTCMOS [2, 3] circuits make use of the body effect to reduce the subthreshold
leakage current, when the circuit is in normal mode. We know that the threshold
voltage is a function of the voltage difference between the source and the substrate.
The substrate terminals of all the n-channel metal–oxide–semiconductor (nMOS)
transistors are connected to the ground potential and the substrate terminals of all
the p-channel metal–oxide–semiconductor (pMOS) transistors are connected to Vdd,
as shown in Fig. 9.6. This ensures that the source and drain diffusion regions always
remain reversed-biased with respect to the substrate and the threshold voltages of
the transistors are not significantly influenced by the body effect. On the other hand,
in the case of VTCMOS circuits, the substrate bias voltages of nMOS and pMOS
transistors are controlled with the help of a substrate bias control circuit, as shown
in Fig. 9.7.
9.4 Transistor Stacking 267
Bias
Bias
p+ n+ n+ p+ p+ n+ p+ n+ n+ p+ p+ n+
n-well n-well
p substrate p substrate
a b
Fig. 9.6 Physical structure of a CMOS inverter a without body bias, b with body bias. CMOS
complementary metal–oxide–semiconductor
Fig. 9.7 Substrate bias con- Vdd =1v

trol circuit
Vin Vout substrate bias

control circuit
When the circuit is operating in the active mode, the substrate bias voltages for nMOS
and pMOS transistors are VBn = 0V and Vβp = Vdd , respectively. Assuming Vdd = 1V
, the corresponding threshold voltages for the nMOS transistors and pMOS transistors
are Vtn = 0.2V and Vtp = −0.2V , respectively. On the other hand, when the circuit is in
standby mode, the substrate bias control circuit generates a lower substrate bias voltage of
VBn = −VB for the nMOS transistor and a higher substrate bias voltage VBp = Vdd + VB for
the pMOS transistors. This leads to an increase in threshold voltages for nMOS and pMOS
transistors to Vtn = 0.5V and Vtp = −0.5V, respectively. This, in turn, leads to substantial
reduction in subthreshold leakage currents because of the exponential dependence of sub-
threshold leakage current on the threshold voltage. It has been found that for every 100-mV
increase in threshold voltage, the subthreshold leakage current reduces by half.
Although the VTCMOS technique is a very effective technique for controlling
threshold voltage and reducing subthreshold leakage current, it requires a twin-well
or triple-well CMOS fabrication technology so that different substrate bias voltages
can be applied to different parts of the chip. Separate power pins may also be re-
quired if the substrate bias voltage levels are not generated on chip. Usually, the ad-
ditional area required for the substrate bias control circuitry is negligible compared
to the overall chip area.
9.4 Transistor Stacking
When more than one transistor is in series in a CMOS circuit, the leakage current
has a strong dependence on the number of turned off transistors. This is known as
the stack effect [4–6]. The mechanism of the stack effect can be best understood by
VDD
0V 2.3V
Q4 Q5 Q6
A Out
0V 89mV
Q1
0V 34mV B
Q2
14mV
C
0V Q3
GND
a b
Fig. 9.8 a Source voltages of the nMOS transistors in the stack, b A 4-input NAND gate. nMOS
n-channel metal–oxide–semiconductor
considering the case when all the transistors in a stack are turned off. Figure 9.8a
shows four nMOS devices of a four-input NAND gate in a stack. The source and
drain voltages of the MOS transistors obtained by simulation are shown in the fig-
ure. These voltages are due to a small drain current passing through the circuit. The
source voltages of the three transistors on top of the stack have positive values. As-
suming all gate voltages are equal to zero, the gate-to-source voltages of the three
transistors are negative. Moreover, the drain-to-source potential of the MOS transis-
tors is also reduced. The following three mechanisms come into play to reduce the
leakage current:
i Due to the exponential dependence of the subthreshold current on gate-to-source
voltage, the leakage current is greatly reduced because of negative gate-to-source
voltages.
ii The leakage current is also reduced due to body effect, because the body of all
the three transistors is reverse-biased with respect to the source.
iii As the source-to-drain voltages for all the transistors are reduced, the subthreshold
current due to drain-induced barrier lowering (DIBL) effect will also be lesser.
As a consequence, the leakage currents will be minimum when all the transistors
are turned off, which happens when the input vector is 0000. The leakage cur-
rent passing through the circuit depends on the input vectors applied to the gate
and it will be different for different input vectors. For example, for a three-input
NAND gate shown in Fig. 9.8b, the leakage current contributions for different
input vectors are given in Table 9.1. It may be noted that the highest leakage
current is 99 times the lowest leakage current. The current is lowest when all the
9.4 Transistor Stacking 269
Table 9.1 Input vectors State (ABC) Leakage current (nA) Leaking transistors
and corresponding leakage
000 0.095 Q 1, Q 2, Q 3
currents for the three-input
NAND gate 001 0.195 Q 1, Q 2
010 0.195 Q 1, Q 3
011 1.874 Q1
100 0.184 Q 2, Q 3
101 1.220 Q2
110 1.140 Q3
111 9.410 Q 4, Q 5, Q 6
transistors in series are OFF, whereas the leakage current is highest when all the
transistors are ON.
For a single gate, it is a trivial problem to find out the input vector that produces
the minimum leakage current when the circuit is in standby mode. However, when
the circuit is complex, it is a nontrivial problem to find out the input vector that
produces the minimum leakage current. Many researchers have used models and
algorithms to estimate the nominal leakage current of a circuit. The minimum and
maximum leakage currents of a circuit have been estimated using suitable greedy
heuristic. The set of input vectors which can put the circuits in the low-power
standby mode with standby leakage power reduced by more than 50 % are selected.
Given a primary input combination, the logic value of each internal node can be
obtained simply by starting from primary inputs and simulating the circuit level by
level (level of a node is equal to the maximum of the level of its fan-in nodes plus
1, with the level of all primary inputs being 0). By applying the minimum-leakage
producing input combination to the circuit when it is in the idle mode, we can sig-
nificantly reduce the leakage power dissipation of the circuit. Consequently, the
identification of a minimum leakage vector (MLV) is an important problem in low
power design of VLSI circuits.
In view of the above discussion, researchers have attempted to make use of the
stack effect to minimize leakage power dissipation in the standby condition. The
following three basic approaches have been proposed by the researchers to select an
appropriate input vector.
An input vector that maximizes the stack effect is determined so that the leakage
power is minimized. This method is suitable for regular structure data path circuits
such as adders, multipliers, etc. The best input vector is identified by using a suit-
able algorithm such as genetic algorithm.
Input vectors are generated randomly and applied to the circuit and leakage
power dissipation is reduced. An input vector that minimizes the leakage current
is selected.
A probabilistic approach eliminates the need to do simulations over all the 2n
input combinations, where n is the number of circuit inputs. A small subset of all the
possible states is evaluated for leakage.
Fig. 9.9 MTCMOS basic VDD

structure SL Q1
VDDV
CV 1
Low-Vt
Transistors
GNDV
Q 1 & Q 2- Sleep transistors . CV 2
SL Q2
GND
9.5 MTCMOS Approach
Mutoh and his colleagues [7] introduced the MTCMOS approach to implement
high-speed and low-power circuits operating from a 1-V power supply. In this ap-
proach, MOSFETs with two different threshold voltages are used in a single chip.
It uses two operational modes—active and sleep for efficient power management.
A basic MTCMOS circuit scheme is shown in Fig. 9.9. The realization of a two-
input NAND gate is shown in the figure. The CMOS logic gate is realized with
transistors of low-threshold voltage of about 0.2–0.3 V. Instead of connecting the
power terminal lines of the gate directly to the power supply lines Vdd and GND,
here these are connected to the ‘virtual’ power supply lines (VDDV and GNDV).
The real and virtual power supply lines are linked by the MOS transistor Q1 and Q2.
These transistors have a high-threshold voltage in the range 0.5–0.6 V and serve as
sleep control transistors. Sleep control signals SL and SL are connected to Q1 and
Q2, respectively, and used for active/sleep mode control. In the active mode, when
SL is set to LOW, both Q1 and Q2 are turned ON connecting the real power lines
to VDDV and GNDV. In this mode, the NAND gate operates at a high speed cor-
responding to the low-threshold voltage of 0.2 V, which is relatively low compared
to the supply voltage of 1.0 V. In the sleep mode, SL is set to HIGH to turn both
Q1 and Q2 OFF, thereby isolating the real supply lines from VDDV and GNDV.
As the sleep transistors have a high-threshold voltage (0.6 V), the leakage current
flowing through these two transistors will be significantly smaller in this mode. As
a consequence, the leakage power consumption during the standby period can be
dramatically reduced by sleep control.The performance of the MTCMOS approach
is shown in Fig 9.10. In Fig. 9.10a, delay MTCMOS NAND gate as a function of
supply voltage is compared with respect to conventional full high-Vt and full low-Vt
logic gates. It is demonstrated that the delay of the MTCMOS gate is much smaller
than that of a conventional CMOS gate with high Vt and it is very close to the con-
ventional CMOS with low-Vt realization. With a power supply voltage of 1 V, the
MTCMOS gate delay time is reduced by 70 % as compared with the conventional
CMOS gate with high-Vt. Moreover, the supply voltage dependence of an MTC-
9.5 MTCMOS Approach 271
5.0 1.6
Normalized Power Delay

Delay time(ns/gate) Conv.H-
1.4 NAND gate
4.0 NAND gate Vth
1.2 F.O. =3
F.O. =3 (full H-Vth)
(ns/MHz)
3.0 1.0 Al:1 mm
Al:1 mm MTCMOS
2.0 0.8
1.0 0.6
0.4 Conv.CMOS
0.0 (full L-Vth)
0.2
0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0
Supply Voltage (V)
Supply Voltage (V)
a b
Fig. 9.10 a Delay characteristic of MTCMOS gate, b dependence of energy on supply voltage.
MTCMOS multi-threshold complementary metal–oxide–semiconductor
Fig. 9.11 Gate delay time 5.0 Vdd =1.0 V

and effective supply voltage 1.0
gate delay time
Effective supply
dependence on the normal- 4.0
voltage
ized gate width of the sleep CV / CO = 1
(ns)
control transistor in simula- 3.0

CV / CO = 5
tion mode 2.0
1.0 0.5
0 2 4 6 8 10
Normalized gate width
WH/W L
MOS gate delay is much smaller than that of a conventional CMOS gate with high
Vt, and MTCMOS gate operates as fast as the full low-Vt gate. On the other hand,
the variation of normalized power-delay product (NPDP) with the supply voltage
is shown in Fig. 9.10b, where power consumption is normalized by frequency. At
low voltage, say at 1 V, the NPDP of the MTCMOS gate is much lower than that of
the conventional high-Vt gates, reflecting the improved speed performance at lower
voltage. The curve also shows that the power reduction effect is proportional to the
square of the supply voltage overcoming speed degradation in a low-voltage opera-
tion. It is also claimed that the standby current is reduced by three or four orders of
magnitude due to sleep control.
Two other factors that affect the speed performance of an MTCMOS circuit are:
the width of the sleep control transistors and the capacitances of the virtual power
line. The sleep transistors should have their widths large enough so that the ON re-
sistances are small. It has been established by simulation that WH/WL of 5 and CV/C0
of 5 leads to the decrease in Veff within 10 % of Vdd and the degradation in gate delay
time within 15 % compared to a pure low-Vth CMOS as shown in Fig. 9.11. The area
penalty is relatively small because this is shared by all the logic gates on a chip. So
far as CV is concerned, the condition is satisfied by the intrinsic capacitances present
and no external capacitance needs to be added.
The main advantage of MTCMOS is that it can be easily implemented using

existing circuits, without modification of the cell library. But the MTCMOS ap-
proach suffers from the disadvantage that only the standby power is reduced and
large inserted sleep transistors increase both area and delay. Some of the problems
are overcome in the dual-Vt approach presented in the following section.
9.6 Power Gating [8]
The basic approach of the MTCMOS implementation has been generalized and
extended in the name of power management or power gating. Here, the basic strat-
egy is to provide two power modes: an active mode, which is the normal operat-
ing mode of the circuit, and a low-power mode when the circuit is not in use. The
low-power mode is commonly termed as sleep mode. At an appropriate time, the
circuit switches from one mode to the other in an appropriate manner such that the
energy drawn from the power source is maximized with minimum or no impact on
the performance.
9.6.1 Clock Gating Versus Power Gating
We have discussed the use of the clock-gating approach for the reduction of switch-
ing power. The typical activity profile for a subsystem using a clock-gating strategy
is shown in Fig. 9.12a. As shown in the figure, no dynamic power dissipation takes
place when the circuit is clock-gated. However, leakage power dissipation takes
place even when the circuit is clock-gated. In the early generation of CMOS circuits
(above 250 nm), the leakage power was an insignificant portion of the total power.
So, the power dissipation of a clock-gated subsystem was negligible. However, the
leakage power has grown with every generation of CMOS process technology and
it is essential to use power gating to reduce leakage power when the circuit is not
in use. The activity profile for the same subsystem using a power-gating strategy is
shown in Fig. 9.12b. Power gating can be implemented by inserting power-gating
transistors in the stack between the logic transistors and either power or ground,
thus creating a virtual supply rail or a virtual ground rail, respectively. The logic
block contains all low-Vth transistors for fastest switching speeds while the switch
transistors, header or footer, are built using high-Vth transistors to minimize the leak-
age power. Power gating can be implemented without using multiple thresholds,
but it will not reduce leakage as much as if implemented with multiple thresholds.
MTCMOS refers to the use of transistors with multiple threshold voltages in power-
gating circuits. The most common implementations of power gating use a footer
switch alone to limit the switch area overhead. High-Vth NMOS footer switches are
about half the size of equivalent-resistance high-Vth PMOS header switches due to
differences in majority carrier mobilities. Power gating reduces leakage by reducing
the gate-to-source voltage, which in turn, drives the logic transistors deeper into
9.6 Power Gating 273
SLEEP
SLEEP
SLEEP
WAKE
WAKE
Activity-1 Activity-2 Activity-3
Dynamic Dynamic Dynamic

power power power
Power
Leakage Clock gated Leakage Clock gated Leakage

power power power
Leakage power Leakage power
a Time
SLEEP
SLEEP
SLEEP
WAKE
WAKE

power power power
Power
Leakage Leakage Leakage

power power power
Power gated Power gated
Leakage power Leakage power
b Time
Fig. 9.12 a Activity profile for a subsystem with clock gating, b activity profile of the same sub-
system with power gating
the cutoff region. This occurs because of the stack effect. The source terminal of
the bottom-most transistor in the logic stack is no longer at ground, but rather at a
voltage somewhat above ground due to the presence of the power-gating transistor.
Leakage is reduced due to the reduction of the Vgs.
9.6.2 Power-Gating Issues
The clock-gating approach discussed earlier does not affect the functionality of the
circuit and does not require changes in the resistor–transistor logic (RTL) represen-
tation. But, power gating is much more invasive because, as we shall see later, it af-
fects inter-block interfaces and introduces significant time delays in order to safely
enter and exit power-gated modes. The most basic form of power-gating control,
and the one with the lowest long-term leakage power, is an externally switched
power supply. An example is an on-chip CPU that has a dedicated off-chip power
VDD
Power Switching Fabric
Power isol
Gating Power-gated
Control Functional Always On
Controller Functional
Block
Block
VSS
Fig. 9.13 An SoC that uses internal power gating. SoC system on chip
supply, that is, the supply provides power only to the CPU. We can, then, shut down
this power supply and reduce the leakage in the CPU to essentially zero. This ap-
proach, though, also takes the longest time and requires the most energy to restore
power to a gated block. Internal power gating, where internal switches are used to
control power to selected blocks, is a better solution when the blocks have to be
powered down for shorter periods of time. Figure 9.13 illustrates an system on chip
(SoC) that uses internal power gating.
Unlike a block that is always powered on, the power-gated block receives its
power through a power-switching network. This network switches either VDD or
VSS to the power-gated block. In this example, VDD is switched and VSS is pro-
vided directly to the entire chip. The switching fabric typically consists of a large
number of CMOS switches distributed around or within the power-gated block.
Various issues involved in the design of power-gated circuits are listed below:
• Power-gating granularity
• Power-gating topologies
• Switching fabric design
• Isolation strategy
• Retention strategy
• Power-gating controller design
9.6.2.1 Power-Gating Granularity
Two levels of granularity are commonly used in power gating. One is referred to as
fine-grained power gating and the other one is referred to as coarse-grained power
gating. In the case of fine-grained power gating, the power-gating switch is placed
locally as part of the standard cell. The switch must be designed to supply the worst-
case current requirement of the cell so that there is no impact on the performance.
As a consequence, the size of the switch is usually large (2 × to 4 × the size of the
original cell) and there is significant area overhead.
In the case of coarse-grained power gating, a relatively larger block, say a pro-
cessor, or a block of gates is power switched by a block of switch cells. Consider
two different implementations of a processor chip. In the first case, a single sleep
control signal is used to power down the entire chip. In the second case, separate
sleep control signals are used to control different building blocks such as instruction
decoder, execution unit, and memory controller. The former design is considered as
coarse-grained power gating, whereas the latter design may be categorized as fine-
grained power gating.
The choice of granularity has both logical and physical implications. A power
domain refers to a group of logic with a logically unique sleep signal. Each power
domain must be physically arranged to share the virtual ground common to that
particular group (except for the boundary case of the switching-cell topology in
which there are no shared virtual grounds). The motivation for fine-grained power
gating is to reduce run-time leakage power, that is, the leakage power consumed
during normal operation. While the coarse-grained example mentioned above will
reduce leakage during standby, it will not affect run-time leakage since with a single
sleep domain the power supply is either completely connected (active mode) or
completely disconnected (standby mode). However, with fine-grained switching,
portions of the design may be switched off while the other portions continue to
operate. For example, in a multicore implementation with four cores, if only three
of the cores are active, the fourth one may be put to sleep until such time as it is
scheduled to resume computation.
Another advantage of fine-grained power gating is that the timing impact of the
current I passing through a through a switch with equivalent resistance R resulting
in IR drop across the switch can be easily characterized and it may be possible to
use the traditional design flow to deploy fine-grained power gating. On the other
hand, the sizing of the coarse-grained switched network is more difficult because
the exact switching activity of the logic block may not be known at design time. In
spite of the advantages of fine-grained power gating, the coarse-grained power gat-
ing is preferred because of its lesser area overhead.
9.6.2.2 Power-Gating Topologies
Another issue closely related to granularity of power gating is the power-gating

topologies. Power-gating topologies can be categorized into three types:
• Global power gating
• Local power gating
• Switch in cell gating
BLOCK 1 BLOCK 2 BLOCK 3
SLEEPN
Fig. 9.14 Example of global power gating
BLOCK 1 BLOCK 2 BLOCK 3
SLEEPN
Fig. 9.15 Example of local power gating
9.6.2.3 Global Power Gating
Global power gating refers to a logical topology in which multiple switches are con-
nected to one or more blocks of logic, and a single virtual ground is shared in com-
mon among all the power-gated logic blocks as shown in Fig. 9.14. This topology
is effective for large blocks (coarse-grained) in which all the logic is power gated,
but is less effective, for physical design reasons when the logic blocks are small. It
does not apply when there are many different power-gated blocks, each controlled
by a different sleep enable signal.
9.6.2.4 Local Power Gating
Local power gating refers to a logical topology in which each switch singularly
gates its own virtual ground connected to its own group of logic. This arrangement
results in multiple segmented virtual grounds for a single sleep domain as shown
in Fig. 9.15.
Fig. 9.16 Example of switch

in cell power gating
IN OUT
SLEEPN
9.6.2.5 Switch in Cell
Switch in cell may be thought of as an extreme form of local power-gating im-

plementation. In this topology, each logic cell contains its own switch transistor
as shown in Fig. 9.16. Its primary advantages are that delay calculation is very
straightforward. The area overhead is substantial in this approach.
9.6.2.6 Switching Fabric Design
Although the basic concept of using sleep transistors for power gating is simple, the
actual implementation of the switching fabric involves many highly technology-
specific issues. First and foremost among the issues is the architectural issue to
decide whether to use only header switch using pMOS transistors or use only footer
switch using nMOS transistors or use both. Some researchers have advocated the
use of both types of switches. However, for designs at 90 nm or smaller than 90 nm,
either the header or footer switch is recommended due to the tight voltage margin,
significant IR drop and large area, and delay penalties when both types of transistors
are used. Various issues to be addressed for switching fabric design are:
• Header-versus-footer switch
• Power-gating implementation style
9.6.2.7 Header-Versus-Footer Switch
Figure 9.17a shows a header switch used for power gating. High-Vt PMOS transis-
tors are used to realize the header switch. Similarly, a footer switch used for power
gating is shown in Fig. 9.17b, where high-Vt NMOS transistors are used to real-
ize the switch. Key issues affecting the choice are: architectural constraint, area
overhead, and IR drop constraints. When external and internal power gating are
used together, the header switch is the most appropriate choice. The header switch
Fig. 9.17 a Header switch

and b footer switch
LOGIC
BLOCK
SLEEP
SLEEPN
LOGIC
BLOCK
HEADER FOOTER
SWITCH SWITCH
a b
is also suitable if multiple power rails and/or voltage swings are used on the chip.
The switch efficiency of sleep transistors is defined as the ratio of drain current
in the ON and OFF states (Ion/Ioff). The maximum value of switch efficiency is
desirable to achieve a high current drive in normal operation and low leakage in
sleep condition. It has been found that for the same drive current, the header switch
using pMOS transistors results in 2.67 times more leakage current than the footer
switches realize using nMOS transistors.
9.6.2.8 Implementation Styles
Implementation of power-gating switches can be broadly categorized into two

types: ring and grid styles. In ring-style implementation, the switches are placed
external to the power-gated block by encapsulating it by a ring of switches as shown
in Fig. 9.18. The switches connect VDD to the virtual VVDD of the power-gated
block. This is the only style that can be used to supply power to an existing hard
block by placing the switches outside it. On the other hand, the switches are distrib-
uted throughout the power-gated region as shown in Fig. 9.19. Any one of the styles
can be used for the implementation of coarse-grain power gating.
9.7 Isolation Strategy
Ideally, the outputs and internal nodes of a header-style power-gated block should
collapse down towards ground level. Similarly, the outputs and internal nodes of
a footer-style power-gated block should collapse down towards supply rail. How-
ever, in practice, the outputs and internal nodes may neither discharge to ground
level nor fully charge to supply voltage level, because of finite leakage currents
9.7 Isolation Strategy 279
VDD
VVDD
Fig. 9.18 Ring-style switching fabric
Fig. 9.19 Grid-style switch- VDD

ing fabric
VVDD
VVDD
VVDD
Fig. 9.20 Output of a power- Vdd

gated block driving a power-
up block Power- up
SLEEP block
Vdd
VddV
Power
gated
block
passing through the off switches. So, if the output of power-down block drives a
power-up block, there is possibility of short-circuit power (also known as crowbar
power) in the power-up block as shown in Fig. 9.20. It is necessary to ensure that
the floating output of the power-down block does not result in spurious behaviour
of the power-up block. This can be achieved by using an isolation cell to clamp the
output of power-down block to some well-defined logic level. Isolation cells can
be categorized into three types: cells those clamp the output to ‘0’ logic level, cells
those clamp to ‘1’ logic level and cells those clamp the output to the most recent
value. To which value the output should be clamped is based on the inactive state of
the power-up block. In the case of active high logic, the output is clamped to logic
‘0’ and in case of low-active logic, the output is clamped to logic ‘1’ level. When
the power-down block is driving a combinational logic circuit, the output can be
clamped to a particular value that reduces the leakage current using stack effect
(refer to Sect. 9.4). Figure 9.21 shows how the output of a power-gated block is
clamped with the help of an AND gate. Isolation cell to clamp the output to ‘0’ logic
Power
Non-power
gated
gated block
block
SLEEPN
Fig. 9.21 AND gate to clamp the output to LOW level

9.8 State Retention Strategy 281
X
X
Clamped Clamped
Low High
ISOLN ISOL
a b
Fig. 9.22 a AND gate to clamp the output to LOW level and b OR gate to clamp the output to
HIGH level
Fig. 9.23 Pull-down and Clamped LOW Vdd

pull-up transistor to clamp
the output to LOW and HIGH ISOLN
levels, respectively ISOL
GND Clamped HIGH
level can be accomplished using an AND gate as shown in Fig. 9.22a. The output is
clamped to ‘0’ as long as the isolation (ISOLN) signal is active.
Similarly, isolation cell to clamp the output to ‘1’ logic level can be accom-
plished using an OR gate as shown in Fig. 9.22b. It can be realized using a NOR
gate and an inverter. The output is forced to ‘1’ as long as the ISOL signal is held
high. An alternative approach is to use pull-up or pull-down transistors to avoid full
gate delay as shown in Fig. 9.23. If we want to clamp the output to the last value, it
is necessary to use a latch to hold the last value.
9.8 State Retention Strategy
Given a power switching fabric and an isolation strategy, it is possible to power gate
a block of logic. But unless a retention strategy is employed, all state information
is lost when the block is powered down. To resume its operation on power up, the
block must either have its state restored from an external source or build up its state
from the reset condition. In either case, the time and power required can be signifi-
cant. One of the following three approaches may be used:
• A software approach based on reading and writing registers
• A scan-based approach based on the reuse of scan chains to store state off chip
• A register-based approach that uses retention registers
Software-Based Approach In the software approach, the always-ON CPU reads
the registers of the power-gated blocks and stores in the processor’s memory. Dur-
ing power-up sequence, the CPU writes back the registers from the memory. Bus
traffic slows down the power-down and power-up sequence and bus conflicts may
make powering down unviable. Software must be written and integrated into the
system’s software for handling power down and power up.
“RESTORE” “RETAIN”
“SAVE”
VDD RET VDD RET
VDD_SW Q VDD_SW Q
D Master/ D Master/
Slave Slave
CLK Latches CLK Latches
RESETN RESETN
Fig. 9.24 Retention registers used for state retention
Scan-Based Approach Scan chains used for built-in self-test (BIST) can be reused.
During power-down sequence, the scan register outputs are routed to an on-chip or
off-chip memory. In this approach, there can be significant saving of chip area.
Retention Registers In this approach, standard registers are replaced by retention
registers. A retention register contains a shadow register that can preserve the regis-
ters state during power down and restore it at power up. High-Vt transistors are used
in the slave latch, the clock buffers, and the inverter that connects the master latch
to the slave latch as shown in Fig. 9.24. In addition to area penalty, this approach
requires more complex power controller.
9.9 Power-Gating Controller
A key concern in controlling the switching fabric is to limit the in-rush of current
when power to the block is switched on. An excessive in-rush current can cause
voltage spikes on the supply, possibly corrupting registers in the always-on blocks,
as well as retention registers in the power-gated block. One representative approach
is to daisy-chain the control signal to the switches. The result of this daisy chaining
is that it takes some time from the assertion of a ‘power-up’ signal until the block
is powered up. A more aggressive approach to turning on the switching fabric is to
use several power-up control signals in sequence. Regardless of the specific control
method, during the power-up sequence, it is important to wait until the switching
fabric is completely powered up before enabling the power-gated block to resume
normal operation. A realistic activity profile for power gating is shown in Fig. 9.25.
Power-gating control without retention is shown in Fig. 9.26. The following se-
quence is followed at the time of switching OFF:
• Flush through any bus or external operation in progress
• Stop the clock at appropriate phase
9.9 Power-Gating Controller 283
SLEEP
SLEEP
SLEEP
WAKE
WAKE

power power power
Power
Leakage Leakage
power power
Power gated Power gated Leakage
Leakage power Leakage power power
Time
Fig. 9.25 Activity profile with realistic power gating
CLOCK
N_ISOLATE
N_RESET
N_PWRON
Fig. 9.26 Power-gating control without retention
• Assert the isolation control signal

• Assert reset to the block
• Assert the power-gating control signal to power down the block
At the time of switching ON, the same signals are sent in reverse order as follows:
• De-assert the power-gating control signal
• De-assert reset to the block
• De-assert the isolation control signal to restore all outputs
• Restart the clocks, without glitches and without violating minimum pulse width
design constraints
Power-gating control with retention is shown in Fig. 9.27. The following sequence
is followed at the time of switching OFF:
• Flush through any bus or external operation in progress
• Stop the clock at appropriate phase
• Assert the isolation control signal
• Assert the state retention save condition
• Assert reset to the block
• Assert the power-gating control signal to power down the block
Fig. 9.27 Power-gating control with retention
At the time of switching ON, the same signals are sent in reverse order as follows:
• De-assert the power-gating control signal to power back up the block
• De-assert reset to ensure clean initialization following the gated power-up
• Assert the state retention restore condition
• De-assert the isolation control signal to restore all outputs
• Restart the clocks, without glitches and without violating minimum pulse width
design constraints
9.10 Power Management
The basic idea of power management stems from the fact that all parts of a circuit
are not needed to function all the time. The power management scheme can identify
conditions under which either certain parts of the circuit or the entire circuit can re-
main idle and shut them down to reduce power consumption. For example, the most
conventional approach used in the X86-compatible processors is to regulate the
power consumption by rapidly altering between running the processor at full speed
and turning the processor off. A different performance level is achieved by varying
the on/off time (duty cycle) of the processor. The mobile X86 power management
model provides one ‘ON’ state corresponding to normal (100 %) activity level and
multiple states as shown in Table 9.2. It is based on the advanced configuration and
power interface (ACPI), a standard developed by Microsoft, Intel, and Toshiba [9].
The relationship between the apparent performance level and the power con-
sumption is shown in Fig. 9.28. By decreasing the amount of time the processor
spends in the active ‘ON’ state, the performance level as well as the power con-
sumption level can be decreased linearly as shown in the figure.
However, this approach delivers performance in discrete bursts, which tends to
be unsuitable for smooth multimedia content such as software-based DVD or MP3
playback. This may create artefacts, such as dropped frames during movie playback
9.10 Power Management 285
Table 9.2 Traditional power management states

State name ACPI state name State description
Normal C0 In the normal state, the processor executes instruction with 100 %
activity
Auto halt C1 The processor enters auto halt state by executing HLT instruction
In the auto halt state, the processor stops its internal clocks in
response to HLT instruction
Quick start C2 The processor stops its internal clocks in response to the STP-
CLK signal from the south bridge
The processor maintains cache coherence (the processor caches
continue to snoop system memory transaction)
Deep sleep C3 The south bridge stops the external clock input to the processor
The system enforces cache coherency (the processor caches do
not need to snoop system memory transactions)
It realizes the maximum saving in processor power savings with-
out losing the processor content (internal state)
ACPI advanced configuration and power interface, HLT halt, STPCLK stop clock
Fig. 9.28 Linear power sav- 1.0

ings of conventional power
management
ON
0 . 75
OFF
Power (W)
Traditional power
0.5 management
50%
duty cycle
0.25
0
1.0 0 . 75 0.5 0.25 0
Normal
Activity level
that are perceptible to a user. In such a situation, the dynamic voltage and frequency
scaling (DVFS) technique can be used to match the desired performance, giving
power reduction to follow cubic relationship as shown in Fig. 9.29. This approach
always drives the processor frequency down to the point where the traditional sleep
states almost disappear.
9.10.1 Combining DVFS and Power Management
However, the DVFS ranges hit the limits of physics and the curve flattens out, when
the minimum frequency–voltage point is reached. At this point, it is more efficient
Fig. 9.29 Reduction in 1.0

power dissipation using
DVFS. DVFS dynamic volt-
age and frequency scaling
0 . 75
Power (W)
0.5
DVFS
0.25
DVFS
Limit
0
1.0 0 . 75 0.5 0.25 0
Normal
Activity level
Fig. 9.30 Combining DVFS 1.0

along with conventional
power management. DVFS Traditional power
management
dynamic voltage and fre-
0 . 75
quency scaling
Power (W)
0.5
DVFS
0.25
DVFS with traditional

power managent
0
1.0 0 . 75 0.5 0.25 0
Normal
Activity level
to switch over to traditional power management, i.e. alternating between normal

and sleep states according to the performance demands of the circuit as shown in
Fig. 9.30.
9.11 Dual-Vt Assignment Approach (DTCMOS) [10]
The availability of two or more threshold voltages on the same chip has opened
up a new opportunity of making tradeoffs between power and delay. Unlike the
MTCMOS approach, where additional sleep transistors are used to reduce leakage
power in standby mode, two or more threshold voltages may be assigned to transis-
9.11 Dual-Vt Assignment Approach (DTCMOS) 287
tors realizing functional gates in the circuit. This not only avoids additional area and
delay due to the inserted sleep transistors, but power reduction takes place both in
standby and active mode of operation of the circuit.
As we know, the delay of a logic gate increases with the threshold voltage of
the transistors shown in Fig. 9.2b. The static power due to the subthreshold leakage
current, on the other hand, decreases with the increase in the threshold voltage as
shown in Fig. 9.2a. So, by the judicious use of both low and high-Vt transistors in a
circuit, one can try to realize digital circuits with speed corresponding to the low-Vt
transistors and power consumption corresponding to the high-Vt transistors. The
key idea is to use low-Vt transistors for realizing gates on the critical path so that
there is no degradation in performance and high-Vt transistors in realizing gates off
the critical path such that leakage power consumption is significantly lower.
This is illustrated with the help of an example circuit shown in Fig. 9.31. Fig-
ure 9.31a is the original single-Vt circuit, where the supply voltage is 1 V and
threshold voltage is 0.2 V. Gates on the critical paths are shown by dark shade. Fig-
ure 9.31b shows a dual-Vt circuit, in which a high Vt of 0.25 V is assigned to all the
nodes on the off-critical path. The same circuit is shown in Fig. 9.32c and d with a
high-Vt of 0.396 V and 0.46 V, respectively. Note that to maintain the delay less than
or equal to the critical path, some nodes on the off-critical path are assigned with
low Vt. Figure 9.32 shows the standby leakage power of the example circuit with
different high-Vt values in the range of 0.2–0.5 V. The diagram shows that there
exists an optimum high-Vt value for which the standby leakage power is minimum.
This has opened up a problem known as dual-Vt assignment problem. The goal is
to obtain an optimum value of high Vt such that it can be assigned to the maximum
number of gates to realize circuits requiring minimum standby leakage power.
Depending on the application for which the circuit is designed, there are two
alternative approaches—delay-constrained and energy-constrained realizations.
In the first case, the circuit is designed for high-performance applications by op-
timizing power dissipation with constraint on delay, i.e. without compromise in
performance. This problem has been addressed in several recent research publica-
tions [10–14]. The basic approach used in these works is to realize circuits using
low-threshold voltage transistors (low-Vt) for gates on the critical path and high-
threshold voltage transistors (high-Vt) for gates on the off-critical path of the circuit
under realization. It has been demonstrated that significant savings in the leakage
power for the optimized circuit is possible based on this approach.
In the second case, the circuit is designed for battery-operated, hand-held and
portable systems, where the energy requirement for the circuit is minimized at the
expense of performance. A preliminary work on this was reported in [13]. Based on
the observation that a small compromise in performance (increase in delay) leads to
significant reduction in leakage power, a novel approach has been developed for the
synthesis of dual-Vt CMOS circuit. This may be termed as the energy-constrained
approach.
In both the cases, the key issue is how efficiently the two threshold voltages are
assigned to the MOS transistors of different gates to minimize power (energy) con-
sumption of the realized circuits.
Fig. 9.31 a Darker gates c

on the critical path, b high d
e nodes in critical
path (low-Vth)
Vt = 0.25 assigned to all gates h nodes with low-Vth
in the off-critical path, c high nodes with high-Vth
b g
Vt = 0.396 assigned to some
gates in the off-critical path, j
and d high Vt = 0.46 assigned
i
to some gates in the off-
critical path l
k
a f
a
c
e nodes in critical
d path (low-Vth)
h nodes with low-Vth
nodes with high-Vth
b g
k
a f
b
c
e nodes in critical
d path (low-Vth)
nodes with high-Vth
b g
k
a f
c
c
e nodes in critical
d path (low-Vth)
nodes with high-Vth
b g
k
a f
d
9.12 Delay-Constrained Dual-Vt CMOS Circuits 289
Fig. 9.32 Standby leakage 2.0
Standby Leakage Power

power for different Vth2 Vdd = 1V
1.8
Leff = 0.32u
1.5
Wpeff = 10.5u
1.2 Wneff = 3u
Tox = 9.8nm
1.0
0.8
0.5
0.15 0.25 0.35 0.45 0.55
0.20 0.30 0.40 0.50
Vth2(V)
Fig. 9.33 Dual-Vt CMOS Critical path

circuit
Assigned with high-VT

transistors
9.12 Delay-Constrained Dual-Vt CMOS Circuits [12]
The basic approach in dual-Vt CMOS circuits is to use two threshold voltages
(0.2Vdd ≤ Vt ≤ 0.5Vdd ) instead of single threshold voltage so that both low power
and high performance circuits can be realized. In a dual-Vt CMOS circuit, as shown
in Fig. 9.33, low Vt is assigned to the transistors for the gates on the critical paths
to obtain a smaller delay and a high Vt is assigned to all the transistors for the gates
on the off-critical path to get a smaller power dissipation. It may be noted that all
gates on the off-critical paths cannot be assigned with high Vt because critical paths
might change and hence the delay. However, the savings in leakage power in the
said approach depends on how many transistors can be assigned with high Vt; larger
the number of transistors assigned with high-Vt, lower is the leakage power. But the
assignment of dual-Vt to get maximum savings in leakage power is an NP-complete
problem.
In the synthesis of delay-constrained dual-Vt CMOS circuits, the problem is to
reduce the standby leakage power subject to the constraint of not increasing the delay.
In [12], a heuristic-based algorithm has been presented which, for a given circuit with
a fixed supply voltage (1 V) and a fixed low-threshold voltage Vt L (0.2V) for gates
on the critical path, finds a ‘static power optimum’ high-threshold voltage Vt H , and
selects a subset of the gates on the off-critical path which can be switched to Vt H .
This approach demonstrated significant savings in the leakage power for the dual-Vt
CMOS circuits without degradation in performance. However, this technique em-
ployed simple backward breadth-first search (BFS) strategy to identify the subset of
gates that can be switched to VTH examining slacks. So far as the computational effort
is concerned, this heuristic-based approach is fast but selects fewer gates for assign-
ing Vt H on the off-critical path. Another approach has been proposed, called delay
balancing method, which is based on the insertion of buffers (specific delay fictitious
(SDF) buffer) into the circuit by transforming a circuit graph G into its functionally
equivalent circuit G ′ so that every wire has the zero edge-slack in G ′ and it then
assigns high-threshold voltage to gates in order to eliminate all these SDF buffers.
However, in the dual-Vt assignment, an algorithm is formulated with an integer linear
programming framework and thus takes an inordinate amount of CPU time for large
circuits. Moreover, this technique is believed to give good results for synthesizing
CMOS circuits when arbitrary numbers of threshold voltages are allowed instead of
two threshold voltages used in dual-Vt CMOS circuits. Here, the same problem has
been addressed and developed a better heuristic-based algorithm such that VTH is as-
signed to more number of gates leading to larger savings in leakage power.
Algorithms
A directed acyclic graph (DAG) representation G( V, E) of a gate-level combina-
tional circuit is assumed, where V is a set of nodes representing the gates and E is
a set of edges representing the interconnection among the gates. The graph is first
initialized by assigning the values of tp( x), the propagation delay of a node x, Ta( x),
the arrival time (this is the propagation delay of each fan-in path of a node x), Tl( x),
the departure time, and Tmax( x), the maximum propagation delay time for each node
x in G. For each primary input x, tp( x) = 0, Ta( x) = 0, Tl( x) = 0, and Tmax( x) = 0. For all
other nodes in the graph, Tmax( x) and Tl( x) can be computed as given below:
T max ( x) = max {Ta ( x) [i ]} i ∈ all fanin(x)
Tl ( x) = Tmax ( x) + tp ( x).
The delay of a circuit is defined as the longest propagation time through a critical
path. In dual-VT CMOS circuits, the minimum delay occurs when all transistors in
the circuit are assigned with a low-threshold voltage ( Vt L = 0.2Vdd ). Here, Vt H is
initially assigned to the transistors of all gates. In this initial condition, the circuit
consumes minimum power but incurs maximum delay. In order to reduce the delay,
the procedure selects a gate on the critical path in depth-first fashion such that it
can be assigned with Vt L . Assign Vt L to all transistors in that gate. As this assign-
ment may alter the critical path, all critical paths are computed and then iterate
low-threshold voltage assignment until there does not exist any gate on the critical
path(s) not assigned with Vt L . When all nodes in the critical paths are assigned with
Vt L , the circuit eventually has minimum delay. The pseudo code for assigning an
arbitrary value of high-threshold voltage ( Vti ) to all transistors in the majority of
gates on the off-critical paths is given below.
9.12 Delay-Constrained Dual-Vt CMOS Circuits 291
$OJRULWKP'XDO97$VVLJQPHQW 97L 97/
,QLWLDOL]HDOOWKHQRGHVZLWK 97L
&RPSXWHWKHFULWLFDOSDWKVLQWKHFLUFXLW
6HOHFWDQRGHRQWKHFULWLFDOSDWKLQGHSWKILUVWIDVKLRQZKLFKLVQRWDVVLJQHG
ZLWK 97L ,IWKHUHLVQRVXFKQRGHWKHQJRWR6WHS
$VVLJQWKHQRGHZLWK 97/
*RWR6WHS
6WRS
The algorithm Dual-Vt Assignment is based on depth-first search with iteration heu-
ristic and assigns Vt L to transistors for gates on the critical paths and a high-thresh-
old voltage Vti to transistors for gates on off-critical paths. It has been observed [10]
that for a given circuit and with a Vt L = 0.2Vdd , the standby leakage power varies
with the value of Vt L . Or, in other words, there is an optimal value of Vti for which
maximum number of nodes can be assigned with high VT so that minimum leakage
power is possible (Fig. 9.32). To select the optimal value for VTH , the dual-Vt as-
signment procedure is repeated with various values of Vti (0.2Vdd < Vti ≤ 0.5Vdd ) , as
given in the algorithm Delay Constrained Dual-Vt Assignment.
Algorithm Delay Constrained Dual VT Assignment
1. Let VTi = 0.2VDD

2. Initialize all the nodes with VTi
3. Calculate the leakage power Pmin
4. While VTi ≤ 0.5V DD do
4.1 DualVTAssignment ( VTi , 0.2VDD)
4.2 Calculate the leakage power P
4.3 If P < Pmin then Pmin = P, VTH = VTi
4.4 Increment the value of VTi by small amount
5. DualVTAssignment ( VTH , 0.2VDD)
6. Stop
The delay-constrained dual-VT assignment algorithm as proposed in [12] produces

better results than the algorithm proposed in [10]. The algorithm in [10] proceeds
in a specific order (BFS-based backtracking from the primary outputs) of searching
the nodes which can be assigned a higher threshold voltage. In fact, it distributes
the inherited slack over a fewer number of gates than the algorithm proposed in
[12]. The modified algorithm searches the critical path (which keeps on changing)
for a node with maximum saving and hence does not follow any particular order of
searching. Figure 9.34 shows the variation of leakage power for different threshold
voltages. It may be noted that there is an optimal threshold voltage for which the
Fig. 9.34 Leakage power

with different high-threshold
voltages
Leakage Power
0.2 Optimal VT 0.5
Threshold Voltage
Fig. 9.35 Dual-Vt assign- i

ment to more number of gates From POs
j
To PIs
x y
k
leakage power is minimum. Now, let us consider Fig. 9.35 as an example. The slack
present with the node y is to be distributed among the nodes in its transitive fan-in
cone. The algorithm in [10] would assign the slack to node x itself while checking,
if x can be assigned a high value for Vt. On the other hand, the Dual-Vt Assignment
assigns the slack not to x, but rather to each of i, j, k, and l. Thus, the algorithm pro-
posed in [12] assigns the higher threshold to four gates as opposed to one in [10].
However, the time complexity of algorithm in [10] is O( n). Actually, the order is
that of BFS on a graph, and since the graph has maximum of 8n (maximum of four
edges for forward and the same for backward) edges, the complexity of O(|V| + |E|)
degenerates to O(|V|). On the other hand, the time complexity of the algorithm [12]
is O( n2). This is because finding the critical path takes O( n) time and in each itera-
tion the critical path has to be found, which keeps on changing. As a consequence,
the proposed algorithm takes longer computation time. With the availability of fast-
er computers, it does not appear to be a serious limitation. Moreover, experimental
results show that even for a reasonably large benchmark circuit, the overall time
requirement is about double that of [10].
9.13 Energy-Constrained Dual-Vt CMOS Circuits 293
Energy
∆E
= -1
∆T
VT=0.2VDD Low-VT
Delay
Fig. 9.36 A plot of leakage energy versus delay
9.13 Energy-Constrained Dual-Vt CMOS Circuits[13]
To understand how the leakage energy changes with threshold voltage, let us have
a look at Fig. 9.36. The graph in Fig. 9.36 shows the change in leakage energy with
the delay, which depends on the threshold voltage of the transistors. For the entire
range of threshold voltage variations, the curve looks like a typical bathtub curve.
Initially, the energy decreases rapidly and for a small increase in delay, there is a
large reduction in leakage power. After this initial part, the leakage energy remains
more or less constant. This is due to the fact that the reduction in leakage power
and increase in delay take place at the same ratio. Subsequently, when the threshold
voltage approaches the supply voltage, there is sharp increase in delay with a very
small reduction in leakage power leading to an increase in leakage power dissipa-
tion. From the leakage energy versus delay graph, a decrease in energy is observed,
and there is a point beyond which there is very small decrease in leakage for a large
increase in delay. This point in the graph corresponds to the slope ∆∆TE = −1 and leak-
age energy at this point is very near to minimum leakage energy possible as shown
in Fig. 9.36. As reported in [13], simulation study reveals the fact that, for most of
the circuits, the increase in delay is within 4–9 % for which the decrease in leakage
energy is around 70 %. Thus, if the delay is increased by a small percentage than
the minimum delay, then a significant amount of leakage energy can be saved. In
this alternative approach of synthesizing dual-Vt CMOS circuits, performance by
a small amount is traded for near minimal energy in dual-Vt CMOS circuits. The
pseudo code for energy-constrained dual-Vt CMOS circuits is given below.
Algorithm Energy Constrained Dual VT Assignment
1. Initialize all nodes with VTL = 0.2VDD.

2. Calculate Leakage energy E1 and delay T1.
3. Let T2 = T1 + ∆
4. Calculate VTL corresponding to this delay T2.
5. Assign all nodes with VTL .
6. Calculate leakage energy E2.
∆E E2 − E1
7. Calculate slope =
∆T T2 − T1
8. T1 = T2, E1 =E2
If ∆ ≠ −1 then Go to Step 3.
E
9.
∆ T
10. Let VTi = VTL
11. Initialize all the nodes with VTi
12. Calculate the leakage energy Emin
13. While VTi ≤ 0.5V DD do
13.1 Dual VT Assignment ( VTi , VTL )
13.2 Calculate the leakage energy E
13.3 If E < E min then Emin = E, VTH = VTi
13.4 Increment the value of VTi by small amount
14. Dua lVT Assignment ( VTH , VTL )
15. Stop
The algorithm Energy Constrained Dual-Vt Assignment consists of two phases.

In the first phase, it calculates the value of low-threshold voltage Vt L for which
∆E
∆T
= −1 and in the next phase, taking this as low-threshold voltage, the dual-Vt
assignment procedure is iterated to find an optimal value of high-threshold voltage
( Vt H ), which provides minimum leakage energy.
Implementation
The algorithms were implemented in a C programming environment on a SUN Ul-
tra Spark 10 workstation with 256 MB cache memory. The Berkeley SIS software
tool has been employed to make the net-list for a given circuit. For a particular
benchmark circuit, it is first optimized with script.rugged code in SIS and then us-
ing the standard library 44–1.genlib in SIS, gate mapping is done with the option
of minimum area. In the output file generated by SIS tool, the circuit is realized by
Table 9.3 Leakage power dissipation in delay-constrained dual-Vt CMOS circuits

Benchmark Active mode (μW) Standby mode (μW)
Single Vt Dual Vt % Redn. Single Vt Dual Vt % Redn.
C432 48.88 7.23 85.21 1.51 0.191 87.35
C499 96.44 40.06 58.46 3.01 1.07 64.45
C880 72.26 9.68 86.60 2.22 0.252 88.65
C1355 120.08 54.91 54.27 3.77 1.51 59.95
C1908 119.39 20.02 83.23 3.68 0.609 83.45
C2670 167.08 15.59 90.67 5.17 0.364 92.96
C3540 209.70 24.64 88.25 6.47 0.620 90.42
C5315 406.56 50.09 87.68 12.51 1.29 89.69
C6288 520.24 104.63 79.89 16.25 2.65 83.69
C7552 521.47 75.69 85.48 16.27 2.01 87.65
79.97 82.82
using only NAND, NOR, and NOT gates. In order to simplify the analysis, the first
two types of gates with two to four inputs are taken. The mapped circuit is then
translated into a bi-directed graph using a parser and a graph generator program.
To calculate the delay, leakage power, and dynamic power, the estimation mod-
els as proposed in [11] is used. Energy is measured as the product of power and
delay. For the calculation of power, the transition activity has been assumed to be
0.3 for all inputs. Although it is assumed that transition activity is 0.3, it can be as
low as 0.1 in many situations, leading to smaller switching power requirement. The
temperature is assumed as 110 °C for the active mode and it is assumed to be 25 °C
for the standby mode of operation of the circuit. The supply voltage is taken as the
1.0 V and zero-biased threshold voltage used in our experiments is 0.2 V. It has
been observed that the threshold voltage increases by about 0.8 mV for every 1 °C
decrease in temperature. The threshold voltage at standby mode is taken as 68 mV
higher than the threshold voltage at active mode. The values of various transis-
tor parameters have been extracted using the BSIM3V3 model and for the TSMC
0.18 µm process technology. The effective channel length of the transistor is taken
as the 0.18 μm and the gate oxide thickness is taken as 40 Å. For simplicity, all
transistors are assumed to have the same channel length of 0.18 μm while the chan-
nel widths for nMOSFETs and pMOSFETs are assumed to be 0.54 and 1.62 μm,
respectively. The subthreshold swing coefficient is taken as 1.44 and the body effect
coefficient and DIBL coefficients are 0.03 and 0.21 for nMOSFETs and 0.02 and
0.11 for pMOSFETs, respectively. The value of α is taken as 1.3 (for short-channel
MOSFETs). However, our experiments are not limited to such assumptions.
Experimental Results
Experimental results for the synthesis of delay-constraint dual-VT CMOS circuits
based on the algorithm Delay Constrained Dual-Vt Assignment are presented in
Tables 9.3 and 9.4. Comparative charts are presented, as shown in Fig. 9.37, to de-
pict how the reduction of leakage power occurs in static CMOS circuits synthesized
Table 9.4 Total power dissipation during active mode

Bench- Delay Switching Switching power Total switching Total power
mark (ns) power at load at internal nodes power (µW)
(μW) Single Vt Dual Vt Single Vt Dual Vt Single Vt Dual Vt
C432 3.32 98.88 23.51 20.34 122.39 119.22 171.27 126.45
C499 2.23 308.34 58.98 56.55 367.32 364.89 463.76 404.95
C880 2.21 237.61 55.59 47.22 293.2 284.83 365.46 294.51
C1355 2.61 325.20 75.18 71.93 400.38 397.13 520.46 452.04
C1908 2.91 302.59 64.49 58.02 367.08 360.61 486.47 380.63
C2670 2.94 409.47 83.82 71.53 493.29 481.00 660.37 496.59
C3540 4.57 307.21 101.43 86.83 408.64 394.04 618.34 418.68
C5315 3.65 673.23 157.20 129.53 830.43 802.76 1236.99 852.85
C6288 11.84 335.30 73.69 67.21 408.99 402.51 929.23 507.14
C7552 2.99 1361.33 243.32 213.54 1604.65 1574.87 2126.12 1650.56
Active Mode
1.0e-3
1.0e-4
Leakage Power
1.0e-5
1.0e-6
1.0e-7
1.0e-8
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
All low-V T Dual-VT All high-V T
Fig. 9.37 Reduction of leakage power in active mode in delay-constrained realization comparing
leakage power for all low-Vt, dual-Vt, and all high-Vt circuits
for three cases: all low Vt, dual Vt, and all high Vt. The dissipation of leakage power
in the active mode and standby mode are presented in Table 9.3. The leakage power
dissipations for single-Vt CMOS circuits are shown in column 2 and column 5 of
Table 9.3 in active mode and standby mode, respectively. In Table 9.3, column 3
and column 6 present the leakage power requirements in active mode and standby
mode, respectively. The percentage reduction of leakage power in dual-Vt circuits
with respect to single-Vt realization in active mode and standby mode operations
is shown in column 4 and column 7 of Table 9.4, respectively. It is observed that,
on an average, percentage reduction of leakage power in delay-constrained dual-Vt
CMOS circuits are 79.97 and 82.82 % in active mode and standby mode, respec-
tively compared to single low-Vt CMOS circuits.
Table 9.5 Leakage energy dissipation in energy-constrained dual-Vt CMOS circuits

Benchmark Active mode (fJ) Standby mode (fJ)
Single Vt Dual Vt % Redn. Single Vt Dual Vt % Redn.
C432 162.73 11.82 92.74 5.05 0.243 95.19
C499 215.43 42.86 80.10 6.72 0.918 86.34
C880 160.37 10.56 93.41 4.93 0.214 95.66
C1355 313.49 69.11 77.95 9.85 1.51 84.67
C1908 347.38 47.35 86.37 10.72 1.04 90.30
C2670 491.67 23.24 95.27 15.24 0.422 97.23
C3540 958.76 55.88 94.17 29.61 1.09 96.32
C5315 1484.99 88.85 94.02 45.71 1.82 96.02
C6288 6164.91 612.11 90.07 192.62 12.22 93.66
C7552 1562.71 132.26 91.54 48.77 2.80 94.26
89.56 92.96
Table 9.6 Total energy requirement during active mode

Bench- Delay % Switching Switching energy Total dynamic Total energy
mark (ns) increase energy at at internal nodes energy (fJ)
in delay load (fJ) Single V Dual V Single V Dual V Single V Dual V
t t t t t t
C432 3.55 6.93 329.00 78.29 64.66 407.29 393.66 570.02 405.48
C499 2.38 6.72 688.76 131.76 118.53 820.52 807.29 1035.95 850.15
C880 2.37 7.24 527.30 122.18 100.05 649.48 627.35 809.85 637.91
C1355 2.78 6.51 849.01 196.28 179.08 1045.29 1028.09 1358.78 1097.2
C1908 3.10 6.53 880.43 187.66 161.11 1068.09 1041.54 1415.47 1088.89
C2670 3.14 4.76 1204.89 246.65 200.98 1451.54 1405.87 1943.21 1429.11
C3540 4.88 4.59 1404.61 463.74 379.07 1868.35 1783.68 2827.11 1839.56
C5315 3.90 6.85 2458.98 574.18 452.01 3033.16 2910.99 4518.15 2999.84
C6288 12.65 6.84 3973.30 873.31 759.96 4846.61 4733.26 11011.52 5345.37
C7552 3.20 7.02 4079.53 729.18 610.84 4808.71 4690.37 6371.42 4822.63
In Table 9.4, the total power (sum of the switching power and leakage power)
requirement in delay-constrained dual-Vt CMOS circuits is presented. The delay of
the circuits is shown in column 2 of Table 9.4. The total switching power dissipated
at the load capacitances is shown in column 3 of Table 9.4. The switching power
dissipations at internal node capacitances in single-Vt and dual-Vt CMOS circuits
are shown in column 4 and column 5 of Table 9.4, respectively. The total switching
power is the sum of the switching power at load capacitances and at internal node
capacitances as shown in column 6 and column 7 of Table 9.4 for single-Vt and
dual-Vt CMOS circuits, respectively. The total power consumption, which includes
total switching power and leakage power of the circuits with single-Vt and delay-
constrained dual-Vt realizations, are shown in column 8 and column 9 of Table 9.4,
respectively. It can be observed that on an average, percentage reduction of total
power in delay-constrained dual-Vt CMOS circuits with respect to single-Vt CMOS
circuits is 24.92 %.
Table 9.7 Energy reduction in delay-constrained and energy-constrained dual-Vt CMOS circuits
Benchmark % Redn. in delay-constrained % Redn. in energy-constrained
dual-Vt CMOS circuits dual-Vt CMOS circuits
Active mode Standby mode Active mode Standby mode
C432 26.17 87.35 28.86 95.19
C499 12.68 64.45 17.94 86.34
C880 19.41 88.65 21.23 95.66
C1355 13.15 59.95 19.25 84.67
C1908 21.75 83.45 23.07 90.30
C2670 24.80 92.96 26.46 97.23
C3540 32.29 90.42 34.93 96.32
C5315 31.05 89.69 33.60 96.02
C6288 45.42 83.69 51.45 93.66
C7552 22.36 87.65 24.31 94.26
24.92, 82.82, 28.11, 92.96 %
Total energy requirements for energy-constrained dual-Vt CMOS circuits are

shown in Table 9.5. The total switching energy at output capacitances are shown
in column 4. Switching energy requirement for the internal nodes with single-Vt
and energy-constrained dual-Vt CMOS circuits is given in columns 5 and 6 of
Table 9.6, respectively. Total switching energies are shown in column 7 and
column 8 of Table 9.6 with single-Vt and dual-Vt realizations, respectively. The total
energy requirement with single-Vt and dual-Vt realizations is given in column 9 and
column 10 of Table 9.6, respectively.
Reductions in energy in energy-constrained dual-Vt CMOS circuits are com-
pared with its delay-constrained counterparts, which are shown in Table 9.7. From
Table 9.7, we observe that the percentage reduction in energy in the active mode
of operation in delay-constrained and energy-constrained dual-Vt CMOS circuits is
on the average 24.92 and 28.11 %, respectively. In standby mode of the operation,
percentage reductions of energy in energy-constrained and delay-constrained dual-
Vt CMOS circuits on the average are 92.96 and 82.82 %, respectively.
Moreover, it has been observed that there is small reduction in dynamic power
in dual-Vt CMOS circuits. However, it has been reported that as the device size be-
comes smaller and smaller, leakage power will become more and more predominant
component. This will make the proposed logic synthesis approach more relevant for
future-generation VLSI circuits.
9.14 Dynamic Vth Scaling
DVFS has been found to be very effective in reducing dynamic power dissipation.
For sub-100-nm generations, where leakage power is a significant portion of the to-
tal power at run time, a dynamic Vth scheme (DVTS) can be used to reduce the run-
time leakage power just like the DVFS scheme is used to reduce dynamic power.
When the workload is less than the maximum value, the processor can be operated
9.15 Chapter Summary 299
Standby Mode
1.0e-4
1.0e-5
Leakage Power
1.0e-6
1.0e-7
1.0e-8
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
All low-VT Dual-VT All high-VT
Fig. 9.38 Reduction of leakage energy in standby mode for energy-constrained realization com-
paring leakage energy for all low-Vt, dual-Vt, and all high-Vt circuits
at a lower clock frequency and instead of reducing the supply voltage, the DVTS
hardware is used to raise the threshold voltage using reverse body biasing. This
increases the delay to match the lower clock frequency and in turn will reduce the
run-time leakage power. The processor is controlled to deliver just enough through-
put that is required for sustaining the current workload by dynamically adjusting
the threshold voltage in an optimal manner such that the leakage power reduction
is maximized.
A simpler scheme, called Vth hopping, dynamically switches between two thresh-
old voltages: low Vt and high Vt, as per the workload requirement. Figure 9.38
shows a detailed implementation of the Vth-hopping scheme. As shown in Fig. 9.39,
the frequency controller generates either fCLK or fCLK/2 clock frequencies based on
the workload requirement and the VTH controller adjusts the threshold voltage by
selecting appropriate body bias voltages for both the nMOS and pMOS transistors.
• The dependence of delay of CMOS circuits on the threshold voltage has been
explained.
• The dependence of leakage power dissipation of CMOS circuits on the threshold
voltage of the MOS transistors has been explained.
VTH Low_Enble
VTH high_Enble
CONT VTH –Hopping

VTH ( V THlow & V THhigh )
Controller VBSP2 VBSP
VDD
VBSP1
VBSN2 GND
Frequency
Control VBSN1 VBSN
fCLKor f CLK/2
Target Processor
Power Control Block
Fig. 9.39 A simple approach for Vth hopping for leakage power minimization
• Different techniques for the fabrication multiple-threshold-voltage MOS transis-

tors have been highlighted.
• How leakage power dissipation can be minimized by using VTCMOS approach
has been explained.
• How standby leakage power can be minimized by using the stack effect has been
explained.
• How run-time leakage power can be minimized by using multiple-threshold-
voltage (MTCMOS) approach has been explained.
• Power-gating approach to minimize leakage power dissipation has been ex-
plained.
• Various issues related to power-gating approach have been highlighted.
• How the power management approach can be used to reduce leakage power
dissipation and how it can be combined with dynamic voltage scaling approach
have been explained.
• The delay-constrained and energy-constrained dual-Vt approaches to minimize
leakage power dissipation have been explained in detail.
• Dynamic Vt scaling approach has been introduced.
Q9.1. For a logic gate, plot the variation of leakage power and delay with the thresh-
old voltage for a fixed power supply voltage.
Q9.2. Discuss various approaches for the fabrication of MOS transistor with mul-
tiple threshold voltages.
References 301
Q9.3. What is subthreshold leakage current? Briefly discuss various mechanisms

responsible for this leakage current?
Q9.4. Distinguish between standby and run-time leakage power dissipation.
Q9.5. What is the transistor stack effect? How can it be used to reduce leakage
power dissipation in standby mode?
Q9.6. Explain how multiple-threshold-voltage transistors can be used to realize
low-voltage high-performance circuits requiring low power (1) only in the standby
mode, (2) both in standby and active modes.
Q9.7. Explain how standby power is reduced in MTCMOS technique. Discuss its
advantages and disadvantages.
Q9.8. Compare and contrast the DTCMOS and MTCMOS techniques that are used
for leakage power reduction.
Q9.9. Give a scheme for the reduction of leakage power dissipation using dual-Vt
approach without compromise in performance. Compare its performance with the
MTCMOS approach.
Q9.10. Explain the power-gating approach to minimize leakage power dissipation.
Q9.11. Discuss various issues related to power-gating approach.
Q9.12. Why is it necessary to isolate the output of a power-gated block when ap-
plied to a non-power-gated block?
Q9.13. How can the power management approach be used to reduce leakage power
dissipation?
Q9.14. Explain how the power management approach can be combined with dy-
namic voltage scaling to reduce total power consumption.
Q9.15. Distinguish between the delay-constrained and energy-constrained dual-Vth
approaches for the reduction of leakage power dissipation.
Q9.16. Explain the dynamic-Vt approach for the reduction of leakage power dissipation.
References
1. Roy, K., Mukhopadhyay, S., Mahmooddi-Meimand, H.: Leakage current mechanisms and
leakage reduction techniques in deep-submicrometer CMOS circuits. Proceedings of the IEEE,
vol. 91, no. 2, pp. 305–327, February (2003)
2. Inukai, T., Hiramoto, T., Sakurai, T.: Variable Threshold Voltage CMOS (VTCMOS) in series
connected circuits, ISPLED ’01, Huntington Beach, California, USA, pp. 201–206, August
6–7 (2001)
3. Kang, Sung Mo, Leblebici, Y. : CMOS digital integrated circuits-analysis and design, 3rd edn.
Tata McGraw Hill, New Delhi (2003)
4. Johnson, M.C., Roy, K., Somaskhar, D.: A model for leakage control by transistor stacking,
Technical Report TR-ECE 97-12, Purdue University (1997)
5. Gu, R.X., Elmasry, M.I.: Power dissipation analysis and optimization of deep submicron
CMOS digital circuits, IEEE JSSC 31(5):707–713 (1996)
6. Yang, S., Wolf, W., Vijaykrishnan, N., Xie, Y., Wang, W.: Accurate stacking effect macro-
modeling of leakage power in sub-100 nm circuits. Proceedings of the 18th International Con-
ference on VLSI Design held jointly with 4th International Conference on Embedded Systems
Design (VLSID’05), Kolkata, January 3–7 (2005)
7. Mutoh, S., et al.: A 1 V multi-threshold voltage CMOS DSP with an efficient power manage-
ment technique for mobile phone application, International Solid-State Circuits Conference
(ISSCC), pp. 168–169, Feb. (1996)
8. Keating, M., Flynn, D., Aitken, R., Gibbons, R., Shi, K.: Low power methodology manual:
for system-on-chip design, Springer (2007)
9. ACPI, Hewlett-Packard, Intel Corporation, Microsoft , Phoenix Technologies, Toshiba
(2011-12-06). Advanced Configuration and Power Interface Specification (Revision 5.0)
10. Wei, L., Chen, Z., Roy, K., Johnson, M.C., Ye, Y., De, V.: Design and optimization of dual
threshold circuits for low voltage low power applications, IEEE Trans. VLSI Syst. 17(1):16–
24 (1999)
11. Kim, Chris H., Roy, K.: Dynamic VTH scaling scheme for active leakage power reduction,
Proceedings of the 2002 Design, Automation and Test in Europe Conference and Exhibition
(DATE02), pp. 163–167 (2002)
12. Tripathi, N., Bhosle, A., Samanta, D., Pal, A.: Optimal assignment of high-VT for synthesiz-
ing dual-VT CMOS circuits, IEEE/ACM Proceedings of the 14th International Conference on
VLSI Design, pp. 227–232, January (2001)
13. Samanta, D., Pal, A.: Optimal dual-VT assignment for low-voltage energy-constrained CMOS
circuits, IEEE/ACM Proceedings of the 7th ASP-DAC and 15th International Conference on
VLSI Design, pp. 193–198, January (2002)
14. Sundararajan, V., Parhi, K.K.: Low power synthesis of dual threshold voltage CMOS VLSI
circuits, IEEE/ACM Proceedings of the International Symposium on Low Power Electronics
Design (ISLPED), pp. 363–368, July (1999)
Chapter 10
Adiabatic Logic Circuits
Abstract This chapter is concerned with adiabatic logic circuits. First, it intro-

duces adiabatic charging which forms the basis of adiabatic circuits. The difference
between adiabatic charging and conventional charging of a capacitor is highlighted.
As amplification is a fundamental operation performed by electronic circuits to
increase the current or voltage drive, adiabatic amplification is considered. The
steps of realization of adiabatic logic gates starting with its static complementary
metal–oxide–semiconductor (CMOS) counterpart are explained. The realization of
pulsed power supply, which is the most fundamental building block of adiabatic cir-
cuits, is introduced. The realizations of both synchronous and asynchronous pulsed
power supplies are explained. How stepwise charging and discharging can be used
to minimize power dissipation is explained. Various partially adiabatic circuits
such as efficient charge recovery logic (ECRL), positive feedback adiabatic logic
(PFAL) and 2N-2N2P are introduced. Non-adiabatic loss in adiabatic circuits high-
lighted. The impact of voltage scaling and threshold voltage scaling on the partially
adiabatic circuits is discussed.
Keywords Adiabatic charging · Adiabatic amplification · Adiabatic logic gates ·

Pulsed power supply · Stepwise charging · Tank capacitors · Partially adiabatic
circuits · ECRL · PFAL · 2N−2N2P
10.1 Introduction
Static complementary metal–oxide–semiconductor (CMOS) circuits are extremely

successful in terms of market share because of many advantages such as lower pow-
er dissipation, reliable operation and availability of computer-aided design (CAD)
synthesis tools. We have seen that all the circuit nodes make a rail-to-rail (0 and
Vdd) transition for each switching event and the supply voltage Vdd remains constant.
As a consequence, the output node makes a transition from 0 to Vdd with a load
capacitance CL, an energy of CL V 2dd is drawn from the power supply. Out of this,
1/2CLV 2dd is stored in the capacitor and the remaining half is dissipated in the p-
type metal–oxide–semiconductor (pMOS) network. Subsequently, when the output
node switches from Vdd to 0, the energy that was stored in the capacitor is dissipated

304 10 Adiabatic Logic Circuits
in the n-type metal–oxide–semiconductor (nMOS) network. The power dissipation

that takes place because of these switching events is converted to heat, which is
ultimately released to the environment. This has far-reaching consequences like
global warming. To reduce power dissipation, the circuit designers can reduce the
supply voltage, decrease the node capacitance or minimize the number of switching
events. In the preceding chapters, we have discussed these approaches.
In this chapter, we introduce a novel approach to achieve energy dissipation be-
low this lower limit of CLV 2dd. This has resulted in a new class of circuits known as
‘adiabatic circuit’. Adiabatic switching [1] is a circuit-level approach that has made
it possible to realize the ultra-low-power computing applications without scaling
the supply voltage. The term ‘adiabatic’ refers to the thermodynamic processes that
exchange no heat with the environment. The electric charge transfer between vari-
ous circuit nodes can be considered as the process. In adiabatic CMOS circuits, the
energy consumption is minimized by slowing down the charge transport between
the drain and source terminals of the metal–oxide–semiconductor field-effect tran-
sistor (MOSFET) switch and recovering the energy without dissipating as heat.
A time-varying voltage source ensures the slow charge transport, keeping small
potential across the on-resistance encountered by the MOSFET switches. As a con-
sequence, adiabatic logic circuits lead to a reduction in power dissipation at the cost
of slower speed of operation. Moreover, it opens up the possibility of recycling, or
reusing, some of the energy drawn from the power supply. In the literature, adia-
batic switching has been investigated and it has been observed that the adiabatic
circuits are good candidates for ultra low-power applications. The ultra-low-pow-
er applications for which conventional energy is limited and speed is not critical
such as bio-medical, robotics, space, deep sea, etc. can be implemented using the
adiabatic switching technique. Section 10.2 introduces adiabatic charging which
forms the basis of adiabatic circuits. As amplification is a fundamental operation
performed by electronic circuits to increase the current or voltage drive, adiabatic
amplification is presented in Sect. 10.3. The realization of adiabatic logic gates is
introduced in Sect. 10.4. The realization of pulsed power supply, which is the most
fundamental building block of adiabatic circuits, is introduced in Sect. 10.5. A step-
wise charging circuit is introduced in Sect. 10.6. Various partially adiabatic circuits
are introduced in Sect. 10.7. Some important issues related to adiabatic circuits have
been highlighted in Sect. 10.8.
10.2 Adiabatic Charging
To get introduced to the basic concept of adiabatic circuits, first we consider the
conventional charging of a capacitor C through a resistor R, followed by adiabatic
charging [2]. Figure 10.1a consists of a resistor R and capacitor C in series and
a supply voltage Vdd. As the switch is closed at time t = 0, current starts flowing.
Initially, at time t = 0, the capacitor does not have any charge and therefore the volt-
age across the capacitor is 0 V and the voltage across the resistor is Vdd. So, a cur-
rent of Vdd/R flows through the circuit. As current flows through the circuit, charge
10.2 Adiabatic Charging 305
Fig. 10.1 a Charging of a

4
capacitor C through a resis- W
tor R using a power supply. 5
b As charging progresses,
current decreases and charge
9GG ,54&
increases
9GG
&
a b W
Fig. 10.2 Adiabatic charging

of a capacitor 5
&
,W
accumulates in the capacitor and voltage builds up. As the time progresses, the
voltage across the resistor decreases with a consequent reduction in current through
the circuit. At any instant of time, Vdd = IR + Q/C, where Q is the charge stored in the
capacitor. Figure 10.1b shows how conventional charging of a capacitor leads to the
dissipation of an energy of 1/2CLV 2dd.
Now let us consider the adiabatic charging of a capacitor as shown in Fig. 10.2.
Here, a capacitor C is charged through a resistor R using a constant current I( t) in-
stead of a fixed voltage Vdd. Here also it is assumed that initially at time t = 0, there
is no charge in the capacitor. The voltage across the capacitor Vc( t) is a function of
time and it can be represented by the following expression:
Vc (t ) = 1 / C ⋅ I (t ) ⋅ t
or I (t ) = C ⋅ Vc (t ) / t
Assuming the current is constant, the energy dissipated by the resistor in the time
interval 0 to T is given by

Ediss = R ⋅ I 2 (t ) ⋅ dt
= R ⋅ I (T ) 2 T (10.1)
= ( RC / T ) ⋅ C ⋅ Vc (T ) 2 ,
where Vc( T) is the voltage across the capacitor at time T.

We may make the following conclusions from Eq. (10.1):
• For T > 2RC, the dissipated energy is smaller than CLV 2dd, which is dissipated in
the case of conventional charging using a supply voltage of Vdd.
+ROG
9$
3UHFKDUJH 5HFRYHU
:DLW
9
Fig. 10.3 Output waveform of a pulsed power supply
• The energy dissipated can be made arbitrarily small by increasing the time T.
Longer the value of T, smaller is the dissipated energy.
• Contrary to the conventional charging where the energy dissipation is indepen-
dent of the value of the resistor R, here the energy dissipated is proportional to
the value of R. This has opened up the possibility of reducing the energy dissipa-
tion by decreasing the value of R.
• In the case of adiabatic charging, the charge moves from the power supply slow-
ly but efficiently (dissipating lesser energy).
Moreover, necessary computation is performed after the capacitor is charged by pro-
viding a stable voltage and then there is a possibility to return the stored energy of
the capacitor back to the power supply after the computation is over. This, however,
requires that the power supply must be specially designed to retrieve the energy in
the form of charge back to the power supply. Another requirement is that the power
supply must generate a non-standard time-varying output contrary to the fixed volt-
age output generated by standard power supplies. These power supplies are known
as ‘pulsed power supplies’ having the output characteristic as shown in Fig. 10.3. It
may be noted that it has four phases: precharge hold, recover and wait. In the pre-
charge phase, the load capacitor is adiabatically charged, in the hold phase necessary
computation is performed, in the recover phase the charge is transferred back to the
power supply and finally the wait phase before a new pre-charge phase starts.
10.3 Adiabatic Amplification
Amplification is a fundamental operation performed by electronic circuits to increase

the current or voltage drive. In this section, we shall discuss how it can be done adia-
batically to drive capacitive loads [2]. Here, the adiabatic amplification is implement-
ed using two transmission gates and the output is dual-rail encoded, which means
amplified output along with its complemented output is available. To drive the trans-
mission gates, input is also dual-rail encoded as shown in Fig. 10.4. Apart from the
transmission gates, two clamping circuits are also used. The dual-rail implementation
has the advantage that the capacitive load to the power supply, is data independent,
because one of the two transmission gates is always ON connecting the capacitive
load to the power supply and the other capacitive load is clamped to ground through
the nMOS transistor. The steps of operation of the circuit are as follows:
10.4 Adiabatic Logic Gates 307
Fig. 10.4 Adiabatic ;9$ ;

amplification
; <
< <
; <
; ;
;;
Step 1: Input X and its complement are applied to the circuit, which remain stable
in the following steps.
Step 2: The amplifier is activated by applying VA, which is a slow ramp voltage from
0 V to Vdd.
Step 3: One of the two capacitors which is connected through the transmission gate
is adiabatically charged to VA and the other one is clamped to 0 V in transition
time T.
Step 4: After the charging is complete, the output signal pair remains stable and can
be used as inputs to the next stage of the circuit.
Step 5: The amplifier is de-energized by ramping the voltage from VA to 0 V. In this
step, the energy that was stored in C is transferred back to the power supply.
Let us consider the energy dissipation that takes place in the above operation. En-
ergy dissipation takes place in steps 3 and 5. As VA ramps up and down between
0 V and Vdd, the states of the two transistors of the transmission gate change. Both
the transistors operate in the non-saturated region in the middle part of ramping up
and ramping down (between Vtp and Vdd−Vtn). Initially, the nMOS transistor is ON
and it remains ON till the output reaches the voltage ( Vdd−Vtn). On the other hand,
the pMOS transistor turns ON when the ramp voltage attains the voltage |Vtp| and
remains ON till the maximum value.
10.4 Adiabatic Logic Gates
Starting with a static CMOS gate, the adiabatic logic gate for the same Boolean
function can be realized using the following steps:
Step 1: Replace each pull-up nMOS network and the pull-down pMOS network of
the static CMOS circuit with transmission gates.
Step 2: Use the expanded pull-up network to drive the true output load capacitance.
Step 3: Use the expanded pull-down network to drive the complementary output
load capacitance.
Step 4: Replace Vdd by a pulsed power supply VA.
Figure 10.5a shows the schematic diagram of a static CMOS circuit with its pull-up
and pull-down blocks. Figure 10.5b shows the transformed adiabatic circuit where
both the networks are used to charge and discharge the load capacitances adiabati-
cally. Figure 10.6 shows the realization of the adiabatic AND/NAND gate based on
I
I
287
,1 ,1
9GG 287
9
I I
a b
Fig. 10.5 a Static CMOS schematic diagram, b adiabatic circuit schematic diagram
Fig. 10.6 Adiabatic ;

realization of the AND/
NAND gate
9$ ; <
;
I
I
<
; <
<
the above procedure. In this way, the adiabatic realization of any function can be
performed. It may be noted that the number of transistors required for the realiza-
tion of the adiabatic circuit is larger than that of the static CMOS realization of the
same function.
10.5 Pulsed Power Supply
Pulse power supply plays an important role in the realization of adiabatic circuits.
As we know, adiabatic circuits allow less energy dissipation during charging/dis-
charging of the load capacitance compared to static CMOS circuits. It also allows
energy recovery during discharge of the load capacitance and above all it serves
as the timing clock for the adiabatic circuits. The recovered node energies are also
stored in pulsed power supply. Total energy consumed in an adiabatic switching
operation is the sum of the energy consumed by the adiabatic circuit and the pulsed
power supply. Therefore, the pulsed power supply should dissipate much less en-
ergy to achieve maximum possible energy efficiency from an adiabatic circuit [3].
10.5 Pulsed Power Supply 309
VDD VDD
MP 2 MP 1
L L
PCK PCK
PCK PCK
L
RL RL
Logic
Logic
RL RL
Logic
Logic
CE2 CE1 CE2 CE1
CL CL CL
MN 2 MN 1 MN 2 MN 1 CL
a b
Fig. 10.7 Asynchronous two-phase clock generator a 2N, b 2N2P
Various circuit topologies for energy recovery have been proposed for different
adiabatic logic styles and for diverse applications [6]. The power clock generators
can be grouped into two main types: asynchronous and synchronous. Asynchronous
power clock generators are free running circuits that use feedback loops to self-
oscillate without any external timing signals. Figure 10.7 illustrates two commonly
used asynchronous power clock generators: 2N and 2N2P power clock generators.
These are simple, dual-rail LC oscillators where the active elements are cross-cou-
pled pairs of NMOS and PMOS transistors. Asynchronous structures are associated
with several problems. For example, their oscillation frequencies are sensitive to
their capacitive load variations in different cycles of the system operation resulting
in unstable frequency problems. In addition, they are unsuitable for the genera-
tion of four-phase or higher phase-shifted power clocks required in the realization
of some adiabatic circuits. Finally, in a large system, inputs and outputs of each
module should be in synchronization with other modules prohibiting the integra-
tion of the adiabatic module driven by asynchronous power clock generators into
a larger non-adiabatic system. Using phase-locked loops or synchronizers such as
self-timed first-in-first-out (FIFO) memory devices would not be energy- and area-
efficient solutions to this problem. In such cases, the synchronous power clock gen-
erators overcome the above-mentioned problems and provide a better alternative in
terms of efficiency.
Synchronous power clock generators are synchronized to external timing signals
usually available in large systems. Figure 10.8 illustrates two synchronous pow-
er clock generators similar to the asynchronous counterparts except that the gate
control signals are generated externally. The capacitors CE1 and CE2 are external
balancing capacitors to achieve better conversion efficiency. The adiabatic module
can be easily synchronized to a larger conventional non-adiabatic system by using
synchronous power clock generators. Level-to-pulse and pulse-to-level converters
(as discussed in Sect. 7.6) can be used for interfacing between adiabatic and con-
ventional circuits. It is also envisaged that the synchronous power clock generators
have higher energy efficiency compared to its asynchronous counterpart.
1/ƒe
CK 1
VDD
CK 2 VDD
2
MP 2 MP 1
L L
CK 1 CK 2
PCK CK 1 PCK PCK PCK
CK 2
L
CK 2 CK 1
Logic
Logic
Logic
RL RL RL
Logic
RL
CE 2 CE 1 CE2 CE1
CL CL CL CL
MN 2 MN 1 MN 2 MN 1
a b
Fig. 10.8 Synchronous two-phase clock generator a 2N, b 2N2P
10.6 Stepwise Charging Circuits
We have seen that the power dissipation during the charging of a capacitor by using
a constant current source can be minimized, and in the ideal case can be approxi-
mated to zero. This requires that the power supply be able to generate linear voltage
ramps. Practical supplies can be constructed by using resonant inductor circuits to
approximate the constant output current and the linear voltage ramp with sinusoidal
signals. But the use of inductors presents several difficulties at the circuit level,
especially in terms of chip-level integration and overall efficiency.
An alternative to using pure voltage ramps is to use stepwise supply voltage
waveforms [4], where the output voltage of the power supply is increased and de-
creased in small increments during charging and discharging. Since the energy dis-
sipation depends on the average voltage drop traversed by the charge that flows
onto the load capacitance, using smaller voltage steps, or increments, should reduce
the dissipation considerably.
Figure 10.9 shows a CMOS inverter driven by a stepwise supply voltage wave-
form. Assuming that the output voltage is equal to zero initially, the input voltage
set to logic low level, the power supply voltage VA is increased from 0 to Vdd, in n
equal voltage steps as shown in Fig. 10.10. Since the pMOS transistor is conducting
during this transition, the output load capacitance will be charged up in a stepwise
manner. The on-resistance of the pMOS transistor can be represented by the linear
resistor R. Thus, the output load capacitance is being charged up through a resistor,
in small voltage increments. For the ith time increment, the amount of capacitor
current can be expressed as
dVout VA(i +1) − Vout

ic = C = .
dt R
10.6 Stepwise Charging Circuits 311
Fig. 10.9 CMOS inverter

VA
driven by a stepwise supply
voltage waveform
constant
current
low Vout
Cload
Solving this differential equation with the initial condition Vout( ti) = VA( i) yields
Vdd − t / RC
Vout (t ) = VA(i +1) − e .
n
Here, n is the number of steps of the supply voltage waveform. The amount of
energy consumed during a single voltage step increment can now be found out as
∞ 1 Vdd2
Estep = ∫ ic2 R dt = C .
0 n2 2
Since n steps are used to charge up the capacitance to Vdd, the total dissipation is
C V 2  CLVdd2
Estep = n ⋅ Estep = n  L dd
2 = .
 2n  2n
VDD
Vout VA
VA
R iC VDD
n
C Vout
Fig. 10.10 Charging a capacitor in n steps

VN
Vout
V2 RC charging steps
V1 Vout
t
Cload
Fig. 10.11 Stepwise driver circuit to charge capacitive loads
According to this simplified analysis, charging the output capacitance with n volt-
age steps, or increments, reduces the energy dissipation per cycle by a factor of n.
Therefore, the total power dissipation is also reduced by a factor of n using stepwise
charging. This result thus implies that if the voltage steps can be made very small
and the number of voltage steps n approaches infinity (i.e. if the supply voltage is a
slow linear ramp), the energy consumption will approach zero.
Another example for simple stepwise charging circuits is the stepwise driver for
capacitive loads, implemented with nMOS devices as shown in Fig. 10.11. Here, a
bank of n constant voltage supplies with evenly distributed voltage levels is used.
The load capacitance is charged up by connecting the constant voltage sources V1
through VN to the load successively, using an array of switch devices. To discharge
the load capacitance, the constant voltage sources are connected to the load in the
reverse sequence.
The switch devices are shown as nMOS transistors in Fig. 10.11, yet some of
them may be successively connected to constant voltage sources Vi through an array
of switches replaced by pMOS transistors to prevent the undesirable threshold volt-
age drop problem and the substrate-bias effects at higher voltage levels.
One of the most significant drawbacks of this circuit configuration is the need for
multiple supply voltages. A power supply system capable of efficiently generating
n different voltage levels, would be complex and expensive. Also, the routing of n
different supply voltages to each circuit in a large system would create a significant
overhead. In addition, the concept is not easily extensible to general logic gates.
Therefore, stepwise charging driver circuits can be best utilized for driving a few
critical nodes in the circuit that are responsible for a large portion of the overall
power dissipation, such as output pads and large busses.
10.7 Partially Adiabatic Circuits 313
Fig. 10.12 Stepwise driver

circuit using tank capacitors VN N RN, CN
CT
2 R2, C2
CT
R1, C1
1
CL
0
R0, C0
10.6.1 Stepwise Driver Using Tank Capacitors
To overcome the limitation of multiple supply voltages, another alternative is to use

tank capacitors. In this case, all supplies except one are replaced by tank capacitors
as shown in Fig. 10.12. Here, the advantage is that a conventional single-supply
voltage can be used. However, the tank capacitors occupy a large chip area and this
overhead limits the practical number of steps to be less than ten.
In general, we have seen that adiabatic logic circuits can offer a significant reduc-
tion of energy dissipation, but usually at the expense of switching delay. Therefore,
adiabatic logic circuits can be best utilized in cases where delay is not a constraint.
Moreover, the realization of unconventional power supplies needed in adiabatic
circuit realizations typically results in an overhead, both in terms of overall energy
dissipation and in terms of chip area. These issues should be judiciously considered
when adiabatic logic is used as an alternative for low-power design.
10.7 Partially Adiabatic Circuits
Implementation of fully reversible adiabatic logic circuits has a very large over-
head. A fully reversible, bit-level pipelined three-bit adder requires several times as
many devices as a conventional one and many times the silicon area. This has mo-
tivated researchers to apply the adiabatic technique to realize a partially adiabatic
logic circuit. Most of the circuits use crossed coupled devices connecting two nodes
that form the true and complementary outputs, i.e. dual-rail encoded. When a volt-
age ramp is applied, the outputs settle to one of the states based on the inputs. VФ is
connected to the pulsed-power supply.
There is non-adiabatic dissipation of approximately (1/2)CLVth2 for transition
from one state to another.
Fig. 10.13 ECRL general-

ized schematic diagram
out out
in1 in1
F F
nMOS nMOS
inN inbN
10.7.1 Efficient Charge Recovery Logic
Figure 10.13 shows the generalized schematic for efficient charge recovery logic
(ECRL). It consists of two pMOS transistors connected in a cross-coupled man-
ner and two networks of NMOS transistors acting as evaluation networks. The
waveforms of the supply clock as well as I/O signals for a NOT gate. In order
to recover and to reuse the supplied energy, an ac power supply is also used for
ECRL gates. As usual, in adiabatic circuits, the supply voltage also acts as a clock.
Both the signals, out and its complement are generated so that the power clock
generator can always drive a constant load capacitance, independent of the input
signal. If the circuit operates correctly, energy has an oscillatory behaviour, be-
cause a large part of the energy supplied to the circuit is given back to the power
supply. As usual, for adiabatic logic, the energy behaviour follows the supply
voltage. It is also observed that, due to a coupling effect, the low-level output goes
to a negative voltage value during the recovery phase (that is, when the supply
voltage ramps down).
The dissipated energy can be defined as the difference between the energy that
the circuit needs to load the output capacitance, and the energy that the circuit re-
turns back to the power supply during the recovery phase. The dissipated energy
value depends on the input sequence and on the switching activity factor. Therefore,
the dissipated energy per cycle can be obtained from the mean value of the whole
sequence. It can also be seen that a larger energy is dissipated if the input state
changes and therefore the output capacitances have to switch from one voltage level
to the other. An ECRL realization of an inverter is shown in Fig. 10.14. How the
data are transferred from one stage to its succeeding stage is shown in Fig. 10.15.
The arrows show when data move from one gate to the consecutive gate in four
phases: precharge, hold, recover and wait phases.
10.7 Partially Adiabatic Circuits 315
Fig. 10.14 ECRL inverter
P1 P2
out out
N1 N2
in in
Fig. 10.15 Data transfer in

ECRL gates
10.7.2 Positive Feedback Adiabatic Logic Circuits
The structure of a positive feedback adiabatic logic (PFAL) gate is shown in

Fig. 10.16. Two nMOS networks are used to realize the logic functions. This logic
family also generates both positive and negative outputs. The two major differences
with respect to ECRL are that the latch is made by two pMOS and two nMOS FETs,
rather than by only two pMOS FETs as in ECRL, and that the functional blocks
are in parallel with the transmission pMOS FETs. Thus, the equivalent resistance
is smaller when the capacitance needs to be charged. During the recovery phase,
the loaded capacitance gives back energy to the power supply and the supplied
energy decreases. The input NMOS network is connected in parallel to the PMOS
transistors. In a PFAL gate, no output is floating and all outputs have full logic
swing; PFAL shows the best performance in terms of energy consumption, useful
frequency range and robustness against technology variations. Energy saving up to
Fig. 10.16 Schematic dia-

gram of a PFAL logic gate
VPAR
M13 M14 M23 M24 M33 M34 M43 M44

/B /B /B /B
B B B B
M11 M12 M21 M22 M31 M32 M41 M42

C /C /C C
/C C C /C
M1 M2
/A M10 A M20 M30 A M40
/A
SUM (A,B,C)
/SUM (A,B,C)
M3 M4
Fig. 10.17 Sum cell of a full adder realized using PFAL logic
200 MHz in 0.25 µm CMOS technology has been observed. Realization of the sum
cell of a full adder by using PFAL logic is shown in Fig. 10.17.
10.7.3 2N−2N2P Inverter/Buffer
This adiabatic logic family was derived from ECRL in order to reduce the coupling
effect. Figure 10.18 shows the general schematic diagram. The primary advantage
of 2N−2N2P over ECRL is that the cross-coupled nMOSFETs switches result in
non-floating outputs for large part of the recovery phase.
10.8 Some Important Issues
Loss in Adiabatic Circuits Discharging a gate in PFAL and ECRL logic styles
will lead to a residual voltage at the output node that is in the range of the threshold
voltage of the PMOS device. As long as the gate evaluates the same input in the
next cycle, in ECRL, the residual charge will be reused in the next cycle, otherwise
10.8 Some Important Issues 317
Fig. 10.18 Schematic CLOCK

diagram of a 2N-2N2P logic
gate
OUT-
OUT +
IN+ IN -
it is discharged to ground. In PFAL, this charge is dissipated when the output signal
changes, as the output is then connected to ground via the NMOS device in the latch
in the evaluate interval. If the output state remains the same, the charge is dissipated
in the wait phase, as the input transistors are turned on and connect the output to the
power-clock (that is on ground potential in the wait phase). Besides that, in ECRL
the output cannot instantly follow the rising power clock. Only when the power
clock is at least value |Vtp|, the charging path over the PMOS device is opened. Then
the output voltage follows the power clock abruptly, leading to a dynamic loss. All
these losses are related to the threshold voltage and lead to a non-adiabatic dissipa-
tion or non-adiabatic losses that are independent of the operating frequency, leading
to an offset in the energy dissipation over the whole frequency range. Thus, three
loss mechanisms that contribute to the overall losses are found in Adiabatic Logic.
Adiabatic losses are dependent on the operating frequency f. A minimum dissipa-
tion of the energy at a certain frequency is observed. Therefore, an optimum fre-
quency exists in Adiabatic Logic, where energy consumed per cycle is minimized.
Energy Saving Factor Energy saving factor (ESF) is defined in the context of
comparing the energy dissipation between static CMOS and the adiabatic logic
circuits. It is a measure for how much more energy is dissipated in static CMOS
circuits with respect to its adiabatic logic counterpart. The precise definition of ESF
depends on, at what level of design hierarchy it is being compared. It may be con-
sidered at the gate or at the system level where losses due to layout parasites are to
be included in the calculation of the ESF. A general definition is given below. Here,
all energy dissipation components are summed up:
ESF =
∑ CMOS
E
.
∑ AL
E
Voltage Scaling We know that an easy and very effective way to reduce losses in
static CMOS is by reducing the supply voltage (known as supply voltage scaling).
The limiting factor for voltage scaling is the propagation delay, that is, the delay
increases as supply voltage reduces. A trade-off exists between speed and power
consumption, therefore, the voltage can only be reduced to a level where no timing
constraints in the design are violated. The critical path in a static CMOS design
determines the maximum degree to which the voltage can be reduced. In designs
where only a few critical paths exist, but many paths have a positive slack after
reducing the supply voltage, the gain from globally reducing the supply voltage is
not satisfying. To make voltage scaling more effective, one can try to break up the
critical paths to allow further reduction of voltage and thus power, and also using
different voltage domains for fast and slow paths could increase the benefits of
scaling. Delay is not a concern for adiabatic logic circuits, as the maximum possible
frequency is far above the optimum frequency for an energy-efficient operation of
gates and systems. Looking into the frequency regime where adiabatic losses domi-
nate the energy consumption of adiabatic logic, it is expected that the reduction
of the supply voltage will lead to a benefit in energy consumption. At first sight, a
dependence of V 2dd is observed, but the on-resistance of the transistor in the charg-
ing path is also a function of the supply voltage. If the overdrive voltage ( Vgs–Vth) is
reduced by reducing the supply voltage, the resistance is increased. As long as Vdd is
far above Vth, the dissipated energy is reduced with voltage scaling. Thus, Adiabatic
Logic also gains from voltage scaling, but the ESF on gate level will decrease if
voltage reduction is applied. Leakage losses are also impacted by reducing the sup-
ply voltage. As long as the leakage losses are negligible, compared to the dynamic
losses in static CMOS, and as long as the adiabatic circuit is not operated in the
leakage-dominated regime, and if non-adiabatic losses are negligible, the impact of
voltage scaling on the ESF is negligible.
The lower bound for Vdd in static CMOS is mainly limited by timing constraints,
including margins for variations in the process and fluctuations in the temperature
and supply voltage. Supply voltage reduction in adiabatic logic is not limited by
timing constraints. But a functional limit for ECRL and PFAL is observed when re-
ducing Vdd. Below this lower bound, malfunctions of circuits constructed by ECRL
and PFAL gates appear. In ECRL, the nMOS device is responsible to keep one out-
put node at ground potential, and the pMOS device charges the dual output node.
Thus, in ECRL, the voltage supply has to be higher than the highest absolute thresh-
old voltage value. In the PFAL gate, the output node has to be at least loaded to Vtn
to make the NMOS device in the latch conductive that is responsible for keeping
the dual output node at ground. The input device’s source node is connected to the
output node that is expected to be at least Vtn. Thus, the gate voltage of the input
device needs a voltage of greater than 2 Vtn to be conducting. Finally, the reduction
in the voltage levels will degrade the noise margin for static CMOS as well as for
adiabatic Logic. Energy reduction via supply voltage scaling will thus be a trade-off
between energy and robustness of the design.
Threshold Voltage Variations For the three logic families ECRL, PFAL and
2N−2N2P, the influence of inter-die and intra-die threshold voltage variations on
the power dissipation has been investigated by means of PSPICE simulations by
a group of researchers from the Technical University of Munich [5]. In the con-
sidered 0.25-µm CMOS technology, the nominal threshold voltages of the n- and
p-channel transistors are Vtn = 0.44 V and Vtp = 0.43 V. Both inter-die and intra-die
10.8 Some Important Issues 319
60
CMOS
50 ECRL
2n-2n2p
PFAL
Dissipated energy [fJ]
40
30
20
10
0
100 1k 10k 100k 1M 10M 100M 1G
Frequency [Hz]
Fig. 10.19 Energy consumption per switching operation versus frequency for a CMOS inverter,
an ECRL inverter, a PFAL inverter and a 2N−2N2P inverter
parameter variations, in particular between the pMOSFETs of the inverter, are taken
into account. In order to determine the energy dependence on the threshold volt-
age and on the frequency, the three adiabatic inverters are simulated for the whole
useful frequency range. Figure 10.19 shows the energy consumption per switch-
ing operation for the three logic families in the case of nominal threshold voltage.
As a reference, the energy consumption of a conventional CMOS inverter is also
plotted. The supply voltage Vdd is 1.8 V and the load capacitance Cload is 20 fF.
For high frequencies, the behaviour is no more adiabatic and therefore the energy
consumption increases. At low frequencies, the dissipated energy increases for
both CMOS and adiabatic gates due to the leakage currents of the transistors. Thus,
for each employed logic family, an optimal interval for the operating frequency is
obtained, that we call ‘adiabatic frequency range’. For f = 1 MHz the ECRL, PFAL
and 2N−2N2P inverters dissipate 30.9, 6.2 and 18.3 % of the energy dissipated by
the CMOS inverter. For gates with a larger number of transistors, e.g. an adder, the
adiabatic logic shows an even better improvement with respect to CMOS, because
the number of transistors needed for an adiabatic implementation becomes compa-
rable with those of the conventional CMOS implementation. It has also been shown
that the power dissipation variations due to parameter ( Vt) variations are strongly
dependent on the logic family.
• The difference between adiabatic charging and conventional charging has been
explained.
• How adiabatic amplification takes place has been discussed.
• The steps of realization of adiabatic logic gates have been stated.
• The realizations of synchronous and asynchronous pulsed power supplies have
been considered.
• How stepwise charging and discharging can be used to minimize power dissipa-
tion has been explained.
• The realization of partially adiabatic circuits such as ECRL, PFAL and 2N−CRL,
PF has been discussed.
• Non-adiabatic loss in partially adiabatic circuits has been highlighted.
• The impact of supply voltage scaling and threshold voltage scaling has been
presented.
• Variations of energy consumption per switching with frequency of operation
have been mentioned.
Q10.1. Distinguish between conventional charging (used in static CMOS circuits)

and adiabatic charging of a load capacitance.
Q10.2. Explain how dynamic power dissipation is minimized using adiabatic
switching?
Q10.3. What is adiabatic switching? How is power dissipation minimized using
adiabatic switching? Design a two-input adiabatic OR/NOR gate. Explain its opera-
tion.
Q10.4. Realize a two-input OR/NOR gate using positive feedback adiabatic logic
(PFAL) circuit. Explain its operation.
Q10.5. A capacitor C is charged to Vdd from 0 V through a resistance R using a
supply voltage of Vdd. Use a constant current ( I) source to charge the capacitor C
through the same resistance R such that the power dissipation is one eighth of the
previous case. Find out the value of I in terms R, C and Vdd.
Q10.6. Prove that the charging of a capacitor C in n steps to a voltage Vdd instead of
a conventional single-step charging reduces the power dissipation by a factor of n.
Q10.7. Design an adiabatic half-adder circuit and compare its transistor count with
that of a static CMOS realization.
Q10.8. Give the realization of the sum cell using positive feedback adiabatic logic
(PFAL) circuit. Explain its operation.
References 321
References
1. Dickinson, A.G., Denker, J.S.: Adiabatic dynamic logic. IEEE J. Solid-State Circuits 30(3),
311–315 (1995)
2. Teichmann P.: Adiabatic logic—future trend and system level perspective. Springer (2012)
3. Athas W.C., Svensson L.J., Tzartzanis N.: A resonant signal driver for two-phase, almost-non-
overlapping clocks. IEEE Symp Circuits Syst 4, 129–132 (1996)
4. Svensson L.: Adiabatic switching. In: Chandrakasan, A.P., Brodersen, R.W. (eds.) Low power
digital CMOS design, pp. 181–218. Kluwer Academic (1995)
5. Amirante E., Bargagli-Stoffi A., Fischer J., Iannaccone G., Schmitt-Landsiedel D.: Variations
of the power dissipation in adiabatic logic gates. Institute for Technical Electronics, Technical
University Munich
6. Mahmood-Meimand H., Afzali-Kusha A.: Efficient power clock generator for adiabatic logic.
IEEE Int. Symp Circuits Syst 4, 642–645 (2001)
Chapter 11
Battery-Aware Systems
Abstract With the proliferation of portable battery-operated devices, smaller,

lighter, more powerful, and longer-lasting battery is a much sought-after commod-
ity. Powering laptops, handhelds, cell phones, pagers, watches, medical devices,
and many other modern gadgets, batteries play a crucial role in supporting today’s
cutting-edge technologies. The widening gap between the increasing power con-
sumption and power density of the popular battery technologies is highlighted
to emphasize the relevance of battery-aware systems. Battery-aware synthesis
approaches try to make efficient use of the available energy in the battery. So opti-
mization of the battery lifetime, i.e., performing maximum amount of computation
per recharge of battery, is one of the primary objectives for portable computing sys-
tem design. An overview of commonly used battery technologies is provided, and
the characteristics of a rechargeable battery are specified. The underlying process
of battery discharge is explained. Realizations of battery-driven systems including
battery-aware sensor network are presented.
Keywords Assisted-LEACH · Battery-aware task scheduling · Battery discharge

characteristics · Battery gap · Battery technologies · Energy-aware routing · Energy
density · LEACH · Lithium ion · Lithium polymer · Memory effect · Nickel
cadmium · Nickel–metal hydride · Rate capacity effect · Rechargeable alkaline ·
Recovery effect
11.1 Introduction
Over the years, with increasing usage of mobile devices in everyday life, there
is proliferation of portable computing and communication equipments, such as
laptops, palmtops, cell phones, etc. The growth rate of the number of these por-
table devices is very high compared to the rate of growth of desktop and server
systems. It has been observed that the processing capability of the contemporary
portable computing devices is becoming comparable to that of desktop comput-
ers. Moreover, complexity of these portable computing devices is increasing due
to the gradual addition of more and more functionality. However, power dissipa-
tion keeps on increasing with the increase in computing complexity. Fortunately,

324 11 Battery-Aware Systems
with the advancement of very-large-scale integration (VLSI) technology and

power-efficient approaches used in the design, these portable devices do not con-
sume as much power as that of desktop computers. As these devices are battery
operated, battery life is of primary concern, and it has put additional constraints.
Commercial success of these products depends on weight, cost, and battery run
time after each recharge. Unfortunately, the battery technology has not kept up
with the energy requirement of the portable equipment. To satisfy larger energy
requirement, use of a battery of higher capacity is not a solution because, for por-
table devices, the size and weight of battery, which are proportional to the bat-
tery capacity, have stringent design constraints. This has motivated the designers
to consider alternative approaches such as battery-aware synthesis, to satisfy the
energy requirement of the portable devices. There is a difference between low-
power synthesis and battery-aware synthesis. As we have seen, low-power design
approaches try to reduce the power consumption by using suitable techniques like
supply voltage scaling, clock gating, power gating, etc. On the contrary, battery-
aware synthesis approaches try to make efficient use of the available energy in the
battery. So optimization of the battery lifetime, i.e., performing maximum amount
of computation per recharge of battery, becomes one of the primary objectives for
portable computing system design. This chapter discusses few design techniques
and proposes an architectural power management method to optimize battery life-
time and to obtain maximum number of cycles per recharge. Section 11.2 presents
the so-called widening battery gap, which provides the motivation for battery-
aware system design. An overview of the popular battery technologies is provided
in Sect. 11.3. Different characteristics of the rechargeable batteries are highlighted
in Sect. 11.4. The basic principle of battery discharge is presented in Sect. 11.5.
Section 11.6 gives an overview of the modeling of rechargeable batteries. Battery-
driven system design approaches are highlighted in Sect. 11.7. Section 11.8 intro-
duces wireless sensor networks (WSNs). An overview of the energy-aware routing
protocols is given in Sect. 11.9. This chapter is concluded with Sect. 11.10 by
presenting a cluster-based routing protocol known as assisted-low-energy adaptive
clustering hierarchy (A-LEACH).
11.2 The Widening Battery Gap [1]
Figure 11.1 illustrates how the advancement of VLSI technology has led to an in-
crease in the number of transistor per die, as the feature size of the transistors has
reduced and chip area has increased over the years. As a consequence, the power
consumption has also increased as shown in Fig. 11.2. Unfortunately, the battery
technology has not been able to maintain the same rate of growth in energy density,
leading to a widening “battery gap” as shown in Fig. 11.3. The designers of the
portable systems have to face the challenge to bridge this gap in the foreseeable
future.
11.2 The Widening Battery Gap 325
Fig. 11.1 Advancement of VLSI technology and Moore’s law
Fig. 11.2 Power consumption of Intel processors

35 400
Power (W) Energy Density (Wh/kg)
30 350
25 300
Energy Density (Wh/kg)

Power (W)
20 250
15 200
10 150
5 100
0 50
1986 1990 1994 1998 2002
Fig. 11.3 Widening battery gap
11.3 Overview of Battery Technologies
Users of portable devices look for longer run times and quicker recharges and al-
ways at a lower cost. One way of balancing a compact design with higher power
requirements is to increase the energy density of the battery. The common method
of evaluating the energy density is to consider the amount of watt-hours per kilo-
gram (Wh/kg). The energy density of the commonly used battery technologies that
have been developed over the years to meet the increasing demand for smaller,
lighter, and higher-capacity rechargeable batteries for portable devices is shown
in Fig. 11.4. Apart from the energy density, other parameters to be considered are:
cycle life (the number of discharge/charge cycles prior to battery disposal), envi-
ronmental impact, safety, cost, available supply voltage, and charge/discharge char-
acteristics. The key characteristics of the popular rechargeable battery technologies
for portable devices are considered in this section [2, 3].
11.3.1 Nickel Cadmium
This is a mature technology dominating its use in portable devices. It has been suc-
cessfully used for several decades. Advantages of these batteries include low cost,
quick recharge time, and high discharge rates. As shown in Fig. 11.4, these batteries
have an energy density in the range of 40–50 Wh/kg. While nickel cadmium (NiCd)
technology has been losing its ground in recent years, owing to its low energy den-
11.3 Overview of Battery Technologies 327
140
120
Energy Density (Wh/kg)
100
Initial
80
60
40
20
0
Nickel Nickel Metal- Lithium ion Rechargeable Lithium
Cadmlum Hydride (1991) Alkaline Polymer
(1950) (1990) (1992) (1999)
Fig. 11.4 Energy density of the commonly used batteries used in portable devices
sity and toxicity, it is still used in low-cost applications like portable radios, MP3
players, etc. The limited energy densities of NiCd batteries affect both the cell size
and weight when improvements to portable operating time are desired. Moreover,
they are near the end of their evolution for capacity improvements, and they are
subject to increasing environmental regulations. As a consequence, manufacturers
of portable devices look for better alternatives.
11.3.2 Nickel–Metal Hydride
Nickel–metal hydride (NiMH) batteries provide higher battery capacity. The en-
ergy densities for these batteries are in the range of 50–60 Wh/kg. These batteries
are in widespread use in recent years for powering laptop computers. However,
they have shorter cycle life, are more expensive, and are inefficient at high rates
of discharge. Switching from NiCd to NiMH provides a 30–40 % improvement to
capacity. NiMH batteries are voltage compatible with NiCd batteries and they have
similar discharge characteristics. Care should be taken not to overdischarge NiMH
batteries. The downside to NiMH batteries, however, is an increase in cost of about
50–75 % compared to NiCd. The principal discharge differences between NiMH
and NiCd batteries are high rate and low temperature. This becomes important in
applications such as radio telephones used in the military or law enforcement. The
available capacity decreases rapidly at temperatures below 0 °C and at high rates of
discharge. One important issue with NiMH is self-discharge. The NiMH battery has
self-discharge rate of about 1.5 % per day at 25 °C and could be a significant portion
of the discharge capacity if the battery is stored at elevated temperatures—higher
than 35 °C.
11.3.3 Lithium Ion
For many years, NiCd had been the only suitable battery for portable devices.
NiMH and lithium (Li) ion emerged in the early 1990s and gradually achieved cus-
tomer’s acceptance. Today, Li ion is the fastest growing and most promising battery
technology. Longer lifetimes have made them the most popular battery choice for
notebook computers, personal digital assistants (PDAs), camcorders, and cellular
phones. The energy density of Li ion is typically twice that of the standard NiCd
batteries. The Li-ion battery has an average voltage of 3.6–3.7 V and an energy den-
sity ranging from 80 to 100 Wh/kg. This is the fastest growing battery technology
today, with significantly higher energy densities, and cycle life about twice that of
NiMH batteries. In addition to power density, Li-ion solutions supply at least double
the amount of energy per unit of weight. Li-ion cells provide a standard nominal
voltage of 3.7 V/cell and a potential voltage of 4.2 V/cell as compared to only 1.2 V/
cell available in NiMH cells. As many as three NiMH cells would be required to
meet the nominal voltage of a single Li-ion cell, and more than six NiMH cells
would be needed to provide a similar power density. The load characteristics are
reasonably good and behave similarly to NiCd in terms of discharge. Li ion is a low-
maintenance battery, an advantage that most other portable battery types provide.
There is no memory and no scheduled cycling is required to prolong the battery’s
life. In addition, the self-discharge is less than half compared to NiCd, and it can
cause little harm when disposed.
Li-ion batteries are more sensitive to characteristics of the discharge current, are
more expensive than NiMH batteries, and can be unsafe when improperly used.
Li-ion batteries have greater internal impedance than NiCd batteries and therefore
available capacity is reduced at higher discharge rates. It is interesting to note that
Li-ion cells have about three times the voltage of NiCd batteries and twice the
energy density by volume, but the rate discharge equates to 50 % more current for
NiCd cells as compared to Li-ion cells. For high-rate discharge, the capacity reduc-
tion is even greater when the comparison is based on current. The relative capacity
reduction at cold temperatures is also greater for Li ion as compared to NiCd. A
capacity gauge for Li-ion batteries can be more complicated than the nickel-based
chemistry. The voltage of NiCd batteries changes by less than 20 % during 80 % of
their discharge capacity, while that of Li-ion batteries changes by nearly 40 %. This
change in voltage can be used as a state-of-charge indicator for systems that have
constant discharge current. This condition rarely exists, and for most electronic ap-
plications, the current consumption changes when the supply voltage changes. So,
for the simple voltage-based capacity gauge, the current must be measured to prop-
erly interpret the battery voltage.
11.4 Battery Characteristics 329
11.3.4 Rechargeable Alkaline
Rechargeable alkaline batteries have characteristics similar to those of primary al-

kaline batteries. The initial energy density of 80 Wh/kg is slightly lower than that of
primary alkaline batteries. The energy density decreases with each discharge/charge
cycle and is about half the initial density after 25 cycles. Also, because of the fade
in cell capacity with recharge, each cell must be individually charged. This greatly
complicates the in-system charger. Beyond the initial energy benefit, the cost is
about half that of NiCd. The self-discharge of these batteries is about two orders of
magnitude lower than that of NiCd. Rechargeable alkaline batteries are typically
recharged outside the system where they are used. When an in-system charger is
required, the charger control is similar to the Li-ion battery except for the high- and
low-voltage levels, 1.65 and 0.8 V, respectively.
11.3.5 Li Polymer
This emerging technology enables ultra-thin batteries (less than 1 mm thickness),
and is expected to suit the needs of lightweight next-generation portable computing
and communication devices. Additionally, they are expected to improve over cur-
rently available Li-ion technology in terms of energy density and safety. However,
these batteries are currently expensive to manufacture, and face challenges in inter-
nal thermal management.
So, we may conclude that currently no one battery type provides all the desired
features. For fast recharge and high-rate discharge, NiMH batteries can eventu-
ally replace NiCds. NiMH batteries are also more environmentally friendly, but the
added cost must be considered against the benefits. The Li-ion batteries have a very
good increase in weight energy density; however, this battery type is not nearly as
robust as NiCd. When energy requirements and weight reduction are important, this
battery type should be considered. The cost of Li ion may preclude its use in some
applications. Rechargeable alkaline can have advantages in some low-discharge-
rate applications. The initial energy density is good compared to nickel-based bat-
teries; the difficulty is the in-system charging when multiple cells are required.
11.4 Battery Characteristics [4, 5]
Before we consider the battery discharge characteristics, let us consider the follow-
ing important characteristics of a battery:
Voc Open-circuit voltage of a fully charged battery under no-load condition.
Vcut Cutoff voltage of a battery is the voltage at which a battery is considered to be
fully discharged.
Theoretical Capacity The theoretical capacity of a battery is based on the amount

of energy stored in the battery, and is an upper bound on the total energy that can be
extracted in practice.
Standard Capacity The standard capacity of a battery is the energy that can be
extracted from it when it is discharged under standard load conditions as specified
by the manufacturer.
Actual Capacity The actual capacity of a battery is the amount of energy that the
battery delivers under a given load, and is usually used (along with battery life) as a
metric to judge the battery efficiency of the load system.
Units of Battery Capacity The energy stored in a battery, called the battery capac-
ity, is measured in either watt-hours (Wh), kilowatt-hours (kWh), or ampere-hours
(Ah). The most common measure of battery capacity is Ah, defined as the number
of hours for which a battery can provide a current equal to the discharge rate at the
nominal voltage of the battery. The unit of Ah is commonly used when working
with battery systems, as the battery voltage will vary throughout the charging or
discharging cycle. The Wh capacity can be approximated from the Ah capacity by
multiplying the Ah capacity by the nominal (or, if known, time average) battery
voltage. A more accurate approach takes into account, the variation of voltage by
integrating the Ah capacity over the time of the charging cycle. For example, a 12-V
battery with a capacity of 500-Ah battery allows energy storage of approximately
100 Ah × 12 V = 1200 Wh or 1.2 kWh. However, because of the large impact from
charging rates or temperatures, for more accurate analysis, additional information
about the variation of battery capacity are also provided by battery manufacturers.
11.4.1 Rate Capacity Effect
It is an established fact that the charging/discharging rates affect the rated battery
capacity. If the battery is being discharged very quickly (i.e., the discharge current
is high), then the amount of energy that can be extracted from the battery is reduced
and the battery capacity becomes lower. As we have already discussed, this is due
to the fact that the necessary components for the reaction to occur do not necessar-
ily have enough time to move to their necessary positions. Then, only a fraction of
the total reactants is converted to other forms, and therefore the energy available is
reduced. Alternatively, if the battery is discharged at a very slow rate using a low
current, more energy can be extracted from the battery and the battery capacity be-
comes higher. Therefore, the capacity of a battery should take into consideration the
charging/discharging rate. A common way to specify battery capacity is to provide
the battery capacity as a function of the time which it takes to fully discharge the
battery (note that in practice the battery often cannot be fully discharged). The nota-
tion to specify battery capacity in this way is written as Cx, where x is the time in
hours that it takes to discharge the battery. The temperature of a battery also affects
the energy that can be extracted from it. At higher temperatures, the battery ca-
pacity is typically higher compared to lower temperatures. However, intentionally
11.4 Battery Characteristics 331
elevating battery temperature is not an effective way to increase battery capacity

because this also decreases battery lifetime. There exists dependency between the
actual capacity and the magnitude of the discharge current (depends on the avail-
ability of active region). When discharge rate is high, surface of the cathode gets
coated with insoluble compound. This prevents access to many active areas and
consequent reduction of the actual capacity of the battery.
11.4.2 Recovery Effect
When heavy current is drawn from a battery, the rate at which positively charged
ions are consumed at the cathode is more than supplied. As a consequence, the bat-
tery fails to supply the necessary current. However, in many situations, the battery
capacity improves if it is kept idle for some duration. The battery voltage recovers
in idle periods. As we know, the rate at which the positively charged ions are sup-
plied depends on the concentration of positively charged ions near the cathode,
which improves as a battery is kept idle. This is known as recovery effect.
11.4.3 Memory Effect
Memory effect, also known as battery effect, lazy battery effect, or battery memory,
is an effect observed in NiCd and NiMH rechargeable batteries that causes them to
hold less charge. It describes one very specific situation in which certain NiCd and
NiMH batteries gradually lose their maximum energy capacity if they are repeated-
ly recharged after being only partially discharged. The battery appears to “remem-
ber” the smaller capacity. The source of the effect is changed in the characteristics
of the underused active materials of the cell. The term is commonly applied wrongly
to almost any case in which a battery appears to hold less charge than was expect-
ed. These cases are more likely due to battery age and use, leading to irreversible
changes in the cells due to internal short circuits, loss of electrolyte, or reversal of
cells. Specifically, the term “memory” came from an aerospace NiCd application in
which the cells were repeatedly discharged to 25 % of the available capacity (±1 %)
by exacting computer control and then recharging to 100 % capacity without over-
charge. This long-term, repetitive cycle regime, with no provision for overcharge,
resulted in a loss of capacity beyond the 25 % discharge point. Hence, the birth of
a “memory” phenomenon, whereby NiCd batteries purportedly lose capacity if re-
peatedly discharged to a specific level of capacity.
11.4.4 Usage Pattern
Although the battery lifetime mainly depends on the rate of energy consumption
of the load, lowering the average consumption rate is not the only way to increase
battery lifetime. Due to nonlinear physical effects in the battery, the lifetime also
Fig. 11.5 Simplified sche-

matic diagram of an electro- e- e-
chemical cell
Anode Cathode
Electrolyte
depends on the usage pattern. During periods of high energy consumption, the ef-
fective battery capacity degrades, and therefore the lifetime is shortened. However,
during periods without energy consumption, the battery can recover some of its
lost capacity, and the lifetime is lengthened. Nonincreasing load sequence leads to
lesser drop in voltage compared to random or increasing load sequence as we shall
discuss later.
11.4.5 Battery Age
Corrosion is the main component behind decreased battery performance by age.

This effect has been modeled as being linear. For instance, if a battery specifica-
tion states that the battery loses 20 % of its operational life by the end of its 6-year
calendar life, then one supposes that every month, the battery will lose 0.28 % of its
cycle life and its capacity (0.28 % = 20 %/72 months).
11.5 Principles of Battery Discharge
A battery consists of one or more cells connected in series or parallel. Chemical en-
ergy stored in the battery is converted to electrical energy through an electrochemi-
cal reaction. Figure 11.5 shows a simplified schematic diagram of an electrochemi-
cal cell, which consists of an anode, a cathode, and an electrolyte, which separates
the two electrodes that provide a medium for the transfer of charge between the two
electrodes. During battery discharge, oxidation at the anode results in the generation
of electrons, which flow through the external circuit, and positively charged ions,
which move through the electrolyte toward the cathode by diffusion. The negatively
charged ions combine with the positive ions to generate an insoluble compound
11.6 Battery Modeling 333
Fig. 11.6 Typical discharge Typical Discharge Characteristics

characteristics of a battery Discharge curve
Fully Charged Nominal area
Exponential
Exponential area
Nominal
Exponential Nom Max

Capacity (Ah)
that gets deposited on the cathode leading to the reduction of reaction sites, mak-
ing them unavailable for further use. As discharge proceeds, more and more reac-
tion sites are made unavailable, eventually leading to a state of complete discharge
as shown in Fig. 11.6. A battery-aware system is one where the discharge profile
characteristics result in improved actual capacity. Although the actual capacity may
exceed the standard capacity, it cannot exceed the theoretical capacity of the battery.
Typical charge characteristics of four different batteries are also shown in Fig. 11.7.
It may be noted that when the battery is charged from fully discharged condition,
initially the voltage increases sharply. Beyond the 20 % point, the voltage increases
very slowly until it reaches the maximum value. The state of charge is indicated by
the battery voltage.
11.6 Battery Modeling
Battery models capture the characteristics of real-life batteries and predict their be-
havior under various conditions of charge/discharge [1]. There exist many differ-
ent mathematical models. None of these models is completely accurate nor do any
include all necessary performance-effecting factors. Different battery models are
briefly discussed in this section.
Analytical Models Analytical expressions are formulated to calculate actual bat-
tery capacity and lifetime under different conditions such as variable/constant load.
The simplest models are based solely on electrochemistry. These models ignore
thermodynamic and quantum effects. As a consequence, these models capture rate
capacity and thermal effects but not recovery effect. Although these models can pre-
dict energy storage, they are unable to model phenomena such as the rate of change
of voltage under load. Nor do they include temperature and age effects. The fraction
of the stored charge that a battery can deliver depends on multiple factors, including
battery chemistry, the rate at which the charge is delivered (current), the required
terminal voltage, the storage period, ambient temperature, and other factors. The
higher the discharge rate, the lower the capacity. The relationship between current,
discharge time, and capacity for a lead acid battery is approximated (over a typical
range of current values) by Peukert’s law:
QP
t=
Ik
Fig. 11.7 Typical charge Typical Charge Characteristics

characteristics of different
batteries Maximum
Voltage (V)
Normal Lead-Acid
Li-Ion
0 20 40 60 80 100 120
State-of-Charge (%)
Typical Charge Characteristics
Maximum
Voltage (V)
Normal
NiMH & NiCD
0 20 40 60 80 100 120
State-of-Charge (%)
where
QP Capacity when discharged at a rate of 1 amp
I Current drawn from battery in Ampere
t Amount of time in hours that a battery can sustain
k Constant around 1.3.
Batteries that are stored for a long period or that are discharged at a small fraction
of the capacity, lose capacity due to the presence of generally irreversible side reac-
tions that consume charge carriers without producing current. This phenomenon is
known as internal self-discharge. Further, when batteries are recharged, additional
side reactions can occur, reducing the capacity for subsequent discharges. After
enough recharges, essentially all capacity is lost and the battery stops producing
power. Internal energy losses and limitations on the rate that ions pass through the
electrolyte cause battery efficiency to vary. Above a minimum threshold, discharg-
ing at a low rate delivers more of the battery’s capacity than at a higher rate because
of rate capacity effect as we have discussed. Other approaches for battery modeling
are mentioned below:
11.7 Battery-Driven System Design 335
• Electrical Circuit Models: In this case, the battery discharge is modeled using
an equivalent electrical circuit provided by simulation program with integrated
circuit emphasis (SPICE) model. This approach is capable of modeling rate ca-
pacity and thermal effects.
• Stochastic Models: In this model, the battery is represented by a finite number
of charge units and the discharge behavior is modeled using a discrete time-tran-
sient stochastic process. This approach can take into account variable loads, rate
capacity, and recovery effects but not thermal effects. Computation requirement
is modest.
• Electrochemical Models: This approach models electrochemical and thermo-
dynamic processes, physical construction, etc. It can analyze many discharge
effects under variable loads, including rate-capacity effect, thermal effect, and
recovery effect. This approach is most accurate but most computationally inten-
sive.
11.7 Battery-Driven System Design
Battery-driven system design involves the use of one or more of the following tech-
niques:
Voltage and Frequency Scaling As we mentioned earlier, the power dissipation
has square law dependence on the supply voltage and linear dependence on the
frequency. Depending upon the performance requirement, the supply voltage Vdd
and frequency of operation of the circuit driven by the battery can be dynamically
adjusted to optimize energy consumption from the battery. Information from a bat-
tery model is used to vary the clock frequency dynamically at run time based on
the workload characteristics. If the workload is higher, higher voltage and clock
frequency are used, and, for lower workload, the voltage and clock frequency can
be lowered such that the battery is discharged at a lower rate. This, in turn, improves
the actual capacity of the battery.
Dynamic Power Management The state of charge of the battery can be used to
frame a policy that controls the operation state of the system.
Battery-Aware Task Scheduling The current discharge profile is tailored to meet
battery characteristics to maximize the actual battery capacity.
Battery Scheduling and Management Efficient management of multi-battery
systems by using appropriate scheduling of the batteries.
Static Battery Scheduling These are essentially open-loop approaches, such as
serial scheduling, random scheduling, round-robin scheduling, where scheduling is
done without checking the condition of a battery.
Terminal Voltage-Based Battery Scheduling The scheduling algorithm makes
use of the state of charge of the battery.
Discharge Current-Based Battery Scheduling This approach is used when het-

erogeneous batteries with different rate capacities are used.
Battery-Efficient Traffic Shaping and Routing Network protocols and commu-
nication traffic patterns play important roles in determining battery efficiency and
lifetime.
11.7.1 Multi-battery System
Instead of having a single battery, it is possible to use multiple batteries in a single

system. However, the load can be serviced entirely by a single battery at a time.
Here, the goal is to switch the load between cells in such a way that their lifetime
is maximized, and this can result in very diverse load distributions. The control
problem that is being used is much more complex because there is no single set
point that can be used to improve the behavior of the system. More efficient use
of multiple batteries can be achieved by exploiting the phenomenon of recovery
effect, which is a consequence of the chemical properties of a battery: as the charge
is drawn from a battery, the stored charge is released by a chemical reaction, which
takes time to replenish the charge. In general, the charge is drawn from a battery at
a faster rate than the reaction can replenish it, and this leads to a battery appearing
to become devoid of charge when, in fact, it still contains stored charge. By allow-
ing the battery to remain idle, the reaction can replenish the charge, and the battery
becomes operational once again as we have already mentioned. Thus, efficient use
of multiple batteries involves carefully timing the use and idle periods for a set of
batteries. This problem can be considered as a planning problem.
11.7.2 Battery-Aware Task Scheduling
Several experiments on a 2.2-Wh Li-ion battery were conducted by Daler Ra-

khmatov and Sarma Vradhula [4] to study the effect of different load profiles on
the actual capacity of the battery. For the first ten experiments, the battery discharge
current was kept constant in each test. The current values were in the range of
1011–123 mA, and the measured battery lifetimes were in the range of 30 min to
more than 300 min as shown in Fig. 11.8. The open-circuit voltage of the battery
was considered to be 4.2 V, and the cutoff voltage was set to 3.0 V. The electronic
load operated in the constant-current mode and variable-current profiles were gen-
erated as a piecewise constant-current profile (a staircase). The battery voltage was
sampled every second, and once the voltage dropped below the cutoff level, the load
was automatically disconnected from the battery. After each test, the battery was re-
charged in the constant-current mode at 800 mA until the battery voltage recovered
to its open-circuit value. From the nature of the plot, it is evident that smaller the
discharge rate, larger is the actual capacity as envisaged by the rate capacity effect.
Lifetimes under Constant-Current Discharge

103
measured
Predicted
Lifetime, min
102
101
100 200 300 400 500 600 700 800 900 1000 1100
Load. mA
Fig. 11.8 Lifetime of the battery under constant-current discharge
The next test set consisted of five variable-current load profiles P1–P5, and are
shown in Fig. 11.9. To obtain P1–P4, four currents of certain durations, 1011 mA
for 10 min, 814 mA for 15 min, 518 mA for 20 min, and 222 mA for 15 min, were
selected, which were arranged in different order. For each of these four profiles, the
total length and delivered charge are 60 min and 36,010 mA-min, respectively. Note
that in P1, after 60 min, the battery is discharged at 222 mA until a failure occurs.
In P1, the load is decreasing in nature, whereas in P2, the load is increasing in na-
ture. Table 11.1 shows the measured lifetimes Lm as well as the measured delivered
charges Cm for different profiles.
The results show that P1 is the best sequence, and P2 is the worst sequence, from
the battery utilization perspective. Indeed, in P1, after 60 min, the battery survives
for another 4.9 min under 222 mA (residual 1088 mA-min charge). However, in P2,
the battery fails to service the last 6.0 min under 1011 mA (undelivered 6066 mA-
min charge). For P1 and P2, the difference in the total delivered charge is as much
as 20 % of 36,010 mA-min. As predicted by the battery model and demonstrated
by the measurements, the other alternative sequences, P3 and P4, are neither bet-
ter than P1 nor worse than P2. The last profile, P5, shows the benefit of reducing
battery load by decreasing energy consumption of a hypothetical processor through
Fig. 11.9 Five load profiles P1–P5

Table 11.1 Measured Profile Measured

lifetimes and the delivered
Lm (min) Cm (mA-min)
charges for different profiles
P1 64.9 37,098
P2 54.0 29,944
P3 55.8 32,591
P4 58.4 35,181
P5 67.5 34,965
reducing its voltage. P5 is obtained from P2 by changing the failing 10-min load of
1011 mA to a 20-min load of 518 mA to reflect a change in the processor voltage.
Note that the charge required from the battery is approximately the same before and
after voltage reduction. The profile length is increased by 10 min, and the battery
failure occurs at 67.5 min. The total delivered charge is 34,966 mA-min, which is
a noticeable improvement over P2 with 29,944 mA-min. This clearly demonstrates
that battery behavior depends on the characteristics of the load profile as envisaged
by the rate capacity effect.
11.7.3 Task Scheduling with Voltage Scaling [12]
Task scheduling can be combined with voltage scaling to maximize the amount of
charge that a battery can supply subject to the following constraints:
a. Dependency constraint: task dependencies are preserved;
b. Delay constraint: the profile length is within the delay budget; and
c. Endurance constraint: the battery survives all the tasks.
Three different approaches toward solving the task scheduling and voltage assign-
ment problem are summarized below:
1. The first approach is aimed at minimizing energy consumption which corre-
sponds to minimizing the total charge consumed during task execution. Task
charges are controlled by scaling task voltages. This approach starts with assign-
ing voltages to tasks so that the total charge consumption is minimized subject
to satisfying the delay budget. Energy minimization does not guarantee maximi-
zation of battery lifetime, since the battery lifetime is sensitive not only to task
charges but also to task ordering in time. The battery may fail before completing
all tasks (i.e., the endurance constraint may be violated), even though the total
charge consumption is minimized. In such situations, task repair is performed,
which reduces the voltage for some tasks in order to reduce the stress on the
battery. Once the profile has been repaired, its length may exceed the delay bud-
get. To meet the delay constraint, a latency reduction procedure is applied. This
scales up the task voltages, while ensuring that no failures are introduced.
2. The second method starts with the highest-power initial profile by assigning all
tasks to the highest voltage. Since the clock frequency is also the highest (i.e.,
task durations are the shortest), the delay constraint is satisfied. However, high-
task currents may result in the failure of the battery. To satisfy the endurance
constraint, task repair is performed while checking that the delay constraint is
not violated. Once the profile no longer fails, there may be some delay slack
available that (delay budget—profile length) may be a positive quantity. To fur-
ther reduce the profile cost, a slack utilization procedure is applied that further
scales down task voltages.
3. In contrast to the second approach, the third method starts with the lowest-power
initial profile by assigning tasks to the lowest voltage. The endurance constraint
is satisfied, but the delay constraint may be violated, since the clock frequency is
the lowest (i.e., task durations are the longest). To meet the delay budget, laten-
cies are reduced by scaling up the voltages; in this case, it is ensured that the
endurance constraint is not violated.
It is assumed that these are the eight tasks T1–T8 with two possible supply
voltages V0 and V1, where V1 > V0. The delay budget B is assumed to be 90 min.
Task scheduling for six different task profiles is shown in Fig. 11.10. Profile P1
corresponds to minimum-charge initial solution which starts with a voltage assign-
ment that consumes the minimum charge. In this case, tasks are completed within
the delay budget. The task ordering is ( T4, T1, T7, T2, T8, T3, T5, T6), and the task
voltages are ( V1, V0, V1, V0, V1, V0, V0, V0). The profile P2 corresponds to the
highest-power initial solution, i.e., the processor is operating at the highest voltage.
The second method scales down the voltages starting from the highest-power initial
solution, as illustrated in Fig. 11.10b–11.10d. Since the task currents are the high-
est possible, the endurance constraint may be violated. As shown in Fig. 11.10b,
the highest-power initial profile, P2, fails. After failures are repaired by voltage
downscaling, profile P3 is obtained as shown in Fig. 11.10c. Note that there is still
some slack available, and the voltage can be scaled down even further as shown in
Fig. 11.10d. This is represented by profile P4 with the task ordering ( T1, T2, T3,
T5, T4, T6, T7, T8), and the corresponding task voltages ( V0, V0, V1, V0, V1, V0,
V0, V1). Finally, the third method scales up the voltages, starting from the lowest-
power initial solution, as illustrated in Fig. 11.10e–f. Since the task durations are
the longest possible, the delay constraint may be violated. The lowest-power initial
profile, P5, is shown in Fig. 11.10e. To meet the delay budget, voltage upscaling is
performed without introducing failures to get the final solution, P6. The task order-
ing is ( T1, T2, T3, T5, T4, T6, T7, T8), and the task voltages are ( V0, V1, V0, V0,
V1, V1, V0, V1).
11.8 Wireless Sensor Networks
WSNs are gaining immense importance in today’s world with respect to numer-
ous applications in which they can be used. These applications range from military
surveillance, environment monitoring, inventory management, habitat monitoring,
health monitoring, etc. Almost all these applications need the sensor network to be
11.8 Wireless Sensor Networks 341
Fig. 11.10 Three approaches to task scheduling with voltage scaling
alive for months or even for years. So, the main objective when the sensor nodes are
deployed is that the life of the sensor network has to be maximized, so that it can last
the duration the application demands.
The lifetime of a sensor network depends on the lifetime of the constituent sen-
sor nodes. Given that the sensor nodes are battery powered and cannot be recharged
or replaced upon its complete discharge, a sensor node lasts only till its battery lasts.
So, in order to improve the sensor network lifetime, the battery lifetime of the indi-
vidual motes has to be maximized. Some of the approaches in improving the sensor
node lifetime are—by using energy-efficient sensor node architecture, power-aware
sensor network architectures, and power-aware protocols.
The power dissipation in sensor networks can be broadly divided into two parts.
They are the communication power and the computation power. It has been observed
that the amount of energy consumed in the communication process is much higher
than the amount consumed in the computation process. So, by using power-aware
protocols in the protocol stack, the amount of energy consumed can be minimized.
Researchers have been carrying out research works in the fields of power-aware
routing protocols and power-aware media access control MAC layer protocols [6].
Some of the power-aware routing protocols [7] are directed diffusion [8, 15], eaves-
drop and register (EAR), and LEACH [9]. Some of the media access control MAC
layer protocols are sensor-MAC (SMAC) [6] and carrier sense multiple access with
collision avoidance (CSMA-CA) [10]. The sensor network architecture is one of the
most important design criteria for sensor networks. Some of the design features in
sensor network architecture are localization, clustering, synchronization, etc. The
other design issues in sensor network architecture are whether the communication
is single hop or multi-hop and single path or multiple paths. The goal of a sensor
network designer is to develop power-aware protocols for the chosen architecture.
The energy consumed in the sensor node can be reduced by minimizing the amount
of time for which the hardware units are switched ON. The processor can be put
in various power states whenever it is idle. Since most of the energy consumed is
due to idle listening of the receiver, the receiver has to be switched ON only when
there is some data packet for that node. So, energy-efficient node architecture has to
be designed. In most of the sensor nodes, the power consumed by the processor is
reduced by using dynamic voltage scaling. Dynamic voltage scaling mainly reduces
the switching power. In the coming years, with the feature size getting smaller, leak-
age power will start to dominate. So, the leakage power cannot be ignored. Since
the battery life has to be maximized, a battery-aware task scheduler algorithm this
proposed that maximizes the battery life by scheduling the tasks accordingly and
also assigning the voltages such that dynamic power as well as leakage power is
reduced. The battery-aware task-scheduling algorithm presented here is an exten-
sion of [9] where the slack is utilized by scaling the voltage, whereas here the slack
is utilized by scaling not only the supply voltage Vdd but also the body bias voltage
Vbs. As a consequence, the battery lifetime is significantly increased because of the
reduction of both the dynamic power and the leakage power.
The battery-aware task-scheduling algorithm presented here follows the follow-
ing three properties of the battery discharge model as given in [8]:
Property 1: For a fixed voltage assignment, sequencing tasks in the nonincreas-
ing order of their currents is optimal when the task loads are constant during the
execution of the task.
Property 2: If a battery fails during some task k, it is always cheaper to repair k
by downscaling its voltage than by inserting on idle period before k.
Property 3: Given a pair of two identical tasks in the profile and a delay slack to
be utilized by voltage downscaling, it is better to use the slack on the later task than
on the earlier task.
The algorithm works in two phases. The first phase generates a feasible schedule
such that the tasks do not miss their deadlines. In the second phase, available slack
is utilized to minimize energy requirement by downscaling both supply voltage Vdd
and using body bias Vbs in an optimal manner. Input to the algorithm is a set of task
graphs, with each task in the task graph specifying the start time, execution time, the
deadline, and the current load as shown in Table 11.2.
The algorithm works in two phases. The functions performed in the two phases
are given below.
Phase I: Generating a Feasible Schedule The input to this phase is a set of task
graphs. From these task graphs, a feasible schedule is obtained. First, using the ear-
liest deadline first (EDF) algorithm, a schedule is formed such that the task depen-
dencies are not violated. Then, using this schedule and the property 1, a separate
Table 11.2 A table showing the tasks to be scheduled

Task Execution time (s) Deadline (s) Current load (mA)
T1 3 10 120
T2 4 20 50
T3 3 20 100
T4 4.5 30 40
approach is used to reschedule the tasks so that a nonincreasing order of their

current loads is obtained, if possible, and ensure that they do not miss their dead-
lines and violate the dependencies. Now, starting from the first task, each task has
to be checked whether it completes its execution or the battery fails. If the battery
fails, then failure recovery is done by downscaling the supply voltage for the task to
a minimum level until the battery does not fail. Now, the completion time of the task
and its successors have to be checked. If it crosses the deadline, then its previous
task has to be downscaled and the same procedure is continued until the failing task
does not cause the battery to fully discharge. If this procedure goes all the way to the
first task without repairing the battery failure of the failing task, then the schedule
is not possible. The reason why the tasks are downscaled, as little as possible, is
because it allows greater amount of slack to be used by the latter tasks so that they
can downscale to a larger extent, thus causing minimum energy consumption. This
is according to property 3 given above.
Phase II: Slack Distribution The input to this phase is a task schedule obtained
in Phase I, which survives the battery capacity. In this phase, slack is distributed
among the tasks. Starting from the last task, slack time, which is the difference
between the task completion time and the deadline, is calculated. This task is down-
scaled to the least-possible extent, beyond which, either the task deadlines are
missed or it is not possible to downscale further. While downscaling, both supply
voltage Vdd and body bias voltage Vbs are considered. The optimal combination
of the Vdd and Vbs, which results in minimum energy to be consumed, is selected.
Now, if there is any further slack, then the remaining slack is distributed among this
task’s predecessors in the similar way. This procedure is continued until no slack is
available or none of the tasks can be downscaled further.
Results obtained through simulations are shown in Table 11.3. It shows that there
is about 85 % reduction in energy consumption by scheduling the tasks along with
supply voltage scaling, which primarily reduces only the dynamic power. Energy
reduction to the extent of 93 % has been achieved by using dynamic voltage scaling
(DVS) as well as reverse body biasing (RBB) to reduce the leakage power as well.
It is observed that, for low-duty-cycle applications, the battery-aware scheduling
algorithm can give very significant energy savings. However, at high duty cycles,
considering the additional circuitry required for RBB, the total energy expended by
the processor can be higher than the amount expended by the DVS-alone algorithm.
Table 11.4 demonstrates that, as the duty cycle increases, i.e., as the utilization
increases, the reduction in the energy consumption also decreases because of the
lesser available slack. For 0.18-μm technology, it is observed that for a duty cycle
of 25 %, there is an improvement of above 40 % over the algorithm that uses DVS
Table 11.3 Energy consumption in three different situations

Procedure Energy consumed (J) Percentage reduction in energy (%)
Without DVS and RBB 177.07 –
With DVS alone 26.62 84.96
With DVS and RBB 12.76 92.19
DVS dynamic voltage scaling, RBB reverse body biasing
Table 11.4 Variation in energy consumption with the change in duty cycle for 180 nm
Avg. duty cycle (%) Without DVS and With DVS alone (J) With DVS and RBB
RBB (J) (J)
10 177.07 19.92 3.91
15 177.07 21.46 5.74
20 177.07 26.62 12.76
25 177.07 36.33 21.62
30 177.07 91.69 57.67
35 177.07 119.1 107.1
40 177.07 175.4 175.4
Table 11.5 Variation in energy consumption with the change in duty cycle for 70 nm
Avg. duty cycle (%) Without DVS and With DVS alone (J) With DVS and RBB
RBB (J) (J)
10 43,247 4,530 30
15 43,247 4,531 81
20 43,247 4,536 127
25 43,247 5,798 187
30 43,247 20,557 9,125
35 43,247 27,550 24,952
40 43,247 43,245 43,245
alone. As the duty cycle further decreases, the improvement increases. As shown
in Table 11.5, for 0.07-μm technology, the improvement is very high, owing to the
high amount of leakage power the algorithm reduces, compared to the DVS-alone
algorithm. With downscaling of the size of the complementary metal oxide semi-
conductor (CMOS) circuits being the present trend, the leakage power reduction
becomes very necessary. A combination of the power-aware protocols and the dy-
namic power management schemes ensures that power consumption is reduced, and
at the same time, the battery lifetime is increased, thereby increasing the network
lifetime.
Applying the Scheduler to Sensor Networks.
WSNs are finding widespread use in diverse applications. The sensor nodes, which
are also called as motes, are getting smaller, but their battery charge density is not
increasing in the same ratio. Since the life of a sensor network depends on the life
Table 11.6 States of processor, radio, and the sensor for four different tasks
Task Description Processor Radio Sensor
Computation Processor is active, others turned off Active Off Off
Transmission Radio transmits the data packet Active Transmit Off
Reception Radio receives the data packet Active Receive Off
Sensing Sensors gather data from the environment Active Off On
Table 11.7 Current require- Resource Operating mode Current (mA)

ment of different resources
Crusoe 5600 On 12
used in realizing the sensor
node Crusoe 5600 Idle 0
CC2420 Transmitting 20
CC2420 Receiving 18
CC2420 Off 0
Gas sensor + ADC On 30
Gas sensor + ADC Off 0
of the sensor nodes, it is essential to increase the lifetime of the sensor nodes. This
can happen if the battery lasts longer. A power-aware sensor node architecture and
a battery-aware task-scheduling algorithm have been presented that use both DVS
and RBB to maximize the battery lifetime. The main reason for applying the algo-
rithm to sensor networks is that the sensor networks are characterized by low-duty
cycle applications. Significant reduction in energy requirement is possible based on
the proposed approach.
WSNs have certain common characteristics. Typical monitoring applications can
be broken down into a clear set of regular and periodic tasks from the perspective
to the processor of the sensor node. Sensor nodes typically do several data genera-
tion tasks that include gathering of data samples from the sensor and processing
of data to prepare messages. The data communication task involves transmission
and reception of messages. Nodes can also perform other tasks such as clustering,
medium access control, and routing of messages, which include looking up routing
information, choosing appropriate root to send messages, interrupting handling,
etc., depending on the nature of application. From a processor’s perspective, the
basic tasks of a sensor node can be classified into four types such as computation,
transmission, reception, and sensing as shown in Table 11.6.
A novel architecture of a sensor node has been proposed in [7] in which reduc-
tion in the energy consumption is achieved by using power-aware protocols, using
an interrupt-driven architecture, the wake-up circuit which reduces the idle listen-
ing time. The sensor node is considered to be realized using Crusoe Processor, the
CC2420 serves as the radio, and gas sensor along with an analog to digital converter
(ADC). The current requirement for different resources in different states is given in
Table 11.7. The dynamic power management controls the processor power state to
reduce power consumption along with battery-aware task scheduler that combines
DVS and RBB. The simulation results provide energy consumption in three differ-
ent situations as shown in Table 11.8.
Table 11.8 Energy consumption in three different situations

Procedure Energy consumed (J) Percentage reduction in energy
(%)
Without DPM 0.015966 –
With DVS + DPM 0.005002 68.67
With DVS + RBB + DPM 0.002582 83.82
DPM dynamic power management, DVS dynamic voltage scaling, RBB reverse body biasing
11.9 Energy-Aware Routing
Routing protocols [6, 8, 11] for WSNs differ from the conventional network proto-
cols because of the tight constraints on the infrastructure and availability of limited
energy at individual nodes. One of the simplest routing protocols is Flooding in
which the packets are broadcast in the network until the destination point is reached.
Every node simply broadcasts the packet it receives from neighbors unless it is
the destination point. This causes the problem of duplicate packets’ circulation and
implosion. This is a major problem in routing and all protocols developed subse-
quently aim at performance improvement.
Routing protocols are classified into network structure-based protocols and the
operation-based protocols. Flat, hierarchical, and location-based protocols come
under network structure-based protocols while multipath-, query-, negotiation-, co-
herent-, and QoS-based protocols fall under operation-based protocols. Flat-based
routing has all the nodes handling similar functions aimed at sending appropriate
paths from the source to destination hop by hop, whereas hierarchical-based routing
has few nodes acting in a top-level hierarchical manner, aggregating and communi-
cating the data between clusters. Location-based protocols use the node’s geograph-
ical location in an energy-efficient manner for routing the packets. Multipath-based
protocols aim to uniformly distribute the energy dissipation by handling different
paths between the source and destination at different times. Query-based proto-
cols remove redundancies by routing to only those nodes which request (query)
for the data. Only after a series of negotiations between the nodes, the actual data
transfer occurs. This aims at reducing the redundancies. Coherent-based protocols
concentrate on the amount of data processing at each node. After some amount of
minimum processing, data are sent to an aggregator for the rest of processing. All
the above set of protocols deal with specific objectives and thus have their own
advantages and limitations.
However, the most popular approaches are based on clustering, in which the sen-
sor nodes are grouped together to form clusters. Basic objective of the clustering
protocols is to achieve energy efficiency. The whole network is divided into clusters
having several sensor nodes with a cluster head node for each cluster as shown in
Fig. 11.11. The data from sensors inside a cluster are aggregated at cluster head.
This eliminates a lot of redundancy in the packet forwarding. LEACH is considered
to be a benchmark protocol in clustering/hierarchical-based protocols [12]. LEACH
protocol has been developed to create and maintain clusters in order to improve the
lifetime of a WSN. It is a hierarchical protocol in which all the nodes of a cluster
11.9 Energy-Aware Routing 347
Fig. 11.11 Schematic

diagram of a clustered sensor
network
%DVH
6HQVRUQRGH
6WDWLRQ
&OXVWHU+HDG
send data packets to the cluster head and the cluster head aggregates and com-
presses the data before forwarding to the base station. Cluster heads are selected
randomly so that the high-energy dissipation in communicating with the base sta-
tion is shared by all the sensor nodes leading to energy balancing among the sensor
nodes in a cluster. LEACH works in two phases: the setup phase and the steady
phase. The steady phase is usually much longer than the setup phase to minimize
the overhead. The primary function of the setup phase is to select cluster heads. For
this purpose, each node chooses a random number between 0 and 1. If the random
number is less than a threshold value T( n), the node is selected as a cluster head.
The value of T( n) is calculated as
P
T (n) = if n ∈ G,
  1 
1 − P *  rmod    ,
 P
=0 otherwise
where P is the desired percentage to become a cluster head, r is the current round,
and G is the set of nodes that have been selected as a cluster head in the last 1/p
rounds.
After the cluster heads are selected, the cluster heads advertise to all sensor nodes
in the network that they are the new cluster heads. Once the sensor nodes receive the
advertisement, they determine the cluster to which they want to belong based on the
signal strength of the advertisement from the cluster heads to the sensor nodes. The
sensor nodes inform the appropriate cluster heads that they will be a member of the
cluster. Afterward, the cluster heads assign the time in which the sensor nodes can
send data to the cluster heads based on a time-division multiple access (TDMA) ap-
proach. During the steady phase, the sensor nodes can begin sensing and transmit-
ting data to the cluster heads. The cluster heads also aggregate data from the nodes
in their cluster before sending these data to the base station. After a certain period of
time is spent on the steady phase, the network goes into the setup phase again and
enters another round of selecting cluster heads.
As the data from sensors in a cluster are aggregated at the cluster head, which
directly sends the aggregated data to the base station, this requires heavy energy
dissipation at the cluster heads. To overcome this problem, multi-hop-LEACH [6]

or M-LEACH has been proposed which works in a way similar to LEACH except
that the aggregated data are routed between the cluster heads before they reache
the base station. Though it is efficient than LEACH, it still imposes more work
upon cluster heads and kills the network in little time. Secure LEACH [9] routing
protocol is based on low-power cluster head selection algorithm for WSN. A multi-
hop clustering protocol using a set of gateway nodes [10] for routing was designed.
LEACH-MF [8] and MR-LEACH [11] routing protocols reduce the energy dis-
sipation by introducing the concept of multilayer clustering. A brief idea of all the
protocols developed on top of LEACH is as follows:
M-LEACH M-LEACH is a clustering-based routing protocol. Each cluster has a
cluster head which collects data from the sensor node and aggregates the data using
fusion methods. The aggregated data are relayed to the base station over optimal
paths through intermediate cluster heads. This protocol proposes inter-cluster and
intra-cluster multi-hop operations.
Secure-LEACH The protocol gets the total number of all nodes because of the
collaboration in selection of cluster heads. Thereby, the calculation of thresholds to
select the cluster heads is precise, the probability of producing optimal cluster heads
in each round is highest, and network reaches optimal energy cost. The adoption of
pre-shared key pair dispatch improves the security of routing effectively compared
to symmetric keys in LEACH.
MR-LEACH It is a multi-hop routing with LEACH. The whole network is par-
titioned into layers of clusters. Base stations elect the upper layer cluster heads to
act as super cluster heads to the lower layer cluster heads. Cluster heads collaborate
with adjacent layers to relay data to the base station. The transmission of nodes is
controlled by the base station that defines the TDMA schedule for each cluster head.
LEACH-MF The energy consumption is balanced among the whole network. It
introduces multilayered clustering and eliminates the redundant information much
better than the LEACH protocol.
Multi-hop Clustering Using Gateway Nodes It is again a multi-hop-based clus-
tering protocol. The sensor nodes send the sensed data to cluster heads. The cluster
heads aggregate data and relay the same to a set of nodes called gateway nodes. The
gateway communicates directly to the base station situated outside the network.
11.10 Assisted-LEACH
In most of the clustering protocols mentioned above, the whole load of data ag-
gregation and data routing is done by cluster heads. This debilitates the lifetime
of a network. A recent extension of LEACH is the A-LEACH [13]. Here, a novel
concept known as helper node has been introduced. The helper nodes, which are
11.10 Assisted-LEACH 349
Fig. 11.12 Schematic

diagram of a clustered sensor
network with sensor nodes
%DVH 6HQVRUQRGH
6WDWLRQ &OXVWHU+HDG
+HOSHUQRGH
closer to the base station in every cluster, are assigned the routing job, whereas
cluster heads take care of data aggregation as shown in Fig. 11.12. A novel approach
for route formulation is used for the helper nodes to reach the base station. Every
helper node chooses at the next hop the node nearest to the base station from all its
neighboring helper nodes. The receive signal strength (RSS) of base station beckon
signals to decide upon nodes nearer to the base station in helper nodes selection
and route setup phases. Thus, the dissipation of energy is lessened due to multi-hop
routing and the same is distributed among the helper and cluster-head nodes. An al-
gorithm is proposed for helper node selection and multi-hop routing. The algorithm
for cluster head selection is an extension to LEACH’s cluster head selection.
A-LEACH protocol has the following substages:
1. Cluster Head Selection
2. Cluster Formation
3. Helper Node Selection
4. Routing Setup
5. Sensing, Aggregating, and Routing
Functions of these substages are briefly discussed below.
Cluster Head Selection The cluster head selection follows an extended procedure
to LEACH’s [7] cluster head selection. Each node calculates its threshold based on
the formula:
P
T (n) = if n ∈ G ,
  1 
1 − P *  rmod   
 P
P
= 0.5 * if n ∈ H ,
  1 
1 − P *  rmod   
 P
=0 otherwise
where
P Desired percentage of cluster heads
r Current round in protocol operation
G Set of nodes that have neither been cluster heads nor been helper nodes in the
last rounds
H Set of nodes that have not been cluster heads but played the role of helper nodes
in the last rounds. Each sensor elects itself to be a cluster head by picking up
a random number between “0” and “1” and comparing it to be less than the
threshold.
Cluster Formation Cluster heads broadcast a HEAD_BOAST message contain-

ing their IDs to facilitate cluster formation. It can happen that a non-cluster node
receives such messages from different cluster heads. They decide upon the cluster
head whose message possesses the highest RSS to be their head and send a JOIN_
CLUSTER packet with their IDs to corresponding cluster heads showing consent to
be part of their clusters.
Helper Node Selection Helper node in a cluster is the node which is nearer to the
base station with sufficient remaining energy. The base station sends a packet con-
taining its ID to every node assuming that the base station can reach every node at
single hop over a common channel. The nodes in each cluster store the base station
ID from the received packet and then make a packet “RSS_PACKET” with the RSS
values and (self) node ID as entries. A copy of this “RSS_PACKET” is sent to the
corresponding cluster heads. The cluster head elects the node which possessed the
packet with the highest RSS to be the helper node and sends an acknowledgment
packet “Helper_BOAST” of the same to the respective node. In case of a tie, with
few nodes possessing the same RSS values for the packets from base station, cluster
head selects the one with the highest remaining energy as the helper node. This way,
every cluster possesses a cluster head and helper node by now.
Routing Setup This stage aims at finding the helper node at the next hop for each
helper node to route aggregated data to the base station. In this stage, only the
helper nodes are operational and all other nodes including cluster heads go into
sleep mode. Each helper node sends the “RSS_PACKET” made in the Helper nodes
selection phase to nodes in transmission range. Thereby, every helper node receives
“RSS_PACKET” from all its neighbors. According to the Helper nodes selection
phase above, “RSS_PACKET” contains the RSS values of the packets received by
the corresponding neighboring helper nodes from the base station as the first entry
and corresponding node ID as the second entry. Now, each helper node picks up the
“RSS_PACKET” with maximum RSS value and stores the second entry node ID as
its next hop. Hence, every helper node chooses at its next hop, the node nearest to
base station out of its neighboring helper nodes.
Sensing, Aggregation, and Routing: Steady-State Phase All the above four
stages form the setup phase, and now, actual data transfer begins. The cluster heads
follow a TDMA schedule to assign time slots for the sensor nodes inside the cluster.
The sensor nodes send the sensed data to the corresponding cluster heads. The
11.10 Assisted-LEACH 351
Start
Round r : 1 to max
No
if r <=max End
Yes
Selection of Cluster Heads
Clusters Formation
Helper Nodes Selection
Routing Set-up between Helper Nodes
Sensors send sensed data to cluster Heads
Cluster Head in each Cluster sends the aggregated

data to its corresponding helper node
All except Helper Nodes sleep. Helper Nodes Route

data to base station. Protocol goes to next round
Fig. 11.13 Protocol operation of assisted-LEACH
cluster heads aggregate the data, remove redundancies, and forward the data to their
helper nodes among which the actual routing takes place. All other nodes, except
helper nodes, go into sleep mode while routing takes place. The Protocol Operation
is shown in Fig. 11.13. The data sensed at sensor nodes inside a cluster are sent to
the cluster head. Cluster heads aggregate the data and send them to the helper node
where all helper nodes are shown with links between them forming a virtual rout-
ing ground for routing aggregated data to the base station. Each round starts with
the selection of cluster heads followed by formation of clusters by assigning cluster
heads to all nodes in the network. This is followed by helper nodes selection and
route setup between the helper nodes by assigning the next hop helper at each of
the helper nodes to reach the base station. This is followed by actual sensing, data
aggregation, and routing the aggregated data to the base station.
A-LEACH Protocol Operation A-LEACH protocol distinguishes the normal sen-

sor nodes, the helper nodes, and the cluster heads within it. The data sensed by the
sensor nodes are passed on to the cluster heads for aggregation. The helper nodes
are used to frame up a routing medium to relay the aggregated messages to the base
station. The aggregated messages are pushed on to respective cluster’s helper node.
The helper node in turn sends the messages to its next hop. This continues until the
aggregated messages reach the base station.
It starts with checking whether the current round of protocol is less than the max-
imum allowed rounds. If it is less than the maximum allowed rounds, further steps
are granted through it. First step is the selection of cluster heads, where the nodes
decide themselves to be cluster heads in a probabilistic approach. Second step is the
cluster formation where the nodes which are not cluster heads put themselves under
the leadership of their nearest cluster head. Third step is the helper node selection
phase where each cluster elects the node nearest to the base station from the cluster
and not a cluster head to be its helper node. Fourth step is the routing setup phase
where the helper nodes choose the node nearest among neighboring helper nodes
to be their next hops. Fifth step is the actual sensing phase where the sensor nodes
sense the data and send them to their corresponding cluster heads. Sixth step is the
aggregation of collected data at cluster heads and forwarding the aggregated data
to cluster’s helper node. Seventh step is the actual routing of the aggregated data
among helper nodes to reach the base station where other nodes go into sleep state.
The whole flow right from the comparison of the current round with maximum
rounds repeats after increment of current operated round by one.
Performance Comparison with Other Protocols Performance of direct trans-
mission, multi-hop transmission, LEACH, and the A-LEACH protocols are com-
pared by simulation. Parameters used in the simulation are shown in Table 11.9.
The round numbers for deaths of all nodes during simulation of the four protocols
have been collected at the time of simulation. For optimal comparison, data for
death of half the network are shown in Table 11.10. The values of total energy in the
network after every death in nodes for all the four protocols have been collected.
Table 11.11 shows the data for an intermediate round depicting the network ener-
gies. Half of the network in other protocols dies at almost half the rounds taken for
death of half the network in A-LEACH.
11.11 Conclusion
By making helper nodes handle the routing job, overall load on the cluster head is
reduced. Overhead is reduced for route formulation to the base station by electing
the next hop at each helper based on the RSS values of beckon signal from base
station already available at helper nodes during helper node selection phase. The
concept of helper nodes in the A-LEACH protocol has improved the lifetime of
the network by minimizing the energy dissipation throughout the nodes, which has
been substantiated by using simulation.
Table 11.9 Simula- k 600

tion parameters used in Eelec 50 nJ/bit
Assisted-LEACH
Eamp 100 pJ/bit/sq. unit
Bit compression cost 5 pJ/bit
Energy of node 2 J
Nodes no. 100
Rounds 2000
Area 100 × 100 sq. units
Cluster percentage 0.1
Table 11.10 Round versus Protocol Death count Round

node death: for the death of
Direct 45 117
half the network
Multihop 45 199
LEACH 45 443
Assisted-LEACH 45 803
LEACH low-energy adaptive clustering hierarchy
Table 11.11 Energy versus Protocol Round Network energy

round: for intermediate round
Direct 600 0
Multihop 600 0
LEACH 600 59.851421
Assisted-LEACH 600 109.714081
LEACH low-energy adaptive clustering hierarchy
• The widening gap between processor power and power density battery technolo-
gies has been highlighted.
• An overview of the popular battery technologies has been provided.
• The characteristics of rechargeable batteries have been specified.
• The underlying process of battery discharge has been explained.
• How various characteristics of a battery can be modeled have been introduced.
• The realization of some battery-driven systems has been presented.
• The realization of battery-aware sensor network has been introduced.
Q11.1. Explain the widening gap between the processor and battery technologies.
Q11.2. Explain the rate capacity effects and recovery effects of the commonly used
rechargeable batteries?
Q11.3. Explain the “nonincreasing profile effect” and “recovery effect” of a re-
chargeable battery?
Q11.4.What is the energy density of a battery? How does it vary for different tech-
nologies?
Q11.5. How can rate capacity effect be used in scheduling a set of tasks to get higher
actual capacity of the battery?
Q11.6. How does a battery-aware task scheduler combine dynamic voltage scaling
and reverse body biasing to extend the life of a sensor node?
References
1. Lahiri, K., Raghunathan, A., Dey, S., Panigrahi, P.: Battery-driven system design: A new fron-
tier in low power design. In: Proceedings of the 15th International Conference on VLSI De-
sign 2002, pp. 261–267, Bangalore, January 2002
2. O’Hara, T., Wesselmark, M.: Battery technologies: A general overview & focus on Lithium-
Ion, Intertek
3. Linden, D., Reddy, T.B.: Handbook of Batteries, 3rd edn. McGraw-Hill (2002)
4. Rakhmatov, D., Vradhula, S.: Energy measurement for battery-powered embedded systems.
ACM Trans. Embed. Comput. Syst. 2(2), 277–324 (2003)
5. Rakhmatov, D., Vrudhula, S., Chakrabarti, C.: Battery-conscious task sequencing for portable
devices including voltage/clock scaling. In: Proceedings of the 39th conference on Design
automation (DAC 2002), pp. 189–194, 10–14 June 2002
6. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks.
IEEE 40(8), 102–114 (2002)
7. Sravan, A., Kundu, S., Pal, A.: Low power sensor node for a wireless sensor network. In:
Proceedings of the 20th International Conference on VLSI 2007, pp. 445–450, Bangalore,
January 2007
8. Akkaya, K., Mohamed Younis, M.: A survey on routing protocols for wireless sensor net-
works. Ad Hoc Netw. 3(3), 325–349 (2005)
9. Pantazis, N.A., Nikolidakis, S.A., Vergados, D.D.: Energy-efficient routing protocols in wire-
less sensor networks: A survey. IEEE Commun. Surv. Tutor. 15(2), 551–591 (2013)
10. Heinzelman, W.R., Chandrakasan, A., Balakrishnan, H.: Energy-efficient communication pro-
tocol for wireless microsensor networks. In: IEEE Computer Society Proceedings of the Thir-
ty Third Hawaii International Conference on System Sciences (HICSS ’00), vol. 8, pp. 8020,
Washington, DC, USA, January 2000
11. Singh, S.K., Singh, M.P., Singh, D.K.: Routing protocols in wireless sensor networks—A
survey. Proc. Int. J. Comput. Sci. Eng. Surv. (IJCSES) 1(2), 63–83 (2010)
12. Chowdhury, P., Chakrabarti, C.: Static task-scheduling algorithms for battery-powered DVS
systems. IEEE Trans. Very Larg. Scale Integr. (VLSI) Syst. 13(2) 226–237 (2005)
13. Kumar, S.V., Pal, A.: Assisted-Leach (A-Leach): Energy efficient routing protocol for wireless
sensor networks. Int. J. Comput. Commun. Eng. 2(4), 420–424 (2013)
14. Heinzelman, W.R., Chandrakasan, A., Balakrishnan, H.: An application-specific protocol ar-
chitecture for wireless microsensor networks. IEEE Trans. Wirel. Commun. 1(4), 660–670
(2002)
15. Intagagonwiwat, C., Govindan, R., Estrin, D.: Directed diffusion: A scalable and robust com-
munication paradigm for sensor networks. In: Proceedings of the Mobicom’00, pp. 56–67,
Boston (2000)
16. Martin, S.M., Flautner, K., Mudge, T., Blaauw, D.: Combined dynamic voltage scaling and
adaptive body biasing for lower power microprocessors under dynamic workloads. IEEE/
ACM International Conference on In Computer Aided Design, 2002(ICCAD 2002), pp. 721–
725, (2002)
Chapter 12
Low-Power Software Approaches
Abstract In this chapter, we focus on various software optimization techniques

to reduce power consumption without any change in the underlying hardware. A
power-aware software will not require any additional hardware, but will perform
suitable optimization of software. Software optimization techniques can be broadly
classified into two categories: machine-independent and machine-dependent.
Machine-independent optimization techniques do not require any knowledge of the
hardware architecture of the processor and can be used for any processor. Instead
of the traditional compiler optimization techniques commonly used to reduce the
execution time, optimization can be performed to reduce the power consumption
keeping the computation time same. On the other hand, the machine-dependent
optimization techniques exploit the architectural features of the target processor and
the hardware platform.
Keywords Loop unrolling · Dead-code elimination · Code motion · Inlining · Code
hoisting · Loop-invariant computation · Common sub-expression elimination ·
Loop tiling · Loop permutation · Loop fusion · Loop peeling · Loop unswitching ·
Strength reduction · Software prefetching · Prefetch distance
12.1 Introduction
While driving a car, the driver of the car is responsible for controlling various acces-
sories such as the gear, clutch, accelerator, and brake for the smooth running of the
car. But, it is the engine that burns fuel and dissipates power in various forms. In a
computer system, the role of the software is somewhat similar to that of the driver
and the hardware of the computer behaves like the engine of the car. Indeed, the
power dissipation takes place in the hardware. The question naturally arises about
the role of the software in minimizing the power dissipation. Just like the driver
of a car can improve the mileage (km per liter of the fuel used), the power dis-
sipation can be minimized by suitable optimization of the software. In the preced-
ing chapters, we have discussed various hardware-based approaches to minimize
power dissipation of a processor. Reduction in power dissipation is accomplished
either by modification of some hardware or by adding additional hardware. These
are somewhat similar to various innovations and improvements incorporated in the
car engines to improve engine efficiency to reduce fuel consumption. However, in
356 12 Low-Power Software Approaches
Fig. 12.1 Simplified sche- &38

matic diagram of a computer
system $/8 ,
1 0(025<
7
5(*,67(56 (
5
7,0,1* )
$
&21752/ &
81,7 (
,2'(9,&(6
6<67(0
%86
all these cases, it is the software that plays an important role behind the scene. In
this chapter, we discuss about various software optimization techniques to reduce
power consumption without any change in the underlying hardware. Power-aware
software will not require any additional hardware, but will perform suitable optimi-
zation of software.
Software optimization techniques can be broadly classified into two categories:
machine-independent and machine-dependent. Machine-independent optimization
techniques do not require any knowledge of the hardware architecture of the proces-
sor and can be used for any processor. On the other hand, the machine-dependent
optimization techniques exploit the architectural features of the target processor and
the hardware platform.
Before we venture into either of the approaches, it is essential to understand how
power dissipation takes place in the hardware, or in other words it is essential to
identify the sources of power dissipation. This is introduced in Sect. 12.2. Machine-
independent software optimization approaches are highlighted in Sect. 12.3. Vari-
ous loop optimization techniques have been combined with dynamic voltage and
frequency scaling (DVFS) to achieve a larger reduction in energy dissipation. This
has been discussed in detail in Sect. 12.4. Power-aware software prefetching ap-
proach has been discussed in detail in Sect. 12.5.
12.2 The Hardware
A simplified schematic diagram of a computer system is shown in Fig. 12.1. The

heart and brain of a computer system is the central processing unit (CPU). The CPU
consists of the arithmetic logic unit (ALU) for performing computation, a number
of registers to hold temporary information during program execution, and the con-
trol unit that coordinates and controls the subsystems within the CPU. The memory
and input/output (I/O) devices are connected externally and are controlled by the
control unit. The CPU is interfaced with the memory and I/O devices through the
system bus as shown in Fig 12.1.
12.2 The Hardware 357
Many of the machine-dependent optimizations depend on the characteristics of

the target machine and the most effective optimizations are those that best exploit
special features of the target platform. That is why, it is important to identify the
platform-related parameters before embarking on software optimization. It may be
possible to parameterize some of the machine-dependent features, so that a single
piece of compiler code can be used to optimize different machines just by altering
the machine description parameters.
Reduced Instruction Set Computing Versus Complex Instruction Set Archi-
tecture Complex instruction set architecture (CISC) instruction sets usually have
variable instruction lengths, characterized by a larger number of possible instruc-
tions that can be used, and each instruction could take differing amounts of execu-
tion time. Reduced instruction set computing (RISC) instruction sets attempt to
restrict the variability. These instruction sets are usually constant in length, with few
exceptions; there are usually fewer combinations of registers and memory opera-
tions, and the instruction issue rate (the number of instructions completed per time
period, usually an integer multiple of the clock cycle) is usually constant in cases
where memory latency is not a factor. There may be several ways of carrying out
a certain task, with CISC usually offering more alternatives than RISC. Compilers
have to know the relative costs among the various instructions and choose the best
instruction sequence.
Arithmetic Logic Units Modern processors are superscalar in nature having mul-
tiple functional units each capable of performing different arithmetic operations.
The availability of multiple functional units allows multiple instructions to be exe-
cuted simultaneously. However, this is constrained by the instruction-level parallel-
ism available, and also there may be restrictions on which instructions can pair with
which other instructions, and which functional unit can execute which instruction.
Instructions have to be scheduled by the compiler so that the various functional
units are fully fed with instructions to execute. In general, in any instruction cycle
some of the functional units remain unused. Simultaneous multithreading (SMT) is
a technique to improve the overall efficiency of superscalar CPUs with hardware
multithreading. SMT permits multiple independent threads of execution for better
utilization of the resources provided by modern processor architectures.
Pipelines To enhance the performance of the ALU, it is broken down into a number
of stages just like an assembly line of a factory. For example, instruction execution
may be broken up into five stages: instruction fetch, instruction decode, instruction
execution, memory access, and write back of result in the resisters. For this five-
stage pipeline, five different instructions can be given in different stages of execu-
tion, thereby speeding up the execution by about five times. However, pipeline
conflicts (known as hazards) occur when an instruction in one stage of the pipeline
depends on the result of another instruction ahead of it in the pipeline, but is not yet
completed. Consequently, pipeline conflicts lead to pipeline stalls, where the CPU
waste cycles waiting for a conflict to resolve and the speedup factor becomes lesser
than the maximum possible value. Additional power consumption takes place in the
wasted cycles. Compilers can play an important role by scheduling, or reordering

instructions, so that pipeline stalls occur less frequently thereby reducing power
consumption.
Registers Processors are provided with a set of special purpose and general pur-
pose registers. Temporary or intermediate results generated while executing a pro-
gram can be left in registers without writing to and reading back from memory.
Local variables can be allocated in the registers and not on the stack, which is
realized using main memory. This results in reduction of execution time and power
dissipation. To a certain extent, availability of more number of registers makes it
easier to optimize for performance.
System Bus The system bus comprising address, data, and control lines are used
by the CPU to read/write instructions/data from memory and I/O devices. Busses
typically have high load capacitances because a number of memory and I/O mod-
ules are connected to each bus. The switching activity on the busses depends on the
software in execution. Switching on the address bus depends on the sequence of
instruction and data accesses. Similarly, switching on the data bus depends on the
sequence of instruction op-codes to be executed. These effects can be accounted at
compile time. However, data-related switching is harder to predict at compile time
because most of the input data to a program is not known until execution time.
Memory Organization In contemporary processors, the memory is organized in
a hierarchical manner such as cache memory, main memory, and then the second-
ary memory. Cache memory is smallest in size and fastest in speed and secondary
memory is largest and slowest of the three. By exploiting the locality of refer-
ence, instruction and data are hoisted in such a way that the CPU from the cache
memory accesses instruction and data most of the time. Performance of the cache
memory depends on the cache memory size (256 kB−12 MB), block size, and
the associativity (direct mapped, 2-/4-/8-/16-way set associative or fully associa-
tive). Techniques such as in-line expansion and loop unrolling may increase the
size of the generated code and reduce the code locality. The program may slow
down drastically if a highly utilized section of code (like inner loops in various
algorithms) suddenly cannot fit in the cache. Also, caches which are not fully
associative have higher chances of cache collisions even in an unfilled cache. The
cache misses results in additional power dissipations on the busses as well as by
the processor.
Power optimization techniques related to memory concentrate on one or more of the
following objectives:
• Minimize the number of memory accesses required by the algorithm.
• Minimize the total memory requirement of the algorithm.
• Make memory accesses as close as possible to the processor: registers first,
cache next, and external RAM last.
• Make most efficient use of the available memory bandwidth; e.g., use multiple-
word parallel loads instead of single loads as much as possible.
12.3 Machine-Independent Software Optimizations 359
12.3 Machine-Independent Software Optimizations
If we analyze the operation of a general purpose computer system, it is observed

that the processor runs different types of software, which can be broadly catego-
rized into two types: systems software and application software. Systems software
consists of operating system and compilers, whereas application software consists
of programs related to different applications developed by the users. These optimi-
zation approaches are based on one or more of the following ideas:
Lesser Executable Code The size of the code can be reduced by removing unnec-
essary computations and intermediate values. This results in lesser work for the
CPU, lesser access to cache, and memory. Consequently, this results in lesser power
consumption and faster execution.
Avoid Redundant Computation Reuse the results that are computed already and
store them for use later, instead of recomputing them repeatedly.
Fewer Jumps Use fewer jumps by using straight-line code, also called branch-free
code. It is observed that jumps (conditional or unconditional branches) interfere
with the prefetching of instructions, thus slowing down the code. Using inlining
or loop unrolling can reduce the number of branches, at the cost of increasing the
binary code size by the length of the repeated code. This tends to merge several
basic blocks into one. Even when it makes the performance worse, programmers
may eliminate certain kinds of jumps in order to defend against side-channel attacks.
Optimize the Common Case The common case may have unique properties that
allow a fast path at the expense of a slow path. If the fast path is taken most often,
then this results in better overall performance.
Locality The code and data that are accessed closely together in time should be
placed close together in memory to increase spatial locality of reference.
Exploit Memory Hierarchy Accesses to memory are increasingly more expensive
for each level of the memory hierarchy, so the most commonly used items are to be
placed in registers first, then in cache memories, and then in main memory, before
putting them into the secondary memory or hard disk.
Parallelize Reorder operations to allow multiple computations to take place in par-
allel, either at the instruction, memory, or thread level leading to parallelism at dif-
ferent levels of memory hierarchy.
12.3.1 Compilation For Low Power
The compiler sits between the front end and the code generator, and it works with
the intermediate code. The compiler can analyse control and data flows on the
intermediate code and can perform appropriate transformations to improve the
Fig. 12.2 Codes after “before LQWSUHGLQW[^

inlining” and “after inlining” LI[
UHWXUQ
HOVH
UHWXUQ[
`
%HIRUHLQOLQLQJ
LQWILQW\^

UHWXUQSUHG\SUHGSUHG\
`
$IWHULQOLQLQJ
LQWILQW\^

LQWWHPS
LI\ WHPS HOVHWHPS \
LI WHPS HOVHWHPS
LI\ WHPS HOVHWHPS \
UHWXUQWHPS
`
intermediate code. Conventional compiler optimizations either reduce the size of

the code generated by it or reduce the execution time. Some of the compiler optimi-
zation approaches are mentioned below:
Inlining Small Functions While performing computation, inline expansion, or
inlining, is a manual or compiler optimization that replaces a function call site with
the body of the callee. This optimization may lead to a lesser execution time and
better utilization of memory space at runtime. However, this happens at the possible
cost of increasing the final size of the code. Normally, when a function is invoked,
control is transferred to its definition by a branch or call instruction. With in-lining,
control drops through directly to the code for the function, without a branch or call
instruction. In-lining improves performance in several ways:
• It removes the cost of the function call and return instructions, as well as any
other prologue and epilogue code injected into every function by the compiler.
• Eliminating branches and keeping code that is executed close together in the
memory improves instruction cache performance by improving the locality of
reference.
• Once inlining has been performed, additional intraprocedural optimizations be-
come possible on the in-lined function body. For example, a constant passed as
an argument can often be propagated to all instances of the matching parameter,
or part of the function may be “hoisted out” of a loop.
Figure 12.2 provides a simple example of inline expansion performed “by hand” at
the source level in the C programming language:
Note that this is only an example. In an actual C application, it would be pref-
erable to use an inlining language feature such as parameterized macros or inline
functions to tell the compiler to transform the code in this manner. The primary cost
Fig. 12.3 Codes after “before W 3,

code hoisting” and “after '2,
code hoisting” $55$<, W ,
(1''2
of inlining is that it tends to increase code size, although it does not always do so.
Inlining may also decrease the performance in some cases—for instance, multiple
copies of a function may increase code size enough that the code no longer fits in
the cache, resulting in more cache misses.
Code Hoisting In many situations, there are some computations inside the body
of the loop which are computed in each pass of the loop execution. Significant
saving in the computation time and reduction in power dissipation can be obtained
by moving computations outside the loop body. This is known as code hoisting as
illustrated by the example given in Fig. 12.3.
'2,
$55$<, 3, ,
(1''2
In this example (3.0 * PI) is an invariant expression which is computed 100 times
unnecessarily. It can be avoided by introducing a temporary variable t. The trans-
formed code is as follows:
Dead-Store Elimination Dead-code elimination (also known as dead-code
removal, dead-code stripping, or dead-code strip) is a compiler optimization to
remove the code which does not affect the program results. If the compiler detects
variables that are never used, it may safely ignore many of the operations that com-
pute their values. Removing such a code has two benefits: it shrinks program size,
an important consideration in some contexts, and it allows the running program to
avoid executing irrelevant operations, which reduces its running time. Dead code
includes a code that can never be executed (unreachable code), and a code that only
affects dead variables, that is, variables that are irrelevant to the outcome of the
program. Consider the example written in C in Fig. 12.4.
A simple analysis of the uses of values would show that the value of y after the
first assignment is not used inside dead-store elimination (dse). Furthermore, y is
declared as a local variable inside dse, so its value cannot be used outside dse. Thus,
the variable y is dead and an optimizer can reclaim its storage space and eliminate
its initialization. Furthermore, because the first return statement is executed uncon-
ditionally, no feasible execution path reaches the second assignment to y. Thus, the
assignment is unreachable and can be removed. If the procedure had a more com-
plex control flow, such as a label after the return statement and a goto elsewhere
in the procedure, then a feasible execution path might exist to the assignment to b.
Also, even though some calculations are performed in the function, their values are
not stored in locations accessible outside the scope of this function. Furthermore,
given the function returns a static value, it may be simplified to the value it returns
(this simplification is called constant folding).
Fig. 12.4 Dead-store LQWGVHYRLG

elimination ^
LQW[
LQW\ $VVLJQPHQWWRGHDGYDULDEOH
LQW]
] [
UHWXUQF
\ 8QUHDFKDEOHFRGH
UHWXUQ
`
Fig. 12.5 Dead-code LQWPDLQYRLG^

elimination LQW[
LQW\
LQW]
] [ \!!
LI^ '(%8*
SULQWIG?Q]
`
UHWXUQ]
`
Dead-Code Elimination Most advanced compilers have options to activate dead-

code elimination, sometimes at varying levels. A lower level might only remove
instructions that cannot be executed. A higher level might also not reserve space for
unused variables. Yet a higher level might determine instructions or functions that
serve no purpose and eliminate them. A common use of dead-code elimination is
as an alternative to optional code inclusion via a preprocessor. Consider the code
shown in Fig. 12.5.
Because the expression 0 will always evaluate to false, the code inside the “if
statement” can never be executed, and dead-code elimination would remove it en-
tirely from the optimized program. This technique is common in debugging to op-
tionally activate blocks of code, using an optimizer with dead-code elimination.
This eliminates the need for using a preprocessor to perform the same task.
Dynamic Dead-Code Elimination Dead code is normally considered dead uncondi-
tionally. Therefore, it is reasonable to attempt removing dead code through dead-code
elimination at compile time. However, in practice, it is also common for code sections
to represent dead or unreachable code only under certain conditions, which may not
be known at the time of compilation. Such conditions may be imposed by differ-
ent runtime environments (for example, different versions of an operating system,
or different sets and combinations of drivers or services loaded in a particular target
environment), which may require different sets of special cases in the code, but at the
same time become conditionally dead code for the other cases. Also, the software (for
example, a driver or resident service) may be configurable to include or exclude cer-
tain features depending on user preferences, rendering unused code portions useless
in a particular scenario. While modular software may be developed to dynamically
Fig. 12.6 Loop-invari- W [ [

ant computation IRUL LL
D>L@ LW
load libraries on demand only, in most cases, it is not possible to load only the rel-
evant routines from a particular library, and even if this would be supported, a routine
may still include code sections which can be considered dead code in a given sce-
nario, but could not be ruled out at compile time. The techniques used to dynamically
detect demand, identify, and resolve dependencies; remove conditionally dead code;
and recombine the remaining code at load or runtime are called dynamic dead-code
elimination.
Loop-Invariant Computations Calculations that do not change between loop iter-
ations are called loop-invariant computations. These computations can be moved
outside the loop to improve performance as illustrated in Fig. 12.6.
IRUL LL
D>L@ L[ [
As the expression x * x produces the same result in each iteration, it can be moved
out of the loop as shown in Fig. 12.6.
Common Sub-Expression Elimination This is a compiler optimization technique
of finding redundant expression evaluations and replacing them with a single com-
putation. This saves the time overhead resulted by evaluating the expression for
more than once. This is illustrated with the help of a simple example given below.
Optimizing compilers are able to perform this operation quite well.
; $ /2*</2*<
Let us introduce an explicit temporary variable t,
W /2*<
By substituting t we get,
; $ WW
This saves one “heavy” function call, by an elimination of the common sub-expres-
sion LOG(Y).
Loop Unrolling Unrolling duplicates the body of the loop multiple times, in order
to decrease the number of times the loop condition is tested and the number of
jumps, which hurt performance by impairing the instruction pipeline. Unrolling
leads to lesser execution time for the program because lesser number of jumps and
lesser number of loop manipulation instructions are to be executed. This, in turn,
Fig. 12.7 Loop unrolling VXP

IRUL L1L ^
VXP DUUD\>L@
VXP DUUD\>L@
VXP DUUD\>L@
VXP DUUD\>L@
`
reduces the power dissipation to execute the program. Complete unrolling of a loop
eliminates all overhead and maximizes the reduction in power dissipation. This
requires that the number of iterations be known at compile time. Moreover, as the
size of the code increases significantly, it may not fit in the cache memory with
consequent increase in cache misses and increase in power dissipation. So, it is
necessary to restrict the unrolling factor. Let us illustrate the loop unrolling with the
help of the following example given in Fig. 12.7.
VXP
IRUL L1L
VXP DUUD\>L@
Now the loop is unrolled by a factor of four resulting in the following code:
Loop unrolling is a very useful transformation that can be carried out by a com-
piler. Apart from the reduction in the execution of overhead instructions, this also
exposes more instruction-level parallelism, which allows the reduction of pipeline
hazards and further reduction of power dissipation. Important characteristics of
loop unrolling are highlighted below:
• Reduces overhead instructions
• Register pressure increases, so register spilling is possible
• The unrolled code has a larger size
12.4 Combining Loop Optimizations with DVFS
Various loop optimization techniques discussed in the previous section usually help
to reduce the execution time. Instead of reducing execution time, they can be used
to reduce energy reduction without increase in the initial execution time. Suppose,
vi and fi are the supply voltage and clock frequency, respectively, used to execute a
particular code before optimization in time ti. ti may be considered as the initial ex-
ecution time. Let the execution time after compiler optimization be to, where ti > to.
Then, the supply voltage and clock frequency can be reduced to Vf and Ff , respec-
tively, such that the execution time again increases but does not exceed ti. As energy
consumption is reduced without an increase in the execution time, we may call this
approach “compilation for low power.” In this section, we explain the impact of
12.4 Combining Loop Optimizations with DVFS 365
Table 12.1 Voltage–fre- Number Voltage (V) Frequency (MHz)

quency pairs supported by the
1 1.5 733
XEEMU simulator
2 1.4 666
3 1.3 600
4 1.2 533
5 1.1 466
6 1.1 400
7 1.0 333
8 1.0 266
9 1.0 200
loop optimization techniques on execution time and energy. The DVFS is applied to
the loop-optimized codes to achieve more energy savings. The energy and perfor-
mance of the original and transformed codes are measured on the XEEMU simula-
tor, which simulates Intels’s XScale processor. XScale processor supports DVFS
and runs on the voltage–frequency pairs shown in Table 12.1.
For each loop optimization technique, first the energy and time consumed by
the original code are measured for the highest voltage–frequency pair (1.5 V,
733 MHz). Then the energy and time consumed by the transformed code are mea-
sured at 1.5 V and 733 MHz. Finally, DVFS is applied to the transformed loop by
further reduction of voltage–frequency pairs. The time taken by the original code at
1.5 V and 733 MHz is the deadline for the transformed code with DVFS. Execution
time and energy consumed by the translated code are compared with those of the
original code.
12.4.1 Loop Unrolling
As we have mentioned earlier, loop unrolling, also known as loop unwinding, is

a loop transformation technique that attempts to optimize a program’s execution
speed at the expense of its size. Loop unrolling replicates the body of a loop several
number of times called the unrolling factor ( uf) and iterates by step uf instead of
step one. Loop unrolling improves the performance by reducing the loop overhead,
effectively exploiting instruction-level parallelism (ILP) from different iterations,
and improving register and data cache locality.
Figure 12.8a shows the original code of inner product. In Fig. 12.8b, the loop in
the original code is unrolled with uf = 8. Table 12.2 shows the impact of loop unroll-
ing on the execution time and energy due to reduction of execution of branch in-
structions. At 1.5 V and 733 MHz, the transformed code achieves 18.18 and 12.5 %
savings in time and energy, respectively, with respect to the original code. At 1.3 V
and 600 MHz, the transformed code achieves 37.5 % reduction in energy without
any performance degradation, with respect to the original code.
VXP
IRUL LQL L
^
VXP VXPD>L@ E>L@
VXP VXPD>L@ E>L@
VXP VXPD>L@ E>L@
VXP VXPD>L@ E>L@
VXP VXPD>L@ E>L@
VXP VXPD>L@ E>L@
VXP VXP VXPD>L@ E>L@
IRUL LQL VXP VXPD>L@ E>L@
VXP VXPD>L@ E>L@ `
a Original Code b Transformed Code
Fig. 12.8 Loop unrolling, where n = 10,000 and uf = 8. a Original code. b Transformed code
Table 12.2 Loop unrolling experimental results

Code Metric Value Gain (%) Instructions executed Value
Original at Time (ms) 1.1 – Total no. of instructions 500,477
1.5 V, Total energy (mJ) 0.8 – No. of branch instructions 20,088
733 MHz
Transformed Time (ms) 0.9 18.18 Total no. of instructions 377,997
at 1.5 V, Total energy (mJ) 0.7 12.5 Branch instructions 2596
733 MHz
Transformed Time (ms) 1.1 0.0 Total no. of instructions 377,997
at 1.3 V, Total energy (mJ) 0.5 37.5 Branch instructions 2596
600 MHz
12.4.2 Loop Tiling
Loop tiling, also called loop blocking, partitions a loop’s iteration space into smaller
chunks or blocks, to help ensure that data used in a loop stay in the cache until
they are reused. The partitioning of loop iteration space leads to partitioning of
large arrays into smaller blocks, thus fitting accessed array elements into cache
size, enhancing cache reuse, and eliminating cache size requirements. The loop til-
ing transformation is essential for enhancing the utilization of data cache in dense
matrix applications.
Figure 12.9 shows the original and transformed codes of transpose of a matrix,
respectively. The cache block size is considered to be 128 bytes. Each element in
matrices “a” and “b” is of 4 bytes. Therefore, the transformed code has a block size
of 32 (block = 32). Table 12.3 shows the impact of loop tiling on the execution time
and the energy due to reduction of data cache miss. At 1.5 V and 733 MHz, the
transformed code achieves 39.05 and 34.86 % savings in time and energy, respec-
tively, with respect to the original code. At 1.1 V and 400 MHz, the transformed
code achieves 69.72 and 5.32 % reduction in energy and execution time, respec-
tively, with respect to the original code.
IRUL LQL LEORFN

IRUM MQM MEORFN
^
IRULL LLLPLQLEORFNQLL
IRUL LQL ^
^ IRUMM MMMPLQMEORFNQMM
IRUM MQM ^
^ D>LL@>MM@ E>MM@>LL@
D>L@>M@ E>M@>L@ `
` `
` `
Fig. 12.9 Loop tiling, where n = 10,000 and block = 32. a Original code. b Transformed code
Table 12.3 Loop tiling experimental results

Code Metric Value Gain (%) DL1 cache parameters Value
Original at 1.5 V, Time (ms) 16.9 – DL1 cache misses 148,327
733 MHz Total energy (mJ) 10.9 – DL1 cache miss rate 22.28 %
Transformed at Time (ms) 10.3 39.05 DL1 cache misses 72,767
1.5 V, 733 MHz Total energy (mJ) 7.1 34.86 DL1 cache miss rate 10.17 %
Transformed at Time (ms) 16.0 5.32 DL1 cache misses 72,767
1.1 V, 400 MHz Total energy (mJ) 3.3 69.72 DL1 cache miss rate 10.17 %
12.4.3 Loop Permutation
This loop transformation exchanges inner loops with outer loops. When the loop
variables index into an array, such a transformation can improve the locality of ref-
erence, depending on the array’s layout as illustrated in Fig. 12.10.
Figure 12.11a, b show the original and transformed codes of a loop, respectively.
Table 12.4 shows the impact of loop permutation on the execution time and energy
due to reduction of data cache miss. At 1.5 V and 733 MHz, the transformed code
achieves 61.9 and 55.88 % savings in time and energy, respectively, with respect to
the original code. At 1.0 V and 333 MHz, the transformed code achieves 80.88 and
15.23 % reduction in energy and execution time, respectively, with respect to the
original code.
12.4.4 Strength Reduction
Reduction in strength replaces an expression in a loop with one that is equivalent

but uses a less expensive operator. Operator strength reduction involves the em-
ployment of mathematical identities to replace slow mathematical operations with
faster operations. The cost and benefits will depend highly on the target CPU and
sometimes on the surrounding code (depending on availability of other functional
units within the CPU).
Fig. 12.10 Loop permuta- IRUL LQL IRUM MQM

tion, where n = 256. ^ ^
a Original code. b Trans- IRUM MQM IRUL LQL
D>M@>L@ D>M@>L@
formed code
` `
Table 12.4 Loop permutation experimental results

Code Metric Value Gain (%) DL1 cache Value
parameters
Original at 1.5 V, Time (ms) 10.5 – DL1 cache misses 4,811,844
733 MHz Total energy (mJ) 6.8 – DL1 cache miss rate 93.6 %
Transformed at 1.5 V, Time (ms) 4.0 61.9 DL1 cache misses 48,492
733 MHz Total energy (mJ) 3.0 55.88 DL1 cache miss rate 11.55 %
Transformed at 1.0 V, Time (ms) 8.9 15.23 DL1 cache misses 48,492
333 MHz Total energy (mJ) 1.3 80.88 DL1 cache miss rate 11.55 %
Figure 12.11a, b shows the original and transformed code, respectively. The
original loop contains a multiplication operation and the transformed version of
the code where the multiplication is replaced by an addition operation. Table 12.5
shows the impact of strength reduction on the execution time and energy due to re-
placement of multiplication with addition. At 1.5 V and 733 MHz, the transformed
code achieves 9.09 and 11.11 % savings in time and energy, respectively, with re-
spect to the original code. At 1.3 V and 600 MHz, the transformed code achieves
44.44 % reduction in energy without degradation in performance, with respect to
the original code.
12.4.5 Loop Fusion
Loop fusion, also called loop jamming, is a type of loop transformation, which
replaces multiple loops with a single one. This transformation aims to reduce the
loop overhead (loop index increment or decrement, compare, and branch). This
transformation is also supposed to enhance the cache and register file utilization.
Some other loop transformations, such as loop reversal, loop normalization, or loop
peeling, may be applied to make sure that the two loops have the same loop bounds
and hence can be fused together.
Fig. 12.11 Strength reduc- 7 F

tion, where n = 10,000. IRUL L QL
IRUL L QL
^
a Original code. b Trans- ^ D>L@ D>L@7
formed code D>L@ D>L@F L 7 7F
` `
Table 12.5 Strength reduction experimental results

Code Metric Value Gain (%) Functional unit Value
access
Original at 1.5 V, Time (ms) 1.1 – ALU access 430,443
733 MHz Total energy (mJ) 0.9 – Multiplier access 10,000
Transformed at 1.5 V, Time (ms) 1.0 9.09 ALU access 500,455
733 MHz Total energy (mJ) 0.8 11.11 Multiplier access 0
Transformed at 1.3 V, Time (ms) 1.1 0.0 ALU access 500,455
600 MHz Total energy (mJ) 0.5 44.44 Multiplier access 0
ALU arithmetic logic unit
Fig. 12.12 Loop fusion, IRUL LQL

where n = 10,000. a Original IRUL LQL ^
[>L@ [>L@F [>L@ [>L@F
code. b Transformed code D>L@ [>L@D>L@
IRUL LQL
D>L@ [>L@D>L@ `
Figure 12.12a, b shows the original and transformed codes, respectively.

Table 12.6 shows the impact of loop fusion on the execution time and energy due
to reduction of execution of branch instructions. At 1.5 V and 733 MHz, the trans-
formed code achieves 18.18 and 18.75 % savings in time and energy, respectively,
with respect to the original code. At 1.3 V and 600 MHz, the transformed code
achieves 43.75 and 4.54 % reduction in energy and execution time, respectively,
with respect to the original code.
12.4.6 Loop Peeling
This transformation, also called loop splitting, attempts to eliminate or reduce the
loop dependencies introduced by the first or the last few iterations by splitting these
iterations from the loop and perform them outside the loop, thus enabling better
instructions parallelization.
Figure 12.13a, b shows the original and transformed code, respectively.
Table 12.7 shows the impact of loop peeling on the execution time and energy
due to the reduction of data movement instructions. At 1.5 V and 733 MHz, the
transformed code achieves 8.33 and 11.11 % savings in time and energy, respec-
tively, with respect to the original code. At 1.4 V and 666 MHz, the transformed
code achieves 22.22 % reduction in energy without degradation in performance,
with respect to the original code.
Table 12.6 Loop fusion experimental results

Original at 1.5 V, Time (ms) 2.2 – Total instructions 990,451
733 MHz Total energy (mJ) 1.6 – Branch instructions 40,111
Transformed at Time (ms) 1.8 18.18 Total no. of instructions 830,464
1.5 V, 733 MHz Total energy (mJ) 1.3 18.75 Branch instructions 20,116
Transformed at Time (ms) 2.1 4.54 Total no. of instructions 830,464
Fig. 12.13 Loop peeling, IRUL LQL \>@ [>@[>@

where n = 10,000. a Original ^ IRUL LQL
code. b Transformed code \>L@ [>L@[>S@ ^
S L \>L@ [>L@[>L@
` `
12.4.7 Loop Unswitching
This transformation is applied when a loop contains a conditional statement with

a loop-invariant condition. Thus, the loop unswitching transformation moves the
conditional statement outside the loop by duplicating the loop’s body inside each
branch of the conditional. Hence, this transformation aims to reduce the overhead of
unnecessary conditional branches which enables more instructions parallelization
that consequently enhance the performance.
Figure 12.14a, b shows the original and transformed code, respectively.
Table 12.8 shows the impact of loop unswitching on the execution time and energy
due to the reduction of branch instructions. At 1.5 V and 733 MHz, the transformed
code achieves 28.57 and 20.0 % savings in time and energy, respectively, with re-
spect to the original code. At 1.2 V and 533 MHz, the transformed code achieves
60.0 % reduction in energy without degradation in performance, with respect to the
original code.
Table 12.7 Loop peeling experimental results

733 MHz Total energy (mJ) 0.9 –
Transformed at Time (ms) 1.1 8.33 Total instructions 540,514
1.5 V, 733 MHz Total energy (mJ) 0.8 11.11
1.4 V, 666 MHz Total energy (mJ) 0.7 22.22
12.5 Power-Aware Software Prefetching 371
Fig. 12.14 Loop unswitch- LID!E

ing, where n = 10,000. ^
IRUL LQL
a Original code. b Trans- IRUL LQL F>L@
formed code ^ `
LID!E HOVH
F>L@ ^
HOVH IRUL LQL
F>L@ F>L@
` `
12.5 Power-Aware Software Prefetching
Prefetching of instructions and data is not uncommon. For example, prefetching of

instructions is done to increase instruction-level parallelism required for execution
of instructions in an overlapped manner in pipelined processors. This is used in all
contemporary high-performance processors. On the other hand, software prefetch-
ing is not as popular as hardware prefetching. Software prefetching [1, 2] is a tech-
nique of inserting prefetch instructions into codes for memory reference that are
likely to result in cache miss. This is done either by the programmer or by a com-
piler. At runtime, the inserted prefetch instructions bring the data into the proces-
sor’s cache memory in advance of their use, thus overlapping the memory access
with processor computation. Software prefetching eliminates cache misses caus-
ing improvement in performance. However, power consumption is increased be-
cause additional prefetch instructions are to be executed. This leads to a problem of
power–performance tradeoff for programs using software prefetching. Figure 12.15
shows a C program of 3D Jacobi’s Kernel, while Fig. 12.16 shows its software
prefetching version. In [3], Agarwal et al. have proposed the idea of low-power soft-
ware prefetching using DVFS [4], without degrading the performance, while Chen
et al. in [5] have shown DVFS with adjustment of prefetch distance (PD) to provide
power reduction as well as performance improvement.
Here, the problem of power reduction with performance gain is formulated as
a multi-objective optimization problem. The solution of this optimization problem
will guide the software prefetching program to achieve higher performance at the
cost of minimum power consumption. Figure 12.18 shows the general structure of
a software prefetch program (SPP). The SPP consists of a software prefetch code
Table 12.8 Loop unswitching experimental results

733 MHz Total energy (mJ) 0.5 – Branch instructions 40,124
Fig. 12.15 3D Jacobi’s GHILQHQ

kernel GRXEOH$>Q@>Q@>Q@%>Q@>Q@>Q@
LQWPDLQ
^
LQWLMN
IRUN NQN
^
IRUM MQM
^
IRUL LQL
^
%>N@>M@>L@ $>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@
`
`
`
UHWXUQ
`
(SPC). An SPC nested in one or more loops form an SPP. SPC has three sections—
prologue loop section (PLS), unrolled loop section (ULS), and epilogue loop section
(ELS). PLS has p prologue loops (PL1, PL2,…, Plp), ULS has p unrolled loops (UL1,
UL2,…, ULp), and ELS has p epilogue loops (EL1, EL2,…, ELp), where p ≥ 1. A pro-
logue loop (PL) contains only data prefetching instructions. An unrolled loop (UL)
section contains both data prefetching and computation instructions to process the
prefetched data. An epilogue loop (EL) contains only computation instructions to
process prefetched data. Here, for each j, PLj, ULj, and ELj are related to a prefetch
distance PDj where 1 ≤ j ≤ p. This implies p prefetch distances are associated with an
SPC. PD directs a PL to prefetch PD loop iterations before the data are referenced.
Here, it measures the PD using the formula PD = ceiling (l/s) as defined in [6, 7],
where l is the memory access latency and s is the latency of one UL iteration.
It considers a processor that runs with m voltage–frequency ( v, f) pairs. A volt-
age–frequency pair is considered as ( vi, fi) where 1 ≤ i ≤ m. ( v1, f1) is the peak ( v, f)
pair and ( vm, fm) is the lowest ( v, f) pair. Figure 12.19 shows the general structure of
power-aware software prefetching program (PASPP) which contains an extra block
in its SPC called power-aware code (PAC). The goal of this work is to transform a
given SPP (as in Fig. 12.17) to a PASPP (as in Fig. 12.18), so that it can take full
advantage of the processor under consideration in terms of performance improve-
ment and power awareness.
Software prefetching increases the average power consumption of SPP. Increase
in average power consumption also increases heat dissipation. Heat dissipation re-
duces the system reliability and increases leakage power consumption. The average
power dissipation is important in context of battery-operated portable devices as
it decides the battery lifetime. The reduction of average power will result longer
battery lifetime. The power-delay product (energy consumption) of SPP is lesser
than that of non-prefetched version (original program). The present work targets in
reduction of average power consumption of PASPP. The performance of PASPP is
Fig. 12.16 3D LQFOXGH¶¶GYIVK¶¶

Jacobi’s kernel with GHILQHQ
software prefetching GHILQH3'
GRXEOH$>Q@>Q@>Q@%>Q@>Q@>Q@
LQWPDLQ
^
LQWLMN
IRUN NQN
^
IRUM MQM
^ 63&EHJLQV
IRUL L3'L 3URORJXH/RRS3/
^
35()(7&+ $>N@>M@>L@
35()(7&+ $>N@>M@>L@
35()(7&+ $>N@>M@>L@
35()(7&+ $>N@>M@>L@
35()(7&+ $>N@>M@>L@
35()(7&+ %>N@>M@>L@
`
IRUL LQ3'L 8QUROOHG/RRS8/
^
35()(7&+ $>N@>M@>L3'@
35()(7&+ $>N@>M@>L3'@
35()(7&+ $>N@>M@>L3'@
35()(7&+ $>N@>M@>L3'@
35()(7&+ $>N@>M@>L3'@
35()(7&+ %>N@>M@>L3'@
%>N@>M@>L@ $>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
%>N@>M@>L@ $>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
%>N@>M@>L@ $>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
%>N@>M@>L@ $>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
`
IRUL Q3'LQL(SLORJXH/RRS(/
^
%>N@>M@>L@ $>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
`
63&HQGV
`
`
` UHWXUQ
much better than that of the original program and is at most as good as that of SPP.
The power-delay product of the PASPP is lesser than those of the original program
and SPP.
The proposed scheme is evaluated with XEEMU [8, 9]. The XEEMU is a power
simulator that simulates XScale processor. It is an extension of SimpleScalar com-
puter architecture simulator and uses Panalyzer as its power model. An XScale pro-
cessor supports data prefetching. The XScale processor can operate at nine ( v, f)
Fig. 12.17 General structure Software Prefetch Program (SPP)

of a program with software
prefetching Software Prefetch Code (SPC)
Prologue Loop Section (PLS)

Prologue Loop1(PL1)
Prologue Loop2(PL2)
...
Prologue Loopp(PLp)
Unrolled Loop Section (ULS)

Unrolled LOOP1(UL1)
Unrolled LOOP2(UL2)
...
Unrolled LOOPp(ULp)
Epilogue LOOP Section (ELS)

Epilogue LOOP1(EL1)
Epilogue LOOP2(EL2)
...
Epilogue LOOPp(ELp)
Fig. 12.18 General structure Software Prefetch Program (SPP)

of power-aware software Software Prefetch Code (SPC)
prefetching program (PASPP)
Power Aware Code (PAC)
Prologue Loop Section (PLS)
Prologue Loop1(PL1)
Prologue Loop2(PL2)
...
Prologue Loopp(PLp)
Unrolled Loop Section (ULS)

Unrolled LOOP1(UL1)
Unrolled LOOP2(UL2)
...
Unrolled LOOPp(ULp)
Epilogue LOOP Section (ELS)

Epilogue LOOP1(EL1)
Epilogue LOOP2(EL2)
...
Epilogue LOOPp(ELp)
pairs as shown in Table 12.1. So, in this case the value of m is 9. For a given pro-
gram, XEEMU measures the time taken by the program in seconds as well as in
cycles. It measures total energy consumption both in watt × cycle and in Joule. It
measures average power dissipation in watts.
12.5.1 Compilation For Low Power
Given a processor with m (v, f) pairs and an SPP, generate a PASPP using the solu-
tion of the following optimization problem represented by Formula 1.
Formula 1 Multi-objective optimization problem (MOOP)
Goal 1: Minimize energy (E) consumed by the PASPP.
Goal 2: Minimize time (T) taken by the PASPP subject to
m
E = ∑ eixi ≤ min( Eprefetch , Eno _ prefetch ),
i =1
m m
∑ tixi ≤ T , ∑ xi = N , and, xi ≥ 0 ,
i =1 i =1
where E is the total energy consumed by PASPP in watts×cycles. ei and ti are energy
(in watts×cycles) consumed and time (in μs) taken, respectively, per execution of
SPC, when executed at (vi, fi). xi is the number of times the SPC is executed at (vi,
fi). T is the total time taken by PASPP in μs. Tprefetch ≤ T < Tno_prefetch. Tprefetch and
Tno_prefetch are the time taken to execute the SPP and its non-prefetched version,
respectively, when executed at v1,f1. Their unit is μs. N is the total number of times
the SPC is executed. N is a function of input size n. For example, the SPP of 3D
Jacobi’s Kernel has N = n2. Eprefetch and Eno_prefetch are energy (in watts×cycles) con-
sumed by SPP and its non-prefetched version, when executed at (v1, f1). The solu-
tion of this problem will find the value of x1,…, xm. The value of xi indicates that
the PASPP will spend (xi/N)*100 % of its execution time at (vi, fi). The solution of
this formula enables the PASPP to adjust the (v, f) pair while in execution to achieve
performance gain at the cost of minimum energy consumption. The following steps
can help a compiler to achieve this:
1. Consider a program without software prefetching as shown in Fig. 12.15 and
transform it to an SPP as shown in Fig. 12.16.
2. Find N from the SPP obtained in the previous step.
3. Find ei, ti, PDji for each (vi, fi) and store them in TEPD_TABLE. PDji is the PD
associated with PLj, ULj, and ELj in the SPC of SPP, when executed at (vi, fi).
TEPD_TABLE is a table having m records, each record having the following
attributes—t, e, and an array of p elements named pd—they store ei, ti, and PDji,
respectively.
4. Find Tprefetch, Eprefetch, Tno_prefetch, and Eno_prefetch by executing the SPP and its non-
prefetched version, respectively, at (v1, f1).
5. Run SPP to PASPP Transformation Algorithm.
The rest of this section discusses each of these steps in details.
A Program And Its SPP Version Consider a source program and find out the oppor-
tunities of having an SPP version of it. This can be done either by the programmer
or by a compiler. To do this, the algorithm by Mowry et al. in [6] is used, which
takes O(p) time, where p is the number of PLs, ULs, and ELs in the SPC of the SPP,
or in other words it is the number of prefetch distances associated with the SPC.
Finding N From SPP The SPC in the SPP obtained from the source program. An
SPC nested in one or more loops form an SPP. The nested loop helps to find N.
When an SPC is not nested in any loop, N is considered as 1. This takes O(k) time
where k ( ≥ 1) is the nesting level of the nested loop which contains the SPC. In
Fig. 12.17, N is n2 and k is 2.
Formation of TEPD_TABLE The following code fragment enables the formation of
TEPD_TABLE in O(p) time, because m is constant for a given processor.
IRUL LPL
^
IRUM MSM
^
([HFXWH63&DWYLILIRURQHLWHUDWLRQDQGVWRUHLWVH[HFXWLRQWLPHLQV
3'ML FHLOLQJOV
7(3'B7$%/(>L@SG>M@ 3'ML
`
([HFXWH63&DWYLILIRURQFHDQGVWRUHWKHH[HFXWLRQWLPHDQGHQHUJ\
FRQVXPHGLQWL DQGHLUHVSHFWLYHO\
7(3'B7$%/(>L@W WL
7(3'B7$%/(>L@H HL
`
where m is the number of (v, f) pairs and p is the number of prefetch distances as-
sociated with the SPC. TEPD_TABLE[i].t and TEPD_TABLE[i].e represent the t
(time) and e (energy) attributes of the ith record of TEPD_TABLE, respectively.
TEPD_TABLE[i].pd[j] is the jth PD of the ith record of TEPD_TABLE, i.e., TEPD_
TABLE[i].pd[j] stores PDji.
Finding Tprefetch, Eprefetch, Tno_prefetch, and Eno_prefetch Execute the SPP at (v1, f1) to
obtain Tprefetch and Eprefetch. Execute the non-prefetch version of SPP at (v1, f1) to
obtain Tno_prefetch and Eno_prefetch.
SPP to PASPP Transformation Algorithm This algorithm converts an SPP to a
PASPP. The algorithm starts with Tprefetch as an initial value of T and increases the
value of T by 1 % of Tprefetch until an optimal solution is found. On finding an optimal
solution, the values of x1,…, xm are stored in X[1],…, X[m], respectively. Then the
PAC is generated.
633WR3$6337UDQVIRUPDWLRQ$OJRULWKP6333$633
633WR3$6337UDQVIRUPDWLRQ$OJRULWKP
,QSXW7QRBSUHIHWFK7SUHIHWFK7(3'B7$%/(1DQG633
2XWSXW3$633
,QLWLDOL]DWLRQ
IRUL LPL
;>L@
7 7SUHIHWFK
$OJRULWKP
6WHS,I77QRBSUHIHWFKWKHQ
^
5HSRUW³)DLOXUH´DQGJRWR6WHS
`
6WHS6ROYHWKH0223GHILQHGLQ)RUPXODE\*RDO3URJUDPPLQJ
XVLQJWKHLQIRUPDWLRQLQ7(3'B7$%/(
6WHS,IWKH0223KDVDQRSWLPDOVROXWLRQWKHQ
^
IRUL LPL
;>L@ [L
`
2WKHUZLVH
^
7 7 7SUHIHWFK
JRWR6WHS
`
6WHS&DOO3URFHGXUH3RZHUB$ZDUHB&RGHB*HQHUDWRU;
6WHS6WRS
SPP–PASPP finds the least possible value of T such that, Tprefetch ≤ T < Tno_prefetch, as
defined in Formula 1.
SPP–PASPP solves the MOOP in Formula 1 using goal programming [10],
where Goal 1 has a higher priority than Goal 2. A goal programming problem can be
reduced to a linear programming problem and solved by using the simplex method
[11]. In the worst case, time taken by simplex method is an exponential function of
m. For a given processor, m is always constant. In the present work, the value of m is
9, and it remains the same for any input. So, the time taken to solve the optimization
problems is O(1). Steps 1–3 take O(floor(((Tno_prefetch-Tprefetch)/Tprefetch)*102)) time
and in step 4 power-aware code generator (PACG) takes O(p) time. So, total effort
required by the algorithm is O(floor(((Tno_prefetch-Tprefetch)/Tprefetch)*102) + p).
3RZHU$ZDUH&RGH*HQHUDWRU3$&*
3RZHUB$ZDUHB&RGHB*HQHUDWRU;
^
FKDU6>@LQW[FRXQWHU ;>@ERROHDQILUVW WUXH
IRUL LPL
^
LI;>L@!
^
LIILUVW IDOVH
^
VSULQWI6³HOVH´
,QVHUWB&RGH6
`
HOVH
^
ILUVW IDOVH
`
VSULQWI6´LIFRXQW G^VHWYROWDJHGVHWIUHTXHQF\G´
[FRXQWHULL
,QVHUWB&RGH6
LIS
^
VSULQWI6³3' G´7(3'B7$%/(>L@SG>@
,QVHUWB&RGH6
`
HOVH
^
IRUM MSM
^
VSULQWI6³3'G G´M7(3'B7$%/(>L@SG>M@
,QVHUWB&RGH6
`
`
VSULQWI6³`´
,QVHUWB&RGH6
`HQGRILI;>L@!
[FRXQWHU ;>L@
`HQGRIIRUL LPL
VSULQWI6³FRXQW´
,QVHUWB&RGH6
` HQGRI3$&*
PACG is a procedure that inserts PAC in the SPP to form the PASPP. After the
least possible value of T is obtained, SPP–PASPP calls this procedure. PACG has a
parameter X which contains the solution of the optimization problem solved before
the algorithm reaches Step 4. PACG uses the C library function sprintf. Insert_Code
is another procedure that enables PACG to insert the desired code in the PAC block
of the PASPP. PACG takes O(p) time because m is constant for a given processor.
Figure 12.19 shows the PASPP of JACOBI which contains an integer variable count
initialized to zero and a PAC containing an if statement. The PACG inserts an if-else
ladder followed by a statement count + + . The if-else ladder and count + + statement
collectively form the PAC. The count variable counts the number of times the SPC
is executed. The if-else ladder helps the PASPP to compare the count with the num-
ber of times the PASPP should be executed at a (v, f) pair and switch to the desired
(v, f) pair with change in prefetch distance.
Fig. 12.19 3D Jacobi’s LQFOXGH´GYIVK´

Kernel with power-aware GHILQHQ
software prefetching
GRXEOH$>Q@>Q@>Q@%>Q@>Q@>Q@
LQWPDLQ
^
LQWLMNFRXQW 3'
IRUN NQN
^
IRUM MQM
^ 63&EHJLQV
LIFRXQW 3RZHU$ZDUH&RGH3$&
^
VHWIUHTXHQF\
VHWYROWDJH
` 3'
FRXQW
IRUL M3'L3URORJXH/RRS3/
^
35()(7&+ $>.@>M@>L@
35()(7&+ $>.@>M@>L@
35()(7&+ $>.@>M@>L@
35()(7&+ $>.@>M@>L@
35()(7&+ $>.@>M@>L@
35()(7&+ %>.@>M@>L@
`
IRUL LQ3'L 8QUROOHG/RRS8/
^
35()(7&+ $>N@>M@>L3'@
35()(7&+ $>N@>M@>L3'@
35()(7&+ $>N@>M@>L3'@
35()(7&+ $>N@>M@>L3'@
35()(7&+ $>N@>M@>L3'@
35()(7&+ %>N@>M@>L3'@
%>N@>M@>L@ $>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
%>N@>M@>L@ $>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
%>N@>M@>L@ $>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
%>N@>M@>L@ $>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
`
IRUL Q3'QL(SLORJXH/RRS(/
^
%>N@>M@>L@ $>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
$>N@>M@>L@$>N@>M@>L@$>N@>M@>L@
`
63&HQGV
`
`
VHWYROWDJH
VHWIUHTXHQF\
UHWXUQ
`
Table 12.9 Lists of benchmark circuits

Benchmark Input n N
JACOBI One n × n × n 3D matrix(affined array) of 8-byte real numbers 102 n2
MM Two n × n 2D matrices(affined array) of 8-byte real numbers 2 × 102 n2
DET One n × n 2D matrices(affined array) of 8-byte real numbers 2 × 102 n2
MOLDYN Three indexed arrays of size n, and time = 10,000 128 K Time
IP Two arrays(affined arrays) of n 8-byte real numbers 105 1
VLIA Two dynamic arrays (affined arrays) of n integers 103 1
MM matrix multiplication, IP inner product, VLIA very long integer addition
12.5.2 Experimental Methodology and Results
To validate the efficacy of the previous section, it is simulated on an architectural

simulator named XEEMU. XEEMU extends SimpleScalar with functionalities of
XScale processor and Panalyzer power model. XScale supports software prefetch-
ing and work on multiple (v, f) pairs. The performance and power of the non-
prefetched version, SPP and PASPP are measured with the help of this simulator.
The information in the TEPD_TABLE is also obtained using XEEMU.
Experimental Methodology As discussed in the previous section, the parameters
Tno_prefetch, Tprefetch, Eno_prefetch, Eprefetch, PDji, ei, and ti are measured on XEEMU. The
present work implements three high-level functions—setvoltage, setfrequency,
and PREFETCH. Each of these functions are implemented using inline assembly
facility available in C programming. setvoltage(i) will set the supply voltage to vi
volts, setfrequency(i) will set the clock frequency to fi MHz, and PREFETCH(data_
address) will fetch a data block to the L1 data cache. The present work assumes a
split 4-Kbyte eight-way set-associative L1 cache with 32-byte cache blocks, and a
unified 128-Kbyte four-way set-associative L2 cache with 64-byte cache blocks.
The memory access overhead l at peak (v, f) pair (v1,f1), as in Table 12.1, is 170 ns.
As the (v, f) pair is scaled down, this delay increases [12]. To fix this memory access
time overhead at lower (v, f) pairs, the PD is adjusted. During the switching from a
(v, f) pair to another, there are time and energy overhead. To measure this, the pres-
ent work uses the mathematical model proposed by Burd et al. in [13].
The simulation is based on experimental evaluation that employs six bench-
marks, representing two classes of data-intensive applications. Table 12.9 lists the
benchmarks along with their problem sizes and memory access patterns. JACOBI
performs a 3D Jacobi relaxation. MM represents matrix multiplication. DET finds
the determinant of a square matrix. MOLDYN [14] performs nonbonded force cal-
culation for key molecular dynamic applications. IP represents the inner product of
two vectors. Very long integer addition (VLIA) adds two integers of very long size.
Each integer is stored in a dynamic array, where each array element represents a
place value. The sum is also stored in a dynamic array.
Experimental Results The performance and power of the optimized codes are
measured with the help of the simulator. All programs are built with Gcc-O2 option.
Table 12.10 TEPD_TABLE i ti (μsec) ei (Watt × Cycle) Pd

for JACOBI
1 396.27 301414 16
2 433.00 250785 12
3 477.55 213549 11
4 532.45 176665 9
5 60.674 145748 7
6 699.55 136370 5
7 834.89 116155 3
8 1037.73 108874 2
9 1372.25 101509 1
Table 12.11 Performance and

Benchmark Items PASPP
power for different bench-
mark programs JACOBI Power(W) 0.119
Time(s) 0.192
MM Power(W) 0.267
Time(s) 0.202
DET Power(W) 0.228
Time(s) 0.183
MOLDYN Power(W) 0.198
Time(s) 0.194
IP Power(W) 0.079
Time(s) 0.037
VLIA Power(W) 0.067
Time(s) 0.041
MM matrix multiplication, IP inner product, VLIA
very long integer addition
Table 12.10 shows TEPD_TABLE for JACOBI. Table 12.11 shows performance

and power (average power) comparison of different approaches. The third column
from left shows the power and time taken by the original non-prefetched version of
the benchmarks. The next column shows the outcome of SPP, where performance
is enhanced at the cost of higher power dissipation. The power dissipation of SPP
increases due to increase in the number of instructions, and overlapped memory
access and CPU computation. The rightmost column shows the power and perfor-
mance of the PASPP. The PASPP is based on MOOP. Figure 12.16 shows the PASPP
version of JACOBI. SPP–PASPP gives higher priority to energy minimization. For
this reason, PASPP programs perform well at the cost of lesser power consumption.
Figure 12.20 shows the power consumed by different units of the processor, all
three versions of JACOBI. The power-consuming units of the system shown here
are register renaming (rename), branch prediction unit (bpred), instruction window
(window), load-store queue (lsq), register file (regfile), instruction cache (icache),
L1 data cache (dcache), L2 data cache (dcache2), ALU (alu), output bus (resultbus),
and write buffer (write_buffer). Table 12.13 shows an average of 60.34 % perfor-
mance and 21.56 % energy gained by SPP with respect to that of the original pro-
gram. Table 12.14 shows an average of 49.18 % performance and 50.19 % energy
PASPP rename
bpred
window
ls q
regfile
icache
SPP dcache
dcache 2
alu
resultbus
write_buffers
Original
0 1 2 3 4 5 6 7 8
Average Power Dissipation(w)
Fig. 12.20 Detailed power dissipation at different units for three versions of 3D Jacobi’s Kernel
Table 12.12 Performance and power requirements of three different versions

Benchmark Items Original SPP PASPP
JACOBI Power(W) 4.3 7.2 4.18
Time(s) 7.54 3.5 4.28
MM Power(W) 4.7 10.2 4.57
Time(s) 8.25 3.23 3.72
DET Power(W) 5.1 11.3 4.92
Time(s) 6.23 2.84 3.32
MOLDYN Power(W) 4.73 7.54 4.69
Time(s) 10.28 2.76 3.57
IP Power(W) 3.57 8.12 3.53
Time(s) 2.57 0.95 1.47
VLIA Power(W) 4.2 7.82 4.17
Time(s) 1.53 0.657 0.884
MM matrix multiplication, IP inner product, VLIA very long integer addition, SPP software
prefetch program, PASPP power-aware software prefetching program
Table 12.13 Performance and energy gains of SPP of the benchmark programs
Benchmark Performance gain by SPP with Energy gain by SPP with
respect to the original (in %) respect to the original (in %)
JACOBI 53.58 22.27
MM 60.84 15.02
DET 54.41 − 1.00
MOLDYN 73.15 57.2
IP 63.03 15.9
VLIA 57.05 20.0
Average 60.34 21.56
MM matrix multiplication, IP inner product, VLIA very long integer addition, SPP software
prefetch program
Table 12.14 Performance and energy gains of PASPP of the benchmark programs
Benchmark Performance gain by PASPP with Energy gain by PASPP with
respect to the original (in %) respect to the original (in %)
JACOBI 43.23 44.82
MM 54.90 56.15
DET 46.70 48.59
MOLDYN 65.27 65.56
IP 42.80 43.44
VLIA 42.22 42.63
Average 49.18 50.19
PASPP power-aware software prefetching program, MM matrix multiplication, IP inner product,
VLIA very long integer addition
Table 12.15 Performance and energy gains of PASPP with respect to SPP
Benchmark Performance loss by PASPP Energy gain by PASPP with
with respect to the SPP (in %) respect to the SPP (in %)
JACOBI 22.28 29.00
MM 15.17 48.40
DET 16.90 49.10
MOLDYN 29.34 19.54
IP 54.73 32.73
VLIA 34.55 28.24
Average 28.82 34.50
PASPP power-aware software prefetching program, MM matrix multiplication, IP inner product,
VLIA very long integer addition, SPP software prefetch program
Table 12.16 Power and time Benchmark Items PASPP

overhead due to PAC and
JACOBI Power (W) 0.119
switching of ( V, f) pairs
Time (s) 0.192
MM Power (W) 0.267
Time (s) 0.202
DET Power (W) 0.228
Time (s) 0.183
MOLDYN Power (W) 0.198
Time (s) 0.194
IP Power (W) 0.079
Time (s) 0.037
VLIA Power (W) 0.067
Time (s) 0.041
MM matrix multiplication, IP inner product, VLIA very
long integer addition, PAC power-aware code
gained by PASPP with respect to that of the original program. Table 12.15 shows
an average of 28.82 % performance lost and 34.50 % energy gained by PASPP with
respect to that of SPP. Table 12.16 shows power and time overhead due to the PAC
and switching of (v, f) pairs. Table 12.17 shows the time spent by the PASPPs at dif-
ferent (v, f) pairs with different prefetch distances (Table 12.12).
Table 12.17 Percentage Benchmark Percentage of execution time spent by

of execution time spent by PASPP at different ( v, f) and PD
PASPP at different (v, f)
JACOBI 16.9 % at (v1, f1), PD = 16
and PD
83.1 % at (v2, f2), PD = 12
MM 2.0 % at (v1, f1), PD = 32
84.5 % at (v2, f2), PD = 26
13.5 % at (v3, f3), PD = 18
DET 5.15 % at (v1, f1), PD = 40
67.23 % at (v2, f2), PD = 34
27.62 % at (v3, f3), PD = 29
MOLDYN 20.17 % at (v1, f1), PD1 = 2, PD2 = 3, PD3 = 3
79.83 % at (v3, f3), PD1 = 2, PD2 = 3, PD3 = 3
IP 100.0 % at (v3, f3), PD = 24
VLIA 100.0 % at (v3, f3), PD = 9
PASPP power-aware software prefetching program, MM
matrix multiplication, IP inner product, VLIA very long inte-
ger addition
12.5.3 Conclusions
The present work provides an idea of transforming the SPP to PASPP. The experi-
mental results show that the PASPPs can perform well at the cost of lesser power
dissipation. The proposed methods can enable a compiler to generate PASPP.
• Software optimization techniques to reduce power consumption without any

change in the underlying hardware have been introduced.
• The software for low power that is discussed in this chapter does not require any
additional hardware, but perform suitable optimization of software.
• Software optimization techniques are broadly classified into two categories:
machine-independent and machine-dependent.
• A number of machine-independent optimization techniques based on compiler
optimizations have been introduced that do not require any knowledge of the
hardware architecture of the processor and can be used for any processor.
• One machine-dependent optimization technique namely power-aware software
prefetching has been discussed in detail that exploits the architectural features of the
target processor to reduce energy dissipation without compromise in performance.
• Simulation results for a number of benchmark applications have been provided.
References 385
Q12.1. Distinguish between hardware and software optimizations for low power.
Q12.2. Distinguish between machine-dependent and machine-independent optimi-
zations.
Q12.3. What are the different ways to reduce power involving memory?
Q12.4. Explain, with an example, how inlining helps to reduce power dissipation.
Q12.4. How is code hoisting used to reduce power consumption?
Q12.5. Distinguish between static and dynamic dead-code elimination.
Q12.6. What is loop-invariant computation? How does a compiler exploit it to re-
duce energy consumption?
Q12.7. Explain with an example how loop unrolling can be used to reduce energy
consumption.
Q12.8. Explain how common sub-expression elimination is done by a compiler.
Q12.9. Briefly explain how software optimizations can be combined with VDFS to
reduce energy consumption.
Q12.10. What is software prefetching? How can it be used to reduce energy con-
sumption?
References
1. Tewari, V., Malik, S., Wolfe, A.: Compilation techniques for low energy. In: The Proceedings
of 1994 Symposium on Low-Power Electronics, San Diego, CA, October 1994
2. Mowry, T.C.: Tolerating latency through software-controlled data prefetching. Doctor disser-
tation, Standford University, March 1994
3. Deepak, N.A., Sumitkumar, N.P., Gang, Q., Donald, Y.: Transferring performance gain from
software prefetching to energy reduction. In: Proceedings of the 2004 International Sympo-
sium on Circuits and Systems (ISCAS2004), Vancouver, Canada
4. Xie, F., Martonosi, M., Malik, S.: Intraprogram dynamic voltage scaling: Bounding oppor-
tunities with analytic modeling. ACM Trans. Architecture Code Optimization 1(3), 323–367
(2004)
5. Chen, J., Dong, Y., Yi, H., Yang, X.: Power-aware software prefetching. ICESS 2007, LNCS
4523, pp. 207–218
6. Mowry, T.C., Lam, S., Gupta, A.: Design and evaluation of a compiler algorithm for prefetch-
ing. In: Proceedings of 5th International Conference on Architectural Support for Program-
ming Languages and Operating Systems, Boston, MA, pp. 62–73, September 1992
7. Klaiber, A.C., Levy, H.M.: An architecture for software-controlled data prefetching. In: Pro-
ceedings of the 18th International Symposium on Computer Architecture, Toronto, ON, Can-
ada, pp. 43–53, May 1991
8. Herczeg, Z., Kiss, A., Schmidt, D., Wehn, N., Gyimothy, T.: XEEMU: An improved XScale
power simulator. PATMOS 2007, LNCS 4644, pp. 300–309
9. Herczeg, Z., Kiss, A., Schmidt, D., Wehn, N., Gyimothy, T.: Energy simulation of embedded
XScale systems with XEEMU. J. Embedded Comput. – PATMOS 2007 selected papers on
low power electronics archive, vol. 3, issue 3, August 2009
10. Hamdy, A.T.: Operations Research: An Introduction, 8th edn., Chap. 8, p. 338. PHI Learning
Private Limited
11. Hamdy, A.T.: Operations Research: An Introduction, 8th edn., Chap. 3, p. 90. PHI Learning
Private Limited
12. Sakurai, T., Newton, A.: Alpha-power model, and its application to CMOS inverter delay and
other formulas. IEEE J. Solid-State Circ. 25, 584–594 (1990)
13. Burd, T., Brodersen, R.: Design issues for dynamic voltage scaling. In: The Proceedings of
International Symposium on Low Power Electronics and Design (ISLPED – 00), June 2000
14. Badawy, A.-H., Aggarwal, A., Yeung, D., Tseng C.-W.: The efficacy of software prefetching
and locality optimizations on future memory systems. J. Instruct.-Level Parallel. 6 (2004)
Index
2N −2N2P, 318, 319 Clock skew, 122, 127, 129

2N−2N2P Clustering 345, 346
advantages of, 316 Code hoisting, 361
Code motion See Loop—invariant
A computation, 356
Accumulation mode 50 Common sub-expression elimination, 189, 363
Active region 49 Complementary pass-transistor logic (CPL),
Adiabatic logic circuit, 304, 313, 317, 318 109–111, 242
Adoptive voltage scaling, 2, 15, 176–178, 183, Constant field scaling, 89, 178
192, 196, 318 Constant voltage scaling, 89, 181
Allocation 193 Cut-off region 50
Average power 10
D
B Dead-code elimination, 189
Band-to-band tunneling current, 160 Depletion mode 44, 45, 50
Battery Deposition, 24, 43
technologies, 324 Diffusion, 20, 23, 24, 28, 38, 43, 44, 47, 86,
overview of, 326 92, 332
Battery-driven systems, 1, 9, 142, 287, 324, Directed acylic graph (DAG) 189
328, 335 Domino CMOS, 128, 135
BiCMOS inverter, 97, 98 circuits, 128, 129
Body bias control 16 Double pass-transistor logic (DPL), 110, 111,
Body effect, 31, 57, 58, 163, 266, 268 242
Bubble pushing 243 Drain induced barrier lowering (DIBL), 35,
Buffer sizing, 98 40, 162, 268
Bus encoding, 214, 220, 221 DTCMOS 16
Dual-Vt Assignment 287
C Dynamic voltage scaling (DVS), 342, 343,
Charge inversion 44 345
Channel length modulation
coefficient, 59 E
effect, 35, 58 Efficient Charge Recovery Logic (ECRL),
Channel punch through effect, 14, 36, 158 314–318
Charge leakage Energy consumption, 142, 201, 210, 304, 312,
problem, 125, 126, 241 318, 319, 337, 339, 343, 345, 372
Charge sharing problem, 121, 126, 127, 240 Energy density, 324, 326, 328, 329
Clock gating, 15, 214, 226, 227, 324, 231, 262 Enhancement mode 44
circuits, 227, 228, 230 Evaluation phase 124

DOI 10.1007/978-81-322-1937-8, © Springer India 2015
388 Index
F permutation, 367
Fabrication steps, 19 tiling, 366
nMOS, 24, 26 unrolling, 189, 191, 358, 359, 363–365
Fan-in, 113, 117, 292 unswitching, 370
Fan-out, 73, 113, 129
Feature size scaling M
device, 15, 178 Mask generation, 21, 22
FinFET, 40 Medium access control 345
Field oxide 21 Memory effect 331
Fluid model, 43, 45, 46, 49, 50 Moore’s Law, 3, 5, 36
Flooding 346 MOS dynamic circuits, 120
Multi-level voltage scaling (MVS), 177, 192,
G 193, 194
Gate induced drain leakage, 14, 158 challenges in, 194
Gate leakage, 14, 36, 176 Multi-threshold voltage CMOS (MTCMOS),
Gate oxide 21 16, 263, 270, 272, 286
Gate oxide tunneling, 168
Glitches 123, 157, 238 N
Glitching power, 10, 176 Narrow width effect, 165
dissipation, 13, 143, 157 Noise-margin, 70, 82, 318
minimization, 237, 238 Non-saturated region 50
Guard ring, 33 NORA CMOS, 128, 241
use of, 33 n-MOS transistor 44
n-well process, 26, 28, 30, 31
H Noise margin 70
Hardware-software co-design, 214, 215
High-K dielectric, 37 O
Holding current 33 Oxidation, 21, 24, 332
Holding voltage 33 Over glassing 26
Hot-carrier injection, 143
P
I Pass transistor 60
Inlining, 359 P-MOS transistor 44
small functions, 360, 361 Pass-transistor logic, 104, 105, 107, 112
Input vector control 16 Peak power 10
Inversion layer 48 Pinch-off point 59
Inversion mode 50 Photolithography, 22, 23
Inverter ratio, 68, 81, 82, 104, 121 Polysilicon 43
Inverter threshold voltage 68, 69 Positive Feedback Adiabatic Logic (PFAL),
Ion implantation, 21, 23, 26, 28, 38 315, 316, 318
Isolation cell 280 Power-aware software, 356, 371
Power density, 1, 181, 182, 7
J Power dissipation, 1, 2, 7, 8, 11, 13–15, 78, 79
Junction leakage, 14 sources of, 9, 10
Power gating, 2, 263, 272, 275, 277, 324
L controller, 282
Latch-up problem, 20, 26 issues, 273, 274
and its prevention, 31, 33 Powerless 123
Lightly doped drain structure, 183 Pre-charge logic, 123
Loop Precharge phase 124
fusion, 368, 369 Prefetch distance 371, 376
invariant computation, 363 Pulsed power supply 304, 308, 309
peeling, 368, 369 Pull-down device 67
Index 389
Pull-up device 67 Switching characteristic, 68, 75, 86, 87, 104,

Pseudo-nMOS 75, 120 117
P-well process, 30 Substrate sensitivity 164
Sunthreshold logic 209
R Switching power, 10–12, 143, 192, 272, 342
Ratioless 74 dissipation, 147–149, 171, 195, 200, 201
Recovery effect 331
Refreshing 121 T
Reverse biased junction current, 160 Task scheduling, 336
Ring oscillator, 89, 90 Threshold voltage, 12, 14, 15, 35, 48, 54, 56,
Routing 345 57, 75, 89, 111, 126, 146, 162, 163,
Run-time leakage 263 199, 262, 293
Transfer characteristics, 68
S Transconductance, 56, 57
Scaling factor 178 Transmission gate, 60, 61, 104, 238, 306
Scheduling 193 Twin-tub process, 30
Shanon’s expansion theorem 106 Two-phase clock, 121
Short-channel effect (SCEs), 34, 40, 182
Short circuit power, 12, 143, 145, 146, 171, U
278 Unipolar, 45
Silicon-on-insulator (SOI), 19, 26, 37, 266
Sneak path 105 V
Software prefetching, 356, 371, 380 Variable-threshold voltage CMOS
Strength reduction, 367 (VTCMOS), 262, 266
Sub-threshold leakage, 14, 36, 97, 126, 182, Vth-hopping, 299
262, 266, 287 Vth roll off, 165, 265
Super buffer, 94–96
Supply voltage scaling, 15, 176, 183, 317, W
324, 343 Wafer fabrication, 20, 22
State locus 233 Wireless sensor network 340
Swing-restored pass transistor logic (SRPL),
109, 110, 242
Switching activity, 11, 15, 149–151, 214, 222,
358

Lpvlsi Full TB

Uploaded by

Copyright:

You might also like

Lpvlsi Full TB

Uploaded by

Document Information

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Lpvlsi Full TB

Uploaded by

Copyright:

Low-Power VLSI Circuits and Systems

Low-Power VLSI Circuits

ISBN 978-81-322-1936-1 ISBN 978-81-322-1937-8 (eBook)

Library of Congress Control Number: 2014950352

© Springer India 2015

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Chapter 2: MOS Fabrication Technology

The basic metal–oxide–semiconductor (MOS) fabrication processes such as diffu-

Chapter 3: MOS Transistors

Chapter 4: MOS Inverters

Basic characteristics of an inverter followed by its noise margin are explained in

Chapter 5: MOS Combinational Circuits

The operation of pass transistor logic circuits is introduced in Sect. 5.1. Advantages

Chapter 6: Sources of Power Dissipation

Chapter 7: Supply Voltage Scaling for Low Power

voltage scaling exploiting high-level transformation is discussed in Sect. 7.3. Multi-

Chapter 8: Switched Capacitance Minimization

A system-level approach based on hardware–software co-design is presented in

Chapter 9: Leakage Power Minimization

Chapter 10: Adiabatic Logic Circuits

Chapter 11: Battery-Aware Systems

Chapter 12: Software for Low Power

2 MOS Fabrication Technology������������������������������������������������������������������� 19

2.6.2 Drain-Induced Barrier Lowering��������������������������������������������� 35

3 MOS Transistors���������������������������������������������������������������������������������������� 43

4 MOS Inverters�������������������������������������������������������������������������������������������� 67

4.5 Switching Characteristics�������������������������������������������������������������������� 86

5 MOS Combinational Circuits����������������������������������������������������������������� 103

6 Sources of Power Dissipation������������������������������������������������������������������ 141

6.3 Switching Power Dissipation [1]������������������������������������������������������ 147

7 Supply Voltage Scaling for Low Power�������������������������������������������������� 175

7.7.4 Workload Prediction�������������������������������������������������������������� 205

8 Switched Capacitance Minimization������������������������������������������������������ 213

9 Leakage Power Minimization����������������������������������������������������������������� 261

9.2.2 Multiple Oxide CMOS�������������������������������������������������������� 264

10 Adiabatic Logic Circuits�������������������������������������������������������������������������� 303

11 Battery-Aware Systems���������������������������������������������������������������������������� 323

11.3.3 Lithium Ion������������������������������������������������������������������������ 328

12 Low-Power Software Approaches���������������������������������������������������������� 355

Ajit Pal is currently a Professor in the Department of Computer Science and

Fig. 1.1 Moore’s law based on his famous prediction��������������������������������� 3

Fig. 3.2 a nMOS enhancement-mode transistor.

Fig. 4.6 Realization of a resistive load������������������������������������������������������� 72

Fig. 5.4 a Multiplexer realization of f = a′b + ab′ .

Fig. 5.27 A weak p-type MOS (pMOS) transistor to reduce the

Fig. 6.18 nMOS inverter and its physical structure������������������������������������� 159

Fig. 7.14 Assignment of multiple supply voltages based on delay

Fig. 8.27 Combinational circuit sandwiched between two registers������������ 236

Fig. 9.6 Physical structure of a CMOS inverter a without body bias,

Fig. 9.34 Leakage power with different high-threshold voltages�������������� 292

Table 1.1 Evolution of IC technology������������������������������������������������������ 4

Table 7.10 Relationship between voltage, frequency, and power�������������� 202

Table 12.5 Strength reduction experimental results����������������������������������� 369

Abstract This chapter provides an introduction to low-power, very-large-scale-

Keywords Moore’s law · Power dissipation · Power density · Energy consumption ·

A. Pal, Low-Power VLSI Circuits and Systems, 1

1.2 Historical Background [1]

Fig. 1.1 Moore’s law 16

Log2 of the Number of Components

Per Integrated Function

Table 1.1 Evolution of IC technology

2 MOS Fabrication Technology�� 19

2.6.2 Drain-Induced Barrier Lowering�� 35

3 MOS Transistors�� 43

4 MOS Inverters�� 67

4.5 Switching Characteristics�� 86

5 MOS Combinational Circuits�� 103

6 Sources of Power Dissipation�� 141

6.3 Switching Power Dissipation [1]�� 147

7 Supply Voltage Scaling for Low Power�� 175

7.7.4 Workload Prediction�� 205

8 Switched Capacitance Minimization�� 213

9 Leakage Power Minimization�� 261

9.2.2 Multiple Oxide CMOS�� 264

10 Adiabatic Logic Circuits�� 303

11 Battery-Aware Systems�� 323

11.3.3 Lithium Ion�� 328

12 Low-Power Software Approaches�� 355

Fig. 1.1 Moore’s law based on his famous prediction�� 3

Fig. 4.6 Realization of a resistive load�� 72

Fig. 6.18 nMOS inverter and its physical structure�� 159

Fig. 8.27 Combinational circuit sandwiched between two registers�� 236

Fig. 9.34 Leakage power with different high-threshold voltages�� 292

Table 1.1 Evolution of IC technology�� 4

Table 7.10 Relationship between voltage, frequency, and power�� 202

Table 12.5 Strength reduction experimental results�� 369

1.2 Historical Background [1]

'63706

1.4 Sources of Power Dissipations [3]

1.4.1.3 Glitching Power Dissipation

1.5 Low-Power Design Methodologies

2.2 Basic Fabrication Processes [1, 2]

2.3 nMOS Fabrication Steps [2, 3]

2.4 CMOS Fabrication Steps [2, 3]

2.4.1 The n-Well Process

2.4.2 The p-Well Process

2.5 Latch-Up Problem and Its Prevention

2.5.1 Use of Guard Rings

2.6 Short-Channel Effects [6]

2.6.1 Channel Length Modulation Effect

2.6.2 Drain-Induced Barrier Lowering

2.6.3 Channel Punch Through

2.7 Emerging Technologies for Low Power

2.7.1 Hi-K Gate Dielectric