Low Power Arch - MIT PDF

Architecture and System Level Optimization
of Power Consumption
Anantha Chandrakasan
Massachusetts Institute of Technology
Power and Throughput in CMOS
Throughput (1/delay)
1.5
1/delay, .6m ring oscillator
1.0
0.5
0.0
1.0
2.0
3.0
4.0
VDD
5.0
6.0
How to scale VDD keeping a fixed throughput?

Architecture and System Level Optimization of Power Consumption
Anantha Chandrakasan 1997
Architecture Trade-offs - Reference Datapath
1
T
LATCH C
COMPARATOR
A>B
ADDER
LATCH B
1
T
LATCH A
COMPARATOR
Area = 636 x 833 2

1
T
Critical path delay Tadder + Tcomparator (= 25ns)

fref = 40Mhz
Total capacitance being switched = Cref
Vdd = Vref = 5V
Power for reference datapath = Pref = Cref Vref2 fref
from [Chandrakasan92] (IEEE JSSC)
Parallel Datapath
LATCH B
ADDER
COMPARATOR
LATCH C
ADDER
COMPARATOR
LATCH C
LATCH B
1
2T
COMPARATOR
MUX
1
2T
A>B
LATCH A
1
2T
LATCH A
COMPARATOR
1
2T
1
2T
A>B
1
T
1
2T
Area = 1476 x 1219 2
The clock rate can be reduced by half with the same

throughput fpar = fref / 2
Vpar = Vref / 1.7, Cpar = 2.15Cref
Ppar = (2.15Cref) (Vref/1.7)2 (fref/2) 0.36 Pref
The Faster the Better??

NORMALIZED POWER
1.00
Fixed Throughput
0.90
Minimal Area
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
Minimal Power
1.00
2.00
3.00
4.00
5.00
Vdd (volts)
Capacitance overhead starts to dominate at high levels

of parallelism and results in an optimum voltage
THE ANSWER IS NO!

Pipelined Datapath
LATCH C1
LATCH C2
LATCH P
LATCH B
COMPARATOR
1
T
A>B
ADDER
1
T
LATCH A
1
T
COMPARATOR
C
1
T
1
T
Area = 640 x 1081 2
Critical path delay is less max [Tadder , Tcomparator]

Keeping clock rate constant: fpipe = fref
Voltage can be dropped Vpipe = Vref / 1.7
Capacitance slightly higher: Cpipe = 1.15Cref
Ppipe = (1.15Cref) (Vref/1.7)2 fref 0.39 Pref
Architecture Summary for a Simple Datapath
Architecture type
Voltage
Simple datapath
(no pipelining or
parallelism)
5V
Pipelined datapath
2.9V
1.3
0.39
Parallel datapath
2.9V
3.4
0.36
Pipeline-Parallel
2.0V
3.7
0.2
Area
Power
Algorithmic Transformations
XN
XN
YN
YN
2D
Loop Unrolling
XN-1
Ceff = 1
Voltage = 5
Throughput = 1
Power = 25
*
*
+
A
YN-1
Ceff = 1
Voltage = 5
Throughput = 1
Power = 25
Loop-unrolling does not reduce power consumption

from [Chandrakasan95] (IEEE TCAD)
Loop Unrolling Enables Other Transformations

XN
2D
*
A
2D
A
*
+
Pipelining
YN-1
XN-1
After
Algebraic Transformations,
&
Constant Propagation
Ceff = 1.5
Voltage = 3.7
Throughput = 1
Power = 20 (20% reduction)
YN
*
A2
XN-1
XN
YN
*
A2
*
+
YN-1
Ceff = 1.5
Voltage = 2.9
Throughput = 1
Power = 12.5 (x2 reduction)
Speed vs. Power Optimization

25
21
POWER (Fixed Throughput)

17
13
9
SPEEDUP
VOLTAGE
5
1
CAPACITANCE
Unrolling Factor
Area can be traded for Higher Throughput or Lower Power

ARBITRARY SPEEDUP vs. FINITE POWER REDUCTION
10
Multiple Supply Voltage Systems: Filter Example

1
2
3
4
5
* * * *
* *
+
+
+
+
2.4V
6
7
*
+
*
+
8
9
10
* *
3V
5V
+
Power (5V) / Power (5V,3V, 2.4V)= 1.5
from [Raje95]
Similar approach to logic design proposed in [Usami95]
11
Self Regulating Voltage Reduction Technique

VDD_Ref
Equivalent
Critical
Path
VDD_Ref
Signal
+
-
Comparator
Equivalent
Critical
Path
Regulated Voltage to DSP
from [Macken90]
Feedback adjusts the regulated voltage to the point

where the equivalent critical path is about to fail
12
Implementation Using PLL

Reference
Clock
Phase
Comparator
System Clock
Loop
Filter
Voltage
Regulator
Regulated VDD
Ring
Oscillator
from [Macken90], [Horowitz93]
Limited by bandwidth of the feedback loop

13
Low Voltage Switching Regulator

PASS Vg1
DEVICE
Vx
Lf
M1
Vin
ILf
Cin
Vg2
M2
Cx
Cf
RL Vdd
-
SYNCHRONOUS RECTIFIER
Arbitrary Vdd (<Vin) generated using the Buck converter

Vdd = Vin Duty Cycle at Node X
Chief sources of inefficiencies:
Conduction loss (I2R)

Switching loss (Cx Vin2 fs and Ls I2 fs)
Gate-drive loss (Cg Vin2 fs)
from [Stratakos94]
(IEEE PESC)
14
Soft-Switching Eliminates CxV2f Loss

PASS Vin
DEVICE
Vgp
M1
Vx
ILf
Iout
Vx
Vgn
Lf
M2
Cx
Cf
RL
ILf
Iout
RECTIFIER
Dead-time when neither

FET conducts
Current reverses
t
PASS DEVICE ON
|Vgsp|
Vgsn
Lf charges and discharges Cx
FETS ARE SWITCHED WITH VDS = 0

RECTIFIER ON
15
What happens when Iout changes?

Iout Cx discharges slowly
Iout Cx discharges quickly
Vgn
Vgn
Vx
Vx
Rectifier Discharges Cx
Body Diode Conduction
Inverter node transition times depends on Iout

Typical schemes use fixed dead time set by gate delays
Adaptive Dead-time Control Needed for varying Iout

16
Normalized FET Losses
ASIC Design: Power Transistor Sizing
Ptotal = Pgd + Pcl
Pcl = b/W
Pgd = afsW
0
Wopt
Gate-Width
Minimize Ptotal = Pgate-drive + Pconduction loss

W opt =
b
-------------a fs
17
Low Voltage Support Circuitry: Level Converter

VddH
4/3
M4
OEN
4/3
M3
VddL
VIN
24/2
M2
8/2,
4/2
24/2
M1
VOUT
VddH
0 VddL
VddL
O
Tri-stateable output driver
Compatibility with 3.3V/5V standard components

(VOH)IN = 1.1V to 5V and (VOH)OUT = 1.1 to 5V
18
Adaptive Power Supply Voltages
Processor
REG
Self-timed
FIFO
FIFO
REG
Control
Power
Supply
VDD(t)
Exploit Data Dependent Computation Times To Vary the Supply

from [Nielsen94]
(IEEE Transactions on VLSI Systems)
19
But Self-timed Circuits Can be Expensive...

Vdd
Vdd
OUTB
OUT
IN
INB
I
Guaranteed transition for every operation!

0->1 = 1
Our Approach Uses Synchronous DSP

20
Variable Supply Voltage Block Diagram
Rate
Controller
Ref_CLK
Workload
Filter
Sample
Workload
Data_In
Actively
Damped
Switching
Supply
Ring
Oscillator
Var_VDD
Var_CLK
Generic
Input
Buffer
DSP
Data_Out
from [Gutnik96]
(IEEE VLSI Circuits Symposium)
Note: cant exploit low-level data dependent computation times

21
Variable Algorithmic Workload

Compare Current Image...
...to Previous Image
Receiver just updates
# of blocks to process depends

on the image activity
Scene change require a lot of
computation.
Fixed throughput, varying
computational requirements.
Exploit Time Varying Computation Requirements

to Vary the Power Supply Voltage
22
Typical MPEG IDCT Histogram
Frequency of Occurrence
0.06
0.04
0.02
0.00
0
500
1000
1500
Number of IDCTs per Frame
2000
23
Synchronous Variable Voltage System
Work Load Power

Supply
VDD
Compressed
Video Data
Synchronous
FRAME
Processor
BUFFER
For an MPEG Decoder, the Work Load is Known A Priori

NO BUFFERING NEEDED!
24
Example of Variable Supply
Tsample
Block 1
Block 2
IDLE
Block 3
Block 4
IDLE
Tsample
Block 1
VDD = 5
VDD = 5
Block 2
VDD = 5
Energy Sample = C (5)2 + 1/2C (5) 2
VDD = 2.5
Block 3
VDD = 5
VDD = 5
VDD = 5
VDD = 2.5
Block 4
Energy Sample = C (5)2 + 1/2C (2.5) 2
2
= 18.75C
2
= 14C
25
Power vs. Workload
Normalized Power
1.0
VDD = 2V
VT = .5V
0.8
Fixed Power Supply
0.6
0.4
Energy Savings
0.2
0
Variable Power Supply
0.2
0.4
0.6
0.8
Normalized Workload
1.0
26
Reduce Power Further By Buffering

Averaged Workload
Xn+Xn+1
2
Power
Supply
VDD(y)
In, In+1
Synchronous
On, On+1
Processor
In+2, In+3
On-2, On-1
Two samples processed every two sample periods

Increased latency
27
Example of Buffering
Tsample
Block 1
Block 2
Block 3
Block 4
Tsample
VDD = 5
Block 1, 2
VDD = 3.75
VDD = 2.5
Block 1, 2
VDD = 3.75
VDD = 5
Block 3,4
VDD = 3.75
VDD = 2.5
Block 3,4
VDD = 3.75
Energy Sample = C (5)2 + 1/2C (2.5) 2
Energy Sample = 2 (0.75) C (3.75)2
2
= 14C
2
= 10.5C
28
Graphical Interpretation of Averaging
Normalized Power
1.0
0.8
Fixed Supply
0.6
Variable Supply
(no buffer)
0.4
Averaging
0.2
Average Workload
0
0.2
0.4
0.6
0.8
1.0
Normalized Workload
29
In General, Use an FIR to Filter Workload

Averaged Workload
y[n]
Power
Supply
FIR
x[n]
Synchronous
Processor
y [n] =
i =1
L
i=1
FIFO
FIFO
VDD(y)
ai x [n i ]
ai = 1 ; ( ai > 0 )
30
Buffering Example: MPEG Decoder
Supply Voltage, Volts
4V
Unbuffered
VDDMAX= 3.5V VT=0.7V
3V
2V
Buffered
1V
550
555
560
565
570
Video Frame Number
575
580
Averaging smooths effective workload and power

supply
31
Quantizing the Number of Voltage Levels

1.0
Normalized Power
4 levels, dithered
0.8
4 levels, undithered
0.6
Fixed Supply
0.4
0.2
Arbitrary Levels
0
0
0.2
0.4
0.6
0.8
1.0
Normalized Workload
4 Voltage Levels are Sufficient!

32
Power Supply Control Loop

Reference Clock
Compare
Variable Clock
Update
Power
Supply
VDD
Workload -> VDD

Lookup Table
Workload
Ring
Oscillator
Two Types of Changes:

1) Large, fast changes when workload changes (Open loop)
2) Small adjustments to compensate for temperature and
processvariations (Feedback)
33
Data Driven Processing: IDCT Example

60
Number of Blocks, x103
Number of Blocks, x103
50
40
30
20
10
0
8
16 24 32 40 48 56 64
Number of Non-zero Coefficients
40
20
0
0
8
16 24 32 40 48 56
Number of Non-zero Coefficients
64
Few non-zero coefficients in a typical 8x8 DCT matrix

Conventional algorithms do not exploit distribution
Recast algorithm noting multiplication with zero = NOP!
from [Xanthopoulos96]
I = x0 C0 + x1 C1 + ... + x63 C63

64x1Pixel Data
DCT Data
64x1 Coefficient Kernels
34
CHEN
(separable)
FMIDCT
(instantaneous)
0
0
FMIDCT
(average)
200
400
600
800
1000
Time (Block Sequence Number)
Number of Additions(x1000)
Number of Multiplications (x100)
FMIDCT vs. Conventional Algorithm

4
FMIDCT
(instantaneous)
3
FMIDCT
(average)
2
CHEN
(separable)
1
00
200
400
600
800
Time (Block Sequence Number)
1000
Time variable computational requirement using FMIDCT

-> Can use variable power supply voltage
35
Relationship to Parallelism
Normalized Power
1.0
0.8
Reference System
2x Parallel
16x Parallel
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
Normalized Workload
1.0
If the workload is typically low, then variable voltage

processing achieves energy efficiency of parallel
solution without area penalty
36
Algorithmic Transformations: VQ Compression

Blocked
Vectors
DISTORTION
CALCULATION
index of
best codeword
(8 bits)
CODEBOOK
(256 levels)
(4x4 pixels)
Algorithm
Full Search
Tree Search
Differential
Tree Search
# of
Memory
Accesses
# of
Multiplications
# of
Adds
# of
Subs
4096
256
136
4096
256
128
3840
240
128
4096
264
0
37
Algorithmic Transformations
Comparing Two Code Vectors
15
D left right =
15
j=0
( X j C leftj )
j=0
( X j C rightj )
OR
15
D left right =
15
D left right =
j=0
2
2
(X C
)
(
X
C
)
leftj
j
rightj
j
2
2
2X j C leftj X 2j C rightj
+ 2X j C
X 2j + C leftj
rightj
j=0
15
D left right =
j=0
2
2
( C leftj
C rightj
)+
15
2X j ( C rightj C leftj )
j=0
from [Fang94]
(VLSI signal processing IV)
38
A Uni-Processor Implementation
MEMORY (36K)
Data
REG
Control
8
INDEX
from [Lidsky94]
(VLSI Signal Processing, VII)
fclock: 8.3 MHz

VDD: 3.3V
Pav: 10.8 mW
Area: 72 mm2
39
A Multi-Processor Implementation
Control
Control
INDEX
fclock: 1.04 MHz
REG
Data
MEM
REG
REG
MEM
MEMORY
(18K)
INDEX
Control
2
INDEX
Pipelined Implementation reduces VDD
Pav: 250W
Distrubuted Resources reduce Physical Capacitance

Memory Power Reduced by factor 8!
Area: 112 mm2
Smaller Clock Frequency reduces activity
VDD: 1.1 V
40
Approximate Filtering Techniques

POWER DOWN
IN
fs
fs
fs
fs
fs
OUT
Dynamically vary the number of operations per sample.
Trade power consumption and filter quality

from [Ludwig95]
(CICC95)
41
How Much Power Reduction is Possible?

Speech
x 10
Filter Length vs. Time

450
0
400
1
0
2000
4000
amplitude
x 10
8000
10000
12000
14000
16000
350
0
1
0
2000
4000
6000
8000
10000
12000
14000
16000
Output of Adaptive Filter
amplitude
6000
Speech Corrupted with Noise
Numberfilterof
Taps
length, N
amplitude
x 10
Average Filter Length = 212
250
200
150
100
0
1
0
Average length = 212
300
50
2000
4000
6000
8000
10000
discrete time index, n
12000
14000
0
0
16000
2000
4000
6000
8000
10000
discrete time index, n
12000
14000
16000
Sample Number
Switched Capacitance Reduction
Peak Number of Operations

Average Number of Operations
Strong Function of Signal Statistics

42
Number Representation
Sign Magnitude
1.0
Rapidly Varying
0.8
0.6
0.4
0.2
0.0
Slowly Varying
0
10
12
14
Transition Probability
Twos Complement
1.0
0.8
Rapidly Varying
0.6
0.4
0.2
0.0
Bit Number
10
12
14
Bit Number
Sign-extension activity significantly reduced using

sign-magnitude representation
43
Number Representation: Accumulator Example

RST
3
14
SIGN-BIT
(To Control)
RST
4
IN
13
14
ACC
IN
OCLK
CLK
(64Mhz)(640Khz)
CLK
(64Mhz)
13
3
GATED
CLK
Twos Complement
13
GATED
GATED OCLK
CLK
CLK (640Khz)
3
IN
3-bit Magnitude
RST
CLK
(64Mhz)
RST
14
14
ACC
RST
13
OCLK
(640Khz)
GATED OCLK
CLK (640Khz)
Sign Magnitude
Sign magnitude datapath switches 30% less

capacitance for uniformly distributed inputs
44
Twos Complement vs. Sign-Magnitude
Transition Activity
SUM
(Twos Complement)
1.0
SUMB
SUMA + SUMB
(Sign-Magnitude)
0.5
SUMA
0.0
0
10
12
Bit Number
Twos complement datapath has a significantly

higher glitching activity
45
Conditional Inversion Coding

Simple
Control
DataIn
Register
invert
DataOut
Large Capacitance
Bus
from [Stan94]
(1994 International Workshop on Low-power Design)
46
Reducing Activity by Reordering Inputs

SUM1
SUM2
SUM1
IN
IN
>> 7
SUM2
>> 8
>> 8
>> 7
IN
Associativity & Commutativity

IN
IN
IN
0.5
0.4
SUM1
0.3
SUM2
0.2
0.1
0.0 0
10
12
14
0.5
0.4
0.3
SUM2
0.2
SUM1
0.1
0.0 0
Bit Number
10
12
14
Bit Number
30% reduction in switching energy

47
Counter 1
BUS1
SHARED BUS
OR
Counter 2
Counter 2
Counter 1
Resource Sharing Can Increase Activity
BUS2
Number of Bus Transitions Per Cycle

= 2 (1 + 1/2 + 1/4 + ...+1/128) 4
# of Bus Transitions Per Cycle
10.0
8.0
6.0
4.0
No Bus-sharing
2.0
0.0
50
100
150
200
250
Skew Between Counter Outputs

48
SPICE Approach to Energy Estimation

Vdd
idd
IN0
IN1
CIN
+
-
Vc = Energy in pJ
COUT
Vdd idd
1pF
100M
SUM
4.0
Voltage, V
3.0
2.0
1.0
voltage across the integrating V

capacitor
IN0
IN1
Energy/ transition is 3.64pJ/8

= 0.44pJ
CIN
100 010
110 001 101 011 111 000
0.00
25
50
75
100
time, ns
125
150
Accurate but extremely slow for large circuits...

49
Switch Level Simulation

Irsim-Cap
(from P. Landman and J. Rabaey)
100
90
80
70
60
50
40
30
20
10
0
Cap (fF/bit)
Irsim-Cap
SPICE
10
20
30
40
50
60
Sample
A
B
X
F
Signal
A
Cphysical N(01) Ceffective % Total

20 fF
1
20 fF
6.9%
20 fF
40 fF
13.8%
10 fF
30 fF
10.3%
100 fF
200 fF
69.0%
50
Intermediate Level: Timing Simulation

Handles reduced signal swings
More accurate in leakage and dc-currents
Up to 2 orders of magnitude faster than SPICE
i(Vdd)
Vdd
in
in
out1
out2
out3
out1
Vdd-Vth
out2
out3
Example: PowerMill (Epic)

51
Architecture Power Analysis
Too Little
Z = X*Y
if (Z<0) then Z=0
Power
Algorithm
Time
Mem
Fast & Accurate
>>
Ctrl
Architecture
Area
Too Late
A
B
Exploration
Gate/Circuit
52
Architecture Level Power Analysis (SPA)

C
reg
reg
Mem
INPUT
RTL Netlist
Power Models
Chip Inputs
+
Ctrl
OUTPUT
Average Power:
Module
Bus
Clock
Control
reg
C
Tool from U.C. Berkeley (P. Landman and J. Rabaey)

from [Landman93] (Proc. EDAC-EUROASIC)
53
Architectural Power Estimation (Sente)

Sentes Watt Watcher Tool
- Performs early [pre-synthesis], full-chip, power estimation
- WW operates at the module-level, not gate-level
- Provides a specific power estimation technique for each full
chip power component
Watt Watchers Approach to Power Estimation

- Synthesizes design into -architectural blocks
- Uses basic Process Technology parameters, not gate-level
models
- Determines circuit activity from functional simulation
- Back-annotates wiring capacitances from floor planners or
layout
- Analyzes Mode-Dependent power contributions
54
Case Study: A Portable Multimedia I/O Terminal

Antenna
Radio Modem
Protocol Module
(2mW)
Video
Decompression
Module
(2mW)
Speech
Pen
Digitizer Codec
Text/Graphics
Frame-Buffer
Module
(1mW)
Protocol, ECC, Buffering, Video Decompression, and I/O

(InfoPad Terminal Developed at U.C. Berkeley)
from [Chandrakasan94]
55
Activity Driven Placement for Low-power
Physical Capacitance, pF
Minimize Cswitched = Cphysical f

Decreasing Activity
1.8
1.2
Display
Data
Address
SRAM Data
f = 1/32
f = 1/16
f = 1/2
0.8
0.2
10
20
30
40
50
Distance from Core to Pads

56
Self-timed Approach for Eliminating Glitching
Sense
Row Decode
Cells
Cells
INACTIVE
Sense
Block Selected
and Sense-amp
Output Valid
32
Enable tri-state drivers after sense-amp outputs are valid

to eliminate glitching on the data-bus.
57
Glitch Free, Low Swing RAM Bitslice

Data Out
Output remains tri-stated

until senseamp/latch has
resolved data
Vdd
OEN
PRE
PRE
Vdd
Column select/
cascode amp
Bitlines precharged to
Vdd - Vtn
Vdd
PRE
SEL0
SEL1
PRE
Vdd
B0 B0
Vdd
PRE
Vdd
B1 B1
58
Memory Architecture
MEMORY
CELL
ARRAY
4 4
f
f
Parallel Access
Addr
Row Decoding
Addr
Row Decoding
Serial Access
MEMORY
CELL
ARRAY
4 4
4 4
Mux
4
Latch
4 4
Latch
f/8
4 4
4 4
Mux
8-nibbles
4 bit display interface
Voltage = 3V
Voltage = 1.1V
59
Digital YIQ -> Digital RGB

R
G
B
D11 D12 D13

D21 D22 D23
D31 D32 D33
Y
I
Q
Optimized matrix multiplication (6mults -> 8 adds)

Hardwired shift-add operations
Coefficient scaling to minimize shift-add operations
Exploit multiple coefficients multiplied with the
same input
60
Optimizing Multiplications
A = IN * 0 0 1 1
B = IN * 0 1 1 1
A = (IN >>4 + IN >>3)
B = (A + IN >>2)
# of shift-add operations
A = (IN >>4 + IN >>3)

B = (IN >>4 + IN >>3 + IN >>2)
16
14
Only Scaling
12
10
6
1.10
1.15
1.20
1.25
Scaling &
Common
Sub-expression
61
Time-multiplexed Architectures
Parallel busses for I,Q
I0
I1
I2
Q0
Q1
Q2
Time-shared bus for I,Q

I0
20
20
Signal Value
30
Signal Value
30
Q1
I1
10
0
-10
-10
-20
I1
T/2
10
Q0
10
20
30
40
Time, Sample Number
50
-20
20
40
60
80
Time, Sample Number
100
Can destroy signal correlations and increase

the switching activity
62
Summary
Signal statistics can be exploited to minimize
the number of transitions required to perform
a given function
Architectural voltage scaling is a key technique for
low-voltage operation
Variable power supply reduces power and buffering
trades latency for power
Chipset for a portable terminal consumes
orders of magnitude lower power compared
to current design approaches
63
References
[Chandrakasan92] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, Low-power Digital CMOS Design, IEEE Journal of
Solid State Circuits, pp. 473-484, April 1992.
[Chandrakasan94] A. P. Chandrakasan, A. Burstein, and R. W. Brodersen, A Low-power Chipset for a Portable Multimedia I/O
Terminal, IEEE Journal of Solid-State Circuits, December 1994.
[Chandrakasan95] A. P. Chandrakasan, R. Mehra, M. Potkonjak, J. Rabaey, and R. W. Brodersen, Optimizing Power Using
Transformations, IEEE Transaction on CAD, vol. 14, pp. 13-32, January 1995.
[Fang94] W.-C. Fang, C.-Y. Chang, B. J. Sheu, O. T.-C. Chen, J. C. Curlander, VLSI systolic binary tree-searched vector quantizer for image compression, IEEE Trans. on VLSI Systems, vol. 2, no. 1, pp. 33-44, Mar. 1994.
[Gutnik96] V. Gutnik, A. Chandrakasan, An Efficient Controller for Variable Supply-Voltage Low Power Processing, 1996
VLSI Circuits Symposium, June 1996.
[Horowitz93] M. Horowitz, Low Power Processor Design Using Self-Clocking, Workshop on Low-power Electronics, 1993.
[Landman93] P. Landman and J. Rabaey, Power Estimation for High Level Synthesis, Prof. EDAC-EUROASIC 93, Paris,
France, pp. 361-366, Feb. 1993.
[Lidsky94] D. Lidsky, J. Rabaey Low-Power Design of Memory Intensive Functions: Case Study: Vector Quantization, VLSI
Signal Processing,VII October 1994 IEEE Press, NY NY.
[Ludwig95] J. Ludwig, S. Nawab, A.P. Chandrakasan,"Low Power Filtering Using Approximate Processing for DSP," IEEE 1995
Custom Integrated Circuits Conference, pp. 185-188, May 1995.
[Macken90] P. Maken, M. Degrauwe, M. Van Paemel, H. Oguey, A Voltage Reduction Technique for Digital Systems, ISSCC
February, 1990.
[Nielsen94] L Nielsen, C. Niessen, J. Sparso, K. van Berkel, Low-Power Operation Using Self-Timed Circuits and Adaptive
Scaling of Supply Voltage, IEEE Transactions on VLSI systems, December 1994, pp 391-397.
[Raje95] S. Raje and M. Sarrafzadeh, Variable Voltage Scheduling, pp. 9-13, International Symposium on Low Power Design,
April 1995.
[Stan94] M. Stan and W. Burleson, Limited-weight Codes for Low-power I/O, 1994 International Workshop on Low-power
Design, pp. 209-214, April 1994.
[Stratakos94] A. Stratakos, S. Sanders, and R. Brodersen, A Low-Voltage CMOS DC-DC Converter for a Portable Battery-Operated
System, IEEE Power Electronics Specialists Conference., pages 619-626, 1994.
[Usami95] K. Usami and M. Horowitz, Clustered Voltage Scaling Technique for Low Power Design, International Symposium on
Low Power Design, pp. 3-8, April 1995.
[Xanthopoulos96] T. Xanthopoulos, A. Chandrakasan, C. Sodini, B. Dally, A Data-Driven IDCT Architecture for Low Power Video
Applications, IEEE ESSIRC, September 1996.

Low Power Arch - MIT PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Low Power Arch - MIT PDF

Uploaded by

Copyright:

Available Formats

Architecture and System Level Optimization

Power and Throughput in CMOS

How to scale VDD keeping a fixed throughput?

Anantha Chandrakasan 1997

Architecture Trade-offs - Reference Datapath

Area = 636 x 833 2

Critical path delay Tadder + Tcomparator (= 25ns)

Anantha Chandrakasan 1997

Area = 1476 x 1219 2

The clock rate can be reduced by half with the same

Anantha Chandrakasan 1997

The Faster the Better??

Capacitance overhead starts to dominate at high levels

THE ANSWER IS NO!

Anantha Chandrakasan 1997

Area = 640 x 1081 2

Critical path delay is less max [Tadder , Tcomparator]

Anantha Chandrakasan 1997

Architecture Summary for a Simple Datapath

Architecture and System Level Optimization of Power Consumption

Anantha Chandrakasan 1997

Loop-unrolling does not reduce power consumption

Anantha Chandrakasan 1997

Loop Unrolling Enables Other Transformations

Architecture and System Level Optimization of Power Consumption

Anantha Chandrakasan 1997

Speed vs. Power Optimization

POWER (Fixed Throughput)

Area can be traded for Higher Throughput or Lower Power

Anantha Chandrakasan 1997

Multiple Supply Voltage Systems: Filter Example

Anantha Chandrakasan 1997

Self Regulating Voltage Reduction Technique

Feedback adjusts the regulated voltage to the point

Anantha Chandrakasan 1997

Implementation Using PLL

from [Macken90], [Horowitz93]

Limited by bandwidth of the feedback loop

Anantha Chandrakasan 1997

Low Voltage Switching Regulator

Arbitrary Vdd (<Vin) generated using the Buck converter

Conduction loss (I2R)

Architecture and System Level Optimization of Power Consumption

Anantha Chandrakasan 1997

Soft-Switching Eliminates CxV2f Loss

Dead-time when neither

Lf charges and discharges Cx

FETS ARE SWITCHED WITH VDS = 0

Anantha Chandrakasan 1997

What happens when Iout changes?

Iout Cx discharges quickly

Body Diode Conduction

Inverter node transition times depends on Iout

Adaptive Dead-time Control Needed for varying Iout

Anantha Chandrakasan 1997

Normalized FET Losses

ASIC Design: Power Transistor Sizing

Ptotal = Pgd + Pcl

Minimize Ptotal = Pgate-drive + Pconduction loss

Architecture and System Level Optimization of Power Consumption

Anantha Chandrakasan 1997

Low Voltage Support Circuitry: Level Converter

Tri-stateable output driver