Professional Documents
Culture Documents
Low Power Arch - MIT PDF
Low Power Arch - MIT PDF
of Power Consumption
Anantha Chandrakasan
Massachusetts Institute of Technology
Throughput (1/delay)
1.5
1/delay, .6m ring oscillator
1.0
0.5
0.0
1.0
2.0
3.0
4.0
VDD
5.0
6.0
1
T
LATCH C
COMPARATOR
A>B
ADDER
LATCH B
1
T
LATCH A
COMPARATOR
Parallel Datapath
LATCH B
ADDER
COMPARATOR
LATCH C
ADDER
COMPARATOR
LATCH C
LATCH B
1
2T
COMPARATOR
MUX
1
2T
A>B
LATCH A
1
2T
LATCH A
COMPARATOR
1
2T
1
2T
A>B
1
T
1
2T
1.00
Fixed Throughput
0.90
Minimal Area
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
Minimal Power
1.00
2.00
3.00
4.00
5.00
Vdd (volts)
Pipelined Datapath
LATCH C1
LATCH C2
LATCH P
LATCH B
COMPARATOR
1
T
A>B
ADDER
1
T
LATCH A
1
T
COMPARATOR
C
1
T
1
T
Architecture type
Voltage
Simple datapath
(no pipelining or
parallelism)
5V
Pipelined datapath
2.9V
1.3
0.39
Parallel datapath
2.9V
3.4
0.36
Pipeline-Parallel
2.0V
3.7
0.2
Area
Power
Algorithmic Transformations
XN
XN
YN
YN
2D
Loop Unrolling
XN-1
Ceff = 1
Voltage = 5
Throughput = 1
Power = 25
*
*
+
A
YN-1
Ceff = 1
Voltage = 5
Throughput = 1
Power = 25
2D
*
A
2D
A
*
+
Pipelining
YN-1
XN-1
After
Algebraic Transformations,
&
Constant Propagation
Ceff = 1.5
Voltage = 3.7
Throughput = 1
Power = 20 (20% reduction)
YN
*
A2
XN-1
XN
YN
*
A2
*
+
YN-1
Ceff = 1.5
Voltage = 2.9
Throughput = 1
Power = 12.5 (x2 reduction)
SPEEDUP
VOLTAGE
5
1
CAPACITANCE
Unrolling Factor
10
* * * *
* *
+
+
+
+
2.4V
6
7
*
+
*
+
8
9
10
* *
3V
5V
+
Power (5V) / Power (5V,3V, 2.4V)= 1.5
from [Raje95]
Similar approach to logic design proposed in [Usami95]
Architecture and System Level Optimization of Power Consumption
11
VDD_Ref
Signal
+
-
Comparator
Equivalent
Critical
Path
Regulated Voltage to DSP
from [Macken90]
12
Phase
Comparator
System Clock
Loop
Filter
Voltage
Regulator
Regulated VDD
Ring
Oscillator
13
Vx
Lf
M1
Vin
ILf
Cin
Vg2
M2
Cx
Cf
RL Vdd
-
SYNCHRONOUS RECTIFIER
14
Vx
ILf
Iout
Vx
Vgn
Lf
M2
Cx
Cf
RL
ILf
Iout
RECTIFIER
t
PASS DEVICE ON
|Vgsp|
Vgsn
RECTIFIER ON
15
Vgn
Vgn
Vx
Vx
Rectifier Discharges Cx
16
Pcl = b/W
Pgd = afsW
0
Wopt
Gate-Width
b
-------------a fs
17
OEN
4/3
M3
VddL
VIN
24/2
M2
8/2,
4/2
24/2
M1
VOUT
VddH
0 VddL
VddL
O
18
Processor
REG
Self-timed
FIFO
FIFO
REG
Control
Power
Supply
VDD(t)
19
Vdd
OUTB
OUT
IN
INB
I
20
Rate
Controller
Ref_CLK
Workload
Filter
Sample
Workload
Data_In
Actively
Damped
Switching
Supply
Ring
Oscillator
Var_VDD
Var_CLK
Generic
Input
Buffer
DSP
Data_Out
from [Gutnik96]
(IEEE VLSI Circuits Symposium)
21
22
Frequency of Occurrence
0.06
0.04
0.02
0.00
0
500
1000
1500
Number of IDCTs per Frame
2000
23
Compressed
Video Data
Synchronous
FRAME
Processor
BUFFER
24
Tsample
Block 1
Block 2
IDLE
Block 3
Block 4
IDLE
Tsample
Block 1
VDD = 5
VDD = 5
Block 2
VDD = 5
VDD = 2.5
Block 3
VDD = 5
VDD = 5
VDD = 5
VDD = 2.5
Block 4
2
= 18.75C
Architecture and System Level Optimization of Power Consumption
2
= 14C
Anantha Chandrakasan 1997
25
Normalized Power
1.0
VDD = 2V
VT = .5V
0.8
0.6
0.4
Energy Savings
0.2
0
0.2
0.4
0.6
0.8
Normalized Workload
1.0
26
Xn+Xn+1
2
Power
Supply
VDD(y)
In, In+1
Synchronous
On, On+1
Processor
In+2, In+3
On-2, On-1
27
Example of Buffering
Tsample
Block 1
Block 2
Block 3
Block 4
Tsample
VDD = 5
Block 1, 2
VDD = 3.75
VDD = 2.5
Block 1, 2
VDD = 3.75
VDD = 5
Block 3,4
VDD = 3.75
VDD = 2.5
Block 3,4
VDD = 3.75
2
= 14C
Architecture and System Level Optimization of Power Consumption
2
= 10.5C
Anantha Chandrakasan 1997
28
Normalized Power
1.0
0.8
Fixed Supply
0.6
Variable Supply
(no buffer)
0.4
Averaging
0.2
Average Workload
0
0.2
0.4
0.6
0.8
1.0
Normalized Workload
29
Power
Supply
FIR
x[n]
Synchronous
Processor
y [n] =
i =1
L
i=1
FIFO
FIFO
VDD(y)
ai x [n i ]
ai = 1 ; ( ai > 0 )
30
4V
Unbuffered
3V
2V
Buffered
1V
550
555
560
565
570
Video Frame Number
575
580
31
Normalized Power
4 levels, dithered
0.8
4 levels, undithered
0.6
Fixed Supply
0.4
0.2
Arbitrary Levels
0
0
0.2
0.4
0.6
0.8
1.0
Normalized Workload
32
Compare
Variable Clock
Update
Power
Supply
VDD
Ring
Oscillator
33
50
40
30
20
10
0
8
16 24 32 40 48 56 64
Number of Non-zero Coefficients
40
20
0
0
8
16 24 32 40 48 56
Number of Non-zero Coefficients
64
DCT Data
34
CHEN
(separable)
FMIDCT
(instantaneous)
0
0
FMIDCT
(average)
200
400
600
800
1000
Time (Block Sequence Number)
Number of Additions(x1000)
00
200
400
600
800
Time (Block Sequence Number)
1000
35
Relationship to Parallelism
Normalized Power
1.0
0.8
Reference System
2x Parallel
16x Parallel
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
Normalized Workload
1.0
36
DISTORTION
CALCULATION
index of
best codeword
(8 bits)
CODEBOOK
(256 levels)
(4x4 pixels)
Algorithm
Full Search
Tree Search
Differential
Tree Search
# of
Memory
Accesses
# of
Multiplications
# of
Adds
# of
Subs
4096
256
136
4096
256
128
3840
240
128
4096
264
0
37
Algorithmic Transformations
Comparing Two Code Vectors
15
D left right =
15
j=0
( X j C leftj )
j=0
( X j C rightj )
OR
15
D left right =
15
D left right =
j=0
2
2
(X C
)
(
X
C
)
leftj
j
rightj
j
2
2
2X j C leftj X 2j C rightj
+ 2X j C
X 2j + C leftj
rightj
j=0
15
D left right =
j=0
2
2
( C leftj
C rightj
)+
15
2X j ( C rightj C leftj )
j=0
from [Fang94]
(VLSI signal processing IV)
Architecture and System Level Optimization of Power Consumption
38
A Uni-Processor Implementation
MEMORY (36K)
Data
REG
Control
8
INDEX
from [Lidsky94]
(VLSI Signal Processing, VII)
39
A Multi-Processor Implementation
Control
Control
INDEX
REG
Data
MEM
REG
REG
MEM
MEMORY
(18K)
INDEX
Control
2
INDEX
Pav: 250W
VDD: 1.1 V
40
IN
fs
fs
fs
fs
fs
OUT
41
x 10
0
400
1
0
2000
4000
amplitude
x 10
8000
10000
12000
14000
16000
350
0
1
0
2000
4000
6000
8000
10000
12000
14000
16000
amplitude
6000
Numberfilterof
Taps
length, N
amplitude
x 10
250
200
150
100
0
1
0
300
50
2000
4000
6000
8000
10000
discrete time index, n
12000
14000
0
0
16000
2000
4000
6000
8000
10000
discrete time index, n
12000
14000
16000
Sample Number
42
Number Representation
Sign Magnitude
1.0
Rapidly Varying
0.8
0.6
0.4
0.2
0.0
Slowly Varying
0
10
12
14
Transition Probability
Transition Probability
Twos Complement
1.0
0.8
Rapidly Varying
0.6
0.4
0.2
0.0
Bit Number
10
12
14
Bit Number
43
14
SIGN-BIT
(To Control)
RST
4
IN
13
14
ACC
IN
OCLK
CLK
(64Mhz)(640Khz)
CLK
(64Mhz)
13
3
GATED
CLK
Twos Complement
13
GATED
GATED OCLK
CLK
CLK (640Khz)
3
IN
3-bit Magnitude
RST
CLK
(64Mhz)
RST
14
14
ACC
RST
13
OCLK
(640Khz)
GATED OCLK
CLK (640Khz)
Sign Magnitude
44
Transition Activity
SUM
(Twos Complement)
1.0
SUMB
SUMA + SUMB
(Sign-Magnitude)
0.5
SUMA
0.0
0
10
12
Bit Number
45
DataIn
Register
invert
DataOut
Large Capacitance
Bus
from [Stan94]
(1994 International Workshop on Low-power Design)
46
SUM2
SUM1
IN
IN
>> 7
SUM2
>> 8
>> 8
>> 7
IN
IN
Transition Probability
Transition Probability
IN
0.5
0.4
SUM1
0.3
SUM2
0.2
0.1
0.0 0
10
12
14
0.5
0.4
0.3
SUM2
0.2
SUM1
0.1
0.0 0
Bit Number
10
12
14
Bit Number
47
Counter 1
BUS1
SHARED BUS
OR
Counter 2
Counter 2
Counter 1
BUS2
10.0
8.0
6.0
4.0
No Bus-sharing
2.0
0.0
50
100
150
200
250
48
idd
IN0
IN1
CIN
+
-
Vc = Energy in pJ
COUT
Vdd idd
1pF
100M
SUM
4.0
Voltage, V
3.0
2.0
1.0
IN0
IN1
CIN
100 010
110 001 101 011 111 000
0.00
25
50
75
100
time, ns
125
150
49
Cap (fF/bit)
Irsim-Cap
SPICE
10
20
30
40
50
60
Sample
A
B
X
F
Signal
A
20 fF
40 fF
13.8%
10 fF
30 fF
10.3%
100 fF
200 fF
69.0%
50
in
in
out1
out2
out3
out1
Vdd-Vth
out2
out3
51
Too Little
Z = X*Y
if (Z<0) then Z=0
Power
Algorithm
Time
Mem
>>
Ctrl
Architecture
Area
Too Late
A
B
Exploration
Gate/Circuit
Architecture and System Level Optimization of Power Consumption
52
reg
Mem
INPUT
RTL Netlist
Power Models
Chip Inputs
+
Ctrl
OUTPUT
Average Power:
Module
Bus
Clock
Control
reg
C
53
54
Radio Modem
Protocol Module
(2mW)
Video
Decompression
Module
(2mW)
Speech
Pen
Digitizer Codec
Text/Graphics
Frame-Buffer
Module
(1mW)
55
Physical Capacitance, pF
1.2
Display
Data
Address
SRAM Data
f = 1/32
f = 1/16
f = 1/2
0.8
0.2
10
20
30
40
50
56
Sense
Row Decode
Cells
Cells
INACTIVE
Sense
Block Selected
and Sense-amp
Output Valid
32
57
Vdd
OEN
PRE
PRE
Vdd
Column select/
cascode amp
Bitlines precharged to
Vdd - Vtn
Vdd
PRE
SEL0
SEL1
PRE
Vdd
B0 B0
Vdd
PRE
Vdd
B1 B1
58
Memory Architecture
MEMORY
CELL
ARRAY
4 4
f
f
Parallel Access
Addr
Row Decoding
Addr
Row Decoding
Serial Access
MEMORY
CELL
ARRAY
4 4
4 4
Mux
4
Latch
4 4
Latch
f/8
4 4
4 4
Mux
8-nibbles
Voltage = 3V
Architecture and System Level Optimization of Power Consumption
Voltage = 1.1V
Anantha Chandrakasan 1997
59
Y
I
Q
60
Optimizing Multiplications
A = IN * 0 0 1 1
B = IN * 0 1 1 1
A = (IN >>4 + IN >>3)
B = (A + IN >>2)
# of shift-add operations
14
Only Scaling
12
10
6
1.10
1.15
1.20
1.25
Scaling &
Common
Sub-expression
Anantha Chandrakasan 1997
61
Time-multiplexed Architectures
Parallel busses for I,Q
I0
I1
I2
Q0
Q1
Q2
20
20
Signal Value
30
Signal Value
30
Q1
I1
10
0
-10
-10
-20
I1
T/2
10
Q0
10
20
30
40
50
-20
20
40
60
80
100
62
Summary
Signal statistics can be exploited to minimize
the number of transitions required to perform
a given function
Architectural voltage scaling is a key technique for
low-voltage operation
Variable power supply reduces power and buffering
trades latency for power
Chipset for a portable terminal consumes
orders of magnitude lower power compared
to current design approaches
63
References
[Chandrakasan92] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, Low-power Digital CMOS Design, IEEE Journal of
Solid State Circuits, pp. 473-484, April 1992.
[Chandrakasan94] A. P. Chandrakasan, A. Burstein, and R. W. Brodersen, A Low-power Chipset for a Portable Multimedia I/O
Terminal, IEEE Journal of Solid-State Circuits, December 1994.
[Chandrakasan95] A. P. Chandrakasan, R. Mehra, M. Potkonjak, J. Rabaey, and R. W. Brodersen, Optimizing Power Using
Transformations, IEEE Transaction on CAD, vol. 14, pp. 13-32, January 1995.
[Fang94] W.-C. Fang, C.-Y. Chang, B. J. Sheu, O. T.-C. Chen, J. C. Curlander, VLSI systolic binary tree-searched vector quantizer for image compression, IEEE Trans. on VLSI Systems, vol. 2, no. 1, pp. 33-44, Mar. 1994.
[Gutnik96] V. Gutnik, A. Chandrakasan, An Efficient Controller for Variable Supply-Voltage Low Power Processing, 1996
VLSI Circuits Symposium, June 1996.
[Horowitz93] M. Horowitz, Low Power Processor Design Using Self-Clocking, Workshop on Low-power Electronics, 1993.
[Landman93] P. Landman and J. Rabaey, Power Estimation for High Level Synthesis, Prof. EDAC-EUROASIC 93, Paris,
France, pp. 361-366, Feb. 1993.
[Lidsky94] D. Lidsky, J. Rabaey Low-Power Design of Memory Intensive Functions: Case Study: Vector Quantization, VLSI
Signal Processing,VII October 1994 IEEE Press, NY NY.
[Ludwig95] J. Ludwig, S. Nawab, A.P. Chandrakasan,"Low Power Filtering Using Approximate Processing for DSP," IEEE 1995
Custom Integrated Circuits Conference, pp. 185-188, May 1995.
[Macken90] P. Maken, M. Degrauwe, M. Van Paemel, H. Oguey, A Voltage Reduction Technique for Digital Systems, ISSCC
February, 1990.
[Nielsen94] L Nielsen, C. Niessen, J. Sparso, K. van Berkel, Low-Power Operation Using Self-Timed Circuits and Adaptive
Scaling of Supply Voltage, IEEE Transactions on VLSI systems, December 1994, pp 391-397.
[Raje95] S. Raje and M. Sarrafzadeh, Variable Voltage Scheduling, pp. 9-13, International Symposium on Low Power Design,
April 1995.
[Stan94] M. Stan and W. Burleson, Limited-weight Codes for Low-power I/O, 1994 International Workshop on Low-power
Design, pp. 209-214, April 1994.
[Stratakos94] A. Stratakos, S. Sanders, and R. Brodersen, A Low-Voltage CMOS DC-DC Converter for a Portable Battery-Operated
System, IEEE Power Electronics Specialists Conference., pages 619-626, 1994.
[Usami95] K. Usami and M. Horowitz, Clustered Voltage Scaling Technique for Low Power Design, International Symposium on
Low Power Design, pp. 3-8, April 1995.
[Xanthopoulos96] T. Xanthopoulos, A. Chandrakasan, C. Sodini, B. Dally, A Data-Driven IDCT Architecture for Low Power Video
Applications, IEEE ESSIRC, September 1996.