Professional Documents
Culture Documents
Sill LowPower2
Sill LowPower2
Sill LowPower2
franksill@ufmg.br
http://www.cpdee.ufmg.br/~frank/
Agenda
Recap
Power reduction on
Gate
level
Architecture
Algorithm
System
level
level
level
Continuously increasing
performance demands
Reduced
Reducedtime
timeof
ofoperation
operation
High
Highefforts
effortsfor
forcooling
cooling
Higher
Higherweight
weight(batteries)
(batteries)
Increasing
Increasingoperational
operationalcosts
costs
Reduced
Reducedmobility
mobility
Reduced
Reducedreliability
reliability
Voltage (Volt, V)
Current (Ampere, A)
Energy
CL
Short-circuit power
( 10 % today and
decreasing
absolutely)
Leakage power
( 20 50 %
today and
increasing)
0.5
A
B
0.5
(1-0.25)*0.25 = 3/16
W
7/64 = 0.109
X
15/256
C
F
0.5
D
0.5
0.5 A
0.5 B
0.5
C
0.5 D
3/16
Y
15/256
F
Z
3/16 = 0.188
X
C
0.1
(1-0.2x0.1)*(0.2x0.1)=0.0196
0.2
B
X
C
F
0.1
A
0.5
AND: P01 = (1 - PAPB) * PAPB
Recap: Glitching
A
B
X
Z
ABC
101
000
X
Z
Unit Delay
Source: Irwin, 2000
10
Basic elements:
Logic
gates
Sequential
11
Optimal
fcircuit = 20
fopt_energy
fopt_performance = 4.47
1.5
normalized energy
Device
fcircuit=1
fcircuit=2
fcircuit=5
fcircuit=10
0.5
fcircuit=20
= 3.53
0
For
fanout f
12
Pdyn
td
13
Multiple VDD
Main ideas:
At design phase:
For low VDD: prefer gates that drive large capacitances (yields
the largest energy benefits)
14
Level converters:
VDDH
VDDL
Vout
15
Data Paths
Paths
Path
16
A
G1 ready with
evaluation
B
Y
all inputs of G2
arrived
all Inputs of G1
arrived
C
delay of G1
Copyright Sill, 2008
Slack for G1
time
17
Minimum energy consumption when all logic paths are critical (same
delay)
Each path starts with VDDH and switches to VDDL (blue gates)
when slack is available
18
Base elements:
Register structures
Memory elements
19
Clock Gating
R
Functional
e
unit
g
clock
disable
20
CLK
21
Flip-flops
PI
Clock
activation
logic
CLK
Copyright Sill, 2008
Combinational
logic
PO
Latch
Source: L. Benini and G. De Micheli,
Dynamic Power Management, Boston: Springer, 1998.
22
30.6mW
With clock gating
8.5mW
10
15
VDE
20
25
MIF
Power [mW]
896Kb SRAM
DEU
DSP/
HIF
23
Pdyn
td
24
Input
Combinational
logic
Register
Register
A Reference Datapath
Output
Cref
CLK
Supply voltage
Total capacitance switched per cycle
Clock frequency
Power consumption:
Pref
= Vref
= Cref
= fClk
= CrefVref2fclk
Source: Agarwal, 2007
25
Comb.
Logic
Copy 2
Multiphase
Clock gen.
and mux
control
fclk/N
Register
fclk/N
Output
fclk
Comb.
Logic
Copy N
CK
Copyright Sill, 2008
N = Deg. of
parallelism
Register
Comb.
Logic
Copy 1
Supply voltage:
VN Vref
N to 1 multiplexer
Input
Register
Register
Parallel Architecture
26
Pipelined Architecture
Constant throughput
A/N
Area A
A/N
CLK
CLK
A/N
Functionality:
Data
CLK
Copyright Sill, 2008
27
VDD = Vref = 5V
28
The clock rate can be reduced by half with the same throughput
fpar = fref / 2
Vpar = Vref / 1.7, Cpar = 2.15 Cref
Ppar = (2.15 Cref) (Vref / 1.7)2 (fref / 2) = 0.36 Pref
Source: Irwin, 2000
29
30
Approximate Trend
N-parallel proc.
Capacitance
N*Cref
Cref
Voltage
Vref/N
Vref/N
Frequency
fref/N
fref
Dynamic Power
CrefVref2fref/N2
CrefVref2fref/N2
Chip area
N times
10-20% increase
31
Guarded Evaluation
Reduction of switching activity by adding latches at
inputs
A
A
B
C
B
Multiplier
C
condition
Latch
Multiplier
condition
32
Precomputation
Precomputed
inputs
R1
Gated
inputs
Precomputation
logic
R2
g(X)
Combination
logic f(X)
Outputs
Load
disable
33
Design steps
1. Selection of precomputation architecture
2. Determination of precomputed and gated inputs (Register R1
should be much smaller than R2)
3. Search good implementation for g(X)
4. Evaluation of potential energy savings based on input statistics
(if savings not sufficient go to step 2 or 3 and try again)
34
Precomputation: Example
Binary Comparator
An
Bn
An-1
Bn-1
A1
B1
R1
R2
Load
disable
An = Bn
A> B
35
Adder Design
Carry Select
Ripple Carry
FA
FA
FA
FA FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
Carry Look-ahead
FA
FA
FA
FA
Source: Mendelson, Intel
36
Adder Design
Energy
(pJ)
Ripple Carry
Constant Width Carry Skip
Variable Width Carry Skip
Carry Lookahead
Carry Select
Conditional Sum
117
109
126
171
216
304
Delay
(nSec)
54.27
28.38
21.84
17.13
19.56
20.05
37
Bus Power
Caused by:
High switching activities
Large capacitive loading
Wout
Xout
Yout
Zout
Bus
receivers
Bus
Ain
Bin
Cin
Din
Bus
drivers
Source: Irwin, 2000
38
Code compression
39
40
Bus segmentation
Another
Control
Shared Bus
B
Segmented Bus
B
41
Base elements:
Functions
Procedures
Processes
Control
structures
42
Coding styles
Variable types
43
Source-code Transformations
Computation
A*B+A*C
A*(B+C)
Communication
receive (A)
for (c = 1..N)
B=c*A
for (c = 1..N)
receive (A)
B=c*A
Storage
for (c = 1..N)
B[c] = A[c]*D[c]
for (c = 1..N)
F[c] = B[c]-1
for (c = 1..N)
F[c] = A[c]*D[c]-1
44
14000
12000
10000
Others
Functional Unit
Pipeline Registers
Register File
8000
6000
4000
2000
0
bubble.c
heap.c
quick.c
45
Active
Idle Active
Idle
Active
3.3 V
2.4 V
46
Speed
T1
T2
T1
T2
Same work,
lower energy
Task
Idle
Task
Time
Copyright Sill, 2008
Time
47
Complex modules
Processors
Sensors
ALU
MEM
Basic Elements:
MEM
MP3
48
Systems are:
Power manager:
49
50
66Mhz
80Mhz
No power mgmt
Dynamic power mgmt
2.18W
1.89W
2.54W
2.20W
DOZE
307mW
366mW
NAP
113mW
135mW
SLEEP
89mW
105mW
18mW
19mW
2mW
2mW
51
Transmeta LongRun
LongRun policies:
Processor frequency
52
Source: Transmeta
Copyright Sill, 2008
53
Source: Transmeta
Copyright Sill, 2008
54
Capacity (mAh)
1000 mAh
(Standard
Capacity)
200
125mA
( Rated Current)
Available
Charge
(mA)
Discharge
Current
(mA)
time
idle
time
Source: Timmermann, 2007
55
Electrode
After a recent
discharge
Electrode
After
Recovery
Electrode
Fully
charged
battery
Electrode
Fully
discharged
Electro-active
species
56
Battery Voltage
Source: LaFollette, Design and performance of high specific power, pulsed discharge,
bipolar lead acid batteries, 10th Annual Battery Conference on Applications and Advances,
Long Beach, pp. 4347, January 1995.
Copyright Sill, 2008
57
Current [mA]
Current [mA]
Discharge profile A
Discharge profile B
Profile
123.8
357053
15.12
124.2
536484
18.58
58
Backup
59
FSM: Clock-Gating
Xj/Zk
Xk/Zk
Clock can be stopped
when (Xk, Sk) combination
occurs.
60
Trend: Interconnects
Interconnects
Propagation delays of
global wires will be a
multiple of the clock
cycle.
Example (very optimistic):
610 clock cycles in
50nm technology
[Benini, 2002]
61
Bus Multiplexing
or
62
63
Bus Multiplexing
Example:
S2 odd
S1
S2
D1
S1
D1
D2
S2
D2
64
MSB
Bit position
LSB
Source: Irwin, 2000
65
66
Implementation
Power-Speed
Control Knob
Workload
Filter
Variable
Power-Speed
System
FIFO Input Buffer
67