Low Power Vlsi Design 2

Low Power Design in VLSI
Why Low Power ?
• Growth of battery-powered systems

• Users need for:
– Mobility
– Portability
– Reliability
• Cost
• Environmental effects
IC Design Space:
Power Impacts on System Design:
• Energy consumed per task determines battery life
– Second order effect is that higher current draws decrease
effective battery energy capacity
• Current draw causes IR drops in power supply voltage
– Requires more power/ground pins to reduce resistance R
– Requires thick&wide on-chip metal wires or dedicated
metal layers
• Switching current (dI/dT) causes inductive power supply
voltage bounce  LdI/dT
– Requires more pins/shorter pins to reduce inductance L
– Requires on-chip/on-package decoupling capacitance to
help bypass pins during switching transients
• Power dissipated as heat, higher temps reduce speed and
reliability
– Requires more expensive packaging and cooling systems
Introduction
• Most SOC design teams now regard power as one
of their top design concerns
• Why low-power design?
 Battery lifetime (especially for portable devices)
 Reliability
• Power consumption
 Peak power
 Average power
National Central University EE4012VLSI Design 5

Overview of Power Consumption
• Average power consumption
 Dynamic power consumption(Switching power
consumption)
 Short-circuit power consumption
 Leakage power consumption
 Static power consumption
• Dynamic power dissipation during switching
Cinput
interconnect
Cdrain Cinput

• Generic representation of a CMOS logic gate for
switching power calculation
VA
pMOS
VB network
Vout
VA nMOS Cdrain  C int erconnect  C input
VB network
1
P  [ T2 / T
avg 
T 0 Vout (Cload dVout )dt 
dt T/
dV
(VDD Vout )(Cload dtout )dt]
2

• The average power consumption can be expressed
as
2 2
1
Pavg  T C load V DD  load V DD f CLK
C
• The node transition rate can be slower than the
clock rate. To better represent this behavior, a
node transition factor ( T ) should be
introducedP   C f
avg T 2
load V DD CLK
• The switching power expressed above are derived

by taking into account the output node load
capacitance

VA VA
Vinternal VB
VB
Cinternal Vinternal
Vout
VA VB Vout
Cload
The generalized expression for the average power dissipation

can be rewritten as

Pavg 


 # ofnodes C V V
Ti
i 1
i i DD f

CLK

Switching Power Dissipation (Contd.):
• Ps/w= 0.5 *  * CL* Vdd2 * fclk

– CL physical capacitance, Vdd supply voltage,  switching activity, fclk
clock frequency.
– CL(i) = Σj CIN + Cwire +

j Cpar(i)
– CIN the gate input capacitance, Cwire the parasitic interconnect and
Cpar diffusion capacitances of each gate[I].
• Depends on:
– Supply voltage
– Physical Capacitance
– Switching activity
Choice of Logic
Style
Choice of Logic
Style
• Power-delay product improves as voltage decreases

• The “best” logic style minimizes power-delay for a
Power Consumption is Data
Dependent
• Example : Static 2 Input NOR Gate

Assume :
P(A=1) = ½
P(B=1) = ½
Then :
P(Out=1) = ¼
P(0→1)
=
P(Out=0).P(O
ut=1)
=3/4 * 1/4 =
3/16
CEFF = 3/16 * CL
Transition Probability of 2-
input NOR Gate
as a function of input probabilities

Switching Activity (α) :
Example
Glitching in Static
CMOS
At the Datapath Level…
Irregular Reusable
Balancing
Operations
Carry
Ripple
Data Representation
(Binary v.s. Gray Encoding)
Low Power Design Consideration
(cont’)
Resource Sharing Can Increase
Activity
(Separate Bus Structure)
Resource Sharing Can
Increase Activity
(cont’d)
Operating at the Lowest
Possible Voltage
• Desire to operate at lowest possible speeds

(using low supply voltages)
• Use Architecture optimization to compensate for
slower operation
Approach : Trade-off AREA for lower

POWER
Reducing Vdd
Lowering Vdd Increases Delay
• Concept of Dynamic Voltage Scaling (DVS)

Architecture Trade-offs :
Reference Data
Path
Parallel Data Path
Paralelna implementacija dela
datapath :
Pipelined Data Path
Protočna
implementacija:
Paralelno-protočna implementacija:
Gate-Level Design – Technology Mapping
• The objective of logic minimization is to reduce the
boolean function.
• For low-power design, the signal switching activity
is minimized by restructuring a logic circuit
• The power minimization is constrained by the
delay, however, the area may increase.
• During this phase of logic minimization, the
function to be minimized is

i
P i (1  P i ) C i

Gate-Level Design – Technology Mapping
• The first step in technology mapping is to decompose
each logic function into two-input gates
• The objective of this decomposition is to minimizing the
total power dissipation by reducing the total switching
activity
A 0.2 0.038

B 0.2 0.0196 
0.0099
C 0.5
D 0.5
A
B
C A 0.2
D   0.0384
B 0.2
C 0.5 
0.0099
D 0.5 
0.1875
Gate-Level Design – Phase Assignment
High activity node

High activity node
A
A
B
B
C
C

Gate-Level Design – Pin Swapping
a b c d a b c d
d a
Switching activity
Switching activity
c b
b
c
a d
d a
c
b b
a c
d

Gate-Level Design – Glitching Power
• Glitches
 spurious transitions due to imbalanced path delays
• A design has more balanced delay paths
 has fewer glitches, and thus has less power dissipation
• Note that there will be no glitches in a dynamic CMOS

logic
A
A
B
D
C
B E
C D
E

Gate-Level Design – Glitching Power
• A chain structure has more glitches
• A tree structure has fewer glitches
A
B
C Chain structure
D
B Tree structure
C
D
Gate-Level Design – Precomputation
REG REG
Combinational Logic
R1 R2
REG REG
Combinational Logic
R1 R2
Precomputation
Logic

Gate-Level Design – Precomputation
A<n-1> REG 1-bit Comparator

B<n- R1 (MSB)
1>
REG
A<n-2:0>
R2
(n-1)-bit REG
Enable
Comparator R4
Precomputation logic F
REG
B<n-2:0>
R3

Gate-Level Design – Gating Clock
D Q D D D
Q Q Q Fail DFT rule
clk checking
D Q D Q D Q Add control pin

D Q to solve DFT
violation
problem
clk

Gate-Level Design – Input Gating
f1
clk
+
select
f2

Clock-Gating in Low-Power Flip-Flop
D D Q
CK
Source: Prof. V. D. Agrawal

Reduced-Power Shift Register
D D Q D Q D Q D Q
multiplexer
Output
CK(f/2)
Flip-flops are operated at full voltage and half the clock frequency.

Power Consumption of Shift Register
16-bit shift register, 2μ CMOS

P = C’VDD2f/n
Deg. Of Freq Power 1.0
parallelism (MHz) (μW)
1 33.0 1535
Normalized power
2 16.5 887
4 8.25 738 0.5
0.25
C. Piguet, “Circuit and Logic Level

Design,” pages 103-133 in W. Nebel 0.0
and J. Mermet (ed.), Low Power 1 2 4
Design in Deep Submicron Degree of parallelism, n
Electronics, Springer, 1997.
Architecture-Level Design – Parallelism
16 16
A R A R
32 16 32
16x16 16x16
fref multiplier fref/2 multiplier
16 R
B R
M 32
fref/2 X
U
fref
Assume that With the same 16x16 R

multiplier, the power supply can fref
be reduced from Vref to Vref/1.83.
V ref f 16x16
ref
( fref/2 multiplier
2  16 32
Pparallel  2.2Cref 1.83)
ref
2 0.33P B R
16
fref/2
Architecture-Level Design – Pipelining
The hardware between the pipeline stages is reduced then
the reference voltage Vref can be reduced to Vnew to maintain
the same worst case delay. For example, let a 50MHz
multiplier is broken into two equal parts as shown below. The
delay between the pipeline stages can be remained at
50MHz when the voltage Vnew is equal to Vref/1.83
32 Half Half 32
(A ,B) REG REG
multiplier multiplier
fref
V ref
(
Ppipeline  1.2C ref ) 2 fref  0.36 ref
1.83 P
Architecture-Level Design – Retiming
Retiming is a transformation technique used to change the
locations of delay elements in a circuit without affecting the
input/output characteristics of the circuit.
Two versions of an IIR filter.

(1) (1)
x(n) y(n) x(n) y(n)
D D a 2D D a
w(n)
(2) (1) D (2) 2D
(1)
w1(n)
b retiming D b
w2(n)
(2) (2)

Retiming for pipeline design
REG C1 C2 REG C3
(6ns) (2ns) (4ns)
fref
REG C1 REG C2
C3
(6ns) (2ns)
(4ns)
fref

Clock cycle is 4 gate delays
Clock cycle is 2 gate delays

Architecture-Level Design – Power Management
C2
C1
C1_FREEZE
C2
C2_FREEZE
C1
C1_FREEZE
C2_FREEZE
Architecture-Level Design – Bus Segmentation
• Avoid the sharing of resources
 Reduce the switched capacitance
• For example: a global system bus
 A single shared bus is connected to all modules, this
structure results in a large bus capacitance due to
 The large number of drivers and receivers sharing the same
bus
 The parasitic capacitance of the long bus line
• A segmented bus structure
 Switched capacitance during each bus access is
significantly reduced
 Overall routing area may be increased

Architecture-Level Design – Bus Segmentation
Cbus
Cbus1
Interface
Bus
Cbus1

Algorithmic-Level Design – factivity Reduction
Minimization the switching activity, at high level, is one way to
reduce the power dissipation of digital processors.
One method to minimize the switching signals, at the
algorithmic
level, is to use an appropriate coding for the signals rather than
straight binary code.
The table shown below shows a comparison of 3-bit representation
of the binary and Gray codes.
Binary Code Gray Code Decimal Equivalent
000 000 0
001 001 1
010 011 2
011 010 3
100 110 4
101 111 5
110 101 6
111 100 7
State Encoding for a Counter
• Two-bit binary counter:
 State sequence, 00 → 01 → 10 → 11 → 00
 Six bit transitions in four clock cycles
 6/4 = 1.5 transitions per clock
• Two-bit Gray-code counter
 State sequence, 00 → 01 → 11 → 10 → 00
 Four bit transitions in four clock cycles
 4/4 = 1.0 transition per clock
• Gray-code counter is more power efficient.
G. K. Yeap, Practical Low Power Digital VLSI Design, Boston:
Kluwer Academic Publishers (now Springer), 1998.
Binary Counter: Original Encoding
a
Present Next state
state A
b
a b A B
B
0 0 0 1
0 1 1 0
1 0 1 1
1 1 0 0
A = a’b + ab’ CK
B = a’b’ + ab’ CLR
Binary Counter: Gray Encoding
Present a
Next state
state
A
a b A B
0 0 0 1 B
b
0 1 1 1
1 0 0 0
1 1 1 0
A = a’b + ab CK
B = a’b’ + a’b CLR
Three-Bit Counters
Binary Gray-code
State No. of toggles State No. of toggles
000 - 000 -
001 1 001 1
010 2 011 1
011 1 010 1
100 3 110 1
101 1 111 1
110 2 101 1
111 1 100 1
000 3 000 1
Av. Transitions/clock = 1.75 Av. Transitions/clock = 1

N-Bit Counter: Toggles in Counting Cycle
• Binary counter: T(binary) = 2(2N – 1)

• Gray-code counter: T(gray) = 2N
• T(gray)/T(binary) = 2N-1/(2N – 1) → 0.5
Bits T(binary) T(gray) T(gray)/T(binary)
1 2 2 1.0
2 6 4 0.6667
3 14 8 0.5714
4 30 16 0.5333
5 62 32 0.5161
6 126 64 0.5079
∞ - - 0.5000

FSM State Encoding
Transition
probability
based on 0.6 0.6
PI statistics
11 01
0.3 0.3
0.1 0.1
0.4 0.4
00 01 00 11
0.1 0.9 0.1 0.9
0.6 0.6
Expected number of state-bit transitions:
2(0.3+0.4) + 1(0.1+0.1) = 1.6 1(0.3+0.4+0.1) + 2(0.1) = 1.0

State encoding can be selected using a power-based cost function.

FSM: Clock-Gating
• Moore machine: Outputs depend only on the
state variables.
 If a state has a self-loop in the state transition
graph (STG), then clock can be stopped
whenever a self-loop is to be executed.
Xi/Zk
Si
Sk Xk/Zk
Clock can be stopped

Sj Xj/Zk when (Xk, Sk) combination
occurs.

Clock-Gating in Moore FSM
PI
Combinational
logic PO
Flip-flops
Clock
activation Latch
L. Benini and G. De Micheli,
logic
Dynamic Power Management,
CK Boston: Springer, 1998.

Bus Encoding for Reduced Power
• Example: Four bit bus
 0000 → 1110 has three transitions.
 If bits of second pattern are inverted, then 0000 →
0001 will have only one transition.
• Bit-inversion encoding for N-bit bus:
N
after inversion encoding
Number of bit transitions
N/2
0
0 N/2 N
Number of bit
National Central University
transitions
EE4012VLSI Design 64
Bus-Inversion Encoding Logic
Received data
Sent data
Bus register
Polarity
M. Stan and W. Burleson, “Bus-Invert
decision
Polarity bit Coding for Low Power I/O,” IEEE
logic
Trans. VLSI Systems, vol. 3, no. 1, pp.
49-58, March 1995.

RTL-Level Design – Signal Gating
Simple Decoder Decoder with enable
module decoder (a, sel); module decoder (en,a, sel);

input [1:0[ a; input en;
ouput [3:0] sel; input [1:0[ a;
reg [3:0] sel; ouput [3:0] sel;
always @(a) begin reg [3:0] sel;
case (a) always @({en,a}) begin
2’b00: case ({en,a})
sel=4’b0001; 3’b100: sel=4’b0001;
2’b01: 3’b101: sel=4’b0010;
sel=4’b0010; 3’b110: sel=4’b0100;
2’b10: 3’b111: sel=4’b1000;
sel=4’b0100; default: sel=4’b0000;
2’b11: endcase
sel=4’b1000; end
endcase endmodule
end
endmodul
e
RTL-Level Design – Datapath Reordering
Initial
Reordered
stable Mux
A<B Mux
glitchy
glitchy
A<B Mux
Mux
stable

RTL-Level Design – Memory Partition
128x32
din
32
addr dout
write
noe
q addr[7:0]
8 M
pre_addr d addr[7:1] dout
32 U
X
clk noe
write
dout
addr 32
din
addr0 128x32

• Application-driven memory partition
Reads
64K bytes
Data
ARM
Core Addr
R/W
Addr
28K 4K 32K Range
64K

• A power-optimal partitioned memory organization
Decoder
ARM
R/W
Addr
Data
R/W
R/W
Addr
Addr
Data
Data
Core CS
CS
CS

Low Power Vlsi Design 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Low Power Vlsi Design 2

Uploaded by

Copyright:

Available Formats

Low Power Design in VLSI

Why Low Power ?

• Growth of battery-powered systems

National Central University EE4012VLSI Design 5

• Dynamic power dissipation during switching

National Central University EE4012VLSI Design 6

National Central University EE4012VLSI Design 7

• The switching power expressed above are derived

National Central University EE4012VLSI Design 8

The generalized expression for the average power dissipation

National Central University EE4012VLSI Design 9

• Ps/w= 0.5 *  * CL* Vdd2 * fclk

– CL(i) = Σj CIN + Cwire +

• Power-delay product improves as voltage decreases

• Example : Static 2 Input NOR Gate

as a function of input probabilities

• Desire to operate at lowest possible speeds

Approach : Trade-off AREA for lower

• Concept of Dynamic Voltage Scaling (DVS)

National Central University EE4012VLSI Design 34

High activity node

National Central University EE4012VLSI Design 36

National Central University EE4012VLSI Design 37

• Note that there will be no glitches in a dynamic CMOS

National Central University EE4012VLSI Design 38

National Central University EE4012VLSI Design 40

A<n-1> REG 1-bit Comparator

National Central University EE4012VLSI Design 41

D Q D Q D Q Add control pin

National Central University EE4012VLSI Design 42

National Central University EE4012VLSI Design 43

Source: Prof. V. D. Agrawal

National Central University EE4012VLSI Design 44

Source: Prof. V. D. Agrawal

16-bit shift register, 2μ CMOS

C. Piguet, “Circuit and Logic Level

Assume that With the same 16x16 R

Two versions of an IIR filter.

National Central University EE4012VLSI Design 49

National Central University EE4012VLSI Design 50

Clock cycle is 4 gate delays

Clock cycle is 2 gate delays

National Central University EE4012VLSI Design 51

National Central University EE4012VLSI Design 53

National Central University EE4012VLSI Design 54

Source: Prof. V. D. Agrawal

• Binary counter: T(binary) = 2(2N – 1)

Source: Prof. V. D. Agrawal

2(0.3+0.4) + 1(0.1+0.1) = 1.6 1(0.3+0.4+0.1) + 2(0.1) = 1.0

Source: Prof. V. D. Agrawal

Clock can be stopped

Source: Prof. V. D. Agrawal

Source: Prof. V. D. Agrawal

Source: Prof. V. D. Agrawal

module decoder (a, sel); module decoder (en,a, sel);

National Central University EE4012VLSI Design 67

National Central University EE4012VLSI Design 68

• Application-driven memory partition

National Central University EE4012VLSI Design 69

• A power-optimal partitioned memory organization

You might also like