05 Timing Clocking p6

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Issues on Timing and Clocking

X Z
Combinational
Logic

Q FF
D

FF

FF
... clock
FF clock period
clock

ECE152B TC 1 ECE152B TC 2

Latch and Flip-Flop Clocking

D
X Z
L Combinational • For correct operation of a
Q
D Q
Logic synchronous circuit:
• The clock period must be
Q CK Q Q FF D longer than the delay of
g p
the longest path in the
CK FF combinational logic.
L1 L2 FF • The width of the clock
D Q
Q1
D Q Q2 ... pulse must be long enough
D1
to allow the flip-flops to
FF
CK Q CK Q
clock change state.

CK clock
clock period
ECE152B TC 3 ECE152B TC 4

Flip-Flop Setup and Hold Times Flip-flip Timing Parameters


• Flip-flop setup time Tsu: the required time the data
input signal value must be held stable prior to the
D Q
arrival of clock pulse.
Input 50% 50% Clk
D Q date

FF clk T
50%
Input Clk
clk Setup time Hold time
Q thold
D
tsu
z Flip-flop hold time Th: the required time the data
tc-q
input signal value must be held stable after the Q
arrival of clock pulse.
Delays can be different for rising and falling data transitions
ECE152B TC 5 ECE152B TC 6
Example: For the circuit shown below, assume the
A Simple RC Model for Logic Gates
delay through the register (tpd) is 0.6 and the delay
through each logic block is indicated inside the box.
Assume that the registers, which are positive edge- G
triggered, have a set-up time Tsu of 0.4. What is the equivalent
minimum clock period? circuit
A B A Rout B
logic
G
tpd=55 Cinput
a buffer/gate
logic logic
tpd=2 tpd=2
register register
logic logic
tpd=5 tpd=3
tθ tθ’
Clock θ

ECE152B TC 7 ECE152B TC 8

Interconnect Models Interconnect Models as a Capacitor


C1 1. Ignoring Interconnects: 2. Lumped Capacitance model:

C2 Rd Rd
Vout Vout

driver C3
Cinput Coutput C1+ C2+ C3 Cinput Coutput C1+ C2+ C3
+ Cwire
1. Ignoring interconnects
VOUT
2. Lumped capacitance model Vout(t) = VDD (1 – e–t/T) T = Rd (Coutput+ C1+ C2+ C3)
1.0 x VDD or
3. RC tree model T = Rd (Coutput+ C1+ C2+ C3 + Cwire)
4. RLC tree model 0.5 x VDD
Vout-1(0.5VDD) = (ln 2) T = 0.693 T
5. Transmission line models (RC, LC, RLC) T 2T time
Vout(T) = 0.632 VDD
6. RC / RLC Network model
ECE152B TC 9 ECE152B TC 10

Analysis of Simple RC Circuit


Analysis of Simple RC Circuit
dv ( t )
Zero-input response: RC + v (t ) = 0
dt
i(t) (natural response) 1 dv(t) 1 − t
= − ⇒ v N (t) = Ke RC
R v(t) dt RC
R ⋅ i (t ) + v (t ) = vT (t ) vT(t) ± C v(t)
d ( Cv ( t )) dv ( t ) dv ( t )
Step-input response: + v (t ) = v 0 u (t )
i (t ) = = C RC
dt
dt dt v F ( t ) = v 0 u ( t ) ⇒ v ( t ) = Ke
K
− t
RC
+ v 0u (t )
dv ( t )
⇒ RC + v (t ) = vT (t ) first-order linear v0u(t) match initial state:
dt v0 v ( 0 ) = 0 ⇒ K + v 0 u (t ) = 0
differential equation
state with constant v0(1-e-t/RC)u(t) output response for step-input:
variable coefficients v ( t ) = v 0 (1 − e
− t
RC
)u (t )
Input
waveform You can get the same result by Laplace Transform

ECE152B TC 11 ECE152B TC 12
Delays of Simple RC Circuit Interconnect Models as a Tree
3. RC tree model
v(t) = v0(1 - e-t/RC) -- waveform under step input v0u(t)
4. RLC tree model
v(t)=0.5v0 ⇒ t = 0.7RC 5. Transmission line models (RC, LC, RLC)
– i.e., delay = 0.7RC (50% delay)
C2
v(t)=0.1v0 ⇒ t = 0.1RC
v(t)=0 9v0 ⇒ t = 22.3RC
v(t)=0.9v 3RC driver C3
– i.e., rise time = 2.2RC
Rise time (Fall time): time for a waveform to rise from
10% to 90%(90% to 10%) of its steady state value
or
VOH VOH transmission
VOL line
VOL
L-type π-type T-type
Rise time Fall time
ECE152B TC 13 ECE152B TC 14

Determining Which Model to Use Model Interconnects as RC Trees

Some Rule-of-Thumbs:
• Each wire maybe
Need to consider C: segmented into several
– if interconnect C is comparable to C of gates edges
driven • Each edge E modeled as a
Need to consider R: π-type or L-type circuit
– if interconnect R is comparable to R of • rE = unit res. × length(E)
driver • cE = unit cap. × length(E)
Need to consider L:
– if ωL is comparable to R of interconnect

ECE152B TC 15 ECE152B TC 16

Interconnect Delay: Putting All Models Together


Clock Tree
G1 Rin G2
ƒ If the # of flip-flops driven by the clock line is large, the
Rout clock rise time (also called slew rate) will be
+ Cin
G1 G2
CL unacceptably long.
- ƒ Solution: Using clock power-up tree (adding buffers into
Vin the clock tree)
R= Rout+Rint CLK
Vin Vout
v0
v0(1-e-t/RC) Vout
C=Cin+CL

time ... ...


Rise/fall time ≅ 2.2 RC =>
• Delay is proportional to the loading capacitance CL
ƒ Driving more gates result in longer rise/fall time ... ... ...
ƒ Longer interconnects (larger Rint and Cin) also result in longer rise/fall time FF FF
ECE152B TC 17 ECE152B TC 18
Solution: Adding Buffers and Limiting
Number of Their Fanouts An Example
• Limiting the number of fanouts of each buffer to N ƒ Assume 64 flip-flips to be driven by a single clock source
• Create a buffer tree to drive all flip-flops while satisfying ƒ Buffer delay (with zero load): 0.2 ns
the constraint of fanout count ƒ Interconnect delay:
ƒ 0.2 ns with a single fanout (either to a flip-flop or to a buffer)
• The load seen by the clock source is significantly reduced
ƒ Addition 0.1ns delay for each additional fanout
• The same idea can be used to reduce the delay of logic
Clk source
signals which drive a large number of gates and are on
timing-critical paths
... ... ...
FF FF

ƒ Total delay from clock source to clock ports of flip-flops: 6.9 ns


0.2ns + 0.2ns + (0.2ns + 63 x 0.1ns) = 6.9ns
ECE152B TC 19 ECE152B TC 20

Techniques for Improving Speed


Example – Clock Tree
ƒ Assume each buffer has four fanouts 1. Keep the logic gate depth shallow between flip-flops.
Clk Source 2. Avoid circuit designs that have highly loaded gates in
the critical path.
– A gate delay will increase as the capacitive load is
increased on the output of the gate.
... ...
– The
Th primary
i sources off load
l d capacitance
it are routing
ti
capacitance and the input capacitance of the driven
... ... ...
FF FF gates.
ƒ Total delay from clock source to clock ports of flip-flops: 2.3ns 3. Duplicate logic to reduce fanouts (similar idea to clock
0.2ns {0th-level wire delay} + 0.2ns {1st-level buffer delay} +
tree buffering).
(0.2ns + 3 x 0.1ns) {1th-level wire delay} + 0.2ns {2nd-level buffer delay} + 4. Avoid long interconnects
(0.2ns + 3 x 0.1ns) {2nd-level wire delay} + 0.2ns {3rd-level buffer delay} + 5. Gate sizing
(0.2ns + 3 x 0.1ns) {3rd-level wire delay}

ECE152B TC 21 ECE152B TC 22

Gate Sizing: Making The Driving Gate


Larger (or Smaller) Static Timing Analysis 101
‹ Larger the driving gate => Greater the driving current (so
sharper Vin), but larger Cinput too (which slows down the
Timing
signal coming into G1)
G1 Specs
G2
Vin B Rin REPORTS
D
A B D G2 + Rout
G1 Cin CL
-

G1 v0
v0(1-e-t/RC)
A Rout B
Cinput
time (courtesy P. Joshi, IBM)
ECE152B TC 23 ECE152B TC 24
Static Timing Analysis Static Timing Analysis
PI1 1 4 6 5 PO1
0/4 1/5 7/9 13/15
18/22 PO1
PI1 1 4 6 5
Netlist with delay for PI2 3 6 6 7 PO2 0/0 9/9
3/3 15/15
each gate arrival time/required time PI2 3 6 6 7 22/22 PO2
4
PI3 1 4 5 4 PO3 0/8 4 7/15 5 14/18 418/22
PI3 1 4
PO3
1/9 7/13

0 1 7 13 18
PI1 1 4 6 5 PO1 4 4 2 2 4
PI1 1 4 6 5 PO1

Arrival times 0 9 slack = required time -


PI2 33 6 6 15 7 22 PO2 0
30 6
0
60 70
arrival time PI2 PO2

0
11 7 5 14 4 18 8
4
18 8 54 44
PI3
4
PO3 4
PI3 PO3
7 4
6

ECE152B TC 25 ECE152B TC 26

Timing Analysis with Interconnect Delay


Clock Non-idealities

22
‹ Clock skew
3 2 1 1 – Spatial variation in temporally equivalent clock
5 5 5
L L edges; deterministic + random, tSK
A A
T 19 2 1 T
‹ Clock jjitter
C C – Temporal variations in consecutive edges of the
H
4 4 4
H clock signal; modulation + random noise
2 1 3 2 – Cycle-to-cycle (short-term) tJS
– Long term tJL
‹ Variation of the pulse width
– Important for level sensitive clocking

ECE152B TC 27 ECE152B TC 28

Clock Skew
Clock Skew and Jitter
ƒ The delays from the clock source to the clock inputs
of different flip-flops are different Clk
CLOCK tSK
DRIVER A

Clk tJS

B
ƒ Both skew and jitter affect the effective cycle time
ƒ Only skew affects the race margin

ECE152B TC 29 ECE152B TC 30
Clock Uncertainties – Sources of Skew ƒ The clock skew problem: the race problem
and Jitter D1 Q1 D2 Q2
FF1 FF2
4 Power Supply 1ns
3 Interconnect
2 6 Capacitive Load
Devices
0ns
7 Coupling to Adjacent Lines 2ns
5 Temperature
1 Clock Generation Correct operation: D1 -> Q1 , D2 -> Q2
Due to clock skew: D1 -> Q2 (error!!)

ƒ Minimizing clock skew:


Distribute the clock signal in such a way that the
interconnections from the clock source to the FF’s
Sources of clock uncertainty clock inputs are of equal length.
ECE152B TC 31 ECE152B TC 32

ECE152B TC 33 ECE152B TC 34

Positive and Negative Skew Positive Skew


R1 R2 R3
R1 R2 In Combinational Combinational
R3 D Q D Q D Q •••
In Combinational Combinational Logic Logic
D Q D Q D Q •••
Logic Logic
CLK tCLK1 tCLK2 tCLK3
CLK tCLK1 tCLK2 tCLK3
delay delay
delay delay
((a)) Positive skew
(a) Positive skew
TCLK + δ
R1 R2 R3
R1 R2 R3
In Combinational TCLK Combinational
D Q1 D Q 3 D Q •••
In Combinational Combinational CLK1 Logic Logic
D Q D Q D Q •••
Logic Logic δ
tCLK1 tCLK2 tCLK3
tCLK1 tCLK2 tCLK3
delay delay CLK
delay delay CLK CLK2 2 4
(b) Negative skew
(b) Negative skew δ + th

Launching edge arrives before the receiving edge


ECE152B TC 35 ECE152B TC 36
Combinational
D Q D Q D Q •••
Logic Logic

CLK tCLK1 tCLK2 tCLK3 “Useful” Clock Skew


delay Negative Skew
delay
(a) Positive skew ƒ Clock skew is not always bad!!
R1 R2
Example:
R3
In
D Q
Combinational
Logic
D Q Combinational
D Q ••• 3
Logic

tCLK1 tCLK2 tCLK3


A B
delay delay CLK
(b) Negative skew
10
TCLK + δ
tθ tθ’

1
TCLK
3 ƒ Assume the FF propagation delay (clock to Q delay)
CLK1
tc→q=0.6, the FF setup time tsu=0.4, the FF hold time
thd=0.5
CLK2 2 4 ƒ If tθ’-tθ = 0 (no clock skew),
δ
– minimum clock period = 11
Receiving edge arrives before the launching edge
ECE152B TC 37 ECE152B TC 38

3
T+δ A B
10
ƒ If tθ’-tθ=δ=1 tθ tθ tθ’

3

A B tθ ’ Data output of A
10 1
0.6
tθ tθ’ >=11
Data input
p of B 3.6

tθ ’
ƒ For proper operation, the time between positive edges at < (3.6 – 0.5)
registers A and B must be greater than or equal to 11 ƒ On the other hand, the clock skew cannot exceed 3.1 ns.
Îclock period + clock skew >= 11 ƒ Otherwise the data latched into register A may propagate through the
Îminimum clock period = 10 short path and reach the data input of register B before the rising edge
of the clock pulse of the same cycle reaching θ’.
ECE152B TC 39 ECE152B TC 40

Effect of Negative Clock Skew Effect of Negative Clock Skew


T-(tθ-tθ’)
ƒ If tθ-tθ’=1 3
A B tθ
3 tθ 10
A B tθ tθ’
10 tθ ’ tθ ’
tθ tθ’
>=11
1 ƒ Register B will never get the chance to latch the wrong data no matter
ƒ For proper operation, the time between positive edges at how large the negative clock skew is.
registers A and B must be greater than or equal to 11 ƒ Race problem is not a concern for negative clock skew
ƒ If we define: tθ’-tθ=δ, δ would be negative for negative
clock skew.
ƒ For this example, δ = -1; T + δ >= 11, thus, the minimum
clock period Tmin= 12
ECE152B TC 41 ECE152B TC 42
Summary Example: Assume tc→q is 0.6, tsu is 0.4 and thold is
0.5.
logic
ƒ Clock Period: T tpd=5
ƒ Longest delay from Reg. A to Reg. B: LD A→B
logic logic
ƒ Shortest delay from Reg. A to Reg. B: SD A→B tpd=2 tpd=2
register register
ƒ FF propagation delay, setup time, hold time: tc→q , tsu , thold
logic logic
1. T + δ ≥ (tc→q + LD A→B + tsu) tpd=5 tpd=3
tθ tθ’
2. δ ≤ (tc→q + SD A→B - thold) Clock θ

Where δ=(tθ’ - tθ) (a) Determine the minimum clock period assuming a positive clock skew: δ = (tθ’ - tθ) = 1.
ƒ Requirement (1) have to be satisfied for datapath between any pair (b) Repeat part(a), factoring in a positive clock skew: δ = 3.
of registers where δ would be either positive or negative (c) Repeat part(a), factoring in a negative clock skew:δ = -2.
(d) Derive the maximum positive clock skew (i.e. tθ’ > tθ) that can be tolerated before the
ƒ Requirement (2) have to be satisfied for datapath between any pair circuit fails.
of registers with a positive clock skew (e) Derive the maximum negative clock skew (i.e. tθ’ < tθ) that can be tolerated before the
circuit fails.
ECE152B TC 43 ECE152B TC 44

Be Very Careful About Gated Clock


Impact of Jitter

2 TC LK 5
D Q
CLK 1 3 4 t j itter A FF
-tji tte r 6
Controller
B
clk

In
REGS Combinational
Logic
clk
CLK t log ic
tc-q , tc-q, cd t log ic, cd A
ts u, thold
tjitter An undesirable glitch or
B spike will result & cause an
additional trigger of the clock.
ECE152B TC 45 ECE152B TC 46

ƒ An alternative design:
Clock Gating

0 • Typically 40-50% of active power is


D 1 Q
in the IC clock trees
FF Q D
A FF • Clock gating allows some of this
Controller Controller
B
power eliminated in active modes
clk clk
• Root and branch clock gating can be
significant
ƒ This design causes less timing problem but consume
more power.
• Resumption of clocking is very fast
• So clock-gated modules can return
to “run” mode without loss of
services
ECE152B TC 47 ECE152B TC 48
Fine-grained Clock Gating Clock Gating Synthesis
always @(posedge CLK) SEL en

begin
CLK
q0 <= D0; RTL Synthesis
D0
q1 <= D1; or q0
Q Low-Power RTL CLK q1 Out
RTL Synthesis D en <= SEL;
end Synthesis D1
EN
always @(posedge assign Out = en ? q0:q1;
CLK
CLK) CLK
begin
if (EN)
Q <= D;
end en
SEL
D Q always @(posedge clk)
CLK
Low-Power RTL EN begin
D0
Synthesis if (sel == 1’b1) Low-Power RTL q0
Synthesis CLK CG q1 Out
CLK q0 <= d0;
CG cell if (sel == 1’b0) D1
q1 <= d1;
CLK CG
en <= sel;
end
assign out = en ? q0:q1;
Ref: Calypto
ECE152B TC 49 ECE152B TC 50

Sequential Clock Gating Advanced Clock Gating (Example)


f_1 f_2 Sequential Analysis
din_1 Original RTL
Combinational
dout Analysis
vld_1
d_1 d_2
g_1 g_2 din
din_2
dout
vld_2 vld
vld_1 vld_2

f_1 f_2 Power Optimized RTL Sequential Clock Gating


din_1 d_1 d_2
CG CG dout din
vld_1
dout
g_1 g_2 Combinational
CG CG CG
din_2 CG
vld Clock Gating
CG CG

vld_2 Combinational Analysis vld_1 vld_2


Sequential Analysis

ECE152B TC Source: Calypto 51 ECE152B TC 52


Source: Calypto
52

Improving Clock Gating Efficiency Dealing with Asynchronous Inputs:


Clock Gating Efficiency = average percentage of time each
Synchronizer
register is gated for a given testbench
ƒ Asynchronous signals incoming to a synchronous
100%
Clock Gated Registers/Duration Original Design system must be synchronized with the rest of the
Original Design
system
Duration

Block 1 Block 2

C oc Gat
Clock Gating
g 38.8% 29%
Efficiency
CLK
R Registers Rn SYNCHRONIZER
0
ASYNCHRONOUS SYNCHRONIZED
100% Clock Gated Registers/Duration D Q
Optimized Design After Identifying more INPUT SIGNAL D METASTABLE
gating opportunities STATE
Duration

Clock Gating CLK


Efficiency Block 1 Block 2
Q
54.9% 47%
POSSIBLE OUTPUTS
R Registers Rn
Source: Calypto
ECE152B 0 TC 53 ECE152B TC 54
A Timing Optimization Technique -
ƒ A multiple-stage synchronizer reduces the chance of Pipelining
synchronization failure.
ƒ Pipelining: a technique to break up a timing-
critical data path into a series of small data paths by
placing registers between sections.
Outside
D Q D Q D Q output
SYSTEM input F()
()

output
input Fa() Fb() Fc()
SYSTEM CLOCK

ECE152B TC 55 ECE152B TC 56

Another Timing Optimization An implementation for k=3


Technique - Retiming yi + + +
ƒ Retiming: a technique to transform a given
synchronous circuit into a faster circuit. xi δ δ δ δ
An example: A digital correlator:
The correlator takes a stream of bits x0,x1, . . . xk as a0 a1 a2 a3
input & compares it with a fixed-length
fixed length pattern a0,aa1,
. . ., ak . After receiving each input xi, the correlator ⎧ 0 P ≠ Q;
: register σ (P, Q) = ⎪⎨
produces as output the number of matches. ⎪
⎩ 1 P=Q
I.e. k
yi =∑ σ ( xi− j , a j )
j= 0 P+Q + P P δ P : comparator
⎧1 : adder
if x = y;
where σ (x, y) = ⎪⎨
⎪0 otherwise Q Q

ECE152B TC 57 ECE152B TC 58

ƒ Suppose each adder has a propagation delay of 7 ns.


& each comparator 3ns. ƒ These two designs are functionally equivalent:
Î The longest propagatim delay is 24ns • all input signals to the box portion arrive one clock tick
Î The clock period must be >= 24ns. earlier.
A better design: • thus, the boxed portion performs the same sequence of
yi computation as the first design, but one clock tick
+ + +
earlier.
B • Since the output from the boxed portion is delayed one
xi δ δ δ δ clock tick by the new register at B, the remainder of the
A circuit sees the same behavior as in the 1st design.
a0 a1 a2 a3
ƒ The longest propagation delay is reduced to 17 nsec.
yi + + + ƒ The elements in the boxed portion “lead” by one clock tick.
Original
B’
design Retiming - the technique of inserting & deleting registers to
δ δ δ δ
xi speed the design while preserve the function.
A’
a0 a a2 a3
ECE152B TC 1 59 ECE152B TC 60
Retiming Transformation
Example:

A 5 4 B 2 C
… a c x …
… b d

A 5 B 4 2 C
… a
x Comparison: pipelining
… b d
A 5 A’ 4 B 2 C

ECE152B TC 61 ECE152B TC 62

Retiming Retiming
Edge label: # of registers Edge label: # of registers

V7 V6 V5 V7 V6 V5
0 0 0 0 1
0 1
0
7 7 7 7 7 7

0 0 0 0 1
0 1
0 1
0 1
0

1 1 1 1 10 10 10 10
3 3 3 3 3 3 3 3
V1 V2 V3 V4 V1 V2 V3 V4

ECE152B TC 63 ECE152B TC 64

You might also like