Chapter 4 Retiming: 1 ECE734 VLSI Arrays For Digital Signal Processing

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 24

1

ECE734 VLSI Arrays for Digital Signal Processing


Chapter 4 Retiming
2
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Definitions
Retiming
Retiming is a mapping from a given DFG, G
to a retimed DFT, G
r
such that the
corresponding transfer function of G and G
r

differ by a pure delay z
L
.
Purposes
To facilitate pipelining to reduce clock cycle
time
To reduce number of registers needed.
3
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Cut-set Retiming
Feed-forward cut-set:





Feed-back cut-set
Delay transfer theorem
Adding arbitrary non-
negative number of delays
to each edge of a feed-
forward cut-set of a DFG will
not alter its output, except
the output timing will be
delayed.
Transfer the same amount
of delays from edges of the
same direction across a
feed-back cut set of a DFG
to all edges of opposing
edges across the same cut
set will not alter the output,
but its timing.
4
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Feed-forward Cut-Set Retiming
Consider the FIR digital filter
and its DFG:
y(n) = b
0
x(n) + b
1
x(n1)







Critical path length = T
M
+T
A

Select a cut set
Insert a delay each to each
edge in the cut set.
Retiming:
y
new
(n) = b
0
x(n1) + b
1
x(n2)
y
new
(n) = y(n1)
Critical path = Max(T
M
, T
A
)


X X
+
D
x(n)
x(n1)
y(n)
b
1
b
0

X X
+
D
x(n)
x(n1)
y(n)
b
1
b
0

D D
5
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Feed-back Cut Set Retiming
Consider an IIR digital filter
y(n) = ay(n-2) + x(n)






loop bound = (T
M
+T
A
)/2
clock cycle = T
M
+T
A

Shift 1 delay to the other
edge across a feed-back
cut set





Filter remains unchanged.
loop bound = (T
M
+T
A
)/2
clock cycle = Max(T
M
,T
A
)

+

2D
x(n) y(n)
a

+

D
x(n) y(n)
a

D
6
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Timing Diagram
Assume t
M
= t
A
= 1 t.u.
Before retiming



After retiming
1 2 3 4
x(1) x(2) x(3) x(4)
y(1) y(2) y(3) y(4)
1 2 3 4 5 6 7 8
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7)
MAC
1 2 3 4 5 6 7 8
y(1)
y(2) y(3) y(4)
y(5) y(6) y(7) y(7)
Add
a y(1)
Mul
0
7
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Feed-back Cut Set Retiming
Consider an IIR digital filter
y(n) = ay(n-1) + x(n)






loop bound = (T
M
+T
A
)
throughput = 1/(T
M
+T
A
)

+

D
x(n) y(n)
a

x(2k-1)=x(k)
x(2k) = 0






Clock period = (T
M
+T
A
)
Throughput = 1/[2(T
M
+T
A
)]
+

2D
x(m) y(m)
a

8
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Slowdown + Retiming
Start with
y(n) = a y(n-1) + x(n)








clock cycle = Max(T
M
,T
A
)
Throughput = 1/[2max(T
M
,T
A
)]

Start with
y(n) = a y(n-2) + x(n)









loop bound = (T
M
+T
A
)/2
clock cycle = Max(T
M
,T
A
)
throughput = 1/ Max(T
M
,T
A
)
+

D
x(m) y(m)
a

D
+

D
x(n) y(n)
a

D
9
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Example 3.2.1
Node delay = 1 t.u.
Before retiming:
Critical path: a3 a4 a5
a6
Clock cycle time = 4
2 delay units
After cut-set retiming
Critical path: a3 a5, a4 a6
Clock cycle time = 2
6 delay units
After additional retiming
Critical path: none
Clock cycle time = 1
11 delay units
D
D
a1
a2
a3
a4
a5
a6
D
D
a1
a2
a3
a4
a5
a6
D
D
D
D
2D
D
a1
a2
a3
a4
a5
a6
D
2D
D
2D
D
D
10
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Slow Down for Cut-Set Retiming
11
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Node Retiming
Transfer delay through a node
in DFG:






r(v) = # of delays transferred
from out-going edges to
incoming edges of node v w(e)
= # of delays on edge e
w
r
(e) = # of delays on edge e
after retiming
Retiming equation:



subject to w
r
(e) > 0.
Let p be a path from v
0
to v
k



then
v v
3D
D
2D
3D
D
2D
r(v) = 2 ( ) ( ) ( ) ( )
r
w e w e r v r u = +
( )
1
0
1
1
0
0
( ) ( )
( ) ( ) ( )
( ) ( ) ( )
k
r r i
i
k
i i i
i
k
w p w e
w e r v r v
w p r v r v

+
=
=
= +
= +

v
0

e
0

v
1

e
1


v
k

e
k

u
v
e
p
12
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Invariant Properties
1. Retiming does NOT change the total number of
delays for each cycle.
2. Retiming does not change loop bound or iteration
bound of the DFG
3. If the retiming values of every node v in a DFG G are
added to a constant integer j, the retimed graph G
r

will not be affected. That is, the weights (# of delays)
of the retimed graph will remain the same.

13
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Node Retiming Examples
r(2) = 1
1 2
1
2
( ) ( ) ( 1) ( 1)
( ) ( 1)
( ) ( 2)
y n x n w n w n
w n a y n
w n b y n
= + +
=
=
( ) ( ) ( 1)
( ) ( 1) ( 2)
y n x n w n
w n a y n b y n
= +
= +
14
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
DFG Illustration of the Example
T

= max. {(1+2+1)/2, (1+2+1)/3} = 2


Cr. Path delay = 2+1 = 3 t.u
T

= max. {(1+2+1)/2, (1+2+1)/3} = 2


Cr. Path Delay = max{2,2,1+1} = 2 t.u
15
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Retiming for Minimizing Clock Period
Note that retiming will NOT
alter iteration bound T

.
Iteration bound is the
theoretical minimum clock
period to execute the
algorithm.
Let edge e connect node u
to node v. If the node
computing time t(u) + t(v) >
T

, then clock period T > T

.
For such an edge, we
require that
To generalize, for any path
from v
0
to v
k
, we have





In other words, for any
possible critical path in the
DFG that is larger than T

,
we require w
r
(e) > 1.



0
( ) ( ) ( ) ( )
r k
w p w p r v r v = +
0
( ) ( ) ,
( ) 1
k
i
i
r
t p t v T
w p

=
= >
>

If
then we require .
( ) 1
r
w e >
16
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Retiming Example Revisited
w
r
(e
21
) > 0, since t(2)+t(1) = 2 = T

.
w
r
(e
13
) > 1, since t(1)+t(3) = 3 > T

.
w
r
(e
14
) > 1, since t(1)+t(4) = 3 > T

.
w
r
(e
32
) > 1, since t(3)+t(2) = 3 > T

.
w
r
(e
42
) > 1, since t(4)+t(2) = 3 > T

.
Use eq. w
r
(e
uv
) = w(e) + r(v) r(u),
w(e
21
) + r(1) r(2) = 1 + r(1) r(2) > 0
w(e
13
) + r(3) r(1) = 1 + r(3) r(1) > 1
w(e
14
) + r(4) r(1) = 2 + r(4) r(1) > 1
w(e
32
) + r(2) r(3) = 0 + r(2) r(3) > 1
w(e
42
) + r(2) r(4) = 0 + r(2) r(4) > 1
2 T

=
17
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Solution continues
Since the retimed graph G
r

remain the same if all node
retiming values are added by
the same constant. We thus
can set r(1) = 0.
The inequalities become
1 r(2) > 0 or r(2) s 1
1 + r(3) > 1 or r(3) > 0
2 + r(4) > 1 or r(4) > 1
r(2) r(3) > 1 or r(3)s r(2) 1
r(2) r(4) > 1 or r(2) > r(4) + 1
Since

one must have r(2) = +1.
This implies r(3) s 0. But we
also have r(3) > 0. Hence
r(3)=0.
These leave 1 s r(4) s 0.
Hence the two sets of
solutions are:
r(0) = r(3) = 0, r(2) = +1, and
r(4) = 0 or 1.

1 (2) (3) 1 0 1 1 r r > > + > + = +
18
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Systematic Solutions
Given a systems of
inequalities:
r(i) r(j) s k; 1 s i,j s N
Construct a constraint graph:
1. Map each r(i) to node i. Add
a node N+1.
2. For each inequality
r(i) r(j) s k,
draw an edge e
ji

such that w(e
ji
) = k.
1. Draw N edges e
N+1,i
= 0.
a) The system of inequalities
has a solution if and only if
the constraint graph
contains no negative cycles
b) If a solution exists, one
solution is where r
i
is the
minimum length path from
the node N+1 to the node i.

Shortest path algorithms:
(Applendix A)
Bellman-Ford algorithm
Floyd-Warshall algorithm
19
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Bellman-Ford Algorithm
Find shortest path from an
arbitrarily chosen origin node U
to each node in a directed
graphif no negative cycle exists.
Given a direct graph
w(m,n): weight on edge from
node m to node n, = if there
is no edge from m to n
r(i,j): the shortest path from node U
to node i within j-1 steps.
r(i,1) = w(U,i),
r(i,j+1) = min {r(k,j) + w(k,i)},
j = 1, 2, , N-1
if max(r(:,n-1)-r(:,n))>0, then
there is a negative cycle. Else,
r(i,n-1) gives shortest cycle
length from i to U.
Note that 1 > 0, hence there is at
least one negative cycle.
2
1
3
4
1
1
2
3
1
0 3 2 2 2
0 1 1 0 0 1 1
0 2 1 1 1 0
1 0 1 1 1 0
W r

( (
( (

( (
= =
( (
( (


spbf.m
20
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Floyd-Warshall Algorithm
Find shortest path between all
possible pairs of nodes in
the graph provided no
negative cycle exists.
Algorithm:
Initialization: R
(1)
=W;
For k=1 to N
R
(k+1)
(u,v) = min{R
(k)
(u,:) +
R
(k)
(:,v)}
If R
(k)
(u,u) < 0 for any k, u, then
a negative cycle exist. Else,
R
(N+1)
(u,v) is SP from u to v
2
1
3
4
2
1
2
3
1
(2)
(3) (4) (5)
0 3 0 3 2 1
0 1 2 3 0 1 2
0 2 3 0 2
1 0 1 2 0
0 3 2 1
3 0 1 2
3 0 0 2
1 2 1 0
W R
R R R

( (
( (

( (
= =
( (
( (



(
(
(
= = =
(
(


21
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Retiming Example
For retiming example:
r(2) r(1) s 1
r(1) r(3) s 0
r(1) r(4) s 1
r(3) r(2) s 1
r(4) r(2) s 1

Bellman-Ford Algorithm for
Shortest Path
2
1 3
4
5
1
1
0
0
0
0
0
1
1
0 1
0 1 1
0 0
1 0
0 0 0 0 0
0 0 1 1
0 0 0 0
0 1 1 1
0 1 1 1
0 0 0 0
W
R

(
(

(
( =
(

(
(


(
(
(
( =
(

(
(

22
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Retiming Example
Floyd-Warshall algorithm
(1) (3) (4) (5) (6)
(2)
0 1 0 1 0 0
0 1 1 1 0 1 1
0 0 0 1 0 0
1 0 1 2 1 0
0 0 0 0 0 1 0 1 1 0
0 1 0 0
1 0 1 1
0 1 0
1 2 0
0 0 1 1 0
W R R R R R
R

( (
( (

( (
( ( = = = = = =
( (

( (
( (

(
(

(
( =
(

(
(


23
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Retiming to Reduce Registers
Register Sharing
When a node has multiple fan-out
with different number of delays, the
registers can be shared so that
only the branch with max. # of
delays will be needed.
Register reduction through node
delay transfer from multiple
input edges to output edges
(e.g. r(v) > 0)
Should be done only when clock
cycle constraint (if any) is not
violated.
D
D
D
Delay
reduction
24
ECE734 VLSI Arrays for Digital Signal Processing (C) 2004-2006 by Yu Hen Hu
Time Scaling (Slow Down)
Transform each delay
element (register) D to ND
and reduce the sample
frequency by N fold will slow
down the computation N
times.
During slow down, the
processor clock cycle time
remains unchanged. Only
the sampling cycle time
increased.
Provides opportunity for
retiming, and interleaving.
+

D
x(3) x(2) x(1)
+

2D
y(3) y(2) y(1)
-- x(3) -- x(2) -- x(1)
y(3) -- y(2) -- y(1)

You might also like