FPGA - Ch0 - Folding

ĐHBK Tp HCM
BMĐT
GV: Hồ Trung Mỹ
Ch.06
Folding (Gấp lại)
TLTK:
1.Các slide từ sách của Prof. Parhi
2.Slide của Prof. Fredrik Edman
3.Slide của Prof. Viktor Öwall
1
4.Slide của Prof. Lan-Da Van
Outline
6.1 Introduction
6.2 Folding Transformation
6.3 Register Minimization Techniques
6.4 Register Minimization in Folded Architecture
6.5 Conclusions
2
What is folding?
Folding is the ”Inverse” of Unfolding
Folding by N
Node A A (N=folding factor)
A0
Unfolding
by J (J=unfolding A1
factor)
AJ-1
3
Folding?
 Used to minimize silicon area (trading area for time).
 A way to systematically determine the control circuits in DSP
architectures by folding transformation, where multiple
algorithm operations are time-multiplexed to a single
functional unit.
 Use for synthesis of DSP architectures that can be operated
at single or multiple clocks.
 Use to reduce the number of hardware functional units (FUs)
such as adders and multipliers by a factor of N at the
expense of increasing computation time by a factor of N.
 Folding lead to an architecture that uses a large number of
registers and thus a register minimization technique needs
sometime to be applied.
4
Hardware Mapped vs. Time multiplexed
N 1
FIR : y n    hk xn  k 
k 0
MUX c
x(n)
D D D
h0 h1 h2 h3
REG
y(n)
 N cc/sample
 1 sample/cc  1 generalized multiplier
 N fixed multipliers  1 adders
 N-1 adders
 1 coefficient memory
 Controller
5
Folding – Time-shared Architecture
b(n) c(n)
a(n) y(n)
y ( n)  a ( n)  b( n)  c ( n)
 Folding is a technique to reduce the silicon area by time-
multiplexing many operations into single functional units.
 The right figure shows a 2 times folded architecture where 2 additions
are folded, or time-multiplexed, to a single adder
 Folding introduces registers/storage
 Computation time increased, e.g. one output sample every 2 cc (one
input signal consumed every 2cc)
6
Folding Example – A more detailed look! b(n) c(n)
y ( n)  a ( n)  b( n)  c ( n) a(n) y(n)
Cycle 0 Cycle 1
2l+0 2l+1
b(0) c(0)
2l+0
2l+0 a(0)+b(0)
a(0) D y(-1) D
2l+1
Cycle 2 Cycle 3
2l+0 2l+1
b(1) c(1)
2l+0
2l+0
a(1) D a(0)+b(0)+c(0) D
7
2l+1
Folding Example – A more detailed look!
8
What’s related to folding
 Reduce hardware by N-folding
 Tcomputation increased by N  Latency
 Extremes
 Fully parallel
 Time multiplexed = 1 unit per algorithmic operation
 Folding 
 extra registers, i.e. extra storage
 a more complex control unit
 more latency
9
Folding Transformation

10
Folding Set
 A folding set is an ordered set of operations to be executed on
the same functional unit.
 The folding set are typically obtained from a scheduling and
allocation algorithm (ref. Appendix B)
 The folding set represents underlying folding transformation
 Each set contain N entries, N=folding factor.
 Example
Folding order: 0 1 2 (… N-1)
S1  A1 , 0, A2 
N=3 Null operation

11

12
N=folding factor
Nr. of operations U w(e)D V l = iteration
folded to a single
HW-unit
unit
N l+ u N l+ v V
Hu PPuD
u D F ( Ua V ) Hv PPvD
v
Delays in folded graph

HW-unit
U Level of
Pipeline
0  u, v  N  1 13
Ex. Folding of Biquad filter


In Out
1 2 [Wiki] A digital biquad filter is a second
order recursive linear filter (IIR filter),
a D b 4 containing two poles and two zeros.
3 5 6 "Biquad" is an abbreviation of
"biquadratic", which refers to the fact that
(S2|2) in the Z domain, its transfer function is the
ratio of two quadratic functions:
c D d
7 8 14
Ex. Folding of Biquad filter
(S1|3) (S1|1)
In Out
1 2
a D b 4
3 5 6
(S1|2) (S1|0)
(S2|0) (S2|2)
Tadder  1 c D d Tmult  2
7 8
Padder  1 (S2|3) (S2|1) Pmult  2
Additions Multiplication
S1  4,2,3,1 S 2  5,8,6,7
15
Ex. Folding of Biquad filter, N = 4
Folding equations receive send
0  u, v  N  1
(S1|3) (S1|1)
In Out
1 2
a D b 4
3 5 6
(S1|2) (S1|0)
(S2|0) (S2|2)
c D d
7 8
(S2|3) (S2|1)
DF (U  V )  0 Not Valid folding
16
A delay between two edges can not be negative!
Retiming: Folding of Biquad filter, N=4
(S1|3) (S1|1)
In Out
1 2
a D b 4
3 5 6
(S1|2) (S1|0)
(S2|0) (S2|2)
c D d Feedforward
7 8
Retiming cutset 
Split and (S2|3) (S2|1) Pipelining
move delay
17
(S1|3) (S1|1)
In Out
D
1 2
a D b 4
3 5 6
(S1|2) D (S1|0)
(S2|0) (S2|2)
D c D
7 8 d Feedforward
Retiming (S |3)
D cutset 
2
(S2|1)
Split and Pipelining
move delay
18
0  u, v  N  1
(S1|3) (S1|1)
In Out
D 2
1
3
a
5
D b
4
6
(S1|2) D (S1|0)
(S2|0) (S2|2)
D c 7
8 d
D
D
(S2|3) (S2|1)
DF (U  V )  0 Valid folding
19
Systematic way of Retiming for Folding

(retiming for folding constraint)

20
 Then solve the the system of inequalities!
21
Using either Bellman-Ford

or Floyd-Warshall algorithm,
we find that the set of
constraints has a solution,
and one solution is
22
retiming
r(1) = -1 r(5) = -1
r(2) = 0 r(6) = -1
r(3) = -1 r(7) = -2
r(4) = 0 r(8) = -1
23
0  u, v  N  1
(S1|3) (S1|1)
In Out
D 2
1
3
a
5
D b
4
6
(S1|2) D (S1|0)
(S2|0) (S2|2)
D c 7
8 d
D
D
(S2|3) (S2|1)
DF (U  V )  0 Valid folding
24
Recall: Folding Transformation
N=folding factor
Nr. of operations U w(e)D V l = iteration
folded to a single
HW-unit
unit
N l+ u N l+ v V
Hu PPuD
u D F ( Ua V ) Hv PPvD
v
Delays in folded graph

HW-unit
U Level of
Pipeline
0  u, v  N  1 25
S1  4,2,3,1 S 2  5,8,6,7
0 1 23 0 1 2 3
26
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
27
D D D 2D
{1}
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
28
D D D  2D
{3} {1}
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
29
D D  D  2D
{2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
30
 D D  D  2D
{0} {2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
31
{1}
 D  D  D  2D
{0} {2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
32
{1}
{2}
 D  D  D  2D
{0} {2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
33
{1}
{0,2}
 D  D  D  2D
{0} {2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
34
{2} {1}
D
{0,2}

 D  D  D  2D
{0} {2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
35
{0,2} {1}
D
{0,2}

 D  D  D  2D
{0} {2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
36
{0,2} {1}
D
{0,2}

{3}  D  D  D  2D
{0} {2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
37
{0,2} {1}
D
{0,2}

{1,3}  D  D  D  2D
{0} {2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
38
In
{0,2} {3} {1}
D
{0,2}

{1,3}  D  D  D  2D
{0} {2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
39
In
{0,2} {3} {1}
{2}
D
Out
{0,2}

{1,3}  D  D  D  2D
{0} {2} {3} {1}

{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
40
6.3 Register Minimization Techniques
 Folding may insert register. Lifetime analysis is used for register
minimization techniques in a DSP hardware.
 A data sample (also called a
variable) is live from the time it is
produced until the time it is
consumed. After that it is dead.
 Linear lifetime chart : Represents
the lifetime of the variables in a
linear fashion.
 Convention: a variable is
• not live during the clock cycle
when it is produced  the
variable does not need to be
stored during that time. One iteration 6 cc  N=6
• but live during the clock cycle 41
when it is consumed.
Register Minimization
Max. number of live variables  Min. number of registers
Use previous iter.

to avoid drawing
lifetime chart over
several iterations
2 live
variables
But 3 if several iterations

2 live variables in iteration 42
Register Minimization
Max. number of live variables  Min. number of registers
6cc
6cc
2 live
variables
But 3 if several iterations

2 live variables in iteration 43
Example of a systematic way of working with lifetime charts
3x3 Matrix Transpose
a b c a d g
d e f  b e h 
 
 g h i   c f i 
i | h | g | f | e | d | c | b | a i | f | c | h | e | b | g | d | a
Matrix
Transposer
One iteration = 9 clock cycles

44
Lifetime Table
8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
- 3x3 Matrix Transpose
8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
Matrix
Transposer
Sample Tin Tzlout Tdiff Tout Life Period

a 0 0
b 1 3
c 2
Out 6before In
d 3 1 -2
e 4 4
f 5 7
g 6 2
h 7 5
i 8 8
Tzlout = zero-latency output time
Tdiff = Tzlout – Tinput 45
Toutput = Tzlout + max{-Tdiff}
Lifetime Table - 3x3 Matrix Transpose
Matrix
Transposer

a 0 0 0
b 1 3 2
c 2 6 4
d 3 1 -2
e 4 4 0
f 5 7 2
g 6 2 -4
h 7 5 -2
i 8 8 0
Tdiff = Tzlout – Tinput 46
3x3 Matrix Transpose
a 0 0 0 4 04
b 1 3 2 7 17
c 2 6 4 10 210
d 3 1 -2 5 35
e 4 4 0 8 48
f 5 7 2 11 511
g 6 2 -4 6 66
h 7 5 -2 9 79
i 8 8 0 12 812

Tdiff = Tzlout – Tinput
47
Life Period = Tin  Tout
Lifetime chart 3x3 Matrix Transpose
Sample Tin Tzlout Tdiff Tout Life P. cycle a b c d e f g h i #live

0 0
a 0 0 0 4 04 1 1
b 1 3 2 7 17 2 2
3 3
c 2 6 4 10 210 4 4
d 3 1 -2 5 35 5 4
6 4
e 4 4 0 8 48 7 4
8 4
f 5 7 2 11 511
9 4 +0
g 6 2 -4 6 66 10 3 +1
11 2 +2
h 7 5 -2 9 79
12 1 +3
i 8 8 0 12 812
Contribution
One iteration = from next
9 clock cycles iteration
48
Sample Tin Tzlout Tdiff Tout Life P. cycle a b c d e f g h i #live

0 0
a 0 0 0 4 04 1 1
b 1 3 2 7 17 2 2
3 3
c 2 6 4 10 210 4 4
d 3 1 -2 5 35 5 4
6 4
e 4 4 0 8 48 7 4
8 4
f 5 7 2 11 511
9 4 +0=4
g 6 2 -4 6 66 10 3 +1=4
11 2 +2=4
h 7 5 -2 9 79
12 1 +3=4
i 8 8 0 12 812
The total
One iteration =
9 clock cycles 49
One iteration = Lifetime chart
9 clock cycles
cycle a b c d e f g h i #live cycle a b c d e f g h i #live
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 4 5 4
6 4 6 4
7 4 7 4
8 4 8 4
9 4 +0 9 4
10 3 +1 10 4
11 2 +2 11 4
12 1 +3 12 4
13 4
Contribution 14 4
from next 15 4
16 4
iteration 17 4
50
18 4
Sample Tin Tzlout Tdiff Tout Life P.

a 0 0 0 4 04
b 1 3 2 7 17
c 2 6 4 10 210
d 3 1 -2 5 35 xx xx
e 4 4 0 8 48
f 5 7 2 11 511
g 6 2 -4 6 66
h 7 5 -2 9 79
i 8 8 0 12 812 max #live =
4 registers (min)
51
Circular lifetime chart
52
Forward Backward
Register Allocation Technique
53
Forward Backward
Matrix
Transposer
Forward Forward
Forward Forward Out
Backward
Forward
Out
The allocation table for the 3x3 matrix The allocation table for the 3x3 matrix
transposer after steps 1 through 4 of transposer after the allocation has been
54
forward-backward register allocation completed.
have been performed.
Folded Architecture for Matrix Transposer
R1 R2 R3 R4
55
OUT
IN R1 R2 R3 R4
output = 56
input
OUT
IN R1 R2 R3 R4
outputs
output = from 57
input R2
OUT
IN R1 R2 R3 R4
outputs outputs
output = from 58
from
input R2 R4
R1
OUT
IN R1 R2 R3 R4
59
R3
OUT
IN R1 R2 R3 R4
60
Controller for Folded Architecture
Controller
for Switches
61
Example: Forward Backward
 N=6
 #reg = 3 62
Example: Forward Backward
Controller
for Switches
outputs outputs
 N=6 from from
 R3 63
#reg = 3 R2
6.4 Register Minimization in
Folded Architecture
Steps:
1. Perform retiming for folding
2. Write the folding equations
3. Use the folding equations to construct a lifetime table
4. Draw the lifetime chart and determine the required
number of registers
5. Perform forward-backward register allocation
6. Draw the folded architecture that uses the minimum
number of registers.
64
Register Minimization of Biquad filter (1/14)
65
S1  4,2,3,1 S 2  5,8,6,7
0 1 23 0 1 2 3 66
S1  4,2,3,1 S 2  5,8,6,7
0 1 23 0 1 2 3
67
Folding factor N = 4
n1 (output of R1)
2cc
3cc
n1 (output of R2)
n1 (output of R2)
5cc
n8
n1 n1 (output of R2)
Note:
The table shows that the variable n1 is output in cycle 9. Note that this variable is also output in cycles
68
5, 6, 7, and 8. For the sake of clarity, the table only shows the latest output time of each variable.
{0} {2} {3} {1}
R1 R2
Additions
S1  4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2  5,8,6,7
5cc
0 1 2 3 69
{0} {2} {3} {1}
R1 R2
Additions
S1  4,2,3,1
entering R1
0 1 23
2cc
Multiplication
3cc
S 2  5,8,6,7
5cc
0 1 2 3 70
{0} {2} {3} {1}
R1 R2
Additions
S1  4,2,3,1
entering R2
0 1 23
2cc
Multiplication
3cc
S 2  5,8,6,7
5cc
0 1 2 3 71
{0} {2} {3} {1}
R1 R2
{1}
Additions
S1  4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2  5,8,6,7
5cc
0 1 2 3 72
{0} {2} {3} {1}
{0}
{1,2,3}
R1 R2
{1}
Additions
S1  4,2,3,1
0 1 23
Multiplication
S 2  5,8,6,7
0 1 2 3 73
{0} {2} {3} {1}
{1,3}
{0}
{1,2,3}
R1 R2
{1}
Additions
S1  4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2  5,8,6,7
5cc
0 1 2 3 74
{0} {2} {3} {1}

{0,2} {1,3}
{0}
{1,2,3}
R1 R2
{1}
Additions
S1  4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2  5,8,6,7
5cc
0 1 2 3 75
{0} {2} {3} {1}

{0,2} {1,3}
{0}
{1,2,3}
R1 R2
{0,1,2}
Additions
S1  4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2  5,8,6,7
5cc
0 1 2 3 76
OUT
{0} {2} {3} {1}

{0,2} {1,3}
{3} {0}
IN
{1,2,3}
R1 R2
{0,1,2}
Additions
S1  4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2  5,8,6,7
5cc
0 1 2 3 77
{n3, n4} addadd

{n5, n6} addmul
{n1, n2}
{n3, n4} muladd
{n5}
{n1}
{n7, n8}
{n6, n7,n8}
{n2,n3,n4}
{n1} A folded biquad filter architecture implementing using
the minimum number of registers, which is 2.
Additions A folded biquad filter architecture
S1  4,2,3,1 implementing without using register
0 1 2 3 minimization technique, which is 6
Multiplication (the 3 pipelining registers that are
S 2  5,8,6,7 internal to the adder and the
78
0 1 2 3
multiplier are not counted).
IIR Filter Example
 Assume that
 Folding factor N = 2
 Addition and multiplication require 1 and 2 u.t. respectively.
 1-stage adders and 2-stage pipelined multipliers are available
79
IIR Filter Example (1/4)
Retiming solution
80
81
n2
n3
n2
n3
82
83
6.5 Conclusions
 Present a systematic transformation of time-
multiplexed architectures
 Explore folding techniques to reduce # of functional
units
 Explore register minimization technique to reduce #
of registers
84

FPGA - Ch0 - Folding

Uploaded by

Copyright:

Available Formats

You might also like

FPGA - Ch0 - Folding

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FPGA - Ch0 - Folding

Uploaded by

Copyright:

Available Formats

ĐHBK Tp HCM

N=3 Null operation

Delays in folded graph

(retiming for folding constraint)

Using either Bellman-Ford

Delays in folded graph

{2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

Use previous iter.

But 3 if several iterations

But 3 if several iterations

One iteration = 9 clock cycles

Sample Tin Tzlout Tdiff Tout Life Period

Sample Tin Tzlout Tdiff Tout Life Period

Tzlout = zero-latency output time

Sample Tin Tzlout Tdiff Tout Life P. cycle a b c d e f g h i #live

Sample Tin Tzlout Tdiff Tout Life P. cycle a b c d e f g h i #live

Sample Tin Tzlout Tdiff Tout Life P.

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{0} {2} {3} {1}

{n3, n4} addadd

You might also like