Professional Documents
Culture Documents
FPGA - Ch0 - Folding
FPGA - Ch0 - Folding
FPGA - Ch0 - Folding
BMĐT
GV: Hồ Trung Mỹ
Ch.06
Folding (Gấp lại)
TLTK:
1.Các slide từ sách của Prof. Parhi
2.Slide của Prof. Fredrik Edman
3.Slide của Prof. Viktor Öwall
1
4.Slide của Prof. Lan-Da Van
Outline
6.1 Introduction
6.2 Folding Transformation
6.3 Register Minimization Techniques
6.4 Register Minimization in Folded Architecture
6.5 Conclusions
2
What is folding?
Folding is the ”Inverse” of Unfolding
Folding by N
Node A A (N=folding factor)
A0
Unfolding
by J (J=unfolding A1
factor)
AJ-1
3
Folding?
Used to minimize silicon area (trading area for time).
A way to systematically determine the control circuits in DSP
architectures by folding transformation, where multiple
algorithm operations are time-multiplexed to a single
functional unit.
Use for synthesis of DSP architectures that can be operated
at single or multiple clocks.
Use to reduce the number of hardware functional units (FUs)
such as adders and multipliers by a factor of N at the
expense of increasing computation time by a factor of N.
Folding lead to an architecture that uses a large number of
registers and thus a register minimization technique needs
sometime to be applied.
4
Hardware Mapped vs. Time multiplexed
N 1
FIR : y n hk xn k
k 0
MUX c
x(n)
D D D
h0 h1 h2 h3
REG
y(n)
N cc/sample
1 sample/cc 1 generalized multiplier
N fixed multipliers 1 adders
N-1 adders
1 coefficient memory
Controller
5
Folding – Time-shared Architecture
b(n) c(n)
a(n) y(n)
y ( n) a ( n) b( n) c ( n)
Folding is a technique to reduce the silicon area by time-
multiplexing many operations into single functional units.
The right figure shows a 2 times folded architecture where 2 additions
are folded, or time-multiplexed, to a single adder
Folding introduces registers/storage
Computation time increased, e.g. one output sample every 2 cc (one
input signal consumed every 2cc)
6
Folding Example – A more detailed look! b(n) c(n)
y ( n) a ( n) b( n) c ( n) a(n) y(n)
Cycle 0 Cycle 1
2l+0 2l+1
b(0) c(0)
2l+0
2l+0 a(0)+b(0)
a(0) D y(-1) D
2l+1
Cycle 2 Cycle 3
2l+0 2l+1
b(1) c(1)
2l+0
2l+0
a(1) D a(0)+b(0)+c(0) D
7
2l+1
Folding Example – A more detailed look!
8
What’s related to folding
Reduce hardware by N-folding
Tcomputation increased by N Latency
Extremes
Fully parallel
Time multiplexed = 1 unit per algorithmic operation
Folding
extra registers, i.e. extra storage
a more complex control unit
more latency
9
Folding Transformation
10
Folding Set
A folding set is an ordered set of operations to be executed on
the same functional unit.
The folding set are typically obtained from a scheduling and
allocation algorithm (ref. Appendix B)
The folding set represents underlying folding transformation
Each set contain N entries, N=folding factor.
Example
Folding order: 0 1 2 (… N-1)
S1 A1 , 0, A2
12
Folding Transformation
N=folding factor
Nr. of operations U w(e)D V l = iteration
folded to a single
HW-unit
unit
N l+ u N l+ v V
Hu PPuD
u D F ( Ua V ) Hv PPvD
v
In Out
1 2 [Wiki] A digital biquad filter is a second
order recursive linear filter (IIR filter),
a D b 4 containing two poles and two zeros.
3 5 6 "Biquad" is an abbreviation of
"biquadratic", which refers to the fact that
(S2|2) in the Z domain, its transfer function is the
ratio of two quadratic functions:
c D d
7 8 14
Ex. Folding of Biquad filter
(S1|3) (S1|1)
In Out
1 2
a D b 4
3 5 6
(S1|2) (S1|0)
(S2|0) (S2|2)
Tadder 1 c D d Tmult 2
7 8
Padder 1 (S2|3) (S2|1) Pmult 2
Additions Multiplication
S1 4,2,3,1 S 2 5,8,6,7
15
Ex. Folding of Biquad filter, N = 4
Folding equations receive send
0 u, v N 1
(S1|3) (S1|1)
In Out
1 2
a D b 4
3 5 6
(S1|2) (S1|0)
(S2|0) (S2|2)
c D d
7 8
(S2|3) (S2|1)
DF (U V ) 0 Not Valid folding
16
A delay between two edges can not be negative!
Retiming: Folding of Biquad filter, N=4
(S1|3) (S1|1)
In Out
1 2
a D b 4
3 5 6
(S1|2) (S1|0)
(S2|0) (S2|2)
c D d Feedforward
7 8
Retiming cutset
Split and (S2|3) (S2|1) Pipelining
move delay
17
Retiming: Folding of Biquad filter, N=4
(S1|3) (S1|1)
In Out
D
1 2
a D b 4
3 5 6
(S1|2) D (S1|0)
(S2|0) (S2|2)
D c D
7 8 d Feedforward
Retiming (S |3)
D cutset
2
(S2|1)
Split and Pipelining
move delay
18
Retiming: Folding of Biquad filter, N=4
Folding equations receive send
0 u, v N 1
(S1|3) (S1|1)
In Out
D 2
1
3
a
5
D b
4
6
(S1|2) D (S1|0)
(S2|0) (S2|2)
D c 7
8 d
D
D
(S2|3) (S2|1)
DF (U V ) 0 Valid folding
19
Systematic way of Retiming for Folding
21
Systematic way of Retiming for Folding
22
Systematic way of Retiming for Folding
retiming
r(1) = -1 r(5) = -1
r(2) = 0 r(6) = -1
r(3) = -1 r(7) = -2
r(4) = 0 r(8) = -1
23
Retiming: Folding of Biquad filter, N=4
Folding equations receive send
0 u, v N 1
(S1|3) (S1|1)
In Out
D 2
1
3
a
5
D b
4
6
(S1|2) D (S1|0)
(S2|0) (S2|2)
D c 7
8 d
D
D
(S2|3) (S2|1)
DF (U V ) 0 Valid folding
24
Recall: Folding Transformation
N=folding factor
Nr. of operations U w(e)D V l = iteration
folded to a single
HW-unit
unit
N l+ u N l+ v V
Hu PPuD
u D F ( Ua V ) Hv PPvD
v
26
Additions Multiplication
{0}
S1 4,2,3,1 S 2 5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
27
D D D 2D
{1}
Additions Multiplication
{0}
S1 4,2,3,1 S 2 5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
28
D D D 2D
{3} {1}
Additions Multiplication
{0}
S1 4,2,3,1 S 2 5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
29
D D D 2D
D D D 2D
{2}
D D D 2D
{0,2}
D D D 2D
2 live
variables
6cc
6cc
2 live
variables
a b c a d g
d e f b e h
g h i c f i
i | h | g | f | e | d | c | b | a i | f | c | h | e | b | g | d | a
Matrix
Transposer
52
Forward Backward
Register Allocation Technique
53
Forward Backward
Register Allocation Technique
i | h | g | f | e | d | c | b | a i | f | c | h | e | b | g | d | a
Matrix
Transposer
Forward Forward
Forward Forward Out
Backward
Forward
Out
The allocation table for the 3x3 matrix The allocation table for the 3x3 matrix
transposer after steps 1 through 4 of transposer after the allocation has been
54
forward-backward register allocation completed.
have been performed.
Folded Architecture for Matrix Transposer
R1 R2 R3 R4
55
Folded Architecture for Matrix Transposer
OUT
IN R1 R2 R3 R4
output = 56
input
Folded Architecture for Matrix Transposer
OUT
IN R1 R2 R3 R4
outputs
output = from 57
input R2
Folded Architecture for Matrix Transposer
OUT
IN R1 R2 R3 R4
outputs outputs
output = from 58
from
input R2 R4
Folded Architecture for Matrix Transposer
R1
OUT
IN R1 R2 R3 R4
59
Folded Architecture for Matrix Transposer
R3
OUT
IN R1 R2 R3 R4
60
Controller for Folded Architecture
Controller
for Switches
61
Example: Forward Backward
Register Allocation Technique
N=6
#reg = 3 62
Example: Forward Backward
Register Allocation Technique
Controller
for Switches
outputs outputs
N=6 from from
R3 63
#reg = 3 R2
6.4 Register Minimization in
Folded Architecture
Steps:
1. Perform retiming for folding
2. Write the folding equations
3. Use the folding equations to construct a lifetime table
4. Draw the lifetime chart and determine the required
number of registers
5. Perform forward-backward register allocation
6. Draw the folded architecture that uses the minimum
number of registers.
64
Register Minimization of Biquad filter (1/14)
65
Register Minimization of Biquad filter (2/14)
Additions Multiplication
S1 4,2,3,1 S 2 5,8,6,7
0 1 23 0 1 2 3 66
Register Minimization of Biquad filter (3/14)
Additions Multiplication
S1 4,2,3,1 S 2 5,8,6,7
0 1 23 0 1 2 3
67
Register Minimization of Biquad filter (4/14)
Folding factor N = 4
n1 (output of R1)
2cc
3cc
n1 (output of R2)
n1 (output of R2)
5cc
n8
n1 n1 (output of R2)
Note:
The table shows that the variable n1 is output in cycle 9. Note that this variable is also output in cycles
68
5, 6, 7, and 8. For the sake of clarity, the table only shows the latest output time of each variable.
Register Minimization of Biquad filter (5/14)
R1 R2
Additions
S1 4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2 5,8,6,7
5cc
0 1 2 3 69
Register Minimization of Biquad filter (6/14)
R1 R2
Additions
S1 4,2,3,1
entering R1
0 1 23
2cc
Multiplication
3cc
S 2 5,8,6,7
5cc
0 1 2 3 70
Register Minimization of Biquad filter (7/14)
R1 R2
Additions
S1 4,2,3,1
entering R2
0 1 23
2cc
Multiplication
3cc
S 2 5,8,6,7
5cc
0 1 2 3 71
Register Minimization of Biquad filter (8/14)
R1 R2
{1}
Additions
S1 4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2 5,8,6,7
5cc
0 1 2 3 72
Register Minimization of Biquad filter (9/14)
{0}
{1,2,3}
R1 R2
{1}
Additions
S1 4,2,3,1
0 1 23
Multiplication
S 2 5,8,6,7
0 1 2 3 73
Register Minimization of Biquad filter (10/14)
{1,3}
{0}
{1,2,3}
R1 R2
{1}
Additions
S1 4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2 5,8,6,7
5cc
0 1 2 3 74
Register Minimization of Biquad filter (11/14)
{1,2,3}
R1 R2
{1}
Additions
S1 4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2 5,8,6,7
5cc
0 1 2 3 75
Register Minimization of Biquad filter (12/14)
{1,2,3}
R1 R2
{0,1,2}
Additions
S1 4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2 5,8,6,7
5cc
0 1 2 3 76
Register Minimization of Biquad filter (13/14)
OUT
{0,1,2}
Additions
S1 4,2,3,1
0 1 23
2cc
Multiplication
3cc
S 2 5,8,6,7
5cc
0 1 2 3 77
Register Minimization of Biquad filter (14/14)
{n5}
{n1}
{n7, n8}
{n6, n7,n8}
{n2,n3,n4}
{n1} A folded biquad filter architecture implementing using
the minimum number of registers, which is 2.
Additions A folded biquad filter architecture
S1 4,2,3,1 implementing without using register
0 1 2 3 minimization technique, which is 6
Multiplication (the 3 pipelining registers that are
S 2 5,8,6,7 internal to the adder and the
78
0 1 2 3
multiplier are not counted).
IIR Filter Example
Assume that
Folding factor N = 2
Addition and multiplication require 1 and 2 u.t. respectively.
1-stage adders and 2-stage pipelined multipliers are available
79
IIR Filter Example (1/4)
Retiming solution
80
IIR Filter Example (2/4)
81
IIR Filter Example (3/4)
n2
n3
n2
n3
82
IIR Filter Example (4/4)
83
6.5 Conclusions
Present a systematic transformation of time-
multiplexed architectures
Explore folding techniques to reduce # of functional
units
Explore register minimization technique to reduce #
of registers
84