FPGA - Ch0 - Folding

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 84

ĐHBK Tp HCM

BMĐT
GV: Hồ Trung Mỹ

Ch.06
Folding (Gấp lại)

TLTK:
1.Các slide từ sách của Prof. Parhi
2.Slide của Prof. Fredrik Edman
3.Slide của Prof. Viktor Öwall
1
4.Slide của Prof. Lan-Da Van
Outline
6.1 Introduction
6.2 Folding Transformation
6.3 Register Minimization Techniques
6.4 Register Minimization in Folded Architecture
6.5 Conclusions

2
What is folding?
Folding is the ”Inverse” of Unfolding

Folding by N
Node A A (N=folding factor)

A0
Unfolding
by J (J=unfolding A1
factor)

AJ-1
3
Folding?
 Used to minimize silicon area (trading area for time).
 A way to systematically determine the control circuits in DSP
architectures by folding transformation, where multiple
algorithm operations are time-multiplexed to a single
functional unit.
 Use for synthesis of DSP architectures that can be operated
at single or multiple clocks.
 Use to reduce the number of hardware functional units (FUs)
such as adders and multipliers by a factor of N at the
expense of increasing computation time by a factor of N.
 Folding lead to an architecture that uses a large number of
registers and thus a register minimization technique needs
sometime to be applied.

4
Hardware Mapped vs. Time multiplexed
N 1
FIR : y n    hk xn  k 
k 0
MUX c
x(n)
D D D
h0 h1 h2 h3
REG

y(n)
 N cc/sample
 1 sample/cc  1 generalized multiplier
 N fixed multipliers  1 adders
 N-1 adders
 1 coefficient memory
 Controller
5
Folding – Time-shared Architecture

b(n) c(n)

a(n) y(n)

y ( n)  a ( n)  b( n)  c ( n)
 Folding is a technique to reduce the silicon area by time-
multiplexing many operations into single functional units.
 The right figure shows a 2 times folded architecture where 2 additions
are folded, or time-multiplexed, to a single adder
 Folding introduces registers/storage
 Computation time increased, e.g. one output sample every 2 cc (one
input signal consumed every 2cc)
6
Folding Example – A more detailed look! b(n) c(n)

y ( n)  a ( n)  b( n)  c ( n) a(n) y(n)

Cycle 0 Cycle 1
2l+0 2l+1
b(0) c(0)
2l+0
2l+0 a(0)+b(0)
a(0) D y(-1) D

2l+1
Cycle 2 Cycle 3
2l+0 2l+1
b(1) c(1)
2l+0
2l+0
a(1) D a(0)+b(0)+c(0) D
7
2l+1
Folding Example – A more detailed look!

8
What’s related to folding
 Reduce hardware by N-folding
 Tcomputation increased by N  Latency
 Extremes
 Fully parallel
 Time multiplexed = 1 unit per algorithmic operation
 Folding 
 extra registers, i.e. extra storage
 a more complex control unit
 more latency

9
Folding Transformation

10
Folding Set
 A folding set is an ordered set of operations to be executed on
the same functional unit.
 The folding set are typically obtained from a scheduling and
allocation algorithm (ref. Appendix B)
 The folding set represents underlying folding transformation
 Each set contain N entries, N=folding factor.
 Example
Folding order: 0 1 2 (… N-1)

S1  A1 , 0, A2 

N=3 Null operation


11
Folding Transformation

12
Folding Transformation
N=folding factor
Nr. of operations U w(e)D V l = iteration
folded to a single
HW-unit
unit
N l+ u N l+ v V
Hu PPuD
u D F ( Ua V ) Hv PPvD
v

Delays in folded graph


HW-unit
U Level of
Pipeline
0  u, v  N  1 13
Ex. Folding of Biquad filter

In Out
1 2 [Wiki] A digital biquad filter is a second
order recursive linear filter (IIR filter),
a D b 4 containing two poles and two zeros.
3 5 6 "Biquad" is an abbreviation of
"biquadratic", which refers to the fact that
(S2|2) in the Z domain, its transfer function is the
ratio of two quadratic functions:
c D d
7 8 14
Ex. Folding of Biquad filter
(S1|3) (S1|1)
In Out
1 2
a D b 4
3 5 6
(S1|2) (S1|0)
(S2|0) (S2|2)
Tadder  1 c D d Tmult  2
7 8
Padder  1 (S2|3) (S2|1) Pmult  2
Additions Multiplication
S1  4,2,3,1 S 2  5,8,6,7
15
Ex. Folding of Biquad filter, N = 4
Folding equations receive send
0  u, v  N  1

(S1|3) (S1|1)
In Out
1 2
a D b 4
3 5 6
(S1|2) (S1|0)
(S2|0) (S2|2)
c D d
7 8
(S2|3) (S2|1)
DF (U  V )  0 Not Valid folding
16
A delay between two edges can not be negative!
Retiming: Folding of Biquad filter, N=4
(S1|3) (S1|1)
In Out
1 2
a D b 4
3 5 6
(S1|2) (S1|0)
(S2|0) (S2|2)
c D d Feedforward
7 8
Retiming cutset 
Split and (S2|3) (S2|1) Pipelining
move delay
17
Retiming: Folding of Biquad filter, N=4
(S1|3) (S1|1)
In Out
D
1 2
a D b 4
3 5 6
(S1|2) D (S1|0)
(S2|0) (S2|2)
D c D
7 8 d Feedforward
Retiming (S |3)
D cutset 
2
(S2|1)
Split and Pipelining
move delay
18
Retiming: Folding of Biquad filter, N=4
Folding equations receive send
0  u, v  N  1

(S1|3) (S1|1)
In Out
D 2
1
3
a
5
D b
4
6
(S1|2) D (S1|0)
(S2|0) (S2|2)
D c 7
8 d
D
D
(S2|3) (S2|1)

DF (U  V )  0 Valid folding
19
Systematic way of Retiming for Folding

(retiming for folding constraint)


20
 Then solve the the system of inequalities!
Systematic way of Retiming for Folding

21
Systematic way of Retiming for Folding

Using either Bellman-Ford


or Floyd-Warshall algorithm,
we find that the set of
constraints has a solution,
and one solution is

22
Systematic way of Retiming for Folding

retiming

r(1) = -1 r(5) = -1
r(2) = 0 r(6) = -1
r(3) = -1 r(7) = -2
r(4) = 0 r(8) = -1
23
Retiming: Folding of Biquad filter, N=4
Folding equations receive send
0  u, v  N  1

(S1|3) (S1|1)
In Out
D 2
1
3
a
5
D b
4
6
(S1|2) D (S1|0)
(S2|0) (S2|2)
D c 7
8 d
D
D
(S2|3) (S2|1)

DF (U  V )  0 Valid folding
24
Recall: Folding Transformation
N=folding factor
Nr. of operations U w(e)D V l = iteration
folded to a single
HW-unit
unit
N l+ u N l+ v V
Hu PPuD
u D F ( Ua V ) Hv PPvD
v

Delays in folded graph


HW-unit
U Level of
Pipeline
0  u, v  N  1 25
Additions Multiplication
S1  4,2,3,1 S 2  5,8,6,7
0 1 23 0 1 2 3

26
Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
27
D D D 2D

{1}
Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
28
D D D  2D

{3} {1}
Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
29
D D  D  2D

{2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
30
 D D  D  2D

{0} {2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
31
{1}

 D  D  D  2D

{0} {2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
32
{1}

{2}
 D  D  D  2D

{0} {2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
33
{1}

{0,2}
 D  D  D  2D

{0} {2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
34
{2} {1}
D
{0,2}

 D  D  D  2D

{0} {2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
35
{0,2} {1}
D
{0,2}

 D  D  D  2D

{0} {2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
36
{0,2} {1}
D
{0,2}

{3}  D  D  D  2D

{0} {2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
37
{0,2} {1}
D
{0,2}

{1,3}  D  D  D  2D

{0} {2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
38
In
{0,2} {3} {1}
D
{0,2}

{1,3}  D  D  D  2D

{0} {2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
39
In
{0,2} {3} {1}
{2}
D
Out
{0,2}

{1,3}  D  D  D  2D

{0} {2} {3} {1}


Additions Multiplication
{0}
S1  4,2,3,1 S 2  5,8,6,7
{2} 0 1 23 0 1 2 3
{3}
{1}
40
6.3 Register Minimization Techniques
 Folding may insert register. Lifetime analysis is used for register
minimization techniques in a DSP hardware.
 A data sample (also called a
variable) is live from the time it is
produced until the time it is
consumed. After that it is dead.
 Linear lifetime chart : Represents
the lifetime of the variables in a
linear fashion.
 Convention: a variable is
• not live during the clock cycle
when it is produced  the
variable does not need to be
stored during that time. One iteration 6 cc  N=6
• but live during the clock cycle 41
when it is consumed.
Register Minimization
Max. number of live variables  Min. number of registers

Use previous iter.


to avoid drawing
lifetime chart over
several iterations

2 live
variables

But 3 if several iterations


2 live variables in iteration 42
Register Minimization
Max. number of live variables  Min. number of registers

6cc
6cc

2 live
variables

But 3 if several iterations


2 live variables in iteration 43
Example of a systematic way of working with lifetime charts
3x3 Matrix Transpose

a b c a d g
d e f  b e h 
 
 g h i   c f i 

i | h | g | f | e | d | c | b | a i | f | c | h | e | b | g | d | a
Matrix
Transposer

One iteration = 9 clock cycles


44
Lifetime Table
8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
- 3x3 Matrix Transpose
8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
i | h | g | f | e | d | c | b | a i | f | c | h | e | b | g | d | a
Matrix
Transposer

Sample Tin Tzlout Tdiff Tout Life Period


a 0 0
b 1 3
c 2
Out 6before In
d 3 1 -2
e 4 4
f 5 7
g 6 2
h 7 5
i 8 8
Tzlout = zero-latency output time
Tdiff = Tzlout – Tinput 45
Toutput = Tzlout + max{-Tdiff}
Lifetime Table - 3x3 Matrix Transpose
i | h | g | f | e | d | c | b | a i | f | c | h | e | b | g | d | a
Matrix
Transposer

Sample Tin Tzlout Tdiff Tout Life Period


a 0 0 0
b 1 3 2
c 2 6 4
d 3 1 -2
e 4 4 0
f 5 7 2
g 6 2 -4
h 7 5 -2
i 8 8 0
Tzlout = zero-latency output time
Tdiff = Tzlout – Tinput 46
Toutput = Tzlout + max{-Tdiff}
3x3 Matrix Transpose
Sample Tin Tzlout Tdiff Tout Life Period
a 0 0 0 4 04
b 1 3 2 7 17
c 2 6 4 10 210
d 3 1 -2 5 35
e 4 4 0 8 48
f 5 7 2 11 511
g 6 2 -4 6 66
h 7 5 -2 9 79
i 8 8 0 12 812

Tzlout = zero-latency output time


Tdiff = Tzlout – Tinput
Toutput = Tzlout + max{-Tdiff}
47
Life Period = Tin  Tout
Lifetime chart 3x3 Matrix Transpose

Sample Tin Tzlout Tdiff Tout Life P. cycle a b c d e f g h i #live


0 0
a 0 0 0 4 04 1 1
b 1 3 2 7 17 2 2
3 3
c 2 6 4 10 210 4 4
d 3 1 -2 5 35 5 4
6 4
e 4 4 0 8 48 7 4
8 4
f 5 7 2 11 511
9 4 +0
g 6 2 -4 6 66 10 3 +1
11 2 +2
h 7 5 -2 9 79
12 1 +3
i 8 8 0 12 812
Contribution
One iteration = from next
9 clock cycles iteration
48
Lifetime chart 3x3 Matrix Transpose

Sample Tin Tzlout Tdiff Tout Life P. cycle a b c d e f g h i #live


0 0
a 0 0 0 4 04 1 1
b 1 3 2 7 17 2 2
3 3
c 2 6 4 10 210 4 4
d 3 1 -2 5 35 5 4
6 4
e 4 4 0 8 48 7 4
8 4
f 5 7 2 11 511
9 4 +0=4
g 6 2 -4 6 66 10 3 +1=4
11 2 +2=4
h 7 5 -2 9 79
12 1 +3=4
i 8 8 0 12 812
The total
One iteration =
9 clock cycles 49
One iteration = Lifetime chart
9 clock cycles
cycle a b c d e f g h i #live cycle a b c d e f g h i #live
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 4 5 4
6 4 6 4
7 4 7 4
8 4 8 4
9 4 +0 9 4
10 3 +1 10 4
11 2 +2 11 4
12 1 +3 12 4
13 4
Contribution 14 4
from next 15 4
16 4
iteration 17 4
50
18 4
Lifetime chart 3x3 Matrix Transpose

Sample Tin Tzlout Tdiff Tout Life P.


a 0 0 0 4 04
b 1 3 2 7 17
c 2 6 4 10 210
d 3 1 -2 5 35 xx xx
e 4 4 0 8 48
f 5 7 2 11 511
g 6 2 -4 6 66
h 7 5 -2 9 79
i 8 8 0 12 812 max #live =
4 registers (min)
51
Circular lifetime chart

52
Forward Backward
Register Allocation Technique

53
Forward Backward
Register Allocation Technique
i | h | g | f | e | d | c | b | a i | f | c | h | e | b | g | d | a
Matrix
Transposer

Forward Forward
Forward Forward Out
Backward

Forward
Out

The allocation table for the 3x3 matrix The allocation table for the 3x3 matrix
transposer after steps 1 through 4 of transposer after the allocation has been
54
forward-backward register allocation completed.
have been performed.
Folded Architecture for Matrix Transposer

R1 R2 R3 R4

55
Folded Architecture for Matrix Transposer

OUT

IN R1 R2 R3 R4

output = 56
input
Folded Architecture for Matrix Transposer

OUT

IN R1 R2 R3 R4

outputs
output = from 57
input R2
Folded Architecture for Matrix Transposer

OUT

IN R1 R2 R3 R4

outputs outputs
output = from 58
from
input R2 R4
Folded Architecture for Matrix Transposer
R1

OUT

IN R1 R2 R3 R4

59
Folded Architecture for Matrix Transposer
R3

OUT

IN R1 R2 R3 R4

60
Controller for Folded Architecture

Controller
for Switches

61
Example: Forward Backward
Register Allocation Technique

 N=6
 #reg = 3 62
Example: Forward Backward
Register Allocation Technique

Controller
for Switches

outputs outputs
 N=6 from from
 R3 63
#reg = 3 R2
6.4 Register Minimization in
Folded Architecture
Steps:
1. Perform retiming for folding
2. Write the folding equations
3. Use the folding equations to construct a lifetime table
4. Draw the lifetime chart and determine the required
number of registers
5. Perform forward-backward register allocation
6. Draw the folded architecture that uses the minimum
number of registers.
64
Register Minimization of Biquad filter (1/14)

65
Register Minimization of Biquad filter (2/14)

Additions Multiplication
S1  4,2,3,1 S 2  5,8,6,7
0 1 23 0 1 2 3 66
Register Minimization of Biquad filter (3/14)

Additions Multiplication
S1  4,2,3,1 S 2  5,8,6,7
0 1 23 0 1 2 3

67
Register Minimization of Biquad filter (4/14)

Folding factor N = 4

n1 (output of R1)

2cc
3cc
n1 (output of R2)
n1 (output of R2)

5cc
n8
n1 n1 (output of R2)

Note:
The table shows that the variable n1 is output in cycle 9. Note that this variable is also output in cycles
68
5, 6, 7, and 8. For the sake of clarity, the table only shows the latest output time of each variable.
Register Minimization of Biquad filter (5/14)

{0} {2} {3} {1}

R1 R2

Additions
S1  4,2,3,1
0 1 23

2cc
Multiplication

3cc
S 2  5,8,6,7
5cc
0 1 2 3 69
Register Minimization of Biquad filter (6/14)

{0} {2} {3} {1}

R1 R2

Additions
S1  4,2,3,1
entering R1

0 1 23

2cc
Multiplication

3cc
S 2  5,8,6,7
5cc
0 1 2 3 70
Register Minimization of Biquad filter (7/14)

{0} {2} {3} {1}

R1 R2

Additions
S1  4,2,3,1
entering R2

0 1 23

2cc
Multiplication

3cc
S 2  5,8,6,7
5cc
0 1 2 3 71
Register Minimization of Biquad filter (8/14)

{0} {2} {3} {1}

R1 R2

{1}

Additions
S1  4,2,3,1
0 1 23

2cc
Multiplication

3cc
S 2  5,8,6,7
5cc
0 1 2 3 72
Register Minimization of Biquad filter (9/14)

{0} {2} {3} {1}

{0}

{1,2,3}
R1 R2

{1}

Additions
S1  4,2,3,1
0 1 23

Multiplication
S 2  5,8,6,7
0 1 2 3 73
Register Minimization of Biquad filter (10/14)

{0} {2} {3} {1}

{1,3}
{0}

{1,2,3}
R1 R2

{1}

Additions
S1  4,2,3,1
0 1 23

2cc
Multiplication

3cc
S 2  5,8,6,7
5cc
0 1 2 3 74
Register Minimization of Biquad filter (11/14)

{0} {2} {3} {1}


{0,2} {1,3}
{0}

{1,2,3}
R1 R2

{1}

Additions
S1  4,2,3,1
0 1 23

2cc
Multiplication

3cc
S 2  5,8,6,7
5cc
0 1 2 3 75
Register Minimization of Biquad filter (12/14)

{0} {2} {3} {1}


{0,2} {1,3}
{0}

{1,2,3}
R1 R2

{0,1,2}

Additions
S1  4,2,3,1
0 1 23

2cc
Multiplication

3cc
S 2  5,8,6,7
5cc
0 1 2 3 76
Register Minimization of Biquad filter (13/14)
OUT

{0} {2} {3} {1}


{0,2} {1,3}
{3} {0}
IN
{1,2,3}
R1 R2

{0,1,2}

Additions
S1  4,2,3,1
0 1 23

2cc
Multiplication

3cc
S 2  5,8,6,7
5cc
0 1 2 3 77
Register Minimization of Biquad filter (14/14)

{n3, n4} addadd


{n5, n6} addmul
{n1, n2}
{n3, n4} muladd

{n5}
{n1}

{n7, n8}
{n6, n7,n8}
{n2,n3,n4}
{n1} A folded biquad filter architecture implementing using
the minimum number of registers, which is 2.
Additions A folded biquad filter architecture
S1  4,2,3,1 implementing without using register
0 1 2 3 minimization technique, which is 6
Multiplication (the 3 pipelining registers that are
S 2  5,8,6,7 internal to the adder and the
78
0 1 2 3
multiplier are not counted).
IIR Filter Example
 Assume that
 Folding factor N = 2
 Addition and multiplication require 1 and 2 u.t. respectively.
 1-stage adders and 2-stage pipelined multipliers are available

79
IIR Filter Example (1/4)

Retiming solution

80
IIR Filter Example (2/4)

81
IIR Filter Example (3/4)

n2
n3
n2
n3

82
IIR Filter Example (4/4)

83
6.5 Conclusions
 Present a systematic transformation of time-
multiplexed architectures
 Explore folding techniques to reduce # of functional
units
 Explore register minimization technique to reduce #
of registers

84

You might also like