Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 17

ECE 565

High-Level SynthesisAn Introduction

Shantanu Dutt
ECE Dept., UIC

HLS Flow
Code/Algorithm Architecture (interconnected functional
units (FUs), memory units (MUs) via muxes, demuxes, tristate
buffers, buses, dedicated interconnects)

Classically, these 3
stages were
performed
sequentially but
currently performed
together (which
leads to better
optimization)

HLS Flow (contd)

HLS Flow (contd)

(Binding)

Allocation: Simple counting of FUs after the


above 2 stages

Simple HLS Examples

Simple HLS Examples (contd)


2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2
ccs and + delay of 1 cc
ldd

ldc

(a) Scheduling

ldx
lda

ldb

x
I1

mux1

d
I0

I0

y
I1

mux

mux

ldy

mux2

i) Non-overlapped pipelined scheduling


c1(1)

c2(1)

ccs 1

(b) Arch. Synthesis

c1(2)
c3(2)

c3(1) c2(2)

Note:
Unspecified
control signals
have either an
inactive value,
or if such a
concept doesnt
exists for the cs,
then the dontcare value

demux

6
[y c+d]
(c2)

Controller FSM:
Reset

cc 3i

cc 3(i+1) (c) Controller FSM


Synthesis
mux1=0,
mux2=0
demux=0,
ldy=1

O1

O0

ldz

Note: A register is loaded at the +ve/-ve edge


(in a +ve/-ve edge triggered system) of the cc
after the one in which its load signal is asseted.

lda=1, ldb=1,
ldc=1, ldd=1,
mux1=1, mux2=1
demux=1,
ldz=1

cc 3(i+2)
ldx=1

[z x+y]
(c3)

demux

[x a x b]
(c1)

lda = 1

reg. a
loaded

Simple HLS Examples (contd)


2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (contd)
ldd

ldc
(a) Scheduling
ii) Overlapped pipelined scheduling
X

c1(1)

+
ccs 1

c1(2)

ldx
lda

(b) Arch. Synthesis

ldb

I1

mux1

d
I0

I0

y
I1

mux

mux

ldy

mux2

c2(1) c3(1) c2(2) c3(2)

demux

6
cc 3(i+1)
[z x+y,]
(c3)

Controller FSM:
Reset

cc 3i
lda=1, ldb=1,
mux1=0, mux2=0
demux=0,
ldy=1, ldx=1

[y c+d, x a x b]
((c1, c2)

ldc=1, ldd=1,
mux1=1,
mux2=1,
demux=1,
ldz=1

demux

(c) Controller FSM


Synthesis
z

ldz

For 4 iterations, the overlapped schedule takes 9


ccs versus 12 ccs by the non-overlapped sched.
Overlap. sched: Time for n iterations = 2n+1
Throughput = n/(2n+1) ~ 0.5 outputs/cc
Nonoverlap. sched: Time for n iterations = 3n
Throughput = n/3n ~ 0.33 outputs/cc
~ 34% throughput improvement using an
overlapped schedule

Simple HLS Examples (contd)


in1

Some DFG control operation nodes:

T
Condition
(T/F)

F
Selectot

out

Conditional code:

If (a > b) then
c a-b;
Else
c b-a;

Possible DFGs corresponding to


the above conditional code:

in

in2

Condition
(T/F)

Distributor
T
F

out1

out2

Simple HLS Examples (contd)


Iterative code: while (a > b)
a a-b;
b

T sel F

c2

mux
>

T dist F

r1

ldr1

c1

Mux

s xor ovfl
= 1 -ve
= 0 +ve
1

cin

Demux

(a) Scheduling
(using only 1
adder/sub)

b+1 = 2s compl.
of -b
1

demux

final a

ldfina

(b) Arch. Synthesis

Scheduling
& binding:

+
ccs

c1

c2

c1

c2

To fsm

Initialized
to F

ldb

lda

Delay Nodes in DFGs

A delay node is generally implemented as a register; a delay node thus becomes a state
variable.

Delay Nodes in DFGs (contd)

register

Transformation in the DFG

Mapping to the architecture

Detailed HLS Example

Detailed HLS Example (contd)


Different paths (i/p
o/p) in the DFG

Scheduling heuristic: Among available


opers schedule those on available FUs
whose delay to o/p is the highest, breaking
ties in favor of those opers u whose
sibling o/ps (o/ps to the same children)
that are avail. or will be available at us
earliest finish will have the largest lifetime
at that point.

(a) Scheduling w/
one X (2 ccs) &
one + (1 cc); goal:
min. latency

(b) Reg. alloc. for


o/p of operations
For WAR
constraint
(c) Arch.
synthesis

Note: Not clear how register allocation has been done.


It is sub-optimal (4 non-primary i/p regs. needed)
The synthesized architecture

Detailed HLS Example (contd)

Detailed HLS ExampleRegister Allocation

Detailed HLS ExampleRegister Allocation (contd)


Scheduling heuristic: Among available opers
schedule those on avail. FUs whose delay to
o/p is the highest, breaking ties in favor of
those opers u whose sibling o/ps (o/ps to
the same children) that are avail. or will be
d0
avail.
at us earliest finish will have the
largest lifetime at that point.

3 non-primary i/p
regs. needed

In the conflict graph (one per FU), there is an edge between


2 var. nodes if their lifetimes overlap (indicating that different
registers need to be allocated to them)
Graph coloringusing min. # of colors to color node s.t.
connected node pairs have different colorsin general is NPhard
The above type of conflict graph is called an interval graph
(derived from a 1-dimensional interval of the lifetimes)
Min. graph coloring can be solved optimally in linear time for

Detailed HLS ExampleRegister Allocation (contd)

d0

3 non-primary i/p
regs. needed

Scheduling heuristic: Among available


opers schedule those on available FUs
whose delay to o/p is the highest,
breaking arbitrarily: Bs lifetime
oncreases, but Ds (dep. of B) decreases

You might also like