Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 4

Dynamatic

Reference:
https://dynamatic.epfl.ch/papers/FPGA_20_Dynamatic_From_C_C_to_Dynamically_Scheduled_Circuits.pdf
https://dynamatic.epfl.ch/
Dynamatic generates RTL logic from high-level languages like C/C++ like HLS tools. However, the circuits generated
from the code is usually dynamic scheduled and not static scheduled. This gives significant improvements in speed and
performance of the RTL logic. The performance is especially enhanced with codes which is control heavy and have lot
of irregular memory accesses.

The RTL logic is generated with respect to Xilinx FPGAs. The RTL logic generated can then be used as sources and
with the addition of constraints can be implemented on Xilinx FPGAs (using Vivado).

Dynamic Scheduling
Reference: https://www.cs.umd.edu/~meesh/411/CA-online/chapter/advanced-concepts-of-ilp-dynamic-
scheduling/index.html

Instruction Level Parallelism is basically trying to parallelise the execution of multiple instructions at a time. Os the
issue of instructions is in-order but execution of instructions are out of order. Again, end of execution is in order. ILP
can be done
• dynamically where the hardware itself looks for areas of parallelism.
• statically where compilers can look for areas for parallelism.
However, the actual flow of instructions along with the actual names of registers is always stored.

Static Scheduling: This is achieved by the compiler itself. This includes register renaming, loop unrolling, instruction
rearrangement etc. These donot require hardware intervention. This is often used to remove hazards while program
execution thus reducing the number of stalls in the pipeline.

Dynamic Scheduling: This involves actual instruction level parallelism which has been studied. This includes tomasulo
algorithm studied earlier. This also includes dynamic branch prediction. All instructions go through in-order issue, out-
of-order execution and in-order finish. Similar to the tomasulo algorithm, there are reservation stations and buffers
to reorder data and store them when required.

There is a issue buffer where instructions are issued (loaded from memory) and stored in the buffer. When the
instruction has all the operands available for execution, it is sent into the reservation station of each execution unit.
Once the execution is over, the result is placed in reorder buffer. Of course the execution is out-of-order. The reorder-
buffer rearranges the data and the outputs the data in-order. In case some instruction in the issue buffer requires the data,
the data is forwarded to it directly before writing to the memory. This is called data forwarding.

Since the execution stage of the processor is divided into 3 stages, the control block is also divided into 3 stages.
• Issue stage control blocks, issues the instructions when issue buffer is empty.

• The control block in the execution stage transfers the instruction from the issue buffer to the reservation station
if the reservation station is empty and the operands of the instruction is available.

• The control block in the reorder buffer reorders the execution before finishing the execution.
Register renaming can be handled by the hardware as well, the reservation stations help in hardware renaming.
The reservaton stations store the register to be renamed. The execution when completed is published on the common
data bus (CDB) so it can be picked up by required block. [Look at the Tomasulo notes]

Now conventional HLS tools, when provided with a high-level code generates a circuit with static scheduling. So the
compiler uses static techniques to increase performance (like register renaming, loop unrolling etc). But the RTL circuit
generated is usually statically scheduled (in-order issue, execution and finish). It does not have dynamic scheduling
mentioned above. Had the same code had been implemented on the general purpose processor, it could have taken
advantage of a dynamic scheduling implemented on the processor.

Dynamatic compiles the code and after using static optimisation generates a circuit which can do dynamic scheduling as
well. Thus improving performance over normal HLS designs.

Dataflow Components
Dynamatic uses tokens for data delivery. There are blocks (which can be any digital circuit implemented in it), which
receive data token from the source/other blocks and after computation gives the data token to other blocks / sink. Along
with the transfer of data, control signals are also transferred to the next blocks. There are buffers between blocks.
So when a block has produced a data token , it is stored in the buffer. When the receiver is ready for the token, it signals
the buffer and receives the data from the buffer. This allows the blocks to have variable latencies in terms of execution.
Thus the protocol is insensitive to latency.

Dynamatic uses special dataflow components to control the flow of data between the blocks of execution. Along with
the functional units, these dataflow components are also implemented using RTL circuits. These components include:
(The below daigram helps understand the dataflow components)
• fork: Replicates every input that it receives and sends
to all its outputs. It cannot receive fresh inputs unless
all its receivers have received the output.
◦ Eager Fork: It transmits the output only to the
source block which is ready to receive the token. It
waits until all the receiver blocks have received the
token and only then it takes in the next input.
◦ Lazy Fork: It transmits the token to all the
receivers only after all the receivers are ready.
• Join: This merges the input from multiple sources and
gives it as an output to the single destination. It stores
the inputs from all sources and only after all the sources
have sent their tokens, they are merged and transferred
to the output token.
• Merge: It transfers any input token to the single output
source. It does not joins the tokens but transfers the
tokens to the output.
• Mux: Depending on the signal, it transfers one of the
input signal to the output.
• Control merge (cmerge): It allows one of the inputs (uncontrolled) to go to the output. Also, it gives out a
signal telling which input signal was transferred to the output.
• Branch: The input transfers to one of the output depending on the signal the branch block receives.
• Source: The ultimate source of tokens
• Sink: The final destination of all tokens

The high-level code is divided into basic blocks (BB) of code. The BBs donot have any conditionals in them. All Bbs
are implemented in the form of a single block which is then connected with other blocks using above dataflow
components. These Bbs give out two signals,
• One when it is ready to receive a token,
• One when it is ready to send out a token

The entire circuit including the dataflow components and the fundamental components are represented by Data Flow
Graph (DFG) by dynamatic. This is transported in a dot language file representing the DFG of the circuit. This dot file
can then be used to develop a circuit.
Code to Data Flow Circuit
Once we have a C/C++ circuit, we need to convert the circuit into a dataflow graph which has the fundamental logic
components and dataflow components.

To make sure that the correct order of instructions is been executed, some protocols are followed:
• Since every BB must send its output to successive BBs. So all BBs must have a branch block at its output.

• Since every BB must receive tokens from its predecessor BBs or some source, so every BB must have a mux
as its input.

• Some BBs donot have any inputs (basically constant inputs) but generate outputs for other BBs. So there is not
branch block at the start of the BB.

• Along with data tokens, control also flows through the BBs which tells the BBs the kind of operation it needs
to perform. So the control tokens enter the BBs though a cmerge. The output signal of cmerge goes into the
input signal of mux of the BB which helps the BB to select the input required for correct order of execution.
This control token flows throughout the datapath and is modified a bit by each BB. [Understand again]

• Each internal BB is then designed by simply converting the logic into hardware with basic functional
components.

• All exchange of tokens between BBs happens through handshaking. The BB producing the output, generates a
“data availibility” signal and the BB wanting to accept the token, generates a “data required” signal.

Buffers
The dynamatic generated circuit also places buffers between BBs. The buffers are basically registers like in the
pipelined data path of a processor. This helps divide a single long data path into smaller parts thus allowing pipelined
execution of the circuit (with multiple inputs). Some effects of the buffers on the datapath includes:

• It divides one long critical path of the circuit into smaller parts. Thus when one set of inputs are executing in
one BB, another set of inputs can execute in a different input. So the total time of execution increases but
throughput of the circuit also increases.

• Not all datapaths between the buffers have the same delay. Some paths are faster than the others. So the faster
datapaths cannot accept or give tokens to slower datapaths unless they have finished executing. This thing can
be resolved by inserting buffers which are built like FIFOs to store data from faster datapaths and provide them
to slower datapaths when it needs. This acts like the cache in the CPU removing the memory speed bottleneck.

Dynamatic optimizes the placement of buffers to improve throughput at given clock speeds.

Basic Flow of Dynamatic


The steps that is followed by Dynamatic while converting a C/C++ file to a VHDL file:
• First the C files are checked for correctness (compilation), is compiled and metadata is added.
• The LLVM compiler framework parses the program and produces a intermediate representation (IR) of the
code.
• Then the IR is passes through clang and LLVM to optimize the circuit.
• Then the IR is sent through a custom pass which adds the dataflow components mentioned above in the design.
he output of this stage is a DOT netlist file (which is a graph representation of the design)
• Then the IR is passed through further custom passes which adds buffers, memory interfaces and optimises
them.
• Then the dot file is converted to dataflow components which is a VHDl file. This file can be added as sources
on Vivado to generate a bitstream to be placed on an FPGA.

Running Dynamatic
Some conditions need to be followed regarding the code to be fed to Dynamatic. The documentation mentions the list of
all commands in dynamatic.

The different steps can be run automatically by running the script: synthesis.tcl. After the synthesis of the circuit from
the code, the output is a dot language file which is basically a digraph including all the dataflow and fundamental
components and they are connected by edges through which data flow. Each component and channel in the dot netlist
has attributes as mentioned in table 2. [Please go through the dot file in the same directory as this and also the table
2 mentioned in the tutorial].
The dot file basically has many nodes and they are connected using paths. Each node can have attributes which is
mentioned in the table 2. The channels donot have any attributes special to RapidWright. This explains the later steps of
the synthesis process, what type of components and what is its use in the channel.

We can convert the netlist (in dot language) into a viewable image using: dot -Tpng file.dot > file.png. This outputs a
png file which can be used to view the netlist. The netlist is then passed through custom dynamatic passes which help in
the placement of buffers, memory interfaces and optimize the circuit. The final dot file can then be translated to VHDL
circuit which is then used as sources in Vivado for generating in bitstream or can be used in ModelSim to simulate its
functionality.

You might also like