Main Project Report Thasreef Final

DESIGN AND IMPLEMENTATION OF ENERGY
EFFICIENT VECTOR PROCESSOR WITH ARIANE

CORE
Main Project Report

Submitted in partial fulfillment of the requirement for the award of the degree of
M.Tech in Micro and Nano Electronics
of the APJ Abdul Kalam Technological University
Submitted By:
THASREEF T C
Register No: TVE19ECMN18
Fourth Semester
M. Tech in Electronics and Communication Engineering with specialization in
Micro and Nano Electronics
Guided By:
Dr. Suresh Kumar E.
Professor
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
COLLEGE OF ENGINEERING TRIVANDRUM

KERALA
January 2021
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
COLLEGE OF ENGINEERING
TRIVANDRUM
CERTIFICATE
This is to certify that this project report entitled “Design and implementation of en-
ergy efficient vector processor with ariane core ” is a bonafide record of the work
done by THASREEF T. C., under our guidance towards partial fulfilment of the re-
quirements for the award of the Degree of Master of Technology in Electronics and
Communication with specialization in Micro and Nano Electronics, of the A P J
Abdul Kalam Technological University during the year 2019-2021.
Dr. Suresh Kumar E. Dr. Shajahan E. S. Dr. Sanil K Daniel.

Professor Associate Professor Associate Professor
Dept. of ECE Dept. of ECE Dept. of ECE
(Project Guide) (Project Co-ordinator) (Project Co-ordinator)
Dr. Hari R. Dr. Biji Jacob .

Professor Professor
Dept. of ECE Dept. of ECE
(PG Co-ordinator) (Head of the Department)
ACKNOWLEDGEMENTS
I would like to express my sincere gratitude and heartful indebtedness to my

guide Dr. Suresh Kumar E, Professor, Department of Electronics and Communication
for his valuable guidance and encouragement in pursuing this project.
I am also very much thankful to Dr. Biji Jacob , Head of the Department and
Dr. Hari R ,PG Coordinator, Department of Electronics and Communication for their
help and support.
I also extend my gratitude to the Project Co-ordinators, Dr. Shajahan E. S,
Associate Professor, Department of Electronics and Communication and Dr. Sanil
K Daniel, Associate Professor, Department of Electronics and Communication, Col-
lege of Engineering Trivandrum for providing necessary facilities and their sincere co-
operation. I extend my sincere thanks to all the teachers of the department of Electronics
and Communication and to all my dear friends for their help and support in completing
this project.
Above all, I thank the almighty for the immense grace and blessings at all stages of this
project.
THASREEF T C
TVE19ECMN18
ii
ABSTRACT
The end of Dennard scaling caused the race for performance through higher fre-
quencies to halt more than a decade ago, when an increasing integration density stopped
translating into proportionate increases in performance or energy efficiency . Proces-
sor frequencies plateaued, inciting interest in parallel multi-core architectures. These
architectures, however, fail to address the efficiency limitation created by the inherent
fetching and decoding of elementary instructions, which only keep the processor dat-
apath busy for a very short period of time. Moreover, power dissipation limits how
much integrated logic can be turned on simultaneously, increasing the energy efficiency
requirements of modern systems. The performance of an interconnection network is
determined not only by its architecture, but also by the routing algorithm employed.
The XY Routing method is the most basic routing technique for mesh topologies in
networks on chip. It has been proven that the level-based Routing algorithm is more
efficient than the XY Routing method. In this paper, a dynamic programming-based
level-based routing method is developed. In terms of computing, the suggested Routing
method shows to be more efficient. The suggested Routing algorithm has obtained a
speed increase of up to two times in a manycore processor based on the ariane core an
open source, in-order , single issue 64-bit application class processor.
iii
TABLE OF CONTENTS
1 INTRODUCTION 1
1.1 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 LITERATURE REVIEW 4
2.1 Ara : A SIMD Vector Processor . . . . . . . . . . . . . . . . . . . 6
2.2 Baseline MIMD-2 Processor . . . . . . . . . . . . . . . . . . . . . 8
3 PROPOSED PROJECT 11
3.1 Underlying architecture and Routing algorithm . . . . . . . . . . . 12
3.1.1 A two-dimensional mesh topology . . . . . . . . . . . . . . 12
3.1.2 XY Routing algorithm . . . . . . . . . . . . . . . . . . . . 13
3.1.3 Level based routing using Dynamic programming . . . . . . 14
3.2 Openpiton+Ariane Architecture . . . . . . . . . . . . . . . . . . . 15
3.2.1 Cache Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Network On-chip (NoC) . . . . . . . . . . . . . . . . . . . 17
3.3 Ariane Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 METHODOLOGY 22
4.1 Verilog HDL for RTL Design . . . . . . . . . . . . . . . . . . . . . 22
4.2 Verilator 4.104 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Build and Simulation steps . . . . . . . . . . . . . . . . . . . . . . 25
5 RESULT AND ANALYSIS 26
6 CONCLUSION 28
BIBLIOGRAPHY 29
iv
LIST OF FIGURES
2.1 Execution pattern on an array processor. . . . . . . . . . . . . . . . 4

2.2 Execution pattern on a vector processor. . . . . . . . . . . . . . . . 5
2.3 Block diagram of an Ara instance with N parallel lanes. . . . . . . . 8
2.4 The MIMD-2 configuration’s memory system . . . . . . . . . . . . . . 9
3.1 Two-dimensional mesh topology . . . . . . . . . . . . . . . . . . . 12

3.2 Overiew of openpiton+ariane architecture . . . . . . . . . . . . . . . . 16
3.3 OpenPiton’s Memory Hierarchy Datapath . . . . . . . . . . . . . . . . 17
3.4 Ariane SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Design flow for the proposed project . . . . . . . . . . . . . . . . . 22
5.1 Simulation output of Openpiton+Ariane . . . . . . . . . . . . . . . 26

v
CHAPTER 1
INTRODUCTION
Modern computer platforms of all sizes must fulfil energy efficiency standards while
still attempting to provide increasingly complex functionality. Due to the end of Den-
nard scaling, architects have been obliged to expose some degree of parallelism to the
user in order to continue to deliver higher computing performance. Energy efficiency
will be a significant problem in the future as mobile devices run increasingly complex
algorithms on bigger quantities of data.
The easiest and most common approach to utilise parallelism in embedded systems
nowadays is to copy an energy-efficient, lightweight in-order core multiple times to
generate an array of processors that function independently. This multiple instruction,
multiple data (MIMD) paradigm has allowed for continuing speed growth, although it
is far from ideal. The majority of the software running on these devices - from DSP
code in radio controllers to convolutional neural networks in more modern internet-of-
things (IoT) endpoints - comprises largely of numerically-intensive, highly dataparallel
kernels.
The obvious way to reduce this waste is to execute a single instruction on multiple
data items. However SIMD, or packed SIMD as it is generally understood, is also an in-
efficient architecture for embedded devices. SIMD is nor mally implemented by instan-
tiating several ALUs and providing instructions that supply multiple source operands
to these ALUs simultaneously, which then execute in lock-step. SIMD ISAs usually
include a different opcode for every possible length of input vector. As a result, they are
difficult to program and adding more ALUs cannot improve performance unless soft-
ware is rewritten to take advantage of the microarchitectural improvements. Worse, they
impose a heavy area burden on embedded processors, because they require replicating
compute resources for each element of the longest supported vector type. However,
these copies will necessarily remain idle during instructions that use less than the full
width. Furthermore, SIMD architectures cannot amortize the overhead of instruction
fetch over as many elements as a vector machine since the degree of amortization is
tied directly to the amount of duplication.
1
Vector processors are a far more effective architecture for taking use of DLP. Vec-
tor processors can enable temporal reuse of a decoded instruction in addition to spatial
replication of compute resources by setting a single ALU to execute the instruction
across numerous data elements across a number of cycles. Furthermore, as a microar-
chitectural characteristic, they can conceal the real number of accessible compute units,
allowing software to merely define the amount of work to be done. This frequently al-
lows a single binary to operate at near-optimal efficiency on cores with various degrees
of parallelism.
The packet routing algorithm of the combined OpenPiton+Ariane platform is mod-
ified in this article, which is a permissively licenced open-source framework meant
to allow scalable architectural research prototypes. OpenPiton+Ariane is the world’s
first open-source, SMP Linux-booting RISC-V system that grows from single-core
to manycore, thanks to the new addition in release 11 of SMP Linux operating on
FPGA. As a result, OpenPiton+Ariane is an excellent RISC-V hardware research plat-
form.OpenPiton is scalable and portable; the architecture allows addressing for up to
500 million cores, shared memory both inside and across chips, and was designed to
readily enable high performance 1000+ core microprocessors and beyond.
The interconnection network is the most important part of digital systems. For com-
munication between components in smaller systems, we have adopted standard bus de-
signs. However, as the desire for greater computing speed grows, processors and mem-
ory devices get quicker, necessitating the use of dedicated connections between the
system’s different components.In the current circumstance, additional processors can
be employed to increase speed. These many processors must interact with one another,
which necessitates the use of a topology and a routing algorithm to route messages.
Various routing techniques exist based on topology and may be appropriate in specific
circumstances. With crucial knowledge about the underlying hardware, the Routing
algorithm is responsible for routing the packet from the source node to the destination
node.
The OpenPiton processing system is a tiled architecture that allows a variety of
network-on-chip (NoC) topologies to link a variable number of processor tiles. The
default setup makes use of a 2D mesh topology, and the OpenPiton+Ariane version al-
lows for the creation of an Ariane RISC-V core within the tiles. Each tile also has a
private L1.5 cache, NoC routers, and a common L2 cache slice.The chipset includes
2
platform peripherals such as the DDR memory controller, UART, and RISC-V specific
peripherals.Ariane is a single-issue, 64-bit RISC-V core (RV64GC). It supports hard-
ware multiply/division, atomic memory operations, and an IEEE compliant Floating
Point Unit (FPU). It also supports both the compressed instruction set extension and
the complete privileged instruction set extension. It uses the 39-bit, page-based virtual
memory technique SV39 to boot Linux single-core on an FPGA.
In this work a routing method for 2-D mesh topology is presented. The XY-Routing
algorithm for 2-D mesh topology is included by default in Openpiton+Ariane. Packet
routing is accomplished using a level-based approach and Dynamic programming. In
terms of execution time, we evaluate and contrast the existing and suggested systems.
In terms of computing efficiency, the suggested Routing method outperforms. The
suggested Routing algorithm is up to two times faster.
1.1 Report Outline

This report mainly consist of 6 chapters. The outline of the report is given below.
Chapter1 It gives a detailed introduction to the proposed work. It includes the
motivation for undertaking this project and the problem definition. The final two sec-
tions of this chapter go over the objectives to be addressed as well as the orientation of
the report.
Chapter2 is the thorough literature review performed on the issue. This research
examines the different data parallel approaches that have previously acted as specialised
platforms for running certain algorithms, as well as the many integrating techniques em-
ployed. This chapter also includes a comprehensive examination of vector processors.
Chapter3 contains a description of the design’s many design components, as well
as its architecture and specs. It included a thorough description of the architecture of
Openpiton and Arine, as well as how the suggested routing algorithm for mesh topology
works.
Chapter4 details the methodology and the tools associated within each stage of
the proposed design flow.
Chapter5 outlines the simulation results in achieving the various work objectives,
the steps undertaken to acquire the various results, and the simulation tools used at each
level.
Chapter6 Concludes the report with explaining the effectiveness of the proposed
work in executing complex algorithms.
3
CHAPTER 2
LITERATURE REVIEW
Single instruction, multiple data (SIMD) architectures share thus amortize—the instruc-
tion fetch among multiple identical processing units. This architectural model can be
seen as instructions operating on vectors of operands. The approach works well as long
as the control flow is regular, i.e., it is possible to formulate the problem in terms of
vector operations.
A.ARRAY PROCESSORS
Array processors implement a packed-SIMD architecture. This type of processor
has several independent but identical processing elements (PEs), all operating on com-
mands from a shared control unit. Figure 2.1 shows an execution pattern for a dummy
instruction sequence. The number of PEs determines the vector length, and the archi-
tecture can be seen as a wide datapath encompassing all subwords, each handled by a
PE .
Figure 2.1: Execution pattern on an array processor.
A limitation of such an architecture is that the vector length is fixed. It is encoded

into the instruction itself, meaning that each expansion of the vector length comes with
another ISA extension. For instance, Intel’s first version of the Streaming SIMD Exten-
sions (SSEs) operates on 128 bit registers, whereas Advanced Vector Extension (AVX)
and AVX-512 evolution operates on 256 and 512-bit wide registers, respectively . ARM
provides packed-SIMD capability via the “Neon” extension, operating on 128 bit wide
registers . RISC-V also supports packed-SIMD via DSP extensions .
4
B.VECTOR PROCESSOR
Vector processors are time-multiplexed versions of array processors, implement-
ing vector-SIMD instructions. Several specialized functional units stream the micro-
operations on consecutive cycles, as shown in Figure 2.2. By doing so, the number
of functional units no longer constrains the vector length, which can be dynamically
configured. As opposed to packed-SIMD, long vectors do not need to be subdivided
into fixed-size chunks, but can be issued using a single vector instruction. Hence, vec-
tor processors are potentially more energy efficient than an equivalent array processor
since many control signals can be kept constant throughout the computation, and the
instruction fetch cost is amortized among many cycles.
Figure 2.2: Execution pattern on a vector processor.
C.SIMT
SIMT architectures represent an amalgamation of the flexibility of multiple instruc-
tion, multiple data (MIMD) and the efficiency of SIMD designs. While SIMD architec-
tures apply one instruction to multiple data lanes, SIMT designs apply one instruction to
multiple independent threads in parallel [8]. The NVIDIA Volta GV100 GPU is a state-
of-the-art example of this architecture, with 64 “processing blocks,” called Streaming
Multiprocessors (SMs) by NVIDIA, each handling 32 threads
A SIMD instruction exposes the vector length to the programmer and requires man-
ual branching control, usually by setting flags that indicate which lanes are active for
a given vector instruction. SIMT designs, on the other hand, allow the threads to di-
verge, although substantial performance improvement can be achieved if they remain
synchronized . SIMD and SIMT designs also handle data accesses differently. Since
5
GPUs lack a control processor, hardware is necessary to dynamically coalesce memory
accesses into large contiguous chunks . While this approach simplifies the programming
model, it also incurs into a considerable energy overhead .
D.VECTOR THREAD
Another compromise between SIMD and MIMD are vector thread (VT) architec-
tures , which support loops with crossiteration dependencies and arbitrary internal con-
trol flow . Similar to SIMT designs—and unlike SIMD—VT architectures leverage the
threading concept instead of the more rigid notion of lanes, and hence provide a mech-
anism to handle program divergence. The main difference between SIMT and VT is
that in the latter the vector instructions reside in another thread, and scalar bookkeeping
instructions can potentially run concurrently with the vector ones. This division alle-
viates the problem of SIMT threads running redundant scalar instructions that must be
later coalesced in hardware. Hwacha is a VT architecture based on a custom RISC-V
extension, recently achieving 64 DPGFLOPS in ST 28 nm FD-SOI technology .
Other vector and data-parallel architectures are too expensive for this use case. Be-
cause vectors give superior performance across the board. Embedded system design-
ers should progressively seek vector architecture for computers of this scale, even if it
means less programmability in some situations.
2.1 Ara : A SIMD Vector Processor

Ara’s microarchitecture is based on the vector extension of RISC-V and is a scalable
high-performance vector unit. Ara collaborates with Ariane, an open-source Linux-
capable applicationclass core, as shown in Figure 2.3. Ariane has been enhanced to
drive the accompanying vector unit as a strongly linked coprocessor to this goal.
ARIANE
Ariane is a free and open-source in-order, single-issue, 64-bit application-class pro-
cessor that uses RV64GC. It features hardware multiply/divide and atomic memory
support, as well as an IEEE-compliant FPU. It was built using GLOBALFOUNDR IES
22FDX FD-SOI technology, with a maximum clock speed of 1.7 GHz and an energy
efficiency of up to 40 GOPS/W. The core features a six-stage pipeline that includes
Program Counter (PC) Generation, Instruction Fetch, Instruction Decode, Issue Stage,
Execute Stage, and Commit Stage. The first two stages are referred to as Ariane’s front
end, which is in charge of the instruction fetch interface, while the following four are
6
referred to as its back end.
Ariane requires some architectural adjustments to power our vector unit, all of
which must be done in the back end. Vector instructions are partially decoded in Ar-
iane’s Instruction Decoder to determine whether they are vector instructions, and then
entirely in Ara’s specialised Vector Instruction Decoder. The huge number of Vector
Control and Status Registers—one for each of the 32 vector registers—that are taken
into consideration before properly decoding such instructions is the reason for this split
decoding.
The dispatcher manages the interaction between Ara’s dedicated scoreboard port
and Ariane’s. In Ariane, instructions can retire from functional units out of sequence,
whereas Ara executes instructions non-speculatively. The dispatcher operates specula-
tively as well, but it waits until a vector instruction reaches the top of the scoreboard
(i.e., it is no longer speculative) before pushing it into the instruction queue, along with
the contents of any scalar registers read by the vector instruction. Ara reads from this
queue, recognises the instruction (if necessary, e.g., the vector instruction generates a
scalar result), and then propagates any potential exceptions back to Ariane’s scoreboard.
Instructions are accepted after Ara decides that they will not cause any exceptions.
This occurs early in their execution, often after decoding. Because vector instructions
can run for a long time, they may be recognised many cycles before their execution is
complete, possibly allowing the scalar cores to resume processing of their instruction
stream. The detached execution works fine, except when Ariane requires a result from
Ara, such as accessing a vector register entry.
The Ariane-Ara interface is lightweight, similar to the Rocket Custom Coprocessor
Interface (RoCC) for use with the Rocket Chip. The dispatcher sends the decoded in-
struction to Ara, whereas RoCC delegated the whole decoding work to the coprocessor.
7
Figure 2.3: Block diagram of an Ara instance with N parallel lanes.
2.2 Baseline MIMD-2 Processor

MIMD (multiple instruction, multiple data) is a method used in computing to cre-
ate parallelism. MIMD-enabled machines feature a number of processors that operate
asynchronously and independently. At any given time, many processors may be exe-
cuting different instructions on various bits of data. MIMD designs can be utilised in
a variety of applications, including computer-aided design/manufacturing, simulation,
modelling, and as communication switches. MIMD machines can be classified as either
8
shared memory or distributed memory.These groups are based on how MIMD proces-
sors access memory. Shared memory machines can be bus-based, extended, or hierar-
chical in nature. Hypercube or mesh connectivity methods can be used in distributed
memory devices.
Figure 2.4: The MIMD-2 configuration’s memory system
Figure 2.4 depicts our basic MIMD microarchitecture, which comprises of two in-
stances of the Ariane single-issue, in-order pipeline. Each Ariane core includes split L1
data and instruction caches of 16KiB, a floating point unit, and a coherent interface to a
common multi-banked L2 cache.
Having more than one processor improves efficiency. Each CPU has its own local
bus for accessing the local memory I/O devices. This makes parallel processing very
simple. The system structure is adaptable, which means that the failure of one module
does not result in the failure of the entire system; the damaged module may be replaced
later. MIMD machines with shared memory use processors that share a central, shared
memory. In its most basic form, all processors are connected to a bus that connects
them to memory. This implies that each computer with shared memory has a unique
9
CM, or common bus system, for all clients. MIMD machines with hierarchical shared
memory employ a bus hierarchy (as in a "Fat tree") to let processors to access one
other’s memory. Inter-nodal buses allow processors on different boards to interact with
one another. Buses help boards communicate with one another. The computer may be
able to accommodate over 9,000 processors with this sort of design.
Each processor in a distributed memory MIMD system has its own unique memory
location. Each CPU has no direct knowledge of the memory of the other processor.
Data must be sent as a message from one processor to another in order to be shared.
Concurrency is less of an issue on these computers since there is no shared memory.
Connecting a huge number of processors directly to each other is not economically
viable. To prevent this slew of direct connections, link each CPU to only a few oth-
ers. Because of the additional time necessary to send a message from one processor
to another along the message channel, this architecture can be wasteful. Processors
can spend a significant amount of time doing simple message routing. Hypercube and
mesh are two prominent connectivity methods that were developed to decrease this time
waste.
10
CHAPTER 3
PROPOSED PROJECT
The interconnection network is the most important component of digital systems. In

smaller systems, we employed common bus designs to communicate between com-
ponents. However, as the desire for greater computing speed grows, processors and
memory devices get quicker, necessitating the use of dedicated connections between
the various system components.In the current circumstance, additional processors can
be employed to increase speed. These many processors must interact with one another,
which necessitates the use of a topology and a routing algorithm to route messages.
Various routing techniques exist based on topology and may be appropriate in specific
circumstances. The mesh topology is the most common and simplest topology, while
the XY Routing method is the simplest Routing algorithm. With crucial knowledge
about the underlying hardware, the Routing algorithm is responsible for routing the
packet from the source node to the destination node. Routing methods are employed in
both sorts of networks, whether they are regular or irregular.
Various mesh topology versions have been proposed in the past; those prove to be
more efficient than Mesh topology and have employed the routing obtained from the
XY Routing method. This emphasises the need of thoroughly researching XY routing
since the research will not only support mesh topology, but will also serve as the foun-
dation for creating the Routing algorithm for mesh topology variations. The various
parameters were utilised to evaluate the Routing algorithm’s performance.
Because a Routing algorithm is a sort of algorithm used for routing, the major goal
in creating a new Routing algorithm should be to focus on time and space complexity.
These two elements have an indirect impact on the router’s different design character-
istics. Complex Routing algorithms will also need a high number of logical gates to
complete the tasks, increasing chip space. The execution of the complicated Routing
algorithm additionally increases the routing decision time, necessitates more power,
and generates more heat.The major goal of our research is to investigate existing XY
routing and its alternatives, and to propose a quicker and more space-efficient method
for network-on-chips.
11
3.1 Underlying architecture and Routing algorithm
To examine the deep analysis of Routing algorithms, a thorough grasp of the mesh
as well as the used router architecture is required.
3.1.1 A two-dimensional mesh topology
Figure 1 illustrates a two-dimensional mesh structure. The processors are depicted
by circles in Fig. 1, while the routers are represented by squares. These hardwares
are linked to one another through connections, and routers are linked to their horizontal
and vertical neighbours. The Routing algorithm used in the Routing algorithm is used to
determine the label on the nodes.The Routing algorithm used in the Routing algorithm
determines the nodes. There are numerous Mesh topology routing methods that might
be basic or adaptive in nature. The most popular form of routing algorithm used in
networking determines the source choice, however they are not ideal for network on
chips since routing tables must be maintained at each node. This will increase the area
of the router, which is an important consideration when building the network on chip.
Figure 3.1: Two-dimensional mesh topology
The labels in the XY Routing method are dependent on coordinates, which will
need memory blocks that store the node’s X and Y coordinates (Router+ Processing
element). In the case of level-based or index-based routing algorithms, this label is a
single integer value ranging from 0 to n – 1.
12
3.1.2 XY Routing algorithm
The XY Routing algorithm, which is a mix of nested if else expressions, has been
thoroughly documented in numerous literatures. In XY routing, the current router’s
X coordinate values are first compared to the destination’s X coordinate value. The
packet is transported in either the eastward or westward direction based on the com-
parison. Once the present router’s X position and the destination address are the same,
a comparison based on the router’s Y coordinates is performed. This comparison will
transport the packet to the router’s north, south, or local port.
13
3.1.3 Level based routing using Dynamic programming
The Level Based Routing with Dynamic Programming (LBDP) technique is based
on the basic concept in Dynamic Programming, redundant calculations are saved rather
than computing again and again to minimise the cost of execution. In level-based rout-
ing, we first compute the level for the present node and then compare it to the level
of the destination node. The computation of the current level is unnecessary for each
packet since it can only be calculated once and then reused for all subsequent packets.
The Routing algorithm will use less time as a result of this. To do this, we’ve set a
constant for the router’s level number. Another change made to the level-based routing
algorithm is the addition of accuracy to the level-based routing method. The current
level-based routing was unable to route the packets appropriately since the code for
routing the packet to the router’s local port did not exist.
14
We have two instances to demonstrate the accuracy of the proposed algorithm.
There are three sub-cases where the current and destination addresses are at the same
level:
a. Present address is more than target address: Because the packet is being routed to
the West port, the node to the left of the current node will be less at a certain level.
b. Current address is smaller than destination address: The packet is sent to the east
port since the node to the left of the current node at any level is always greater.
c. Current address equals destination address: When the current address and destination
address are the same, this indicates that the packet has arrived at the destination node
and should be forwarded to the local port. When the present address and the destination
address are not on the same level, there are two possibilities.
d. The current node is at a lower level than the target router: The packet must be routed
through the southern node since nodes above the provided node are always at a lower
level.
e. The current node is higher in level than the target router: The packet must be routed
in the northward node since nodes above the provided node are always lower in level.
Because all five scenarios of routing the packets function successfully, the Routing
method given works appropriately.
3.2 Openpiton+Ariane Architecture

OpenPiton+Ariane, the combined platform, is a permissively licenced open-source
framework meant to allow scalable architectural research prototypes. OpenPiton+Ariane
is the first Linux-booting, open-source, RISC-V system that grows from single-core to
manycore, thanks to the recent addition of SMP Linux operating on FPGA. As a result,
OpenPiton+Ariane is an excellent RISC-V hardware research platform.
An overview of the Ariane core’s design and changes, as well as the P-Mesh cache
subsystem. The OpenPiton processor system is a tiled architecture that allows sev-
eral network-on-chip (NoC) topologies to interconnect a variable number of processing
tiles. OpenPiton+Ariane inherits features from both the Ariane and OpenPiton projects,
including simulation and FPGA emulation infrastructure, as well as ASIC synthesis
and back-end scripting.The OpenPiton P-Mesh cache architecture was improved with
support for RISC-V atomic operations but otherwise remains unchanged, resulting in a
robust, well-validated manycore memory system. Similarly, Ariane’s cache subsystem
was changed to link to P-Mesh, but the core was largely unaltered.
15
Figure 3.2: Overiew of openpiton+ariane architecture
The default setup employs a 2D mesh structure, as seen in Figure 3.1, and the Open-
Piton+Ariane version allows for the creation of an Ariane RISC-V core within the tiles.
Each tile also has a private L1.5 cache, NoC routers, and a common L2 cache slice. The
chipset includes platform peripherals such as the DDR memory controller, UART, and
RISC-V-specific peripherals.
3.2.1 Cache Hierarchy

The cache hierarchy in OpenPiton is made up of three cache layers. In OpenPiton,
each tile has private L1 and L1.5 caches as well as a slice of the distributed, shared L2
cache. Figure 3.2 depicts the data flow of the cache structure. The memory subsys-
tem maintains cache coherence with our P-Mesh coherence mechanism. It complies
to the OpenSPARC T1 memory consistency concept. Coherent communications be-
tween L1.5 and L2 caches are routed through three NoCs that have been meticulously
constructed to avoid deadlocks.
The cache hierarchy in OpenPiton is made up of three cache levels: private L1 and
L1.5 caches, as well as a distributed, shared L2 cache. In OpenPiton, each tile has an
instance of the L1 cache, L1.5 cache, and L2 cache.
16
Figure 3.3: OpenPiton’s Memory Hierarchy Datapath
3.2.2 Network On-chip (NoC)

An OpenPiton chip has three NoCs. The NoCs form a 2D mesh topology that con-
nects tiles. The NoCs’ primary function is to facilitate communication across tiles for
cache coherence, I/O and memory traffic, and inter-core interruptions. They also send
traffic to the chip bridge that is intended for off-chip. The NoCs preserve point-to-point
ordering between a single source and destination, which is frequently used to ensure
TSO consistency. To route data between chips in a multi-chip setup, OpenPiton em-
ploys similarly customizable NoC routers.
The three NoCs are physical networks (no virtual channels), with two 64-bit unidi-
rectional links in each direction. Credit-based flow control is used on the connections.
The packet structure retains 29 bits of core addressability, allowing it to scale up to 500
million cores.
To avoid deadlocks, the L1.5 cache, L2 cache, and memory controller assign various
priorities to distinct NoC channels; NoC3 has the greatest priority, followed by NoC2,
while NoC1 has the lowest. As a result, NoC3 will never be blocked. Furthermore, all
hardware components are built in such a way that eating a high priority packet is never
contingent on ingesting lower priority traffic. While the cache coherence protocol is
intended to be conceptually deadlock-free, it also relies on the physical layer and routing
to be deadlock-free.
The following criteria are used to map coherence operation classes to NoCs, as
shown in Figure 3.2:
• Requests from the private cache (L1.5) to the shared cache start NoC1 messages
(L2).
17
• The shared cache (L2) initiates NoC2 messages to the private cache (L1.5) or
memory controller.
• NoC3 messages are replies to shared cache from the private cache (L1.5) or mem-
ory controller (L2).
3.3 Ariane Core

Ariane is a 64-bit, single-issue, in-order RISC-V core that provides hardware mul-
tiply/divide operations, atomic memory operations, and an IEEE compliant Floating
Point Unit (FPU). It also supports the compressed instruction set extension as well as
the complete privileged instruction set extension, and it provides a 39-bit page-based
virtual memory architecture (SV39). The micro-major architecture’s design aim was
to decrease critical path length while reducing Instructions per Cycle (IPC) losses to
a minimum. A synthesis-driven design method leads to a 6-stage pipelined design to
reach desired performance targets, and the micro-architecture includes a branch predic-
tor to decrease the penalty of branching.
The six stages of pipeline includes:
• PC Generation is in charge of picking the next Program Counter (PC). Control
and Status Registers (CSR) while returning from an error, the debug interface, a
mispredicted branch, or a sequential fetch can all cause this.
• Instruction Fetch contains the instruction cache, the fetch logic, and the pre-
decode logic that directs the PC stage’s branch prediction.
• Instruction Decode Re-aligns possibly misaligned instructions, as well as de-

compresses and decodes them. In the issue stage, decoded instructions are subse-
quently placed in an issue queue.
• Issue Stage The issue queue, a scoreboard, and a tiny Re-order Buffer are all
included (ROB) When all of the operands are ready, the instruction is sent to the
execute step. The scoreboard tracks dependencies, and operands are passed from
the ROB when needed.
18
• Execute Stage all functional components are housed Every functional unit shakes
hands, and preparedness is considered throughout teaching difficulties. The in-
teger Arithmetic Logic Unit (ALU), multiplier/divider, and CSR processing all
have fixed latency. Currently, the only variable latency units are the FPU and the
load/store unit (LSU). Instructions can be retired from functioning units out of
sequence. The ROB is used to address write-back issues.
• Commit Stage reads from the ROB and commits all instructions in the order they
appear in the programme Stores and atomic memory operations are kept in a store
buffer until their architectural commit is confirmed by the commit stage. Finally,
the retiring instruction updates the register file. The commit stage can commit
two instructions each cycle to minimise artificial hunger caused by a full ROB.
The key features of the Ariane Core are:

• Branch Prediction: Mis-prediction can occur on both the jump destination
address (specified by a register value) and the mis-predicted branch outcome. In
the event of a misprediction, the frontend, as well as the decode and issue stages,
must be flushed, introducing at least a five-cycle delay in the pipeline, and much
more in the event of a TLB or instruction cache miss. Ariane has three distinct
methods for anticipating the next PC to reduce the negative impact of control flow
delays on IPC: a Branch History Table (BHT), a Branch Target Buffer (BTB),
and a Return Address Stack (RAS). Ariane does light pre-decoding in its fetch
interface to detect branches and jumps to aid with branch prediction.
• Virtual Memory : Ariane includes full hardware support for address translation
through a Memory Management Unit to support an operating system (MMU). It
has distinct, programmable data and instruction TLBs. TLBs are standard-cell
memory that are completely set-associative and based on flip-flops. They are
verified for a valid address translation on each instruction and data access. If
no proper address translation exists, Ariane’s hardware PTW searches the main
memory for one. Pseudo Least Recently Used is the replacement method for TLB
entries (LRU).
19
• Exception Handling: Exceptions can occur anywhere in the pipeline and are
thus associated with a specific command. When the PTW detects an unlawful
TLB entry during instruction fetch, the first exception occurs. Unlawful instruc-
tion exceptions can arise during decoding, and the LSU can also fail on address
translation or cause an illegal access exception. When an exception occurs, the
appropriate instruction is flagged and auxiliary data is stored. The commit stage
transfers the instruction frontend to the exception handler when the excepting in-
struction ultimately dies. Interrupts are asynchronous exceptions that are tied to
a certain instruction. As a result, the commit stage waits for a valid instruction to
retire before taking an external interrupt and associating it with an exception.
• Scoreboard / Reorder Buffer : The scoreboard, which includes the ROB, is

implemented as a circular buffer that lies logically between the issue and execute
stages and holds issued, decoded, in-flight instructions that are presently being
executed in the execute stage. To track data risks, the issue stage tracks and
checks source and destination registers. When a new instruction is provided, it
is recorded on the scoreboard. The various functional units return speculative
outcomes. Because each instruction’s destination register is known, results are
passed to the issue stage when necessary. The commit stage reads completed
instructions and retires them, freeing up space on the scoreboard for future in-
structions.
• Functional Units : Ariane contains 6 functional units:

– ALU: The majority of the RISC-V base ISA is covered, including branch
target computation.
– LSU: Integer and floating-point load/store operations, as well as atomic

memory activities, are managed. The LSU communicates with the data
cache through three master interfaces. One is for the PTW, one for the load
unit, and the last one is for the storage unit.
– FPU: Ariane includes an IEEE-compliant floating-point unit with cus-

tomised trans-precision extensions.
20
– Branch unit is an ALU enhancement that performs branch prediction and
rectification.
– CSR: RISC-V requires atomic operations on its CSR since it has to act
on the most recent value. Ariane waits until the instruction is committed at
the commit step before reading or writing. This functional unit buffers the
relevant write data and reads it again when the instruction retires.
– Multiplier/Divider: The hardware support for the M-extension is housed

in this functional unit. The multiplier is a two-stage multiplier that is en-
tirely pipelined. During synthesis, we rely on re-timing to bring the pipeline
register into the combinational logic. A bit-serial divider with input prepro-
cessing is used in this divider. Division can take anywhere from 2 to 64
cycles depending on the operand values.
Figure 3.4: Ariane SoC
21
CHAPTER 4
METHODOLOGY
This chapter describes the approach for the suggested design as well as the tools needed
at each step of the project. This design path is followed for the creation of the level-
based method utilising dynamic programming, which is a routing algorithm for packet
routing in a 2-d mesh topology in an Openpiton+Ariane manycore processor (shown in
figure 4.1).
Figure 4.1: Design flow for the proposed project
4.1 Verilog HDL for RTL Design

The Verilog Hardware Description Language (Verilog HDL) is a programming lan-
guage used to describe the behaviour of electrical circuits, most notably digital circuits.
IEEE standards describe Verilog HDL. Verilog 1995, Verilog 2001, and the most cur-
rent SystemVerilog 2005 are the three most popular versions. Verilog HDL may be
used to design hardware as well as to create test entities to evaluate the behaviour of a
piece of hardware. Verilog HDL is supported by a wide range of EDA tools, including
synthesis tools like Quartus® Prime Integrated Synthesis, simulation tools, and formal
verification tools.
22
Verilog is a language for defining hardware. It is a language used to describe a dig-
ital system such as a network switch, microprocessor, memory, or flipflop. This means
that we can describe any digital hardware at any level using an HDL. HDL-described
designs are technology-independent, highly straightforward to create and debug, and
are typically more helpful than schematics, particularly for complex circuits.
Verilog can support a design at many different levels of abstraction. The top three
are as follows:
• Behavioral level
• Register-transfer level
• Gate level
Behavioral level
Concurrent algorithms are used to describe a system at this level (Behavioural).
Every algorithm is sequential, meaning it is made up of a series of instructions that are
carried out one after the other. The major elements are functions, tasks, and blocks.
There is no consideration for the design’s structural reality.
Register T ransf er level
The RegisterTransfer Level defines the features of a circuit that uses operations and
data transfer between registers. "Any code that is synthesizable is considered RTL
code," according to the modern definition.
Gate level
The features of a system are defined at the logical level by logical connections and
their temporal aspects. The signals are all distinct. Only definite logical values (0’, 1’,
X’, Z) are allowed. Predefined logic primitives are the operations that can be used (basic
gates). For logic design, gate level modelling may not be the best option. His netlist is
utilised for gate level simulation and backend, and his gate level code is created using
tools like synthesis tools.
4.2 Verilator 4.104

Verilator is invoked using arguments identical to those used by GCC or Synop-
sys’ VCS. It reads the supplied Verilog or SystemVerilog code, does lint checks, and
optionally inserts assertion checks and coverage-analysis points to "Verilate" it. The
"Verilated" code is produced as single- or multi-threaded.cpp and.h files.
The user creates a small C++/SystemC wrapper file that instantiates the user’s top
level module’s "Verilated" model. A C++ compiler (gcc/clang/MSVC++) then com-
23
piles these C++/SystemC files. The design simulation is carried out by the resultant
executable. Verilator also allows you to link its produced libraries, which may be en-
crypted if desired, with other simulators.
Verilator may not be the ideal solution if you anticipate a full-featured substitute for
NC-Verilog, VCS, or any commercial Verilog simulator, or if you need a behavioural
Verilog simulator, such as for a fast class project (we recommend Icarus Verilog for
this.) However, if you want to migrate SystemVerilog to C++ or SystemC, or if your
team is comfortable creating a little C++ code, Verilator is the tool for you.
Verilator does more than just translate Verilog HDL to C++ or SystemC. Verilator,
on the other hand, converts your code into a considerably faster optimised and poten-
tially thread-partitioned model, which is then wrapped within a C++/SystemC module.
The end result is a built Verilog model that runs more than 10 times faster than stan-
dalone SystemC and more than 100 times faster than interpreted Verilog simulators like
Icarus Verilog. Multithreading might provide an additional 2-10x speedup (yielding
200-1000x total over interpreted simulators).
Verilator generally outperforms closed-source Verilog simulators (Carbon Design
Systems Carbonator, Modelsim, Cadence Incisive/NC-Verilog, Synopsys VCS, VTOC,
and Pragmatic CVer/CVC). However, because Verilator is open-source, you may spend
your money on computes rather than licencing. As a result, Verilator provides the best
cycles/dollar.
1)It accepts Verilog or SystemVerilog that may be synthesised. 2)Conducts lint code
quality checks. 3)Creates multithreaded C++ or SystemC code. 4)Generates XML to
be used as a front-end for your own tools. 5)Performs better than many commercial
simulators. 6)Models for single- and multi-threaded output. 7)Extensive industrial and
academic usage. 8)Arm out-of-the-box support and RISC-V vendor IP.
24
4.3 Build and Simulation steps
Building M anycore M odel
• cd PITON ROOT
• source piton/ariane setup.h
• cd PITON ROOT/build
• sims -sys=manycore -vlt build -ariane
Check Outputs
• sims.log: Check for build errors
• build/manycore/rel-0.1/obj dir/vcmp top
C P rogram Simulation
• sims -sys=manycore -vlt run -ariane hello world.c rtl timeout=
- RISC-V assembly test have the same syntax but end with .s or .riscv
- Check fake uart.log for output
25
CHAPTER 5
RESULT AND ANALYSIS
Figure 5.1: Simulation output of Openpiton+Ariane
26
Table 5.1: Execution time in µs for XY Routing algorithm and Implemented Dynamic
programming based algorithm
N XY Routing algorithm Dynamic programming algorithm

10000 424 169
40000 1205 626
90000 3020 1197
160000 4352 3046
250000 10643 4451
360000 16252 6395
490000 21753 9465
640000 29725 13566
810000 38090 16713
1000000 48529 20739
27
CHAPTER 6
CONCLUSION
The proposed Routing algorithm has been recommended as an alternative to the XY

Routing method since it has proven to be efficient in terms of both space and time com-
plexity. The given Routing algorithm was created with the goal of increasing the speed
by a factor of two. Because the computations in the redundant section of the algorithm
are deleted by utilising the dynamic programming technique, the quickest algorithm
will almost certainly need less clock cycles, lowering the hardware cost and power
consumption. In the future, we’ll concentrate on adaptive routing with complicated
calculations that may contain duplicate computations.
28
BIBLIOGRAPHY
[1] Jonathan Balkind, Katie Lim, Fei Gao, Jinzheng Tu, David Wentzlaf,Michael
Schaffner, Florian Zaruba, Luca Benini " OpenPiton+Ariane: The First Open-
Source, SMP Linux-booting RISC-V System Scaling From One to Many Cores"
carv 2019
[2] Jonathan Balkind, Michael McKeown, Yaosheng Fu, Tri Nguyen, Yanqi Zhou,
Alexey Lavrov, Mohammad Shahrad, Adi Fuchs, Samuel Payne, Xiaohua Liang,
et al. “ Openpiton: An open source manycore research framework. ”, In ACM
SIGARCH Computer Architecture News, Vol. 44. ACM, 217–232 2016.
[3] D a l l y, W. J., B. T o w e l s. " Principle and Practices of Interconnection Net-

works." Elsevier, 2004.
[4] D u a t o, J., S. Y a l a m a n c h i l i, L. N i.,"Interconnection Networks." Elsevier,

2003.
[5] B h a r d w a j, V. P., R. V. N i t i n., “On the Minimization of Crosstalk Con-

flicts in a Destination Based Modified Omega Network.” Journal of Information
Processing Systems, Vol. 9, 2013, No 3, pp. 301-314.
[6] N i t i n, R. V., D. S. C h a u h a n., “On a Deadlock and Performance Analysis of

ALBR and DAR Algorithm on X-Torus Topology by Optimal Utilization of Cross
Links and Minimal Lookups.” Journal of Supercomputing, Vol. 59, 2010, No 3,
pp. 1252-1288.
[7] Matheus Cavalcante, Fabian Schuiki, Florian Zaruba, Michael Schaffner, Luca
Benini, “ Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector
Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI, ”,
arXiv:1906.00478v3 [cs.AR] 27 oct 2019.
[8] S. F. Beldianu and S. G. Ziavras„"“Performance-energy optimizations for shared

vector accelerators in multicores" IEEE Transactions on Computers, vol. 64, no.
3, pp. 805–817, Mar. 2015.
29
[9] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanovic,
“Exploring the tradeoffs between programmability and ´ efficiency in data-parallel
accelerators” SIGARCH Comput. Archit. News, vol. 39, no. 3, ,pp. 129–140, 2011
[10] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J.

Casper, and K. Asanovic,"“The vector-thread architecture," SIGARCH Com-
put. Archit. News, vol. 32, no. 2, pp. 52–, Mar. 2004. Available:
http://doi.acm.org/10.1145/1028176.1006736.
[11] C. Schmidt, A. Ou, and K. Asanovic, ““Hwacha: A data-parallel RISC-

V ´ extension and implementation,” ” in Inaugural RISC-V Summit Pro-
ceedings. Santa Clara, CA, USA: RISC-V Foundation, Dec. 2018. Avail-
12 able: https://content.riscv.org/wp-content/uploads/2018/12/Hwacha-AData-
Parallel-RISC-V-Extension-and-Implementation-Schmidt-Ou-.pdf
30

Main Project Report Thasreef Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Main Project Report Thasreef Final

Uploaded by

Copyright:

Available Formats

DESIGN AND IMPLEMENTATION OF ENERGY

EFFICIENT VECTOR PROCESSOR WITH ARIANE

Main Project Report

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

COLLEGE OF ENGINEERING TRIVANDRUM

Dr. Suresh Kumar E. Dr. Shajahan E. S. Dr. Sanil K Daniel.

Dr. Hari R. Dr. Biji Jacob .

I would like to express my sincere gratitude and heartful indebtedness to my

5 RESULT AND ANALYSIS 26

2.1 Execution pattern on an array processor. . . . . . . . . . . . . . . . 4

3.1 Two-dimensional mesh topology . . . . . . . . . . . . . . . . . . . 12

4.1 Design flow for the proposed project . . . . . . . . . . . . . . . . . 22

5.1 Simulation output of Openpiton+Ariane . . . . . . . . . . . . . . . 26

1.1 Report Outline

Figure 2.1: Execution pattern on an array processor.

A limitation of such an architecture is that the vector length is fixed. It is encoded

Figure 2.2: Execution pattern on a vector processor.

2.1 Ara : A SIMD Vector Processor

2.2 Baseline MIMD-2 Processor

Figure 2.4: The MIMD-2 configuration’s memory system

The interconnection network is the most important component of digital systems. In

Figure 3.1: Two-dimensional mesh topology

3.2 Openpiton+Ariane Architecture

3.2.1 Cache Hierarchy

3.2.2 Network On-chip (NoC)

3.3 Ariane Core

• Instruction Decode Re-aligns possibly misaligned instructions, as well as de-

The key features of the Ariane Core are:

• Scoreboard / Reorder Buffer : The scoreboard, which includes the ROB, is

• Functional Units : Ariane contains 6 functional units:

– LSU: Integer and floating-point load/store operations, as well as atomic

– FPU: Ariane includes an IEEE-compliant floating-point unit with cus-

– Multiplier/Divider: The hardware support for the M-extension is housed

Figure 3.4: Ariane SoC

Figure 4.1: Design flow for the proposed project

4.1 Verilog HDL for RTL Design

4.2 Verilator 4.104

• source piton/ariane setup.h

• sims -sys=manycore -vlt build -ariane

• build/manycore/rel-0.1/obj dir/vcmp top

Figure 5.1: Simulation output of Openpiton+Ariane

Figure 5.2: Simulation output of Openpiton+Ariane

N XY Routing algorithm Dynamic programming algorithm

The proposed Routing algorithm has been recommended as an alternative to the XY

[3] D a l l y, W. J., B. T o w e l s. " Principle and Practices of Interconnection Net-

[4] D u a t o, J., S. Y a l a m a n c h i l i, L. N i.,"Interconnection Networks." Elsevier,

[5] B h a r d w a j, V. P., R. V. N i t i n., “On the Minimization of Crosstalk Con-

[6] N i t i n, R. V., D. S. C h a u h a n., “On a Deadlock and Performance Analysis of

[8] S. F. Beldianu and S. G. Ziavras„"“Performance-energy optimizations for shared

[10] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J.

[11] C. Schmidt, A. Ou, and K. Asanovic, ““Hwacha: A data-parallel RISC-

You might also like