Riscv Design

DEGREE PROJECT IN TECHNOLOGY,
SECOND CYCLE, 30 CREDITS

STOCKHOLM, SWEDEN 2022
Design a Three-Stage
Pipelined RISC-V
Processor Using
SystemVerilog
KTH Thesis Report
<Ziyan He>
KTH ROYAL INSTITUTE OF TECHNOLOGY

ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Authors
Ziyan He <ziyanh@kth.se>
Electrical Engineering and Computer Science
KTH Royal Institute of Technology
Place for Project

Stockholm, Sweden
Examiner
Johnny Öberg
Stockholm, Sweden
Supervisor
Mattias Ekström
Stockholm, Sweden
ii
Abstract | i
Abstract
RISC-V is growing in popularity as a free and open RISC Instruction Set
Architecture (ISA) in academia and research. Also, the openness, simplicity,
extensibility, and modularity, among its advantages, make it more and more
used by designers in industry. The aim of this thesis is to design an open-source
RISC-V processor. The development of this RISC-V processor was based
on the prototype which was made in the course IL2232 Embedded Systems
Design Project (SoI-CMOS Design group), against an experimental high-
temperature SoC CMOS process. SystemVerilog was used for RTL coding.
ModelSim was used for RTL simulation. Genus was used for digital synthesis
and Innovus was used for digital place & route. The thesis concludes that
this RISC-V processor can run the compiled C-code which has been produced
by the virtual platform tool Imperas OVP. The instruction set RV32IM is
the Instruction Set base for this processor. Through simulation, the CPI of
this RISC-V processor can be collected while running different benchmark
programs developed in two parallel Master thesis to this one. To a certain
extent, it can reflect the performance of the processor. However, the actual
execution time needs to be tested by loading the processor to the hardware.
This part will not be discussed in this thesis but is left for future work. The gate
count is collected by digital synthesis and the corresponding area is collected
after digital place & route.
Keywords
RISC, RISC-V, ISA, SystemVerilog, RTL simulation, RV32IM, CPI.
ii | Sammanfattning
Sammanfattning
RISC-V växer i popularitet som en gratis och öppen RISC ISA inom akademi
och forskning. Öppenheten, enkelheten, utbyggbarheten och modulariteten,
bland dess fördelar, gör att den används mer och mer av designers inom
industrin. Syftet med denna avhandling är att designa en RISC-V-processor
med öppen källkod. Utvecklingen av denna RISC-V-processor baserades på
prototypen som gjordes i kursen IL2232 Embedded Systems Design Project
(SoI-CMOS Design group). Mot en experimentell högtemperatur, SoC CMOS-
process diskuteras. SystemVerilog användes för RTL-kodning. ModelSim
användes för RTL-simulering. Genus användes för digital syntes och Innovus
användes för digital plats & rutt. Avhandlingen drar slutsatsen att denna
RISC-V-processor kan köra den kompilerade C-koden som har producerats
av det virtuella plattformsverktyget Imperas OVP. Instruktionsuppsättningen
RV32IM är instruktionsuppsättningens bas för denna processor. Genom simulering
kan CPI för denna RISC-V-processor samlas in samtidigt som man kör olika
benchmarkprogram utvecklade i två parallella masteruppsatser till denna. Till
viss del kan det spegla processorns prestanda. Den faktiska exekveringstiden
måste dock testas genom att ladda processorn till hårdvaran. Denna del
kommer att diskuteras i denna uppsats men lämnas för framtida arbete.
Grindräkningen samlas in genom digital syntes och motsvarande yta samlas
in efter den digitala platsen & rutten.
Nyckelord
RISC, RISC-V, ISA, SystemVerilog, RTL simulering, RV32IM, CPI.
Acknowledgments | iii
Acknowledgments
This thesis was carried out in KTH.
I am grateful to my examiner Dr. Johnny Öberg for the weekly discussion

and guidance during the project. My friends and classmates Annan Liu,
Gengwu Du and Gautham Prabhakar who made contribution to this project. I
also would like to thank my supervisor Mattias Ekström for providing useful
tutorial in synthesis and physical design part.
Stockholm, June 2022

Ziyan He
iv | CONTENTS
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . 4
2 Background Study 5
2.1 RISC vs CISC . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 RISC-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 RV32I . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 "M" Standard Extension . . . . . . . . . . . . . . . . 9
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 RISC-V Processor Structure 13

3.1 RISC-V core . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 PC register and General purpose registers . . . . . . . 14
3.1.2 Pipeline units . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3 Instruction decoder . . . . . . . . . . . . . . . . . . . 17
3.1.4 Execution unit . . . . . . . . . . . . . . . . . . . . . 18
3.1.5 Division unit . . . . . . . . . . . . . . . . . . . . . . 18
3.1.6 Control unit . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.7 CSR register . . . . . . . . . . . . . . . . . . . . . . 20
3.1.8 Interrupt unit . . . . . . . . . . . . . . . . . . . . . . 22
3.2 RISC-V SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 ROM and RAM . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 GPIO . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Contents | v
4 RISC-V Processor Implementation 29

4.1 RTL Development . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 PC_REG . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 IFu_IDu & IDu_EXu . . . . . . . . . . . . . . . . . . 30
4.1.3 Decode Unit . . . . . . . . . . . . . . . . . . . . . . 32
4.1.4 Execution Unit . . . . . . . . . . . . . . . . . . . . . 34
4.1.5 Control Unit . . . . . . . . . . . . . . . . . . . . . . 35
4.1.6 Division Unit . . . . . . . . . . . . . . . . . . . . . . 35
4.1.7 CSR_REG . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.8 Interrupt Unit . . . . . . . . . . . . . . . . . . . . . . 37
4.1.9 Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.10 ROM and RAM . . . . . . . . . . . . . . . . . . . . . 39
4.1.11 GPIO . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Place & Route . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Results and Analysis 44

5.1 Verification and Simulation . . . . . . . . . . . . . . . . . . . 44
5.1.1 Verification . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Digital Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 SOI_STDLIB . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Lsi_10k . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.3 Digital Synthesis result . . . . . . . . . . . . . . . . . 52
5.3 Place & Route . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Conclusions and Future work 55

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References 57
A First Appendix 60
A.1 RV32I Base Instruction Description . . . . . . . . . . . . . . 60
A.2 RV32M Standard Extension Listing . . . . . . . . . . . . . . 64
B Second Appendix 65
B.1 Fibonacci program code . . . . . . . . . . . . . . . . . . . . 65
B.2 Speed program code . . . . . . . . . . . . . . . . . . . . . . 66
B.3 Matrix program code . . . . . . . . . . . . . . . . . . . . . . 68
vi | LIST OF FIGURES
List of Figures
1.1 Machine cycle . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 CISC vs RISC . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 RISC-V base instruction formats . . . . . . . . . . . . . . . . 9
2.3 RV32I Base Instruction Set . . . . . . . . . . . . . . . . . . . 9
2.4 Multiplication Operations . . . . . . . . . . . . . . . . . . . . 10
2.5 Division Operations . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Block Diagram of RI5CY . . . . . . . . . . . . . . . . . . . . 11
3.1 RISC-V core structure . . . . . . . . . . . . . . . . . . . . . . 14

3.2 RISC-V Division unit algorithm . . . . . . . . . . . . . . . . 19
3.3 CSR Instructions . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 RISC-V SoC structure . . . . . . . . . . . . . . . . . . . . . . 24
3.5 RISC-V SoC without Bus . . . . . . . . . . . . . . . . . . . . 25
3.6 RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.8 GPIO Module . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 PC_REG code . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 D flip-flop code . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 IFu_IDu code . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 IDu_EXu code . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 C2 internal signal connection . . . . . . . . . . . . . . . . . . 32
4.6 sample waveform . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7 IDu code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.8 EXu code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.9 CTRLu code . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.10 DIVu code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.11 Write CSR_REG . . . . . . . . . . . . . . . . . . . . . . . . 37
4.12 Read CSR_REG . . . . . . . . . . . . . . . . . . . . . . . . . 37
LIST OF FIGURES | vii
4.13 INTu states . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.14 Bus code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.15 RAM code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.16 GPIO code . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.17 Floorplan configuration . . . . . . . . . . . . . . . . . . . . . 42
4.18 Power Rings configuration . . . . . . . . . . . . . . . . . . . 43
5.1 Block diagram of Simulation . . . . . . . . . . . . . . . . . . 47

5.2 Information from .elf file . . . . . . . . . . . . . . . . . . . . 48
5.3 Information from .log file . . . . . . . . . . . . . . . . . . . . 49
5.4 Simulation waveform . . . . . . . . . . . . . . . . . . . . . . 50
5.5 Load sources . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Constrains used with SoI_LIB . . . . . . . . . . . . . . . . . 52
5.7 Constrains used with Lsi_10K library . . . . . . . . . . . . . 53
5.8 Final routed layout . . . . . . . . . . . . . . . . . . . . . . . 54
5.9 Area report of final routed layout . . . . . . . . . . . . . . . . 54
viii | LIST OF TABLES
List of Tables
2.1 CISC and RISC comparison . . . . . . . . . . . . . . . . . . 6
3.1 Registers’ role in the first standard calling convention . . . . . 14

3.2 CSR Listing . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1 Hit rate of the bins defined in the cover group . . . . . . . . . 46

5.2 Simulation result of each program . . . . . . . . . . . . . . . 50
5.3 Synthesis result of each library . . . . . . . . . . . . . . . . . 53
List of acronyms and abbreviations | ix
List of acronyms and abbreviations

ALU Arithmetic Logic Unit
CCOPT Clock Concurrent Optimization
CISC Complex Instruction Set Computer
CMOS Complementary metal–oxide–semiconductor
CPI Cycle Per Instruction
CPU Central Processing Unit
CSR Control and Status Register
CTS Clock Tree Synthesis
DUT Design Under Test
GPIO General Purpose Input/Output
ISA Instruction Set Architecture
JTAG Joint Test Action Group
MSB Most Significant Bit
PCB Printed Circuit Board
PR Place & Route
QNN Quantized Neural Network
RAM Random Access Memory
RISC Reduced Instruction Set Computer
RISC-V Reduced Instruction Set Computer Five
ROM Read Only Memory
RTL Register Transfer Level

x | List of acronyms and abbreviations
RV32I 32 bits Base Integer Instruction Set
RV32IM 32 bits Base Integer Instruction Set with M Standard Extension
SoC System on Chip
SoI Silicon on Insulator
SRA Shift Right Arithmetically
SRL Shift Right Logically
UART Universal Asynchronous Receiver-Transmitter

Introduction | 1
Chapter 1
Introduction
This chapter gives a description of the research area. The problem definition,
purpose, research methodology and, goal are discussed. Finally, the structure
of this thesis is also presented.
1.1 Motivation
One of NASA’s missions in the 2030’s is to send a Rover to Venus. However,
the surface temperature of Venus is very high, which makes it difficult to apply
standard Complementary metal–oxide–semiconductor (CMOS) circuits and
Printed Circuit Boards (PCBs). The threshold voltage of electronic devices
would change because of variations in temperature, which affects the hold and
setup time of traditional digital circuits [1], and commonly used PCB metals
will melt.
In order to build high-temperature robust electronic circuits, a new technique

which builds circuits from Sic, has been proposed [2]. Because of the high
bandgap, it can endure up to 1,000 degrees Celsius. However, this technology
is not currently available for building digital circuits. There is another new
technique which is called Silicon on Insulator (SoI) CMOS. When the insulator
is used as a substrate, there is a higher band gap which can work at a higher
temperature [3], however, not as high as for SiC. The advantage of this
technology is that it can be used to manufacture digital circuits in the KTH
Electrum Laboratory. The downside is that it uses a line width of 1 µm
(around 2500 NAND-equivalents per mm2, propagation delay around 300 ps
per stage), and to reduce the manufacturing cost of the masks, the area should
2 | Introduction
be limited to 7x7 mm2 (around 120k Gates). For the sake of testing this new
process and verifying its realistic use for common digital circuits, a CPU is
designed.
1.2 Purpose
The RISC-V processor has been receiving much attention in recent years.
Embedded RISC-V processors have become common in many new chips.
Many institutions have also successively developed products based on RISC-
V, such as ETH Zurich’s Zero-riscy and GreenWave’s Gap8 CPU [4]. The
biggest advantage of RISC-V is that it is open-source and free, which means
that it can help developers to complete CPU designs at a lower cost. Besides,
the simplicity of the basic instruction set and the flexibility of the coding
process make it even more popular. The purpose of this project is to explore the
advantages and disadvantages of RISC-V by designing an open-source RISC-
V processor that can also be used in future projects.
1.3 Problem Statement

Figure 1.1 illustrates the process of a computer processor when it receives
instructions. The instructions that need to be executed are fetched from
memory and decoded into commands by the control unit. Finally, the
Arithmetic Logic Unit (ALU) processes the information and stores results
in registers or memory. Indeed, there are many types of CPUs, such as
Figure 1.1: Machine cycle
Intel, ARM, IBM, and AMD. All these CPUs can be divided into two
types, Complex Instruction Set Computer (CISC) processors and Reduced
Instruction Set Computer (RISC) processors. CISC usually has a large set
Introduction | 3
of instructions, which can contain more than 300 instructions. It is mainly

applied to servers and general PCs. However, it is not easy to achieve on a fast
chip due to the complexity of its internal hardware architecture [5]. In contrast,
RISC normally has a simpler and faster architecture, with a small instruction
set that contains less than 100 instructions. That means the designers can
implement instruction pipelining and write software easily [5]. Therefore, it
is mainly used for real-time applications [5].
Through the above comparison, the RISC processor is more suitable. To build
a RISC processor, there are two problems to consider:
• Can we build a RISC-V processor architecture that can execute all

instructions in the instruction set?
• Can this RISC processor design fit in a 7x7 mm2 chip?
1.4 Goals
The goal of this project is to design a RISC-V processor that can execute
the benchmarks developed in two parallel master theses, using the selected
instruction set. After this, a Place & Route should be done against the
experimental SoC library to determine its final area. The work can be divided
into:
1. Design a basic RISC-V processor capable of executing the 32 bits Base

Integer Instruction Set with M Standard Extension (RV32IM)(v2.0).
2. Test the system in SystemVerilog to verify that it works according to the

specification.
3. Run the machine codes produced by the Virtual Prototype tool on the
RISC-V processor to get the execution time.
4. Synthesize the RISC-V using the given SoI library and do place & route
to get the area estimates of the design.
1.5 Delimitations
The focus of this thesis project is the basic functionality of the processor.
The highest priority goal is to execute all instructions of the specified ISA.
4 | Introduction
In addition, the area of the processor should be limited to 7x7 mm2 . The
small size of the processor means its performance may not be very great. In
order to improve efficiency, instruction pipelining is implemented. Moreover,
the debugging module - Joint Test Action Group (JTAG) and asynchronous
serial communication module - Universal Asynchronous Receiver-Transmitter
(UART) are not necessary. Hence, the processor does not have these two
modules in this project. Besides, the given library for synthesis is not
optimized yet. Once the library is optimized, the area can be further reduced.
1.6 Structure of the thesis

The thesis is divided into six parts. Chapter 2 describes some background
information about RISC-V. The specification of the RISC-V instruction set
is discussed in detail. Chapter 3 presents the structure of the processor,
including the top-level structure and core structure. Each module of the design
is discussed respectively. Chapter 4 explains how to implement RTL coding,
RTL simulation, digital synthesis, and Place & Route (PR). Chapter 5 shows
the results and analysis of the simulation, synthesis and PR. Chapter 6 provides
a summary of the whole project. The future work options of this project are
also discussed here.
Background Study | 5
Chapter 2
Background Study
This chapter describes some background sources about RISC for this project.
The literature of CISC architecture and RISC architecture is discussed. Many
benefits come with RISC architecture in real-time application. Besides, RISC-
V architecture, one kind of RISC architecture, is also introduced. This
architecture is used in this thesis project. The specification of RISC-V is
also discussed. The instruction set RV32IM, which is used in the project,
is described in detail. Additionally, some literature on RISC-V will be
introduced in related work, including some RISC-V products developed by
other researchers and the role of RISC-V ISA extensions.
2.1 RISC vs CISC

Nowadays, computer architecture is mainly divided into RISC and CISC.
These two architectures have been compared to each other. Table 2.1 summarizes
the part of the difference between these two architectures. Figure 2.1 [6]
illustrates how CISC and RISC execute an instruction, respectively.
CISC instruction set usually has a large number of instructions. It means that
in order to store all the instructions, more storage capacity is required. This is
why lots of transistors are used for instruction storage. One property of CISC
is that the instructions can have high complexity. They might have variable
lengths, formats and several low-level operations that can be executed in one
instruction. CISC instructions may contain much information like addressing
mode, operation code or operands [7]. That means it may take more cycle time
to execute a single instruction. The number of instructions executed per second
is small. Besides, the variable execution time of each instruction may make
6 | Background Study
Table 2.1: CISC and RISC comparison
CISC RISC
More than 300 instructions in Less than 100 instructions in
instruction set instruction set
Non-uniform sizes and formats of Fixed sizes and formats of
instructions instructions
Large number of transistors are used Large number of transistors are used
for instructions storage for more registers
Long cycle time per instruction One cycle time per instruction
More addressing modes Less addressing modes
Difficult to pipeline Easy to pipeline
Emphasis on hardware Emphasis on software
Figure 2.1: CISC vs RISC
scheduling and pipelining difficult to implement [8]. In general, CISC makes

use of microcode to decode these complex instructions. This is because, at
the beginning, some scholars believed that hardware design was more mature
than compiler design. Therefore, they implemented parts of the functionality

in hardware and/or microcode, rather than in a memory-constrained compiler
[9].
A RISC instruction set usually contains fewer instructions. Unlike CISC, the
instructions of the RISC instruction set are simple. All instructions have the
same length and fields. This makes it possible to pipeline the instruction.
Every clock cycle can start a new instruction, but each individual instruction
still takes several clock cycles to execute. [10]. The number of instructions
executed per second is large. Additionally, RISC implements load and store
architecture. Only load and store instructions can access memory. It is not
possible to perform on the memory directly. It can only execute Register
to Register arithmetic operations. Therefore, a large number of registers
is required to prevent large amounts of interactions with memory [5] [9].
Because the instructions are simple, decoding does not require complex
architecture. The work done by hardware is reduced. However, a compiler is
needed to break down the high-level programs into simple instructions, which
is software related [11].
Each architecture has its merits and disadvantages. CISC processor is suitable
for workstations, PCs and servers. It has higher power consumption but
can realize special functions. RISC processor is mainly used for real-time
applications, including telecommunications, video and image processing. It
has a smaller area and lower power consumption.
2.2 RISC-V
In this thesis project, the RISC-V Instruction Set Architecture (ISA), one kind
of RISC architecture, is used as the ISA of the processor. RISC-V is a free and
open instruction set which is debuted in 2011. Because of the high flexibility
and low cost, it is beneficial to design more specialized microprocessors
which can be applied in some custom chips for specific applications [12]. In
RISC-V, there are many extension instructions. For instance, "A" standard
extension contains instructions that are used for reading and writing memory
atomically. "M" standard extension instructions are for integer multiplication
and division. "F" standard extension is used for floating-point computing.
It allows users to combine the optional extensions with a base instruction
set according to their needs [13]. Therefore, in many applications, such as
portable devices, wearable devices as well as aerospace equipment, RISC-V
has become more and more popular. Besides, one of the benefits of RISC-
V is that it is open-source. Commercially, it has great appeal. However, the
software compatibility of RISC-V is terrible at the moment. The reason is that
most existing programs are designed for ARM or X86 [14].
2.2.1 RV32I
The instruction set RV32I is the Instruction Set base for the processor in this
project. The ISA of RISC-V consists of a base integer ISA and some optional
standard extensions. In most processors, the instructions of the base integer
instruction set can be executed. Compared with other RISC architectures, all
base integer ISAs of RISC-V have no branch delay slots and support optional
variable-length instructions encoding [15]. They are restricted to a minimal set
of instructions enough for basic functions. There are four base ISAs in RISC-V
family: RV32I, RV64I, RV128I and RV32E. Each base instruction set has its
own integer registers width, address space size and number of integer registers.
RV32I and RV64I are similar. They provide 32-bit and 64-bit address space,
respectively. RV32E is a variant of RV32I, which has half the number of
integer registers and was originally designed for small microcontrollers [15].
The method of Two’s complement is used to represent signed number values
for all base instruction sets.
In the RV32I, there are totally six instruction formats, R-type, I-type, S-type,
B-type, U-type and J-type, respectively. The length of all types is fixed in
32 bits. Furthermore, they must be aligned on a 4-byte boundary in memory.
Figure 2.2 shows all the formats of RISC-V base instruction types. In order to
simplify decoding, the source registers (rs1 and rs2) and destination registers
(rd) are in the same position in all formats. The immediate numbers are sign-
extended. Bit 31 of the instruction is always the sign bit for all immediate
numbers.
Figure 2.3 shows all instructions in RV32I. It contains 40 unique instructions.

The instructions end with I means it is an operation with immediate numbers.
The instructions ending with U mean it is an operation with unsigned numbers.
The descriptions of all the instructions can be seen in appendix A.1.
Figure 2.2: RISC-V base instruction formats
Figure 2.3: RV32I Base Instruction Set
2.2.2 "M" Standard Extension

"M" standard extension in RISC-V is used for integer multiplication and
division. During multiplication or division, the values are held in two integer
registers. Figure 2.4 shows the format of multiplication instructions. The
normal multiplication in "M" extension is rs1 multiplied by rs2. The values of

rs1 and rs2 are both XLEN-bit. The answer is the lower XLEN bits. Figure
2.5 indicates the format of division instruction. The normal division is the
XLEN-bit value of rs1 divided by the XLEN-bit value of rs2, rounding towards
0. The instruction REM is to get the remainder of the corresponding division
operation. All the instructions in "M" standard extension are listed in appendix
A.2.
Figure 2.4: Multiplication Operations
Figure 2.5: Division Operations
In this project, the processor has a division unit but no multiplier. The
multiplication written in SystemVerilog is expressed in operator symbols *.
After the synthesis, the default circuit of multiplier would be generated by
the EDA tool. However, this multiplier cannot do calculations when there
is a negative operand. Hence, in the execution unit, if the operands of
multiplication are negative numbers, the Two’s complement would be applied
to the operands first. In addition, if the final result of a multiplication is also a
negative number, the Two’s complement of the temporary result would be the
output.
2.3 Related Work

Since the advent of RISC-V, it has developed quickly. More and more
scholars and institutions have researched this architecture. Many researchers
are working on developing RISC-V products or applying RISC-V to some
specified applications. In this section, two RISC processor cores are introduced.
They are RI5CY and Zero-riscy. In addition, how the RISC-V processor plays
a role in neural networks is also mentioned.
Andreas Traber, who worked in the Integrated Systems Lab of ETH, developed
a four-stage in-order pipelined RISC-V processor core with SystemVerilog in
2016. This core was called RI5CY, which was popular for many applications.
The first Core-V core in the OpenHW Group family was based on RI5CY [16].
RI5CY supports RV32IMC, including base integer instruction set, integer
multiplication and division instruction set, and compressed instruction set.
Figure 2.6 shows the block diagram of RI5CY which has a four-stage pipeline.
Each stage can work independently, even if the previous stage is stalled. If one
stage wants to work, the next stage and the current stage should be on standby.
This is because instruction can only be propagated to the next stage when
the next two stages are all ready to receive a new instruction. Additionally,
each stage has two control signals. One is the enable signal, which is used
to activate the stage to process instruction. The other one is the clear signal,
which removes the completed instructions from the stage [17]. There is also
a division unit in RI5CY. It takes between 2 and 32 cycles to execute division
or remainder instruction, and the number of cycles depends on the operand
values [17].
Figure 2.6: Block Diagram of RI5CY
Zero-riscy was developed based on RI5CY. It is a two-stage in-order pipelined

RISC-V processor core, and it also supports RV32IMC instruction sets. Two
parameters can be used to disable and enable the "M" standard extensions
and the base integer instruction set RV32E [18]. Compared with RI5CY,
Zero-rsicy has a smaller area and higher power efficiency. There are two
pipeline stages, the instruction fetch stage and the instruction decode and
execution stage. For the instruction fetch stage, there is a buffer that can
collect data from the instruction memory. The buffer is also responsible for
generating instruction addresses and storing instructions when the pipeline
stage is stalled. The second stage is to decode the instructions and read the
operands from the correct register file. The operands should be moved to
ALU or multiplication unit before executing the instructions. The ALU can
fully support the RV32IMC instruction set, and it consists of one 32-bit adder,
one 32-bit shifter and one logic unit [19].
In [19], Schiavone et al. also compared three RISC-V processor cores

in terms of energy consumption. They are Riscy, Zero-riscy, and Micro-
riscy respectively. Many factors may affect power efficiency, such as core
architecture, voltage, work frequency, and time constraints. In the case of
a given workload, a method that can minimize the power consumption is
proposed by Schiavone et al.
Garofalo et al. in [20] have designed a set of extensions to the RISC-

V ISA implementing low-bitwidth arithmetic instructions. It can improve
the performance of the Quantized Neural Network (QNN). In terms of core
micro-architecture, the RI5CY processor core is chosen. This is because
RI5CY has an extension that is used to target the power-efficient digital signal
processing. The proposed extension is integrated into a RI5CY processor
in RTL description. The author compared the original RI5CY core and the
new RI5CY core. The performance, power consumption and area estimation
of both systems are all collected. Because an extra extension is added, the
area of the extended RI5CY is larger than the original one. However, the
execution latency is much lower, and the energy efficiency is higher. This
is one example of using RISC-V to get optimization and improvement. The
RISC-V has become more and more popular in recent years, and it will be
applied to more different fields in the future.
RISC-V Processor Structure | 13
Chapter 3
RISC-V Processor Structure
This chapter discusses the structure of the RISC-V processor. Section 3.1
presents the core architecture of the processor. In addition, each module of
the core is described in detail. In section 3.2, the structure of RISC-V SoC is
discussed. The extra modules of the RISC-V platform, like peripheral units
and memory, are also explained.
3.1 RISC-V core

There is one RISC-V processing core in this SoC design. A component is
termed a core if it contains an independent instruction fetch unit. The RISC-
V core can continuously execute instructions that are stored in memory. It
may load or store data signals from (to) the memory and peripheral equipment
according to different instructions. A RISC-V core might support multiple
hardware threads through multi-threading, but this core does not support
multi-threading. In addition, a RISC-V core might have some additional
specialized instruction sets and an added co-processor. Figure 3.1 shows the
structure of the RISC-V core. This RISC-V core has a three-stage pipeline. It
makes the processor have higher efficiency. The three stages are respectively
1). Instruction fetch stage, 2). Instruction decode and read information from
memory (register bank or RAM), 3) Computation and Write back data. Each
stage of the pipeline is independent. There are ten components in total,
including the PC_REG (program counter register), IFu_IDu (instruction fetch
and pipeline unit), IDu (instruction decoder), IDu_EXu (pipeline unit between
IDu and EXu), EXu (execution unit), DIVu (division co-processor), reg_bank
(general purpose registers), CSR_REG (control and status registers), INTu
(interrupt unit) and CTRLu (control unit). Each component is discussed in
14 | RISC-V Processor Structure
the following subsections.
Figure 3.1: RISC-V core structure
3.1.1 PC register and General purpose registers

Because the base instruction set is RV32I, there are 32 registers in the
register bank, which are all 32 bits wide. All these registers can be written
synchronously and read asynchronously.
Table 3.1: Registers’ role in the first standard calling convention

Register ABIN ame Description
x0 zero hardwired zero
x1 ra return address
x2 sp stack pointer
x3 gp global pointer
x4 tp thread pointer
x5-7, x28-31 t0-6 temporary registers
x8 s0/fp save register/frame pointer
x9, x18-27 s1-11 save register
x10-11 a0-1 function arguments/return values
x12-17 a2-7 function arguments
These registers also serve different purposes. Table 3.1 [21] describes the
role of each type of register. Register x0 is hardwired to the constant zero.
Register x1 (ra) holds the return address to get back to the caller of the
current subroutine. It is usually used in jump instructions. During the jump
instruction, a new address is moved into the PC register, and the previous
address of the PC register is saved to register x1 simultaneously. It is
convenient to get back from the subroutine to the instruction which is following
the jump instruction. Register x2 (sp) is a stack pointer register that keeps track
of the top of the stack. It points to the next available memory location on the
stack. A call stack usually stores information about the return address of the
invocation and the variables of the procedure. To implement a stack pointer,
push and pop operations are executed. The push operation moves sp down and
then stores data to the corresponding address in memory. The pop operation
loads data and then moves sp up. Register x3 (gp) is a global pointer register
which is used to access global data. It can make optimization for memory
accesses. In general, it should cover the region which is the most intensely
used in RAM. That means register x3 holds the base address, which is the
location of global variables. Register x4 (tp) is a thread pointer register which
is used to access the thread-specific variable. In multi-threaded applications,
each thread may have a different value in the tp register. Registers x5-7 (t0-
2) and x28-31 (t3-6) are temporary registers. They are also called caller-
save registers. During instruction execution, these registers are used to keep
intermediate values. They must be saved by the caller before the procedure
returns. In general, these registers can be used freely, but one must assume
that the contents are destroyed by other functions. Hence, it is suitable, if
only a few functions are called [22]. Register x8-9 (s0-1) and x18-27 (s2-
11) are save registers, which are callee-saved registers. The current value of
these registers must be saved on the stack before use. It may assume that their
contents are preserved even across function calls. Therefore, if there are a lot
of function calls and some values need to be preserved across function calls,
it would be proper to use save registers [22]. Register x8 (s0) is also a frame
pointer register. A pointer (frame pointer) is needed to point to the base of the
stack frame, because it is difficult to keep track of the location of data on the
stack. Register x10-17 (a0-7) are function argument registers, which are used
to pass arguments. The arguments to the subroutine should be passed to the
argument registers first. Once there are return values, they are passed back to
register x10-11 (a0-1).
There are two ports for reading data and one port for writing data in the register
bank. This is because one instruction may read the values of two registers
simultaneously. Two read ports allow two registers to be read in one clock
cycle. It happens in the instruction decode stage of the pipeline. In order

to read the value from the registers, a 5-bit register address must be known.
To write to one of the registers, some signals like wregaddr (write register
address), wren (write enable), and wrreg (write data) are required. The data
will be written into the register on the next positive edge of the clock.
There is one additional register which is the program counter (PC) register. It
holds the address of the current instruction and points to the next executable
instruction. The program counter cannot be written or read by store and load
instructions directly. Executing instructions is the only way to change the
value of PC. It will increase 0x4 on each positive edge of the clock in order
to point to the next instruction which is stored in the ROM. When executing
a jump instruction or a branch instruction, if the jump flag is asserted, then
the program counter register will take the jump address as input on the next
positive edge instead of increasing 0x4. If the hold flag is asserted, it means
the pipeline would be stopped, the PC register will take the current output as
input at the next positive edge instead of increasing 0x4 to hold the pipeline
and prevent fetching the next instruction.
When the active low reset signal is asserted, all values of 32 registers would be
reset to 0x00000000. The program counter would be also reset to 0x00000000,
which points to the first instruction stored in the ROM. All registers start
running on the first rising edge after de-asserting the reset signal.
3.1.2 Pipeline units

The RISC-V processor has a three-stage pipeline in this project. However,
many common processors have a five-stage pipeline which consists of fetch
stage, decode stage, execute stage, access memory stage and write back stage.
Compared with the three-stage pipeline, the five-stage pipeline has higher
efficiency but a larger area. Due to the area requirement of this project, the
three-stage pipeline is chosen after consideration.
There are two units, IFu_IDu and IDu_EXu, in the processor. They are all
sequential logic. In this processor, there is no specific instruction fetch module.
The output signal pc_o of the PC register is connected to the address input of
the ROM. Because the operation of reading ROM is a combinational logic,
the instruction output from ROM is ready at the input of the IFu_IDu unit.
Therefore, the role of the IFu_IDu unit is to fetch instructions from ROM and
pass to instruction decoder IDu unit on each rising edge. The IDu_EXu unit
has the same function. It fetches decoded instructions from IDu and the data
from registers, then passes them to execution unit EXu. If the reset signal or
the hold flag is asserted, both pipeline units would output instruction NOP to
prevent further execution in the next unit. In addition, other data output signals
would be all zero.
In fact, IFu_IDu and IDu_EXu are two groups of registers which are placed
between two large combinational circuits to get a shorter critical path in timing
and create a pipeline structure in the data path. Hence, these two units are
called pipeline units. By shortening the critical path, the clock period is
shorter, which can get a higher clock frequency. However, using a pipeline
structure in the processor may have some problems. First, when a branch
instruction or jump instruction is executed, the whole pipeline needs to be
emptied. It may cause some clock cycles to be lost. In addition, because it
takes more than one clock cycle to execute the division instruction, the pipeline
needs to be paused during the division operation. Some control signals are
designed to solve the problems.
3.1.3 Instruction decoder

The decoding module IDu is a large pure combinational logic circuit. The
working principle of IDu is decoding the input instructions and then passing
the decoded instructions to the IDu_EXu unit. IDu outputs the read register
addresses to the register bank to load the register data and then pass the register
data to the IDu_EXu unit. If the instruction requires writing a register, IDu
module would pass the register write enable signal and the write register
address to the register bank. Each instruction can be divided into at most six
parts:
1. Opcode
2. Function 3 (Funct3)
3. Function 7 (Funct7)
4. Source register 1 (rs1)
5. Source register 2 (rs2)
6. Destination register (rd)

The decoder extracts the opcode first to recognize the type of the instruction.
Then the function 3 of the instruction would be extracted, which can be used
to distinguish the instructions of the same type. Sometimes the function 7
is also needed to be extracted. For example, instruction ADD and MUL
have the same opcode and function3, but their function 7 are different. After
determining what the instruction is, the decoder would extract the addresses
of the necessary registers and set the operands required by the instruction. All
these information would be passed to EXu unit.
3.1.4 Execution unit

The execution unit EXu is also a key combinational circuit in which the
functions of all instructions are achieved. According to the instruction, the
EXu unit performs the corresponding operation, which is to process the input
data and output the result. For instance, if there is an add instruction, the
values of source register rs1 and source register rs2 would be added. If there
is a load instruction, the EXu unit would read the data, which is saved in the
corresponding address of RAM. If there is a jump or branch instruction, a
jump signal would be sent by the EXu unit. Most instructions can be finished
by execution unit in one clock cycle, but all division and remainder instructions
require the DIVu unit, which needs at least 32 clock cycles to finish. When
there is a division or remainder instruction, a hold flag or a jump flag is needed.
In general, the EXu unit fetches necessary data from pipeline unit IDu_EXu,
co-processor DIVu, and interconnection unit BUS, and then processes the
instructions and outputs the result to the general-purpose registers or RAM.
3.1.5 Division unit

A co-processor is a unit that is attached to the RISC-V core. It is mostly
sequenced by a RISC-V instruction stream that contains instruction-set extensions.
In this project, there is a divider DIVu attached to the RISC-V core, which
is used to execute division and remainder instructions from "M" standard
extension. Once the execution unit EXu receives a division instruction, it
sends the data to the division unit for calculation. Figure 3.2 indicates how
the division unit DIVu implements integer division.
Here is an example of a calculation using this method. Let’s assume the

dividend is 15 (decimal), and the divisor is 4 (decimal). The binary number
of the dividend is 1111 and the divisor is 100. Therefore, m = 1111, n = 100,
Figure 3.2: RISC-V Division unit algorithm
k = 4 (bit number of m) and s is the quotient whose initial value is 0. In the

first loop, m[3] is smaller than n, so s[3] = 0, and then k > 1 and minuend
= (minuend, m[2]) = 11, k = 3. In the second loop, minuend = 11 which is
smaller than n. Hence, s[2] = 0, and then k > 1 and minuend = (minuend, m[1])
= 111, k = 2. In the third loop, minuend = 111 which is larger than n. Hence,
s[1] = 1, and then k > 1 and minuend = (minuend - n, m[0]) = 110, k =1. In
the last loop, minuend = 110 which is larger than n. Therefore, s[0] = 1, and
then k = 1. The calculation ends. Finally, s (quotient) = 0011, which is equal
to 3 (decimal). The remainder = minuend - divisor = 110 - 100 = 10, which
is equal to 2 (decimal). It is easy to verify that the final values of quotient
and remainder are correct. In this case, the final minuend is larger than the
divisor, so the remainder is equal to minuend minus divisor. However, if the
final minuend is smaller than the divisor, the remainder would have the same
value with that minuend. Because the width of the dividend is 32 bits, each
division takes at least 32 cycles to complete.
3.1.6 Control unit

The control (CTRLu) unit handles the jump and hold signal of the three-stage
pipeline. The flag output from the EXu unit, INTu unit and BUS would be
passed to the CTRLu unit. At last, CTRLu unit output the hold flag, jump
flag, jump address to all pipeline units and program counter register to control
the pipeline.
3.1.7 CSR register

The RISC-V architecture defines some extended registers Control and Status
Registers (CSRs). They are used to control or monitor the states of the CPU.
RISC-V distributes a separate address space to CSR. The address of each CSR
is 12 bits, so for each hardware thread, there are 4096 CSRs. Each CSR has its
own name and role. The two most significant bits of the CSR[11:10] address
are used to determine whether this register is readable or writable. When they
are equal to 00, 01 or 10, the CSR can be read and written. But if they are
equal to 11, the CSR could only be read. The next two bits of the CSR[9:8]
address indicate the lowest level of privilege required to access this CSR. 00
means user level. 01 means supervisor level. 10 means hypervisor level. 11
means machine level.
Table 3.2: CSR Listing

Address P rivilege N ame
0xC00 URO cycle
0x305 MRW mtvec
0x342 MRW mcause
0x341 MRW mepc
0x304 MRW mie
0x300 MRW mstatus
0x340 MRW mscratch
In general, the type of CSR varies depending on the privilege level. The
machine mode is the most basic one. It means that all RISC-V processors must
be able to implement machine mode. The other three modes are optional. In
this project, the machine mode CSR is discussed. Although CSR can complete
many auxiliary functions, not all RISC-V processors need all types of CSR.
Table 3.2 shows several types of CSR which are used in this project. The
register cycle is used to count the number of clock cycles which has been
executed by the processor core since some arbitrary time in the past [23]. The
register mtvec is used to record the interrupt vector address for the machine
mode. The register mcause is used to record the trap cause. The register
mepc is used as an exception program counter. The register mie is used as
an interrupt enable register. The register mstatus is used to save the operating
state of hart for the machine mode. The register mscratch is used as a scratch
register for machine trap handler.
The CSR registers can be read and written, which can affect the operation of
the processor. However, the base integer instructions cannot read or write the
CSR registers. Only CSR instructions can be used to read and write the CSR
registers. These instructions belong to "Zicsr" standard extension instruction
set. Figure 3.3 shows the structure of six CSR instructions.
Figure 3.3: CSR Instructions
The CSRRW instruction is used to exchange the values atomically between the
CSR registers and the general-purpose registers. It reads the previous value
in the CSR register and extends it to 32 bits. The extended value would be
passed to the destination register rd while the original value in source register
rs1 is written to the corresponding CSR register. For the CSRRWI instruction,
the only difference is that the output value of source register rs1 is replaced by
an unsigned immediate number (uimm[4:0]). Then the immediate number
is extended to 32 bits. If the value of the destination register rd in these
two instructions is zero, the CSR could not be read and pass the value to the
general-purpose register. However, writing CSR is feasible at this point.
The CSRRS instruction is applied to read and set bits in CSR. The process is
to read the previous value of the CSR register and extend it to a 32-bit value,
and then save the value to the destination register (rd). Unlike the CSRRW
instruction, the original value of source register rs1 is not written directly to
the CSR register. It is regarded as a bit mask that can determine which bits
of CSR value should be set. For instance, if the value of source register rs1
is 11000, the first bit and the second bit of the value in CSR register should
be set. Once these bits can be written, the set operation will be executed.
The CSRRSI instruction has a similar function, which is using an unsigned
immediate number(uimm[4:0]) as the bit mask. Note that this value is also
extended to 32 bits.
The CSRRC instruction is used to read and clear bits in CSR, which also reads
the old value of the CSR register and extends the value to 32 bits. It is similar
to the CSRRS instruction. The only difference is that its role is not to set the bit
but to clear the bit. For the CSRRCI instruction, the extended 32-bit unsigned
immediate number replaces the value of source register rs1. The rest of the
operations are the same as the CSRRC instruction.
For CSRRS and CSRRC, the read operation is value-independent. That means
the value of the source register rs1 and the destination register rd can be
zero. However, if the value of the source register rs1 is equal to zero, the
write operation could not be performed. The CSRRCI instruction works like
CSRRC, and that the CSRRSI works like CSRRS.
3.1.8 Interrupt unit

Interrupt return (IR) is essentially a jump operation, but it also needs to read
and write the CSR register. There are two types of RISC-V interrupts. One is
synchronous interrupts. In this case, interrupts are generated by instructions
such as ECALL and EBREAK. The other one is asynchronous interrupts
which are generated by peripherals such as GPIO, timers and UART. For
interrupt unit design, a simple method is to pause the entire pipeline when
the interrupt return signal is asserted. After that, set the jump address as the
entry address of interrupt, and then read and write the necessary CSR registers
(mstatus, mepc, mcause, etc.). After reading and writing these CSR registers,
the pipeline suspension can be cancelled. It means that the processor can start
to fetch instructions from the interrupt entry address and enter the interrupt
service routine.
In some cases, multiple interrupt requests occur at the same time. The INTu
unit can take all the requests and determine which one can be sent to the
CTRLu unit. This is because there is arbitration in the interrupt unit. A simple
conditional check of the different interrupt requests is implemented. Then the
flag is forwarded to the EXu unit. If a synchronous interrupt request is received
during the execution of the division instruction, the synchronous interrupt
would not be processed by the INTu unit until the execution of division
instruction is finished. Nevertheless, if there is an asynchronous interrupt
request occurs during the execution of the division instruction, the interrupt
request would be processed first. When the interrupt ends, the calculation
continues.
3.2 RISC-V SoC

Except for a RISC-V processor core, there are still many components on
a RISC-V platform. For instance, the various physical memory structures,
such as Random Access Memory (RAM), Directly Access Memory (DAM),
Content Addressable Memory (CAM) or Sequential Access Memory (SAM).
According to the specific function of the system, a suitable memory structure
can be chosen. In addition, Input/Output (I/O) devices are also needed,
which are used to achieve the specific application. In order to make the
core communicate with these peripheral units, an interconnect structure bus
is needed. Moreover, some types of accelerators can also be configured to
improve the performance, but it is optional. Therefore, the accelerator is not
discussed in this project.
Figure 3.4 indicates the structure of the RISC-V SoC. In this RISC-V SoC,
there is a ROM which is used to store all instructions, and a RAM which is used
as the main memory. In general, the ROM is connected to the bus. However,
in this project, the ROM is connected to the RISC core directly. In fact, the
ROM can be connected to the bus as the highest priority slave in this SoC. In
addition, a bus module is used to connect all components together, and a GPIO
component that acts as an I/O device. It allows the SoC to interact with users.
All modules are discussed one by one.
Figure 3.4: RISC-V SoC structure
3.2.1 Bus
Suppose there is an SoC without a bus, how to communicate between the
processor core and the peripherals? It might look like Figure 3.5. The
processor core interacts with each peripheral directly. In this project, a bus
unit is used to decode the address. Nowadays, many bus designs are standard
and popular, such as SPI, wishbone and AXI. When designing a CPU, one of
them can be used. However, these buses are relatively complicated. Due to
comprehensive protocols, complex operations, and flexible control, AXI has
high performance. However, AXI is too large and complex, which makes it
difficult to debug or use [24]. Moreover, an AXI bus may consume too many
resources and power. On the contrary, SPI bus is relatively simple. However,
SPI bus is a serial bus, but the data transmission is generally parallel inside the
CPU [24]. It means that it is very troublesome to do the conversion. Therefore,
a more simple bus is applied in this project.
Figure 3.5: RISC-V SoC without Bus
The Bus unit is a combinational circuit that connects the bus master RISC-V
core to different bus slaves, such as RAM and GPIO. The bus has a master
interface and a slave interface, which must be matched when a component
wants to connect to the bus. The bus unit supports multi-master and multi-
slave connections, but only supports one master and one slave communication
at the same time. An arbitration mechanism with fixed priority is applied to
each master device on the Bus unit.
When the master component wants to write to a slave, it sets up some

significant signals, such as the address, input data, bus request, and write
enable signals. Then the Bus unit creates a bus grant signal inside the Bus,
which ensures this bus access. After that, some necessary signals from the
bus, such as the address, write data, and write enable signals are passed to the
selected slave. The write operation is finished. When the master component
wants to read from a slave, it just needs to set up address, bus request signal,
and bus grant signal. After receiving the request and signals from the bus, the
selected slave passes the required data to the master by bus.
The selection of bus master is made by fixed priority bus arbitration. According
to the setting of the bus, different bus masters have different bus priorities.
The master with higher priority will be granted first. This project has only one
master, the CPU. It always has the higher priority. However, it is also available
if additional masters are added in the future. The selection of the slave is made
by address segment. Currently, the first two Most Significant Bits (MSBs) are
used to indicate bus slave. Because only two MSBs are used, at most four
slaves can be connected to the bus. The bus selection bits take two bits away
from the address, so for each slave, 230 bytes can be accessed, which means
the maximum size of each slave is 230 bytes. Actually, the number of slaves
allowed to be held can be increased. The code can be modified to use more
than four slaves. Just use more MSBs as address segments. The disadvantage
of this method is also obvious. The programmers may have a lot of burdens
when designing programs. This is because all the load and store instructions
can only be stored at addresses within a specific region. For example, let us
assume that when the two MSBs of the address are 00, the RAM would be
selected. In this premise, if the two MSBs of the address corresponding to
the load or store instruction are not 00, the instruction could not be executed
normally. One method is that the compiler adds a special symbol to all load
or store instructions. When the bus detects this symbol, the RAM would be
selected.
3.2.2 ROM and RAM

This RISC-V processor has a single byte-addressable address space of 232
bytes for all memory addresses. A word of memory is defined as 32 bits (4
bytes). A half word is 16 bits (2 bytes), and a double word is 64 bits (8 bytes).
The memory address space is circular so that the byte at address 232 − 1 is
adjacent to the byte at address zero. Different address ranges of the address
space may:
1. contain main memory,
2. contain different types of I/O devices,
3. contain nothing at all.
ROM is a program storage module that is used for storing instruction lists.
While RAM is a data storage module that is used for saving data, it can only
communicate with Bus. The information about their ports is shown in Figures
3.6 and 3.7. The inputs of these modules are addr_i and data_i, which are
represented by 32 bits, write enable signal, reset and clock. The output is
data_o which is represented by 32 bits. Both designs have the same principle.
In the process of writing, both of them are clocked on the rising edge of the
clock and controlled by the write enable signal. In the process of reading,
both of them are combinational logic. It means that they are not affected by
the clock. When the reset bit is low, they will immediately output the value
stored at the current address.
Figure 3.6: RAM Figure 3.7: ROM
3.2.3 GPIO
Figure 3.8: GPIO Module

GPIO is just a simple I/O port module in this RISC-V processor, mainly used
for light debugging. Both the data and address of GPIO are 32 bits. According
to the structure of RISC-V SoC, there are totally two GPIO modules. The data
and address are passed to the GPIO module through the Bus. The data in the
GPIO is also transmitted to RISC-V core through the Bus. The information
about its ports is shown in Figure 3.8.
It is similar to the ROM and RAM. GPIO has built-in control registers and data
registers. For control bits, every two bits can control the type of one I/O port.
Therefore, each module can control sixteen I/O ports. The modes of these I/O
ports are input, output, and high impedance state. In the process of writing, it
is clocked on the positive edge of the clock and controlled by the reset bit and
write enable signal. It has two configurations. When the write enable signal
is high, the lower four bits of the input address will determine whether the
input data is stored in the data register or the control register, while if the write
enable signal is low, io_pin will input the data to the built-in data registers and
the built-in control register will determine which bit io_pin will output to the
built-in data register. In the process of reading, GPIO is not affected by the
clock and write enable signal but is affected by the reset signal. The lower
four bits of the input address determine whether the data_o is from the control
register or the data register.
RISC-V Processor Implementation | 29
Chapter 4
RISC-V Processor Implementation
This chapter discusses the implementation of this RISC-V processor. For

the RTL development, compared to the previous project, the code of many
modules have been modified. All of the changes are discussed in section 4.1.
In addition, the place & route are also discussed in this chapter.
4.1 RTL Development

In this section, some important codes of each module are explained. Compared
to the previous project, there are some changes in certain modules, which are
discussed in detail. Two new units, CSR_REG and INTu, are also discussed.
4.1.1 PC_REG
Figure 4.1 shows the codes of the PC_REG unit. As shown in line 15, the value
of pc_o can be recovered to the initial value by the active low reset signal.
The reset value is set to ‘prcreset which is set as 32’h0 in the define file. In
line 17, if the jump flag is asserted, the value of the pc_o would be set to the
destination jump address, and then the processor would fetch the instruction
from this address. In line 19, the value of ‘pchold is 3’b001. If the hold flag
is asserted, the value of the pc_o would be held. This hold flag is also used by
the IFu_IDu and IDu_EXu modules. If only the PC register is paused, then the
IFu_IDu module and the IDu_EXu module could work independently. If the
IFu_IDu module is suspended, the PC register would be also held at the same
time. If the IDu_EXu module is paused, the whole pipeline would be paused.
RISC-V must be aligned on 32-bit boundaries because of fixed-length 32-bit
instructions (i.e. at memory locations divisible by 4) [15], if the processor is
30 | RISC-V Processor Implementation
Figure 4.1: PC_REG code
working normally, the pc_o would be added by four.
4.1.2 IFu_IDu & IDu_EXu
Figure 4.2: D flip-flop code Figure 4.3: IFu_IDu code
For both pipeline units, one type of D flip-flop is applied, which is shown in
figure 4.2. When the reset or the hold flag is asserted, the output is a preset
value. Otherwise, the output is equal to din, which means one clock cycle
delay to the next module.
Figure 4.3 indicates the three output signals in different cases. If reset signal is
active low or hold flag is asserted, the output instruction would be set to NOP
instruction and the corresponding address would be set to all zero. Otherwise,
the instruction and address would be passed to the IDu module until the next
rising edge clock. In addition, the ‘holdIF is set to 3’b010 and the ‘holdoc
is set to 3’b001. The thing that should be noticed is the hold signal in this
module would be asserted if the value is larger than ‘holdIF. If the value of
the hold signal is ‘holdpc, the hold flag would not be asserted.
Figure 4.4: IDu_EXu code
A part of IDu_EXu’s codes is shown in figure 4.4. This module is similar to

IFu_IDu, that is, the same type of D flip-flop is used. When the reset or hold
flag is asserted, the output of each signal is set to a defined value. Otherwise,
the output is the input value, which would be passed to the EXu module in
the next cycle. In this module, the hold flag is only asserted when the value
is equal to ‘holdid, which is 3’b100. That means the whole pipeline will be
paused.
As shown in figure 4.5, in the previous project, there is a signal C2 which

is used to connect the EXu unit and IDu unit, because some instructions
like jump instruction or division instruction need more than one cycle to be
executed. During the period when these instructions are executed, none of the
instructions should be fetched. At this moment, the NOP instruction should be

passed to EXu unit. The EXu controls for how many clock cycles C2 should be
asserted for and the IDu depending on the value of C2 passes the instructions
or NOP to the IDu_EXu. In this project, the C2 signal is removed from the
EXu unit and IDu unit. The hold flag is used to achieve the same function, and
the output of the IFu_IDu unit and the IDu_EXu unit are modified when the
hold flag is asserted. The benefit is that without the C2 signal, the EXu unit
becomes a pure combinational logic, which does not need a clock.
Figure 4.5: C2 internal signal connection
Figure 4.6 indicates how the pipeline units work. When a jump instruction
reaches the execution unit, the jump flag is asserted. In the next cycle, the
output instructions of IFu_IDu and IDu_EXu are both NOP instructions. That
means instruction 3 would not be passed to IDu_EXu. Two clock cycles later,
the instruction with the target address would reach the EXu unit. Therefore,
this processor may take three cycles to execute a jump instruction or a branch
instruction, if the jump flag is asserted.
4.1.3 Decode Unit

As mentioned before, there are six types of instruction formats. Each
instruction can be divided into several parts. For instance, the ADDI instruction
can be divided into six parts. The inst[31:20] is the immediate number. The
inst[19:15] is the address of source register 1 (rs1). The inst[14:12] is function
3. The inst[11:7] is the address of destination register (rd) and the inst[6:0]
is opcode. The decoder can recognize the ADDI instruction by function 3
Figure 4.6: sample waveform
Figure 4.7: IDu code
and opcode. After that, the required information can be extracted, such as
which general-purpose register is to be read or written. Figure 4.7 shows some
codes which are for I-type instructions decoding. In line 1, the case (opcode)
is used to determine which type of instruction it is. The value of ‘insttypei
(7’b0010011) is used to identify extended instructions. Then, the case (fun3)
is used to determine exactly what instruction it is. Once the instruction is
determined, the necessary data is extracted, and then passed to the execution
unit. In line 5, the write enable signal is set to 1, which means the register
needs to be written. In line 6, the address of the written register is set to rd.
In lines 7 and 8, one address of reading registers is set to rs1. The other one
is set to zero. The immediate number is not read from the general-purpose
register. In lines 9 and 10, two operands are set. Some instructions require
more signals, such as LW instruction which is used to load a word from RAM.
It needs to send a request to the bus and get the address of the memory first.
According to Figure 2.3, the instruction format of Shift Right Logically (SRL)
is similar to Shift Right Arithmetically (SRA). Therefore, in the decode unit

of this project, these two instructions are decoded as SR. In the execution unit,
there is an if case in the code that executes the SR. If the bit 30 is zero, SR will
do a shift right logically. Otherwise, it will do a shift right arithmetically. In
the same way, SRLI and SRAI also can be decoded as SRI. Therefore, there
are just 38 instructions in the define.sv file.
4.1.4 Execution Unit
Figure 4.8: EXu code
Figure 4.8 presents part of the code of the EXu unit, which is for the ADDI
instruction. The ADDI instruction is used to add an immediate number to the
value of the source register rs1, and then write the result to the destination
register rd. The code of lines 1 and 3 is used to determine what instruction it
is. Because executing the ADDI instruction does not need to access the main
memory, the signals about memory are all set to zero. In addition, no jump or
hold is involved in this instruction, so the jump flag and hold flag are both set
to zero too. The data should be written to the destination register (reg_wdata),
and it is equal to operand 1 plus operand 2.
Some instructions do not need to write the register but need to jump to a target
address, for example, the BGE instruction. When the value of a source register
rs1 is greater than or equal to the value of source register rs2, it would jump to
a new address that is equal to the current instruction address plus an immediate
number. In this case, reg_wdata is set to zero, and the data read from the two
source register should be compared with each other. If the condition is met,
the jump flag would be asserted, and the jump address would be set to the
target address. Otherwise, both are set to zero. The branch is not taken.
4.1.5 Control Unit

In the execution phase, when a jump happens, the jump flag and the target
address should be sent to the CTRLu unit first. Then, the CRTLu unit should
send the hold flag to PC_REG, IFu_IDu and IDu_EXu modules and also send
the target address to the PC_REG module. When the next rising edge of the
clock arrives, the value of local hold flags in the IFu_IDu and IDu_EXu unit
should be determined respectively. Once both local hold flags are asserted,
there are only NOP instructions in the whole pipeline i.e., the decoding stage,
the execution stage.
Figure 4.9: CTRLu code
As shown in Figure 4.9, the default value of the hold flag is set to zero, and
the jump_address and jump flag are passed to the PC_REG unit. The requests
from the different modules are processed according to priority. In line 7, if
the jump flag or the hold flag from the EXu unit or the hold flag from the
interrupt unit is asserted, the pipeline would be paused. In line 10, once the
hold flag from the bus is asserted, the PC_REG unit is paused, which means
the pc address is held. This design can improve the performance of the MCU.
4.1.6 Division Unit

In the previous project, a latch was placed in the division unit, which is
used to update next_minuend_temp according to the count value. The
next_minuend_temp should be updated before the next consecutive clock
cycle, because it is used to assign the value for the intermediate division result.
However, the code is optimized, it looks cleaner now. In addition, the Two’s
complement sometimes needs to be applied to the result. This part is in the
start state instead of the idle state in this project.
Figure 4.10: DIVu code
As shown in Figure 4.10, there are four states in the division unit. The initial
state is STATE_IDLE. When a division instruction reaches the EXu unit, the
control signal start_i in the DIVu unit is asserted. In the next cycle, the state
is STATE_START. The count value is set to 32’h40000000. After that, it
enters STATE_CALC. As mentioned before, it may take 32 cycles to do the
calculation. The last state is STATE_END. The result would be output in this
state. It should be noted that the jump flag and hold flag are also asserted
when the division instruction arrives. In addition, there are two extra cycles
with inserted NOPs. Therefore, one clock cycle for STATE_IDLE, one clock
cycle for STATE_START, 32 clock cycle for STATE_CALC, one clock cycle
for STATE_END, and two clock cycles for NOPs. It may take 37 cycles totally
to execute one division instruction.
4.1.7 CSR_REG
The CSR_REG unit is similar to the REG_BANK unit. The only difference is
that the CSR_REG unit can be written and read by the Exu unit and INTu, but
the REG_BANK unit can only be written and read by the EXu unit. Therefore,
only the CSR_REG is discussed in this chapter.
Figure 4.11 shows how the EXu unit and INTu unit write to the CSRs registers.
It is synchronous with the clock. The waddr[11:0] is used to determine what
CSR register it is. When the rising edge of the clock is reached, data is written
to the corresponding register. During the writing process, the execution unit
has a higher priority than the interrupt unit. Figure 4.12 indicates which CSR
register should be read. The read operation is asynchronous. It does not
depend on the clock. If the read address is equal to the write address and
the write operation is in process, the write data would be returned directly.
Figure 4.11: Write CSR_REG

Figure 4.12: Read CSR_REG
4.1.8 Interrupt Unit

In the interrupt module, four interrupt states and five write CSR registers states
are defined, which are shown in figure 4.13. The interrupt states are used to
determine what type of interrupt is generated. Once an interrupt occurs, the
hold flag is asserted. Part of the CSR registers would be written. The CSR
register states are used to determine the written content and which CSR register
should be written.
Figure 4.13: INTu states
When the int_state and csr_state are both idle, the hold flag is not asserted.
If one of the states is not idle, the hold flag would be asserted. If there is
an ECALL or EBREAK instruction and there is no division operation being
performed, the interrupt state should be changed to INT_SYN_ASSERT. The

division cannot be interrupted by these two instructions. Once the interrupt
state is set to INT_SYN_ASSERT, the csr_state would be set to CSR_MEPC.
If the jump flag is asserted, the interrupt return address would be stored.
The csr_mepc register is written to store the previous instruction address.
In the next clock cycle, the csr_state would be set to CSR_MSTATUS. The
csr_mstatus is written to store the operating state of hardware thread for the
machine mode. In the next cycle, the global interrupt is not asserted, the
csr_state is changed to CSR_MCAUSE, and the corresponding register macuse
is written to save the interrupt cause. At last, the csr_state would be changed
to idle.
If the int_flag from the PC_REG unit and the global_int_en from the CSR_REG
unit are both asserted, it would be an asynchronous interrupt that is generated
by peripherals such as GPIO, timer, etc. Then, the interrupt state should
be set to INT_ASYN_ASSERT. Compared to the synchronous interrupt, the
asynchronous interrupt can interrupt the division. The division will continue
until the interrupt is processed. Once the interrupt state is set to INT_ASYN_ASSERT,
the csr_state would be set to CSR_MEPC. Subsequent operations are the same
as synchronous interrupts.
If there is an MRET instruction that is used to return after handling a trap,

the interrupt state should be set to INT_MRET. Then the csr_state would be
changed to CSR_MSTATUS_MRET. The mstatus register is written. In the
next cycle, the csr_state would be recovered to idle state.
4.1.9 Bus
In the bus module, one of the master devices is selected to access the
corresponding slave device. In this RISC-V processor, the only master device
is the PC_REG module which is the default master device. If there are more
than one master device, the bus module would apply fixed priority arbitration
to select master devices. If another master device with higher priority is
selected, the hold flag would be asserted, and the pipeline would be paused.
The idea of having just four slave devices was an idea of us during the project
course. Figure 4.14 shows how to select a slave device. The two most
significant bits of the address from the master device are used as a slave select
signal. That means there are up to four slave devices. However, in this project,
Figure 4.14: Bus code
there are only two slave devices, RAM and GPIO. If there are more than four
slave devices, the slave select signal could be extended to four bits, which can
support sixteen slave devices.
4.1.10 ROM and RAM

As shown in figure 4.15, the size of RAM is set to 4096. In fact, the maximum
size is 230 . If the signal rst_n is asserted, all elements would be set to zeroword.
If there is a store instruction, the RAM would be written in the next clock cycle.
The addr_i is equal to the value of source register 1 (rs1) plus the offset, whose
length is 32 bits. The address of RAM is the first 30 bits of addr_i, and the
data_i is equal to the value of source register 2 (rs2). That means the value
of the destination address in RAM is a quarter of the input address from the
bus, for example, if the input address from the bus is 3’b100, the destination
address in RAM is 3’b001. If there is a load instruction, the addr_i is also
equal to the value of source register 1 (rs1) plus the offset. The data in the
corresponding address is loaded to the destination register (rd). The ROM
module is similar to the RAM module. Since the ROM is read-only memory,
once the instruction is loaded into ROM, the data in the corresponding address
cannot be reset or overwritten.
Figure 4.15: RAM code
4.1.11 GPIO
In the GPIO, there is a 32-bit control signal and a 32-bit data signal. Each
I/O is controlled by two bits. There are three modes, input, output, and high
impedance state respectively. Hence, at most 16 I/O can be controlled in this
module. As shown in figure 4.16, the GPIO can be written. When the write
enable signal is asserted and the addr_i[3:0] is 4’b0000, the input data would
be written to the control register. If the addr_i[3:0] is 4’b0100, the input data
would be written to the data register. When the write enable signal is not
asserted and every two bits in the control register is 10, the input value io_pin_i
would be written to the corresponding I/O. The reading process is similar to
the writing process.
4.2 Place & Route

In this section, the process of physical design is discussed. As shown in the
below list, there are a totally nine steps which are the same as the user manual
KISTA 1um CMOS SOI Cadence Digital Flow. Here is a brief explanation,
Figure 4.16: GPIO code
and please refer to the user manual for more detail.
1. Load the design.
2. Floorplan.
3. Set power rings.
4. Routing power and ground nets.
5. Pin placement.
6. Standard cell placement & Pre-Clock Tree Synthesis CTS optimazation.
7. Clock Tree Synthesis CTS.
8. Route with nanoroute.
9. Placement of filler cells.

Figure 4.17: Floorplan configuration
The first step is to load the design. Once Innovus starts, the TCL command:
setDesignMode -process 250 is used to set process tech to 250nm. Then the
Verilog netlists from the Genus should be imported. Furthermore, two .lef
files and an MMMC View Definition file are necessary. After that, the global
nets should be connected. The power net is set as VDD and the ground net
is set to VSS. The second step is the floorplan. Several parameters can be
modified according to the requirement. The ratio (H/W) is set to 1, and the
core utilization is set to 0.6. The core margins to all sides are set to 32.0.
The final configuration is shown in figure 4.17. The core utilization increased
to 0.7 after optimization, which is quite high. The third step is to set power
rings. A generous margin can avoid electromigration problems. As shown in
figure 4.18, the widths, spacing and offset are set to 12, 2 and 4, respectively.
They are all related to the parameter of the using PDK. In addition, the widths
is also related to the power. The power report from synthesis can be used to
calculate the setting widths. The fourth step is to route power and ground nets,
which is used to connect the VDD and VSS to the power ring. The fifth step
is pin placement. It can select in which side of the block that needs to put the
pin or pins, etc. When use automatic placement, the pins would be placed on
each partition automatically. They can then be adjusted manually as required.
The next step is standard cell placement and pre-Clock Tree Synthesis (CTS)
optimization. The placement should be driven by run timing, and enable clock
gating awareness. In addition, the TIEHI and TIELO cells should be placed.
After that, report the timing to check pre-CTS. If there are violations, it should
run timing optimization. The seventh step is the CTS. Before creating the clock
tree, a Clock Concurrent Optimization (CCOPT) specification file should be
created. In this file, the inverter and buffer cells that are defined in SoI standard
library should be set up. The time report should be made after the timing tree
is created. If there are setup violations, run command: optDesign -postCTS
to optimize. The next step is to route with nanoroute. After that, the design
should be checked to see if there are connectivity violations. The final step is
the placement of filler cells. The places that are not used by standard cells can
be filled with decoupling capacitors and empty fillers. Finally, a connectivity
check should be done and the design exported.
Figure 4.18: Power Rings configuration

44 | Results and Analysis
Chapter 5
Results and Analysis
In this chapter, all the necessary results are presented and analysed. In section
5.1, the verification is discussed briefly. Because most of the content is similar
to previous project, only the simulation part is emphasized. The second section
records the area estimation and the timing report for both synthesises. The last
section illustrates the layout and area report after physical design.
5.1 Verification and Simulation

In this section, the verification and simulation for the processor is explained.
Compared to the previous project, the simulation part is newly added. Hence,
the simulation part is discussed in detail.
5.1.1 Verification
The verification strategy for each module is the same as the previous project.
Hence, the individual verification of each module is not discussed here. This
subsection focus on the verification of the processor core and the top-level
design.
RISC-V core verification strategy

For the RISC-V core verification, the aim of this verification environment is
to test the whole data flow from the input to the output of the RISC-V core.
The functionality of the core with the bus or the other peripherals like ROM,
RAM and GPIO is not verified by this environment. The PC_REG unit needs
to modify in a way that it increments in steps of 1 and not steps of 4 because of
the temporary ROM in place. This is because the temporary ROM is an array
Results and Analysis | 45
whose width is 32 bits.
Since there is no memory which is used to store instructions and then

pass them to the IFu_IDu module, a temporary ROM is applied to the
verification. In the testbench, the rst_n signal is asserted for 10ns. After de-
asserting rst_n, the bus_inst_i input of the IFu_IDu unit is probed to check
if the instructions are fetched according to the increments of the program
counter. A monitor is used to know what instruction is being executed at
what clock cycle and the value of each general-purpose register is checked
to determine whether the instruction is executing correctly or not. In this
verification, it cannot randomize instructions to the input port (bus_inst_i).
This is because when there is a branch instruction, the core needs to fetch
instructions according to the corresponding change in the program counter,
but randomizing instructions would hide this effect.
RISC-V top verification strategy

The top-level verification is used to verify the data flow of the RISC-V design,
which includes the core integrated with the bus, the ROM, the RAM and the
GPIO. The verification plan can be divided into two types. One is a directed
test. Another is a randomized test.
For the direct test, the clock port is modified with a multiplexer to select
between the system clock and the test clock. The test clock would be used
to load the instructions into ROM. After loading the instructions the system
clock would be turned on for the core. All the input of the ROM module
should be added with a multiplexer, which is applied to select between the
INITIAL_TEST PHASE and POST_INITIAL_TEST PHASE. The former is
used to load all instructions into the ROM module, and the latter is where the
core takes over the control signals of the ROM.
The randomized test covers a wider range of instructions, both in terms of

registers and immediate values as well as the type of instructions. The Design
Under Test (DUT) clock port and the ROM ports need to be modified in the
same way as the directed test. A separate class for every type of instruction
should be defined first. Each class of instruction has its own constraints to
ensure the instructions are structured according to the ISA. However, in order
to ensure the test can be controlled better, the branch (B-type) and the jump
(J-type) instructions are not randomized. The other instructions are generated
through constrained randomization and then loaded into the ROM in order.
An assertion is used to check if randomization passed or not for each class. In
addition, there is a monitor class to collect coverage for the source register 1
(rs1), source register 2 (rs2), destination register (rd), immediate values,and
the ROM address written into.
Randomized test result

As mentioned before, in order to keep track of all necessary objectives,
the cover groups are defined for the source register 1 and 2 (rs1, rs2), the
destination register (rd), the ROM addresses, and the immediate values.
Table 5.1: Hit rate of the bins defined in the cover group
Covergroup M etric Goal
Coverpoint cg::RS1 68.5% 100
Coverpoint cg::RS2 52.2% 100
Coverpoint cg::RD 70.6% 100
Coverpoint cg::ROM_ADDR 50% 100
/risc_top/monitor_risc/cg 60.1% 100
Table 5.1 shows the hit rate of the bins defined in the cover groups. The
expected hit rate of the source and destination register bins was expected to be
more than or equal to 50%. This was because the constrained randomization
used only sets of registers for generating a few instruction types, such as the
R-type and the S-type. On the other hand, the ROM was about 4K bytes, and
it was not possible to exercise all the locations for sign-off. Only half of the
memory is used for storing instructions.
5.1.2 Simulation
In this subsection, the simulation work is discussed. The main goal is to
compare the designed processor with a RISC-V processor produced by the
virtual platform tool Imperas OVP to check if the designed processor works
correctly.
Figure 5.1: Block diagram of Simulation
Extract necessary information

OVPSim is a simulator that can simulate a RISC-V platform. As shown
in Figure 5.1, during the simulation, the C codes are compiled by a cross-
compiler which is called riscv-none-embed-gcc. The output is a .elf file
which is used for the OVPSim simulation. The .elf files will be input and
then executed by the provided RISC-V processor model in the virtual RISC-V
platform. There is a tool, riscv-none-embed-objdump, which can extract all
the necessary instructions with the corresponding address and the initial value
of some registers from the original .elf file. In this project, four different
C programs are applied to run on the designed processor using Verilog
simulations. The C code of each program is shown in Appendix B. Two
of them are used to output three or five numbers in the Fibonacci sequence.
The speed program contains several integer computation instructions, which
are applied to test the performance of integer computation instructions and
register performance. The matrix program is used to do matrix multiplication.
Therefore, the compiler will generate four .elf files in total. Figure 5.2 shows
part of the information which is extracted from one of the .elf files. Different
instructions are responsible for different parts of the program. Furthermore,
the addresses of all instructions are listed clearly.
Figure 5.2: Information from .elf file
Load instructions and initial value

After extracting the necessary information, the next step is to load the
instructions to the appropriate address of the ROM unit and store the initial
values at the target address of the RAM unit. It is similar to the directed test,
which was motioned before. The difference is that the address of the first
instruction in the ROM is not 0000000 but 00000074. In addition, there are
several initial values that should be loaded into the RAM. For the OVPSim,
the general-purpose register x2 (sp) is set to fffff000. It is a base address,
which is usually used to add an offset to get a new address for RAM locating.
However, if the address range of the RAM unit is the same as the compiled
code assuming (0 to ffffffff), the ModelSim would report an error that the range
is too large. Therefore, the size of the RAM unit is limited to 1500000, and
the general-purpose register x2 (sp) is set to 000ffff0 in the designed processor.
Check the functionality

For the OVPSim, after a simulation, a .log file can be obtained. It lists the
instructions executed by the processor in order. Some instructions would
change the value of a specific general-purpose register, and this information
is also included. Furthermore, some basic information like estimated time
and executed numbers of instructions is also described at the end. Figure 5.3
indicates part of the content in a .log file which is gained after executing the
Fibonacci_3 program.
Figure 5.3: Information from .log file
In order to ensure the functionalities of the three-stage pipeline processor

are the same as other RISC-V processors, the designed processor should
execute the instruction in an order that is the same as in the .log file. The
correctness can be verified by the waveform in ModelSim. The subsequent
instruction execution order and the change in the general-purpose registers can
be compared with the information in the .log file. If it is the same, that means
the simulation is correct. For instance, the Fibonacci program is simulated
by the designed processor. In figure 5.4, the input instruction address is
0000009c, and the input instruction is 00040513. It corresponds to the first
line of the .log file in figure 5.3. According to the .log file, this instruction can
cause the value in the general-purpose register x10 (a0) from 1620 to 0000. As
shown in the simulation waveform, the value in the general-purpose register
x10 is also changed from 1620 to 0000 in the next clock cycle. That means the
simulation result is the same as the description in the .log file. In the simulation
part, there are four different simulations. The number of cycles required for
each simulation and the number of instructions executed are both recorded.
They are analyzed in the following chapter.
Figure 5.4: Simulation waveform
Simulation results
All the simulations passed. The functionality of the designed processor is
satisfactory. In theory, the ideal CPI is equal to 1. However, some instructions
need more than one clock cycle to be executed. For instance, all the jump
instructions like j, jal, jalr and the RET instruction need three clock cycles
to be executed. For the branch instructions, if the jump flag is asserted, these
instructions also need three clock cycles to be executed. On the contrary, if the
jump flag is not asserted, the inst_addr_i plus 4, and the EXu unit will fetch
the next instruction. In addition, all the division instructions need 37 clock
cycles to be executed.
Table 5.2: Simulation result of each program
P rogramN ame N umber_of _instructions N umber_of _cycles CP I

Fib_3.c 367 483 1.316
Fib_5.c 675 879 1.302
Speed.c 430 692 1.609
Matrix.c 1157 1843 1.593
Table 5.2 shows all the significant data in every simulation. The CPI of the
last two programs is obviously larger than the other two. This is because the
Fibonacci programs only used integer-based instructions. Most instructions
only take one cycle to complete. For example, in the Fib_3 program, among
the 367 instructions, there are 23 jump instructions, 15 RET instructions and
23 branch instructions. Three of the branch instructions did not assert the jump
flag. Therefore, theoretically, the number of cycles is equal to 367 + 58 ∗ 2,
which is 483 in total. This was evidenced by measurements taken during the
execution of the programs.
In the speed and matrix program, there are many division instructions and
remainder instructions. It causes the CPI to be larger than the first two
programs. In order to improve the CPI, there are some methods. One is
applying a pipeline that has more than three stages. The second one is to
optimize the division unit, which means using a better algorithm to do the
computation. A third one is to schedule instructions in the empty slots and do
something useful, while the core is waiting for the division to finish.
5.2 Digital Synthesis

In this section, the synthesis part is discussed. The sizes of ROM and RAM
are too large, and insufficient resources are allocated. It may cause insufficient
permissions for the synthesis tool. The top-level synthesis needs too much
time. Hence, only the processor core is synthesized using Genus in this
project.
5.2.1 SOI_STDLIB
The SoI standard library is used as the target library in the synthesis. The main
goal is to get the area estimates, check for timing violations, as well as get the
Verilog netlists and constraints. Some information is used for physical design.
Figure 5.5 indicates how to load the liberty timing files and the SystemVerilog
code. After that, the constraints should be created for the clock, inputs, and
outputs, sets loads, etc. The constraints are shown in figure 5.6. The period is
set to 100 nanoseconds. Finally, it performs a generic synthesis, maps it to the
available cells in the library, and then runs optimization. The Verilog netlists
and constraints are written out, and the area and timing reports are also stored.
5.2.2 Lsi_10k
In the previous project, the Lsi_10K is used as the target library, which comes
with Synopsys DC. Therefore, synthesis with the Lsi_10K library is necessary,
as it can be used for comparison. The script for running synthesis with
Lsi_10K is quite similar to the above one. In the load sources part, only the
Figure 5.5: Load sources

Figure 5.6: Constrains used with SoI_LIB
path of the library and the library name should be modified. Moreover, the
operating condition is set to nominal. For the constraints part, there are no
block constraints for the Lsi_10K library. The other setting is the same as the
SoI standard library, which is shown in figure 5.7.
5.2.3 Digital Synthesis result

After synthesizing with Lsi_10K library, there are one time report and one area
report. As shown in table 5.3. The cell count after synthesis is around 13292.
The total area is 36077. But the area unit is unknown. This table also shows
the area report of the previous project. It is obvious that the new core has larger
cell counts and area. This is because two additional modules CSR_REG and
INTu are added. The design also has a positive slack of about 4ns.
Table 5.3 also illustrates the area and time report about SoI_STD library. The
cell count is around 23877. In this library, the area unit is defined as µm2 .
Therefore, the total area is around 25.177 mm2 . It is less than 49 mm2
(7mmx7mm), which the largest area that our in-house SoI technology allows.
The design also has a positive slack of about 0.5ns.
Figure 5.7: Constrains used with Lsi_10K library
Table 5.3: Synthesis result of each library

Library N ame Cell Count Cell Area Slack time
SoI_Lib 23877 25176960 µm2 548ps
Lsi_10K(new) 13292 36077 4327ps
Lsi_10K(pre) 9154 24688 8654ps
5.3 Place & Route

The final layout is shown in figure 5.8. The final core utilization is quite high.
The I/O pins are allocated on every side. A more detailed and clear picture is
available on Innovus.
Figure 5.9 indicates the area report of the final layout. The total area is around
28.31 mm2 , which is larger than the synthesized one but still less than 49 mm2 .
This is because the optimization requires the insertion of additional cells like
buffers.
Figure 5.8: Final routed layout
Figure 5.9: Area report of final routed layout

Conclusions and Future work | 55
Chapter 6
Conclusions and Future work
This chapter summarizes this project. In addition, the future work is also
discussed in the second section.
6.1 Conclusions
The RISC-V SoC design is a 32-bit core with a three-stage pipeline, which
supports the RV32I base instruction set with the "M" standard extension. Two
pipeline units have been modified. The pipeline works as expected. The
codes of the division unit have been modified, and all division and remainder
instructions can be supported. Furthermore, two new modules, CSR_REG
and INTu, have been added in this project. This processor has the ROM as a
part of the core, so ROM accesses do not have to go through the bus. This is
non-standard procedure, so it should be modified in the future. The core can
only support up to four slaves. In order to support more slaves, the bus unit
should also be modified.
After the verification and simulation, the functionality of this new design is
complete. It can run the machine code produced by the virtual platform. The
CPI is also collected. The RISC-V core, which is synthesized as an ASIC with
the SoI_STD library, has an area of around 25.177 mm2 and a slack of about
0.5ns. The critical data path is 99.5 ns, so it can run with at most 10Mhz clock
frequency. When the place and route is finished, an area estimation is gained.
It is around 28.31 mm2 . That means the new technique SoI CMOS can be
applied to build a RISC-V processor theoretically.
56 | Conclusions and Future work
6.2 Future work

For the RTL level, more modules can be added to the RISC-V SoC in the
future, for example, a JTAG debug module or peripherals components, such
as UART or timer. This can make the processor more fully functional. In
the interrupt unit, more peripheral interrupts can be added. In addition, a
better algorithm can be applied to reduce the execution cycles. Moreover, the
number of stages in the pipeline can be increased. It would help the RISC-V
processor to improve its efficiency by allowing to execute the instructions at a
higher clock frequency. Furthermore, the I/O pad and the RAM layout are still
missing. They should be added later. In order to reduce the area, the SoI_STD
library should be optimized. This library is incomplete.
There are some empty slots during a jump, a branch, or a division instruction.
It would cause a pipeline bubble. However, some instructions must execute
in order, for example, if a SUB instruction needs the result from the last
division instruction, it must wait until the division is complete. One way to
resolve that issue is to determine if the next new instruction of the division
instruction requires the result of division or not. Each division instruction has
a destination register rd, and if the new instruction does not need to access the
same register during division, it means that the new instruction can be fetched.
For the jump and branch instructions, a branch predictor can be added. The
core can do a speculative execution first. If the prediction is wrong, flush the
pipeline and fetch the correct instructions.
The bus unit can support at most four slaves. However, this is a severe
restriction, the number of slaves should be configurable, together with their
address spaces. One method is to add an additional ID signal to the bus unit or
extend the input address of the bus unit to select slaves. Another method is to
use some standard bus design. The AXI4 protocol is recommended because it
is very common and powerful.
REFERENCES | 57
References
[1] L. J. Caley, “High temperature cmos silicon carbide asynchronous

circuit design,” Graduate Theses and Dissertations, 2015. [Online].
Available: https://scholarworks.uark.edu/etd/30/
[2] M. Alexandru, V. Banu, X. Jordà, J. Montserrat, M. Vellvehi,

D. Tournier, J. Millán, and P. Godignon, “Sic integrated circuit
control electronics for high-temperature operation,” IEEE Transactions
on Industrial Electronics, vol. 62, no. 5, pp. 3182–3191, 2015. doi:
10.1109/TIE.2014.2379212
[3] S. Cristoloveanu, “Silicon on insulator technologies and devices:

from present to future,” Solid-State Electronics, vol. 45, no. 8, pp.
1403–1411, 2001. doi: https://doi.org/10.1016/S0038-1101(00)00271-
9. [Online]. Available: https://www.sciencedirect.com/science/article/
pii/S0038110100002719
[4] C. Ramírez, A. Castelló, and E. S. Quintana-Ortí, “A blis-like matrix

multiplication for machine learning in the risc-v isa-based gap8
processor,” The Journal of Supercomputing, pp. 1–10, 2022.
[5] N. A. Shahla Gul, “A comparison between risc and cisc

microprocessor architectures,” International Journal of Science
Engineering and Advance Technology, vol. 4, Issue 5, May
2016. [Online]. Available: https://core.ac.uk/display/235196731?utm_
source=pdf&utm_medium=banner&utm_campaign=pdf-decoration-v1
[6] A. Sha. (2020) What is the difference between cisc and risc architectures?
[Online]. Available: https://forum.huawei.com/enterprise/en/
what-is-the-difference-between-cisc-and-risc-architectures/thread/
644537-895
58 | REFERENCES
[7] WatElectronics. (2020) What are the differences between microprocessor

and microcontroller? [Online]. Available: https://www.watelectronics.
com/differences-between-microprocessor-and-microcontroller/
[8] S. Chevtchenko and R. Vale, “A comparison of risc and cisc

architectures,” resource, vol. 2, p. 4, 2013.
[9] F. Masood, “Risc and cisc,” 2011. doi: 10.48550/ARXIV.1101.5364.

[Online]. Available: https://arxiv.org/abs/1101.5364
[10] V. G. Oklobdzija, “Reduced instruction set computers prof,” 2004.
[11] S. Chatterjee. (2018) A beginner’s guide to risc and cisc

architectures. [Online]. Available: https://medium.com/@csoham358/
a-beginners-guide-to-risc-and-cisc-architectures-fc9af424db3b
[12] S. Greengard, “Will risc-v revolutionize computing?” Communications

of the ACM, vol. 63, no. 5, pp. 30–32, 2020.
[13] E. Alkim, H. Evkan, N. Lahr, R. Niederhagen, and R. Petri, “Isa

extensions for finite field arithmetic: Accelerating kyber and newhope on
risc-v,” IACR Transactions on Cryptographic Hardware and Embedded
Systems, pp. 219–242, 2020.
[14] Y. Cheng, L. Huang, Y. Cui, S. Ma, Y. Wang, and B. Sui, “Efficient

multiple-isa embedded processor core design based on risc-v.”
[15] A. W. K. A. S. Inc. Risc-v isa specification, volume 1 unprivileged

spec v. [Online]. Available: https://github.com/riscv/riscv-isa-manual/
releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf
[16] S. Leonard. (2019) A dive into ri5cy core internals.

[Online]. Available: https://www.embecosm.com/2019/08/13/
a-dive-into-ri5cy-core-internals/
[17] A. Traber. (2019) Ri5cy: User manual. [Online]. Available: https:

//www.pulp-platform.org/docs/ri5cy_user_manual.pdf
[18] P. D. Schiavone. (2018) zero-riscy: User manual. [Online]. Available:

https://www.pulp-platform.org/docs/user_manual.pdf
[19] P. Davide Schiavone, F. Conti, D. Rossi, M. Gautschi, A. Pullini,

E. Flamand, and L. Benini, “Slow and steady wins the race?
REFERENCES | 59
a comparison of ultra-low-power risc-v cores for internet-of-things

applications,” in 2017 27th International Symposium on Power and
Timing Modeling, Optimization and Simulation (PATMOS), 2017. doi:
10.1109/PATMOS.2017.8106976 pp. 1–8.
[20] A. Garofalo, G. Tagliavini, F. Conti, D. Rossi, and L. Benini, “Xpulpnn:

Accelerating quantized neural networks on risc-v processors through isa
extensions,” in 2020 Design, Automation & Test in Europe Conference
& Exhibition (DATE), 2020. doi: 10.23919/DATE48585.2020.9116529
pp. 186–191.
[21] S. D. Team. (2020) Risc-v assembly language programmer manual part

i. [Online]. Available: https://shakti.org.in/docs/risc-v-asm-manual.pdf
[22] J. H. Andrew Waterman, Krste Asanovic. (2021) The risc-v

instruction set manual volume ii: Privileged architecture, document
version 20211203. [Online]. Available: https://riscv.org/technical/
specifications/
[23] R. Logic. (2018) Rv12 risc-v 32/64-bit cpu core datasheet (v1.3).
[Online]. Available: https://roalogic.github.io/RV12/docs/RoaLogic_
RV12_RISCV_Datasheet.pdf
[24] William. (2022) Risc-v bus and pipeline2risc-v cpu bus design.
[Online]. Available: https://en.ica123.com/risc-v-bus-and-pipeline%
EF%BC%882%EF%BC%89risc-v-cpu-bus-design/
60 | Appendix A: First Appendix
Appendix A
First Appendix
A.1 RV32I Base Instruction Description

Appendix A: First Appendix | 61
Appendix A: First Appendix | 63
A.2 RV32M Standard Extension Listing

Appendix B: Second Appendix | 65
Appendix B
Second Appendix
B.1 Fibonacci program code
Listing B.1: Fibonacci program code
# i n c l u d e < s t d i o . h>
# i n c l u d e < s t d l i b . h>
static int fib ( int i ) {

return ( i >1) ? f i b ( i −1) + f i b ( i −2) : i ;
}
v o i d main ( ) {
int i , j ;
i n t num = 5 ;
/ / p r i n t f ( " s t a r t i n g f i b (%d ) . . . \ n " , num ) ;
f o r ( i = 0 ; i <num ; i ++) {
j=fib ( i );
}
/ / p r i n t f (" finishing . . . \ n ");

}
66 | Appendix B: Second Appendix
B.2 Speed program code
Listing B.2: Speed program code
# d e f i n e NOINLINE _ _ a t t r i b u t e _ _ ( ( n o i n l i n e ) )
i n t main ( )
{
i n t a =0 , b =0 , c =0 , d =0 , e =0 , f =0 , g =0 , h =0 ,
i =0 , j =0 , k =0 , l =0 , m=0 , n =0 , o =0 , p = 0 ;
i n t a2 =0 , b2 =0 , c2 =0 , d2 =0 , e2 =0 , f 2 =0 , g2 =0 , h2 =0 ,
i 2 =0 , j 2 =0 , k2 =0 , l 2 =0 , m2=0 , n2 =0 , o2 =0 , p2 = 0 ;
i n t count , r e s u l t ;
i n t num = 1 ;
f o r ( c o u n t = 0 ; c o u n t <num ; c o u n t ++) {
a = i +1;
b = j +2;
c = k +3;
d = l +4;
e = m+ 5 ;
f = n +6;
g = o +7;
h = p +8;
i = a −1;
j = e −2;
k = b −3;
l = f −4;
m = c −5;
n = g −6;
o = d −7;
p = h −9;
a2 = i2 ∗1;
b2 = j2 ∗2;
c2 = k2 ∗ 3 ;
d2 = l2 ∗4;
e2 = m2 ∗ 5 ;
f2 = n2 ∗ 6 ;
g2 = o2 ∗ 7 ;
h2 = p2 ∗ 8 ;
i2 = a2 / 1 ;
j2 = e2 / 2 ;
k2 = b2 / 3 ;
l2 = f2 / 4 ;
m2 = c2 / 5 ;
n2 = g2 / 6 ;
o2 = d2 / 7 ;
p2 = h2 / 9 ;
}
r e s u l t = a ∗b+c ∗d+e ∗ f +g∗h+ i ∗ j +k∗ l +m∗n+o∗p ;
# i f d e f MICROBLAZE
void e x i t ( i n t ) ;
exit (0);
# endif
return r e s u l t ;
}
68 | Appendix B: Second Appendix
B.3 Matrix program code
Listing B.3: Matrix program code
# i n c l u d e <math . h>
v o i d g e t M a t r i x E l e m e n t s ( i n t m a t r i x [ ] [ 2 ] , i n t row , i n t column ) {
f o r ( i n t i = 0 ; i < row ; ++ i ) {
f o r ( i n t j = 0 ; j < column ; ++ j ) {
matrix [ i ] [ j ] = rand ( ) % 10000;
}
}
}
void m u l t i p l y M a t r i c e s ( i n t first [][2] ,

int second [ ] [ 2 ] ,
int result [][2] ,
int r1 , i n t c1 , i n t r2 , i n t c2 ) {
f o r ( i n t i = 0 ; i < r 1 ; ++ i ) {
f o r ( i n t j = 0 ; j < c2 ; ++ j ) {
r e s u l t [ i ][ j ] = 0;
}
}
f o r ( i n t i = 0 ; i < r 1 ; ++ i ) {
f o r ( i n t j = 0 ; j < c2 ; ++ j ) {
f o r ( i n t k = 0 ; k < c1 ; ++k ) {
r e s u l t [ i ] [ j ] += f i r s t [ i ] [ k ] ∗ s e c o n d [ k ] [ j ] ;
}
}
}
}
i n t main ( ) {
i n t f i r s t [ 2 ] [ 2 ] , s e c o n d [ 2 ] [ 2 ] , r e s u l t [ 2 ] [ 2 ] , r1 , c1 , r2 , c2 ;
r1 =2;
r2 =2;
c1 = 2 ;
c2 = 2 ;
g e t M a t r i x E l e m e n t s ( f i r s t , r1 , c1 ) ;
g e t M a t r i x E l e m e n t s ( s e c o n d , r2 , c2 ) ;
m u l t i p l y M a t r i c e s ( f i r s t , s e c o n d , r e s u l t , r1 , c1 , r2 , c2 ) ;
return 0;
}
TRITA-EECS-EX-2022:779
www.kth.se

Riscv Design

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Riscv Design

Uploaded by

Copyright:

Available Formats

DEGREE PROJECT IN TECHNOLOGY,

SECOND CYCLE, 30 CREDITS

KTH ROYAL INSTITUTE OF TECHNOLOGY

Place for Project

I am grateful to my examiner Dr. Johnny Öberg for the weekly discussion

Stockholm, June 2022

3 RISC-V Processor Structure 13

4 RISC-V Processor Implementation 29

5 Results and Analysis 44

6 Conclusions and Future work 55

1.1 Machine cycle . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 CISC vs RISC . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 RISC-V core structure . . . . . . . . . . . . . . . . . . . . . . 14

4.1 PC_REG code . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.13 INTu states . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 Block diagram of Simulation . . . . . . . . . . . . . . . . . . 47

2.1 CISC and RISC comparison . . . . . . . . . . . . . . . . . . 6

3.1 Registers’ role in the first standard calling convention . . . . . 14

5.1 Hit rate of the bins defined in the cover group . . . . . . . . . 46

List of acronyms and abbreviations

CCOPT Clock Concurrent Optimization

CISC Complex Instruction Set Computer

CMOS Complementary metal–oxide–semiconductor

CPI Cycle Per Instruction

CPU Central Processing Unit

CSR Control and Status Register

CTS Clock Tree Synthesis

DUT Design Under Test

GPIO General Purpose Input/Output

ISA Instruction Set Architecture

JTAG Joint Test Action Group

MSB Most Significant Bit

PCB Printed Circuit Board

PR Place & Route

QNN Quantized Neural Network

RAM Random Access Memory

RISC Reduced Instruction Set Computer

RISC-V Reduced Instruction Set Computer Five

ROM Read Only Memory

RTL Register Transfer Level

RV32I 32 bits Base Integer Instruction Set

RV32IM 32 bits Base Integer Instruction Set with M Standard Extension

SoC System on Chip

SoI Silicon on Insulator

SRA Shift Right Arithmetically

SRL Shift Right Logically

UART Universal Asynchronous Receiver-Transmitter

In order to build high-temperature robust electronic circuits, a new technique

1.3 Problem Statement

Figure 1.1: Machine cycle

of instructions, which can contain more than 300 instructions. It is mainly

• Can we build a RISC-V processor architecture that can execute all

• Can this RISC processor design fit in a 7x7 mm2 chip?

1. Design a basic RISC-V processor capable of executing the 32 bits Base

2. Test the system in SystemVerilog to verify that it works according to the

1.6 Structure of the thesis

2.1 RISC vs CISC

Table 2.1: CISC and RISC comparison

Figure 2.1: CISC vs RISC

scheduling and pipelining difficult to implement [8]. In general, CISC makes

than compiler design. Therefore, they implemented parts of the functionality

Figure 2.3 shows all instructions in RV32I. It contains 40 unique instructions.