Professional Documents
Culture Documents
Riscv Design
Riscv Design
Design a Three-Stage
Pipelined RISC-V
Processor Using
SystemVerilog
KTH Thesis Report
<Ziyan He>
Examiner
Johnny Öberg
Stockholm, Sweden
KTH Royal Institute of Technology
Supervisor
Mattias Ekström
Stockholm, Sweden
KTH Royal Institute of Technology
ii
Abstract | i
Abstract
RISC-V is growing in popularity as a free and open RISC Instruction Set
Architecture (ISA) in academia and research. Also, the openness, simplicity,
extensibility, and modularity, among its advantages, make it more and more
used by designers in industry. The aim of this thesis is to design an open-source
RISC-V processor. The development of this RISC-V processor was based
on the prototype which was made in the course IL2232 Embedded Systems
Design Project (SoI-CMOS Design group), against an experimental high-
temperature SoC CMOS process. SystemVerilog was used for RTL coding.
ModelSim was used for RTL simulation. Genus was used for digital synthesis
and Innovus was used for digital place & route. The thesis concludes that
this RISC-V processor can run the compiled C-code which has been produced
by the virtual platform tool Imperas OVP. The instruction set RV32IM is
the Instruction Set base for this processor. Through simulation, the CPI of
this RISC-V processor can be collected while running different benchmark
programs developed in two parallel Master thesis to this one. To a certain
extent, it can reflect the performance of the processor. However, the actual
execution time needs to be tested by loading the processor to the hardware.
This part will not be discussed in this thesis but is left for future work. The gate
count is collected by digital synthesis and the corresponding area is collected
after digital place & route.
Keywords
RISC, RISC-V, ISA, SystemVerilog, RTL simulation, RV32IM, CPI.
ii | Sammanfattning
Sammanfattning
RISC-V växer i popularitet som en gratis och öppen RISC ISA inom akademi
och forskning. Öppenheten, enkelheten, utbyggbarheten och modulariteten,
bland dess fördelar, gör att den används mer och mer av designers inom
industrin. Syftet med denna avhandling är att designa en RISC-V-processor
med öppen källkod. Utvecklingen av denna RISC-V-processor baserades på
prototypen som gjordes i kursen IL2232 Embedded Systems Design Project
(SoI-CMOS Design group). Mot en experimentell högtemperatur, SoC CMOS-
process diskuteras. SystemVerilog användes för RTL-kodning. ModelSim
användes för RTL-simulering. Genus användes för digital syntes och Innovus
användes för digital plats & rutt. Avhandlingen drar slutsatsen att denna
RISC-V-processor kan köra den kompilerade C-koden som har producerats
av det virtuella plattformsverktyget Imperas OVP. Instruktionsuppsättningen
RV32IM är instruktionsuppsättningens bas för denna processor. Genom simulering
kan CPI för denna RISC-V-processor samlas in samtidigt som man kör olika
benchmarkprogram utvecklade i två parallella masteruppsatser till denna. Till
viss del kan det spegla processorns prestanda. Den faktiska exekveringstiden
måste dock testas genom att ladda processorn till hårdvaran. Denna del
kommer att diskuteras i denna uppsats men lämnas för framtida arbete.
Grindräkningen samlas in genom digital syntes och motsvarande yta samlas
in efter den digitala platsen & rutten.
Nyckelord
RISC, RISC-V, ISA, SystemVerilog, RTL simulering, RV32IM, CPI.
Acknowledgments | iii
Acknowledgments
This thesis was carried out in KTH.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . 4
2 Background Study 5
2.1 RISC vs CISC . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 RISC-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 RV32I . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 "M" Standard Extension . . . . . . . . . . . . . . . . 9
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 10
References 57
A First Appendix 60
A.1 RV32I Base Instruction Description . . . . . . . . . . . . . . 60
A.2 RV32M Standard Extension Listing . . . . . . . . . . . . . . 64
B Second Appendix 65
B.1 Fibonacci program code . . . . . . . . . . . . . . . . . . . . 65
B.2 Speed program code . . . . . . . . . . . . . . . . . . . . . . 66
B.3 Matrix program code . . . . . . . . . . . . . . . . . . . . . . 68
vi | LIST OF FIGURES
List of Figures
List of Tables
Chapter 1
Introduction
This chapter gives a description of the research area. The problem definition,
purpose, research methodology and, goal are discussed. Finally, the structure
of this thesis is also presented.
1.1 Motivation
One of NASA’s missions in the 2030’s is to send a Rover to Venus. However,
the surface temperature of Venus is very high, which makes it difficult to apply
standard Complementary metal–oxide–semiconductor (CMOS) circuits and
Printed Circuit Boards (PCBs). The threshold voltage of electronic devices
would change because of variations in temperature, which affects the hold and
setup time of traditional digital circuits [1], and commonly used PCB metals
will melt.
be limited to 7x7 mm2 (around 120k Gates). For the sake of testing this new
process and verifying its realistic use for common digital circuits, a CPU is
designed.
1.2 Purpose
The RISC-V processor has been receiving much attention in recent years.
Embedded RISC-V processors have become common in many new chips.
Many institutions have also successively developed products based on RISC-
V, such as ETH Zurich’s Zero-riscy and GreenWave’s Gap8 CPU [4]. The
biggest advantage of RISC-V is that it is open-source and free, which means
that it can help developers to complete CPU designs at a lower cost. Besides,
the simplicity of the basic instruction set and the flexibility of the coding
process make it even more popular. The purpose of this project is to explore the
advantages and disadvantages of RISC-V by designing an open-source RISC-
V processor that can also be used in future projects.
Intel, ARM, IBM, and AMD. All these CPUs can be divided into two
types, Complex Instruction Set Computer (CISC) processors and Reduced
Instruction Set Computer (RISC) processors. CISC usually has a large set
Introduction | 3
Through the above comparison, the RISC processor is more suitable. To build
a RISC processor, there are two problems to consider:
1.4 Goals
The goal of this project is to design a RISC-V processor that can execute
the benchmarks developed in two parallel master theses, using the selected
instruction set. After this, a Place & Route should be done against the
experimental SoC library to determine its final area. The work can be divided
into:
3. Run the machine codes produced by the Virtual Prototype tool on the
RISC-V processor to get the execution time.
4. Synthesize the RISC-V using the given SoI library and do place & route
to get the area estimates of the design.
1.5 Delimitations
The focus of this thesis project is the basic functionality of the processor.
The highest priority goal is to execute all instructions of the specified ISA.
4 | Introduction
In addition, the area of the processor should be limited to 7x7 mm2 . The
small size of the processor means its performance may not be very great. In
order to improve efficiency, instruction pipelining is implemented. Moreover,
the debugging module - Joint Test Action Group (JTAG) and asynchronous
serial communication module - Universal Asynchronous Receiver-Transmitter
(UART) are not necessary. Hence, the processor does not have these two
modules in this project. Besides, the given library for synthesis is not
optimized yet. Once the library is optimized, the area can be further reduced.
Chapter 2
Background Study
This chapter describes some background sources about RISC for this project.
The literature of CISC architecture and RISC architecture is discussed. Many
benefits come with RISC architecture in real-time application. Besides, RISC-
V architecture, one kind of RISC architecture, is also introduced. This
architecture is used in this thesis project. The specification of RISC-V is
also discussed. The instruction set RV32IM, which is used in the project,
is described in detail. Additionally, some literature on RISC-V will be
introduced in related work, including some RISC-V products developed by
other researchers and the role of RISC-V ISA extensions.
CISC instruction set usually has a large number of instructions. It means that
in order to store all the instructions, more storage capacity is required. This is
why lots of transistors are used for instruction storage. One property of CISC
is that the instructions can have high complexity. They might have variable
lengths, formats and several low-level operations that can be executed in one
instruction. CISC instructions may contain much information like addressing
mode, operation code or operands [7]. That means it may take more cycle time
to execute a single instruction. The number of instructions executed per second
is small. Besides, the variable execution time of each instruction may make
6 | Background Study
CISC RISC
More than 300 instructions in Less than 100 instructions in
instruction set instruction set
Non-uniform sizes and formats of Fixed sizes and formats of
instructions instructions
Large number of transistors are used Large number of transistors are used
for instructions storage for more registers
Long cycle time per instruction One cycle time per instruction
More addressing modes Less addressing modes
Difficult to pipeline Easy to pipeline
Emphasis on hardware Emphasis on software
A RISC instruction set usually contains fewer instructions. Unlike CISC, the
instructions of the RISC instruction set are simple. All instructions have the
same length and fields. This makes it possible to pipeline the instruction.
Every clock cycle can start a new instruction, but each individual instruction
still takes several clock cycles to execute. [10]. The number of instructions
executed per second is large. Additionally, RISC implements load and store
architecture. Only load and store instructions can access memory. It is not
possible to perform on the memory directly. It can only execute Register
to Register arithmetic operations. Therefore, a large number of registers
is required to prevent large amounts of interactions with memory [5] [9].
Because the instructions are simple, decoding does not require complex
architecture. The work done by hardware is reduced. However, a compiler is
needed to break down the high-level programs into simple instructions, which
is software related [11].
Each architecture has its merits and disadvantages. CISC processor is suitable
for workstations, PCs and servers. It has higher power consumption but
can realize special functions. RISC processor is mainly used for real-time
applications, including telecommunications, video and image processing. It
has a smaller area and lower power consumption.
2.2 RISC-V
In this thesis project, the RISC-V Instruction Set Architecture (ISA), one kind
of RISC architecture, is used as the ISA of the processor. RISC-V is a free and
open instruction set which is debuted in 2011. Because of the high flexibility
and low cost, it is beneficial to design more specialized microprocessors
which can be applied in some custom chips for specific applications [12]. In
RISC-V, there are many extension instructions. For instance, "A" standard
extension contains instructions that are used for reading and writing memory
atomically. "M" standard extension instructions are for integer multiplication
and division. "F" standard extension is used for floating-point computing.
It allows users to combine the optional extensions with a base instruction
set according to their needs [13]. Therefore, in many applications, such as
portable devices, wearable devices as well as aerospace equipment, RISC-V
8 | Background Study
has become more and more popular. Besides, one of the benefits of RISC-
V is that it is open-source. Commercially, it has great appeal. However, the
software compatibility of RISC-V is terrible at the moment. The reason is that
most existing programs are designed for ARM or X86 [14].
2.2.1 RV32I
The instruction set RV32I is the Instruction Set base for the processor in this
project. The ISA of RISC-V consists of a base integer ISA and some optional
standard extensions. In most processors, the instructions of the base integer
instruction set can be executed. Compared with other RISC architectures, all
base integer ISAs of RISC-V have no branch delay slots and support optional
variable-length instructions encoding [15]. They are restricted to a minimal set
of instructions enough for basic functions. There are four base ISAs in RISC-V
family: RV32I, RV64I, RV128I and RV32E. Each base instruction set has its
own integer registers width, address space size and number of integer registers.
RV32I and RV64I are similar. They provide 32-bit and 64-bit address space,
respectively. RV32E is a variant of RV32I, which has half the number of
integer registers and was originally designed for small microcontrollers [15].
The method of Two’s complement is used to represent signed number values
for all base instruction sets.
In the RV32I, there are totally six instruction formats, R-type, I-type, S-type,
B-type, U-type and J-type, respectively. The length of all types is fixed in
32 bits. Furthermore, they must be aligned on a 4-byte boundary in memory.
Figure 2.2 shows all the formats of RISC-V base instruction types. In order to
simplify decoding, the source registers (rs1 and rs2) and destination registers
(rd) are in the same position in all formats. The immediate numbers are sign-
extended. Bit 31 of the instruction is always the sign bit for all immediate
numbers.
In this project, the processor has a division unit but no multiplier. The
multiplication written in SystemVerilog is expressed in operator symbols *.
After the synthesis, the default circuit of multiplier would be generated by
the EDA tool. However, this multiplier cannot do calculations when there
is a negative operand. Hence, in the execution unit, if the operands of
multiplication are negative numbers, the Two’s complement would be applied
to the operands first. In addition, if the final result of a multiplication is also a
negative number, the Two’s complement of the temporary result would be the
output.
specified applications. In this section, two RISC processor cores are introduced.
They are RI5CY and Zero-riscy. In addition, how the RISC-V processor plays
a role in neural networks is also mentioned.
Andreas Traber, who worked in the Integrated Systems Lab of ETH, developed
a four-stage in-order pipelined RISC-V processor core with SystemVerilog in
2016. This core was called RI5CY, which was popular for many applications.
The first Core-V core in the OpenHW Group family was based on RI5CY [16].
RI5CY supports RV32IMC, including base integer instruction set, integer
multiplication and division instruction set, and compressed instruction set.
Figure 2.6 shows the block diagram of RI5CY which has a four-stage pipeline.
Each stage can work independently, even if the previous stage is stalled. If one
stage wants to work, the next stage and the current stage should be on standby.
This is because instruction can only be propagated to the next stage when
the next two stages are all ready to receive a new instruction. Additionally,
each stage has two control signals. One is the enable signal, which is used
to activate the stage to process instruction. The other one is the clear signal,
which removes the completed instructions from the stage [17]. There is also
a division unit in RI5CY. It takes between 2 and 32 cycles to execute division
or remainder instruction, and the number of cycles depends on the operand
values [17].
and the base integer instruction set RV32E [18]. Compared with RI5CY,
Zero-rsicy has a smaller area and higher power efficiency. There are two
pipeline stages, the instruction fetch stage and the instruction decode and
execution stage. For the instruction fetch stage, there is a buffer that can
collect data from the instruction memory. The buffer is also responsible for
generating instruction addresses and storing instructions when the pipeline
stage is stalled. The second stage is to decode the instructions and read the
operands from the correct register file. The operands should be moved to
ALU or multiplication unit before executing the instructions. The ALU can
fully support the RV32IMC instruction set, and it consists of one 32-bit adder,
one 32-bit shifter and one logic unit [19].
Chapter 3
This chapter discusses the structure of the RISC-V processor. Section 3.1
presents the core architecture of the processor. In addition, each module of
the core is described in detail. In section 3.2, the structure of RISC-V SoC is
discussed. The extra modules of the RISC-V platform, like peripheral units
and memory, are also explained.
These registers also serve different purposes. Table 3.1 [21] describes the
role of each type of register. Register x0 is hardwired to the constant zero.
RISC-V Processor Structure | 15
Register x1 (ra) holds the return address to get back to the caller of the
current subroutine. It is usually used in jump instructions. During the jump
instruction, a new address is moved into the PC register, and the previous
address of the PC register is saved to register x1 simultaneously. It is
convenient to get back from the subroutine to the instruction which is following
the jump instruction. Register x2 (sp) is a stack pointer register that keeps track
of the top of the stack. It points to the next available memory location on the
stack. A call stack usually stores information about the return address of the
invocation and the variables of the procedure. To implement a stack pointer,
push and pop operations are executed. The push operation moves sp down and
then stores data to the corresponding address in memory. The pop operation
loads data and then moves sp up. Register x3 (gp) is a global pointer register
which is used to access global data. It can make optimization for memory
accesses. In general, it should cover the region which is the most intensely
used in RAM. That means register x3 holds the base address, which is the
location of global variables. Register x4 (tp) is a thread pointer register which
is used to access the thread-specific variable. In multi-threaded applications,
each thread may have a different value in the tp register. Registers x5-7 (t0-
2) and x28-31 (t3-6) are temporary registers. They are also called caller-
save registers. During instruction execution, these registers are used to keep
intermediate values. They must be saved by the caller before the procedure
returns. In general, these registers can be used freely, but one must assume
that the contents are destroyed by other functions. Hence, it is suitable, if
only a few functions are called [22]. Register x8-9 (s0-1) and x18-27 (s2-
11) are save registers, which are callee-saved registers. The current value of
these registers must be saved on the stack before use. It may assume that their
contents are preserved even across function calls. Therefore, if there are a lot
of function calls and some values need to be preserved across function calls,
it would be proper to use save registers [22]. Register x8 (s0) is also a frame
pointer register. A pointer (frame pointer) is needed to point to the base of the
stack frame, because it is difficult to keep track of the location of data on the
stack. Register x10-17 (a0-7) are function argument registers, which are used
to pass arguments. The arguments to the subroutine should be passed to the
argument registers first. Once there are return values, they are passed back to
register x10-11 (a0-1).
There are two ports for reading data and one port for writing data in the register
bank. This is because one instruction may read the values of two registers
simultaneously. Two read ports allow two registers to be read in one clock
16 | RISC-V Processor Structure
There is one additional register which is the program counter (PC) register. It
holds the address of the current instruction and points to the next executable
instruction. The program counter cannot be written or read by store and load
instructions directly. Executing instructions is the only way to change the
value of PC. It will increase 0x4 on each positive edge of the clock in order
to point to the next instruction which is stored in the ROM. When executing
a jump instruction or a branch instruction, if the jump flag is asserted, then
the program counter register will take the jump address as input on the next
positive edge instead of increasing 0x4. If the hold flag is asserted, it means
the pipeline would be stopped, the PC register will take the current output as
input at the next positive edge instead of increasing 0x4 to hold the pipeline
and prevent fetching the next instruction.
When the active low reset signal is asserted, all values of 32 registers would be
reset to 0x00000000. The program counter would be also reset to 0x00000000,
which points to the first instruction stored in the ROM. All registers start
running on the first rising edge after de-asserting the reset signal.
There are two units, IFu_IDu and IDu_EXu, in the processor. They are all
sequential logic. In this processor, there is no specific instruction fetch module.
The output signal pc_o of the PC register is connected to the address input of
the ROM. Because the operation of reading ROM is a combinational logic,
the instruction output from ROM is ready at the input of the IFu_IDu unit.
Therefore, the role of the IFu_IDu unit is to fetch instructions from ROM and
RISC-V Processor Structure | 17
pass to instruction decoder IDu unit on each rising edge. The IDu_EXu unit
has the same function. It fetches decoded instructions from IDu and the data
from registers, then passes them to execution unit EXu. If the reset signal or
the hold flag is asserted, both pipeline units would output instruction NOP to
prevent further execution in the next unit. In addition, other data output signals
would be all zero.
In fact, IFu_IDu and IDu_EXu are two groups of registers which are placed
between two large combinational circuits to get a shorter critical path in timing
and create a pipeline structure in the data path. Hence, these two units are
called pipeline units. By shortening the critical path, the clock period is
shorter, which can get a higher clock frequency. However, using a pipeline
structure in the processor may have some problems. First, when a branch
instruction or jump instruction is executed, the whole pipeline needs to be
emptied. It may cause some clock cycles to be lost. In addition, because it
takes more than one clock cycle to execute the division instruction, the pipeline
needs to be paused during the division operation. Some control signals are
designed to solve the problems.
1. Opcode
2. Function 3 (Funct3)
3. Function 7 (Funct7)
The decoder extracts the opcode first to recognize the type of the instruction.
Then the function 3 of the instruction would be extracted, which can be used
to distinguish the instructions of the same type. Sometimes the function 7
is also needed to be extracted. For example, instruction ADD and MUL
have the same opcode and function3, but their function 7 are different. After
determining what the instruction is, the decoder would extract the addresses
of the necessary registers and set the operands required by the instruction. All
these information would be passed to EXu unit.
value with that minuend. Because the width of the dividend is 32 bits, each
division takes at least 32 cycles to complete.
In general, the type of CSR varies depending on the privilege level. The
machine mode is the most basic one. It means that all RISC-V processors must
RISC-V Processor Structure | 21
be able to implement machine mode. The other three modes are optional. In
this project, the machine mode CSR is discussed. Although CSR can complete
many auxiliary functions, not all RISC-V processors need all types of CSR.
Table 3.2 shows several types of CSR which are used in this project. The
register cycle is used to count the number of clock cycles which has been
executed by the processor core since some arbitrary time in the past [23]. The
register mtvec is used to record the interrupt vector address for the machine
mode. The register mcause is used to record the trap cause. The register
mepc is used as an exception program counter. The register mie is used as
an interrupt enable register. The register mstatus is used to save the operating
state of hart for the machine mode. The register mscratch is used as a scratch
register for machine trap handler.
The CSR registers can be read and written, which can affect the operation of
the processor. However, the base integer instructions cannot read or write the
CSR registers. Only CSR instructions can be used to read and write the CSR
registers. These instructions belong to "Zicsr" standard extension instruction
set. Figure 3.3 shows the structure of six CSR instructions.
The CSRRW instruction is used to exchange the values atomically between the
CSR registers and the general-purpose registers. It reads the previous value
in the CSR register and extends it to 32 bits. The extended value would be
passed to the destination register rd while the original value in source register
rs1 is written to the corresponding CSR register. For the CSRRWI instruction,
the only difference is that the output value of source register rs1 is replaced by
an unsigned immediate number (uimm[4:0]). Then the immediate number
is extended to 32 bits. If the value of the destination register rd in these
two instructions is zero, the CSR could not be read and pass the value to the
22 | RISC-V Processor Structure
The CSRRS instruction is applied to read and set bits in CSR. The process is
to read the previous value of the CSR register and extend it to a 32-bit value,
and then save the value to the destination register (rd). Unlike the CSRRW
instruction, the original value of source register rs1 is not written directly to
the CSR register. It is regarded as a bit mask that can determine which bits
of CSR value should be set. For instance, if the value of source register rs1
is 11000, the first bit and the second bit of the value in CSR register should
be set. Once these bits can be written, the set operation will be executed.
The CSRRSI instruction has a similar function, which is using an unsigned
immediate number(uimm[4:0]) as the bit mask. Note that this value is also
extended to 32 bits.
The CSRRC instruction is used to read and clear bits in CSR, which also reads
the old value of the CSR register and extends the value to 32 bits. It is similar
to the CSRRS instruction. The only difference is that its role is not to set the bit
but to clear the bit. For the CSRRCI instruction, the extended 32-bit unsigned
immediate number replaces the value of source register rs1. The rest of the
operations are the same as the CSRRC instruction.
For CSRRS and CSRRC, the read operation is value-independent. That means
the value of the source register rs1 and the destination register rd can be
zero. However, if the value of the source register rs1 is equal to zero, the
write operation could not be performed. The CSRRCI instruction works like
CSRRC, and that the CSRRSI works like CSRRS.
to fetch instructions from the interrupt entry address and enter the interrupt
service routine.
In some cases, multiple interrupt requests occur at the same time. The INTu
unit can take all the requests and determine which one can be sent to the
CTRLu unit. This is because there is arbitration in the interrupt unit. A simple
conditional check of the different interrupt requests is implemented. Then the
flag is forwarded to the EXu unit. If a synchronous interrupt request is received
during the execution of the division instruction, the synchronous interrupt
would not be processed by the INTu unit until the execution of division
instruction is finished. Nevertheless, if there is an asynchronous interrupt
request occurs during the execution of the division instruction, the interrupt
request would be processed first. When the interrupt ends, the calculation
continues.
Figure 3.4 indicates the structure of the RISC-V SoC. In this RISC-V SoC,
there is a ROM which is used to store all instructions, and a RAM which is used
as the main memory. In general, the ROM is connected to the bus. However,
in this project, the ROM is connected to the RISC core directly. In fact, the
ROM can be connected to the bus as the highest priority slave in this SoC. In
addition, a bus module is used to connect all components together, and a GPIO
component that acts as an I/O device. It allows the SoC to interact with users.
All modules are discussed one by one.
24 | RISC-V Processor Structure
3.2.1 Bus
Suppose there is an SoC without a bus, how to communicate between the
processor core and the peripherals? It might look like Figure 3.5. The
processor core interacts with each peripheral directly. In this project, a bus
unit is used to decode the address. Nowadays, many bus designs are standard
and popular, such as SPI, wishbone and AXI. When designing a CPU, one of
them can be used. However, these buses are relatively complicated. Due to
comprehensive protocols, complex operations, and flexible control, AXI has
high performance. However, AXI is too large and complex, which makes it
difficult to debug or use [24]. Moreover, an AXI bus may consume too many
resources and power. On the contrary, SPI bus is relatively simple. However,
SPI bus is a serial bus, but the data transmission is generally parallel inside the
CPU [24]. It means that it is very troublesome to do the conversion. Therefore,
RISC-V Processor Structure | 25
The Bus unit is a combinational circuit that connects the bus master RISC-V
core to different bus slaves, such as RAM and GPIO. The bus has a master
interface and a slave interface, which must be matched when a component
wants to connect to the bus. The bus unit supports multi-master and multi-
slave connections, but only supports one master and one slave communication
at the same time. An arbitration mechanism with fixed priority is applied to
each master device on the Bus unit.
The selection of bus master is made by fixed priority bus arbitration. According
to the setting of the bus, different bus masters have different bus priorities.
The master with higher priority will be granted first. This project has only one
master, the CPU. It always has the higher priority. However, it is also available
if additional masters are added in the future. The selection of the slave is made
26 | RISC-V Processor Structure
by address segment. Currently, the first two Most Significant Bits (MSBs) are
used to indicate bus slave. Because only two MSBs are used, at most four
slaves can be connected to the bus. The bus selection bits take two bits away
from the address, so for each slave, 230 bytes can be accessed, which means
the maximum size of each slave is 230 bytes. Actually, the number of slaves
allowed to be held can be increased. The code can be modified to use more
than four slaves. Just use more MSBs as address segments. The disadvantage
of this method is also obvious. The programmers may have a lot of burdens
when designing programs. This is because all the load and store instructions
can only be stored at addresses within a specific region. For example, let us
assume that when the two MSBs of the address are 00, the RAM would be
selected. In this premise, if the two MSBs of the address corresponding to
the load or store instruction are not 00, the instruction could not be executed
normally. One method is that the compiler adds a special symbol to all load
or store instructions. When the bus detects this symbol, the RAM would be
selected.
ROM is a program storage module that is used for storing instruction lists.
While RAM is a data storage module that is used for saving data, it can only
communicate with Bus. The information about their ports is shown in Figures
3.6 and 3.7. The inputs of these modules are addr_i and data_i, which are
represented by 32 bits, write enable signal, reset and clock. The output is
data_o which is represented by 32 bits. Both designs have the same principle.
In the process of writing, both of them are clocked on the rising edge of the
clock and controlled by the write enable signal. In the process of reading,
both of them are combinational logic. It means that they are not affected by
RISC-V Processor Structure | 27
the clock. When the reset bit is low, they will immediately output the value
stored at the current address.
3.2.3 GPIO
to the structure of RISC-V SoC, there are totally two GPIO modules. The data
and address are passed to the GPIO module through the Bus. The data in the
GPIO is also transmitted to RISC-V core through the Bus. The information
about its ports is shown in Figure 3.8.
It is similar to the ROM and RAM. GPIO has built-in control registers and data
registers. For control bits, every two bits can control the type of one I/O port.
Therefore, each module can control sixteen I/O ports. The modes of these I/O
ports are input, output, and high impedance state. In the process of writing, it
is clocked on the positive edge of the clock and controlled by the reset bit and
write enable signal. It has two configurations. When the write enable signal
is high, the lower four bits of the input address will determine whether the
input data is stored in the data register or the control register, while if the write
enable signal is low, io_pin will input the data to the built-in data registers and
the built-in control register will determine which bit io_pin will output to the
built-in data register. In the process of reading, GPIO is not affected by the
clock and write enable signal but is affected by the reset signal. The lower
four bits of the input address determine whether the data_o is from the control
register or the data register.
RISC-V Processor Implementation | 29
Chapter 4
4.1.1 PC_REG
Figure 4.1 shows the codes of the PC_REG unit. As shown in line 15, the value
of pc_o can be recovered to the initial value by the active low reset signal.
The reset value is set to ‘prcreset which is set as 32’h0 in the define file. In
line 17, if the jump flag is asserted, the value of the pc_o would be set to the
destination jump address, and then the processor would fetch the instruction
from this address. In line 19, the value of ‘pchold is 3’b001. If the hold flag
is asserted, the value of the pc_o would be held. This hold flag is also used by
the IFu_IDu and IDu_EXu modules. If only the PC register is paused, then the
IFu_IDu module and the IDu_EXu module could work independently. If the
IFu_IDu module is suspended, the PC register would be also held at the same
time. If the IDu_EXu module is paused, the whole pipeline would be paused.
RISC-V must be aligned on 32-bit boundaries because of fixed-length 32-bit
instructions (i.e. at memory locations divisible by 4) [15], if the processor is
30 | RISC-V Processor Implementation
For both pipeline units, one type of D flip-flop is applied, which is shown in
figure 4.2. When the reset or the hold flag is asserted, the output is a preset
value. Otherwise, the output is equal to din, which means one clock cycle
delay to the next module.
Figure 4.3 indicates the three output signals in different cases. If reset signal is
active low or hold flag is asserted, the output instruction would be set to NOP
instruction and the corresponding address would be set to all zero. Otherwise,
the instruction and address would be passed to the IDu module until the next
rising edge clock. In addition, the ‘holdIF is set to 3’b010 and the ‘holdoc
is set to 3’b001. The thing that should be noticed is the hold signal in this
module would be asserted if the value is larger than ‘holdIF. If the value of
the hold signal is ‘holdpc, the hold flag would not be asserted.
RISC-V Processor Implementation | 31
Figure 4.6 indicates how the pipeline units work. When a jump instruction
reaches the execution unit, the jump flag is asserted. In the next cycle, the
output instructions of IFu_IDu and IDu_EXu are both NOP instructions. That
means instruction 3 would not be passed to IDu_EXu. Two clock cycles later,
the instruction with the target address would reach the EXu unit. Therefore,
this processor may take three cycles to execute a jump instruction or a branch
instruction, if the jump flag is asserted.
and opcode. After that, the required information can be extracted, such as
which general-purpose register is to be read or written. Figure 4.7 shows some
codes which are for I-type instructions decoding. In line 1, the case (opcode)
is used to determine which type of instruction it is. The value of ‘insttypei
(7’b0010011) is used to identify extended instructions. Then, the case (fun3)
is used to determine exactly what instruction it is. Once the instruction is
determined, the necessary data is extracted, and then passed to the execution
unit. In line 5, the write enable signal is set to 1, which means the register
needs to be written. In line 6, the address of the written register is set to rd.
In lines 7 and 8, one address of reading registers is set to rs1. The other one
is set to zero. The immediate number is not read from the general-purpose
register. In lines 9 and 10, two operands are set. Some instructions require
more signals, such as LW instruction which is used to load a word from RAM.
It needs to send a request to the bus and get the address of the memory first.
According to Figure 2.3, the instruction format of Shift Right Logically (SRL)
34 | RISC-V Processor Implementation
Figure 4.8 presents part of the code of the EXu unit, which is for the ADDI
instruction. The ADDI instruction is used to add an immediate number to the
value of the source register rs1, and then write the result to the destination
register rd. The code of lines 1 and 3 is used to determine what instruction it
is. Because executing the ADDI instruction does not need to access the main
memory, the signals about memory are all set to zero. In addition, no jump or
hold is involved in this instruction, so the jump flag and hold flag are both set
to zero too. The data should be written to the destination register (reg_wdata),
and it is equal to operand 1 plus operand 2.
Some instructions do not need to write the register but need to jump to a target
address, for example, the BGE instruction. When the value of a source register
rs1 is greater than or equal to the value of source register rs2, it would jump to
a new address that is equal to the current instruction address plus an immediate
number. In this case, reg_wdata is set to zero, and the data read from the two
source register should be compared with each other. If the condition is met,
the jump flag would be asserted, and the jump address would be set to the
RISC-V Processor Implementation | 35
target address. Otherwise, both are set to zero. The branch is not taken.
As shown in Figure 4.9, the default value of the hold flag is set to zero, and
the jump_address and jump flag are passed to the PC_REG unit. The requests
from the different modules are processed according to priority. In line 7, if
the jump flag or the hold flag from the EXu unit or the hold flag from the
interrupt unit is asserted, the pipeline would be paused. In line 10, once the
hold flag from the bus is asserted, the PC_REG unit is paused, which means
the pc address is held. This design can improve the performance of the MCU.
As shown in Figure 4.10, there are four states in the division unit. The initial
state is STATE_IDLE. When a division instruction reaches the EXu unit, the
control signal start_i in the DIVu unit is asserted. In the next cycle, the state
is STATE_START. The count value is set to 32’h40000000. After that, it
enters STATE_CALC. As mentioned before, it may take 32 cycles to do the
calculation. The last state is STATE_END. The result would be output in this
state. It should be noted that the jump flag and hold flag are also asserted
when the division instruction arrives. In addition, there are two extra cycles
with inserted NOPs. Therefore, one clock cycle for STATE_IDLE, one clock
cycle for STATE_START, 32 clock cycle for STATE_CALC, one clock cycle
for STATE_END, and two clock cycles for NOPs. It may take 37 cycles totally
to execute one division instruction.
4.1.7 CSR_REG
The CSR_REG unit is similar to the REG_BANK unit. The only difference is
that the CSR_REG unit can be written and read by the Exu unit and INTu, but
the REG_BANK unit can only be written and read by the EXu unit. Therefore,
only the CSR_REG is discussed in this chapter.
Figure 4.11 shows how the EXu unit and INTu unit write to the CSRs registers.
It is synchronous with the clock. The waddr[11:0] is used to determine what
CSR register it is. When the rising edge of the clock is reached, data is written
to the corresponding register. During the writing process, the execution unit
has a higher priority than the interrupt unit. Figure 4.12 indicates which CSR
register should be read. The read operation is asynchronous. It does not
depend on the clock. If the read address is equal to the write address and
the write operation is in process, the write data would be returned directly.
RISC-V Processor Implementation | 37
When the int_state and csr_state are both idle, the hold flag is not asserted.
If one of the states is not idle, the hold flag would be asserted. If there is
an ECALL or EBREAK instruction and there is no division operation being
38 | RISC-V Processor Implementation
If the int_flag from the PC_REG unit and the global_int_en from the CSR_REG
unit are both asserted, it would be an asynchronous interrupt that is generated
by peripherals such as GPIO, timer, etc. Then, the interrupt state should
be set to INT_ASYN_ASSERT. Compared to the synchronous interrupt, the
asynchronous interrupt can interrupt the division. The division will continue
until the interrupt is processed. Once the interrupt state is set to INT_ASYN_ASSERT,
the csr_state would be set to CSR_MEPC. Subsequent operations are the same
as synchronous interrupts.
4.1.9 Bus
In the bus module, one of the master devices is selected to access the
corresponding slave device. In this RISC-V processor, the only master device
is the PC_REG module which is the default master device. If there are more
than one master device, the bus module would apply fixed priority arbitration
to select master devices. If another master device with higher priority is
selected, the hold flag would be asserted, and the pipeline would be paused.
The idea of having just four slave devices was an idea of us during the project
course. Figure 4.14 shows how to select a slave device. The two most
significant bits of the address from the master device are used as a slave select
signal. That means there are up to four slave devices. However, in this project,
RISC-V Processor Implementation | 39
there are only two slave devices, RAM and GPIO. If there are more than four
slave devices, the slave select signal could be extended to four bits, which can
support sixteen slave devices.
4.1.11 GPIO
In the GPIO, there is a 32-bit control signal and a 32-bit data signal. Each
I/O is controlled by two bits. There are three modes, input, output, and high
impedance state respectively. Hence, at most 16 I/O can be controlled in this
module. As shown in figure 4.16, the GPIO can be written. When the write
enable signal is asserted and the addr_i[3:0] is 4’b0000, the input data would
be written to the control register. If the addr_i[3:0] is 4’b0100, the input data
would be written to the data register. When the write enable signal is not
asserted and every two bits in the control register is 10, the input value io_pin_i
would be written to the corresponding I/O. The reading process is similar to
the writing process.
2. Floorplan.
5. Pin placement.
The first step is to load the design. Once Innovus starts, the TCL command:
setDesignMode -process 250 is used to set process tech to 250nm. Then the
Verilog netlists from the Genus should be imported. Furthermore, two .lef
files and an MMMC View Definition file are necessary. After that, the global
nets should be connected. The power net is set as VDD and the ground net
is set to VSS. The second step is the floorplan. Several parameters can be
modified according to the requirement. The ratio (H/W) is set to 1, and the
core utilization is set to 0.6. The core margins to all sides are set to 32.0.
The final configuration is shown in figure 4.17. The core utilization increased
to 0.7 after optimization, which is quite high. The third step is to set power
rings. A generous margin can avoid electromigration problems. As shown in
figure 4.18, the widths, spacing and offset are set to 12, 2 and 4, respectively.
They are all related to the parameter of the using PDK. In addition, the widths
is also related to the power. The power report from synthesis can be used to
calculate the setting widths. The fourth step is to route power and ground nets,
which is used to connect the VDD and VSS to the power ring. The fifth step
is pin placement. It can select in which side of the block that needs to put the
pin or pins, etc. When use automatic placement, the pins would be placed on
each partition automatically. They can then be adjusted manually as required.
The next step is standard cell placement and pre-Clock Tree Synthesis (CTS)
optimization. The placement should be driven by run timing, and enable clock
gating awareness. In addition, the TIEHI and TIELO cells should be placed.
RISC-V Processor Implementation | 43
After that, report the timing to check pre-CTS. If there are violations, it should
run timing optimization. The seventh step is the CTS. Before creating the clock
tree, a Clock Concurrent Optimization (CCOPT) specification file should be
created. In this file, the inverter and buffer cells that are defined in SoI standard
library should be set up. The time report should be made after the timing tree
is created. If there are setup violations, run command: optDesign -postCTS
to optimize. The next step is to route with nanoroute. After that, the design
should be checked to see if there are connectivity violations. The final step is
the placement of filler cells. The places that are not used by standard cells can
be filled with decoupling capacitors and empty fillers. Finally, a connectivity
check should be done and the design exported.
Chapter 5
In this chapter, all the necessary results are presented and analysed. In section
5.1, the verification is discussed briefly. Because most of the content is similar
to previous project, only the simulation part is emphasized. The second section
records the area estimation and the timing report for both synthesises. The last
section illustrates the layout and area report after physical design.
5.1.1 Verification
The verification strategy for each module is the same as the previous project.
Hence, the individual verification of each module is not discussed here. This
subsection focus on the verification of the processor core and the top-level
design.
For the direct test, the clock port is modified with a multiplexer to select
between the system clock and the test clock. The test clock would be used
to load the instructions into ROM. After loading the instructions the system
clock would be turned on for the core. All the input of the ROM module
should be added with a multiplexer, which is applied to select between the
INITIAL_TEST PHASE and POST_INITIAL_TEST PHASE. The former is
used to load all instructions into the ROM module, and the latter is where the
core takes over the control signals of the ROM.
through constrained randomization and then loaded into the ROM in order.
An assertion is used to check if randomization passed or not for each class. In
addition, there is a monitor class to collect coverage for the source register 1
(rs1), source register 2 (rs2), destination register (rd), immediate values,and
the ROM address written into.
Table 5.1: Hit rate of the bins defined in the cover group
Covergroup M etric Goal
Coverpoint cg::RS1 68.5% 100
Coverpoint cg::RS2 52.2% 100
Coverpoint cg::RD 70.6% 100
Coverpoint cg::ROM_ADDR 50% 100
/risc_top/monitor_risc/cg 60.1% 100
Table 5.1 shows the hit rate of the bins defined in the cover groups. The
expected hit rate of the source and destination register bins was expected to be
more than or equal to 50%. This was because the constrained randomization
used only sets of registers for generating a few instruction types, such as the
R-type and the S-type. On the other hand, the ROM was about 4K bytes, and
it was not possible to exercise all the locations for sign-off. Only half of the
memory is used for storing instructions.
5.1.2 Simulation
In this subsection, the simulation work is discussed. The main goal is to
compare the designed processor with a RISC-V processor produced by the
virtual platform tool Imperas OVP to check if the designed processor works
correctly.
Results and Analysis | 47
indicates part of the content in a .log file which is gained after executing the
Fibonacci_3 program.
Simulation results
All the simulations passed. The functionality of the designed processor is
satisfactory. In theory, the ideal CPI is equal to 1. However, some instructions
need more than one clock cycle to be executed. For instance, all the jump
instructions like j, jal, jalr and the RET instruction need three clock cycles
to be executed. For the branch instructions, if the jump flag is asserted, these
instructions also need three clock cycles to be executed. On the contrary, if the
jump flag is not asserted, the inst_addr_i plus 4, and the EXu unit will fetch
the next instruction. In addition, all the division instructions need 37 clock
cycles to be executed.
Table 5.2 shows all the significant data in every simulation. The CPI of the
last two programs is obviously larger than the other two. This is because the
Fibonacci programs only used integer-based instructions. Most instructions
only take one cycle to complete. For example, in the Fib_3 program, among
the 367 instructions, there are 23 jump instructions, 15 RET instructions and
23 branch instructions. Three of the branch instructions did not assert the jump
flag. Therefore, theoretically, the number of cycles is equal to 367 + 58 ∗ 2,
Results and Analysis | 51
which is 483 in total. This was evidenced by measurements taken during the
execution of the programs.
In the speed and matrix program, there are many division instructions and
remainder instructions. It causes the CPI to be larger than the first two
programs. In order to improve the CPI, there are some methods. One is
applying a pipeline that has more than three stages. The second one is to
optimize the division unit, which means using a better algorithm to do the
computation. A third one is to schedule instructions in the empty slots and do
something useful, while the core is waiting for the division to finish.
5.2.1 SOI_STDLIB
The SoI standard library is used as the target library in the synthesis. The main
goal is to get the area estimates, check for timing violations, as well as get the
Verilog netlists and constraints. Some information is used for physical design.
Figure 5.5 indicates how to load the liberty timing files and the SystemVerilog
code. After that, the constraints should be created for the clock, inputs, and
outputs, sets loads, etc. The constraints are shown in figure 5.6. The period is
set to 100 nanoseconds. Finally, it performs a generic synthesis, maps it to the
available cells in the library, and then runs optimization. The Verilog netlists
and constraints are written out, and the area and timing reports are also stored.
5.2.2 Lsi_10k
In the previous project, the Lsi_10K is used as the target library, which comes
with Synopsys DC. Therefore, synthesis with the Lsi_10K library is necessary,
as it can be used for comparison. The script for running synthesis with
Lsi_10K is quite similar to the above one. In the load sources part, only the
52 | Results and Analysis
path of the library and the library name should be modified. Moreover, the
operating condition is set to nominal. For the constraints part, there are no
block constraints for the Lsi_10K library. The other setting is the same as the
SoI standard library, which is shown in figure 5.7.
Table 5.3 also illustrates the area and time report about SoI_STD library. The
cell count is around 23877. In this library, the area unit is defined as µm2 .
Therefore, the total area is around 25.177 mm2 . It is less than 49 mm2
(7mmx7mm), which the largest area that our in-house SoI technology allows.
The design also has a positive slack of about 0.5ns.
Results and Analysis | 53
Figure 5.9 indicates the area report of the final layout. The total area is around
28.31 mm2 , which is larger than the synthesized one but still less than 49 mm2 .
This is because the optimization requires the insertion of additional cells like
buffers.
54 | Results and Analysis
Chapter 6
This chapter summarizes this project. In addition, the future work is also
discussed in the second section.
6.1 Conclusions
The RISC-V SoC design is a 32-bit core with a three-stage pipeline, which
supports the RV32I base instruction set with the "M" standard extension. Two
pipeline units have been modified. The pipeline works as expected. The
codes of the division unit have been modified, and all division and remainder
instructions can be supported. Furthermore, two new modules, CSR_REG
and INTu, have been added in this project. This processor has the ROM as a
part of the core, so ROM accesses do not have to go through the bus. This is
non-standard procedure, so it should be modified in the future. The core can
only support up to four slaves. In order to support more slaves, the bus unit
should also be modified.
After the verification and simulation, the functionality of this new design is
complete. It can run the machine code produced by the virtual platform. The
CPI is also collected. The RISC-V core, which is synthesized as an ASIC with
the SoI_STD library, has an area of around 25.177 mm2 and a slack of about
0.5ns. The critical data path is 99.5 ns, so it can run with at most 10Mhz clock
frequency. When the place and route is finished, an area estimation is gained.
It is around 28.31 mm2 . That means the new technique SoI CMOS can be
applied to build a RISC-V processor theoretically.
56 | Conclusions and Future work
There are some empty slots during a jump, a branch, or a division instruction.
It would cause a pipeline bubble. However, some instructions must execute
in order, for example, if a SUB instruction needs the result from the last
division instruction, it must wait until the division is complete. One way to
resolve that issue is to determine if the next new instruction of the division
instruction requires the result of division or not. Each division instruction has
a destination register rd, and if the new instruction does not need to access the
same register during division, it means that the new instruction can be fetched.
For the jump and branch instructions, a branch predictor can be added. The
core can do a speculative execution first. If the prediction is wrong, flush the
pipeline and fetch the correct instructions.
The bus unit can support at most four slaves. However, this is a severe
restriction, the number of slaves should be configurable, together with their
address spaces. One method is to add an additional ID signal to the bus unit or
extend the input address of the bus unit to select slaves. Another method is to
use some standard bus design. The AXI4 protocol is recommended because it
is very common and powerful.
REFERENCES | 57
References
[6] A. Sha. (2020) What is the difference between cisc and risc architectures?
[Online]. Available: https://forum.huawei.com/enterprise/en/
what-is-the-difference-between-cisc-and-risc-architectures/thread/
644537-895
58 | REFERENCES
[23] R. Logic. (2018) Rv12 risc-v 32/64-bit cpu core datasheet (v1.3).
[Online]. Available: https://roalogic.github.io/RV12/docs/RoaLogic_
RV12_RISCV_Datasheet.pdf
[24] William. (2022) Risc-v bus and pipeline2risc-v cpu bus design.
[Online]. Available: https://en.ica123.com/risc-v-bus-and-pipeline%
EF%BC%882%EF%BC%89risc-v-cpu-bus-design/
60 | Appendix A: First Appendix
Appendix A
First Appendix
Appendix B
Second Appendix
# i n c l u d e < s t d i o . h>
# i n c l u d e < s t d l i b . h>
v o i d main ( ) {
int i , j ;
i n t num = 5 ;
f o r ( i = 0 ; i <num ; i ++) {
j=fib ( i );
}
# i n c l u d e < s t d i o . h>
# i n c l u d e < s t d l i b . h>
# d e f i n e NOINLINE _ _ a t t r i b u t e _ _ ( ( n o i n l i n e ) )
i n t main ( )
{
i n t a =0 , b =0 , c =0 , d =0 , e =0 , f =0 , g =0 , h =0 ,
i =0 , j =0 , k =0 , l =0 , m=0 , n =0 , o =0 , p = 0 ;
i n t a2 =0 , b2 =0 , c2 =0 , d2 =0 , e2 =0 , f 2 =0 , g2 =0 , h2 =0 ,
i 2 =0 , j 2 =0 , k2 =0 , l 2 =0 , m2=0 , n2 =0 , o2 =0 , p2 = 0 ;
i n t count , r e s u l t ;
i n t num = 1 ;
f o r ( c o u n t = 0 ; c o u n t <num ; c o u n t ++) {
a = i +1;
b = j +2;
c = k +3;
d = l +4;
e = m+ 5 ;
f = n +6;
g = o +7;
h = p +8;
i = a −1;
j = e −2;
k = b −3;
l = f −4;
m = c −5;
n = g −6;
o = d −7;
p = h −9;
Appendix B: Second Appendix | 67
a2 = i2 ∗1;
b2 = j2 ∗2;
c2 = k2 ∗ 3 ;
d2 = l2 ∗4;
e2 = m2 ∗ 5 ;
f2 = n2 ∗ 6 ;
g2 = o2 ∗ 7 ;
h2 = p2 ∗ 8 ;
i2 = a2 / 1 ;
j2 = e2 / 2 ;
k2 = b2 / 3 ;
l2 = f2 / 4 ;
m2 = c2 / 5 ;
n2 = g2 / 6 ;
o2 = d2 / 7 ;
p2 = h2 / 9 ;
}
# i f d e f MICROBLAZE
void e x i t ( i n t ) ;
exit (0);
# endif
return r e s u l t ;
}
68 | Appendix B: Second Appendix
# i n c l u d e < s t d i o . h>
# i n c l u d e < s t d l i b . h>
# i n c l u d e <math . h>
v o i d g e t M a t r i x E l e m e n t s ( i n t m a t r i x [ ] [ 2 ] , i n t row , i n t column ) {
f o r ( i n t i = 0 ; i < row ; ++ i ) {
f o r ( i n t j = 0 ; j < column ; ++ j ) {
matrix [ i ] [ j ] = rand ( ) % 10000;
}
}
}
f o r ( i n t i = 0 ; i < r 1 ; ++ i ) {
f o r ( i n t j = 0 ; j < c2 ; ++ j ) {
r e s u l t [ i ][ j ] = 0;
}
}
f o r ( i n t i = 0 ; i < r 1 ; ++ i ) {
f o r ( i n t j = 0 ; j < c2 ; ++ j ) {
f o r ( i n t k = 0 ; k < c1 ; ++k ) {
r e s u l t [ i ] [ j ] += f i r s t [ i ] [ k ] ∗ s e c o n d [ k ] [ j ] ;
}
}
}
}
i n t main ( ) {
Appendix B: Second Appendix | 69
i n t f i r s t [ 2 ] [ 2 ] , s e c o n d [ 2 ] [ 2 ] , r e s u l t [ 2 ] [ 2 ] , r1 , c1 , r2 , c2 ;
r1 =2;
r2 =2;
c1 = 2 ;
c2 = 2 ;
g e t M a t r i x E l e m e n t s ( f i r s t , r1 , c1 ) ;
g e t M a t r i x E l e m e n t s ( s e c o n d , r2 , c2 ) ;
m u l t i p l y M a t r i c e s ( f i r s t , s e c o n d , r e s u l t , r1 , c1 , r2 , c2 ) ;
return 0;
}
TRITA-EECS-EX-2022:779
www.kth.se