Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2023 The 8th International Conference on Computer and Communication Systems

Survey and Comparison of Pipeline of Some RISC


and CISC System Architectures
Yan He Xiangning Chen
School of Electronic Science and Engineering School of Electronic Science and Engineering
Nanjing University Nanjing University
2023 8th International Conference on Computer and Communication Systems (ICCCS) | 978-1-6654-5612-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICCCS57501.2023.10150975

Nanjing, China Nanjing, China


MF21230031@smail.nju.edu.cn shining@nju.edu.cn

Abstract—Instruction set is a set of instructions used by CPU operands", "operation", "storage" and other operations. First of
to calculate and control computer system, and is the interface all, for complex instructions, their execution times are different.
between hardware and software. There are two common Some can be completed in 4 or 5 clock cycles, while others
instruction sets: CISC and RISC. Pipeline technology is widely require dozens. Even for simple instructions, different
used in instruction set processor design to improve the efficiency addressing methods will result in different execution times.
of executing instructions. This paper introduces the difference What's worse, the length of instructions is also different, and
between CISC and RISC in pipeline implementation, introduces the length of the same instruction will vary with different
the basic pipelining and two advanced pipelining - superscalar
addressing methods. For these instructions, how to design the
and superpipelining in detail, and introduces several pipelining
pipeline length? If the pipeline is designed according to the
using CISC and RISC architecture processors, including ARM,
RISC-V, Longarch, and X86.
shortest instruction, the pipeline will be interrupted when
encountering complex instructions; If the pipeline is designed
Keywords—Pipeline, superscalar, superpipelining, CISC, RISC, according to the longest instruction, some stages will be
processor skipped when executing shorter instructions, so that the
pipeline cannot be fully filled. RISC have a fixed length and
I. INTRODUCTION few addressing modes. Most of them are simple instructions
If the computer is to obey the command, it must be in the and can be completed in a clock cycle.
language of the computer. The basic words in a computer Secondly, CISC has a complex instruction format due to
language are called instructions, and all the instructions of a its variable instruction length. In contrast, RISC has fewer
computer are called the instruction set of the computer [1]. In instruction formats, and the source register field positions in
general, instruction set architectures define supported each instruction are the same. This symmetry means that the
instructions, data types, registers, hardware support for decoding stage can start to read the register stack while
managing main memory, basic features (such as memory determining the reference type. If the instruction format is
consistency, addressing mode, virtual memory), and a set of asymmetric, the decoding stage needs to be divided into two
implemented input/output models. parts to deepen the pipeline level.
The Instruction Set architecture is mainly divided into Then, CISC does not have special memory operation
Complex Instruction Set Computer (CISC) and Reduced instructions. Many instructions can operate the memory.
Instruction Set Computer (RISC). In the 1960s, people changed Taking one instruction often requires several consecutive
the original task to be completed by multiple instructions to be memory operations, making memory access operations
completed by one Instruction, and the computer that executes frequent and without rules to follow. RISC uses special load
these Complex instructions is called CISC. But with the instructions and store instructions to access memory. Other
development of computer technology, the number of ordinary instructions cannot access memory, making RISC
instructions also increases, making the complex instruction set memory operands only appear in access instructions. All
more complex. It was found that only 20% of the instructions operands must be aligned in the memory, and there is no need
defined by the CISC instruction set were frequently used and to worry that a data transmission instruction needs to access the
80% were rarely used. Therefore, RISC became popular in the memory twice. The requested data can be transmitted between
1980s. The biggest difference between RISC and CISC lies in the processor and memory in the first level pipeline[3].
the simplicity of its instructions. Multiple instructions are used
to complete the tasks that can be completed by one instruction Although RISC has more advantages in using pipelines, it
in a complex instruction set [2]. does not mean that CISC cannot use pipelines. In order to
facilitate pipeline design, various CISC system CPUs have
Compared with reduced instructions, complex instructions introduced the concept of micro-operations instructions. In the
introduce problems to the pipeline technology widely used in pipeline pre fetching and pre decoding stages, hardware
modern processor technology: the execution of instructions in decoders are used to translate the corresponding internal simple
microprocessors is generally divided into "prefetch", "fetching instruction (microcode) sequences, and then sent to the

978-1-6654-5612-8/23/$31.00 ©2023 IEEE 785


Authorized licensed use limited to: International Institute of Information Technology Bangalore. Downloaded on October 14,2023 at 04:10:15 UTC from IEEE Xplore. Restrictions apply.
processor pipeline for execution. A common CISC instruction instructions independently in different pipelines. This concept
is converted into one or more micro operation instructions with can be further developed to allow instructions to be executed in
the same length, fixed format and similar RISC instruction a sequence different from the original program sequence[7].
form, which can draw on the advantages of RISC architecture.
The superscalar processor fetches multiple instructions at a
II. PIPELINE time, and then tries to find several instructions that are
unrelated to each other and can be executed in parallel. If the
A. Basic Pipeline input of an instruction depends on the output of the previous
Pipelining is a technology to realize the overlapping instruction, the instruction cannot be parallel at the same time,
execution of multiple instructions, similar to the production nor can it be executed before the previous instruction. Once
pipeline [1]. The pipeline structure divides the execution of an this correlation is confirmed, the processor can transmit and
instruction into several steps called stages, and each stage is complete instructions in a different order from the original
completed within a clock cycle. Each stage completes a code. By using more registers or renaming register references
specific function and produces a (intermediate) result. Each in the original code, the processor can remove some
stage is composed of an input register and a subsequent unnecessary dependencies[8]. Compared with RISC, CISC has
combined logical data path, which is connected with the input more limitations in using superscalar method. Because the
register of the next stage. The clock signal can be sent to the specific length of any instruction is not known in advance, it
input registers at all stages at the same time. Each clock pulse must be decoded at least partially before taking subsequent
urges each stage to send the completed results to the next level instructions. This prevents instruction fetching required by
at the same time [4]. superscalar pipelining. Although superscalar techniques are
more suitable for RISC or RISC like structures, superscalar
TABLE I. MIPS PIPELINE methods can also be applied to CISC structures through micro-
operations. The most famous example is Pentium. The original
Pentium has a certain superscalar capability. It uses two
discrete integer execution units. Pentium Pro introduces a
comprehensive superscalar design concept. Subsequent
Pentium models have more sophisticated and powerful
superscalar designs[3].
The classic MIPS five level pipeline which is mentioned
most in computer architecture textbooks. As shown in Table
1,The life cycle of an instruction in this pipeline is divided into
the following steps: 1) IF(instruction fetch):fetch instructions
from the memory. 2)ID(instruction decode): The instruction
decodes while reading the register. After decoding, the operand
register index required by the instruction is obtained. You can
use this index to fetch the operand from the Register File.
3)EX(execution): Perform operations or calculate addresses.
4)MEM(memory access): fetch the operands from the data
store. 5)WB(write-back): Write the result back to the register.
If it is an ordinary operation instruction, the result value comes
from the result calculated in the "execution" stage; if it is a load
instruction, the result comes from the data read from the
memory in the "memory access" stage[5]. The time required
for each stage to complete the specified processing task may be
different, but the clock cycle needs to be selected to ensure that
the slowest level can complete the task. if the pipeline does not
pause, it can theoretically achieve the performance of
completing one instruction per clock cycle, increasing the
number of instructions executed simultaneously and the rate at Fig. 1. Superscalar and superpipeline.
which instructions start and end. Pipelining does not reduce the
execution time of a single instruction, also known as latency. C. Super Pipelining
MIPS five level pipeline still needs 5 clock cycles to complete The tasks completed in most pipeline stages only need less
an instruction [6]. time than half of the clock cycle[3]. Thus, in a two-stage super
B. Superscalar pipelined microprocessor, the task completed by each pipelined
segment can be divided into two non overlapping parts and
Superscalar method is a pipeline technology that uses each part can be executed within half a clock cycle. In this way,
multiple independent instructions, which depends on the ability as shown in the middle of Fig. 1,double the internal clock
to execute multiple instructions in parallel. The bottom of Fig. frequency allows two tasks to be completed in one external
2 shows a superscalar implementation that executes two clock cycle. The super pipelined processor can transmit
instructions in parallel. Its essence is the ability to execute multiple instructions in a clock cycle. It can also be said that

786
Authorized licensed use limited to: International Institute of Information Technology Bangalore. Downloaded on October 14,2023 at 04:10:15 UTC from IEEE Xplore. Restrictions apply.
the machine cycle can be shortened by increasing the number the operation of reading registers to the decoding unit, which
of pipeline stages. In the same time, the super pipelining makes the functions of various components of the pipeline
execute more machine instructions[9]. Another advantage of more balanced [11].
super pipelining is to improve the dominant frequency. The
more pipeline stages, the thinner the pipeline is cut, and the
less hardware logic each pipeline contains. The less hardware
logic between the two registers, the higher the frequency.
Fig. 3. ARM9 pipeline.
Of course, more pipeline stages will consume more
registers and more area overhead. Another problem with deep As shown in Fig.4,ARM11 selects pipeline of Scalar
processor pipelines is that it is impossible to know whether the architecture. On the back end of the pipeline, three parallel
conditional jump result will jump or not at the instruction component structures are used, ALU (arithmetic logic unit),
fetching stage of the pipeline, so it can only be predicted. At MAC (multiple/aggregate). LS (Load/Store). LS pipeline is
the end of the pipeline, it is possible to know whether the specially used for processing access operation instructions to
branch should jump or not through actual operations. If the separate data access operations from data arithmetic operations,
prediction results do not match, you need to "Pipeline Flush" -- so as to execute instructions more effectively. Considering that
discard all the prefetched error instruction streams. Retrieve the different instructions require different execution times, when
correct instruction stream. If the pipeline is deep, more error three types of instructions are sent to the pipeline successively,
instruction streams will be prefetched, discarded and restarted, they can be executed simultaneously, allowing random
wasting power and losing performance [10]. execution [12].
It is precisely because of the different advantages and
disadvantages of processor pipeline depth, according to
different application scenarios, the pipeline depth of today's
processors is developing towards two different extremes. On
the one hand, the progression is getting deeper and deeper. In
2004, Pentium 4 (Prescott) reached an amazing 31; On the
other hand, it becomes shallower and shallower, the shallowest
can reach two stages.
III. PIPELINING OF SEVERAL RISC AND CISC PROCESSORS
Fig. 4. ARM11 pipeline.
A. ARM
The ARM core uses a RISC architecture. Most ARM ARM Cortex A series architecture, under which ARM
processors support two instruction sets: 32-bit ARM instruction imports Superscalar architecture pipeline, enabling the
set and 16 bit Thumb instruction set. processor to process more than one instruction set in parallel in
a cycle Taking Cortex A8 as an example, it supports the 13
The previous classic processor series of ARM include stage integer pipeline(see Table 2) and the 10 stage NEON
ARM7, ARM9, ARM11.ARM7, which is a Von Neumann multimedia instruction set pipeline(see Table 3). Taking the
structure. As shown in Fig.2,it uses a typical three-stage integer processing instruction set as an example, Cortex A8
pipeline and is divided into fetching, decoding, and execution. supports Dual Issue and In Order Pipeline. Unlike the previous
The execution unit completes a lot of work, including reading ARM core, which can only process one integer processing
and writing operations of registers and memories related to instruction set at a time, Cortex A8 uses superscalar technology
operands, ALU operations, and data transmission between to issue two integer processing instruction sets together, The
related devices. Therefore, it takes up multiple clock cycles. two instruction sets are processed in parallel by two integer
arithmetic logic units Pipeline in a cycle [14].
From the classic ARM series to the current Cortex series,
the structure of ARM processors is developing towards a
complex stage, but what has not changed is the CPU's access
instructions and address relationships. That is, no matter how
Fig. 2. RM7 pipeline.
many stages of pipelines, the current PC position can be judged
As shown in Fig.3, ARM9 is a Harvard architecture, which according to the operating characteristics of the original three-
uses five level pipeline technology. ARM9 adds two stages of stage pipeline. PC always points to the instruction fetching, not
access memory and write back results after fetching to the instruction executing or the instruction decoding.
instructions, decoding, and executing. Access memory is Generally speaking, people habitually agree that the instruction
responsible for loading and storing data specified in being executed is the reference point, which is called the
instructions, extracting, symbol expanding, and loading data current first instruction. Therefore, PC always points to the
through byte or halfword loading commands. But access to third instruction. In the execution phase, when ARM is in the
memory and write back results are only valid for load (LDR) state, the PC always points to the instruction address+8 bytes;
and store command (STR), and other instructions do not need When the processor is in thumb state, the PC always points to
to execute these two stages. At the same time, ARM9 transfers the instruction address+4 bytes; When the branch instruction is

787
Authorized licensed use limited to: International Institute of Information Technology Bangalore. Downloaded on October 14,2023 at 04:10:15 UTC from IEEE Xplore. Restrictions apply.
executed or the PC is directly modified for branching, the
ARM kernel will refresh its pipeline; An instruction at the
execution stage completes its execution even if an interrupt is
raised.

TABLE II. CORTEX A INTEGER PIPELINE


13-Stage Integer Pipeline
F F F D D D D D E0 E E E E E
0 1 2 0 1 2 3 4 1 2 3 4 5
Instruction Instruction Decode Architec ALU/MUL Pipeline 0
Fetch With Dual-Issues tural Fig. 6. The Boom Core pipeline.
ALU Pipeline 1
Register
File Load/Store Pipeline 0 C. Longarch
or 1 At the beginning, Loongson used MIPS instruction set, but
in 2019, Loongson started to design a completely independent
TABLE III. CORTEX A NEON PIPELINE new instruction set, realizing completely independent design
10-Stage NEON Pipeline from the instruction system to the CPU, that is, LoongArch
architecture. LoongArch is a RISC,Godson3A5000 is the first
M0 M1 M2 M3 N0 N1 N2 N3 N4 N5
Godsonprocessor to realize Godson architecture[17]. The basic
NEON NEON NEON Integer ALU Pipe pipeline includes 12 stages, including PC, fetch, pre decoding,
Instruction Instruction Instruction Integer MUL Pipe decoding I, decoding II, register renaming, scheduling, issuing,
Queue Decode File Integer Shift Pipe
reading registers, executing, submitting I, and submitting II.
None-IEEE FP Add Pipe The first stage PC accesses the cache according to the PC
value; The second stage refers to fetching instructions after the
None-IEEE FP Mul Pipe
cache access query is completed; The third stage of pre
IEEE FP Engine decoding is to parse instructions and predict the direction and
Load/Store Permute Pipe target of branch instructions. Directly predicts the PC value of
subsequent instructions based on the PC value used for
fetching. If there is no branch instruction predicted as a jump in
B. RISC-V the fetched instructions, the PC fetches the instructions in
Rocket Core is an open source RISC-V processor core backward order. If the fetched instruction has branch
developed by Berkeley, which can be generated by Rocket instructions that are jump instructions, you need to update the
Chip, a SoC generator developed by Berkeley. RISC-V is an PC value according to the earliest one of these instructions;
open source RISC instruction set first released by the The fourth and fifth stages of decoding take instructions from
University of California, Berkeley in 2010. Since Rocket Core the instruction queue for decoding, translate the instruction
is an open source processor core launched at the same time as code into the internal code convenient for the functional unit to
Berkeley developed RISC-V architecture, it can be said to be process, and identify the instruction type, the required register
the well-known open source RISC-V Core. As shown in Fig.5, number and the immediate number that may be included; The
launch the five level pipeline executed in sequence[15]. sixth stage register renaming is for disordered execution; The
seventh stage scheduling is queue item allocation, that is,
selecting an empty item in the related queue for instructions;
The eighth stage of issue is to issue instructions from the issue
queue to the corresponding components; The ninth stage write
register is the operand required for reading; The tenth stage of
execution is operand calculation and address calculation; The
Fig. 5. The Rocket Core pipeline.
11th stage commit 1 and the 12th stage commit 2 are the
Berkeley also developed a superscalar out of order instructions that have been renamed but not submitted in the
transmission and out of order execution processor core, which pipeline. After writeback, these instructions are submitted in
also requires Rocket Chip generation. As shown in Fig.6,the the order of instructions.
BOOM pipeline is divided into 10 stages, namely Fetch,
Decode, Register Rename, Dispatch, Issue, Register Read, D. X86
Execute, Memory, Writeback and Commit. However, in the The x86 architecture is a representative variable instruction
actual implementation process, the Register Rename is divided length CISC instruction architecture. CISC instructions vary in
into two parts, which are combined with decoding and length and execution cycle, To complete an instruction pipeline,
dispatching respectively; The two stages of issuing and reading it requires one clock cycle to implement one instruction, which
the register are also combined; While submission is requires the following two points: on the one hand, the
asynchronous, it is not considered in the pipeline. Therefore, integration of the cache into the pipeline; on the other hand,
the 10 stages are actually compressed in the 7-stage pipeline pipelining the instruction decoder into two stages to provide a
shown in the figure[16]. sustained throughput of 1 instruction per clock, while decoding
the complex instruction formats of the CISC Architecture.

788
Authorized licensed use limited to: International Institute of Information Technology Bangalore. Downloaded on October 14,2023 at 04:10:15 UTC from IEEE Xplore. Restrictions apply.
Intel's first pipeline was introduced in the i486 chip. The instruction into one or more fixed length RISC instructions; 3)
five level pipeline is Fetch, D1, D2, EX, and WB .Where, FI: The superscalar pipeline organization can perform micro
fetch the instruction from the cache. Since the entire cache line operations in disorder; 4) Submit the execution results of each
is obtained from the cache, most instructions do not need this micro operation to the register group in the order of the original
stage. On average, about 5 instructions will be acquired.D1: program flow[20].
main instruction decode. Up to three instruction bytes can be
decoded. Decode the operation that occurs in D2 stage, TABLE IV. PENTIUM 4 PIPELINE
determine the length of the instruction, and guide the 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2
instruction aligner/prefetch queue to the next instruction. D2: 0 1 2 3 4 5 6 7 8 9 0
TC TC D A Rena Q Sch Disp RF E F B D
secondary instruction decode, and memory address Next Fetc ri ll me u x l r ri
computation. Each clock decodes a memory displacement field IP h v o e g C v
e c s k e
of 1 to 4 bytes, or an immediate constant of 1 to 4 bytes. If TC Next IP=Pointer to the next instruction of trace cache TC Fetch=Trace cache fetch
there are both memory shifts and direct constants in the Alloc=Alloc Rename=Register Rename
Que=Micro operation queue Sch=Micro operation scheduling
instruction, decoding requires two D2 cycles [18]. Disp=Dispatch RF=register group
Ex=Exection Flgs=Flag
Br Ck=Branch check
PF Fetch and align instruction
IV. CONCLUSION
Decode instruction Generate
In the past, most of the new isas conceived were mainly
D1
control word RISC. RISC technology has formed two technical styles: one is
the super pipeline style that deepens the traditional pipeline,
D2 Decode control word Generate Decode control word Generate
and the other is the superscalar style that allows multiple
memory address memory address instructions to enter the pipeline per clock. Of course, many
CPUs now use superscalars together with super pipelining
Decode control word Generate Decode control word Generate technology. However, as CISC ISA, x86 is still very popular. It
E
memory address memory address converts x86 macro instructions into micro operations (Intel's
uops and AMD's ROPS). The use of uops (or ROPS) allows
WB Write result Write result
RISC style execution cores to be used to implement super
pipelines and superscalars. At present, the boundary between
U pipe V pipe RISC and CISC is not so obvious. RISC and CISC are
gradually integrated, which is equally important for pipeline
Fig. 7. Pentium superscalar pipeline. research on variable length instruction sets and pipeline
research on fixed length instruction sets.
The classic Pentium series pipeline is similar to the i486
chip, and each integer pipeline is divided into five stages of REFERENCES
pipeline. Compared with the i486 CPU's integer pipeline, the [1] Patterson D A, Hennessy J L. Computer organization and design: the
Pentium microprocessor integrates additional hardware to hardware/software interface 5th ed[J]. 2014.
speed up instruction execution. For example, the i486 CPU [2] Blem E, Menon J, Sankaralingam K. Power struggles: Revisiting the
needs two clocks to decode several instruction formats, but the RISC vs. CISC debate on contemporary ARM and x86
Pentium CPU only needs one clock to execute shift and architectures[C]//2013 IEEE 19th International Symposium on High
Performance Computer Architecture (HPCA). IEEE, 2013: 1-12.
multiply instructions faster. More importantly, the Pentium
[3] William S. COMPUTER ORGANIZATION AND ARCHITECTURE
processor adds a second independent superscalar pipeline. The
[4] DESIGNING FOR PERFORMANCE EIGHTH EDITION[J]. 2010.
two pipelines can run in parallel, and each pipeline can have
multiple instructions executed at different pipeline stages at the [5] Sutherland I E. Micropipelines[J]. Communications of the ACM, 1989,
32(6): 720-738.
same time. Fig. 7 shows that the resources used for address
[6] Kane G, Heinrich J. MIPS RISC architectures[M]. Prentice-Hall, Inc.,
generation and ALU functions are copied into independent 1992
integer pipelines, called U and V. In the PF and D1 stages, the [7] Hennessy D A P. Computer Architecture: A Quantitative Approach by
CPU can obtain and decode two simple instructions in parallel John L[J]. Hennessy, David A. Patterson, 2017.
and send them to the U and V pipelines[19]. If possible, the [8] Shen J P, Lipasti M H. Modern processor design: fundamentals of
first instruction is arranged to be executed in the U pipeline, superscalar processors[M]. Waveland Press, 2013.
and the second instruction is arranged to be executed in the V [9] Omondi A R. The microarchitecture of pipelined and superscalar
pipeline. If not, the first instruction is scheduled to be executed computers[M]. Springer Science & Business Media, 2013.
in the U pipeline, and the instruction is not scheduled to run in [10] .Jouppi N P, Wall D W. Available instruction-level parallelism for
the V pipeline. Instructions running in two pipes have exactly superscalar and superpipelined machines[J]. ACM SIGARCH Computer
the same effect as their sequential execution. Architecture News, 1989, 17(2): 272-282.
[11] Hartstein A, Puzak T R. The optimum pipeline depth for a
Unlike the previous microprocessors, Pentium 4 has at least microprocessor[J]. ACM Sigarch Computer Architecture News, 2002,
20 stages of pipeline (see Table 4). In some cases, micro 30(2): 7-13.
operations require multiple execution segments, which leads to [12] Sloss A, Symes D, Wright C. ARM system developer's guide: designing
a longer pipeline. As a whole, the instruction operation steps of and optimizing system software[M]. Elsevier, 2004.
Pentium 4 can be divided into four stages: 1) fetch micro [13] Cormie D. The ARM11 microarchitecture[J]. Retrieved July, 2002, 21:
2004.
instructions in order; 2) Micro operation, which translates each

789
Authorized licensed use limited to: International Institute of Information Technology Bangalore. Downloaded on October 14,2023 at 04:10:15 UTC from IEEE Xplore. Restrictions apply.
[14] Williamson D. Arm cortex-a8: A high-performance processor for low- [17] Loongson Technology Corp. Ltd. Loong3A5000 processor[OL] [2022-
power applications[M]//Unique Chips and Systems. CRC Press, 2018: 6-28] htts://www.loongson.cn/productShow/32
95-122. [18] Crawford J. The execution pipeline of the intel i486 cpu[C]//1990
[15] Asanovic K, Avizienis R, Bachrach J, et al. The rocket chip generator[J]. Thirty-Fifth IEEE Computer Society International Conference on
EECS Department, University of California, Berkeley, Tech. Rep. Intellectual Leverage. IEEE Computer Society, 1990:
UCB/EECS-2016-17, 2016, 4.. 254,255,256,257,258-254,255,256,257,258.
[16] Asanovic K, Patterson D A, Celio C. The berkeley out-of-order machine [19] Alpert D, Avnon D. Architecture of the Pentium microprocessor[J].
(boom): An industry-competitive, synthesizable, parameterized risc-v IEEE micro, 1993, 13(3): 11-21..
processor[R]. University of California at Berkeley Berkeley United [20] Hinton G, Sager D, Upton M, et al. The microarchitecture of the
States, 2015. Pentium® 4 processor[C]//Intel technology journal. 2001..

790
Authorized licensed use limited to: International Institute of Information Technology Bangalore. Downloaded on October 14,2023 at 04:10:15 UTC from IEEE Xplore. Restrictions apply.

You might also like