Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 127

COMPUTER ORGANIZATION AND ARCHITECTURE

II Year / III Semester

UNIT III

1
TEXTBOOKS & REFERENCES
Text Books
 Carl Hamacher, Zvonko G. Vranesic, Safwat G. Zaky, Computer Organization, 5th edition, McGraw-Hill, 2014
 DavidA.PattersonandJohnL.Henessey,“ComputerOrganisationandDesign”,Fifthedition,Morgan Kauffman /
Elseveir,2014.
 Morris Mano, "Computer System Architecture ", Prentice Hall of India, Third Edition,2008.

Reference Books
 Morris Mano, "Computer System Architecture ", Prentice Hall of India, Third Edition,2008.
 John P.Hayes, Computer Architecture and Organisation, McGraw Hill,2012.
 William Stallings, Computer Organization and Architecture, 7th edition, Prentice-Hall of India Pvt. Ltd., 2016.
 Vincent P. Heuring, Harry F. Jordan, “Computer System Architecture”, Second Edition, Pearson Education,2005.
 Govindarajalu, “Computer Architecture and Organization, Design Principles and Applications", First edition, Tata
McGraw Hill, New Delhi,2005.

Web References
 http://www.inetdaemon.com/tutorials/computers/hardware/cpu/
 https://inst.eecs.berkeley.edu/~cs152/sp18/
 http://users.ece.cmu.edu/~jhoe/doku/doku.php?id=18-447_introduction_to_computer_architecture
 
UNIT III
PROCESSOR AND CONTROL UNIT
 Fundamental concepts
 Execution of a complete instruction
 Multiple bus organization
 Hardwired control
 Microprogrammed control
 Pipelining: Basic concepts
 Data hazards
 Instruction hazards
 Influence on Instruction sets
 Data path and control consideration
 Parallel Processors.
Overview
 Central Processing Unit (CPU) : Fetching, Decoding, Ex-
ecuting instructions
 A typical computing task consists of a series of steps
specified by a sequence of machine instructions that
constitute a program.
 An instruction is executed by carrying out a sequence of
more operations.
Some Fundamental Con-
cepts
Fundamental Concepts
 Processor fetches one instruction at a time and
perform the operation specified.
 Instructions are fetched from successive memory
locations until a branch or a jump instruction is
encountered.
 Processor keeps track of the address of the memory
location containing the next instruction to be fetched
using Program Counter (PC).
 Instruction Register (IR)
Executing an Instruction
 Fetch the contents of the memory location pointed
to by the PC. The contents of this location are
loaded into the IR (fetch phase).
IR ← [[PC]]
 Assuming that the memory is byte addressable,
increment the contents of the PC by 1 (fetch phase).
PC ← [PC] + 1
 Carry out the actions specified by the instruction in
the IR (execution phase).
Executing an Instruction
 Transfer a word of data from one processor
register to another or to the ALU.
 Perform an arithmetic or a logic operation
and store the result in a processor register.
 Fetch the contents of a given memory
location and load them into a processor
register.
 Store a word of data from a processor
register into a given memory location.
Register Transfers Riin
Internal processor
bus

Ri

Riout

Yin

Constant 4

Select MUX

A B
ALU

Zin

Z out

Figure 7.2. Input and output gating for the registers in Figure 7.1.
Bu s

Register Transfers
D Q
1
Q
Rio u t

Ri in
Clo c
k

Fi g u r e 7 .I 3n.p u t a nd o uat tpiunt g gf o r goinset er re b i t .

 All operations and data transfers are controlled by the processor clock.

Figure 7.3. Input and output gating for one register bit.
Performing an Arithmetic or
Logic Operation
 The ALU is a combinational circuit that has no internal storage. It
performs arithmetic and logic operations on the two operands
applied to its A and B inputs
 one of the operands is the output of the multiplexer MUX and the
other operand is obtained directly from the bus
 The result produced by the ALU is stored temporarily in register
Z. Therefore, a sequence of operations to add the contents of reg-
ister RI to those of register R2 and store the result in register R3 is
Performing an Arithmetic or
Logic Operation
Performing an Arithmetic or
Logic Operation
Fetching a Word from Memory
 To fetch a word of information from memory, the pro-
cessor has to specify the address of the memory loca-
tion where this information is stored and request a
Read operation.
 The processor transfers the required address to the
MAR, whose output is connected to the address lines of
the memory bus.
 At the same time, the processor uses the control lines
of the memory bus to indicate that a Read operation is
needed. When the requested data are received from
the memory they are stored in register MDR, from
where they can be transferred to other registers in the
processor.
Fetching a Word from Memory
 The connections for register MDR has four control sig-
nals: MDRin, and MDout, control the connection to the
internal bus, and MDRinE and MDRoutE control the
connection to the external bus

 The input is selected when MDRin = 1


Me
mo ry -b us In te
rn a
lp ro e
cr
so
datlin e
s MDR o MDR o bu s
u tE u t

MDR

Fetching a Word from Memory


MDR in E MDR in

Fi g u r e 7 . Co
4 .n n e c t i o n a n d c o n t r o l sgii gsnt aelr s M
f oDR.
r re

 Address into MAR; issue Read operation; data into MDR.

Figure 7.4. Connection and control signals for register MDR.


Fetching a Word from Memory

Figure 7.4. Connection and control signals for register MDR.


Fetching a Word from Memory
 The response time of each memory access varies
(cache miss, memory-mapped I/O,…).
 To accommodate this, the processor waits until it re-
ceives an indication that the requested operation
has been completed (Memory-Function-Completed,
MFC).
 Move (R1), R2
 MAR ← [R1]
 Start a Read operation on the memory bus
 Wait for the MFC response from the memory
 Load MDR from the memory bus
 R2 ← [MDR]
Fetching a Word from Memory
Timing
MAR ← [R1]
Assume MAR
is always available
on the address lines
of the memory bus. Start a Read operation on the memory bus

Wait for the MFC response from the memory

Load MDR from the memory bus


R2 ← [MDR]
Storing a word in memory
 Writing a word into a memory location follows
a similar procedure. The desired address is
loaded into MAR. Then, the data to be & writ-
ten ate loaded into MDR, and a Write com-
mand is issued
Storing a word in memory
Execution of a Complete In-
struction
 Let us now put together the sequence of el-
ementary operations required to execute one
instruction. Consider the instruction

Add (R3),R1

 which adds the content of a memory location


pointed to by R3 to register R1
Execution of a Complete In-
struction
 Executing this instruction requires the
following actions: Add (R3), R1
Fi gu r e 7 . 1 . Si ng l e- bu s or g ani za t i on of t he da t a pa t h i n s i d e a p r oc e s s or .

Execution of a Complete In-


St e p

1
2
3
4
5
6
Ac t i o n

PC o
Zo
M DRo
R3 o
R1 o
M DRo
u ,t

u ,t

u t,

u t,
M ARin , Re a d , Se l e c t Ad
PC in , Yin , W M F
u ,t I Rin
M ARin , Re a d
Yin , W M F
u ,t
C
C

Se l e c t Y,
Ad d , Zin
4 , d , Zin

struction
7 Zo u ,t R1 in , En d

Fi g u r 7
e . 6 . Co nt r o ls e q u e n cfeo re x e c u t i o
onf t h ei n s t r u c t iAd
ond ( R3 ) , R1 .

Add (R3), R1
Branch Instructions
 A branch instruction replaces the contents of
PC with the branch target address, which is
usually obtained by adding an offset X given
in the branch instruction.
 The offset X is usually the difference between
the branch target address and the address
immediately following the branch instruction.
 Conditional branch
Branch Instructions
 The gives a control sequence that
implements an unconditional branch
instruction. Processing starts, as usual, with
the fetch phase.
 This phase ends when the instruction is
loaded into the IR in step 3. The offset value
is extracted from the IR by the instruction
decoding circuit, which will also perform sign
extension if required.
Execution of Branch Instruc-
tions
 Since the value of the updated PC is already
available in register Y, the Offset X is gated
onto the bus in step 4, and an addition opera-
tion is performed. The result, which is tire
branch target address, is loaded into the PC
in step 5.
Execution of Branch Instruc-
tions

Step Action

1 PCout , MAR in , Read, Select4, Add, Z in


2 Zout , PC in , Yin , WMF C
3 MDR out , IR in
4 Offset-field-of-IRout , Add, Z in
5 Z out , PCin , End

Figure 7.7. Control sequence for an unconditional branch instruction.


Multiple-Bus Organization
 We used the simple single-bus structure of to
illustrate the basic ideas
 The resulting control sequences in Figures
7.6 and 7.7 are quite long because only one
data item can be transferred over the bus in
a clock cycle.
 To reduce the number of steps needed, most
commercial processors provide multiple in-
ternal paths that enable several transfers to
take place in parallel.
Multiple-Bus Organization
 a three-bus structure used to connect the reg-
isters and the ALU of a processor.
 All general-purpose registers are combined
into a single block called the register file.
 In VLSI technology, the most efficient way to
implement a number of registers is in the
form of an array of memory cells similar to
those used in the implementation of random-
access memories (RAMs)
Multiple-Bus Organization
 a three-bus structure used to connect the reg-
isters and the ALU of a processor.
 All general-purpose registers are combined
into a single block called the register file.
 In VLSI technology, the most efficient way to
implement a number of registers is in the
form of an array of memory cells similar to
those used in the implementation of random-
access memories (RAMs)
Multiple-Bus Organization
 There are two outputs, allowing the contents
of two different registers to be accessed
simultaneously and have their contents
placed on buses A and B.
 The third port allows the data on bus C to be
loaded into a third register during the same
clock cycle.
Multiple-Bus Organization
 Buses A and B are used to transfer the source operands
to the A and B inputs of the ALU, where an arithmetic or
logic operation may be performed. The result is trans-
ferred to the destination over bus C.
 If needed, the ALU may simply pass one of its two input
operands unmodified to bus C. We will call the ALU con-
trol signals for such an operation R=A or R=B. The three-
bus arrangement obviates the need for registers Y and Z
Multiple-Bus Organization

 .
In stru ctio n
deco der

IR

MDR

MAR

Memory b us Addres
datalin es lin es

Fi gur e 7.Thr8. e e-usb gani


or zat i on of t he dat apat h.

Multiple-Bus Organization
Multiple-Bus Organization
 Add R4, R5, R6

Step Action

1 PCout , R=B, MAR in , Read, IncPC


2 WMFC
3 MDR outB , R=B, IR in
4 R4outA , R5outB , SelectA, Add, R6 in , End

Figure 7.9. Control sequence for the instruction. Add R4,R5,R6,


for the three-bus organization in Figure 7.8.
Hardwired Control
Hardwired Control
 The Hardwired Control organization involves
the control logic to be implemented with
gates, flip-flops, decoders, and other digital
circuits.
 The following image shows the block diagram
of a Hardwired Control organization.
Hardwired Control
 A Hard-wired Control consists of two
decoders, a sequence counter, and a number
of logic gates.
 An instruction fetched from the memory unit
is placed in the instruction register (IR).
 The component of an instruction register in-
cludes; I bit, the operation code, and bits 0
through 11.
 The operation code in bits 12 through 14 are
coded with a 3 x 8 decoder
Hardwired Control
 The outputs of the decoder are designated by
the symbols D0 through D7.
 The operation code at bit 15 is transferred to
a flip-flop designated by the symbol I.
 The operation codes from Bits 0 through 11
are applied to the control logic gates.
 The Sequence counter (SC) can count in bi-
nary from 0 through 15.
Hardwired Control
 To execute instructions, the processor must
have some means of generating the control
signals needed in the proper sequence.
 Two categories: hardwired control and
microprogrammed control
 Hardwired system can operate at high speed;
but with little flexibility.
Hardwired Control
Control Unit Organization
CLK Control step
Clock counter

External
inputs
Decoder/
IR
encoder
Condition
codes

Control signals

Figure 7.10. Control unit organization.


Microprogrammed Con-
trol
Microprogrammed Control

 The Microprogrammed Control organization


is implemented by using the programming
approach. (ie, control signals are implemented
with the help of programs /software)
 In Microprogrammed Control, the micro-
operations are performed by executing a pro-
gram consisting of micro-instructions.
 It acts as a midware between hardware and
software.
Microprogrammed Control

 It is easy to design, test and implement and


also flexible to modify

The following image shows the block diagram of


a Microprogrammed Control organization
Microprogrammed Control
Microprogrammed Control

 The Control memory address register


specifies the address of the micro-instruction.
 The Control memory is assumed to be a
ROM, within which all control information is
permanently stored.
 The control register holds the microinstruction
fetched from the memory.
Microprogrammed Control
 The micro-instruction contains a control word
that specifies one or more micro-operations for
the data processor.
 While the micro-operations are being executed,
the next address is computed in the next ad-
dress generator circuit and then transferred into
the control address register to read the next
microinstruction.
 The next address generator is often referred
to as a micro-program sequencer, as it
determines the address sequence that is read
from control memory.
Microprogrammed Control

Micro programmed control unit consists of

 Control signal
 Control variables
 Control word
 Control memory
 Micro instructions
 Micro programs
Microprogrammed Control

 Control signal: Group of bits used to select the path


(whether go to multiplexer/ALU)
 Control variables: a binary variable specifies micro
operation
 Control word :string of ones and zeros represents in
control variable
 Control memory :contains control word
 Micro instructions : specifies execution of microoperations
 Micro programs : sequence of micro instructions
Microprogrammed Control
Microprogrammed Control
Microprogrammed Control
Microprogrammed Control
 The previous organization cannot handle the situation when the control
unit is required to check the status of the condition codes or external in-
puts to choose between alternative courses of action.
 Use conditional branch microinstruction.
Address Microinstruction

0 PCout , MAR in , Read, Select4, Add, Z in


1 Zout , PC in , Yin , WMF C
2 MDR out , IR in
3 Branch to starting addressof appropriate microroutine
. ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... ..
25 If N=0, then branch to microinstruction0
26 Offset-field-of-IR out , SelectY, Add, Z in
27 Zout , PC in , End

Figure 7.17. Microroutine for the instruction Branch<0.


Pipelining
Overview
 Pipelining is a technique where multiple
instructions are overlapped during execution.
 It allows storing and executing instructions in
an orderly process It is also known a pipeline
processing.
 Pipeline is divided into stages and these
stages are connected with one another to
form a pipe like structure.
 Instructions enter from one end and exit from
another end.
Overview
 Pipelining increases the overall instruction
throughput.
 In pipeline system, each segment consists of an
input register followed by a combinational circuit.
 The register is used to hold data and combina-
tional circuit performs operations on it. The output
of combinational circuit is applied to the input
register of the next segment.
Overview
 Pipelining is widely used in modern
processors.
 Pipelining improves system performance in
terms of throughput.
 Pipelined organization requires sophisticated
compilation techniques.
Overview
Basic Concepts
Making the Execution of
Programs Faster
 Use faster circuit technology to build the
processor and the main memory.
 Arrange the hardware so that more than one
operation can be performed at the same time.
 In the latter way, the number of operations
performed per second is increased even
though the elapsed time needed to perform
any one operation is not changed.
Traditional Pipeline Concept

 Laundry Example
 Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold A B C D
 Washer takes 30 minutes

 Dryer takes 40 minutes

 “Folder” takes 20 minutes


Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight

Time

30 40 20 30 40 20 30 40 20 30 40 20
 Sequential laundry takes 6
A hours for 4 loads
 If they learned pipelining,
how long would laundry
B take?

D
Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight

Time
T
a 30 40 40 40 40 20
s
k A
 Pipelined laundry takes
3.5 hours for 4 loads
O B
r
d C
e
r D
Traditional Pipeline Concept
 Pipelining doesn’t help
6 PM 7 8 9
latency of single task, it
helps throughput of entire
Time
workload
T
30 40 40 40 40 20  Pipeline rate limited by
a
slowest pipeline stage
s
A  Multiple tasks operating
k simultaneously using differ-
ent resources
O B  Potential speedup = Number
r pipe stages
d C  Unbalanced lengths of pipe
e stages reduces speedup
r  Time to “fill” pipeline and
D
time to “drain” it reduces
speedup
 Stall for Dependences
Use the Idea of Pipelining in a
Computer
Fetch + Execution
T ime
I1 I2 I3
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction

I1 F1 E1
(a) Sequential execution

I2 F2 E2
Interstage buffer
B1
I3 F3 E3

Instruction Execution
fetch unit
unit (c) Pipelined execution

Figure 8.1. Basic idea of instruction pipelining.


(b) Hardware organization
(b )Hard wareo rg an izato n

Use the Idea of Pipelining in a


Fi gur e 8 .A2 .4- s t a g e pi pe l i ne .

Computer
Fetch + Decode
+ Execution + Write
Role of Cache Memory
 Each pipeline stage is expected to complete in one
clock cycle.
 The clock period should be long enough to let the
slowest pipeline stage to complete.
 Faster stages can only wait for the slowest one to
complete.
 Since main memory is very slow compared to the
execution, if each instruction needs to be fetched
from main memory, pipeline is almost useless.
 Fortunately, we have cache.
Pipeline Performance
 The potential increase in performance
resulting from pipelining is proportional to the
number of pipeline stages.
 However, this increase would be achieved
only if all pipeline stages require the same
time to complete, and there is no interruption
throughout program execution.
 Unfortunately, this is not true.
Pipeline Performance
 The previous pipeline is said to have been stalled for two clock
cycles.
 Any condition that causes a pipeline to stall is called a hazard.
 Data hazard – any condition in which either the source or the
destination operands of an instruction are not available at the
time expected in the pipeline. So some operation has to be de-
layed, and the pipeline stalls.
 Instruction (control) hazard – a delay in the availability of an in-
struction causes the pipeline to stall.
 Structural hazard – the situation when two instructions require
the use of a given hardware resource at the same time.
Pipeline Performance
 Again, pipelining does not result in individual instruc-
tions being executed faster; rather, it is the through-
put that increases.
 Throughput is measured by the rate at which in-
struction execution is completed.
 Pipeline stall causes degradation in pipeline perfor-
mance.
 We need to identify all hazards that may cause the
pipeline to stall and to find ways to minimize their
impact.
Data Hazards
Pipeline Performance
 Again, pipelining does not result in individual instruc-
tions being executed faster; rather, it is the through-
put that increases.
 Throughput is measured by the rate at which in-
struction execution is completed.
 Pipeline stall causes degradation in pipeline perfor-
mance.
 We need to identify all hazards that may cause the
pipeline to stall and to find ways to minimize their
impact.
Hazards
Hazards
 hazards are problems with the instruction pipeline in
CPU micro architectures when the next instruction
cannot execute in the clock cycle, and can
potentially lead to incorrect computation results.
 Three common types of hazards are data hazards,
structural hazards, and control hazards (branching
hazards).
Data Hazards
 Data Hazards occur when an instruction depends on the result of
previous instruction and that result of instruction has not yet been
computed. whenever two different instructions use the same
storage. the location must appear as if it is executed in sequen-
tial order. 

 There are four types of data dependencies: Read after


Write (RAW), Write after Read (WAR), Write after Write
(WAW), and Read after Read (RAR). These are
explained as follows below. 
Data Hazards
 Read after Write (RAW) : 
It is also known as True dependency or Flow depen-
dency. It occurs when the value produced by an instruc-
tion is required by a subsequent instruction. For exam-
ple,
ADD R1, --, --; SUB --, R1, --;  

 Write after Read (WAR) : 


It is also known as anti dependency. These hazards oc-
cur when the output register of an instruction is used
right after read by a previous instruction. For example,
ADD --, R1, --; SUB R1, --, --;
Data Hazards
 Write after Write (WAW) : 
It is also known as output dependency. These hazards
occur when the output register of an instruction is used
for write after written by previous instruction.
For example,
ADD R1, --, --; SUB R1, --, --;

 Read after Read (RAR) : 


It occurs when the instruction both read from the same
register. For example,
ADD --, R1, --; SUB --, R1, --;
Data Hazards
 Handling Data Hazards : 
These are various methods we use to handle hazards:
Forwarding, Code recording, and Stall insertion. 
These are explained as follows below. 
 Forwarding : 
It adds special circuitry to the pipeline. This method
works because it takes less time for the required values
to travel through a wire than it does for a pipeline seg-
ment to compute its result.
Data Hazards
 Code reordering : 
We need a special type of software to reorder code. We
call this type of software a hardware-dependent com-
piler.
 Stall Insertion : 
it inserts one or more installs (no-op instructions) into the
pipeline, which delays the execution of the current in-
struction until the required operand is written to the regis-
ter file, but this method decreases pipeline efficiency
and throughput. 
Data Hazards
 We must ensure that the results obtained when instructions are
executed in a pipelined processor are identical to those obtained
when the same instructions are executed sequentially.
 Hazard occurs
A←3+A
B←4×A
 No hazard
A←5×C
B ← 20 + C
 When two operations depend on each other, they must be
executed sequentially in the correct order.
 Another example:
Mul R2, R3, R4
Add R5, R4, R6
Operand Forwarding
 Instead of from the register file, the second
instruction can get data directly from the out-
put of ALU after the previous instruction is
completed.
 A special arrangement needs to be made to
“forward” the output of ALU to the input of
ALU.
Handling Data Hazards in
Software
 Let the compiler detect and handle the
hazard:
I1: Mul R2, R3, R4
NOP
NOP
I2: Add R5, R4, R6
 The compiler can reorder the instructions to
perform some useful work during the NOP
slots.
Side Effects
 The previous example is explicit and easily detected.
 Sometimes an instruction changes the contents of a register
other than the one named as the destination.
 When a location other than one explicitly named in an instruction
as a destination operand is affected, the instruction is said to
have a side effect. (Example?)
 Example: conditional code flags:
Add R1, R3
AddWithCarry R2, R4
 Instructions designed for execution on pipelined hardware should
have few side effects.
Instruction Hazards
Instruction Hazards
 The instruction fetch unit of the CPU is responsible for
providing a stream of instructions to the execution unit. The
instructions fetched by the fetch unit are in consecutive
memory locations and they are executed.
 However the problem arises when one of the instructions is
a branching instruction to some other memory location.
Thus all the instruction fetched in the pipeline from consecu-
tive memory locations are invalid now and need to
removed(also called flushing of the pipeline).This induces a
stall till new instructions are again fetched from the mem-
ory address specified in the branch instruction
Instruction Hazards
 The effect of branch instructions and the
techniques that can be used for mitigating
their impact are discussed with uncondi-
tional branches and conditional
branches.
I3 F3 X

Ik Fk Ek

Ik+ 1 Fk+ 1 Ek+ 1

Fi gu r e 8 . An
8 . i d l yec lce c a u s e d b y a b r a n c h i n s t r u c t i o n .

Unconditional Branches
 Unconditional branch instructions such as
GOTO are used to unconditionally
"jump" to (begin execution of) a different
instruction sequence.
 In programming, a GOTO, BRANCH or
JUMP instruction that passes control to a
different part of the program.
I3 F3 X

Ik Fk Ek

Ik+ 1 Fk+ 1 Ek+ 1

Fi gu r e 8 . An
8 . i d l yec lce c a u s e d b y a b r a n c h i n s t r u c t i o n .

Unconditional Branches
.
I3 F3 X

Ik Fk Ek

Ik+ 1 Fk+ 1 Ek+ 1

Fi gu r e 8 . An
8 . i d l yec lce c a u s e d b y a b r a n c h i n s t r u c t i o n .

Unconditional Branches
.
Instruction Queue and
Prefetching
Instruction fetch unit
Instruction queue
F : Fetch
instruction

D : Dispatch/
Decode E : Ex ecute W : Write
instruction results
unit

Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
Conditional Braches
 A conditional branch instruction introduces
the added hazard caused by the dependency
of the branch condition on the result of a
preceding instruction.
 The decision to branch cannot be made until
the execution of that instruction has been
completed.
 Branch instructions represent about 20% of
the dynamic instruction count of most
programs.
Conditional Braches
 A programming instruction that directs the
computer to another part of the program
based on the results of a compare
 Conditional branch is happened based on
some condition like if condition in C. Transfer
of control of the program will depend on the
outcome of this condition.
Delayed Branch
 The instructions in the delay slots are always
fetched. Therefore, we would like to arrange
for them to be fully executed whether or not
the branch is taken.
 The objective is to place useful instructions in
these slots.
 The effectiveness of the delayed branch ap-
proach depends on how often it is possible to
reorder instructions.
Delayed Branch
LOOP Shift_left R1
Decrement R2
Branch=0 LOOP
NEXT Add R1,R3

(a) Original program loop

LOOP Decrement R2
Branch=0 LOOP
Shift_left R1
NEXT Add R1,R3

(b) Reordered instructions

Figure 8.12. Reordering of instructions for a delayed branch.


Branch Prediction
 To predict whether or not a particular branch will be taken.
 Simplest form: assume branch will not take place and continue to
fetch instructions in sequential address order.
 Until the branch is evaluated, instruction execution along the
predicted path must be done on a speculative basis.
 Speculative execution: instructions are executed before the pro-
cessor is certain that they are in the correct execution sequence.
 Need to be careful so that no processor registers or memory
locations are updated until it is confirmed that these instructions
should indeed be executed.
Branch Prediction
 Better performance can be achieved if we arrange
for some branch instructions to be predicted as
taken and others as not taken.
 Use hardware to observe whether the target
address is lower or higher than that of the branch
instruction.
 Let compiler include a branch prediction bit.
 So far the branch prediction decision is always the
same every time a given instruction is executed –
static branch prediction.
Influence on Instruction
Sets
Overview
 The instruction set provides commands to the
processor, to tell it what it needs to do
 Some instructions are much better suited to
pipeline execution than others.
 Addressing modes
 Conditional code flags
Addressing Modes
 Addressing Modes specify the location of an
operand. 
 Addressing modes include simple ones and
complex ones.
 In choosing the addressing modes to be
implemented in a pipelined processor, we
must consider the effect of each addressing
mode on instruction flow in the pipeline:
 Side effects
 The extent to which complex addressing modes cause
the pipeline to stall
 Whether a given mode is likely to be used by compilers
Simple Addressing Mode
Complex Addressing Mode
Addressing Modes
 In a pipelined processor, complex addressing
modes do not necessarily lead to faster execution.
 Advantage: reducing the number of instructions /
program space
 Disadvantage: cause pipeline to stall / more hard-
ware to decode / not convenient for compiler to work
with
 Conclusion: complex addressing modes are not
suitable for pipelined execution.
Addressing Modes
 Good addressing modes should have:
 Access to an operand does not require more than one
access to the memory
 Only load and store instruction access memory operands
 The addressing modes used do not have side effects
 Register, register indirect, index
Conditional Codes
 If an optimizing compiler attempts to reorder
instruction to avoid stalling the pipeline when
branches or data dependencies between
successive instructions occur, it must ensure
that reordering does not cause a change in
the outcome of a computation.
 The dependency introduced by the condition-
code flags reduces the flexibility available for
the compiler to reorder instructions.
Conditional Codes
Add R1,R2
Compare R3,R4
Branch=0 ...

(a) A program fragment

Compare R3,R4
Add R1,R2
Branch=0 ...

(b) Instructions reordered


Figure 8.17. Instruction reordering.
Conditional Codes
 Two conclusion:
 To provide flexibility in reordering instructions, the condi-
tion-code flags should be affected by as few instruction
as possible.
 The compiler should be able to specify in which instruc-
tions of a program the condition codes are affected and
in which they are not.
Superscalar Operation
Overview
 A more aggressive approach is to equip the
processor with multiple processing units to
handle several instructions in parallel in each
processing stage.
 With this arrangement, several instructions start
execution in the same clock cycle and the
process is said to use multiple issue.
 Such processors are capable of achieving an
instruction execution throughput of more than
one instruction per cycle. They are known as
‘Superscalar Processors’. 
  
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

Superscalar
 The maximum throughput of a pipelined processor is
one instruction per clock cycle.
 If we equip the processor with multiple processing
units to handle several instructions in parallel in each
processing stage, several instructions start
execution in the same clock cycle – multiple-issue.
 Processors are capable of achieving an instruction
execution throughput of more than one instruction per
cycle – superscalar processors.
 Multiple-issue requires a wider path to the cache and
multiple execution units.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

Superscalar
Advantages of Superscalar Architecture : 
 The compiler can avoid many hazards through
judicious selection and ordering of instructions.
 The compiler should strive to interleave floating
point and integer instructions. This would enable
the dispatch unit to keep both the integer and
floating point units busy most of the time.
 In general, high performance is achieved if the
compiler is able to arrange program instructions
to take maximum advantage of the available
hardware units.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

Superscalar
Disadvantages of Superscalar Architecture : 
 In a Superscalar Processor, the detrimental
effect on performance of various hazards
becomes even more pronounced.
 Due to this type of architecture, problem in
scheduling can occur.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

Very long instruction word


(VLIW)
 Very long instruction word (VLIW) describes a
computer processing architecture in which a
language compiler or pre-processor breaks pro-
gram instruction down into basic operations that
can be performed by the processor in parallel
(that is, at the same time).

 It is an appropriate, for performing more than


one basic (primitive) instruction at a time
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

Very long instruction word


(VLIW)
 These processors include various functional
units, fetch from the instruction cache a
Very-Long Instruction Word including various
primitive instructions, and dispatch the whole
VLIW for parallel implementation.
Very long instruction word
(VLIW)
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

Very long instruction word


(VLIW)
Advantages of VLIW architecture
 It can improve performance.

 It is used to potentially scalable.

 More implementation units can be inserted and


so more instructions can be overflowing into the
VLIW instruction.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

Very long instruction word


(VLIW)
Disadvantages of VLIW architecture
 It can be used by a new programmer required.

 The program should keep track of Instruction


scheduling.
 It can increased memory use.

 It is used for high power consumption


Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

Parallel processing
Parallel processing is a method in computing
of running two or more processors (CPUs) to
handle separate parts of an overall task. ...
Any system that has more than one CPU can
perform parallel processing, as well as
multi-core processors which are commonly
found on computers today.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

Parallel processing
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

vector processing
vector processing Processing of sequences of
data in a uniform manner, a common occur-
rence in manipulation of matrices (whose ele-
ments are vectors) or other arrays of data. A
vector processor will process sequences of input
data as a result of obeying a single vector in-
struction and generate a result data sequence.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .

vector processing

You might also like