Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

18PD08 / Computer Architecture

Presented By
Dr T Ravichandran
Professor
Department of Computer Science and Engineering
MODULE 2 ( PART 3)

PIPELINING
 Pipelining –Basic concepts, Pipeline
organization, Pipelining issues
 Data dependencies
 Memory delays
 Branch delays
 Resource limitations
 Performance evaluation
 Superscalar operation
 Pipelining in CISC processors
Pipelining –Basic concepts, Pipeline organization,
Pipelining issues
Understanding Pipelining
Pipelining in Computer Architecture implements a form of parallelism for
executing the instructions.
A pipelined processor executes multiple instructions at the same time.
Pipelining –Basic concepts, Pipeline organization,
Pipelining issues
Pipelining -Basic Concepts

To improve performance of execution speed – 2ways


1. Use faster circuit technology implementation
2. Arrange hardware – more operation at same time

Advantages:
• More efficient use of processor
• Quicker time of execution of large number of instructions
Disadvantages:
• Pipelining involves adding hardware to the chip
• Inability to continuously run the pipeline at full speed because of
pipeline hazards which disrupt the smooth execution of the
pipeline.
Pipelining –Basic concepts, Pipeline organization,
Pipelining issues
Pipelining -Organisation
Pipelining creates and organizes a pipeline of instructions the processor can execute in
parallel. Creating parallel operators to process events improves efficiency. The pipeline is
divided into logical stages connected to each other to form a pipelike structure. Instructions
enter from one end and exit from the other.

• In the first stage of the pipeline, the program counter (PC) is used to fetch a new instruction.

• At any given time, each stage of the pipeline is processing a different instruction

Interstage buffers
• Information such as register addresses, immediate data, and the operations to be performed must be carried
through the pipeline.

• Interstage Registers RA, RB, RM, RY, and RZ


Pipelining –Basic concepts, Pipeline organization,
Pipelining issues
Pipelining -Organisation

Interstage buffers – USES

Buffer B1 : feeds the Decode stage with a newly-


fetched instruction.

Buffer B2 : Feeds computer stage with 2 operands ,


immediate value , incremented PC value as return
address

Buffer B3: Feeds memory stage with result of the


ALU operation

Buffer B4 : feeds the Write stage with a value to be


written into the register
Pipelining –Basic concepts, Pipeline organization,
Pipelining issues

Pipelining issues
Consider the two instructions,

Add R2, R3, #100 (Ij)


Subtract R9, R2, #30 (Ij+1)

The result of Ij is not written into register file , While Ij+1 reads it in cycle 3
• If the execution proceeds result will be incorrect
• Therefore the execution need to wait until Ij+1 Writes
• This means it must be stalled
• The subsequent instructions cannot enter pipeline –
Increases execution time
Pipelining : Pipelining Issues

Any condition that causes pipeline to stall – HAZARD

The hazard occurs due to

 Unavailability of data ( DATA HAZARD)

 Memory delay (MEMORY HAZARD)

 Branch instructions (BRANCH HAZARD)


Pipelining – Unavailability of data ( DATA HAZARD)

Consider the two instructions,

Add R2, R3, #100 Ij


Subtract R9, R2, #30 Ij+1
Clock cycle 1 2 3 4 5 6 7 8
FETCH DECODE COMPUTE MEMORY WRITE
FETCH DECODE STALL STALL COMPUTE MEM WRITE

 The subtract instruction Ij+1 has to be stalled for 3rd to 5th clock cycle until Ij writes its value
 NOP sent to interstage buffer – which creates one clock cycle of Idle time
Pipelining – Unavailability of data ( DATA HAZARD)

DATA Dependencies – Operand Forwarding

• This approach is one solution to overcome data dependency


• Provides the value od operand to the needy stage of instruction in pipeline, before the operand being stored

FETCH DECODE COMPUTE MEMORY WRITE

FETCH DECODE COMPUTE MEMORY WRITE

 Interstage buffers Rx ,Ry ,RA,RB 


 Result of Computation (ALU) stage is passed to needy stage
DATA DEPENDENCY – Unavailability of data
( DATA HAZARD)
 Leaving the task of detecting data dependencies to the compiler

 Compiler identifies a data dependency between two successive


instructions Ij and Ij+1
 insert three explicit NOP (No-operation)
 The NOPs introduce the necessary delay
 This simplifies hardware implementation, But code size increases

 Reordering – placing useful instruction in the place of NOP


Pipelining – Memory delays

 The delay arising with two instructions access same memory causing pipeline to stall
Example:
Load R2, (R3) Ij
Subtract R9, R2, #30 Ij+1

 Ij may require more than one clock cycle to obtain operand ( here R3) from memory
 Operand may not be found in cache – CACHE MISS
 This would cause all subsequent instructions to stall
 And if there is any data dependency this would cause additional stall ( MEM related Stall)

Compiler can eliminate the one cycle data dependency stall by


• Reordering - inserting useful instruction in between that do not
depend on memory
• Insert NOP - If useful instruction not found

EXAMPLE : Operand forwarding after memory


Pipelining – Branch Delay ( Control Hazard)

Occurs during control transfer instructions like BRANCH ,CALL ,JUMP


• Before calculating target address it starts inserting new instruction to
pipeline- causes delay
• Consider the following instructions
F D C M W
Add R8 , R8 , R9 F D C M W
Ij+1
Ij+2 F D C M W
Ij
Ij+k F D C M W
Jump LOOP
Ij+l F D C M W
Ij+1
NOP
Pipeline Execution
X Subtract R8,R5,#8
Ij+2 I àI
j j+1 àNOPà Ij+k àIj+l….à Ij+2
……………………..
…………………………………
Performance Evaluation

Instruction throughput -- no. of instructions executed per second

Non Pipelined instruction Throughput (P) = R/ S


R- clock rates per second
S- Average no. of clock cycle for one instruction execution

Pipelined instruction Throughput (P)= R


If no cache miss , five stage pipeline execute an instruction without stall (S=1)
Ideal pipeline S= 1

Performance is affected by
i) Data dependencies (data hazard)
ii) Branch penalty (branch or control hazard)
iii) Cache miss (memory hazard)
Performance Evaluation
Performance Evaluation
Performance Evaluation
Performance Evaluation
Performance Evaluation
Performance Evaluation
Performance Evaluation
Performance Evaluation
Performance Evaluation –Data dependency

Effect of stall and data dependencies


Assume that Load instructions constitute 25 percent of the dynamic instruction count, and assume that 40 percent of
these Load instructions are followed by a dependent instruction. A one-cycles stall is needed in such cases.
Evaluate the increase over the ideal case of S = 1

δstall = 0.25 × 0.40 × 1 = 0.10

Execution time T is increased by 10 percent and

Throughput is reduced to , Pp = R / 1 + δstall


= R / 1.1
= 0.91R

07/07/2024 Dr T Ravichandran HOD / CSE 23


Performance Evaluation –Branch Hazard

Effect of Branch Penalty

Assume that branches constitute 20 percent of the dynamic instruction count of a program, and that
the average prediction accuracy for branch instructions is 90 % i.e., 10 % of all branch instructions
that are executed incur a one-cycle penalty due to misprediction. The increase in the average number
of cycles (S)

δbranch_penalty = 0.20 × 0.10 × 1 = 0.02 Throughput is reduced to ,


Pp = R / 1 + δbranch_penalty
= R / 1.02
= 0.98R
Performance Evaluation – Memory Delay

The increase over the ideal case of S = 1 due to cache misses is


δmiss = (mi + d × md ) × pm
pm = No. of cycle there is a cache miss
mi = fraction of instruction that are fetched ,causing cache miss d = Fraction of Load & Store instruction
md = Fraction of d which cause cache miss
Performance Evaluation – Memory Delay

Suppose that 5 percent of all fetched instructions incur a cache miss, 30 percent of all
instructions executed are Load or Store instructions, and 10 percent of their data-operand
accesses incur a cache miss. Assume that the penalty to access the main memory for a cache
miss is 10 cycles.
The increase over the ideal case of S = 1 due to cache misses given by
δmiss = (mi + d × md ) × pm
δmiss = (0.05 + 0.30 × 0.10) × 10
= 0.8

 Compared to δstall and δbranch_penalty the effect of a slow main memory for cache misses is more significant δmiss
 When all factors are combined, S is increased from the ideal value of 1 to 1 + δstall + δbranch_penalty + δmiss.
 The contribution of cache misses is often the dominant one.
Superscalar Operation

 This is an approach to equip processor


 Processor has multiple execution units
 Each of it will be pipelined
 Several instruction start execution at same
clock cycle
 Achieves throughput of more than one instruction per
cycle
“ Superscalar Processors “
Superscalar Operation

Two instructions fetched at same clock cycle


 STORE / LOAD deals with memory stage
 ADD / SUB doesn’t perform memory access
Superscalar Operation
OUT OF ORDER EXECUTION

• Load is fetched previously to subtract.


• Subtract may write it result before Load.
• This would become disadvantage when there is data
dependency.
• The instruction must be dispatched in the same sequence
as they appear in program.
• But superscalar may cause out of order execution
Superscalar Operation

Out of order Execution


 To guarantee consistent state , result of successive
instruction should be buffered until its previous instruction
is written.

 Temporary registers can be used for this buffering.

 In example , result of subtract instruction is buffered until


load instruction executes.
Pipelining in CISC
Complications arise due to instructions that are
 Variable in size
 Multiple memory operands
 Condition codes

Instruction that occupy more than one word


• Take several clock cycle to fetch
• Complicates decode and operand access
• Complicates dispatch in superscalar processor
Pipelining in CISC

Cold fire - processors


V1 - FIFO

V2 -Two stage fetch


1 register involved instruction pass once
2 memory involved instruction pass twice on execution stage

V4 – extended with 4 stages and branch prediction

V5 – Same as V4 , provide superscalar processing


Pipelining in CISC

INTEL processors

 Superscalar – high performance


 Core 2 and Corei7 have 14 stage pipeline
 Branch prediction , buffering techniques are used
 To reduce complexity CISC converted to RISC micro-operations

You might also like