Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 114

Pipelined Processor Design

M S Bhat
Dept. of E&C,
NITK Suratkal
Presentation Outline
 Pipelining versus Serial Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

 Delayed Branch and Dynamic Branch Prediction

2/22/2020 EC792 2
Pipelining Example
 Laundry Example: Three Stages

1. Wash dirty load of clothes

2. Dry wet clothes

3. Fold and put clothes into drawers

 Each stage takes 30 minutes to complete


A B
 Four loads of clothes to wash, dry, and fold C D

2/22/2020 EC792 3
Sequential Laundry
6 PM 7 8 9 10 11 12 AM
Time 30 30 30 30 30 30 30 30 30 30 30 30

 Sequential laundry takes 6 hours for 4 loads


 Intuitively, we can use pipelining to speed up laundry

2/22/2020 EC792 4
Pipelined Laundry: Start Load ASAP
6 PM 7 8 9 PM
30 30 30
30 30 30 Time
30 30 30
30 30 30

A  Pipelined laundry takes


3 hours for 4 loads
B
 Speedup factor is 2 for
4 loads
C
 Time to wash, dry, and
D fold one load is still the
same (90 minutes)

2/22/2020 EC792 5
Serial Execution versus Pipelining
 Consider a task that can be divided into k subtasks
 The k subtasks are executed on k different stages
 Each subtask requires one time unit
 The total execution time of the task is k time units
 Pipelining is to overlap the execution
 The k stages work in parallel on k different tasks
 Tasks enter/leave pipeline at the rate of one task per time unit
1 2 … k 1 2 … k
1 2 … k 1 2 … k
1 2 … k 1 2 … k

Without Pipelining With Pipelining


One completion every k time units One completion every 1 time unit

2/22/2020 EC792 6
Synchronous Pipeline
 Uses clocked registers between stages
 Upon arrival of a clock edge …
 All registers hold the results of previous stages simultaneously

 The pipeline stages are combinational logic circuits


 It is desirable to have balanced stages
 Approximately equal delay in all stages

 Clock period is determined by the maximum stage delay


Register

Register

Register
Register

Input S1 S2 Sk Output

Clock

2/22/2020 EC792 7
Pipeline Performance
 Let ti = time delay in stage Si
 Clock cycle t = max(ti) is the maximum stage delay
 Clock frequency f = 1/t = 1/max(ti)
 A k-stage pipeline can process n tasks in k + n – 1
cycles
 k cycles are needed to complete the first task
 n – 1 cycles are needed to complete the remaining n – 1 tasks

 Ideal speedup of a k-stage pipeline over serial execution


Serial execution in cycles nk
Sk = = Sk → k for large n
Pipelined execution in cycles k+n–1

2/22/2020 EC792 8
MIPS Processor Pipeline
 Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory
2. ID: Instruction Decode, register read, and J/Br address
3. EX: Execute operation or calculate load/store address
4. MEM: Memory access for load and store
5. WB: Write Back result to register

2/22/2020 EC792 9
Performance Example
 Assume the following operation times for components:
 Instruction and data memories: 200 ps
 ALU and adders: 180 ps
 Decode and Register file access (read or write): 150 ps
 Ignore the delays in PC, mux, extender, and wires

 Which of the following would be faster and by how much?


 Single-cycle implementation for all instructions
 Multicycle implementation optimized for every class of instructions

 Assume the following instruction mix:


 40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps

2/22/2020 EC792 10
Single-Cycle vs Multicycle Implementation
 Break instruction execution into five steps
 Instruction fetch
 Instruction decode and register read
 Execution, memory address calculation, or branch completion
 Memory access or ALU instruction completion
 Load instruction completion

 One step = One clock cycle (clock cycle is reduced)


 First 2 steps are the same for all instructions

Instruction # cycles Instruction # cycles


ALU & Store 4 Branch 3
Load 5 Jump 2

2/22/2020 EC792 11
Solution
Instruction Instruction Register ALU Data Register
Total
Class Memory Read Operation Memory Write
ALU 200 150 180 150 680 ps
Load 200 150 180 200 150 880 ps
Store 200 150 180 200 730 ps
Branch 200 150 180 530 ps
Jump 200 150 decode and update PC 350 ps

 For fixed single-cycle implementation:


 Clock cycle = 880 ps determined by longest delay (load instruction)

 For multi-cycle implementation:


 Clock cycle = max (200, 150, 180) = 200 ps (maximum delay at any step)
 Average CPI = 0.4×4 + 0.2×5 + 0.1×4+ 0.2×3 + 0.1×2 = 3.8

 Speedup = 880 ps / (3.8 × 200 ps) = 880 / 760 = 1.16

2/22/2020 EC792 12
Single-Cycle vs Pipelined Performance
 Consider a 5-stage instruction execution in which …
 Instruction fetch = Data memory access = 200 ps
 ALU operation = 180 ps
 Register read = register write = 150 ps
 What is the clock cycle of the single-cycle processor?
 What is the clock cycle of the pipelined processor?
 What is the speedup factor of pipelined execution?
 Solution
Single-Cycle Clock = 200+150+180+200+150 = 880 ps
IF Reg ALU MEM Reg
880ps IF Reg ALU MEM Reg
880 ps
2/22/2020 EC792 13
Single-Cycle versus Pipelined – cont’d
 Pipelined clock cycle = max(200, 180, 150) = 200 ps

IF Reg ALU MEM Reg


200 IF Reg ALU MEM Reg
200 IF Reg ALU MEM Reg
200 200 200 200 200

 CPI for pipelined execution = 1


 One instruction is completed in each cycle (ignoring pipeline fill)
 Speedup of pipelined execution = 880 ps / 200 ps = 4.4
 Instruction count and CPI are equal in both cases
 Speedup factor is less than 5 (number of pipeline stage)
 Because the pipeline stages are not balanced
 Throughput = 1/Max(delay) = 1/200 ps = 5 x 109 instructions/sec

2/22/2020 EC792 14
Pipeline Performance Summary
 Pipelining doesn’t improve latency of a single instruction
 However, it improves throughput of entire workload
 Instructions are initiated and completed at a higher rate

 In a k-stage pipeline, k instructions operate in parallel


 Overlapped execution using multiple hardware resources
 Potential speedup = number of pipeline stages k
 Unbalanced lengths of pipeline stages reduces speedup

 Pipeline rate is limited by slowest pipeline stage


 Unbalanced lengths of pipeline stages reduces speedup
 Also, time to fill and drain pipeline reduces speedup
2/22/2020 EC792 15
Next . . .
 Pipelining versus Serial Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

 Delayed Branch and Dynamic Branch Prediction

2/22/2020 EC792 16
Single-Cycle Datapath
 Shown below is the single-cycle datapath
 How to pipeline this single-cycle datapath?
Answer: Introduce pipeline register at end of each stage
IF = Instruction Fetch ID = Decode & EX = Execute MEM = Memory WB =
Register Read Access Write
Jump or Branch Target Address
Back
J
Next Beq
PC
30 Bne
ALU result
Imm26
Imm16
PCSrc +1 zero
Instruction Rs 5 32
RA BusA Data
30 Memory Memory 0
32
Registers A 32
00

32
Instruction
0
Rt 5 L Address
32
RB
Address
BusB 0 U Data_out 1
PC

0 32
1 Rd RW Data_in
1 BusW E 1
32
RegDst Reg
clk Write

Mem Mem Mem


ExtOp ALUSrc ALUCtrl Read Write toReg

2/22/2020 EC792 17
Pipelined Datapath
 Pipeline registers are shown in green, including the PC
 Same clock edge updates all pipeline registers, register
file, and data memory (for store instruction)
IF = Instruction Fetch ID = Decode & EX = Execute MEM = Memory

WB = Write Back
Register Read Access

NPC2
Next
PC
NPC

+1

Imm
Imm26
ALU result 32
Imm16 zero

Instruction 32 Data

ALUout
Register File

BusA A A
Instruction

Memory Rs 5 Memory

WB Data
RA L 0
Instruction Address
Rt 5 U
0 RB E 1 32
Data_out 1
Address BusB
PC

0 0 32
1 RW Data_in

D
1 BusW 32
Rd
32

clk

2/22/2020 EC792 18
Problem with Register Destination
 Is there a problem with the register destination address?
 Instruction in the ID stage different from the one in the WB stage
 Instruction in the WB stage is not writing to its destination register
but to the destination of a different instruction in the ID stage
ID = Decode &

WB = Write Back
IF = Instruction Fetch Register Read EX = Execute MEM =
Memory Access

NPC2
Next
PC
NPC

+1

Imm
Imm26
ALU result 32
Imm16 zero

Instruction 32 Data

ALUout
Register File

A
A
Instruction

Rs 5 BusA Memory
Memory

WB Data
RA L 0
Instruction Address
Rt 5 U
0 RB E 1 32
Data_out 1
Address BusB
PC

0 0 32
1 RW Data_in

D
1 BusW 32
Rd
32

clk

2/22/2020 EC792 19
Pipelining the Destination Register
 Destination Register number should be pipelined
 Destination register number is passed from ID to WB stage
 The WB stage writes back data knowing the destination register
IF ID EX MEM WB

NPC2
Next
PC
NPC

+1

Imm
Imm26
ALU result 32
Imm16 zero

Instruction 32 Data

ALUout
Register File

A
A
Instruction

Rs 5 BusA Memory

WB Data
Memory
RA L 0
Instruction Address
Rt 5 U
0 RB E 1 32
Data_out 1
Address BusB
PC

Rd
B
0 32
1 RW Data_in

D
BusW 32

32

Rd4
Rd3
Rd2

0
1

clk

2/22/2020 EC792 20
Graphically Representing Pipelines
 Multiple instruction execution over multiple clock cycles
 Instructions are listed in execution order from top to bottom
 Clock cycles move from left to right
 Figure shows the use of resources at each stage and each cycle

Time (in cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Program Execution Order

lw R14, 8(R21) IF Reg ALU DM Reg

add R17,R18,R19 IF Reg ALU DM Reg

ori R20,R11, 7 IF Reg ALU DM Reg

sub R13, R18, R11 IF Reg ALU DM Reg

sw R18, 10(R19) IF Reg ALU DM

2/22/2020 EC792 21
Instruction-Time Diagram
 Instruction-Time Diagram shows:
 Which instruction occupying what stage at each clock cycle
 Instruction flow is pipelined over the 5 stages
Up to five instructions can be in the
pipeline during the same cycle. ALU instructions skip
Instruction Level Parallelism (ILP) the MEM stage.
Store instructions
skip the WB stage
lw R15, 8(R19)
Instruction Order

IF ID EX MEM WB
lw R14, 8(R21) IF ID EX MEM WB
ori R12, R19, 7 IF ID EX – WB
sub R13, R18, R11 IF ID EX – WB
sw R18, 10(R19) IF ID EX MEM –

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time

2/22/2020 EC792 22
Control Signals
IF ID EX MEM WB

NPC2
J
Next Beq
PC

NPC
Bne
+1

Imm
Imm26
ALU result 32
Imm16 zero
PCSrc Data
Instruction 32

ALUout
Register File

A
A
Instruction
Rs 5 BusA Memory

WB Data
Memory
RA L 0
Instruction Address
Rt 5 U
0 RB E 1 32
Data_out 1
Address BusB
PC

Rd

B
0 32
1 RW Data_in

D
BusW 32

32

Rd4
Rd3
Rd2
0
1

clk

Reg Reg Ext ALU ALU Mem Mem Mem


Dst Write Op Src Ctrl Read Write toReg

Same control signals used in the single-cycle datapath

2/22/2020 EC792 23
Pipelined Control

NPC2
J
Next Beq
PC

NPC
Bne
+1

Imm
Imm26
ALU result 32
Imm16 zero
PCSrc Data
Instruction
Instruction
32

ALUout
Register File

A
Memory Rs 5 BusA A Memory

WB Data
RA L 0
Instruction Address
Rt 5 U
0 RB E 1 32
Data_out 1
Address BusB
PC

Rd

B
0 32
1 RW Data_in

D
Op

BusW 32

32

Rd4
Rd3
Rd2
0
1

clk
J
func

Pass control Reg


Dst
Reg
Write
Ext
Op
ALU
Src
ALU Beq
Ctrl Bne
Mem Mem
Read Write
Mem
toReg
signals along
pipeline just Main
EX

& ALU

MEM
like the data Control

WB
2/22/2020 EC792 24
Pipelined Control – Cont'd
 ID stage generates all the control signals
 Pipeline the control signals as the instruction moves
 Extend the pipeline registers to include the control signals

 Each stage uses some of the control signals


 Instruction Decode and Register Read
 Control signals are generated
 RegDst is used in this stage
 Execution Stage => ExtOp, ALUSrc, and ALUCtrl
 Next PC uses J, Beq, Bne, and zero signals for branch control
 Memory Stage => MemRead, MemWrite, and MemtoReg
 Write Back Stage => RegWrite is used in this stage

2/22/2020 EC792 25
Control Signals Summary
Decode Execute Stage Memory Stage Write
Stage Control Signals Control Signals Back
Op
RegDst ALUSrc ExtOp J Beq Bne ALUCtrl MemRd MemWr MemReg RegWrite

R-Type 1=Rd 0=Reg x 0 0 0 func 0 0 0 1

addi 0=Rt 1=Imm 1=sign 0 0 0 ADD 0 0 0 1

slti 0=Rt 1=Imm 1=sign 0 0 0 SLT 0 0 0 1

andi 0=Rt 1=Imm 0=zero 0 0 0 AND 0 0 0 1

ori 0=Rt 1=Imm 0=zero 0 0 0 OR 0 0 0 1

lw 0=Rt 1=Imm 1=sign 0 0 0 ADD 1 0 1 1

sw x 1=Imm 1=sign 0 0 0 ADD 0 1 x 0

beq x 0=Reg x 0 1 0 SUB 0 0 x 0

bne x 0=Reg x 0 0 1 SUB 0 0 x 0

j x x x 1 0 0 x 0 0 x 0

2/22/2020 EC792 26
Next . . .
 Pipelining versus Serial Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

 Delayed Branch and Dynamic Branch Prediction

2/22/2020 EC792 27
Pipeline Hazards
 Hazards: situations that would cause incorrect execution
 If next instruction were launched during its designated clock cycle
1. Structural hazards
 Caused by resource contention
 Using same resource by two instructions during the same cycle
2. Data hazards
 An instruction may compute a result needed by next instruction
 Hardware can detect dependencies between instructions
3. Control hazards
 Caused by instructions that change control flow (branches/jumps)
 Delays in changing the flow of control
 Hazards complicate pipeline control and limit performance

2/22/2020 EC792 28
Structural Hazards
 Problem
 Attempt to use the same hardware resource by two different
instructions during the same cycle
Structural Hazard
 Example
Two instructions are
 Writing back ALU result in stage 4 attempting to write
 Conflict with writing load data in stage 5 the register file
during same cycle

lw R14, 8(R21) IF ID EX MEM WB


Instructions

ori R12, R19, 7 IF ID EX WB


sub R13, R18,R19 IF ID EX WB
sw R18, 10(R19) IF ID EX MEM

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time

2/22/2020 EC792 29
Resolving Structural Hazards
 Serious Hazard:
 Hazard cannot be ignored

 Solution 1: Delay Access to Resource


 Must have mechanism to delay instruction access to resource
 Delay all write backs to the register file to stage 5
 ALU instructions bypass stage 4 (memory) without doing anything

 Solution 2: Add more hardware resources (more costly)


 Add more hardware to eliminate the structural hazard
 Redesign the register file to have two write ports
 First write port can be used to write back ALU results in stage 4
 Second write port can be used to write back load data in stage 5

2/22/2020 EC792 30
Next . . .
 Pipelining versus Serial Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

 Delayed Branch and Dynamic Branch Prediction

2/22/2020 EC792 31
Data Hazards
 Dependency between instructions causes a data hazard
 The dependent instructions are close to each other
 Pipelined execution might change the order of operand access

 Read After Write – RAW Hazard


 Given two instructions I and J, where I comes before J
 Instruction J should read an operand after it is written by I
 Called a data dependence in compiler terminology
I: add R17, R18, R19 # R17 is written
J: sub R20, R17, R19 # R17 is read
 Hazard occurs when J reads the operand before I writes it

2/22/2020 EC792 32
Example of a RAW Data Hazard
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
value of R18 10 10 10 10 10 20 20 20
Program Execution Order

sub R18, R9, R11 IF Reg ALU DM Reg

add R20, R18, R13 IF Reg ALU DM Reg

or R22, R11, R18 IF Reg ALU DM Reg

and R23, R12, R18 IF Reg ALU DM Reg

sw R24, 10(R18) IF Reg ALU DM

 Result of sub is needed by add, or, and, & sw instructions


 Instructions add & or will read old value of R18 from reg file
 During CC5, R18 is written at end of cycle
2/22/2020 EC792 33
Solution 1: Stalling the Pipeline
Time (in cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
value of R18 10 10 10 10 10 20 20 20 20
Instruction Order

sub R18, R9, R11 IF Reg ALU DM Reg

add R20, R18, R13 IF Reg Reg Reg Reg ALU DM Reg

stall stall stall


or R22, R11, R18 IF Reg ALU DM

 Three stall cycles during CC3 thru CC5 (wasting 3 cycles)


 Stall cycles delay execution of add & fetching of or instruction
 The add instruction cannot read R18 until beginning of CC6
 The add instruction remains in the Instruction register until CC6
 The PC register is not modified until beginning of CC6

2/22/2020 EC792 34
Solution 2: Forwarding ALU Result
 The ALU result is forwarded (fed back) to the ALU input
 No bubbles are inserted into the pipeline and no cycles are wasted
 ALU result is forwarded from ALU, MEM, and WB stages
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
value of R18 10 10 10 10 10 20 20 20
Program Execution Order

sub R18, R9, R11 IF Reg ALU DM Reg

add R20, R18, R13 IF Reg ALU DM Reg

or R22, R11, R18 IF Reg ALU DM Reg

and R23, R22, R18 IF Reg ALU DM Reg

sw R24, 10(R18) IF Reg ALU DM

2/22/2020 EC792 35
Implementing Forwarding
 Two multiplexers added at the inputs of A & B registers
 Data from ALU stage, MEM stage, and WB stage is fed back
 Two signals: ForwardA and ForwardB control forwarding
ForwardA

Im26
Imm26 Imm16
32 32 ALU result

0 A

Result
Register File

Rs BusA 1
L Address
A
RA 2
Instruction

Rt 3 E U Data 0

WData
1
RB BusB
0 Memory
0 32
1 32 32 Data_out 1
RW

D
B

BusW 2 Data_in
3
32
Rd2

Rd4
Rd3
0
1
Rd

clk
ForwardB

2/22/2020 EC792 36
Forwarding Control Signals
Signal Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)

ForwardA = 1 Forward result of previous instruction to A (from ALU stage)

ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)

ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)

ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)

ForwardB = 1 Forward result of previous instruction to B (from ALU stage)

ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)

ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)

2/22/2020 EC792 37
Forwarding Example
Instruction sequence: When sub instruction is in decode stage
lw R12, 4(R8) ori will be in the ALU stage
ori R15, R9, 2
lw will be in the MEM stage
sub R11, R12, R15
ForwardA = 2 from MEM stage ForwardB = 1 from ALU stage
sub R11,R12,R15 ori R15, R9,2 lw R12,4(R8)
Imm26

Imm16 2

Imm
ext
32 32 ALU result

0 A

Result
Register File

Rs BusA 1 A
L Address
RA 2
Instruction

Rt 3 U Data 0

WData
1
RB BusB
0 Memory
0 32
1 32 32 Data_out 1
RW

D
B

BusW 2 Data_in
3
32
Rd2

Rd3

Rd4
0
1
Rd 1

clk

2/22/2020 EC792 38
RAW Hazard Detection
 Current instruction being decoded is in Decode stage
 Previous instruction is in the Execute stage
 Second previous instruction is in the Memory stage
 Third previous instruction in the Write Back stage

If ((Rs != 0) and (Rs == Rd2) and (EX.RegWrite)) ForwardA  1


Else if ((Rs != 0) and (Rs == Rd3) and (MEM.RegWrite)) ForwardA  2
Else if ((Rs != 0) and (Rs == Rd4) and (WB.RegWrite)) ForwardA  3
Else ForwardA  0

If ((Rt != 0) and (Rt == Rd2) and (EX.RegWrite)) ForwardB  1


Else if ((Rt != 0) and (Rt == Rd3) and (MEM.RegWrite)) ForwardB  2
Else if ((Rt != 0) and (Rt == Rd4) and (WB.RegWrite)) ForwardB  3
Else ForwardB  0

2/22/2020 EC792 39
Hazard Detect and Forward Logic
Imm26

Im26
32 32 ALU result
E
0 A

Result
Register File
Rs BusA 1
L Address

A
RA 2
Instruction

Rt 3 U Data 0

WData
1
RB BusB
0 Memory
0 32
Rd 1 32 32 Data_out 1
RW

D
B
BusW 2 Data_in
3 ALUCtrl 32

Rd2

Rd3

Rd4
0
1

clk
RegDst
ForwardB ForwardA

Hazard Detect
and Forward
func
RegWrite RegWrite RegWrite
Op Main
EX

& ALU
MEM

Control

WB
2/22/2020 EC792 40
Next . . .
 Pipelining versus Serial Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Pipeline Stall

 Control Hazards

 Delayed Branch and Dynamic Branch Prediction

2/22/2020 EC792 41
Load Delay
 Unfortunately, not all data hazards can be forwarded
 Load has a delay that cannot be eliminated by forwarding
 In the example shown below …
 The LW instruction does not read data until end of CC4
 Cannot forward data to ADD at end of CC3 - NOT possible
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

lw R18, 20(R17)
However, load can
IF Reg ALU DM Reg
Program Order

forward data to
2nd next and later
add R20, R18, R13 IF Reg ALU DM Reg
instructions

or R14, R11, R18 IF Reg ALU DM Reg

and R15, R18, R12 IF Reg ALU DM Reg

2/22/2020 EC792 42
Detecting RAW Hazard after Load
 Detecting a RAW hazard after a Load instruction:
 The load instruction will be in the EX stage
 Instruction that depends on the load data is in the decode stage

 Condition for stalling the pipeline


if ((EX.MemRead == 1) // Detect Load in EX stage
and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard

 Insert a bubble into the EX stage after a load instruction


 Bubble is a no-op that wastes one clock cycle
 Delays the dependent instruction after load by once cycle
 Because of RAW hazard

2/22/2020 EC792 43
Stall the Pipeline for one Cycle
 ADD instruction depends on LW  stall at CC3
 Allow Load instruction in ALU stage to proceed
 Freeze PC and Instruction registers (NO instruction is fetched)
 Introduce a bubble into the ALU stage (bubble is a NO-OP)
 Load can forward data to next instruction after delaying it
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

lw R18, 20(R17) IF Reg ALU DM Reg


Program Order

add R20, R18, R13 IF stall bubble bubble bubble

Reg ALU DM Reg

or R14, R11, R18 IF Reg ALU DM Reg

2/22/2020 EC792 44
Showing Stall Cycles
 Stall cycles can be shown on instruction-time diagram
 Hazard is detected in the Decode stage
 Stall indicates that instruction is delayed
 Instruction fetching is also delayed after a stall
 Example:
Data forwarding is shown using blue arrows

lw R17, (R13) IF ID EX MEM WB


lw R18, 8(R17) IF Stall ID EX MEM WB
add R2, R18, R11 IF Stall ID EX MEM WB
sub R1, R18, R2 IF ID EX MEM WB

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 Time

2/22/2020 EC792 45
Hazard Detect, Forward, and Stall
Imm26

Im26
32 32 ALU result
E
0 A

Result
Register File
Rs BusA 1
L Address

A
RA 2
Instruction

Rt 3 U Data 0

WData
1
RB BusB
PC

0 Memory
0 32
Rd 1 32 32 Data_out 1
RW

D
B
BusW 2 Data_in
3 32

Rd2

Rd3

Rd4
0
1

clk
RegDst
ForwardB ForwardA
Disable PC

func
Hazard Detect
Forward, & Stall

MemRead
Stall RegWrite RegWrite
Op Control Signals RegWrite
Main & ALU 0
EX

Control Bubble 1 MEM

WB
=0

2/22/2020 EC792 46
Code Scheduling to Avoid Stalls
 Compilers reorder code in a way to avoid load stalls
 Consider the translation of the following statements:
A = B + C; D = E – F; // A thru F are in Memory

 Slow code:  Fast code: No Stalls


lw R8, 4(R16) # &B = 4(R16) lw R8, 4(R16)
lw R9, 8(R16) # &C = 8(R16) lw R9, 8(R16)
add R10, R8, R9 # stall cycle lw R11, 16(R16)
sw R10, 0(R16) # &A = 0(R16) lw R12, 20(R16)
lw R11, 16(R16) # &E = 16(R16) add R10, R8, R9
lw R12, 20(R16) # &F = 20(R16) sw R10, 0(R16)
sub R13, R11, R12 # stall cycle sub R13, R11, R12
sw R13, 12(R0) # &D = 12(R0) sw R13, 12(R0)

2/22/2020 EC792 47
Name Dependence: Write After Read
 Instruction J writes its result after it is read by I
 Called anti-dependence by compiler writers
I: sub R12, R9, R11 # R9 is read
J: add R9, R10, R11 # R9 is written
 Results from reuse of the name R9
 NOT a data hazard in the 5-stage pipeline because:
 Reads are always in stage 2
 Writes are always in stage 5, and
 Instructions are processed in order
 Anti-dependence can be eliminated by renaming
 Use a different destination register for add (eg, R13)
2/22/2020 EC792 48
Name Dependence: Write After Write
 Same destination register is written by two instructions
 Called output-dependence in compiler terminology
I: sub R9, R12, R11 # R9 is written
J: add R9, R10, R11 # R9 is written again
 Not a data hazard in the 5-stage pipeline because:
 All writes are ordered and always take place in stage 5
 However, can be a hazard in more complex pipelines
 If instructions are allowed to complete out of order, and
 Instruction J completes and writes R9 before instruction I
 Output dependence can be eliminated by renaming R9
 Read After Read is NOT a name dependence
2/22/2020 EC792 49
Hazards due to Loads and Stores
 Consider the following statements:
sw R10, 0(R1)
lw R11, 6(R5)

Is there any Data Hazard possible in the above code sequence?


If 0(R1) == 6(R5), then it means the same memory location!!
 Data dependency through Data Memory
 But no Data Hazard since the memory is not accessed by the two
instructions simultaneously – writing and reading happens in two
consecutive clock cycles
 But, in an out-of-order execution processor, this can lead to a
Hazard!

2/22/2020 EC792 50
Next . . .
 Pipelining versus Serial Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

 Delayed Branch and Dynamic Branch Prediction

2/22/2020 EC792 51
Control Hazards
 Jump and Branch can cause great performance loss
 Jump instruction needs only the jump target address
 Branch instruction needs two things:
 Branch Result Taken or Not Taken
 Branch Target Address
 PC + 4 If Branch is NOT taken
 PC + 4 × immediate If Branch is Taken

 Jump and Branch targets are computed in the ID stage


 At which point a new instruction is already being fetched
 Jump Instruction: 1-cycle delay
 Branch: 2-cycle delay for branch result (taken or not taken)

2/22/2020 EC792 52
2-Cycle Branch Delay
 Control logic detects a Branch instruction in the 2nd Stage
 ALU computes the Branch outcome in the 3rd Stage
 Next1 and Next2 instructions will be fetched anyway
 Convert Next1 and Next2 into bubbles if branch is taken
cc1 cc2 cc3 cc4 cc5 cc6 cc7

Beq R9,R10,L1 IF Reg ALU

Next1 IF Reg Bubble Bubble Bubble

Next2 IF Bubble Bubble Bubble Bubble

Branch
L1: target instruction Target IF Reg ALU DM
Addr

2/22/2020 EC792 53
Implementing Jump and Branch

NPC2
Jump or Branch Target

J
Next
Beq
PC

NPC
Bne

Im26
+1 Imm26

Imm16 zero
PCSrc 0
Instruction

Instruction
32

ALUout
1

Register File
BusA A

A
Memory Rs 5
RA 2
Instruction 3
L
Rt 5 U
0 RB BusB E 1
Address 0
PC

Rd 0
1 RW 1

D
Op

BusW 2 32
3
32

Rd3
0

Rd2
1

clk
func

Reg
Dst
Branch Delay = 2 cycles J, Beq, Bne

Branch target & outcome Main & ALU Control Signals


0
are computed in ALU stage

EX
Control

MEM
Bubble = 0 1

2/22/2020 EC792 54
Predict Branch NOT Taken
 Branches can be predicted to be NOT taken
 If branch outcome is NOT taken then
 Next1 and Next2 instructions can be executed
 Do not convert Next1 & Next2 into bubbles
 No wasted cycles
 Else, convert Next1 and Next2 into bubbles – 2 wasted cycles

cc1 cc2 cc3 cc4 cc5 cc6 cc7

Beq R9,R10,L1 IF Reg ALU NOT Taken

Next1 IF Reg ALU DM Reg

Next2 IF Reg ALU DM Reg

2/22/2020 EC792 55
Reducing the Delay of Branches
 Branch delay can be reduced from 2 cycles to just 1 cycle

 Branches can be determined earlier in the Decode stage


 A comparator is used in the decode stage to determine branch
decision, whether the branch is taken or not

 Because of forwarding, the delay in the second stage will be


increased and this will also increase the clock cycle

 Only one instruction that follows the branch is fetched

 If the branch is taken then only one instruction is flushed

 We should insert a bubble after jump or taken branch


 This will convert the next instruction into a NOP
2/22/2020 EC792 56
Reducing Branch Delay to 1 Cycle
Longer Cycle
Zero
=
Jump or Branch Target

Next J Data forwarded


+1 Beq
PC

Reset
Bne then compared

Im16
Imm16

PCSrc 0
Instruction

Instruction
32

ALUout
1

Register File
BusA A

A
Memory Rs 5
RA 2
Instruction 3
L
Rt 5 U
0 RB BusB E 1
Address 0
PC

Rd 0
1 RW 1

D
Op

BusW 2 32
3
32

Rd3
0

Rd2
1

clk
func

Reg
Reset signal converts Dst
J, Beq, Bne
next instruction after ALUCtrl
Control Signals
jump or taken branch Main & ALU 0

EX
Control

MEM
into a bubble Bubble = 0 1

2/22/2020 EC792 57
Branch Behavior in Programs
 Based on SPEC benchmarks on DLX
 Branches occur with a frequency of 14% to 16% in integer
programs and 3% to 12% in floating point programs.

 About 75% of the branches are forward branches

 60% of forward branches are taken

 85% of backward branches are taken

 67% of all branches are taken

 Why are branches (especially backward branches) more


likely to be taken than not taken?

2/22/2020 EC792 58
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
 Execute successor instructions in sequence
 “Flush” instructions in pipeline if branch actually taken
 Advantage of late pipeline state update
 33% DLX branches not taken on average
 PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
 67% DLX branches taken on average
#4: Define branch to take place AFTER n following instruction
(Delayed branch)

2/22/2020 EC792 59
Next . . .
 Pipelining versus Serial Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

 Delayed Branch and Dynamic Branch Prediction

2/22/2020 EC792 60
Branch Hazard Alternatives
 Predict Branch Not Taken (previously discussed)
 Successor instruction is already fetched
 Do NOT Flush instruction after branch if branch is NOT taken
 Flush only instructions appearing after Jump or taken branch

 Delayed Branch
 Define branch to take place AFTER the next instruction
 Compiler/assembler fills the branch delay slot (for 1 delay cycle)

 Dynamic Branch Prediction


 Loop branches are taken most of time
 Must reduce branch delay to 0, but how?
 How to predict branch behavior at runtime?

2/22/2020 EC792 61
Delayed Branch
 Define branch to take place after the next instruction
 Instruction in branch delay slot is always executed
 Compiler (tries to) move a useful instruction into delay slot.
From before the Branch: Always helpful when possible
label:
 For a 1-cycle branch delay, we have one delay slot
branch instruction . . .

branch delay slot (next instruction) add R10,R11,R12


beq R17,R16,label
branch target (if branch taken)
Delay Slot
 Compiler fills the branch delay slot
 By selecting an independent instruction label:
 From before the branch . . .
 If no independent instruction is found
beq R17,R16,label
 Compiler fills delay slot with a NO-OP add R10,R11,R12

2/22/2020 EC792 62
Delayed Branch
From the Target: Helps when branch is taken. May duplicate instructions

ADD R2, R1, R3 ADD R2, R1, R3


BEQZ R2, L1 BEQZ R2, L2
DELAY SLOT SUB R4, R5, R6
- -
L1: SUB R4, R5, R6 L1: SUB R4, R5, R6
L2: L2:

Instructions between BEQ and SUB (in fall through) must not use R4.
Why is instruction at L1 duplicated? What if R5 or R6 changed?
(Because, L1 can be reached by another path !!)

2/22/2020 EC792 63
Delayed Branch
From Fall Through: Helps when branch is not taken
(default case)

ADD R2, R1, R3 ADD R2, R1, R3


BEQZ R2, L1 BEQZ R2, L1
DELAY SLOT SUB R4, R5, R6
SUB R4, R5, R6 -
-
L1: L1:

Instructions at target (L1 and after) must not use R4 till set (written)
again.

2/22/2020 EC792 64
Figure C.14 Scheduling the branch delay slot. The top box in each pair shows the code before scheduling; the bottom box shows
the scheduled code. In (a), the delay slot is scheduled with an independent instruction from before the branch. This is the best
choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences for (b) and (c), the use of R1 in the branch
condition prevents the DADD instruction (whose destination is R1) from being moved after the branch. In (b), the branch delay slot
is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another
path. Strategy (b) is preferred when the branch is taken with high probability, such as a loop branch. Finally, the branch may be
scheduled from the not-taken fall-through as in (c). To make this optimization legal for (b) or (c), it must be OK to execute the
moved instruction when the branch goes in the unexpected direction. By OK we mean that the work is wasted, but the program will
still execute correctly. This is the case, for example, in (c) if R7 were an unused temporary register when the branch goes in the
unexpected direction.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Filling delay slots
 Compiler effectiveness for single branch delay slot:
 Fills about 60% of branch delay slots
 About 80% of instructions executed in branch delay slots
are useful in computation
 About 50% (i.e., 60% x 80%) of slots usefully filled
 Canceling branches or nullifying branches
 Include a prediction of the branch is taken or not taken
 If the prediction is correct, the instruction in the delay slot is
executed
 If the prediction is incorrect, the instruction in the delay slot
is quashed.
 Allow more slots to be filled from the target address or fall
through
2/22/2020 EC792 66
Figure C.17 Misprediction rate on SPEC92 for a profile-based predictor varies widely but is generally better for the floating-
point programs, which have an average misprediction rate of 9% with a standard deviation of 4%, than for the integer
programs, which have an average misprediction rate of 15% with a standard deviation of 5%. The actual performance depends
on both the prediction accuracy and the branch frequency, which vary from 3% to 24%.

Copyright © 2011, Elsevier Inc. All rights Reserved.


Drawback of Delayed Branching
 New meaning for branch instruction
 Branching takes place after next instruction (Not immediately!)

 Impacts software and compiler


 Compiler is responsible to fill the branch delay slot
 For a 1-cycle branch delay  One branch delay slot

 However, modern processors are deeply pipelined


 Branch penalty is multiple cycles in deeper pipelines
 Multiple delay slots are difficult to fill with useful instructions

 MIPS used delayed branching in earlier pipelines


 However, delayed branching is not useful in recent processors

2/22/2020 EC792 68
Compiler “Static” Prediction of
Taken/Untaken Branches
 Two strategies examined
 Backward branch predict taken, forward branch not taken
 Profile-based prediction: record branch behavior, predict
branch based on prior run
100000
Instructions per mispredicted branch

10000

1000

100

10

1
gcc
doduc

espresso

ora
compress

tomcatv
alvinn

hydro2d

swm256
mdljsp2

Profile-based Direc tion-based

2/22/2020 EC792 69
Zero-Delayed Branching
 How to achieve zero delay for a jump or a taken branch?
 Jump or branch target address is computed in the ID stage
 Next instruction has already been fetched in the IF stage

Solution
 Introduce a Branch Target Buffer (BTB) in the IF stage
 Store the target address of recent branch and jump instructions

 Use the lower bits of the PC to index the BTB


 Each BTB entry stores Branch/Jump address & Target Address
 Check the PC to see if the instruction being fetched is a branch
 Update the PC using the target address stored in the BTB

2/22/2020 EC792 70
Branch Target Buffer
 The branch target buffer is implemented as a small cache
 Stores the target address of recent branches and jumps
 We must also have prediction bits
 To predict whether branches are taken or not taken
 The prediction bits are dynamically determined by the hardware
Branch Target & Prediction Buffer
Addresses of Target Predict
Inc
Recent Branches Addresses Bits
mux low-order bits
used as index
PC

=
predict_taken

2/22/2020 EC792 71
Dynamic Branch Prediction
 Prediction of branches at runtime using prediction bits
 Prediction bits are associated with each entry in the BTB
 Prediction bits reflect the recent history of a branch instruction
 Typically few prediction bits (1 or 2) are used per entry
 We don’t know if the prediction is correct or not
 If correct prediction …
 Continue normal execution – no wasted cycles
 If incorrect prediction (misprediction) …
 Flush the instructions that were incorrectly fetched – wasted cycles
 Update prediction bits and target address for future use

2/22/2020 EC792 72
Dynamic Branch Prediction – Cont’d
Use PC to address Instruction
Memory and Branch Target Buffer
Increment PC PC = target address
IF

Found
No Yes
BTB entry with predict
taken?

No Jump Yes No Jump Yes


or taken or taken
ID

branch? branch?
Normal Correct Prediction
Execution No stall cycles

Mispredicted Jump/branch Mispredicted branch


Enter jump/branch address, target Branch not taken
EX

address, and set prediction in BTB entry. Update prediction bits


Flush fetched instructions Flush fetched instructions
Restart PC at target address Restart PC after branch

2/22/2020 EC792 73
1-bit Prediction Scheme
 Prediction is just a hint that is assumed to be correct
 If incorrect then fetched instructions are flushed
 1-bit prediction scheme is simplest to implement
 1 bit per branch instruction (associated with BTB entry)
 Record last outcome of a branch instruction (Taken/Not taken)
 Use last outcome to predict future behavior of a branch

Taken
Taken
Not Predict Predict
Taken Not Taken Taken
Not Taken

2/22/2020 EC792 74
1-Bit Predictor: Shortcoming
Inner loop branch mispredicted twice!
 Mispredict as taken on last iteration of inner loop
 Then mispredict as not taken on first iteration of inner
loop next time around

outer: …

inner: …

bne …, …, inner

bne …, …, outer

2/22/2020 EC792 75
2-bit Prediction Scheme
 1-bit prediction scheme has a performance shortcoming
 2-bit prediction scheme works better and is often used
 4 states: strong and weak predict taken / predict not taken

 Implemented as a saturating counter


 Counter is incremented to max=3 when branch outcome is taken
 Counter is decremented to min=0 when branch is not taken

Not Taken Taken

Taken Taken Taken


Strong Weak Weak Strong
Predict Predict Predict Predict
Not Taken Not Taken Taken Taken
Not Taken Not Taken Not Taken

2/22/2020 EC792 76
2-bit Prediction Scheme
 Alternative state machine

2/22/2020 EC792 77
1-Bit Branch History Table (BHT)
Example main() ‘while’ branch : always TAKEN
{
int i, j; 11111111111111111….
while (1) ‘for’ branch : TAKEN x 3, NOT TAKEN x 1
for (i=1; i<=4; i++) {
…… 1110111011101110….
j = i++;
}
}

branch outcome 1 1 1 0 1 1 1 0 1 1 1 0 1 ...


last outcome 1 1 1 1 0 1 1 1 0 1 1 1 0 ...
prediction T T T T N T T T N T T T N ...
new last outcome 1 1 1 0 1 1 1 0 1 1 1 0 1 ...
correctness O O O X X O O X X O O X X ...
Assume initial last outcome = 1 O : correct, X : mispredict
2/22/2020
Prediction accuracyEC792
of ‘for’ branch: 50% 78
2-Bit Counter Scheme
 Prediction is made based on the last two branch outcomes
 Each of Branch History Table (BHT) entries consists of 2 bits,
usually a 2-bit counter, which is associated with the state of the
automaton(many different automata are possible)

Branch outcome
.
.
.
Automaton
PC Prediction

BHT

2/22/2020 EC792 79
2-Bit BHT
 2-bit scheme where change prediction only if get misprediction twice
 MSB of the state symbol represents the prediction;
 1: TAKEN, 0: NOT TAKEN

T
Predict NT Predict
11 10 TAKEN
TAKEN branch outcome 1 1 1 0 1 1 1 0 1 1 1 0 ...
T
counter value 11 11 11 11 10 11 11 11 10 11 11 11 . . .
NT
T prediction T T T T T T T T T T T T ...
Predict NT new counter value 11 11 11 10 11 11 11 10 11 11 11 10 . . .
Predict
NOT
01 00 NOT correctness O O O X O O O X O O O X ...
TAKEN
TAKEN
T

NT
assume initial counter value : 11 O : correct, X : mispredict

Prediction accuracy of ‘for’ branch: 75 %

2/22/2020 EC792 80
2-Level Self Prediction
• First level
– Record previous few outcomes(history) of the branch itself
– Branch History Table(BHT) - A shift register
• Second Level
– Predictor Table(PT) indexed by history pattern captured by BHT
– Each entry is a 2-bit counter

2/22/2020 EC792 81
2-Level Self Prediction
Algorithm
Using PC, read BHT
Using self history from BHT, read the
counter value from PT
If (MSB==1), predict T
else predict NT
After resolving branch outcome, bookkeeping
If mispredicted, discard all speculative executions
Shift right the new branch outcome into the entry read from BHT
Update PT: If(branch outcome==T)
increase counter value by 1
else decrease by 1
Prediction Accuracy = 100%
Branch Outcome 1 1 1 0 1 1 1 0 1 1 1 0
Self History (BHT) 0000 1000 1100 1110 0111 1011 1101 1110 0111 1011 1101 1110
Counter Value (PT) 10 10 10 10 10 10 10 01 11 11 11 00
Prediction T T T T T T T NT T T T NT
New BHT entry 1000 1100 1110 0111 1011 1101 1110 0111 1011 1101 1110 0111
New PT entry 11 11 11 01 11 11 11 00 11 11 11 00
Correctness O O O X O O O O O O O O

Initial BHT entry =0000, Self history length=4, Initial PT value=10


2/22/2020 EC792 82
Two-Level Branch Predictor
Pattern History Table (PHT)

00…..00
2N entries
00…..01
Branch History Register (BHR) 00…..10
(Shift right when update)

…….
Rc-k Rc-1
1 1 ..... 1 0

N Prediction
11…..10
11…..11 PHT update Current State
Branch History Pattern
FSM
Rc: Actual Branch Outcome Update
Logic
 Generalized correlated branch predictor
 1st level keeps branch history in Branch History Register (BHR)
 2nd level segregates pattern history in Pattern History Table (PHT)
2/22/2020 EC792 83
Branch History Register
 An N-bit Shift Register = 2N patterns in PHT
 Shift-in branch outcomes
 1 taken
 0  not taken
 First-in First-Out
 BHR can be
 Global
 Per-set
 Local (Per-address)

2/22/2020 EC792 84
Pattern History Table
 2N entries addressed by N-bit BHR
 Each entry keeps a counter (2-bit or more) for prediction
 Counter update: the same as 2-bit counter
 Can be initialized in alternate patterns (01, 10, 01, 10, ..)
 Alias (or interference) problem

2/22/2020 EC792 85
Two-Level Branch Prediction
Set PHT
00000000
00000001
00000010
PC = 0x4001000C


00110110

00110110 10
00110111
0110


BHR
11111101
11111110
11111111 MSB = 1
Predict Taken
2/22/2020 EC792
86
Predictor Update (Actually, Not Taken)
PHT
00000000
00000001
00000010
PC = 0x4001000C
00110011


00110011
00110110

00110110 01
10 decremented
00110111
0110
0011


BHR
11111101
11111110
11111111

• Update Predictor after branch is resolved


2/22/2020 EC792
87
Correlating Branches
subi R3, R1, #2
 Basic two-bit predictor schemes bnez R3, L1 ; b1
 use recent behavior of a branch add R1, R0, R0
to predict its future behavior L1: subi R3, R1, #2
bnez R3, L2 ; b2
 Improve the prediction accuracy add R2, R0, R0
 look also at recent behavior L2: sub R3, R1, R2
of other branches beqz R3, L3 ; b3

if (aa == 2) aa = 0; b3 is correlated with b1 and b2;


if (bb == 2) bb = 0; If b1 and b2 are both untaken in
the above code,
if (aa != bb) { }
then b3 will be taken.
=>
Use correlating predictors or
2/22/2020 EC792 two-level predictors. 88
Correlating Branches
Example: BNEZ R1,b1 ; (b1)(d!=0)
ADDI R1,R1,#1; since d==0, make d=1
if (d==0)
b1 : SUBI R3,R1,#1
d=1;
BNEZ R3,b2; (b2)(d!=1)
if (d==1)
.....
b2 :

Initial value Value of d


of d d==0? b1 before b2 d==1? b2
0 Y NT 1 Y NT
1 N T 1 Y NT
2 N T 2 N T
If b1 is NT, then b2 is NT

1-bit b1 b1 New b1 b2 b2 New b2


self history d=? prediction action prediction prediction action prediction
predictor 2 NT T T NT T T
0 T NT NT T NT NT
Sequence of 2 NT T T NT T T
2,0,2,0,2,0,... 0 T NT NT T NT NT

All branches are mispredicted


2/22/2020 EC792 90
Correlating Branches
Example: BNEZ R1,b1 ; branch b1 (d!=0)
ADDI R1,R1,#1; since d==0, make d=1
if (d==0)
b1 : SUBI R3,R1,#1
d=1;
BNEZ R3,b2; branch b2(d!=1)
if (d==1)
.....
b2 :
Self Gloabal
Prediction Prediction, if last Prediction, if last
bits(XX) branch action was NT branch action was T
NT/NT NT NT
NT/T NT T
T/NT T NT
T/T T T

d=? b1 prediction b1 action new b1 prediction b2 prediction b2 action new b2 prediction


2 NT/NT T T/NT NT/NT T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T
3 T/NT T T/NT NT/T T NT/T

Initial self prediction bits NT/NT and


Initial last branch was NT. Prediction used is shown in Red
Misprediction only in the first prediction !!
2/22/2020 EC792 91
Local/Global Predictors
 Instead of maintaining a counter for each branch to
capture the common case,
 Maintain a counter for each branch and surrounding pattern
 If the surrounding pattern belongs to the branch being predicted,
the predictor is referred to as a local predictor
 If the surrounding pattern includes neighboring branches, the
predictor is referred to as a global predictor

2/22/2020 EC792 92
Correlated Branch Prediction
 Idea: record m most recently executed branches as
taken or not taken, and use that pattern to select the
proper n-bit branch history table
 In general, (m,n) predictor means record last m branches
to select between 2m history tables, each with n-bit
counters
 Thus, old 2-bit BHT is a (0,2) predictor
 Global Branch History: m-bit shift register keeping T/NT
status of last m branches.
 Each entry in table has m n-bit predictors.

2/22/2020 EC792 93
Correlated Branch Predictor
2-bit shift register
Subsequent (global branch history)
branch
direction
select
Branch PC Branch PC

X X X X
Prediction w Prediction
hash . hash . . . .
. . . . .
. . . . .
. . . . .
2w
2-bit 2-bit 2-bit 2-bit
2-bit counter counter counter counter
counter
(2,2) Correlation Scheme
2-bit Sat. Counter Scheme
 (M,N) correlation scheme
 M: shift register size (# bits)
 N: N-bit counter
2/22/2020 EC792 94
Correlating Branches

(2,2) predictor Branch address


4
Behavior of recent 2-bits per branch predictor
branches selects
between four
predictions of next Prediction
branch, updating just
that prediction

2-bit global branch history

2/22/2020 EC792 95
Tournament Predictors

• A local predictor might work well for some branches or


programs, while a global predictor might work well for others

• Provide one of each and maintain another predictor to


identify which predictor is best for each branch

Local
Predictor M
U
Global X
Predictor

Branch PC Tournament
Predictor
Table of 2-bit
saturating counters
EC792 96
2/22/2020
Tournament Predictors
 Multilevel branch predictor
 Selector for the Global and Local predictors of correlating branch
prediction
 Use n-bit saturating counter to choose between predictors
 Usual choice between global and local predictors

2/22/2020
EC792 97
Tournament Predictors

2/22/2020 EC792 98
Tournament Predictors

2/22/2020 EC792 99
Tournament Predictors
• Advantage of tournament predictor is the ability to select the right
predictor for a particular branch

• A typical tournament predictor selects global predictor 40% of the time


for SPEC integer benchmarks

• AMD Opteron and Phenom use tournament style

2/22/2020 EC792 100


Accuracy v. Size (SPEC89)
10%
Conditional branch misprediction rate

9%

8%

7% Local - 2 bit counters


6%

5%
4%
Correlating - (2,2) scheme
3%

2% Tournament
1%

0%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)


2/22/2020 EC792 101
Tournament Predictors (Intel Core i7)
• Based on predictors used in Core 2 Duo chip
• Combines three different predictors
• Two-bit
• Global history
• Loop exit predictor
• Uses a counter to predict the exact number of taken
branches (number of loop iterations) for a branch
that is detected as a loop branch
• Tournament: Tracks accuracy of each predictor
• Main problem of speculation:
• A mispredicted branch may lead to another branch
being mispredicted !

2/22/2020 EC792 102


Accuracy of Different Schemes
20%

18% 4096 Entries 2-bit BHT


Unlimited Entries 2-bit BHT
Frequency of Mispredictions

16%

14% 1024 Entries (2,2) BHT


12%
11%
10%

8%
6% 6% 6%
6%
5% 5%
4%
4%

2%
1% 1%
0%
0%
spice

gcc
fpppp
nasa7

doducd
matrix300

li
tomcatv

expresso

eqntott
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

2/22/2020 EC792 103


Branch Prediction is More Important Today
Conditional branches still comprise about 20% of instructions
Correct predictions are more important today - why?
• pipelines deeper
branch not resolved until more cycles from fetching - therefore the
misprediction penalty greater
 cycle times smaller - more emphasis on throughput (performance)
 more functionality between fetch & execute
• multiple instruction issue (superscalars & VLIW)
branch occurs almost every cycle
 flushing & refetching more instructions
• object-oriented programming
more indirect branches - which are harder to predict
• dual of Amdahl’s Law
other forms of pipeline stalling are being addressed - so the portion
of CPI due to branch delays is relatively larger
All this means that the potential stalling due to branches is greater
2/22/2020 EC792 104
Branch Prediction is More Important Today
On the other hand,
• Chips are denser so we can consider sophisticated
HW solutions
• Hardware cost is small compared to the performance
gain

2/22/2020 EC792 105


Directions in Branch Prediction
1: Improve the prediction
 correlated (2-level) predictor (Pentium III - 512 entries, 2-bit,
Pentium Pro - 4 history bits)
• hybrid local/global predictor (Alpha 21264)
• confidence predictors
2: Determine the target earlier
• branch target buffer (Pentium Pro, IA-64 Itanium)
• next address in I-cache (Alpha 21264, UltraSPARC)
• return address stack (Alpha 21264, IA-64 Itanium, MIPS R10000,
Pentium Pro, UltraSPARC-3)
3: Reduce misprediction penalty
• fetch both instruction streams (IBM mainframes, SuperSPARC)
4: Eliminate the branch
• predicated execution (IA-64 Itanium, Alpha 21264)

2/22/2020 EC792 106


Predicated Execution
 Predicated instructions execute conditionally
 Some other (previous) instruction sets a condition
 Predicated instruction tests the condition & executes if the condition is
true
 If the condition is false, predicated instruction isn’t executed
 i.e., instruction execution is predicated on the condition
 Eliminates conditional branch (expensive if mispredicted)
 changes a control hazard to a data hazard
 Fetch both true & false paths

2/22/2020 EC792 107


Predicated Execution
 Example:
 Without predicated execution
sub R10, R4, R5  Advantages
 No branch hazard,
beqz R10, Label
 Creates straight line code
sub R2, R1, R6
 More independent instructions
Label: or R3, R2, R7
 Disadvantages
 With predicated execution  Instructions on both paths are
executed
sub R10, R4, R5
 Hard to add predicated instructions
sub R2, R1, R6, R10 to an existing instruction set;
or R3, R2, R7  Additional register pressure
 Complex conditions if nested loops

2/22/2020 EC792 108


Conditional Execution in ARM ISA
ARM ISA Example:

2/22/2020 EC792 109


Pipelining Complications
 Exceptions: Events other than branches or jumps that change the
normal flow of instruction execution. Some types of exceptions:

 I/O Device request


 Invoking an OS service from user program
 Tracing Instruction execution
 Breakpoint (programmer requested interrupt)
 Integer arithmetic overflow
 FP arithmetic anomaly
 Page fault (page not in main memory)
 Misaligned memory accesses
 Memory protection violation
 Use of undefined instruction
 Hardware malfunction
 Power failure
2/22/2020 EC792 110
Pipelining Complications
 Exceptions: Events other than branches or jumps
that change the normal flow of instruction execution.
 5 instructions executing in 5 stage pipeline
 How to stop the pipeline?
 Who caused the interrupt?
 How to restart the pipeline?

Stage Problems causing the interrupts

IF Page fault on instruction fetch; misaligned


memory access; memory-protection violation
ID Undefined or illegal opcode
EX Arithmetic interrupt
MEM Page fault on data fetch; misaligned memory
access; memory-protection violation
2/22/2020 EC792 111
Pipelining Complications
 Simultaneous exceptions in more than one pipeline stage,
e.g.,
 LOAD with data page fault in MEM stage
 ADD with instruction page fault in IF stage
 Solution #1
 Interrupt status vector per instruction
 Defer check until last stage, kill state update if exception
 Solution #2
 Interrupt ASAP
 Restart everything that is incomplete

2/22/2020 EC792 112


Pipelining Complications
 Our DLX pipeline only writes results at the end of the
instruction’s execution. Not all processors do this.
 Address modes: Auto-increment causes register change
during instruction execution
 Interrupts Need to restore register state
 Adds WAR and WAW hazards since writes happen not only in
last stage
 Memory-Memory Move Instructions
 Must be able to handle multiple page faults
 VAX and x86 store values temporarily in registers
 Condition Codes
 Need to detect the last instruction to change condition codes

2/22/2020 EC792 113


Fallacies and Pitfalls
Pipelining is easy!
 The basic idea is easy
 The devil is in the details
 Detecting data hazards and stalling pipeline

Poor ISA design can make pipelining harder


 Complex instruction sets (Intel IA-32)
 Significant overhead to make pipelining work
 IA-32 micro-op approach
 Complex addressing modes
 Register update side effects, memory indirection

2/22/2020 EC792 114


Pipeline Hazards Summary
 Three types of pipeline hazards
 Structural hazards: conflicts using a resource during same cycle
 Data hazards: due to data dependencies between instructions
 Control hazards: due to branch and jump instructions

 Hazards limit the performance and complicate the design


 Structural hazards: eliminated by careful design or more hardware
 Data hazards are eliminated by data forwarding
 However, load delay cannot be completely eliminated
 Delayed branching can be a solution for control hazards
 BTB with branch prediction can reduce branch delay to zero
 Branch misprediction should flush the wrongly fetched instructions
2/22/2020 EC792 115

You might also like