Lec15 - Tournament Predictor and Loop Unrolling

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

VLSI Architecture

ES ZG642 / MEL ZG 642


Session 15
Pawan Sharma
BITS Pilani ps@pilani.bits-pilani.ac.in
Pilani Campus 11/11/2023
Last Lecture

Overcoming control hazards


• 1-bit predictor
• 2-bit predictor
• Correlating predictor

BITS Pilani, Pilani Campus


Today’s lecture

Overcoming Control Hazards


• Tournament Predictor

More ILP techniques


• Loop unrolling
• Tomasulo Algorithm (overcoming data hazards using
dynamic scheduling)

BITS Pilani, Pilani Campus


Tournament Predictors
• Motivation for correlating branch predictors:
• 2-bit local predictor was good compared to 1-bit local
predictor but failed on important branches;
• By adding global information, performance was improved
• Under some scenarios, local predictors work better
compared to global predictors
• So, exploiting best features of both and making a selection
to choose one gives me tournament predictor.

BITS Pilani, Pilani Campus


Tournament predictors: use two predictors, 1 based on
global information and 1 based on local information,
and choose with a selector
Features:
 Combine global and local predictors
 Use a selector to select the prediction outcome from
one of them
 Update both predictors as well as the selector for
each branch
♦Selector Update
♦ Update selector based on the prediction correctness of two
predictors
BITS Pilani, Pilani Campus
Predictor Indexing

• Local predictor: indexed by LSBs of branch PC

• Global predictor: indexed by global history


register
• (m-bit shift register, where m=no. of previous branches considered. No. of
bits used for indexing corresponds to number of previous branch
instructions considered)

BITS Pilani, Pilani Campus


Block diagram

A1 A2

BITS Pilani, Pilani Campus


Tournament Predictor in Alpha 21264
(4Kx2-bit) selector to choose from among a global predictor and a local predictor
Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each
entry in the global predictor is a standard 2-bit predictor
– 12-bit pattern: ith bit is 0 => ith prior branch not taken;
ith bit is 1 => ith prior branch taken;
Here c1c2 means: correctness of predictor 1 - correctness of predictor 2. Use1 (green) means use
global predictor, Use2 (purple) means use local predictor. The approach similar to drawing
state diagram of 2-bit predictor is also used here. Go to use2 from use1 if prediction is wrong
twice.

1
2
4K × 2
3
.. bits
.
12
Transition condition is correctness of predictors
BITS Pilani, Pilani Campus
Tournament Predictor in Alpha 21264
Local predictor consists of a 2-level predictor:
– Top level a local history table consisting of 1024, 10-bit entries; each 10-
bit entry corresponds to the most recent 10 branch outcomes.10-bit
history allows patterns of most recent 10 branches to be discovered and
predicted. Indexed by local branch address.
– Next level Selected entry from the local history table is used to index a
table of 1K entries consisting a 3-bit saturating counters, which provide
the local prediction
Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors)

For selector For local predictor 1K ×


For global predictor 1K × 10 3
bits
bits

BITS Pilani, Pilani Campus


Accuracy of Branch Prediction
99%
tomcatv 99%
100%
95%
doduc 84%
97%
86%
fpppp 82%
98% Profile-based
2-bit counter
88% Tournament
li 77%
98%
86%
espresso 82%
96%
88%
gcc 70%
94%

0% 20% 40% 60% 80% 100%


Profile: branch profile from last execution
BITS Pilani, Pilani Campus
Overview

• Pipelining exploited concept of ILP – multiple operations overlapped


• 2 methods of increasing ILP:
• making pipeline deeper: more stagesmore overlapping in operations
or instuctionsshortening clock cyclespeeduphigher throughput

• replicate internal components to launch multiple instructions in every


pipeline stage (multiple issue). for example, a multiple issue laundry
will have three washers and three dryerswill require extra
manpower to fold and put away three times as much as laundry in
same amount of timeextra overhead needed to keep machines busy
and transfer loads to other pipeline stages

BITS Pilani, Pilani Campus


• Launching multiple instructions per stage allows the
instruction execution rate to exceed the clock rate or, making
the CPI to be less than 1.
• It is sometimes useful to flip the metric CPI and use IPC, or
instructions per clock cycle.
• A 4 GHz four-way multiple-issue microprocessor can execute a
peak rate of 16 billion instructions per second and have a
best-case CPI of 0.25, or an IPC of 4.
• Assuming a five-stage pipeline, such a processor would have
20 instructions in execution at any given time.
• Today’s high-end microprocessors attempt to issue from three
to six instructions in every clock cycle.

BITS Pilani, Pilani Campus


Recall from Pipelining Review

• Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls +
Control Stalls
• Ideal pipeline CPI: measure of the maximum performance attainable by
the implementation. By reducing each of the terms of the right-hand
side, we decrease the overall pipeline CPI or, alternatively, increase the
IPC (instructions per clock).
• Structural hazards: HW cannot support this combination of
instructions
• Data hazards: Instruction depends on result of prior instruction still in
the pipeline
• Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps)
Instruction Level Parallelism
• Instruction-Level Parallelism (ILP): overlap the execution
of instructions to improve performance
• 2 approaches to exploit ILP:
1) Rely on hardware to help discover and exploit the
parallelism
dynamically (e.g., Pentium 4, AMD Opteron, IBM
Power, intel Core series) , and
2) Rely on software technology to find parallelism,
statically at compile- time (e.g., Itanium 2)
Loop-Level Parallelism
• Exploit loop-level parallelism by “unrolling loop”, either
statically –by the compiler or dynamically—by the
hardware
• Determining instruction dependence is critical to Loop
Level Parallelism
• If 2 instructions are
– parallel, they can execute simultaneously in a pipeline of
arbitrary depth without causing any stalls (assuming no
structural hazards)
– dependent, they are not parallel and must be executed
in order, although they may often be partially
overlapped
Compiler techniques to increase ILP
use of simple compiler technology to enhance a processor’s ability to
exploit ILP
Basic Pipeline Scheduling and Loop Unrolling

To avoid a pipeline stall, the execution of a dependent


instruction must be separated from the source instruction by a
distance in clock cycles equal to the pipeline latency of that
source instruction.

A compiler’s ability to perform this scheduling depends both on


the amount of ILP available in the program and on the latencies
of the functional units in the pipeline.
Software Techniques - Example
• This code, add a scalar to a vector:
for (i=999; i>=0; i=i–1)
x[i] = x[i] + s;

assume the standard five-stage integer pipeline, so that branches have a


delay of one clock cycle, no structural hazards.

Instruction Instruction stalls between


producing result using result in cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0
the number of intervening clock cycles needed to avoid a stall. These numbers are similar to the average
latencies an FP unit. The latency of a floating-point load to a store is 0, since the result of the load can be
bypassed without stalling the store. We will continue to assume an integer load latency of 1 and an integer
ALU operation latency of 0.
FP Loop: Where are the Hazards?
• First translate into MIPS code:
-To simplify, assume 8 is lowest address

Loop: L.D F0,0(R1) ;F0=vector element


ADD.D F4,F0,F2 ;add scalar from F2
S.D 0(R1),F4 ;store result
DADDUI R1,R1,-8 ;decrement pointer 8B (DW)
BNEZ R1,Loop ;branch R1!=zero

R1 is initially the address of the element in the array with the highest address,
and F2 contains the scalar value s.

Let’s see how well this loop will run when it is scheduled on a simple
pipeline for MIPS with the latencies mentioned before.
FP Loop Showing Stalls (without any scheduling)
ignoring delayed branches.

1 Loop: L.D F0,0(R1) ;F0=vector element


2 stall
3 ADD.D F4,F0,F2 ;add scalar in F2
4 stall
5 stall
6 S.D 0(R1),F4 ;store result
7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW)
8 stall ;assumes can‘t forward to branch
9 BNEZ R1,Loop ;branch R1!=zero
Instruction Instruction Latency in
producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1

• 9 clock cycles: Rewrite code to minimize stalls?


Revised FP Loop Minimizing Stalls
1 Loop: L.D F0,0(R1)
2 DADDUI R1,R1,-8
3 ADD.D F4,F0,F2
4 stall
5 stall
6 S.D 8(R1),F4 ;altered offset
7 BNEZ R1,Loop

Swap DADDUI and S.D by changing address of S.D

Instruction Instruction Latency in


producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
we complete one loop iteration and store back one array element every seven clock cycles, but the actual work of
operating on the array element takes just three (the load, add, and store) of those seven clock cycles. The remaining four
clock cycles consist of loop overhead—the DADDUI and BNE—and two stalls. To eliminate these four clock cycles we need
to get more operations relative to the number of overhead instructions.
• Loop Unrolling or loop unwinding

simple yet powerful technique to reduce loop overheads


make more than iteration inside new loop, i.e. replicate loop body
loop unrolling factor tells the number of loop body iterations
update loop counter by 2, if unrolled by once, thereby doing half iterations
compared to before
to do so, need to adjust loop termination code and some addresses
1 Loop L.D F0, 0(R1)
3 ADD.D F4,F0,F2
for (i=999; i>=0; i=i–2) 6 S.D 0(R1), F4
x[i] = x[i] + s; unroll twice 7 L.D F6, -8(R1)
9 ADD.D F8,F6,F2
x[i-1] = x[i-1] +s
12 S.D -8(R1), F8
13 DADDUI R1,R1,#-16
15 BNEZ R1, Loop
for (i=999; i>=0; i=i–3)
x[i] = x[i] + s; unroll thrice
x[i-1] = x[i-1] +s
x[i-2] = x[i-2] + s
Unroll Loop Four Times (straightforward way,
unscheduled)
3 stalls per x[i] iteration. Schedule
1 cycle stall loop to minimize or remove
1 Loop:L.D F0,0(R1) stalls?
3 ADD.D F4,F0,F2 2 cycles stall
6 S.D 0(R1),F4 ;drop DSUBUI
1 cycle stall & BNEZ
7 L.D F6,-8(R1)
9 ADD.D F8,F6,F2 2 cycles stall
12 S.D -8(R1),F8 ;drop DSUBUI
1 cycle stall & BNEZ
13 L.D F10,-16(R1)
2 cycles stall
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12 ;drop DSUBUI
1 cycle stall & BNEZ
19 L.D F14,-24(R1) 2 cycles stall
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16 We have eliminated three
25 DADDUI R1,R1,#-32 ;alter to 4*8 branches and three decrements
27 BNEZ R1,LOOP of R1. The addresses on the
loads and stores have been
Without scheduling, every operation in the unrolled loop is followed compensated to allow the
by a dependent operation and thus will cause a stall. 27 clock cycles DADDUI instructions on R1 to
(14 instructions plus 13 stalls), 4 iterations or loop unroll factor of 4, be merged. Can’t use same
meaning 6.75 (27/4 cycles per iteration. can be scheduled to registers in each enrol
improve performance significantly
Loop Unrolling

• To eliminate these four clock cycles (overhead) we need to get more


operations relative to the number of overhead instructions.
• A simple scheme for increasing the number of instructions relative to the
branch and overhead instructions is loop unrolling.
• Unrolling simply replicates the loop body multiple times, adjusting the loop
termination code.
• Loop unrolling can also be used to improve scheduling because it eliminates
the branch, it allows instructions from different iterations to be scheduled
together.
• In this case, we can eliminate the data use stalls by creating additional
independent instructions within the loop body.
• If we simply replicated the instructions when we unrolled the loop, the
resulting use of the same registers could prevent us from effectively
scheduling the loop.
• Thus, we will want to use different registers (register renaming) for each
iteration, increasing the required number of registers
Unrolled Loop Detail
• Do not usually know upper bound of loop count, at
compile time
• Suppose loop count is n, and we would like to unroll
the loop to make k copies of the body (where k may
not be a multiple of n)
• Instead of a single unrolled loop, we generate a pair of
consecutive loops:
– 1st executes (n mod k) times and has a body that is the original loop
– 2nd is the unrolled body surrounded by the outer 1st loop that iterates
(n div k) times
• For large values of n, most of the execution time will be
spent in the unrolled loop
for (i<n; i>=0; i=i–1)
x[i] = x[i] + s;

inner loop outer loop


for (i-k-1<n; i>=0; i=k) for (i<n; i>=/empty/; i=i-1)
x[i] = x[i] + s; x[i] = x[i] + s;
x[i-1] = x[i-1] + s
---
---
x[i-k-1] = x[i-k-1] +s
Inner loop is unrolled by factor of k, so the loop counter i is decremented by k in every iteration
and each iteration calculates the elements x[i], x[i-1] till x[i-k-1]. the loop count is replaced by
boolean expression, i-k-1<n. Since i-k-1 is the last element computed in the iteration with index i,
this assures the array index is within bounds.
The outer loop is the copy of the original and performs remaining iterations. loop index i is not
reinitialized at beginning of for loop as it has the value after completing the previous inner loop as
indicated in the form of comment --empty statement in the initialization part of the for loop.
Schedule Unrolled Loop That Removes Stalls

Loop: L.D F0, 0(R1)


L.D F6, -8(R1)
L.D F10, -16(R1)
L.D F14, -24(R1)
ADD.D F4, F0, F2
ADD.D F8, F6, F2
ADD.D F12, F10, F2
ADD.D F16, F14, F2
S.D F4, 0(R1)
S.D F8, -8(R1)
DADDUI R1, R1, #-32
S.D F12, 16(R1)
BNEZ R1, Loop
S.D F16, 8(R1)

14 clock cycles, or 3.5 cycles per iteration


5 Loop Unrolling Decisions
• Requires understanding how one instruction depends on another
and how the instructions can be changed or reordered given the
dependences:
1. Determine if loop unrolling is useful by finding that loop
iterations were independent (except for maintenance code)
2. Use different registers to avoid unnecessary constraints forced by
using same registers for different computations
3. Eliminate the extra test and branch instructions and adjust the loop
termination and iteration code
4. Determine that loads and stores in unrolled loop can be
interchanged by observing that loads and stores from different
iterations are independent
• Transformation requires analyzing memory addresses and finding that they do
not refer to the same address
5. Schedule the code, preserving any dependences needed to yield the
same result as the original code
3 Limits to Loop Unrolling
1. Decrease in amount of overhead reduces with each
extra unrolling
2. Growth in code size
• For larger loops, concern it increases the instruction cache miss rate
3. Register pressure: potential shortfall in registers created
by aggressive unrolling and scheduling
• It may not be possible to allocate all live values to registers, may lose
some or all of its advantage
• Loop unrolling reduces impact of branches on pipeline;
another way is branch prediction
Data Dependence and Hazards
• InstrJ is data dependent (aka true dependence) on InstrI:
1. InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
2. or InstrJ is data dependent on InstrK which is dependent on InstrI
• The second condition simply states that one instruction is
dependent on another if there exists a chain of dependences
of the first type between the two instructions. This
dependence chain can be as long as the entire program.
• If two instructions are data dependent, they cannot execute
simultaneously or be completely overlapped
• If data dependence caused a hazard in pipeline,
called a Read After Write (RAW) hazard
ILP and Data Dependencies,Hazards
• HW/SW must preserve program order:
ordered instructions would execute if executed sequentially as
determined by original source program
– Dependences are a property of programs
• Presence of dependence indicates potential for a hazard, but
actual hazard and length of any stall is property of the pipeline
• Importance of the data dependencies
1) indicates the possibility of a hazard
2) determines order in which results must be calculated
3) sets an upper bound on how much parallelism can possibly be exploited
• HW/SW goal: exploit parallelism by preserving program order
only where it affects the outcome of the program
Name Dependence #1: Anti-dependence
• Name dependence occurs when 2 instructions use same register or memory location,
called a name, but there is no flow of data between the instructions associated with that
name;
• 2 types of name dependence between an instruction i that precedes instruction j in
program order:
• InstrJ writes operand before InstrI reads it

I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• j tries to write a destination before it is read by i, so i incorrectly gets the new value.
• This hazard arises from an antidependence (or name dependence).
• WAR hazards cannot occur in most static issue pipelines— even deeper pipelines or floating-
point pipelines—because all reads are early (in ID in the pipeline and all writes are late (in WB).
• A WAR hazard occurs either when there are some instructions that write results early in the
instruction pipeline and other instructions that read a source late in the pipeline, or when
instructions are reordered
• Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. The
original ordering must be preserved to ensure that i reads the correct value
• If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard
Name Dependence #2: Output dependence
An output dependence occurs when instruction i and instruction j write the same register or
memory location.
• InstrJ writes operand before InstrI writes it.

I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7

• The writes end up being performed in the wrong order, leaving the value written by i rather than
the value written by j in the destination. This hazard corresponds to an output dependence.
• WAW hazards are present only in pipelines that write in more than one pipe stage or allow an
instruction to proceed even when a previous instruction is stalled
• The ordering between the instructions must be preserved to ensure that the value finally written
corresponds to instruction j.
• Called an “output dependence” by compiler writers. This also results from the reuse of name “r1”
• If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard
• Because a name dependence is not a true dependence,
instructions involved in a name dependence can execute
simultaneously or be reordered, if the name (register number
or memory location) used in the instructions is changed so
the instructions do not conflict.

• This renaming can be more easily done for register operands,


where it is called register renaming.

• Register renaming can be done either statically by a compiler


or dynamically by the hardware

• Note that the RAR (read after read) case is not a hazard.
Control Dependencies

• The last type of dependence is a control dependence.


• A control dependence determines the ordering of an instruction, i, with
respect to a branch instruction so that instruction i is executed in
correct program order and only when it should be.

• If an instruction is control dependent on some set of branches, then, in


general, these control dependencies must be preserved to preserve
program order
if p1 {then
S1;
};
if p2 { then
S2;
}
• S1 is control dependent on p1, and S2 is control dependent on p2
but not on p1.
• In general, two constraints are imposed by control
dependences:
• An instruction that is control dependent on a branch cannot
be moved before the branch so that its execution is no longer
controlled by the branch. For example, we cannot take an
instruction from the then portion of an if statement and move
it before the if statement.
• An instruction that is not control dependent on a branch
cannot be moved after the branch so that its execution is
controlled by the branch. For example, we cannot take a
statement before the if statement and move it into the then
portion.
• When processors preserve strict program order, they ensure
that control dependences are also preserved.
• We may be willing to execute instructions that should not
have been executed, thereby violating the control
dependences, however, if we can do so without affecting the
correctness of the program, then its alright.
• Thus, control dependence is not the critical property that
must be preserved.
• Instead, the two properties critical to program correctness—
and normally preserved by maintaining both data and control
dependences—are the exception behavior and the data flow.
Overcoming Data Hazards with Dynamic Scheduling
Why Dynamic Scheduling…?

Static pipeline Data Yes Bypass Yes


Bypass or
scheduling Hazard possible
Forwarding

No
No
Pipeline Stall
Stall
processing instruction
instruction
Dynamic scheduling
reduces this stall via

ILP: Instruction Level Parallelism

Goal of ILP: To get as many instructions as possible executing


in parallel while respecting dependencies
Advantages of Dynamic Scheduling
• Dynamic scheduling - hardware rearranges the instruction
execution to reduce stalls while maintaining data flow and
exception behavior
• It handles cases when dependences unknown at compile
time
– it allows the processor to tolerate unpredictable delays such as cache
misses, by executing other code while waiting for the miss to resolve
• It allows code that compiled for one pipeline to run
efficiently on a different pipeline
• It simplifies the compiler
• Hardware speculation, a technique with significant
performance advantages, builds on dynamic scheduling
HW Schemes: Instruction Parallelism
• Key idea: Allow instructions behind stall to proceed
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
• Enables out-of-order execution and allows out-of-order
completion (e.g., SUBD)
– In a dynamically scheduled pipeline, all instructions still pass through issue
stage in order (in-order issue)
• Will distinguish when an instruction begins execution and
when it completes execution; between 2 times, the
instruction is in execution
• Note: Dynamic execution creates WAR and WAW hazards and
makes exceptions harder
• Use reservation stations.
Dynamic Scheduling

• Simple pipeline had 1 stage to check both structural


and data hazards: Instruction Decode (ID), also called
Instruction Issue
• Split the ID pipe stage of simple 5-stage pipeline into
2 stages:
• Issue—Decode instructions, check for structural
hazards
• Read operands—Wait until no data hazards, then
read operands
Tomasulo Algorithm

• Control & buffers distributed with Function Units (FU)


– FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to
reservation stations(RS); called register renaming ;
– Renaming avoids WAR, WAW hazards
– More reservation stations than registers, so can do optimizations compilers
can’t
• Results to FU from RS, not through registers, over Common Data
Bus that broadcasts results to all FUs
– Avoids RAW hazards by executing an instruction only when its operands are available
• Load and Stores treated as FUs with RSs as well
Tomasulo Organization

From Mem FP Op FP Registers


Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5 Store
Load6
Buffers

Add1
Add2 Mult1
Add3 Mult2

Reservation To Mem
Stations
FP adders FP multipliers

Common Data Bus (CDB)

You might also like