Professional Documents
Culture Documents
Lec15 - Tournament Predictor and Loop Unrolling
Lec15 - Tournament Predictor and Loop Unrolling
Lec15 - Tournament Predictor and Loop Unrolling
A1 A2
1
2
4K × 2
3
.. bits
.
12
Transition condition is correctness of predictors
BITS Pilani, Pilani Campus
Tournament Predictor in Alpha 21264
Local predictor consists of a 2-level predictor:
– Top level a local history table consisting of 1024, 10-bit entries; each 10-
bit entry corresponds to the most recent 10 branch outcomes.10-bit
history allows patterns of most recent 10 branches to be discovered and
predicted. Indexed by local branch address.
– Next level Selected entry from the local history table is used to index a
table of 1K entries consisting a 3-bit saturating counters, which provide
the local prediction
Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors)
• Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls +
Control Stalls
• Ideal pipeline CPI: measure of the maximum performance attainable by
the implementation. By reducing each of the terms of the right-hand
side, we decrease the overall pipeline CPI or, alternatively, increase the
IPC (instructions per clock).
• Structural hazards: HW cannot support this combination of
instructions
• Data hazards: Instruction depends on result of prior instruction still in
the pipeline
• Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps)
Instruction Level Parallelism
• Instruction-Level Parallelism (ILP): overlap the execution
of instructions to improve performance
• 2 approaches to exploit ILP:
1) Rely on hardware to help discover and exploit the
parallelism
dynamically (e.g., Pentium 4, AMD Opteron, IBM
Power, intel Core series) , and
2) Rely on software technology to find parallelism,
statically at compile- time (e.g., Itanium 2)
Loop-Level Parallelism
• Exploit loop-level parallelism by “unrolling loop”, either
statically –by the compiler or dynamically—by the
hardware
• Determining instruction dependence is critical to Loop
Level Parallelism
• If 2 instructions are
– parallel, they can execute simultaneously in a pipeline of
arbitrary depth without causing any stalls (assuming no
structural hazards)
– dependent, they are not parallel and must be executed
in order, although they may often be partially
overlapped
Compiler techniques to increase ILP
use of simple compiler technology to enhance a processor’s ability to
exploit ILP
Basic Pipeline Scheduling and Loop Unrolling
R1 is initially the address of the element in the array with the highest address,
and F2 contains the scalar value s.
Let’s see how well this loop will run when it is scheduled on a simple
pipeline for MIPS with the latencies mentioned before.
FP Loop Showing Stalls (without any scheduling)
ignoring delayed branches.
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• j tries to write a destination before it is read by i, so i incorrectly gets the new value.
• This hazard arises from an antidependence (or name dependence).
• WAR hazards cannot occur in most static issue pipelines— even deeper pipelines or floating-
point pipelines—because all reads are early (in ID in the pipeline and all writes are late (in WB).
• A WAR hazard occurs either when there are some instructions that write results early in the
instruction pipeline and other instructions that read a source late in the pipeline, or when
instructions are reordered
• Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. The
original ordering must be preserved to ensure that i reads the correct value
• If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard
Name Dependence #2: Output dependence
An output dependence occurs when instruction i and instruction j write the same register or
memory location.
• InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• The writes end up being performed in the wrong order, leaving the value written by i rather than
the value written by j in the destination. This hazard corresponds to an output dependence.
• WAW hazards are present only in pipelines that write in more than one pipe stage or allow an
instruction to proceed even when a previous instruction is stalled
• The ordering between the instructions must be preserved to ensure that the value finally written
corresponds to instruction j.
• Called an “output dependence” by compiler writers. This also results from the reuse of name “r1”
• If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard
• Because a name dependence is not a true dependence,
instructions involved in a name dependence can execute
simultaneously or be reordered, if the name (register number
or memory location) used in the instructions is changed so
the instructions do not conflict.
• Note that the RAR (read after read) case is not a hazard.
Control Dependencies
No
No
Pipeline Stall
Stall
processing instruction
instruction
Dynamic scheduling
reduces this stall via
Add1
Add2 Mult1
Add3 Mult2
Reservation To Mem
Stations
FP adders FP multipliers