Instruction Level Parallelism: Pipelining

Instruction Level Parallelism: Pipelining
Programmer's model: one instruction is fetched and executed at a time.

Computer architect's model: The effect of a program's execution are given by the programmer's
model. But, implementation may be different.
To make execution of programs faster, we attempt to exploit parallelism: doing more than one
thing at one time.
program level parallelism: Have one program run parts of itself on more than one
computer. The different parts occasionally synch up (if needed), but they run at the same
time.
instruction level parallelism (ILP): Have more than one instruction within a single
program executing at the same time.
Pipelining (ILP)
The concept:
A task is broken down into steps. Assume that there are N steps, each takes the same amount of
time.
(Mark Hill's) EXAMPLE: car wash
steps:
P
W
R
D
X
------
prep
wash
rinse
dry
wax
assume each step takes 1 time unit

time to wash 1 car (red) = 5 time units
time to wash 3 cars (red, green, blue) = 15 time units
which car
red
green
blue
1
P
time units
3 4 5 6
R D X
P
2
W
9 10 11 12 13 14 15
9 10 11 12 13 14 15
X
D
R
X
D
A pipeline overlaps the steps.

which car
red
green
blue
yellow
etc.
1
P
time
3
R
W
P
2
W
P
units
4 5
D X
R D
W R
P W
IT STILL TAKES 5 TIME UNITS TO WASH 1 CAR, BUT THE RATE OF CAR WASHES GOES UP!
Two very important terms when discussing pipelining:
latency: The time it takes (from beginning to end) to complete a task.

throughput: The rate of task completions.
Pipelining does not affect the latency of car washes. It increases the throughput of car washes.
Pipelining can be done in computer hardware.
2-stage pipeline
steps:
F -- instruction fetch (and PC update!)
E -- instruction execute (everything else)
which instruction
1
2
3
4
1
F
time units
3 4 5 6
2
E
F
E
F
time for 1 instruction =

(INSTRUCTION LATENCY)
E
F
8 . . .
2 time units
rate of instruction execution = pipeline depth * (1 / time for

)
(INSTRUCTION THROUGHPUT)
1 instruction
=
2
* (1 /
2)
=
1 per time unit
5-stage pipeline
A popular pipelined implementation:
(Note: the R2000/3000 has 5 stages, the R6000 has 5 stages (but different), and the R4000 has 8
stages)
steps:
IF -ID -EX -MA -WB --
which
instruction
1
2
3
instruction fetch (and PC update)

instruction decode (and get operands from registers)
ALU operation (can be effective address calculation)
memory access
write back (results written to register(s))
time units
2
3
4
ID EX MA
IF ID EX
IF ID
1
IF
5
WB
MA
EX
WB
MA
WB
8 . . .
INSTRUCTION LATENCY = 5 time units

INSTRUCTION THROUGHPUT = 5 * (1 / 5) = 1 instruction per time unit
Unfortunately, pipelining introduces other difficulties. . .

Data dependencies
Suppose we have the following code:
lw
$8, data1
addi $9, $8, 1
The data loaded does not get written to $8 until WB, but the addi instruction wants to get the data
out of $8 it its ID stage. . .
which
instruction
lw
time units
2
3
4
ID EX MA
1
IF
addi
IF
ID
^^
EX
5
WB
^^
MA
8 . . .
WB
The simplest solution is to STALL the pipeline. (Also called HOLES, HICCOUGHS or
BUBBLES in the pipe.)
which
instruction
lw
time units
2
3
4
ID EX MA
1
IF
addi
IF
ID
^^
ID
^^
5
6
7
8 . . .
WB
^^
ID EX MA WB
^^ (pipeline stalling)
A data dependency (also called a hazard) causes performance to decrease.

Classification of data dependencies:
Read After Write (RAW), the example given. A read of data is needed before it has been
written.
Write After Read (WAR). Given for completeness, not a difficulty to current pipelines in
practice, since the only writing occurs as the last stage.
Write After Write (WAW). Given for completeness, not a difficulty to current pipelines in
practice, since the only writing occurs as the last stage.
NOTE: there is no difficulty implementing a 2-stage pipeline due to data dependencies!
Control dependencies
What happens to a pipeline in the case of branch instructions?
MAL CODE SEQUENCE:
b label1
addi $9, $8, 1
label1: mult $8, $9
which
instruction
b
addi
time units
2
3
4
ID EX MA
1
IF
IF
^^
5
6
7 8 . . .
WB
^^ (PC changed here)
ID EX MA WB
(WRONG instruction fetched here!)
Whenever the PC changes (except for PC <- PC + 4), we have a control dependency.
Control dependencies break pipelines. They cause performance to plummet.
So, lots of (partial) solutions have been implemented to try to help the situation. Worst case, the
pipeline must be stalled such that instructions are going through sequentially.
Note that just stalling does not really help, since the (potentially) wrong instruction is fetched before
it is determined that the previous instruction is a branch.
How to minimize the effect of control dependencies on pipelines.
Easiest solution (poor performance):
Cancel anything (later) in the pipe when a branch (jump) is decoded. This works as long as
nothing changes the program's state before the cancellation. Then let the branch instruction
finish ("flush the pipe"), and start up again.
which
instruction
b
addi
mult
time units
2
3
4
ID EX MA
1
IF
5
6
7 8 . . .
WB
^^ (PC changed here)
IF
^^ (cancelled)
IF
ID
EX
MA
WB
branch Prediction (static or dynamic):

Add lots of extra hardware to try to help.
static branch prediction
Assume that the branch will not be taken.
When the decision is made, the hw "knows" if the correct instruction has been
partially executed.
If the correct instruction is currently in the pipe, let it (and all those after it) continue.
Then, there will be NO holes in the pipe. If the incorrect instruction is currently in
the pipe, (meaning that the branch was taken), then all instructions currently in the
pipe subsequent to the branch must be BACKED OUT.
NOTE: static branch prediction works quite well with currently popular pipeline
solutions, because no state information is changed until the very last stage of an
instruction. As long as the last stage has not started, backing out is a matter of
stopping the last stage from occuring and getting the PC right.
dynamic branch prediction
Have some extra hw that keeps track of which branches have been taken in the recent
past. Design the hw to presume that a branch will be taken the same way it was
previously. If the guess is wrong, back out as in static branch prediction.
Question for the advanced student: Which is better, static branch prediction or dynamic
branch prediction? Why?
separate the test from the branch
Make the conditional test and address calculation separate instructions from the one that
changes the PC. This reduces the number of holes in the pipe.
delayed branch, the MIPS solution.

The concept: prediction is always wrong sometime. There will be holes in the pipe when the
prediction is wrong. So the goal is to reduce (eliminate?) the number of holes in the case of a
branch.
The mechanism:
Have the effect of a branch (the change of the PC) be delayed until a subsequent instruction.
This means that the instruction following a branch is executed independent of whether the
branch is to be taken or not.
(NOTE: the simulator completely ignores this delayed branch mechanism!)
code example:
add $8, $9, $10
beq $3, $4, label
move $18, $5
.
.
.
label: sub $20, $21, $22
Note that in this code example, we want one of two possibilities for the code that gets
executed:
1)
add $8, $9, $10
beq $3, $4, label
move $18, $5
or 2)
add $8, $9, $10

beq $3, $4, label
sub $20, $21, $22
In both cases, the add and beq instructions are executed.

The code is turned into the following by a MIPS assembler:
add $8, $9, $10
beq $3, $4, label
nop
move $18, $5
.
.
.
label: sub $20, $21, $22
# really a pipeline hole, the DELAY SLOT
If the assembler has any smarts at all, it would REARRANGE the code to be
beq $3, $4, label
add $8, $9, $10
move $18, $5
.
.
.
label:
sub $20, $21, $22
This code can be rearranged only if there are no data dependencies between the branch and
the add instructions. In fact, any instruction from before the branch (and after any previous
branch) can be moved into the DELAY SLOT, as long as there are no dependencies on it.
Delayed branching depends on a smart assembler (sw) to make the hardware perform at
peak efficiency. This is a general trend in the field of computer science. Let the sw do more
and more to improve performance of the hw.
An aside, on condition codes
A historically significant way of branching. Condition codes were used on MANY machines before
pipelining became popular.
4 1-bit registers (condition code register):
N -- negative
V -- overflow
P -- positive
Z -- zero
The result of an instruction set these 4 bits. Conditional branches were then based on these flags.
Example: bn label # branch to label if the N bit is set
Earlier computers had virtually every instruction set the condition codes. This had the effect that the
test (for the branch) needed to come directly before the branch.
Example:
sub r3, r4, r5
bn label
# blt $4, $5, label
A performance improvement (sometimes) to this allowed the programmer to explicitly specify

which instructions should set the condition codes. In this way, (on a pipelined machine) the test
could be separated from the branch, resulting in fewer pipeline holes due to data dependencies.
Copyright Karen Miller, 2006

Instruction Level Parallelism: Pipelining

Uploaded by

Copyright:

Available Formats

You might also like

Instruction Level Parallelism: Pipelining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Instruction Level Parallelism: Pipelining

Uploaded by

Copyright:

Available Formats

Instruction Level Parallelism: Pipelining

Programmer's model: one instruction is fetched and executed at a time.

assume each step takes 1 time unit

A pipeline overlaps the steps.

latency: The time it takes (from beginning to end) to complete a task.

time for 1 instruction =

rate of instruction execution = pipeline depth * (1 / time for

instruction fetch (and PC update)

INSTRUCTION LATENCY = 5 time units

Unfortunately, pipelining introduces other difficulties. . .

A data dependency (also called a hazard) causes performance to decrease.

branch Prediction (static or dynamic):

delayed branch, the MIPS solution.

add $8, $9, $10

In both cases, the add and beq instructions are executed.

# really a pipeline hole, the DELAY SLOT

sub $20, $21, $22

# blt $4, $5, label

A performance improvement (sometimes) to this allowed the programmer to explicitly specify

You might also like