Advanced Topics in Computer Architecture ECE 7373

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Advanced Topics in Computer

ECE 7373
Pauline Markenscoff
N320 Engineering Building 1
Multiple issue processors
Single issue processors:
Eliminate data and control stalls to achieve an ideal CPI of 1.
Multiple issue processors:
Reduce CPI below 1.

Multiple issue processors:

Superscalar processors (Dynamic issue capability)

- Statically scheduled (in-order execution)
! Diminishing advantages as the issue width grows;
used primarily for narrow widths, usually for just two instructions
! Early superscalar processors
! Embedded processors
- Dynamically scheduled (out-of-order execution)
! using techniques based on Tomasulos Algorithm
! Most leading-edge desktops and servers
! Typical superscalar from 0 to 8 processors
VLIW (Very long instruction word) processors
(Static issue capability)
- Inherently statically scheduled
VLIW processors

Issue a xed number of instructions formatted as either

- one large instruction or
- as a xed instruction packet with the parallelism among
instructions explicitly indicated by the instruction
(EPIC Explicitly Parallel Instruction Computers)
Intel IA-64 (Itanium)
A superscalar has dynamic issue capability
The hardware makes dynamically any decisions about
multiple issue

A VLIW processor has static issue capability.

Compiler makes any decisions about multiple issue
Approaches to Multiple Issue

Fig. 3.15
Statically Scheduled Superscalar Processors

Instructions issue in order

All pipeline hazards are checked at issue time
Pipeline control logic must check for hazards
- among the instructions being issued in a given clock cycle,
- among the issuing instruction and all those still in execution.

Statically Scheduled Superscalar Processors

If some instruction in the instruction stream

is dependent (i.e, will cause a data hazard)
does not meet the issue criteria (structural hazard)
then only the instructions preceding it will be issued.

The pipeline would receive from the instruction fetch unit from one to
k instructions, where k is the width of the issue packet.
Issue packet
- The set of instructions that could potentially issue.
If an instruction would cause a structural hazard or data hazard
- either due to an earlier instruction already in execution
- or earlier in the issue packet
then the instruction is not issued.

Issue checks are complex

Performing them in one clock cycle would mean that the
issue logic determined the minimum clock cycle.

In many statically and all dynamically scheduled


Issue stage is split and pipelined, so that it can issue every

clock cycle.
A Statically Scheduled Superscalar MIPS Processor
Two instructions can be issued per clock cycle
One of the instructions can be an integer operation
- a load, store, move (Integer or FP)
- branch or integer ALU
The other can be any FP operation.
Issue of an integer operation in parallel with a FP is much
simpler and less demanding than arbitrary dual issue.

Integer and FP operations use different register sets and different

functional units.
Most hazard possibilities within the issue packet are eliminated
- Sufcient in many cases to look only at the opcodes of the
The need for additional hardware is minimized.
Only difculties when the integer instruction is a FP load, store or move,

If the rst instruction is a FP load and the second a FP operation or

If the rst instruction is a FP operation and the second a FP store
Then possibility of

RAW hazard
- When the second instruction of the pair depends on the rst
Structural hazard
- Contention for the FP register ports

Allowing FP loads and stores to issue with FP


creates the need for an additional read/write port on the FP

register le
increases the need for bypass paths
- to avoid RAW hazards
Highly desirable capability for performance reasons.
There is also possibility of WAR and WAW hazards across
issue packets boundaries.

The use of the restriction that one instruction is

integer and the other one is FP

represents a structural hazard but

reduces complexity of hazard detection
It is common in multiple-issue processors.
Issuing two instructions per cycle will require
fetching and decoding 64 bits of instructions.

Early superscalars often limited the placement of the

instruction types
Integer instruction must be rst
Modern superscalars dropped this restriction.
Assuming instruction placement is not limited

Steps in fetch and issue:

- Fetch two instructions from the cache

- Determine whether zero, one or two instructions can issue
- Issue them to the correct functional unit
Superscalar pipeline

All FP ops are adds (3 execution clock cycles)
Integer instruction is always shown rst, although it may be the second
instruction in the issue packet.
The rate at which instructions can be issued has been substantially boosted.
To improve the rate at which instructions are executed
Pipelined FP units
Multiple independent FP units.
Maintaining precise exception model

Possibility of an imprecise exception:

A FP instruction can nish execution after an integer instruction that is

later in the program
The FP instruction exception could be detected after the integer
instruction completed.
Need to
restore a precise exception state before resuming execution or
delaying instruction completion until we know an exception is
Maintaining peak throughput (CPI=0.5) for a dual-issue
pipeline much harder than for a single issue pipeline.
Single issue pipeline
Loads had a latency of one clock cycle which prevented one instruction from
using the result of the load without stalling.
Dual issue pipeline
The result of a load cannot be used on the same clock cycle or the next clock
cycle and hence the next two or three instructions cannot use the result of the
load without stalling, depending on whether the load is the rst or second
instruction in the pair.
the branch delay for a taken branch
becomes either two or three instructions
depending on whether the branch
is the rst or the second instruction
of a pair.
To effectively exploit parallelism available in a superscalar
processor we need
- more ambitious compiler or
- hardware scheduling techniques.
Because of the diminishing advantages of a statically
scheduled superscalar as the issue width grows, most
designers choose to implement either

a VLIW or
a dynamically scheduled superscalar.
Multiple Instruction Issue with Dynamic Scheduling

Dynamic scheduling can increase performance

Even in the presence of hazards

Allows the processor to eliminate the issue restrictions until the
hardware runs out of reservation stations.
Extend Tomasulos Algorithm
to support a dual-issue superscalar pipeline

Instructions are issued to the reservation stations in-order

(otherwise we would have violation of program semantics).
How is branch prediction integrated into a
dynamically scheduled pipeline?

- Instructions are fetched and issued based on branch

predictions, but executed when the branch has
completed (IBM 360/91):
! Static branch prediction scheme.
- Instructions are executed based on branch predictions.
! Speculation

Multiple issue with Speculation

" Process multiple instructions per clock assigning

reservation stations and reorder buffers to the

" Must be able to handle multiple commits per clock

Consider the following loop:

Loop:LD R2, 0(R1) ; R2 points to array element

DADDIU R2, R2, #1 ; increment array element
SD R2, 0(R1) ; store result
DADDIU R1, R1, #8 ; increment pointer
BNE R2, R3, LOOP ; branch if not last element

Assume separate integer functional units for

Effective address calculation

ALU operations
Branch condition evaluation
A dual issue dynamically scheduled processor
- Without and with speculation
Up to two instructions of any type can commit per clock.
Time of Issue, Execution, and Writing result for a
Dual-issue version of our pipeline without speculation

Fig. 3.19

Any instructions following a branch cannot start execution until after the
branch condition has been evaluated.
For 3 iterations:
Issue rate:14 instructions in 8 clock cycles=14/8= 1.75
Execution rate:15 instructions in 19 clock cycles=15/19= 0.79
Time of Issue, Execution, and Writing result for a
Dual-issue version of our pipeline without speculation

Fig. 3.20

For 3 iterations:
Issue rate= 1.75 (14 instructions in 8 clock cycles=14/8=1.75)
Execution rate=0.79 (15 instructions in 19 clock cycles=15/19=0.79)
Because completion rate falls behind the issue rate rapidly, the
nonspeculative processor will stall when a few more iterations are issued!
Performance of nonspeculative processor can be
improved by allowing memory access instructions to
complete effective address calculation before a
branch is decided.

Improvement will be small, unless speculative

memory accesses are allowed.
Time of Issue, Execution, and Writing result for a
Dual-issue version of our pipeline with speculation

Fig. 3.20

Instructions following a branch can start execution before the branch

condition has been evaluated.
Issue rate:14 instructions in 8 clock cycles=14/8 = 1.75
Execution rate:15 instructions in 13 clock cycles=15/13= 1.15
The VLIW Approach
Major factor limiting wider-issue superscalar processors:
Growth in overhead
VLIW processors
Issue a xed number of instructions formatted as either
- one large instruction or
- as a xed instruction packet with the parallelism among instructions
explicitly indicated by the instruction
Use multiple independent functional units
- To keep functional units busy there must be enough parallelism in a code
sequence to ll the available operation slots.

The VLIW Approach

Parallelism is uncovered by the compiler by unrolling loops

and scheduling the code.

Local scheduling techniques

- If unrolling generates straight line code
Global scheduling techniques
- Scheduling code across branches.

Consider again the loop that

Increments a vector of values by a scalar stored in F2;
Starts with the element of the vector at location 0(R1) which is the highest
address and end at 8(R2).

Loop: L.D F0, 0(R1) ; F0=array element

ADD.D F4, F0, F2 ; add scalar in F2
S.D F4, 0(R1) ; store result
DADDUI R1, R1, #-8 ; decrement pointer
; 8 bytes (per DW)
BNE R1, R2, Loop ; branch R1!=R2
Without any scheduling
Clock cycle issued

Loop: L.D F0, 0(R1) 1

stall 2
ADD.D F4, F0, F2 3
stall 4
stall 5
S.D F4, 0(R1) 6
DADDUI R1, R1, #-8 7
stall 8
BNE R1, R2, Loop 9
stall 10
10 clock cycles per iteration
The VLIW Approach
Consider a VLIW processor that can issue in every clock cycle
Two memory references
Two FP operations
One integer operation/branch
The instruction would have a set of elds for each functional
16-24 bits per unit
Instruction length of between 80 and 120 bits.

Fig. 3.16

Issue rate: 23 operations in 9 clock cycles= 2.5 operations per cycle

Execution rate: 7 results in 9 clock cycles= 0.77 results per cycle cycle
Efciency: percentage of available slots that contain an operation about 60%
Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, #-8
BNE R1, R2, Loop

Unroll as many times as

necessary to eliminate any stalls

Fig. 3.16

Increase in code size
- Ambitious unrolling of loops
- Instructions might not be full and unused functional units are translated to
wasted bits.
Limitations of lockstep operation
- A stall in any functional unit must cause the entire processor to stall.

You might also like