Me FIRST

Tcpu = IC x CPI x Tc
! ! %&'( !
Amdahls law: SpeedUp = ! = ! ; %)*+
= ,-''+.- ; Sequential part = 1-f
!"#$ !"#$
" #$%&'(""&%"
Hazard: Situations that prevent starting the next instruction in the next cycle
• Structure hazards: A required resource is busy.
• Data hazard (RAW, WAR, WAW): Need to wait for previous instruction to complete its data read/write.
• Control hazard: Deciding on control action depends on previous instruction.
Instruction level parallelism (ILP): is a measure of how many of the instructions in a computer program
can be executed at the same time. To increase the performance.
more issue slots lead to increase complexity in control logic and that will limit performance increase.
Pipelining: executing multiple instructions in parallel to increase ILP reduce cycle time by...(Instruction
level parallelism).
Deeper pipeline: Less work per stage (more stages) Þ shorter clock cycle.
Multiple issue: Replicate pipeline stages (multiple pipelines)
Start multiple instructions per clock cycle.
CPI < 1, so use Instructions Per Cycle (IPC).
But dependencies reduce this in practice.
1. Static multiple issue (Software approach - VLIW) “The tasks performed by compiles in loop unrolling”:
1-Compiler groups instructions to be issued together.
2- Packages them into “issue slots”.
3-Compiler detects and avoids hazards. (Scheduling)
2. Dynamic multiple issue (Hardware approach - Superscalar)
-CPU examines instruction stream and chooses instructions to issue each cycle.
-Avoid compiler help to reordering instructions. (though it may still help).
-CPU resolves hazards using advanced techniques at runtime.
Speculation: guess what to do with an instruction that we don’t have its result yet.
1-Start operation as soon as possible. 2-Check whether guess was right.
If so, complete the operation.
If not, roll-back and do the right thing.
We need to buffer the result of guess until we make sure that its correct.
Ad: 1, save time if the guess is correct DisAd: ^ so we consume the CPU power for nothing.
Scheduling Static Multiple Issue: Compiler must remove some/all hazards by:
1-Speculation: Reorder instructions into issue packets. (move load before the branch. -Can include fix-up
instructions to recover from incorrect guess).
2 -Pad with nop if necessary.
3-No dependencies with a packet. (Possibly some dependencies between packets)
Hardware can look ahead for instructions to execute: - Buffer the results until it determines they are
actually needed. - Flush buffers on incorrect speculation
¨What if an exception occurs on a speculatively executed instruction?

Static speculation: Can add ISA support for deferring exceptions.
Dynamic speculation: Can buffer exceptions until instruction completion (which may not occur).
Loop Unrolling: Replicate loop body to expose more parallelism.

Register renaming: Use different registers per replication.
VLIW processor allows programs to explicitly specify instructions that will be executed at the same time (in
parallel). Package multiple operations into one instruction.
Advantages: -Reduce hardware complexity -Less design time. -Shorter cycle time.
-Better performance. -Reduced power consumption.
Disadvantages/downside for VLIW/ Loop.U:

-Larger Code size. -No hazard detection hardware/some dependencies aren’t predictable.
-Statically finding parallelism. -Binary code compatibility.
¨How VLIW designs reduce hardware complexity?

-less multiple-issue hardware. -Simpler instruction dispatch
-no dependence checking for instructions within an instruction
-can be fewer paths between instruction issue slots & FUs
-no out-of-order execution, no instruction grouping.
Dynamic Multiple Issue “Superscalar”:

Dis: consume more power. Adv: no need for compiler
If operand is available in a register file or reorder buffer?
-Copied to reservation station
-No longer required in the register; can be overwritten
If operand is not yet available?

-It will be provided to the reservation station by a function unit
-Register update may not be required
Reservation station: Holds instructions waiting to execute.

Provides forwarding to reduce RAW hazards.
Provides out-of-order execution.
Re-order buffer: circular queue with head and tail pointers / is a buffer used in a superscalar processor
It stores the results of instructions, reorder them and writes them in order even it was executed out-of-
order. It uses register renaming for some dependences. Send missing operand to RS.
¨Why Do Dynamic Scheduling? Why not just let the compiler schedule code?
-Not all stalls are predictable (cache misses).
-Can’t always schedule around branches (Branch outcome is dynamically determined)
-Different implementations of an ISA have different latencies and hazards.
¨Does Multiple Issue Work?

Yes, but not as much as we’d like
-Programs have real dependencies that limit ILP
-Some dependencies are hard to eliminate (pointer aliasing).
-Some parallelism is hard to expose (Limited window size during instruction issue).
-Memory delays and limited bandwidth (Hard to keep pipelines full).
-Speculation can help if done well.
Nop: inserted by the compiler before execution time, it will be executed as normal instruction will move
through all stages.
Stall: created by a hardware at runtime, it will move through few of stages not all of them.
Parallel programming:
Difficulties? Partitioning - Coordination - Communications overhead
Why? To save time and money as many resources working together.
App? Data bases and Data mining - Real time simulation of systems - Science and Engineering – graphics.
SPMD: Single Program Multiple Data

-A parallel program on a MIMD computer
-Conditional code for different processors
¨Why SIMD is more energy efficient than MIMD?

-Only needs to fetch one instruction per data operation
-Makes SIMD attractive for personal mobile devices
-SIMD allows programmer to continue to think sequentially
SIMD-based architectures : vector-SIMD, subword-SIMD, SIMT/GPUs
SIMD architectures can exploit significant data-level parallelism for:

- matrix-oriented scientific computing
- media-oriented image and sound processors
Scalar processor: implements instructions that operate on single data items.

Vector processor or array processor: a CPU that implements an instruction set that operates on one-
dimensional arrays of data called vectors.
Registers are controlled by compiler - Used to hide memory latency - Leverage memory bandwidth
Advantages of vector instructions:

-A single instruction specifies a great deal of work (reducing code size)
- simpler hardware - energy-efficient – more lanes
- Since each loop iteration must not contain data dependence to other loop iterations
- No need to check for data hazards between loop iterations
- Only one check required between two vector instructions
- Loop branches eliminated
Disadvantages:
-Programmer in charge of giving hints to the compiler
-Design issues: number of lanes, functional units, and registers, length of vector registers, exception
handling, conditional operations
-The fundamental design issue is memory bandwidth ( virtual address translation and caching ).
Execution time depends on: Execution time in VMIPS :

- Length of operand vectors is the vector length
- Structural hazards
- Data dependencies
Chaining: Allows a vector operation to start as soon as the missing operand appears by forwarding the
results from the first functional unit to the second unit.
Multiple lanes of hardware :

- No communication between lanes - Little increase in control overhead - No need to change machine code
Adding more lanes allows to trade-off clock rate and energy without reducing performance!
Vector instruction vs multimedia

- Number of data operands encoded into opcode
- No sophisticated addressing modes
- Vector instructions have a variable vector width, multimedia extensions have a fixed width
- Vector instructions support strided access, multimedia extensions do not
Advantages of multimedia:
- Cost little to add to the standard ALU and easy to implement
- Require little extra state -> easy for context-switch
- Require little extra memory bandwidth
- No virtual memory problem of cross-page access and page-fault
Thread-level parallelism: execute independent programs/threads at the same time using different sources
of execution. To increase the performance.
Multithreading: multiple threads to share the functional units of one processor via overlapping.
Thread: process with its own PC, instructions and data. It may be a process part of a parallel program of
multiple processes, or it may be an independent program.
Advantages of using multiple instruction streams to improve 1. Throughput of computers that run many
programs 2. Execution time of multi-threaded programs. Increase the TLP. For every active thread, Pc,
instructions and data.
To types: fine grain / coarse grain.
MIMD Multiprocessors:
SMP: Shared-Memory Multiprocessors = Symmetric MultiProcessors:
- Processors communicate with each other through the global memory at the same time.
- Access to memory from all processors is symmetric.
- Not easy to scalable as we have a single address space for all processors.
- Uniform Memory Access (UMA)
Distributed-memory multiprocessor:
- Scales better with an increased number of processors compared to SMPs
- Communicate by messages through the network.
-Each processor has a private physical address space.
- Non-Uniform Memory Access (NUMA)
Sum Reduction on MIMD:

- Recursively, half the processors send, other half receive and add, on synchronization.
synch(): Processors must synchronize before the consumer processor tries to read the results from the
memory location written by the producer processor.
Barrier synchronization: a synchronization scheme where processors wait at the barrier, not proceeding
until every processor has reached it

Me FIRST

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Me FIRST

Uploaded by

Copyright:

Available Formats

Tcpu = IC x CPI x Tc

¨What if an exception occurs on a speculatively executed instruction?

Loop Unrolling: Replicate loop body to expose more parallelism.

Disadvantages/downside for VLIW/ Loop.U:

¨How VLIW designs reduce hardware complexity?

Dynamic Multiple Issue “Superscalar”:

If operand is not yet available?

Reservation station: Holds instructions waiting to execute.

¨Does Multiple Issue Work?

SPMD: Single Program Multiple Data

¨Why SIMD is more energy efficient than MIMD?

SIMD-based architectures : vector-SIMD, subword-SIMD, SIMT/GPUs

SIMD architectures can exploit significant data-level parallelism for:

Scalar processor: implements instructions that operate on single data items.

Advantages of vector instructions:

Execution time depends on: Execution time in VMIPS :

Multiple lanes of hardware :

Vector instruction vs multimedia

To types: fine grain / coarse grain.

Sum Reduction on MIMD:

You might also like