PARALLELISM VIA INSTRUCTIONS: Pipelining Exploits The Potential Parallelism Among Instructions. This Parallelism Is

Datapath - manipulates the data coming through the processor.
It also provides a small amount of temporary data STATIC MULTIPLE ISSUE: • Compiler groups instructions into “issue
storage. packets” – Group of instructions that can be issued on a single cycle;
Control - generates control signals that direct the operation of memory and the datapath. independent instructions – Determined by pipeline resources required. •
Memory - holds instructions and most of the data for currently executing programs. Think of an issue packet as a very long instruction – Specifies multiple
Input - external devices such as keyboards, mice, disks, and networks that provide input to the processor. concurrent operations – ⇒ Very Long Instruction Word (VLIW).
Output - external devices such as displays, printers, disks, and networks that receive data from the processor. Scheduling Static Multiple Issue • Compiler must remove some/all
Why single cycle is inefficient? The longest possible path in the processor determines the clock cycle hazards – Reorder instructions into issue packets – No dependencies with
PIPELININGAn implementation technique in which multiple instructions are overlapped in execution. a packet – Possibly some dependencies between packets (Varies between
Time between inst. for pipeline = Time between inst. for nonpipeline/# of stages (for ideal case) ISAs; compiler must know!)
Pipelining improves performance by ↑ instruction throughput, as opposed to ↓ execution time of an individual – Pad with nop if necessary
instruction. MIPS with Static Dual Issue • Two-issue packets – One ALU/branch
Things better in MIPS: All MIPS instructions are the same length. MIPS has only a few instruction formats, with the source instruction – One load/store instruction – 64-bit aligned VLIW (•
register fi elds being located in the same place in each instruction(symmetry). Memory operands only appear in loads or ALU/branch, then load/store • Pad an unused instruction with nop)
stores in MIPS. Operands must be aligned in memory. Designing Instruction Sets for Pipelining An important compiler technique to get more performance from loops is
PIPELINE HAZARDS: when the next instruction cannot execute in the following clock cycle. Structural/Data/Control loop unrolling, where multiple copies of the loop body are made. After
Structural Hazard: The hardware does not support the combination of instructions that are set to execute. An unrolling, there is more ILP available by overlapping instructions from
example: We have a single memory instead of two memories(IF and MEM overlap). NOP cannot be used. different iterations.
register renaming: The renaming
Data Hazard: occur when the pipeline must be stalled because one step must wait for another to complete. An ALU/BRA SW/LW
of registers by the compiler or
example: add $s0, $t0, $t1 / sub $t2, $s0, $t3. 2nd one waits for WB stage of 1st one. As soon as the ALU creates the L: addi s1, s1, -16 lw t0, 0(s1)
hardware to remove
sum for the add, we can supply it as an input for the sub. Adding extra h.ware to get the missing item early is called nop lw t1, 4(s1)
antidependences which is also
forwarding or bypassing. addu t0, t0, s2 lw t2, 8(s1)
called name dependence. An
Forwarding paths are valid only if the destination stage is later in time than the source stage. For example, there cannot addu t1, t1, s2 lw t3, 12(s1)
ordering forced by the reuse of a
be a valid forwarding path from the output of the mem. stage in the first inst. to the input of the ex. stage of the addu t2, t2, s2 sw t0, 16(s1)
name, typically a register, rather
following. (↘) addu t3, t3, s2 sw t1, 4(s1)
than by a true dependence that
Even with forwarding, we would have to stall one stage for a load-use data hazard. nop sw t2, 8(s1)
carries a value between two inst.
Control hazard: arises from the need to make a decision based on the results of one instruction while others are bra s1, zero, L sw t3, 12(s1)
executing.
One solution is to stall immediately after we fetch a branch. The other is branch prediction. 3rd one is delayed decision. CPI is 8/14. Loop unrolling and scheduling with dual issue gave us an
The WB stage can lead to Data Hazard. The selection of the next value of the PC can lead to Control Hazard. improvement factor of almost 2, partly from ↓ the loop control inst. and
nop is an instruction that does no operation to change state. partly from dual issue execution. The cost of this performance
Moving the branch execution to the ID stage is an improvement, because it reduces the penalty of a branch to only one improvement is using 4 temp registers rather than 1, as well as a
instruction if the branch is taken, namely, the one currently being fetched. significant ↑in code size.
Dynamic Branch Prediction • In deeper and superscalar pipelines, branch penalty is more significant • Use dynamic DYNAMIC MULTIPLE ISSUE:
prediction – Branch prediction buffer (aka branch history table) – Indexed by recent branch instruction addresses – • “Superscalar” processors • CPU decides whether to issue 0, 1, 2, … each
Stores outcome (taken/not taken) – To execute a branch • Check table, expect the same outcome • Start fetching from cycle (Avoiding structural and data hazards) • Avoids the need for
fall-through or target • If wrong, flush pipeline and flip prediction. compiler scheduling( – Though it may still help – Code semantics ensured
Even with predictor, still need to calculate the target address 1-cycle penalty for a taken branch. by the CPU)
Delayed branching is inefficient for longer pipelines. Why Do Dynamic Scheduling? • Not all stalls are predictable (cache
Exception arises within the CPU. Interrupt is from an external I/O controller or Hardwaremalfunction. misses) • Can’t always schedule around branches (Branch outcome is
The two types of exceptions that our current implementation can generate are execution of an undefined instruction and dynamically determined) • Different implementations of an ISA have
an arithmetic overflow. The basic action that the processor must perform when an exception occurs is to save the different latencies and hazards
address of the off ending instruction in the exception program counter (EPC) and then transfer control to the operating Does Multiple Issue Work? • Yes, but not as much as we’d like • Programs
system at some specified address. The operating system can then take the appropriate action, which may involve have real dependencies that limit ILP • Some dependencies are hard to
providing some service to the user program, taking some predefined action in response to an overflow, or stopping the eliminate – e.g., pointer aliasing • Some parallelism is hard to expose –
execution of the program and reporting an error. After performing whatever action is required because of the exception, Limited window size during instruction issue • Memory delays and limited
the operating system can terminate the program or may continue its execution, using the EPC to determine where to bandwidth – Hard to keep pipelines full • Speculation can help if donewell
restart the execution of the program. For the operating system to handle the exception, it must know the reason for the CHAPTER 5
exception, in addition to the instruction that caused it. There are two main methods used to communicate the reason for Temporal locality (locality in time): if an item is referenced, it will tend to
an exception. The method used in the MIPS architecture is to include a status register (called the Cause register), which be referenced again soon. Spatial locality (locality in space): if an item is
holds a field that indicates the reason for the exception. A second method, is to use vectored interrupts. In a vectored referenced, items whose addresses are close by will tend to be referenced
interrupt, the address to which control is transferred is determined by the cause of the exception. soon.
overflow on add in EX stage Prevent $1 from being clobbered – Complete previous instructions – Flush add and FastestSmallest memoryhighest costSRAM (lowest access time)
subsequent instructions – Set Cause and EPC register values – Transfer control to handler. **Block: The min. unit of inf. that can be either present or not present in
Exception Properties • Restartable exceptions – Pipeline can flush the instruction– Handler executes, then returns to a cache. **Hit rate: The fraction of memory accesses found in a level of
theinstruction. - Refetched and executed from scratch • PC saved in EPC register – Identifies causing instruction – the memory hierarchy. **Miss Rate = 1 – Hit rate **Hit time is the time to
Actually PC + 4 is saved - Handler must adjust. access the upper level of the memory hierarchy, which includes the time
PARALLELISM VIA INSTRUCTIONS: Pipelining exploits the potential parallelism among instructions. This parallelism is needed to determine whether the access is a hit or a miss.
called instruction-level parallelism (ILP). To increase ILP: **The miss penalty is the time to replace a block in the upper level with
– Deeper pipeline • Less work per stage ⇒ shorter clock cycle the corresponding block from the lower level, plus the time to deliver this
– Multiple issue • Replicate pipeline stages ⇒ multiple pipelines • Start multiple instructions per clock cycle • CPI < 1, so block to the processor.
use Instructions Per Cycle (IPC) • E.g, 4GHz 4-way multiple-issue (16 BIPS, peak CPI = 0.25, peak IPC = 4) . Direct-mapped cache is a cache in which each memory location is mapped
Multiple Issue: Static and Dynamic. to exactly one location in the cache.  (Block address) % (# blocks in the
• Static multiple issue • Dynamic multiple issue cache)
– Compiler groups instructions to be issued together – CPU examines inst stream and chooses inst to issue each cycle The tags contain the address information required to identify whether a
– Packages them into “issue slots” – Compiler can help by reordering instructions word in the cache corresponds to the requested word. Valid bit indicatse
– Compiler detects and avoids hazards – CPU resolves hazards using advanced techniques at runtime whether an entry contains a valid address. A tag fi eld is used to compare
Speculation: An approach whereby the compiler or processor guesses the outcome of an inst. to remove it as a with the value of the tag field of the cache. A cache index is used to select
dependence in executing other instructions. Any speculation mechanism must include both a method to check if the the block.
guess was right and a method to unroll or back out the effects of the instructions that were executed speculatively. *** In the MIPS architecture, since words are aligned to multiples of 4
Speculation may be done in the compiler or by the hardware. For example, the compiler can use speculation to reorder bytes, the least significant 2 bits of every address specify a byte within a
instructions, moving an instruction across a branch or a load across a store. The processor hardware can perform the word. Hence, the least significant two bits are ignored when selecting a
same transformation at runtime. word in the block.
The recovery mechanisms used for incorrect speculation are rather diff erent. In the case of speculation in soft ware, the *** ■ 32-bit addresses ■ A direct-mapped cache ■ Th e cache size is 2^n
compiler usually inserts additional instructions that check the accuracy of the speculation and provide a fi x-up routine to blocks, so n bits are used for the index ■ The block size is 2^m words
use when the speculation is incorrect. In hardware speculation, the processor usually buff ers the speculative results until (2^m+2 bytes), so m bits are used for the word within the block, and 2 bits
it knows they are no longer speculative. If the speculation is correct, the instructions are completed by allowing the are used for the offset. The size of the tag field is 32- (n+ m+ 2). The total
contents of the buff ers to be written to the registers or memory. If the speculation is incorrect, the hardware flushes the number of bits in a direct-mapped cache is 2^n x (2^mx32+(32-m-n-2)+1).
buff ers and re-executes the correct instruction sequence. This cache can be said 2^n x 2^m+2 BYTE.
Loop: lw $t0, 0($s1) ALU/BRA LW/SW CLK CYCLE HAZARDS in Dual-Issue MIPS : EX: How many total bits are required for a direct-mapped cache with 16
addu $t0,$t0,$s2 Loop: nop I1 1 EX data hazards (ALU) KiB of data and 4-word blocks, assuming a 32-bit address?(16 byte 4
sw $t0, 0($s1) I4 nop 2 Load use hazards (2 inst stall) offset)
addi $s1,$s1,–4 I2 nop 3 More aggressive scheduling 2^10 blocks, 4 bits for offset, 18 bits for tag, 1 bit for valid 2^10(4x32
bne $s1,$zero,Loop I5 sw $t0, 4($s1) 4 +18 + 1)=147 Kibibits. Or 18.4 KiB for 16 KiB cache.
EX: Consider a cache with 64 blocks and a block size of 16 bytes. To what block number does byte DEPENDABLE MEMORY HIERARCHY
address 1200 map? 1200/16= 75 75%64= 11 The one great idea for dependability is redundancy.
Increasing the block size usually decreases the miss rate. Th e miss rate may go up eventually if the 1)Service accomplishment Service delivered as specified 12 Failure
block size becomes a significant fraction of the cache size, because the number of blocks that can be 2)Service interruption Deviation from specified service 21 Restoration
held in the cache will become small, and there will be a great deal of competition for those blocks. ***Fault: failure of a component – May or may not lead to system failure
Unless we change the memory system, the transfer time—and hence the miss penalty—will likely Dependability Measures:
increase as the block size increases. if we design the memory to transfer larger blocks more effi • Reliability: mean time to failure (MTTF) OR is a measure of the continuous service
ciently, we can increase the block size and obtain further improvements in cache performance. accomplishment
• Service interruption: mean time to repair (MTTR)
instruction cache miss: 1. Send the original PC value (current PC – 4) to the memory. 2. Instruct main • Mean time between failures  MTBF = MTTF + MTTR
memory to perform a read and wait for the memory to complete its access. 3. Write the cache entry, • Availability = MTTF / (MTTF + MTTR)
putting the data from memory in the data portion of the entry, writing the upper bits of the into the • Improving Availability – Increase MTTF: fault avoidance, fault tolerance(Using
tag field, and turning the valid bit on. 4. Restart the instruction execution. redundancy to allow the service to comply with the service specifi cation despite
faults occurring.), fault forecasting – Reduce MTTR: improved tools and processes for
What happens on a Write? diagnosis and repair.
• Writes take longer because you have to check tags first. • Two strategies for writes: – Write back
(write to cache and write back only when block has to be replaced) – Write through (write to cache THE HAMMING SEC CODE
and memory) •  Hamming distance – # of bits that are different between 2 bit patterns •  Minimum
Write through: Write-Back: distance = 2 provides single bit error detection (e.g. parity code) •  Minimum distance
• Advantage: Main memory has current copy Write as fast as read = 3 provides single error correction, 2 bit error detection •  To calculate Hamming
• Disadvantage: High bandwidth Read misses will result in massive write back code: – Number bits from 1 on the left – All bit positions that are a power 2 are
when a block is replaced (even) parity bits. p1=3,5,7 p2=3,6,7 p4= 5,6,7 (for decoding)
• Three reason for write misses: – The first time (compulsory) – Cache cannot contain all blocks p1=1,3,5,7 p2=2,3,6,7 p4= 4,5,6,7 (for encoding. If no error, all 0.)
(capacity) – Due to direct mapped cache (conflict) SEC/DED Code •  Single Error Correcting; double error detecting
THIS PART IS RELATED TO BUS AND MEMORY DESIGN •  Add an additional parity bit for the whole word (pn)
EX: cache block read – 1 bus cycle for address transfer – 15 bus cycles per DRAM access – 1 bus cycle •  Make Hamming distance = 4
per data transfer • 1-word wide memory – Miss penalty = 1 + 15 + 1 = 17 bus cycles – Bandwidth =
4 bytes / 17 cycles = 0.24 B/cycle VIRTUAL MACHINES: •  Host computer emulates guest operating system and
4-word wide memory: Miss penalty = 1 + 15 + 1 = 17 bus cycles machine resources –  Improved isolation of multiple guests –  Avoids security and
Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle reliability problems –  Aids sharing of resources •  Virtualization has some
4-bank interleaved memory: Miss penalty = 1 + 15 + 4×1 = 20 bus cycles Bandwidth = 16 bytes / 20 performance impact
cycles = 0.8 B/cycle VIRTUAL MACHINE MONITOR •  Maps virtual resources to physical resources • 
MEASURING AND IMPROVING CACHE PERFORMANCE Guest code runs on native machine in user mode •  Guest OS may be different from
Average memory access time = hit time + miss rate x miss penalty host OS •  VMM handles real I/O devices
miss penalty= Access time + transfer time AMAT VİRTUAL MEMORY •  Use main memory as a “cache” for secondary (disk) storage • 
CPU TIME= Program execution cycles +Memory stall cycles Programs share main memory •  CPU and OS translate virtual addresses to physical
EX: inst cache miss rate=%2, data cache miss rate = %4, CPI w/out hazard= 2, miss penalty = 100cycles, addresses – VM “block” is called a page – VM translation “miss” is called a page fault.
lw/sw  %36 On page fault, the page must be fetched from disk. Try to minimize page fault rate
inst miss cycle= I x %2 x100 =2I, data miss cycle = Ix %4x%36x100=1.44I – Fully associative placement – Smart replacement algorithms- can be used.
CPI with hazard = 5.44I >performance with perfect cache is 2.72 times better.
PAGE TABLES •  Stores placement information – Array of page table entries, indexed
CPU performance increased – Miss penalty becomes more significant by virtual page number – Page table register in CPU points to page table in physical
Decreasing base CPI - Greater proportion of time spent on memory stalls memory •  If page is present in memory – PTE stores the physical page number – Plus
Increasing clock rate – Memory stalls account for more CPU cycles other status bits (referenced, dirty, …) •  If page is not present – PTE can refer to
3 Strategies for Placement ( 1st way to improve cache performance) location in swap space on disk
1. Direct Mapped Cache: block has a certain place How Large is the page table?
2. Fully Associative Cache: block can be anywhere, uses FIFO or LRU 32 bit virtual address, 4KiB pages, 4 byte per page table entry
3. Set Associative Cache: Block can have a set of locations. A block is directly mapped into a set, and # of page table entries= 2^32/2^12= 2^20
then all the blocks in the set are searched for a match.
Increased associativity decreases miss
rate – But with diminishing returns.
***Size of Tags versus Set Associativity:
If Set Associativitiy ↑, # of index bits will
↓, # of tag bits ↑.
(2nd way to improve cache performance)

Multilevel Cache Example: CPU base CPI = 1, clock rate = 4GHz Cache Design Trade-offs
– Miss rate/instruction = 2% – Main memory access time = 100ns PARALLELISM AND MEMORY HIERARCHY
– With just primary cacheEffective CPI = 1 + 0.02 × 400 = 9 •  In multi-core systems, a processes’ code may be run on diﬀerent cores in parallel
Now add L-2 cache – Access time = 5ns – Global miss rate to main memory = 0.5%
• Primary miss with L-2 hit – Penalty = 5ns/0.25ns = 20 cycles •  These cores will access the same data
• Primary miss with L-2 miss – Extra penalty = 500 cycles •  Each cores has its own cache
• CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 • Performance ratio = 9/3.4 = 2.6 •  Multiple copies of the same data item on diﬀerent caches!
*** Primary cache – Focus on minimal hit time • L-2 cache – Focus on low miss rate to avoid main CACHE COHERENCE PROTOCOLS
memory access. •  Operations performed by caches in multiprocessors to ensure coherence –
 Migration of data to local caches
THINGS RELATED TO INTERRUPTS •  Reduces bandwidth for shared memory – Replication of read-shared data • 
• Different ways of doing I/O 1. Programmed I/O with busy waiting 2. Interrupt driven I/O 3. Direct Reduces contention for access
Memory Access (DMA) •  Snooping protocols – Each cache monitors bus reads/writes
Programmed I/O • Simplest; you wait in a loop and periodically check whether memory is ready – •  Directory-based protocols – Caches and memory record sharing status of blocks in
busy waiting • If CPU has other things to do, such as running other programs, this is wasteful a directory
Interrupt driven I/O • Start the I/O device and tell it to generate an interrupt when it is done by 1.  Write broadcast: Updates everyone’s copy 2.  Write invalidate: The writing
enabling interrupts • When the device is ready, it generates an interrupt • The interrupt service processor causes all copies to be invalidated before changing its copy.
routine services the interrupt
• Enable interrupts • Hardware sends an interrupt-request signal to the processor at the appropriate
time, much like a phone call. • Meanwhile, processor performs useful tasks
Enable INTANDCC #$BF Stack point initialization LDS #$2000 RTI (return)
Direct Memory Access • This is like programmed I/O; but having somebody else do it: the DMA
controller • The CPU initializes the DMA; and then does something else while the DMA is busy waiting
• When done, the DMA sends an interrupt to CPU • Cycle stealing vs data channels

PARALLELISM VIA INSTRUCTIONS: Pipelining Exploits The Potential Parallelism Among Instructions. This Parallelism Is

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PARALLELISM VIA INSTRUCTIONS: Pipelining Exploits The Potential Parallelism Among Instructions. This Parallelism Is

Uploaded by

Copyright:

Available Formats

Datapath - manipulates the data coming through the processor.

(2nd way to improve cache performance)

You might also like