CS 322M Digital Logic & Computer Architecture: Cache Optimization Techniques-II

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

CS 322M Digital Logic & Computer Architecture

Lecture 25 [28.10.2019]
Cache Optimization Techniques-II

John Jose
Assistant Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati, Assam.
Accessing Cache Memory
Hit time
Memory
CPU Cache Miss penalty

Average memory access time (AMAT) =


Hit time + (Miss rate×Miss penalty)

 Hit Time: Time to find the block in the cache and


return it to processor [indexing, tag comparison,
transfer].
 Miss Rate: Fraction of cache access result in a miss.
 Miss Penalty: Number of cycles required to fetch the
block from the next level of memory hierarchy. It is the
extra (not total) time (or cycle) for a miss in addition to
hit time which is incurred by all accesses.
How to optimize cache ?
 Reduce Average Memory Access Time
 AMAT= Hit Time + Miss Rate x Miss Penalty
 Motives
Reducing the miss rate
Reducing the miss penalty
Reducing the hit time
Multi-banked Caches
 Multi-banked caches to increase cache bandwidth
 Rather a single monolithic unit, divide cache into many
banks that can support simultaneous accesses.
ARM Cortex-A8 supports 1-4 banks for L2
Intel i7 supports 4 banks for L1 and 8 banks for L2
 Interleave banks according to block address
 Sequential Interleaving
Non-blocking Caches
 Non blocking caches to increase cache bandwidth
 Caches can serve hits under multiple cache miss in progress
(a) Hit under miss (b) Hit under multiple miss

 Must needed for OOO superscalar processor for IPC increase


 L2 must support it with L1-MSHR (Miss Status Holding Reg.)
 On an L1 miss allocate MSHR entry, clear upon L2 respond
with cache block reply.
 L1 miss penalty can be hidden to some extend.
Non-blocking Caches
 Non blocking caches to increase cache bandwidth
 Processors can hide L1 miss penalty but not L2 miss penalty
 Reduces the effective miss penalty by overlapping miss
latencies
 Significantly increases the complexity of the cache controller
as there can be multiple outstanding memory accesses
 Requires pipelined or banked memory system
Early Restart
 Early restart to reduce miss penalty
 CPU do not wait for entire block to be loaded
 Early restart
Request words in normal order
Missed word to the processor as soon as it arrives
Generally useful in large blocks
L2 controller is not involved in this technique
Critical Word First
 Critical word first to reduce miss penalty
 Critical word first
Request missed word from memory first
Send it to the processor as soon as it arrives
Processor resume while rest of the block is filled in
cache
L2 cache controller send words out of order.
L1 cache controller should re-arrange words in block
Merging Write Buffer
 Write buffer merging to reduce miss penalty
 Write buffer allows the processor to continue without waiting
for writes to get over.
 When performing a store/write on a block that is already
pending in the write buffer, update write buffer
 Reduces stalls due to full write buffer, improve buffer
efficiency
 If buffer is full writes incur processor stall.

No write buffering Write buffering


Hardware Prefetching
 Pre-fetching to reduce miss rate and miss penalty.
 Pre-fetch items before processor request them.
 Fetch more blocks on miss -include next sequential block
 Requested block is kept in I-cache and next in stream buffer.
 If a missed block is in stream buffer, cache miss is cancelled
Compiler Optimizations
 Compiler optimization to reduce miss rate
 Loop Interchange
Swap nested loops to access memory in sequential order
Maximize the use of data in cache before it is discarded
Compiler Optimizations
 Blocking
Instead of accessing entire rows or columns, subdivide
matrices into blocks
Requires more accesses but improves locality of accesses
Compiler Controlled Pre-fetching
 Pre-fetching to reduce miss rate and miss penalty.
 Insert pre-fetch instructions before data is needed
 Pre-fetching will give performance only if processor reads
from caches and executes while pre-fetching is in progress.
 Register pre-fetch
Loads data into register
 Cache pre-fetch
Loads data into cache
 Use loop unrolling and scheduling for pre-fetching data of
adjacent iterations
johnjose@iitg.ac.in
http://www.iitg.ac.in/johnjose/

You might also like