Professional Documents
Culture Documents
COMP 740: Computer Architecture and Implementation: Montek Singh
COMP 740: Computer Architecture and Implementation: Montek Singh
COMP 740: Computer Architecture and Implementation: Montek Singh
Montek Singh
Sep 14, 2016
1
Outline
Cache Performance
Means of improving performance
2
How to Improve Cache Performance
Latency
Reduce miss rate
Reduce miss penalty
Reduce hit time
Bandwidth
Increase hit bandwidth
Increase miss bandwidth
3
1. Reduce Misses via Larger Block Size
Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss rate actually goes up if the block size is
too large relative to the cache size. Each line represents a cache of different size. Figure B.11 shows the data used to plot these
lines. Unfortunately, SPEC2000 traces would take too long if block size were included, so these data are based on SPEC92 on a
DECstation 5000 [Gee et al. 1993].
2. Reduce Misses by Increasing Cache Size
Increasing cache size reduces cache misses
both capacity misses and conflict misses reduced
0.14
1-way
0.12
2-way
Miss Rate per Type
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Cache Size (KB) Compulsory
3. Reduce Misses via Higher Associativity
2:1 Cache Rule
Miss Rate DM cache size N Miss Rate FA cache size N/2
Not merely empirical
Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list
update and paging rules”, CACM, 28(2):202-208,1985
Beware: Execution time is only final measure!
Will clock cycle time increase?
Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-way
0.14
1-way
0.12
2-way
Miss Rate per Type
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Compulsory
Cache Size (KB) 6
Example: Ave Mem Access Time vs. Miss Rate
Example: assume clock cycle time is 1.10 for 2-way, 1.12 for 4-
way, 1.14 for 8-way vs. clock cycle time of direct mapped
(Red means A.M.A.T. not improved by more associativity)
7
4. Miss Penalty Reduction: L2 Cache
L2 Equations:
AMAT = Hit TimeL1 + Miss RateL1 Miss PenaltyL1
Definitions:
Local miss rate— misses in this cache divided by the total number
of memory accesses to this cache (Miss rateL2)
Global miss rate—misses in this cache divided by the total number
of memory accesses generated by the CPU
(Miss RateL1 Miss RateL2)
8
5. Reducing Miss Penalty
Read Priority over Write on Miss:
Goal: allow reads to be served before writes have completed
Challenges:
Write-through caches:
Using write buffers: RAW conflicts with reads on cache misses
If simply wait for write buffer to empty might increase read miss
penalty by 50% (old MIPS 1000)
Check write buffer contents before read;
if no conflicts, let the memory access continue
Write-back caches:
Read miss replacing dirty block
Normal: Write dirty block to memory, and then do the read
Instead copy the dirty block to a write buffer, then do the read,
and then do the write
CPU stall less since restarts as soon as read completes
9
Summary of Basic Optimizations
Six basic cache optimizations:
1. Larger block size
Reduces compulsory misses
Increases capacity and conflict misses, increases miss penalty
2. Larger total cache capacity to reduce miss rate
Increases hit time, increases power consumption
3. Higher associativity
Reduces conflict misses
Increases hit time, increases power consumption
4. Higher number of cache levels
Reduces overall memory access time
5. Giving priority to read misses over writes
Reduces miss penalty
6. Avoiding address translation in cache indexing (later)
Reduces hit time
More advanced optimizations
11
1. Fast Hit Times via Small, Simple Caches
Simple caches can be faster
cache hit time increasingly a bottleneck to CPU performance
set associativity requires complex tag matching slower
direct-mapped are simpler faster shorter CPU cycle times
– tag check can be overlapped with transmission of data
12
Simple and small is fast
13
Simple and small is energy-efficient
14
2. Way Prediction
Way prediction to improve hit time
Goal: reduce conflict misses, yet maintain hit speed of a
direct-mapped cache
Approach: keep extra bits to predict the “way” within the set
the output multiplexor is pre-set to select the desired block
if block is correct one, fast hit time of 1 clock cycle
if block isn’t correct, check other blocks in 2nd clock cycle
Mis-prediction gives longer hit time
Prediction accuracy
> 90% for two-way
> 80% for four-way
I-cache has better accuracy than D-cache
First used on MIPS R10000 in mid-90s
Used on ARM Cortex-A8
15
2a. Way Selection
Extension of way prediction
Idea:
Instead of pre-setting the output multiplexor to select the correct
block out of many…
… only the ONE predict block is actually read from the cache
Pros: energy efficient
only reading one block (assuming prediction is correct)
Cons: longer latency on misprediction
if prediction was wrong, other block(s) have to now be read and
their tags checks
16
3. Pipelining Cache
Pipeline cache access to improve bandwidth
For faster clock cycle time:
allow L1 hit time to be multiple clock cycles (instead of 1 cycle)
make cache pipelined, so it still has high bandwidth
Examples:
Pentium: 1 cycle
Pentium Pro – Pentium III: 2 cycles
Pentium 4 – Core i7: 4 cycles
Cons:
increases number of pipeline stages for an instruction
longer branch mis-prediction penalty
more clock cycles between “load” and receiving the data
Pros:
allows faster clock rate for the processor
makes it easier to increase associativity
17
4. Non-blocking Caches
Non-blocking cache or lockup-free cache allows the
data cache to continue to supply cache hits during a
miss
“Hit under miss”
reduces the effective miss penalty by being helpful during a miss
instead of ignoring the requests of the CPU
“Hit under multiple miss” or “miss under miss”
may further lower the effective miss penalty by overlapping
multiple misses
Significantly increases the complexity of the cache controller
as there can be multiple outstanding memory accesses
18
Value of Hit Under Miss for SPEC
20
6. Early Restart and Critical Word First
Don’t wait for full block to be loaded before
restarting CPU
Early Restart—As soon as the requested word of the block
arrrives, send it to the CPU and let the CPU continue
execution
Critical Word First—Request the missed word first from
memory and send it to the CPU as soon as it arrives
let the CPU continue while filling the rest of the words in the block.
also called “wrapped fetch” and “requested word first”
Generally useful only in large blocks
Spatial locality a problem
tend to want next sequential word, so not clear if benefit by
early restart
21
7. Merging Write Buffer
Write buffers used in both write-through and write-back
write-through: write sent to buffer so memory update can happen in
background
write-back: when a dirty block is replaced, write sent to buffer
Merging writes:
when updating a location that is already pending in the write buffer,
update write buffer, instead of creating a new entry in write buffer
No write buffer
merging
With write
buffer merging 22
Merging Write Buffer (contd.)
Pros: reduces stalls due to write buffer being full
But: I/O writes cannot be merged
memory-mapped I/O
I/O writes become memory writes
should not be merged because I/O has different semantics
want to keep each I/O event distinct
No write buffer
merging
With write
buffer merging
23
23
8. Reduce Misses by Compiler Optzns.
Instructions
Reorder procedures in memory so as to reduce misses
Profiling to look at conflicts
McFarling [1989] reduced caches misses by 75% on 8KB direct
mapped cache with 4 byte blocks
Data
Merging Arrays
Improve spatial locality by single array of compound elements vs. 2 arrays
Loop Interchange
Change nesting of loops to access data in order stored in memory
Loop Fusion
Combine two independent loops that have same looping and some
variables overlap
Blocking
Improve temporal locality by accessing “blocks” of data repeatedly vs.
going down whole columns or rows
24
Merging Arrays Example
/* Before */ /* After */
int val[SIZE]; struct merge {
int key[SIZE]; int val;
int key;
};
struct merge merged_array[SIZE];
25
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k++)
for (j = 0; j < 100; j++)
for (i = 0; i < 5000; i++)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k++)
for (i = 0; i < 5000; i++)
for (j = 0; j < 100; j++)
x[i][j] = 2 * x[i][j];
26
Loop Fusion Example
/* Before */
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
27
Blocking Example
/* Before */
for (i = 0; i < N; i++)
for (j = 0; j < N; j++) {
r = 0;
for (k = 0; k < N; k++)
r = r + y[i][k]*z[k][j];
x[i][j] = r;
}
29
Blocking Example (contd.)
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i++)
for (j = jj; j < min(jj+B-1,N); j++) {
r = 0;
for (k = kk; k < min(kk+B-1,N); k++)
r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
}
30
Blocking Example (contd.)
31
Summary: Compiler Optimizations to
Reduce Cache Misses
32
9. Reduce Misses by Hardware Prefetching
Prefetching done by hardware outside of the cache
Instruction prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer
On miss check stream buffer
Pentium 4 Pre-fetching
34
10. Reducing Misses by Software Prefetching
Data prefetch
Compiler inserts special “prefetch” instructions into program
Load data into register (HP PA-RISC loads)
Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9)
A form of speculative execution
don’t really know if data is needed or if not in cache already
Most effective prefetches are “semantically invisible” to prgm
does not change registers or memory
cannot cause a fault/exception
if they would fault, they are simply turned into NOP’s
36
Reduce Conflict Misses via Victim Cache
How to combine fast hit CPU
time of direct mapped yet
TAG DATA
avoid conflict misses
Add small highly associative
buffer to hold data
discarded from cache
Jouppi [1990]: 4-entry
?
victim cache removed 20%
to 95% of conflicts for a 4 TAG DATA
KB direct mapped data Mem
cache
37
Reduce Conflict Misses via Pseudo-Assoc.
How to combine fast hit time of direct mapped and
have the lower conflict misses of 2-way SA cache
Divide cache: on a miss, check other half of cache to
see if there, if so have a pseudo-hit (slow hit)
Hit Time
Time
100 1 1 1 0
200 0 1 0 1
300 0 0 0 1
Valid Bits
39
Review: Improving Cache Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
40
Summary
41