COMP 740: Computer Architecture and Implementation: Montek Singh

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 41

COMP 740:

Computer Architecture and


Implementation

Montek Singh
Sep 14, 2016

Topic: Optimization of Cache Performance

1
Outline
 Cache Performance
 Means of improving performance

Read textbook Appendix B.3 and Ch. 2.2

2
How to Improve Cache Performance
 Latency
 Reduce miss rate
 Reduce miss penalty
 Reduce hit time

 Bandwidth
 Increase hit bandwidth
 Increase miss bandwidth

3
1. Reduce Misses via Larger Block Size

Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss rate actually goes up if the block size is
too large relative to the cache size. Each line represents a cache of different size. Figure B.11 shows the data used to plot these
lines. Unfortunately, SPEC2000 traces would take too long if block size were included, so these data are based on SPEC92 on a
DECstation 5000 [Gee et al. 1993].
2. Reduce Misses by Increasing Cache Size
 Increasing cache size reduces cache misses
 both capacity misses and conflict misses reduced

0.14
1-way
0.12
2-way
Miss Rate per Type

0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1

16

32

64

128
Cache Size (KB) Compulsory
3. Reduce Misses via Higher Associativity
 2:1 Cache Rule
 Miss Rate DM cache size N  Miss Rate FA cache size N/2
 Not merely empirical
 Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list
update and paging rules”, CACM, 28(2):202-208,1985
 Beware: Execution time is only final measure!
 Will clock cycle time increase?
 Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-way

0.14
1-way
0.12
2-way
Miss Rate per Type

0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1

16

32

64

128
Compulsory
Cache Size (KB) 6
Example: Ave Mem Access Time vs. Miss Rate
Example: assume clock cycle time is 1.10 for 2-way, 1.12 for 4-
way, 1.14 for 8-way vs. clock cycle time of direct mapped
(Red means A.M.A.T. not improved by more associativity)

Cache size Associativity


(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.2 1.21 1.23
128 1.10 1.17 1.18 1.20

7
4. Miss Penalty Reduction: L2 Cache
L2 Equations:
AMAT = Hit TimeL1 + Miss RateL1  Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2  Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1  (Hit TimeL2 + Miss RateL2


Miss PenaltyL2)

Definitions:
Local miss rate— misses in this cache divided by the total number
of memory accesses to this cache (Miss rateL2)
Global miss rate—misses in this cache divided by the total number
of memory accesses generated by the CPU
(Miss RateL1  Miss RateL2)

8
5. Reducing Miss Penalty
Read Priority over Write on Miss:
 Goal: allow reads to be served before writes have completed

 Challenges:
 Write-through caches:
 Using write buffers: RAW conflicts with reads on cache misses
 If simply wait for write buffer to empty might increase read miss
penalty by 50% (old MIPS 1000)
 Check write buffer contents before read;
if no conflicts, let the memory access continue
 Write-back caches:
 Read miss replacing dirty block
 Normal: Write dirty block to memory, and then do the read
 Instead copy the dirty block to a write buffer, then do the read,
and then do the write
 CPU stall less since restarts as soon as read completes
9
Summary of Basic Optimizations
 Six basic cache optimizations:
1. Larger block size
 Reduces compulsory misses
 Increases capacity and conflict misses, increases miss penalty
2. Larger total cache capacity to reduce miss rate
 Increases hit time, increases power consumption
3. Higher associativity
 Reduces conflict misses
 Increases hit time, increases power consumption
4. Higher number of cache levels
 Reduces overall memory access time
5. Giving priority to read misses over writes
 Reduces miss penalty
6. Avoiding address translation in cache indexing (later)
 Reduces hit time
More advanced optimizations

11
1. Fast Hit Times via Small, Simple Caches
 Simple caches can be faster
 cache hit time increasingly a bottleneck to CPU performance
 set associativity requires complex tag matching  slower
 direct-mapped are simpler  faster  shorter CPU cycle times
– tag check can be overlapped with transmission of data

 Smaller caches can be faster


 can fit on the same chip as CPU
 avoid penalty of going off-chip
 for L2 caches: compromise
 keep tags on chip, and data off chip
– fast tag check, yet greater cache capacity
 L1 data cache reduced from 16KB in Pentium III to 8KB in
Pentium IV

12
Simple and small is fast

Access time vs. size and associativity

13
Simple and small is energy-efficient

Energy per read vs. size and associativity

14
2. Way Prediction
 Way prediction to improve hit time
 Goal: reduce conflict misses, yet maintain hit speed of a
direct-mapped cache
 Approach: keep extra bits to predict the “way” within the set
 the output multiplexor is pre-set to select the desired block
 if block is correct one, fast hit time of 1 clock cycle
 if block isn’t correct, check other blocks in 2nd clock cycle
 Mis-prediction gives longer hit time
 Prediction accuracy
 > 90% for two-way
 > 80% for four-way
 I-cache has better accuracy than D-cache
 First used on MIPS R10000 in mid-90s
 Used on ARM Cortex-A8

15
2a. Way Selection
 Extension of way prediction
 Idea:
 Instead of pre-setting the output multiplexor to select the correct
block out of many…
 … only the ONE predict block is actually read from the cache
 Pros: energy efficient
 only reading one block (assuming prediction is correct)
 Cons: longer latency on misprediction
 if prediction was wrong, other block(s) have to now be read and
their tags checks

16
3. Pipelining Cache
 Pipeline cache access to improve bandwidth
 For faster clock cycle time:
allow L1 hit time to be multiple clock cycles (instead of 1 cycle)
make cache pipelined, so it still has high bandwidth
 Examples:
 Pentium: 1 cycle
 Pentium Pro – Pentium III: 2 cycles
 Pentium 4 – Core i7: 4 cycles
 Cons:
 increases number of pipeline stages for an instruction
 longer branch mis-prediction penalty
 more clock cycles between “load” and receiving the data
 Pros:
 allows faster clock rate for the processor
 makes it easier to increase associativity
17
4. Non-blocking Caches
 Non-blocking cache or lockup-free cache allows the
data cache to continue to supply cache hits during a
miss
 “Hit under miss”
 reduces the effective miss penalty by being helpful during a miss
instead of ignoring the requests of the CPU
 “Hit under multiple miss” or “miss under miss”
 may further lower the effective miss penalty by overlapping
multiple misses
 Significantly increases the complexity of the cache controller
as there can be multiple outstanding memory accesses

18
Value of Hit Under Miss for SPEC

 Hit under 1 miss, 2 misses and 64 misses


 Hit under 1 miss
 miss penalty reduced 9% for integer and 12.5% for floating-pt
 Hit under 2 misses
 benefir is slightly higher: 10% and 16% respectively
 No further benefit for 64 misses
19
5. Multibanked Caches
 Organize cache as independent banks to support
simultaneous access
 originally banks only used for main memory
 now common for L2 caches
 ARM Cortex-A8 supports 1-4 banks for L2
 Intel i7 supports 4 banks for L1 and 8 banks for L2
 Interleave banks according to block address
 can be accesses in parallel

20
6. Early Restart and Critical Word First
 Don’t wait for full block to be loaded before
restarting CPU
 Early Restart—As soon as the requested word of the block
arrrives, send it to the CPU and let the CPU continue
execution
 Critical Word First—Request the missed word first from
memory and send it to the CPU as soon as it arrives
 let the CPU continue while filling the rest of the words in the block.
 also called “wrapped fetch” and “requested word first”
 Generally useful only in large blocks
 Spatial locality a problem
 tend to want next sequential word, so not clear if benefit by
early restart

21
7. Merging Write Buffer
 Write buffers used in both write-through and write-back
 write-through: write sent to buffer so memory update can happen in
background
 write-back: when a dirty block is replaced, write sent to buffer
 Merging writes:
 when updating a location that is already pending in the write buffer,
update write buffer, instead of creating a new entry in write buffer

No write buffer
merging

With write
buffer merging 22
Merging Write Buffer (contd.)
 Pros: reduces stalls due to write buffer being full
 But: I/O writes cannot be merged
 memory-mapped I/O
 I/O writes become memory writes
 should not be merged because I/O has different semantics
 want to keep each I/O event distinct

No write buffer
merging

With write
buffer merging
23
23
8. Reduce Misses by Compiler Optzns.
 Instructions
 Reorder procedures in memory so as to reduce misses
 Profiling to look at conflicts
 McFarling [1989] reduced caches misses by 75% on 8KB direct
mapped cache with 4 byte blocks
 Data
 Merging Arrays
 Improve spatial locality by single array of compound elements vs. 2 arrays
 Loop Interchange
 Change nesting of loops to access data in order stored in memory
 Loop Fusion
 Combine two independent loops that have same looping and some
variables overlap
 Blocking
 Improve temporal locality by accessing “blocks” of data repeatedly vs.
going down whole columns or rows
24
Merging Arrays Example
/* Before */ /* After */
int val[SIZE]; struct merge {
int key[SIZE]; int val;
int key;
};
struct merge merged_array[SIZE];

 Reduces conflicts between val and key


 Addressing expressions are different

25
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k++)
for (j = 0; j < 100; j++)
for (i = 0; i < 5000; i++)
x[i][j] = 2 * x[i][j];

/* After */
for (k = 0; k < 100; k++)
for (i = 0; i < 5000; i++)
for (j = 0; j < 100; j++)
x[i][j] = 2 * x[i][j];

 Sequential accesses instead of striding through


memory every 100 words

26
Loop Fusion Example
/* Before */
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
d[i][j] = a[i][j] + c[i][j];

/* After */
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}

 Before: 2 misses per access to a and c


 After: 1 miss per access to a and c

27
Blocking Example
/* Before */
for (i = 0; i < N; i++)
for (j = 0; j < N; j++) {
r = 0;
for (k = 0; k < N; k++)
r = r + y[i][k]*z[k][j];
x[i][j] = r;
}

 Two Inner Loops:


 Read all NxN elements of z[]
 Read N elements of 1 row of y[] repeatedly
 Write N elements of 1 row of x[]
 Capacity Misses a function of N and Cache Size
 3 NxN  no capacity misses; otherwise ...
 Idea: compute on BxB submatrix that fits
28
Blocking Example (contd.)
 Age of accesses
 White means not touched yet
 Light gray means touched a while ago
 Dark gray means newer accesses

29
Blocking Example (contd.)
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i++)
for (j = jj; j < min(jj+B-1,N); j++) {
r = 0;
for (k = kk; k < min(kk+B-1,N); k++)
r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
}

 Work with BxB submatrices


 smaller working set can fit within the cache
 fewer capacity misses

30
Blocking Example (contd.)

 Capacity reqd. goes from (2N3 + N2) to (2N3/B +N2)


 B = “blocking factor”

31
Summary: Compiler Optimizations to
Reduce Cache Misses

32
9. Reduce Misses by Hardware Prefetching
 Prefetching done by hardware outside of the cache
 Instruction prefetching
 Alpha 21064 fetches 2 blocks on a miss
 Extra block placed in stream buffer
 On miss check stream buffer

 Works with data blocks too


 Jouppi [1990] 1 data stream buffer got 25% misses from 4KB
cache; 4 stream buffers got 43%
 Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from 2 64KB, 4-way set
associative caches
 Prefetching relies on extra memory bandwidth that
can be used without penalty
 e.g., up to 8 prefetch stream buffers in the UltraSPARC III 33
Hardware Prefetching: Benefit
 Fetch two blocks on miss (include next sequential
block)

Pentium 4 Pre-fetching

34
10. Reducing Misses by Software Prefetching
 Data prefetch
 Compiler inserts special “prefetch” instructions into program
 Load data into register (HP PA-RISC loads)
 Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9)
 A form of speculative execution
 don’t really know if data is needed or if not in cache already
 Most effective prefetches are “semantically invisible” to prgm
 does not change registers or memory
 cannot cause a fault/exception
 if they would fault, they are simply turned into NOP’s

 Issuing prefetch instructions takes time


 Is cost of prefetch issues < savings in reduced misses?

 Combine with loop unrolling and software


35
A couple other optimizations

36
Reduce Conflict Misses via Victim Cache
 How to combine fast hit CPU
time of direct mapped yet
TAG DATA
avoid conflict misses
 Add small highly associative
buffer to hold data
discarded from cache
 Jouppi [1990]: 4-entry
?
victim cache removed 20%
to 95% of conflicts for a 4 TAG DATA
KB direct mapped data Mem
cache

37
Reduce Conflict Misses via Pseudo-Assoc.
 How to combine fast hit time of direct mapped and
have the lower conflict misses of 2-way SA cache
 Divide cache: on a miss, check other half of cache to
see if there, if so have a pseudo-hit (slow hit)

Hit Time

Pseudo Hit Time Miss Penalty

Time

 Drawback: CPU pipeline is hard if hit takes 1 or 2


cycles
 Better for caches not tied directly to processor
38
Fetching Subblocks to Reduce Miss Penalty
 Don’t have to load full block on a miss
 Have bits per subblock to indicate valid

100 1 1 1 0
200 0 1 0 1
300 0 0 0 1

Valid Bits

39
Review: Improving Cache Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

40
Summary

41

You might also like