Professional Documents
Culture Documents
l08 Caches 2
l08 Caches 2
Cache Optimizations
Joel Emer
6.823 L8- 2
(5-stage pipeline)
0x4
Add E
M
A
we
ALU Y addr
nop Decode,
IR Primary
Register B
Data rdata
addr inst Fetch Cache R
PC D
wdata hit?
hit? wdata
PCen Primary
Cache
Stall entire
CPU on data
cache miss
To Memory Control
October 5, 2005
6.823 L8- 3
Joel Emer
Write Performance
Tag Index Block
Offset
b
t
k
V Tag Data
2k
lines
t
= WE
October 5, 2005
6.823 L8- 4
Joel Emer
Solutions:
• Design data RAM that can perform read and write in one
cycle, restore old value after tag miss
October 5, 2005
6.823 L8- 5
Load/Store
=?
S
Tags L Data
=? 1 0
Write pipeline
October 5, 2005
6.823 L8- 7
Joel Emer
Write Policy
• Cache hit:
– write through: write both cache & memory
• generally higher traffic but simplifies cache coherence
– write back: write cache only
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic
• Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
• Common combinations:
– write through and no write allocate
– write back with write allocate
October 5, 2005
Average Cache Read Latency
October 5, 2005
6.823 L8- 9
Joel Emer
To improve performance:
• reduce the miss rate (e.g., larger cache)
• reduce the miss penalty (e.g., L2 cache)
• reduce the hit time
October 5, 2005
6.823 L8- 10
Joel Emer
To improve performance:
• reduce the miss rate (e.g., larger cache)
• reduce the miss penalty (e.g., L2 cache)
• reduce the hit time
October 5, 2005
6.823 L8- 11
Joel Emer
October 5, 2005
6.823 L8- 12
Joel Emer
• Higher associativity
October 5, 2005
6.823 L8- 13
Joel Emer
October 5, 2005
6.823 L8- 14
Joel Emer
Block-level Optimizations
• Tags are too large, i.e., too much overhead
– Simple solution: Larger blocks, but miss penalty
could be large.
• Sub-block placement (aka sector cache)
100 1 1 1 1
300 1 1 0 0
204 0 1 0 1
October 5, 2005
6.823 L8- 15
Joel Emer
Not energy-efficient
– A tag and data word
is read from every
way
Two-phase approach
– First read tags, then
=? =? just read data from
selected way
– More energy-
efficient
– Doubles latency in
L1
– OK, for L2 and
above, why?
October 5, 2005
6.823 L8- 16
Joel Emer
Set i
Set 1
Set 0 Tag =? Data Block
Tag =? Data Block
Tag
Tag =?=? Data Block
Data Block
Tag =? Data Block
Tag =? Data Block
Tag =? Data Block
Tag =? Data Block
Tag =? Data Block
HIT MISS
MISS
SLOW HIT
(change entry in
prediction table) Read block of data from
next level of cache
October 5, 2005
6.823 L8- 18
Joel Emer
Jump target
0x4
Jump Add
control
PC addr inst
Primary
way Instruction
Cache
Sequential Way
October 5, 2005
19
6.823 L8- 20
Joel Emer
CPU
Unified L2
L1 Data Cache
RF
Cache
Evicted data
from L1
Victim
where ?
FA Cache
Hit data from VC 4 blocks Evicted data
(miss in L1) From VC
Victim cache is a small associative back up cache, added to a direct
mapped cache, which holds recently evicted lines
• First look up in direct mapped cache
• If miss, look in victim cache
• If hit in victim cache, swap hit line with line now evicted from L1
• If miss in victim cache, L1 victim -> VC, VC victim->?
Fast hit time of direct mapped but with reduced conflict misses
October 5, 2005
6.823 L8- 21
Joel Emer
Multilevel Caches
• A memory cannot be large and fast
CPU L1 L2 DRAM
October 5, 2005
6.823 L8- 22
Joel Emer
Inclusion Policy
• Inclusive multilevel cache:
– Inner cache holds copies of data in outer cache
– Extra-CPU access needs only check outer cache
– Most common case
October 5, 2005
6.823 L8- 23
Joel Emer
October 5, 2005
6.823 L8- 24
Joel Emer
CPU Unified
Data L2
Cache Cache
Write
RF
buffer
Prefetching
• Varieties of prefetching
– Hardware prefetching
– Software prefetching
– Mixed schemes
Issues in Prefetching
• Usefulness – should produce hits
• Timeliness – not late and not too early
• Cache and bandwidth pollution
L1
Instruction
CPU Unified L2
Cache
RF L1 Data
Prefetched data
October 5, 2005
6.823 L8- 27
Joel Emer
Prefetched
Req Stream instruction block
block Buffer
(4 blocks)
CPU
L1 Unified L2
Instruction Req Cache
RF block
October 5, 2005
6.823 L8- 28
Joel Emer
• Strided prefetch
– If sequence of accesses to block b, b+N,
b+2N, then prefetch b+3N etc.
October 5, 2005
6.823 L8- 29
Joel Emer
Software Prefetching
October 5, 2005
6.823 L8- 30
Joel Emer
Compiler Optimizations
Loop Interchange
x[i][j] = 2 * x[i][j];
}
}
October 5, 2005
6.823 L8- 33
Joel Emer
Loop Fusion
Blocking
for(i=0; i < N; i++)
for(j=0; j < N; j++) {
r = 0;
for(k=0; k < N; k++)
r = r + y[i][k] * z[k][j];
x[i][j] = r;
}
x j y k z j
i i k
Blocking
i i k
October 5, 2005
36
Thank you !
37
Extras
6.823 L8- 38
Joel Emer
October 5, 2005
6.823 L8- 39
Joel Emer
Further Issues
stay tuned
October 5, 2005