Professional Documents
Culture Documents
L14 AdvancedMemory
L14 AdvancedMemory
L14 AdvancedMemory
Joel Emer
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
http://www.csg.csail.mit.edu/6.823
L14-2
Direct-Mapped Cache
t
k b
V Tag Data Block
2k
lines
t
=
Write Performance
2k
lines
t
= WE
Solutions:
• Design data RAM that can perform read and write in
one cycle, restore old value after tag miss
LD0
Tag LD0 ST1 ST2 LD3 ST4 LD5
ST1
ST2
Data LD0 ST1 LD3 ST2 LD5
LD3
ST4
LD5
Buffer ST1 ST2 ST2 ST4 ST4
=? 1 0
• Cache hit:
– write through: write both cache & memory
• generally higher traffic but simplifies cache coherence
– write back: write cache only
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic
• Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
• Common combinations:
– write through and no write allocate
– write back with write allocate
t1
Snapshots for t2 Reg
r1 ti mispredict recovery File
.
r2 tj
tn
Rename
Load Store
Table FU FU
FU FU
Unit Unit
Bypass: …
WAW Hazards
• On store execute:
– mark valid and speculative; save tag, data and instruction number.
• On store commit:
– clear speculative bit and eventually move data to cache
• On store abort:
– clear valid bit
Memory Dependencies
For registers we used tags or register numbers
to determine dependencies. What about
memory operations?
st r1, (r2)
ld r3, (r4)
When r2 == r4
=> Load and store cannot leave ROB for execution until all
previous loads and stores have completed execution
Address Speculation
st r1, (r2)
ld r3, (r4)
• Guess that r4 != r2, and execute load before store address known
• If r4 != r2 commit…
Watch for stores that arrive a load that needed its data.
• On load execute:
– mark entry valid, and instruction number and tag of data.
• On load commit:
– clear valid bit
• On load abort:
– clear valid bit
V Inum Tag
V Inum Tag
V Inum Tag
V Inum Tag
V Inum Tag
Store Sets
(Alpha 21464)
PC 8 Multiple Readers
PC
0 Store
Program 4 Store {Empty}
Order 8 Store
12 Store
PC 0
PC 12
Multiple Writers
28 Load - multiple code paths
- multiple components
32 Load of a single location
36 Load
PC 8
40 Load
Store Index
V
. Writer
.
.
.
Load Index
.
. Store
Set A
Load Index
. V
. Reader
.
.
Load Index
Store Index
V
.
.
.
.
Load Index
. Store
.
Set A
Load Index
. V
.
.
.
Load Index
V
Load Index
V
. Store
.
Set A
Load Index
. V
.
.
.
Load Index
V
Prefetching
• Speculate on future instruction and
data accesses and fetch them into
cache(s)
– Instruction accesses easier to predict
than data accesses
• Varieties of prefetching
– Hardware prefetching
– Software prefetching
– Mixed schemes
Issues in Prefetching
• Usefulness – should produce hits
• Timeliness – not late and not too early
• Cache and bandwidth pollution
L1
Instruction
CPU Unified L2
Cache
RF L1 Data
Prefetched data
Prefetched
Req instruction block
block Stream
Buffer
CPU
L1 Unified L2
Instruction Req Cache
RF block
• Strided prefetch
–If observe sequence of accesses to block b, b+N,
b+2N, then prefetch b+3N etc.
Thank you !
http://www.csg.csail.mit.edu/6.823