Professional Documents
Culture Documents
Chapter 5
Chapter 5
Chapter 5
Memory Hierarchy Design
December 2004
5.1 Introduction
The five classic components of a computer:
Processor
Input
Control
Memory
Datapath
Output
Build a memory hierarchy which includes main memory & caches (internal
memory) and hard disk (external memory)
Instructions are first fetched from external storage such as hard disk and
are kept in the main memory. Before they go to the CPU, they are
probably extracted to stay in the caches
3
Capacity
2x in 1.5 years
4x in 3 years
4x in 3 years
Year
1980
1983
1986
1989
1992
1995
2000
Speed (latency)
2x in 1.5 years
2x in 10 years
2x in 10 years
DRAM
Size
64 Kb
256 Kb
1 Mb
4 Mb
16 Mb
64 Mb
256 Mb
4000:1!
Technology Trends
Cycle Time
250 ns
220 ns
190 ns
165 ns
145 ns
120 ns
100 ns
2.5:1!
4
(improvement
ratio)
Memory
7%/yr
Memory Hierarchy
Levels of the Memory Hierarchy
Registers
Cache
64 KB
1 ns
Cache
Blocks
Capacity
CPU Registers
500 bytes
0.25 ns
Speed
Faster
Capacity
Access Time
Main Memory
512 MB
100ns
Upper Level
Memory
Pages
Disk
100 GB
5 ms
I/O Devices
Files
???
Larger
Lower Level
6
cache
main
Memory
Blk X
From Processor
Blk Y
Cache Measures
CPU execution time incorporated with cache performance:
CPU execution time = (CPU clock cycles + Memory stall cycles)
* Clock cycle time
Memory stall cycles: number of cycles during which the CPU is stalled
waiting for a memory access
Memory stall clock cycles = Number of misses * miss penalty
= IC*(Misses/Instruction)*Miss penalty
= IC*(Memory accesses/Instruction)*Miss rate*Miss penalty
= IC * Reads per instruction * Read miss rate * Read miss penalty
+IC * Writes per instruction * Write miss rate * Write miss penalty
Memory access consists of fetching instructions and reading/writing data
P.395 Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of
the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%,
how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then
CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time
(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.
memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty
= IC*(1+50%)*2%*25 = IC*0.75
then CPU(B) = (IC + IC*0.75)* Clock cycle time
= 1.75*IC*clock cycle time
The performance ration is easy to get to be the inverse of the CPU execution
time :
CPU(B)/CPU(A) = 1.75
The computer with no cache miss is 1.75 times faster.
10
11
12
Memory
4 Block Direct Mapped Cache
Block Index in Cache
0
1
2
3
Example: 0x50
Stored as part
of the cache state
Valid Bit
Cache Tag
0x50
Cache Data
Byte 31
Byte 63
4
0
Byte Select
Ex: 0x00
Byte 1 Byte 0 0
Byte 33 Byte 32 1
2
3
:
Byte 1023
Cache Tag
9
Cache Index
Ex: 0x01
: :
31
Byte 992 31
14
Block Offset
Cache/Set Index
(Block Size)
Block Offset selects the desired data from the block, the index filed selects
the set, and the tag field compared against the CPU address for a hit
Use the Cache Index to select the cache set
Check the Tag on each block in that set
No need to check index or block offset
A valid bit is added to the Tag to indicate whether or not this entry
contains a valid address
Select the desiredbytes using Block Offset
Increasing associativity => shrinks index
expands tag
15
31
Cache Tag
Valid
Cache Tag
Example: 0x50
Cache Index
Cache Data
Cache Data
Cache Block 0
Cache Block 0
4
0
Byte Select
Ex: 0x00
Cache Tag
Valid
0x50
Adr Tag
Compare
Sel1 1
Mux
0 Sel0
Compare
OR
Hit
Cache Block
16
Valid
Cache Tag
:
Adr Tag
Compare
Cache Index
Cache Data
Cache Data
Cache Block 0
Cache Block 0
Sel1 1
Mux
0 Sel0
Cache Tag
Valid
Compare
OR
Hit
Cache Block
17
Data cache misses per 1000 instructions for various replacement strategies
Associativity:
2-way
4-way
8-way
Size
16 KB
64 KB
256 KB
There are little difference between LRU and random for the largest size cache, with
LRU outperforming the others for smaller caches. FIFO generally outperforms
random in the smaller cache sizes
18
19
Processor
Cache
DRAM
Write Buffer
Block stay out of the cache in no-write allocate until the program tries
to read the blocks, but with write allocate even blocks that are only
written will still be in the cache
21
Answer:
No-write Allocate:
Write allocate:
Write Mem[100];
Write Mem[100];
Read Mem[200];
Write Mem[200];
Write Mem[100];
4 misses; 1 hit
2 misses; 3 hits
1 write miss
1 write hit
1 read miss
1 write hit
1 write hit
22
0.4%
11.4%
32 KB
3.18%
Assume
A hit takes 1 clock cycle and the miss penalty is 100 cycles
A load or store takes 1 extra clock cycle on a unified cache since
there is only one cache port
36% of the instructions are data transfer instructions.
About 74% of the memory accesses are instruction references
Answer:
Average memory access time (split)
= % instructions x (Hit time + Instruction miss rate x Miss penalty)
+ % data x (Hit time + Instruction miss rate x Miss penalty)
= 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24
Average memory access time(unified)
= 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44
23
26
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Section 5.7
Section 5.5
Section 5.4
Memory Accesses
Miss Rate Miss Penalty) Clock Cycle Time
Instruction
27
28
Design of L2 Cache
Size
Since everything in L1 cache is likely to be in L2 cache, L2 cache
should be much bigger than L1
Whether data in L1 is in L2
novice approach: design L1 and L2 independently
multilevel inclusion: L1 data are always present in L2
Advantage: easy for consistency between I/O and cache (checking L2 only)
Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level
block to be replaced => slightly higher 1st-level miss rate
i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
31
(cache index 0)
(cache index 0)
(cache index 0)
Problem: write through with write buffers offer RAW conflicts with main
memory reads on cache misses
If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
Check write buffer contents before read; if no conflicts, let the
memory access continue
Write Back
Suppose a read miss will replace a dirty block
Normal: Write dirty block to memory, and then do the read
Instead: Copy the dirty block to a write buffer, do the read, and then
do the write
CPU stall less since restarts as soon as do read
32
Write buffer with 4 entries, each can hold four 64-bit words
(left) without merging (right) Four writes are merged into a single entry
writing multiple words at the same time is faster than writing multiple times
33
3 Cs of Cache Miss
2:1 Cache Rule
Conflict
0.14
1-way
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
64
32
16
128
Compulsory vanishingly
small
Compulsory
36
100%
1-way
Miss Rate per Type
80%
2-way
60%
4-way
8-way
Conflict
40%
Capacity
20%
128
64
32
16
0%
Compulsory
37
1.
2.
3.
4.
5.
38
20%
Miss
Rate
4K
15%
16K
10%
64K
5%
256K
0%
256
128
64
32
16
Size of Cache
Block Size (bytes)
1-way
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
128
64
32
16
Compulsory
40
On a miss, a 2nd cache entry is checked before going to the next lower
level
one fast hit and one slow hit
Invert the most significant bit to the find other block in the pseudoset
Miss penalty may become slightly longer
42
Maximize accesses to the data loaded into the cache before replaced
Improve temporal locality
/* After: B=blocking factor */
X=Y*Z
for(jj=0;jj<N;jj=jj+B)
/* Before */
for(i=0;i<N;i=i+1)
for(j=0;j<N;j=j+1){
r=0;
for(k=0;k<N;k=k+1)
r=r+y[i][k]*z[k][j];
x[i][j]=r;
}
for(kk=0;kk<N;kk=kk+B)
for(i=0;i<N;i=i+1)
for(j=jj;j<min(jj+B,N;j=j+1){
r=0;
for(k=kk;k<min(kk+B,N);k=k+1)
r=r+y[i][k]*z[k][j];
x[i][j]=x[i][j]+r;
}
45
hit under miss: D-cache continues to supply cache hits during a miss
hit under multiple miss or miss under miss: overlap multiple misses
Ratio of average memory stall
time for a blocking cache to
hit-under-miss schemes
first 14 are FP programs
average: 76% for 1-miss,
51% for 2-miss, 39% for 64miss
final 4 are INT programs
average: 81%, 78% and 78%
46
Up to 8 simultaneous prefetches
It may interfere with demand misses resulting in lowering performance
47
48
Four techniques:
1.Small and simple caches
2.Avoiding address translation during indexing of the cache
3.Pipelined cache access
4.Trace caches
49
General design:
small and simple cache for 1st-level cache
Keeping the tags on chip and the data off chip for 2nd-level caches
The emphasis recently is on fast clock time while hiding L1 misses with
dynamic execution and using L2 caches to avoid going to memory
50
4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
51
CPU
VA
VA
VA
VA
Tags
TB
CPU
$
VA
PA
PA
Tags
TB
$
PA
PA
MEM
MEM
Conventional
Organization
TB
L2 $
PA
MEM
Overlap cache access
with VA translation:
requires $ index to
remain invariant
across translation
52
Cache
Optimization
Summary
5.4 miss penalty
5.6 parallelism
55
Summary
Chapter 5 Memory Hierarchy Design
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
Introduction
Review of the ABCs of Caches
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Cache Miss Penalty/Miss Rate via Parallelism
Reducing Hit Time
Main Memory and Organizations for Improving
Performance
5.9 Memory Technology
5.10 Virtual Memory
5.11 Protection and Examples of Virtual Memory
56