Chapter 5

EEF011 Computer Architecture
Chapter 5
Memory Hierarchy Design
December 2004
Chapter 5 Memory Hierarchy Design

5.1 Introduction
5.2 Review of the ABCs of Caches
5.3 Cache Performance
5.4 Reducing Cache Miss Penalty
5.5 Reducing Cache Miss Rate
5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism
5.7 Reducing Hit Time
5.8 Main Memory and Organizations for Improving Performance
5.9 Memory Technology
5.10 Virtual Memory
5.11 Protection and Examples of Virtual Memory
2
5.1 Introduction
The five classic components of a computer:
Processor
Input
Control
Memory
Datapath
Output
Where do we fetch instructions to execute?

z
Build a memory hierarchy which includes main memory & caches (internal
memory) and hard disk (external memory)
Instructions are first fetched from external storage such as hard disk and
are kept in the main memory. Before they go to the CPU, they are
probably extracted to stay in the caches
3
Memory Performance Index

CPU:
DRAM:
Disk:
Capacity
2x in 1.5 years
4x in 3 years
4x in 3 years
Year
1980
1983
1986
1989
1992
1995
2000
Speed (latency)
2x in 1.5 years
2x in 10 years
2x in 10 years
DRAM
Size
64 Kb
256 Kb
1 Mb
4 Mb
16 Mb
64 Mb
256 Mb
4000:1!
Technology Trends
Cycle Time
250 ns
220 ns
190 ns
165 ns
145 ns
120 ns
100 ns
2.5:1!
4
Performance Gap between CPUs and Memory

CPU
1.35X/yr
1.55X/yr
(improvement
ratio)
Memory
7%/yr
The gap (latency) grows about 50% per year!

5
Memory Hierarchy
Levels of the Memory Hierarchy
Registers
Cache
64 KB
1 ns
Cache
Blocks
Capacity
CPU Registers
500 bytes
0.25 ns
Speed
Faster
Capacity
Access Time
Main Memory
512 MB
100ns
Upper Level
Memory
Pages
Disk
100 GB
5 ms
I/O Devices
Files
???
Larger
Lower Level
6
5.2 ABCs of Caches

Cache:
In this textbook it mainly means the first level of the memory
hierarchy encountered once the address leaves the CPU
applied whenever buffering is employed to reuse commonly
occurring items, i.e. file caches, name caches, and so on
Principle of Locality:
Program access a relatively small portion of the address space at
any instant of time.
Two Different Types of Locality:
Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon
(e.g., straightline code, array access)
Memory Hierarchy: Terminology

Hit: data appears in some block in the cache (example: Block X)
Hit Rate: the fraction of cache access found in the cache
Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Miss: data needs to be retrieved from a block in the main
memory (Block Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in cache
+ Time to deliver the block to the processor
Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles)
To Processor
cache
main
Memory
Blk X
From Processor
Blk Y
Cache Measures
CPU execution time incorporated with cache performance:
CPU execution time = (CPU clock cycles + Memory stall cycles)
* Clock cycle time
Memory stall cycles: number of cycles during which the CPU is stalled
waiting for a memory access
Memory stall clock cycles = Number of misses * miss penalty
= IC*(Misses/Instruction)*Miss penalty
= IC*(Memory accesses/Instruction)*Miss rate*Miss penalty
= IC * Reads per instruction * Read miss rate * Read miss penalty
+IC * Writes per instruction * Write miss rate * Write miss penalty
Memory access consists of fetching instructions and reading/writing data
P.395 Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of
the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%,
how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then
CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time
(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.
memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty
= IC*(1+50%)*2%*25 = IC*0.75
then CPU(B) = (IC + IC*0.75)* Clock cycle time
= 1.75*IC*clock cycle time
The performance ration is easy to get to be the inverse of the CPU execution
time :
CPU(B)/CPU(A) = 1.75
The computer with no cache miss is 1.75 times faster.
10
Four Memory Hierarchy Questions

Q1 (block placement):
Where can a block be placed in the upper level?
Q2 (block identification):
How is a block found if it is in the upper level?
Q3 (block replacement):
Which bock should be replaced on a miss?
Q4 (write strategy):
What happens on a write?
11
Q1(block placement): Where can a block be placed?

Direct mapped: (Block number) mod (Number of blocks in cache)
Set associative: (Block number) mod (Number of sets in cache)
# of set # of blocks
n-way: n blocks in a set
1-way = direct mapped
Fully associative: # of set = 1
Example: block 12 placed

in a 8-block cache
12
Simplest Cache: Direct Mapped (1-way)

Block number
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
Memory
4 Block Direct Mapped Cache
Block Index in Cache
0
1
2
3
The block have only one place it can appear in the

cache. The mapping is usually
(Block address) MOD ( Number of blocks in cache)
13
Example: 1 KB Direct Mapped Cache, 32B Blocks

For a 2N byte cache:
The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2M)
Example: 0x50
Stored as part
of the cache state
Valid Bit
Cache Tag
0x50
Cache Data
Byte 31
Byte 63
4
0
Byte Select
Ex: 0x00
Byte 1 Byte 0 0
Byte 33 Byte 32 1
2
3
:
Byte 1023
Cache Tag
9
Cache Index
Ex: 0x01
: :
31
Byte 992 31
14
Q2 (block identification): How is a block found?

Three portions of an address in a set-associative or direct-mapped cache
Block Address
Tag
Block Offset
Cache/Set Index
(Block Size)
Block Offset selects the desired data from the block, the index filed selects
the set, and the tag field compared against the CPU address for a hit
Use the Cache Index to select the cache set
Check the Tag on each block in that set
No need to check index or block offset
A valid bit is added to the Tag to indicate whether or not this entry
contains a valid address
Select the desiredbytes using Block Offset
Increasing associativity => shrinks index
expands tag
15
Example: Two-way set associative cache

Cache Index selects a set from the cache
The two tags in the set are compared in parallel
Data is selected based on the tag result
9
Cache Index
Ex: 0x01
31
Cache Tag
Valid
Cache Tag
Example: 0x50
Cache Index
Cache Data
Cache Data
Cache Block 0
Cache Block 0
4
0
Byte Select
Ex: 0x00
Cache Tag
Valid
0x50
Adr Tag
Compare
Sel1 1
Mux
0 Sel0
Compare
OR
Hit
Cache Block
16
Disadvantage of Set Associative Cache

N-way Set Associative Cache v.s. Direct Mapped Cache:
N comparators vs. 1
Extra MUX delay for the data
Data comes AFTER Hit/Miss
In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
Possible to assume a hit and continue. Recover later if miss.
Valid
Cache Tag
:
Adr Tag
Compare
Cache Index
Cache Data
Cache Data
Cache Block 0
Cache Block 0
Sel1 1
Mux
0 Sel0
Cache Tag
Valid
Compare
OR
Hit
Cache Block
17
Q3 (block replacement): Which block should be

replaced on a cache miss?
Easy for Direct Mapped hardware decisions are simplified
Only one block frame is checked and only that block can be replaced
Set Associative or Fully Associative

There are many blocks to choose from on a miss to replace
Three primary strategies for selecting a block to be replaced

z
z
z
Random: randomly selected

LRU: Least Recently Used block is removed
FIFO(First in, First out)
Data cache misses per 1000 instructions for various replacement strategies
Associativity:
2-way
4-way
8-way
Size
16 KB
64 KB
256 KB
LRU Random FIFO LRU Random FIFO LRU Random FIFO

114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4
103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3
92.2 92.1
92.5 92.1
92.1 92.5 92.1 92.1
92.5
There are little difference between LRU and random for the largest size cache, with
LRU outperforming the others for smaller caches. FIFO generally outperforms
random in the smaller cache sizes
18
Q4(write strategy): What happens on a write?

Reads dominate processor cache accesses.
E.g. 7% of overall memory traffic are writes while 21% of data cache
access are writes
Two option we can adopt when writing to the cache:
zWrite through The information is written to both the block in the
cache and to the block in the lower-level memory.
zWrite back The information is written only to the block in the cache.
The modified cache block is written to main memory only when it is
replaced.
To reduce the frequency of writing back blocks on replacement, a dirty
bit is used to indicate whether the block was modified in the cache
(dirty) or not (clean). If clean, no write back since identical information
to the cache is found
Pros and Cons
zWT: simply to be implemented. The cache is always clean, so read
misses cannot result in writes
zWB: writes occur at the speed of the cache. And multiple writes within
a block require only one write to the lower-level memory
19
Write Stall and Write Buffer

When the CPU must wait for writes to complete during WT, the CPU is
said to write stall
z A common optimization to reduce write stall is a write buffer, which
allows the processor to continue as soon as the data are written to the
buffer, thereby overlapping processor execution with memory updating
z
Processor
Cache
DRAM
Write Buffer
A Write Buffer is needed between the Cache and Memory

Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
Typical number of entries: 4
20
Write-Miss Policy: Write Allocate vs. Not Allocate

Two options on a write miss
Write allocate the block is allocated on a write miss, followed
by the write hit actions
z
Write misses act like read misses
No-write allocate write misses do not affect the cache. The

block is modified only in the lower-level memory
z
Block stay out of the cache in no-write allocate until the program tries
to read the blocks, but with write allocate even blocks that are only
written will still be in the cache
21
Write-Miss Policy Example

Example: Assume a fully associative write-back cache with many
cache entries that starts empty. Below is sequence of five memory
operations.
Write Mem[100];
Write Mem[100];
Read Mem[200];
Write Mem[200];
Write Mem[100].
What are the number of hits and misses (inclusive reads and writes) when
using no-write allocate versus write allocate?
Answer:
No-write Allocate:
Write allocate:
Write Mem[100]; 1 write miss

Write Mem[100]; 1 write miss
Read Mem[200]; 1 read miss
Write Mem[200]; 1 write hit
Write Mem[100]. 1 write miss
Write Mem[100];
Write Mem[100];
Read Mem[200];
Write Mem[200];
Write Mem[100];
4 misses; 1 hit
2 misses; 3 hits
1 write miss
1 write hit
1 read miss
1 write hit
1 write hit
22
5.3 Cache Performance

Example: Split Cache vs. Unified Cache
Which has the better avg. memory access time?
A 16-KB instruction cache with a 16-KB data cache (split cache), or
A 32-KB unified cache?
Miss rates Size Instruction Cache
Data Cache Unified Cache
16KB
0.4%
11.4%
32 KB
3.18%
Assume
A hit takes 1 clock cycle and the miss penalty is 100 cycles
A load or store takes 1 extra clock cycle on a unified cache since
there is only one cache port
36% of the instructions are data transfer instructions.
About 74% of the memory accesses are instruction references
Answer:
Average memory access time (split)
= % instructions x (Hit time + Instruction miss rate x Miss penalty)
+ % data x (Hit time + Instruction miss rate x Miss penalty)
= 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24
Average memory access time(unified)
= 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44
23
Impact of Memory Access on CPU Performance

Example: Suppose a processor:
Ideal CPI = 1.0 (ignoring memory stalls)
Avg. miss rate is 2%
Avg. memory references per instruction is 1.5
Miss penalty is 100 cycles
What are the impact on performance when behavior of the cache is included?
Answer:
CPI = CPU execution cycles per instr. + Memory stall cycles per instr.
= CPI execution + Miss rate x Memory accesses per instr. x Miss penalty
CPI with cache = 1.0 + 2% x 1.5 x 100 = 4
CPI without cache = 1.0 + 1.5 x 100 = 151
CPU time with cache = IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle time
CPU time without cache = IC x 151 x Clock cycle time
Without cache, the CPI of the processor increases from 1 to 151!
75 % of the time the processor is stalled waiting for memory! (CPI: 14)
24
Impact of Cache Organizations on CPU Performance

Example: What is the impact of two different cache organizations (direct
mapped vs. 2-way set associative) on the performance of a CPU?
Ideal CPI = 2.0 (ignoring memory stalls)
Clock cycle time is 1.0 ns
Avg. memory references per instruction is 1.5
Cache size: 64 KB, block size: 64 bytes
For set-associative, assume the clock cycle time is stretched 1.25 times to
accommodate the selection multiplexer
Cache miss penalty is 75 ns
Hit time is 1 clock cycle
Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%.
Answer:
Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 ns
Avg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns
CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instruction
x Miss penalty) x Clock cycle time
= IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 IC
CPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC
25
Summary of Performance Equations
26
Improving Cache Performance

The next few sections in the text book look at ways to improve cache
and memory access times.
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Section 5.7
CPU Time = IC * (CPI Execution +
Section 5.5
Section 5.4
Memory Accesses
Miss Rate Miss Penalty) Clock Cycle Time
Instruction
27
5.4 Reducing Cache Miss Penalty

Time to handle a miss is becoming more and more the
controlling factor. This is because of the great improvement in
speed of processors as compared to the speed of memory.
Average Memory Access Time
= Hit Time + Miss Rate * Miss Penalty
Five optimizations
1. Multilevel caches
2. Critical word first and early restart
3. Giving priority to read misses over writes
4. Merging write buffer
5. Victim caches
28
O1: Multilevel Caches

Approaches
Make the cache faster to keep pace with the speed of CPUs
Make the cache larger to overcome the widening gap
L1: fast hits, L2: fewer misses
L2 Equations
Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
Average Memory Access Time = Hit TimeL1
+ Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
Hit TimeL1 << Hit TimeL2 << << Hit TimeMem
Miss RateL1 < Miss RateL2 <
Definitions:
Local miss rate misses in this cache divided by the total number of memory
accesses to this cache (Miss rateL1 , Miss rateL2)
L1 cache skims the cream of the memory accesses
Global miss ratemisses in this cache divided by the total number of memory
accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2)
Indicate what fraction of the memory accesses that leave the CPU go all
the way to memory
29
Design of L2 Cache
Size
Since everything in L1 cache is likely to be in L2 cache, L2 cache
should be much bigger than L1
Whether data in L1 is in L2
novice approach: design L1 and L2 independently
multilevel inclusion: L1 data are always present in L2
Advantage: easy for consistency between I/O and cache (checking L2 only)
Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level
block to be replaced => slightly higher 1st-level miss rate
i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
multilevel exclusion: L1 data is never found in L2

A cache miss in L1 results in a swap of blocks between L1 and L2
Advantage: prevent wasting space in L2
i.e. AMD Athlon: 64 KB L1 and 256 KB L2
30
O2: Critical Word First and Early Restart

Dont wait for full block to be loaded before restarting CPU
Critical Word FirstRequest missed word first from memory
and send it to CPU as soon as it arrives; let CPU continue
execution while filling the rest of the words in the block. Also
called wrapped fetch and requested word first
Early restartAs soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
Given spatial locality, CPU tends to want next sequential word, so its
not clear if benefit by early restart
Generally useful only in large blocks,

block
31
O3: Giving Priority to Read Misses over Writes

Serve reads before writes have been completed
Write through with write buffers
SW
LW
LW
R3, 512(R0) ; M[512] <- R3

R1, 1024(R0) ; R1 <- M[1024]
R2, 512(R0) ; R2 <- M[512]
(cache index 0)
(cache index 0)
(cache index 0)
Problem: write through with write buffers offer RAW conflicts with main
memory reads on cache misses
If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
Check write buffer contents before read; if no conflicts, let the
memory access continue
Write Back
Suppose a read miss will replace a dirty block
Normal: Write dirty block to memory, and then do the read
Instead: Copy the dirty block to a write buffer, do the read, and then
do the write
CPU stall less since restarts as soon as do read
32
O4: Merging Write Buffer

If a write buffer is empty, the data and the full address are
written in the buffer, and the write is finished from the CPUs
perspective
Usually a write buffer supports multi-words
Write merging: addresses of write buffers are checked to see if

the address of the new data matches the address of a valid
write buffer entry. If so, the new data are combined
Write buffer with 4 entries, each can hold four 64-bit words
(left) without merging (right) Four writes are merged into a single entry
writing multiple words at the same time is faster than writing multiple times
33
O5: Victim Caches
Idea of recycling: remember what was discarded latest due to

cache miss in case it is needed again
rather simply discarded or swapped into L2
victim cache: a small, fully associative cache between a cache

and its refill path
contain only blocks that are discarded from a cache because of a miss,
victims
checked on a miss before going to the next lower-level memory
Victim caches of 1 to 5 entries are effective at reducing misses,
especially for small, direct-mapped data caches
AMD Athlon: 8 entries
34
5.5 Reducing Miss Rate

3 Cs of Cache Miss
CompulsoryThe first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first
reference misses.
(Misses in even an Infinite Cache)
CapacityIf the cache cannot contain all the blocks needed during
execution of a program, capacity misses will occur due to blocks being
discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
ConflictIf block-placement strategy is set associative or direct mapped,
conflict misses (in addition to compulsory & capacity misses) will occur
because a block can be discarded and later retrieved if too many blocks
map to its set. Also called collision misses or interference misses.
(Misses in N-way Associative but hits in Fully Associative Size X Cache)
35
3 Cs of Cache Miss
2:1 Cache Rule
3Cs Absolute Miss Rate (SPEC92)
miss rate 1-way associative cache size X

= miss rate 2-way associative cache size X/2
Conflict
0.14
1-way
Miss Rate per Type
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
64
32
16
Cache Size (KB)
128
Compulsory vanishingly
small
Compulsory
36
3Cs Relative Miss Rate
100%
1-way
Miss Rate per Type
80%
2-way
60%
4-way
8-way
Conflict
40%
Capacity
20%
Flaws: for fixed block size

Good: insight => invention
Cache Size (KB)
128
64
32
16
0%
Compulsory
37
Five Techniques to Reduce Miss Rate
1.
2.
3.
4.
5.
Larger block size

Larger caches
Higher associativity
Way prediction and pseudoassociative caches
Compiler optimizations
38
O1: Larger Block Size

25%
1K
20%
Miss
Rate
4K
15%
16K
10%
64K
5%
Using the principle of

locality: The larger the
block, the greater the
chance parts of it will be
used again.
256K
0%
256
128
64
32
16
Size of Cache
Block Size (bytes)
Take advantage of spatial locality

-The larger the block, the greater the chance parts of it is used again
# of blocks is reduced for the cache of same size => Increase miss penalty
It may increase conflict misses and even capacity misses if the cache is
small
Usually high latency and high bandwidth encourage large block size
39
O2: Larger Caches

0.14
1-way
Miss Rate per Type
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
128
64
32
16
Compulsory
Increasing capacity of cache reduces capacity misses

(Figure 5.14 and 5.15)
May be longer hit time and higher cost
Trends: Larger L2 or L3 off-chip caches
40
O3: Higher Associativity

Figure 5-14 and 5-15 show how improve miss rates improve
with higher associativity
8-way set asociative is as effective as fully associative for practical purposes
2:1 Cache Rule:
Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2
Tradeoff: higher associative cache complicates the circuit

May have longer clock cycle
Beware: Execution time is the only final measure!

Will Clock Cycle time increase as a result of having a more
complicated cache?
Hill [1988] suggested hit time for 2-way vs. 1-way is:
external cache +10%,
internal + 2%
41
O4: Way Prediction & Pseudoassociative Caches

way prediction: extra bits are kept in cache to predict the way, or
block within the set of the next cache access
Example: 2-way I-cache of Alpha 21264
If the predictor is correct, I-cache latency is 1 clock cycle
If incorrect, tries the other block, changes the way predictor, and has a
latency of 3 clock cycles
excess of 85% accuracy
reduce conflict miss and maintain the hit speed of direct-mapped cache
pseudoassociative or column associative
On a miss, a 2nd cache entry is checked before going to the next lower
level
one fast hit and one slow hit
Invert the most significant bit to the find other block in the pseudoset
Miss penalty may become slightly longer
42
O5: Compiler Optimizations

Improve hit rate by compile-time optimization
Reordering instructions with profiling information (McFarling[1989])
Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75%
in an 8KB cache
Get best performance when it was possible to prevent some instruction from
entering the cache
Aligning basic block: the entry point is at the beginning of a

cache block
Decrease the chance of a cache miss for sequential code
Loop Interchange: exchanging the nesting of loops

Improve spatial locality => reduce misses
Make data be accessed in order
=> maximize use of data in a cache block before discarded
/* Before: row first */
for(j=0;j<100;j=j+1)
for(i=0;i<5000;i=i+1)
x[i][j]=2*x[i][j];
/* Before: row first */

for(i=0;i<5000;i=i+1)
for(j=0;j<100;j=j+1)
x[i][j]=2*x[i][j];
skip through memory in strides of 100 words
access all words in a cache block

43
Blocking: operating on submatrices or blocks
Maximize accesses to the data loaded into the cache before replaced
Improve temporal locality
/* After: B=blocking factor */
X=Y*Z
for(jj=0;jj<N;jj=jj+B)
/* Before */
for(i=0;i<N;i=i+1)
for(j=0;j<N;j=j+1){
r=0;
for(k=0;k<N;k=k+1)
r=r+y[i][k]*z[k][j];
x[i][j]=r;
}
for(kk=0;kk<N;kk=kk+B)
for(i=0;i<N;i=i+1)
for(j=jj;j<min(jj+B,N;j=j+1){
r=0;
for(k=kk;k<min(kk+B,N);k=k+1)
r=r+y[i][k]*z[k][j];
x[i][j]=x[i][j]+r;
}
# of capacity misses depends on

N and cache size
total # of memory words

accessed = 2N3/B+N2
y benefits from spatial locality
z benefits from temporal locality
44
5.6 Reducing Cache Penalty or Miss Rate

via Parallelism
Three techniques that overlap the execution of instructions
1.Nonblocking caches to reduce stalls on cache misses
to match the out-of-order processors
2.Hardware prefetching of insructions and data

3.Compiler-controlled prefetching
45
O1: Nonblocking cache to reduce stalls on cache miss

For pipelined computers that allow out-of-order completion, the
CPU need not stall on a cache miss
separate I-cache and D-cache
Continue fetching instructions from I-cache while waiting for D-cache to
return missing data
Nonblocking cache (lookup-free cache)
hit under miss: D-cache continues to supply cache hits during a miss
hit under multiple miss or miss under miss: overlap multiple misses
Ratio of average memory stall
time for a blocking cache to
hit-under-miss schemes
first 14 are FP programs
average: 76% for 1-miss,
51% for 2-miss, 39% for 64miss
final 4 are INT programs
average: 81%, 78% and 78%
46
O2: Hardware Prefetching of Instructions and Data

Prefetch instructions or data before requested by the CPU
either directly into the caches or into an external buffer (faster than
accessing main memory)
Instruction prefetch: frequently done in hardware outside cache

Fetch two blocks on a miss
the requested block is placed in I-cache when it returns
the prefetched block is placed in instruction stream buffer (ISB)
1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block
direct-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)
UltraSPARC III: data prefetch

If a load hits in the prefetch cache
the block is read from the prefetch cache
the next prefetch request is issued: calculating the stride of the next
prefetched block using the difference between the current address and the
previous address
Up to 8 simultaneous prefetches
It may interfere with demand misses resulting in lowering performance
47
O3: Compiler-Controlled Prefetching

Compiler-controlled prefetching
Register prefetch: load the value into a register
Cache prefetch: load data only into the cache (not register)
Faulting vs. nonfaulting: the address does or does not cause an

exception for virtual address faults and protection violations
normal load instruction = faulting register prefetch instruction
Most effective prefetch: semantically invisible to a program

doesnt change the contents of registers and memory, and
cannot cause virtual memory faults
nonbinding prefetch: nonfaulting cache prefetch

Overlapping execution: CPU proceeds while the prefetched data are being
fetched
Advantage: The compiler may avoid unnecessary prefetches in hardware
Drawback: Prefetch instructions incurs instruction overhead
48
5.7 Reducing Hit Time

Importance of cache hit time
Average Memory Access Time
= Hit Time + Miss Rate * Miss Penalty
More importantly, cache access time limits the clock cycle rate in
many processors today!
Fast hit time:

Quickly and efficiently find out if data is in the cache, and
if it is, get that data out of the cache
Four techniques:
1.Small and simple caches
2.Avoiding address translation during indexing of the cache
3.Pipelined cache access
4.Trace caches
49
O1: Small and Simple Caches

A time-consuming portion of a cache hit is using the index portion of the
address to read the tag memory and then compare it to the address
Guideline: smaller hardware is faster

Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB
second level cache?
Small data cache and thus fast clock rate
Guideline: simpler hardware is faster

Direct Mapped, on chip
General design:
small and simple cache for 1st-level cache
Keeping the tags on chip and the data off chip for 2nd-level caches
The emphasis recently is on fast clock time while hiding L1 misses with
dynamic execution and using L2 caches to avoid going to memory
50
O2: Avoiding address translation during cache indexing

Two tasks: indexing the cache and comparing addresses
virtually vs. physically addressed cache
virtual cache: use virtual address (VA) for the cache
physical cache: use physical address (PA) after translating virtual address
Challenges to virtual cache

1.Protection: page-level protection (RW/RO/Invalid) must be checked
Its checked as part of the virtual to physical address translation
solution: an addition field to copy the protection information from TLB and check
it on every access to the cache
2.context switching: same VA of different processes refer to different PA,

requiring the cache to be flushed
solution: increase width of cache address tag with process-identifier tag (PID)
3.Synonyms or aliases: two different VA for the same PA

inconsistency problem: two copies of the same data in a virtual cache
hardware antialiasing solution: guarantee every cache block a unique PA
Alpha 21264: check all possible locations. If one is found, it is invalidated
software page-coloring solution: forcing aliases to share some address bits
Suns Solaris: all aliases must be identical in last 18 bits => no duplicate PA
4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
51
Virtually indexed, physically tagged cache

CPU
CPU
VA
VA
VA
VA
Tags
TB
CPU
$
VA
PA
PA
Tags
TB
$
PA
PA
MEM
MEM
Conventional
Organization
Virtually Addressed Cache

Translate only on miss
Synonym Problem
TB
L2 $
PA
MEM
Overlap cache access
with VA translation:
requires $ index to
remain invariant
across translation
52
O3: Pipelined Cache Access

Simply to pipeline cache access
Multiple clock cycle for 1st-level cache hit
Advantage: fast cycle time and slow hit

Example: accessing instructions from I-cache
Pentium: 1 clock cycle
Pentium Pro ~ Pentium III: 2 clocks
Pentium 4: 4 clocks
Drawback: Increasing the number of pipeline stages leads to

greater penalty on mispredicted branches and
more clock cycles between the issue of the load and the use of the data
Note that it increases the bandwidth of instructions rather than

decreasing the actual latency of a cache hit
53
O4: Trace Caches

Trace cache for instructions: find a dynamic sequence of
instructions including taken branches to load into a cache block
The cache blocks contain
dynamic traces of executed instructions determined by CPU
rather than static sequences of instructions determined by memory
branch prediction is folded into the cache: validated along with the
addresses to have a valid fetch
i.e. Intel NetBurst microarchitecture
advantage: better utilization

Trace caches store instructions only from the branch entry point to the exit
of the trace
Unused part of a long block entered or exited from a taken branch in
conventional I-cache may not be fetched
Downside: store the same instructions multiple times

54
Cache
Optimization
Summary
5.4 miss penalty
5.5 miss rate
5.6 parallelism
5.7 hit time
55
Summary
Chapter 5 Memory Hierarchy Design
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
Introduction
Review of the ABCs of Caches
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Cache Miss Penalty/Miss Rate via Parallelism
Reducing Hit Time
Main Memory and Organizations for Improving
Performance
5.9 Memory Technology
5.10 Virtual Memory
5.11 Protection and Examples of Virtual Memory
56

Chapter 5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5

Uploaded by

Copyright:

Available Formats

EEF011 Computer Architecture

Chapter 5 Memory Hierarchy Design

Where do we fetch instructions to execute?

Memory Performance Index

Performance Gap between CPUs and Memory

The gap (latency) grows about 50% per year!

5.2 ABCs of Caches

Memory Hierarchy: Terminology

Four Memory Hierarchy Questions

Q1(block placement): Where can a block be placed?

Example: block 12 placed

Simplest Cache: Direct Mapped (1-way)

The block have only one place it can appear in the

Example: 1 KB Direct Mapped Cache, 32B Blocks

Q2 (block identification): How is a block found?

Example: Two-way set associative cache

Disadvantage of Set Associative Cache

Q3 (block replacement): Which block should be

Set Associative or Fully Associative

Three primary strategies for selecting a block to be replaced

Random: randomly selected

LRU Random FIFO LRU Random FIFO LRU Random FIFO

Q4(write strategy): What happens on a write?

Write Stall and Write Buffer

A Write Buffer is needed between the Cache and Memory

Write-Miss Policy: Write Allocate vs. Not Allocate

Write misses act like read misses

No-write allocate write misses do not affect the cache. The

Write-Miss Policy Example

Write Mem[100]; 1 write miss

5.3 Cache Performance

Impact of Memory Access on CPU Performance

Impact of Cache Organizations on CPU Performance

Summary of Performance Equations

Improving Cache Performance

CPU Time = IC * (CPI Execution +

5.4 Reducing Cache Miss Penalty

O1: Multilevel Caches

multilevel exclusion: L1 data is never found in L2

O2: Critical Word First and Early Restart

Generally useful only in large blocks,

O3: Giving Priority to Read Misses over Writes

R3, 512(R0) ; M[512] <- R3

O4: Merging Write Buffer

Write merging: addresses of write buffers are checked to see if

O5: Victim Caches

Idea of recycling: remember what was discarded latest due to

victim cache: a small, fully associative cache between a cache

5.5 Reducing Miss Rate

3Cs Absolute Miss Rate (SPEC92)

miss rate 1-way associative cache size X

Miss Rate per Type

Cache Size (KB)

3Cs Relative Miss Rate

Flaws: for fixed block size

Cache Size (KB)

Five Techniques to Reduce Miss Rate

Larger block size

O1: Larger Block Size

Using the principle of

Take advantage of spatial locality

O2: Larger Caches

Miss Rate per Type

Cache Size (KB)