Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

EEF011 Computer Architecture

Chapter 5
Memory Hierarchy Design

December 2004

Chapter 5 Memory Hierarchy Design


5.1 Introduction
5.2 Review of the ABCs of Caches
5.3 Cache Performance
5.4 Reducing Cache Miss Penalty
5.5 Reducing Cache Miss Rate
5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism
5.7 Reducing Hit Time
5.8 Main Memory and Organizations for Improving Performance
5.9 Memory Technology
5.10 Virtual Memory
5.11 Protection and Examples of Virtual Memory
2

5.1 Introduction
The five classic components of a computer:
Processor
Input
Control
Memory
Datapath

Output

Where do we fetch instructions to execute?


z

Build a memory hierarchy which includes main memory & caches (internal
memory) and hard disk (external memory)
Instructions are first fetched from external storage such as hard disk and
are kept in the main memory. Before they go to the CPU, they are
probably extracted to stay in the caches
3

Memory Performance Index


CPU:
DRAM:
Disk:

Capacity
2x in 1.5 years
4x in 3 years
4x in 3 years

Year
1980
1983
1986
1989
1992
1995
2000

Speed (latency)
2x in 1.5 years
2x in 10 years
2x in 10 years

DRAM
Size
64 Kb
256 Kb
1 Mb
4 Mb
16 Mb
64 Mb
256 Mb

4000:1!

Technology Trends

Cycle Time
250 ns
220 ns
190 ns
165 ns
145 ns
120 ns
100 ns

2.5:1!
4

Performance Gap between CPUs and Memory


CPU
1.35X/yr
1.55X/yr

(improvement
ratio)

Memory
7%/yr

The gap (latency) grows about 50% per year!


5

Memory Hierarchy
Levels of the Memory Hierarchy

Registers

Cache
64 KB
1 ns

Cache
Blocks

Capacity

CPU Registers
500 bytes
0.25 ns

Speed

Faster

Capacity
Access Time

Main Memory
512 MB
100ns

Upper Level

Memory
Pages

Disk
100 GB
5 ms

I/O Devices
Files
???

Larger
Lower Level
6

5.2 ABCs of Caches


Cache:
In this textbook it mainly means the first level of the memory
hierarchy encountered once the address leaves the CPU
applied whenever buffering is employed to reuse commonly
occurring items, i.e. file caches, name caches, and so on
Principle of Locality:
Program access a relatively small portion of the address space at
any instant of time.
Two Different Types of Locality:
Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon
(e.g., straightline code, array access)

Memory Hierarchy: Terminology


Hit: data appears in some block in the cache (example: Block X)
Hit Rate: the fraction of cache access found in the cache
Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Miss: data needs to be retrieved from a block in the main
memory (Block Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in cache
+ Time to deliver the block to the processor
Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles)
To Processor

cache

main
Memory

Blk X

From Processor

Blk Y

Cache Measures
CPU execution time incorporated with cache performance:
CPU execution time = (CPU clock cycles + Memory stall cycles)
* Clock cycle time
Memory stall cycles: number of cycles during which the CPU is stalled
waiting for a memory access
Memory stall clock cycles = Number of misses * miss penalty
= IC*(Misses/Instruction)*Miss penalty
= IC*(Memory accesses/Instruction)*Miss rate*Miss penalty
= IC * Reads per instruction * Read miss rate * Read miss penalty
+IC * Writes per instruction * Write miss rate * Write miss penalty
Memory access consists of fetching instructions and reading/writing data

P.395 Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of
the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%,
how much faster would the computer be if all instructions are in the cache?

Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then
CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time
(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.
memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty
= IC*(1+50%)*2%*25 = IC*0.75
then CPU(B) = (IC + IC*0.75)* Clock cycle time
= 1.75*IC*clock cycle time
The performance ration is easy to get to be the inverse of the CPU execution
time :
CPU(B)/CPU(A) = 1.75
The computer with no cache miss is 1.75 times faster.
10

Four Memory Hierarchy Questions


Q1 (block placement):
Where can a block be placed in the upper level?
Q2 (block identification):
How is a block found if it is in the upper level?
Q3 (block replacement):
Which bock should be replaced on a miss?
Q4 (write strategy):
What happens on a write?

11

Q1(block placement): Where can a block be placed?


Direct mapped: (Block number) mod (Number of blocks in cache)
Set associative: (Block number) mod (Number of sets in cache)
# of set # of blocks
n-way: n blocks in a set
1-way = direct mapped
Fully associative: # of set = 1

Example: block 12 placed


in a 8-block cache

12

Simplest Cache: Direct Mapped (1-way)


Block number
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F

Memory
4 Block Direct Mapped Cache
Block Index in Cache
0
1
2
3

The block have only one place it can appear in the


cache. The mapping is usually
(Block address) MOD ( Number of blocks in cache)
13

Example: 1 KB Direct Mapped Cache, 32B Blocks


For a 2N byte cache:
The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2M)

Example: 0x50

Stored as part
of the cache state
Valid Bit

Cache Tag
0x50

Cache Data
Byte 31
Byte 63

4
0
Byte Select
Ex: 0x00

Byte 1 Byte 0 0
Byte 33 Byte 32 1
2
3

:
Byte 1023

Cache Tag

9
Cache Index
Ex: 0x01

: :

31

Byte 992 31
14

Q2 (block identification): How is a block found?


Three portions of an address in a set-associative or direct-mapped cache
Block Address
Tag

Block Offset

Cache/Set Index

(Block Size)

Block Offset selects the desired data from the block, the index filed selects
the set, and the tag field compared against the CPU address for a hit
Use the Cache Index to select the cache set
Check the Tag on each block in that set
No need to check index or block offset
A valid bit is added to the Tag to indicate whether or not this entry
contains a valid address
Select the desiredbytes using Block Offset
Increasing associativity => shrinks index

expands tag
15

Example: Two-way set associative cache


Cache Index selects a set from the cache
The two tags in the set are compared in parallel
Data is selected based on the tag result
9
Cache Index
Ex: 0x01

31
Cache Tag

Valid

Cache Tag

Example: 0x50

Cache Index
Cache Data
Cache Data
Cache Block 0
Cache Block 0

4
0
Byte Select
Ex: 0x00

Cache Tag

Valid

0x50
Adr Tag

Compare

Sel1 1

Mux

0 Sel0

Compare

OR
Hit

Cache Block

16

Disadvantage of Set Associative Cache


N-way Set Associative Cache v.s. Direct Mapped Cache:
N comparators vs. 1
Extra MUX delay for the data
Data comes AFTER Hit/Miss
In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
Possible to assume a hit and continue. Recover later if miss.

Valid

Cache Tag

:
Adr Tag

Compare

Cache Index
Cache Data
Cache Data
Cache Block 0
Cache Block 0

Sel1 1

Mux

0 Sel0

Cache Tag

Valid

Compare

OR
Hit

Cache Block
17

Q3 (block replacement): Which block should be


replaced on a cache miss?
Easy for Direct Mapped hardware decisions are simplified
Only one block frame is checked and only that block can be replaced

Set Associative or Fully Associative


There are many blocks to choose from on a miss to replace

Three primary strategies for selecting a block to be replaced


z
z
z

Random: randomly selected


LRU: Least Recently Used block is removed
FIFO(First in, First out)

Data cache misses per 1000 instructions for various replacement strategies
Associativity:
2-way
4-way
8-way
Size
16 KB
64 KB
256 KB

LRU Random FIFO LRU Random FIFO LRU Random FIFO


114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4
103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3
92.2 92.1
92.5 92.1
92.1 92.5 92.1 92.1
92.5

There are little difference between LRU and random for the largest size cache, with
LRU outperforming the others for smaller caches. FIFO generally outperforms
random in the smaller cache sizes

18

Q4(write strategy): What happens on a write?


Reads dominate processor cache accesses.
E.g. 7% of overall memory traffic are writes while 21% of data cache
access are writes
Two option we can adopt when writing to the cache:
zWrite through The information is written to both the block in the
cache and to the block in the lower-level memory.
zWrite back The information is written only to the block in the cache.
The modified cache block is written to main memory only when it is
replaced.
To reduce the frequency of writing back blocks on replacement, a dirty
bit is used to indicate whether the block was modified in the cache
(dirty) or not (clean). If clean, no write back since identical information
to the cache is found
Pros and Cons
zWT: simply to be implemented. The cache is always clean, so read
misses cannot result in writes
zWB: writes occur at the speed of the cache. And multiple writes within
a block require only one write to the lower-level memory

19

Write Stall and Write Buffer


When the CPU must wait for writes to complete during WT, the CPU is
said to write stall
z A common optimization to reduce write stall is a write buffer, which
allows the processor to continue as soon as the data are written to the
buffer, thereby overlapping processor execution with memory updating
z

Processor

Cache

DRAM

Write Buffer

A Write Buffer is needed between the Cache and Memory


Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
Typical number of entries: 4
20

Write-Miss Policy: Write Allocate vs. Not Allocate


Two options on a write miss
Write allocate the block is allocated on a write miss, followed
by the write hit actions
z

Write misses act like read misses

No-write allocate write misses do not affect the cache. The


block is modified only in the lower-level memory
z

Block stay out of the cache in no-write allocate until the program tries
to read the blocks, but with write allocate even blocks that are only
written will still be in the cache

21

Write-Miss Policy Example


Example: Assume a fully associative write-back cache with many
cache entries that starts empty. Below is sequence of five memory
operations.
Write Mem[100];
Write Mem[100];
Read Mem[200];
Write Mem[200];
Write Mem[100].
What are the number of hits and misses (inclusive reads and writes) when
using no-write allocate versus write allocate?

Answer:
No-write Allocate:

Write allocate:

Write Mem[100]; 1 write miss


Write Mem[100]; 1 write miss
Read Mem[200]; 1 read miss
Write Mem[200]; 1 write hit
Write Mem[100]. 1 write miss

Write Mem[100];
Write Mem[100];
Read Mem[200];
Write Mem[200];
Write Mem[100];

4 misses; 1 hit

2 misses; 3 hits

1 write miss
1 write hit
1 read miss
1 write hit
1 write hit
22

5.3 Cache Performance


Example: Split Cache vs. Unified Cache
Which has the better avg. memory access time?
A 16-KB instruction cache with a 16-KB data cache (split cache), or
A 32-KB unified cache?
Miss rates Size Instruction Cache
Data Cache Unified Cache
16KB

0.4%

11.4%

32 KB
3.18%
Assume
A hit takes 1 clock cycle and the miss penalty is 100 cycles
A load or store takes 1 extra clock cycle on a unified cache since
there is only one cache port
36% of the instructions are data transfer instructions.
About 74% of the memory accesses are instruction references

Answer:
Average memory access time (split)
= % instructions x (Hit time + Instruction miss rate x Miss penalty)
+ % data x (Hit time + Instruction miss rate x Miss penalty)
= 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24
Average memory access time(unified)
= 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44

23

Impact of Memory Access on CPU Performance


Example: Suppose a processor:
Ideal CPI = 1.0 (ignoring memory stalls)
Avg. miss rate is 2%
Avg. memory references per instruction is 1.5
Miss penalty is 100 cycles
What are the impact on performance when behavior of the cache is included?
Answer:
CPI = CPU execution cycles per instr. + Memory stall cycles per instr.
= CPI execution + Miss rate x Memory accesses per instr. x Miss penalty
CPI with cache = 1.0 + 2% x 1.5 x 100 = 4
CPI without cache = 1.0 + 1.5 x 100 = 151
CPU time with cache = IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle time
CPU time without cache = IC x 151 x Clock cycle time
Without cache, the CPI of the processor increases from 1 to 151!
75 % of the time the processor is stalled waiting for memory! (CPI: 14)
24

Impact of Cache Organizations on CPU Performance


Example: What is the impact of two different cache organizations (direct
mapped vs. 2-way set associative) on the performance of a CPU?
Ideal CPI = 2.0 (ignoring memory stalls)
Clock cycle time is 1.0 ns
Avg. memory references per instruction is 1.5
Cache size: 64 KB, block size: 64 bytes
For set-associative, assume the clock cycle time is stretched 1.25 times to
accommodate the selection multiplexer
Cache miss penalty is 75 ns
Hit time is 1 clock cycle
Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%.
Answer:
Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 ns
Avg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns
CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instruction
x Miss penalty) x Clock cycle time
= IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 IC
CPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC
25

Summary of Performance Equations

26

Improving Cache Performance


The next few sections in the text book look at ways to improve cache
and memory access times.

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Section 5.7

CPU Time = IC * (CPI Execution +

Section 5.5

Section 5.4

Memory Accesses
Miss Rate Miss Penalty) Clock Cycle Time
Instruction

27

5.4 Reducing Cache Miss Penalty


Time to handle a miss is becoming more and more the
controlling factor. This is because of the great improvement in
speed of processors as compared to the speed of memory.
Average Memory Access Time
= Hit Time + Miss Rate * Miss Penalty
Five optimizations
1. Multilevel caches
2. Critical word first and early restart
3. Giving priority to read misses over writes
4. Merging write buffer
5. Victim caches

28

O1: Multilevel Caches


Approaches
Make the cache faster to keep pace with the speed of CPUs
Make the cache larger to overcome the widening gap
L1: fast hits, L2: fewer misses
L2 Equations
Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
Average Memory Access Time = Hit TimeL1
+ Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
Hit TimeL1 << Hit TimeL2 << << Hit TimeMem
Miss RateL1 < Miss RateL2 <
Definitions:
Local miss rate misses in this cache divided by the total number of memory
accesses to this cache (Miss rateL1 , Miss rateL2)
L1 cache skims the cream of the memory accesses
Global miss ratemisses in this cache divided by the total number of memory
accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2)
Indicate what fraction of the memory accesses that leave the CPU go all
the way to memory
29

Design of L2 Cache
Size
Since everything in L1 cache is likely to be in L2 cache, L2 cache
should be much bigger than L1

Whether data in L1 is in L2
novice approach: design L1 and L2 independently
multilevel inclusion: L1 data are always present in L2
Advantage: easy for consistency between I/O and cache (checking L2 only)
Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level
block to be replaced => slightly higher 1st-level miss rate
i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2

multilevel exclusion: L1 data is never found in L2


A cache miss in L1 results in a swap of blocks between L1 and L2
Advantage: prevent wasting space in L2
i.e. AMD Athlon: 64 KB L1 and 256 KB L2
30

O2: Critical Word First and Early Restart


Dont wait for full block to be loaded before restarting CPU
Critical Word FirstRequest missed word first from memory
and send it to CPU as soon as it arrives; let CPU continue
execution while filling the rest of the words in the block. Also
called wrapped fetch and requested word first
Early restartAs soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
Given spatial locality, CPU tends to want next sequential word, so its
not clear if benefit by early restart

Generally useful only in large blocks,


block

31

O3: Giving Priority to Read Misses over Writes


Serve reads before writes have been completed
Write through with write buffers
SW
LW
LW

R3, 512(R0) ; M[512] <- R3


R1, 1024(R0) ; R1 <- M[1024]
R2, 512(R0) ; R2 <- M[512]

(cache index 0)
(cache index 0)
(cache index 0)

Problem: write through with write buffers offer RAW conflicts with main
memory reads on cache misses
If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
Check write buffer contents before read; if no conflicts, let the
memory access continue
Write Back
Suppose a read miss will replace a dirty block
Normal: Write dirty block to memory, and then do the read
Instead: Copy the dirty block to a write buffer, do the read, and then
do the write
CPU stall less since restarts as soon as do read
32

O4: Merging Write Buffer


If a write buffer is empty, the data and the full address are
written in the buffer, and the write is finished from the CPUs
perspective
Usually a write buffer supports multi-words

Write merging: addresses of write buffers are checked to see if


the address of the new data matches the address of a valid
write buffer entry. If so, the new data are combined

Write buffer with 4 entries, each can hold four 64-bit words
(left) without merging (right) Four writes are merged into a single entry
writing multiple words at the same time is faster than writing multiple times
33

O5: Victim Caches

Idea of recycling: remember what was discarded latest due to


cache miss in case it is needed again
rather simply discarded or swapped into L2

victim cache: a small, fully associative cache between a cache


and its refill path
contain only blocks that are discarded from a cache because of a miss,
victims
checked on a miss before going to the next lower-level memory
Victim caches of 1 to 5 entries are effective at reducing misses,
especially for small, direct-mapped data caches
AMD Athlon: 8 entries
34

5.5 Reducing Miss Rate


3 Cs of Cache Miss
CompulsoryThe first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first
reference misses.
(Misses in even an Infinite Cache)
CapacityIf the cache cannot contain all the blocks needed during
execution of a program, capacity misses will occur due to blocks being
discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
ConflictIf block-placement strategy is set associative or direct mapped,
conflict misses (in addition to compulsory & capacity misses) will occur
because a block can be discarded and later retrieved if too many blocks
map to its set. Also called collision misses or interference misses.
(Misses in N-way Associative but hits in Fully Associative Size X Cache)
35

3 Cs of Cache Miss
2:1 Cache Rule

3Cs Absolute Miss Rate (SPEC92)

miss rate 1-way associative cache size X


= miss rate 2-way associative cache size X/2

Conflict

0.14

1-way

Miss Rate per Type

0.12

2-way

0.1

4-way

0.08

8-way

0.06

Capacity

0.04
0.02
64

32

16

Cache Size (KB)

128

Compulsory vanishingly
small

Compulsory

36

3Cs Relative Miss Rate

100%
1-way
Miss Rate per Type

80%

2-way

60%

4-way
8-way

Conflict

40%

Capacity

20%

Flaws: for fixed block size


Good: insight => invention

Cache Size (KB)

128

64

32

16

0%
Compulsory
37

Five Techniques to Reduce Miss Rate

1.
2.
3.
4.
5.

Larger block size


Larger caches
Higher associativity
Way prediction and pseudoassociative caches
Compiler optimizations

38

O1: Larger Block Size


25%
1K

20%
Miss
Rate

4K

15%

16K
10%

64K

5%

Using the principle of


locality: The larger the
block, the greater the
chance parts of it will be
used again.

256K

0%
256

128

64

32

16

Size of Cache
Block Size (bytes)

Take advantage of spatial locality


-The larger the block, the greater the chance parts of it is used again
# of blocks is reduced for the cache of same size => Increase miss penalty
It may increase conflict misses and even capacity misses if the cache is
small
Usually high latency and high bandwidth encourage large block size
39

O2: Larger Caches


0.14

1-way

Miss Rate per Type

0.12

2-way

0.1

4-way

0.08

8-way

0.06

Capacity

0.04
0.02

Cache Size (KB)

128

64

32

16

Compulsory

Increasing capacity of cache reduces capacity misses


(Figure 5.14 and 5.15)
May be longer hit time and higher cost
Trends: Larger L2 or L3 off-chip caches

40

O3: Higher Associativity


Figure 5-14 and 5-15 show how improve miss rates improve
with higher associativity
8-way set asociative is as effective as fully associative for practical purposes
2:1 Cache Rule:
Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2

Tradeoff: higher associative cache complicates the circuit


May have longer clock cycle

Beware: Execution time is the only final measure!


Will Clock Cycle time increase as a result of having a more
complicated cache?
Hill [1988] suggested hit time for 2-way vs. 1-way is:
external cache +10%,
internal + 2%
41

O4: Way Prediction & Pseudoassociative Caches


way prediction: extra bits are kept in cache to predict the way, or
block within the set of the next cache access
Example: 2-way I-cache of Alpha 21264
If the predictor is correct, I-cache latency is 1 clock cycle
If incorrect, tries the other block, changes the way predictor, and has a
latency of 3 clock cycles
excess of 85% accuracy
reduce conflict miss and maintain the hit speed of direct-mapped cache

pseudoassociative or column associative

On a miss, a 2nd cache entry is checked before going to the next lower
level
one fast hit and one slow hit

Invert the most significant bit to the find other block in the pseudoset
Miss penalty may become slightly longer

42

O5: Compiler Optimizations


Improve hit rate by compile-time optimization
Reordering instructions with profiling information (McFarling[1989])
Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75%
in an 8KB cache
Get best performance when it was possible to prevent some instruction from
entering the cache

Aligning basic block: the entry point is at the beginning of a


cache block
Decrease the chance of a cache miss for sequential code

Loop Interchange: exchanging the nesting of loops


Improve spatial locality => reduce misses
Make data be accessed in order
=> maximize use of data in a cache block before discarded
/* Before: row first */
for(j=0;j<100;j=j+1)
for(i=0;i<5000;i=i+1)
x[i][j]=2*x[i][j];

/* Before: row first */


for(i=0;i<5000;i=i+1)
for(j=0;j<100;j=j+1)
x[i][j]=2*x[i][j];

skip through memory in strides of 100 words

access all words in a cache block


43

Blocking: operating on submatrices or blocks

Maximize accesses to the data loaded into the cache before replaced
Improve temporal locality
/* After: B=blocking factor */
X=Y*Z
for(jj=0;jj<N;jj=jj+B)
/* Before */
for(i=0;i<N;i=i+1)
for(j=0;j<N;j=j+1){
r=0;
for(k=0;k<N;k=k+1)
r=r+y[i][k]*z[k][j];
x[i][j]=r;
}

for(kk=0;kk<N;kk=kk+B)
for(i=0;i<N;i=i+1)
for(j=jj;j<min(jj+B,N;j=j+1){
r=0;
for(k=kk;k<min(kk+B,N);k=k+1)
r=r+y[i][k]*z[k][j];
x[i][j]=x[i][j]+r;
}

# of capacity misses depends on


N and cache size

total # of memory words


accessed = 2N3/B+N2
y benefits from spatial locality
z benefits from temporal locality
44

5.6 Reducing Cache Penalty or Miss Rate


via Parallelism
Three techniques that overlap the execution of instructions
1.Nonblocking caches to reduce stalls on cache misses
to match the out-of-order processors

2.Hardware prefetching of insructions and data


3.Compiler-controlled prefetching

45

O1: Nonblocking cache to reduce stalls on cache miss


For pipelined computers that allow out-of-order completion, the
CPU need not stall on a cache miss
separate I-cache and D-cache
Continue fetching instructions from I-cache while waiting for D-cache to
return missing data

Nonblocking cache (lookup-free cache)

hit under miss: D-cache continues to supply cache hits during a miss
hit under multiple miss or miss under miss: overlap multiple misses
Ratio of average memory stall
time for a blocking cache to
hit-under-miss schemes
first 14 are FP programs
average: 76% for 1-miss,
51% for 2-miss, 39% for 64miss
final 4 are INT programs
average: 81%, 78% and 78%
46

O2: Hardware Prefetching of Instructions and Data


Prefetch instructions or data before requested by the CPU
either directly into the caches or into an external buffer (faster than
accessing main memory)

Instruction prefetch: frequently done in hardware outside cache


Fetch two blocks on a miss
the requested block is placed in I-cache when it returns
the prefetched block is placed in instruction stream buffer (ISB)
1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block
direct-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)

UltraSPARC III: data prefetch


If a load hits in the prefetch cache
the block is read from the prefetch cache
the next prefetch request is issued: calculating the stride of the next
prefetched block using the difference between the current address and the
previous address

Up to 8 simultaneous prefetches
It may interfere with demand misses resulting in lowering performance
47

O3: Compiler-Controlled Prefetching


Compiler-controlled prefetching
Register prefetch: load the value into a register
Cache prefetch: load data only into the cache (not register)

Faulting vs. nonfaulting: the address does or does not cause an


exception for virtual address faults and protection violations
normal load instruction = faulting register prefetch instruction

Most effective prefetch: semantically invisible to a program


doesnt change the contents of registers and memory, and
cannot cause virtual memory faults

nonbinding prefetch: nonfaulting cache prefetch


Overlapping execution: CPU proceeds while the prefetched data are being
fetched
Advantage: The compiler may avoid unnecessary prefetches in hardware
Drawback: Prefetch instructions incurs instruction overhead

48

5.7 Reducing Hit Time


Importance of cache hit time
Average Memory Access Time
= Hit Time + Miss Rate * Miss Penalty
More importantly, cache access time limits the clock cycle rate in
many processors today!

Fast hit time:


Quickly and efficiently find out if data is in the cache, and
if it is, get that data out of the cache

Four techniques:
1.Small and simple caches
2.Avoiding address translation during indexing of the cache
3.Pipelined cache access
4.Trace caches
49

O1: Small and Simple Caches


A time-consuming portion of a cache hit is using the index portion of the
address to read the tag memory and then compare it to the address

Guideline: smaller hardware is faster


Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB
second level cache?
Small data cache and thus fast clock rate

Guideline: simpler hardware is faster


Direct Mapped, on chip

General design:
small and simple cache for 1st-level cache
Keeping the tags on chip and the data off chip for 2nd-level caches
The emphasis recently is on fast clock time while hiding L1 misses with
dynamic execution and using L2 caches to avoid going to memory

50

O2: Avoiding address translation during cache indexing


Two tasks: indexing the cache and comparing addresses
virtually vs. physically addressed cache
virtual cache: use virtual address (VA) for the cache
physical cache: use physical address (PA) after translating virtual address

Challenges to virtual cache


1.Protection: page-level protection (RW/RO/Invalid) must be checked
Its checked as part of the virtual to physical address translation
solution: an addition field to copy the protection information from TLB and check
it on every access to the cache

2.context switching: same VA of different processes refer to different PA,


requiring the cache to be flushed
solution: increase width of cache address tag with process-identifier tag (PID)

3.Synonyms or aliases: two different VA for the same PA


inconsistency problem: two copies of the same data in a virtual cache
hardware antialiasing solution: guarantee every cache block a unique PA
Alpha 21264: check all possible locations. If one is found, it is invalidated
software page-coloring solution: forcing aliases to share some address bits
Suns Solaris: all aliases must be identical in last 18 bits => no duplicate PA

4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
51

Virtually indexed, physically tagged cache


CPU

CPU

VA

VA

VA
VA
Tags

TB

CPU

$
VA

PA

PA
Tags

TB

$
PA

PA

MEM

MEM

Conventional
Organization

Virtually Addressed Cache


Translate only on miss
Synonym Problem

TB
L2 $

PA

MEM
Overlap cache access
with VA translation:
requires $ index to
remain invariant
across translation
52

O3: Pipelined Cache Access


Simply to pipeline cache access
Multiple clock cycle for 1st-level cache hit

Advantage: fast cycle time and slow hit


Example: accessing instructions from I-cache
Pentium: 1 clock cycle
Pentium Pro ~ Pentium III: 2 clocks
Pentium 4: 4 clocks

Drawback: Increasing the number of pipeline stages leads to


greater penalty on mispredicted branches and
more clock cycles between the issue of the load and the use of the data

Note that it increases the bandwidth of instructions rather than


decreasing the actual latency of a cache hit
53

O4: Trace Caches


Trace cache for instructions: find a dynamic sequence of
instructions including taken branches to load into a cache block
The cache blocks contain
dynamic traces of executed instructions determined by CPU
rather than static sequences of instructions determined by memory
branch prediction is folded into the cache: validated along with the
addresses to have a valid fetch
i.e. Intel NetBurst microarchitecture

advantage: better utilization


Trace caches store instructions only from the branch entry point to the exit
of the trace
Unused part of a long block entered or exited from a taken branch in
conventional I-cache may not be fetched

Downside: store the same instructions multiple times


54

Cache
Optimization
Summary
5.4 miss penalty

5.5 miss rate

5.6 parallelism

5.7 hit time

55

Summary
Chapter 5 Memory Hierarchy Design
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8

Introduction
Review of the ABCs of Caches
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Cache Miss Penalty/Miss Rate via Parallelism
Reducing Hit Time
Main Memory and Organizations for Improving
Performance
5.9 Memory Technology
5.10 Virtual Memory
5.11 Protection and Examples of Virtual Memory
56

You might also like