Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

dce

2019
dce
2019

ADVANCED SYSTEM
ARCHITECTURES
Khoa Khoa học và Kỹ thuật Máy tính Memory Hierarchy Design
BM Kỹ thuật Máy tính
BK
TP.HCM

Trần Ngọc Thịnh


http://www.cse.hcmut.edu.vn/~tnthinh

©2019, dce
2

dce dce
2019
Since 1980, CPU has outpaced DRAM ... 2019
Processor-DRAM Performance Gap Impact
Four-issue 2GHz superscalar accessing 100ns DRAM could execute • To illustrate the performance impact, assume a single-issue pipelined
800 instructions during time for one memory access! CPU with CPI = 1 using non-ideal memory.
Performance • Ignoring other factors, the minimum cost of a full memory access in terms
(1/latency) CPU of number of wasted CPU cycles:
CPU 60% per yr
CPU CPU Memory Minimum CPU memory stall cycles
2X in 1.5 yrs Year speed cycle Access or instructions wasted
MHZ ns ns

1986: 8 125 190 190/125 - 1 = 0.5


Gap grew 50% per 1989: 33 30 165 165/30 -1 = 4.5
year 1992: 60 16.6 120 120/16.6 -1 = 6.2
DRAM 1996: 200 5 110 110/5 -1 = 21
1998: 300 3.33 100 100/3.33 -1 = 29
DRAM
9% per yr 2000: 1000 1 90 90/1 - 1 = 89
2X in 10 yrs 2002: 2000 .5 80 80/.5 - 1 = 159
Year 2004: 3000 .333 60 60.333 - 1 = 179

3 4

1
dce Addressing the Processor-Memory Performance GAP dce
2019 2019
Levels of the Memory Hierarchy
Upper Level
Capacity
Access Time Staging
Cost
Today’s Xfer Unit faster
• Goal: Illusion of large, fast, cheap memory. CPU Registers Focus Registers
100s Bytes
Let programs address a memory space that <10s ns
Instr. Operands prog./compiler
1-8 bytes
scales to the disk size, at a speed that is Cache
K Bytes
Cache
10-100 ns
usually as fast as register access 1-0.1 cents/bit cache cntl
Blocks 8-128 bytes
Main Memory
M Bytes Memory
200ns- 500ns
• Solution: Put smaller, faster “cache” $.0001-.00001 cents /bit
Pages
OS
512-4K bytes
Disk
memories between CPU and DRAM. Create G Bytes, 10 ms
(10,000,000 ns) Disk
a “memory hierarchy”. -5 -6
10 - 10 cents/bit user/operator
Files Mbytes
Tape Larger
infinite
sec-min Tape Lower Level
10 -8

5 6

dce dce
2019
Common Predictable Patterns 2019

Caches

Two predictable properties of memory references:


Caches exploit both types of predictability:
• Temporal Locality: If a location is referenced, it is
likely to be referenced again in the near future (e.g., – Exploit temporal locality by remembering the contents of
loops, reuse). recently accessed locations.

– Exploit spatial locality by fetching blocks of data around


• Spatial Locality: If a location is referenced it is likely recently accessed locations.
that locations near it will be referenced in the near
future (e.g., straightline code, array access).

7 8

2
dce dce
Simple view of cache
2019 2019
Simple view of cache
Address Address • Hit rate: fraction of cache hit
Processor Main
CACHE Memory • Miss rate: 1 – Hit rate
Data Data
- Miss penalty: Time to replace a block + time to
• The processor accesses the cache first deliver the data to the processor
• Cache hit: Just use the data
• Cache miss: replace a block in cache by a
block from main memory, use the data
• The data transferred between cache and main
memory is in blocks, and controlled by
independent hardware
9 10

dce dce
2019
Simple view of cache 2019
Replacement
• Example: For(i = 0; i < 10; i++) S = S + A[i]; Block Number 0 1 2 3 4 5 6 7 8 9
1111111111 2222222222 33
0123456789 0123456789 01

• No cache: At least 12 accesses to main Memory


memory (10 A[i] and Read S, write S)
• With Cache: if A[i] and S is in a single block 0123

CPU need this


(ex 32-bytes), 1 access to load block to cache, Cache
and 1 access to write block to main memory
• Access to S: Temporal Locality • Cache cannot hold all blocks
• Access to A[i]: Spatial Locality (A[i]) • Replace a block by another that is currently
needed by CPU
11 12

3
dce dce
2019
Basic Cache Design & Operation Issues 2019

Q1: Where can a block be placed?


• Q1: Where can a block be placed cache? Block Number 0 1 2 3 4 5 6 7 8 9
1111111111 2222222222 33
0123456789 0123456789 01
(Block placement strategy & Cache organization)
– Fully Associative, Set Associative, Direct Mapped. Memory
• Q2: How is a block found if it is in cache?
(Block identification)
– Tag/Block. Set Number 0 0 1 2 3 01234567

• Q3: Which block should be replaced on a miss? Cache


(Block replacement)
– Random, LRU, FIFO.
Fully (2-way) Set Direct
• Q4: What happens on a write? Associative Associative Mapped
anywhere anywhere in only into
(Cache write policy) Block 12
set 0 block 4
can be placed
– Write through, write back. (12 mod 4) (12 mod 8)
13 14

dce dce
2019

Direct-Mapped Cache
2019
Direct-mapped Cache
Tag Index Block Address • Address: N bits (2N bytes)
Offset

t
• Cache has 2k lines (blocks)
k b
V Tag Data Block • Each line has 2b bytes
• Block M is mapped to the line M % 2k
2k
lines • Need t = N-k-bTag bits to identify mem. block
t • Advantage: Simple
= • Disadvantage: High miss rate
• What if CPU accesses block N0, N1 and N0 %
HIT Data Word or Byte 2k = N1 % 2k ?
15 16

4
dce dce
2019
Direct-mapped Cache 2019
4KB Direct Mapped Cache Example
A d d r e s s ( s h o w i n g b i t p o s it io n s )
Index field
31 30 13 12 11 2 1 0

1111111111 2222222222 33
Tag field B y te

Block Number 0 1 2 3 4 5 6 7 8 9 0123456789 0123456789 01


o f fs e t

20 10
H it D a ta
1K = 1024 Blocks Tag

Memory Each block = one word


In d e x

Can cache up to
In d e x V a l id T ag D a ta

01234567
232 bytes = 4 GB 1

of memory 2

Mapping function:
Cache Block frame number = 1021

1022
(Block address) MOD (1024)
• Access N0, N1 where N0 % 2k = N1 % 2k i.e. index field or
1023

20 32

10 low bits of block address


• Replace a block while there are many rooms
available! Block Address = 30 bits Block offset
Hit or miss?

Tag = 20 bits Index = 10 bits = 2 bits

17 18

dce dce
2019
64KB Direct Mapped Cache Example 2019

Fully Associative Cache


Tag field A d d r e s s (s h o w in g b it p o s i ti o n s )

4K= 4096 blocks 31 16 1 5 4 32 1 0 Index field


Each block = four words = 16 bytes V Tag Data Block
16 12 2 B yte
H it
T ag o ffs e t Word select D a ta

Can cache up to In d e x B lo c k o f f s e t
232 bytes = 4 GB
of memory
1 6 b i ts 1 2 8 b i ts
t
V T ag D a ta

4K

Tag
e n t rie s

t
16 32 32 32 32
=
HIT
Mux
Offset
Block

Hit or miss? 32

Data
Larger cache blocks take better advantage of spatial locality
and thus may result in a lower miss rate
Block Address = 28 bits
Block offset = Word
Tag = 16 bits Index = 12 bits = 4 bits
b or Byte
Mapping Function: Cache Block frame number = (Block address) MOD (4096)
i.e. index field or 12 low bit of block address
19 20

5
dce dce
2019
Fully associative cache 2019
Set-Associative Cache
• CAM: Content Addressable Memory Tag Index Block
Offset b
• Each block can be mapped to any lines in t
k
cache V Tag Data Block V Tag Data Block
• Tag bit: t = N-b. Compared to Tag of all lines
• Advantage: replacement occurs only when no
rooms available t

• Disadvantage: resource consumption, delay Data


= =
by comparing many elements Word
or Byte

HIT
21 22

dce dce 4K Four-Way Set Associative Cache:


2019
W-way Set-associative Cache 2019

MIPS Implementation Example


Block
• Balancing: Direct mapped cache vs Fully
A d d re s s
Tag 31 30 1 2 11 10 9 8 3 2 1 0 Offset
Field
associative cache 22 8 Index
Field
1024 block frames
• Cache has 2k sets Each block = one word In d ex V Tag D a ta V Tag D a ta V T ag D ata V T ag D a ta

4-way set associative 0

• Each set has w lines


1
1024 / 4= 256 sets 2

25 3
Can cache up to
• Block M is mapped to one of w lines in set M 232 bytes = 4 GB
25 4
25 5
22 32
of memory
% 2k
• Tag bit: t = N-k-b Set associative cache requires parallel tag
matching and more complex hit logic which
may increase hit time

• Currently: widely used (Intel, AMD, …) Block Address = 30 bits


Block offset 4 - to - 1 m u l tip le xo r
Tag = 22 bits Index = 8 bits = 2 bits

H it D a ta

Mapping Function: Cache Set Number = index= (Block address) MOD (256)
23 24

6
dce dce
2019
Q2: How is a block found? 2019
What causes a MISS?
• Index selects which set to look in
• Three Major Categories of Cache Misses:
• Compare Tag to find block – Compulsory Misses: first access to a block
• Increasing associativity shrinks index, – Capacity Misses: cache cannot contain all blocks needed
expands tag. Fully Associative caches have to execute the program
– Conflict Misses: block replaced by another block and then
no index field. later retrieved - (affects set assoc. or direct mapped
• Direct-mapped: 1-way set associative? caches)
• Nightmare Scenario: ping pong effect!
• Fully associative: 1 set?
Memory Address
Block Address Block
Tag Index Offset

25 26

dce dce
Block Size and Spatial Locality
2019 2019

Q3: Which block should be replaced on a miss?


Block is unit of transfer between the cache and memory • Easy for Direct Mapped
4 word block, • Set Associative or Fully Associative:
Tag Word0 Word1 Word2 Word3
b=2 – Random
Split CPU block address offsetb – Least Recently Used (LRU)
address • LRU cache state must be updated on every access
32-b bits b bits • true implementation only feasible for small sets (2-
2b = block size a.k.a line size (in bytes) way, 4-way)
Larger block size has distinct hardware advantages
• pseudo-LRU binary tree often used for 4-8 way
• less tag overhead – First In, First Out (FIFO) a.k.a. Round-Robin
• exploit fast burst transfers from DRAM
• exploit fast burst transfers over wide busses • used in highly associative caches

What are the disadvantages of increasing block size?


• Replacement policy has a second order effect
Fewer blocks => more conflicts. Can waste bandwidth. since replacement only happens on misses
27 28

7
dce dce
2019
Q4: What happens on a write? 2019
Reading assignment 1
• Cache hit:
– write through: write both cache & memory • Cache performance
• generally higher traffic but simplifies cache coherence – Replacement policy (algorithms)
– write back: write cache only – Optimization (Miss rate, penalty, …)
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic • Reference
• Cache miss: – Hennessy - Patterson - Computer Architecture. A
– no write allocate: only write to main memory Quantitative
– write allocate (aka fetch on write): fetch into cache – www2.lns.mit.edu/~avinatan/research/cache.pdf
– More on internet
• Common combinations:
– write through and no write allocate
– write back with write allocate
29 30

dce
2019
Reading Log1
• Listing some algorithms of replacement
policy? Can it affect cache performance?

• What is cache performance?

• What criteria can effect cache performance?

• Explain optimization technique on each


criterion

• Multi-level Caches
31

You might also like