Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

B nh m (Caches)

Ni dung
Phn cp b nh
Lm th no to ra mt b nh ln v nhanh?
Lin kt SRAM, DRAM, v a cng
Caching
Nhng b nh nh lu nhng d liu quan trng
V d
B nh cache lm vic nh th no?
Cc th: Tags
Cc khi: Blocks (lines)
Thc thi
3 loi cache: kt hp hon ton (Fullyassociative), kt hp theo tp
hp (setassociative), nh x trc tip (directmapped)
Hiu nng

Phn cp b nh

t vn

Cn b nh ln v nhanh
B nh lnh ln ISA : 232 memory address (4GB)
Yu cu nhanh v 33% cc lnh l loads/stores v 100% cc lnh cn phi ti
v thanh ghi lnh
Tn ti b nh c th c dung lng ln v truy nhp nhanh?

B nh ln v nhanh
Cc loi b nh c?
Hard disk: Huge (1000 GB) Super slow (1M cycles)
Flash: Big (100 GB) Very slow (1k cycles)
DRAM: Medium (10 GB) Slow (100 cycles)
SRAM: Small (10 MB) Fast (110 cycles)
Cn b nh nhanh v ln
Khng th s dng SRAM (too small)
Khng th s dng DRAM (too slow and small)
Khng th s dng Flash/Hard disk (way too slow)
C th kt ni gia chng:
Speed t (small) SRAMs
Size t (big) DRAM v Hard disk
Xy dng mt phn cp s dng cng ngh khc tn dng cc u im ca
cc b nh c sn.

Phn cp b nh
Phn loi:
Dung lng nh v nhanh: SRAM
Chm: DRAM
a cng dung lng ln nhng
rt chm
Vin cnh:
Rt ln
Rt nhanh (on average)
Mc tiu?
Lu tr thng tin quan trng trong
b nh nhanh.
Di chuyn nhng thng tin khng
quan trng vo b nh chm

V d: sa video
Video dung lng ln (ln hn DRAM)
Lu vo cng
Ti phn cn chnh sa vo DRAM
CPU ti d liu x l vo cache.
Di chuyn d liu mi vo DRAM v
cache khi x l video
Ch :
Lu nhng d liu quan trng vo
b nh nhanh
Di chuyn nhng d liu khng
quan trng vo b nh chm

Phn cp b nh ngy nay


(Intel Nehalem)

So snh s pht trin cng ngh

Lm th no SRAMs c
dung lng ln hn, DRAM
truy cp nhanh hn?

Cc tng c bn v cache
t nhng d liu quan trng trong b
nh nh v nhanh (cache).
Nu truy cp (load/store) nhng d liu
quan trng, cn thc hin nhanh.
Nu truy cp (load/store) nhng d liu
khc, dch chuyn d liu vo trong cache.
Nu t chnh xc d liu cn dng vo
cache, khi hu ht cc truy cp s tm
ra d liu hu ch trong cache v tr nn
nhanh hn.

Hiu nng ca caches


Truy nhp d liu trong DRAM ht 100 chu k
Truy nhp d liu trong Cache (SRAM) ht 1 chu k
T l lnh load/stores l 33%.

B qua vic ti lnh (Cn


thm b nh th hai sau
)

Tnh ton hiu nng cache


S dng DRAM:
Truy nhp d liu trong DRAM ht 100 chu k
(33% ti/lu d liu)*100 chu k = 33 chu k truy cp b nh/ lnh.

S dng mt SRAM cache hon ho (d liu trong cache 100% thi


gian):
(33% ti/lu d liu)*1 chu k = 0.33 chu k truy cp b nh / lnh.
S dng SRAM cache thc t hn (d liu trong 90% thi gian):
(33% ti/lu d liu)*(1 chu k 90% thi gian + 100 chu k 10% thi
gian)
=0.33*(1*0.9+100*0.1) =0.33*(1.9) = 0.67 chu k truy cp b nh / lnh

V d: b nh m

Example: caching instructions

Example: caching instructions

Example: caching instructions

Example: caching instructions

Example: caching instructions

Example: caching instructions

Example: caching instructions

Example: caching instructions

Cache lm vic nh th no?

Cache lm vic nh th no?


B nh kt hp:
Lu tr mt th (tag) ch ra nh v
b nh trong cache.
Cc th (tag) l a ch hoc mt phn
a ch ca d liu c nh v trong
cache.
Phn cn li ca cache l lu tr d liu
Lm th no bit d liu trng thi sn
sng?
Cache bt u ti th c gi tr bng 0
Lm th no bit c hay khng lnh
v tr a ch 0?
Bit nh du - Valid/Invalid bit
Thm bit 1 khi d liu l hp l v bit 0
nu khng hp l.
Nu th nh x n c bit nh du bng
0 (invalid (0)), b qua d liu

Ci g trong cache

Lu tr d liu trong cache


Ti d liu t B nh
Lu th (tag)
Lu d liu (data)
t gi tr valid bit

(0x8)
(add r2, r2, r1)
(1)

Q: Khi no cc bit nh du b xa?


1/ Khng bao gi
2/ Khi xa b nh cache
3/ Sau khi ti v
A: 2
Cache b xa khi chuyn i gia cc
chng trnh hoc my tnh ch
ch. Cn xa cache bi v d liu s
khng cn hp l.

Truy cp d liu t cache


Kim tra th tag (0x8) trong cache
Kim tra nu bit hp l bng 1
c d liu t cache.
Q: S lm g nu bit nh du bng 0
1/ B qua d liu.
2/ t li bng 1.
2/ Ly d liu t DRAM.
A: Ly d liu t DRAM
Nu cc bit hp l khng c t gi tr
c ngha l d liu trong cache khng hp
l. Trong trng hp ny cn ly d liu t
DRAM a vo cache.

Cache blocks (lines)


Tng hiu nng bng cch lu tr
nhiu d liu hn mi tag

Hiu nng s dng cache l g?


Tm kim mt Cache nphn t:
Cng mt v tr n chu k
Cng mt thi im n b so snh
Tn km!
S dng b nh:
Data: 32 bits (one word)
Tag: 30 bits (one address)
Valid: 1bit
63 bits cho mi phn t
Ch c 32 s dng lu d liu
Rt khng hiu qu!
Q: Bao nhiu bit cn cho mt
a ch th (tag)?
1. 4
2. 30
3. 32

A: 30
a ch th xc nh a ch d liu
trong cache, cn bits nhn din a
ch t.
(We would need 32 bits if the memory
was not wordaligned.)

Tn qu nhiu khng gian cho th


Hai ph tn ca caches:
Tn nhiu khng gian cho th (tags)
Tn nhiu phn t logic tm kim
Khc phc vn :
1 word / tag:
2 words / tag:
4 words / tag:
8 words / tag:
16 words / tag:

1/2 wasted
1/3 wasted
1/5 wasted
1/9 wasted
1/17 wasted (c s dng trong b x l ngy nay)

Phng php khc phc?


Cc d liu trong cache ln hn t th hn hiu nng lu tr tt hn

Truy nhp d liu t cache


Tng t nh trc:
Kim tra nu th tag (0x8)
trong cache
Kim tra nu bit nh du l 1
c d liu t cache
Nhng c nhiu t hn trong
mt khi cache:
S dng a ch cc bit
chn t ng vi b dn knh
MUX
Tag ch so snh cc bit l nh
nhau trong ton b khi

nh a ch cho cc cache c
kch thc khi khc nhau
a ch:
2 bit cui : xc nh byte trong t
(khng cn thit bi v truy nhp b nh bng cc t)
N bit tip theo: xc nh t trong khi
(v d, kch thc khi l 4 cn 2 bits xc nh t)
32N2 bit cn li: tag
(cn thit xc nh a ch b nh)

V d: cache block size

A: 2
Bng vic ti 3 t tip
theo cng vi t cn
dng, t l thnh cng
cao hn. Phng
Q: Ti sao kch thc block 4 php ny c gi l
khng gian cc b t l tt ?
a phng: Nhng
1. t tn khng gian hn
d liu gn vi nhng
2. Ti trc cc d liu cn
d liu va s dng l
dng sau
nhng d liu hp l
3. Th t bit hn.
nht.

Cc khi cache ln hn
u im:
t tn khng gian cho cc th (higher % data)
Quy nh cc b:
N1 words tip theo c s c s dng sau
Nhc im:
Nu khng s dng d liu trong cache s lm tn khng gian d liu
V d:
for (int i=0; i<100; i+=2) {
a[i]++;
}
Tn mt na khng gian
64 bytes (8 words) l kch thc chun
Intel lun np 2 khi cache ti mt thi im tng ng vi 128 bytes hay 16
words

B nh m nh x trc tip
tng hiu qu tm kim

t vn
Tm kim khi
Cn n b so snh
Hoc cn n chu k
Mc tiu?
C th t d liu bt k u
trong cache: flexibility
N blocks c th lu d n mng
d liu, mi v tr t khi c th
cha mt trong mt trong tt c cc
khi trong b nh.

Q: Nhc im?
1. t lng ph khng gian
2. Ch cn mt b so
snh
3. Ch mt block c
tm kim

A: Ch mt block c tm
kim
Tm kim ton b cache, vi
n b so snh hoc vi mt
b snh trong n chu k

V tr khi trong b m nh
x trc tip
Mi a ch nh x ti mt block
nh a ch theo modulo (K = i mod n)
K : v tr khi t trong cache
i : s th t khi trong b nh trong
n : s khi ca cache
V d:
0 (0 mod 8) = 0
1 (1 mod 8) = 1
2 (2 mod 8) = 2
8 (8 mod 8) = 0
9 (9 mod 8) = 1
18 (18 mod 8) = 2
Ch 3 bits cui:
0 000000 0
1 000001 1
Tag bits 2 000010 2
8 001000 0
9 001001 1
18 010010 2

(b qua nh
a ch byte)

V d b m nh x trc tip
u im: d dng trong vic tm kim
S dng a ch tm mt v tr (modulo
addressing)
So snh vi tag
Nhc im: km linh hot (1 chu k tm kim
1 khi )
Nu 2 a ch nh x ti cng mt dng khi
ch c th lu tr mt khi d liu
Xy ra xung t
Nu 2 a ch khi trong b nh trong c
cng a ch modulo, c 2 khng c lu tr
cng nhau trong cache, k c khi cache cn
trng. (Inflexible Conflicts)

B nh m kt hp theo tp
hp: Setassociative caches
linh hot hn

C th tha hip?
C th kt hp vic tm kim d dng trong b m nh x trc tip
(directmapped) v s linh hot trong b m kt hp ton khi
(fullyassociative caches)?
C: B m kt hp theo tp hp (setassociative)
Mi khi nm trong mt tp (1 set gm nhiu khi - c cng a ch
modulo)
Mi tp c nhiu khi (tm kim kt hp)
D dng hn trong vic tm km:
Xc nh c khi da trn a ch
Ch cn kim tra mt s khi cache trong tp.
Linh hot:
C th t khi bt k u trong tp
Gim s ln trt b m.

B m phi hp kiu tp
hp (Setassociative cache)
Mi a ch nh x ti mt khi
S dng a ch modulo (K = i mod s)
K : v tr khi t trong cache
i : s th t khi trong b nh trong
s : s lng tp hp trong cache
0 (0 mod 4) = anywhere in Set 0
1 (1 mod 4) = anywhere in Set 1
2 (2 mod 4) = anywhere in Set 2
8 (8 mod 4) = anywhere in Set 0
9 (9 mod 4) = anywhere in Set 1
18 (18 mod 4) = anywhere in Set 2
Xem xt 2 bits cui:
0 000000 Set 0
1 000001 Set 1
2 000010 Set 2
8 001000 Set 0
Tag bits
9 001001 Set 1
18 010010 Set 2

(Ignoring byte
addressing for
now)

V d: Setassociative
u im: D dng trong vic tm kim
S dng a ch tm kim cc tp
So snh tt c cc tag trong tp
Nhc im:
t linh hot hn kiu fullyassociative (can
only go anywhere in the set)
Phc tp hp (kim tra nhiu tags hn)

Nu hai a ch c cng mt i ch modulo, c


th tm chng trong mt tp. (More flexible
Fewer conflicts)

Tng kt: Cc kiu b nh m

Cc loi caches
a ch tng ng mt khi
(9 mod 8) = 1
Tm kim d dng, khng linh hot
a ch tng ng vi mt tp .
(9 mod 4) = Set 1
Tm kim tt hn, linh hot hn

a ch tng ng vi bt k v tr no
trong cache.
Kh trong vic tm kim, rt linh hot

Thit k cache

u nhc im ca cc loi cache


Fullyassociative
Pro: linh hot, d liu c th t bt k v tr no (no conflicts)
Con: Kh khn trong vic tm kim (slow/expensive)
Setassociative
Pro: tng i linh hot: d liu c th t bt k v tr no trong tp
Con: Ch cn tm kim mt vi v tr trong tp (reasonable
speed/complexity)
Directmapped
Pro: Tm kim linh hot, ch cn tm kim mt v tr (fast/simple)
Con: xy ra xung t d liu (conflicts)

Same block size


across the whole
hierarchy

B nh m ngy nay
Intel Ivy Bridge (2012, 4core x86)
L1 Data:
32kB
8way setassociative
L1 Instruction: 32kB
8way setassociative
L2 I&D:
256kB
8way setassociative
L3 I&D:
8MB
16way setassociative

64byte line
64byte line
64byte line
64byte line

Qualcomm Krait (2012, 4core ARM)


L0 Data:
4kB
Directmapped
L0 Instruction: 4kB
Directmapped
L1Data:
16kB
4way setassociative
L1 Instruction: 16kB
4way setassociative
L2 I&D:
1MB
8way setassociative

64byte line
64byte line
64byte line
64byte line
64byte line

Higher associativity
for larger caches.

Filter cache.
Designed to
save energy for
small pieces of
code

Dung lng b nh m
iu g xy ra nu b nh m
y?

Dung lng caches


iu g xy ra khi cache y?
Chnh xch thay th
Cn chn mt s d liu trong caches loi b (evict)
Chn nhng d liu khng s dng t lu nht
Dn n:
Directmapped: ch loi i mt khi (bi v mt khi tng ng vi
mt v tr)
Set/Fullyassociative:
Chn ngu nhin mt khi loi b
Chn khi no khng s dng lu nht (least recently used - LRU)
loi b.

Chnh sch thay th


Bn chin thut ch yu chn khi thay th trong cache
Thay th ngu nhin
phn b ng u vic thay th, cc khi cn thay th trong cache
c chn ngu nhin
Khi xa nht (Least Recently Used ): hiu qu trong cc v tr tm thi
Thay th khi khng c dng t lu nht
Vo trc ra trc (FIFO)
Khi c a vo cache u tin, nu b thay th s l khi b thay th
trc nht.
Khi c tn sut s dng t nht (LFU Least Frequently Used): Khi trong
cache c tham chiu t nht

Thay th ngu nhin

Chn ngu nhin mt khi loi b


V d:
Black loop v blue loop
Khi chuyn i gia hai vng lp d
liu lun c s dng li
Problem: D liu loi b l d liu cn
s dng li

Least recently used (LRU)


replacement

Loi b khi khng c s dng lu


nht
V d :
Black loop v blue loop
Khi chuyn i gia cc vng lp,
cn tt c cc lnh trong vng lp
mu xanh li trong cache

Ghi vo b nh m
Chin thut ghi?

Ghi vo b nh m
Khi c lnh hoc d liu: t vo cache
Ghi nh th no?
Writethrough (ghi ng thi)
Thng tin c ghi ng thi vo khi ca cache v khi ca b nh
trong
n gin, nhng chm (phi i ghi vo DRAM)
Writeback (ghi li)
Ch ghi d liu vo cache
Nu khng c trong cache, cn np li vo cache
Cn nh du li nu d liu trong cache c cp nht
Khi mt khi b thay th, khi ny s c ghi li vo b nh trong.

Writethrough
Lun ghi vo DRAM
Nu d liu trong cache, khi ghi
ng thi vo cache.
Q: Ghi chm hn v sao?
1. Ph thuc vo v tr ca d liu
trong cache.
2. Chm do ghi vo DRAM
3. Chm do ghi vo DRAM v
cache
A: Chm do ghi vo DRAM
Ghi vo DRAM mi thi im.

Writethrough with
allocateonwrite
Lun ghi vo DRAM
Nu d liu trong cache, ghi ng thi vo
cache
Allocateonwrite
Nu d liu khng c trong cache, ti d liu v
cache
Q: Tt hn ti sao?
1. It isnt
2. Faster to read and write than just write
3. Subsequent reads will hit in the cache
A: Subsequent reads will hit in the cache
Thng hay truy cp d liu v ghi li, ch ghi mt
phn ca khi nh mt t hoc mt byte, do vy ti vo
cache s nhanh hn.

Writeback
Lun ghi vo cache
Nu d liu khng c trong cache,
ti vo cache
Note: cache v DRAM l khng
nh nhau!
Khi chng ta loi b mt dng cn
phi bit khc vi d liu ghi t
DRAM nh th no?

Ghi vo b nh m
Writethrough n gin, nhng chm
Phi i ghi ng thi vi DRAM trong mi ln ghi
Writeback phc tp hn, nhng nhanh
Phi nh du d liu bn (dirty data) trong cache v ghi li vo b
nh trong nu n b loi b
Ghi nhanh hn
Allocateonwrite ti d liu vo cache khi ghi
Mt khc ch ti d liu vo cache khi c.
Ghi vo cache ngy nay l kiu writeback
Intels new Xeon Phi c mc L2 writethrough cache

Hiu nng b nh m

T s trt b m
Miss ratio = % of cache misses = (# cache misses / # memory accesses)
T l trt (%) = (s ln trt/s ln truy nhp)
Q: Hiu nng thay i th
no khi t s trt b m
gim t 10.5% n 3.5%?
1. Slower
2. Stays the same
3. Faster
A: Faster
Cc ng dng c th chy
nhanh hn, nhng khng
th bit l nhanh n bao
nhiu. T l Trng cache ln
hn rt nhiu so vi trt.

Thi gian truy cp b nh trung


bnh

Average Memory Access


Time(AMAT)
S chu k trung bnh cho mt truy cp b nh
= (hit time) + (miss %)*(miss time + miss penalty)

Hit time
Miss time
Miss penalty

= 1 Hit (1cycle)
= 1 Miss (1 cycle)
= 3 Penalty

% miss = 2/6 = 33%


AMAT = (1) + (33%)*(1 + 3) = 2.3 cycles per access

Miss Penalty l thi


gian sau khi tm kim
trong cache.

V d: AMAT
Machine 1
100 truy nhp DRAM
1 chu k truy nhp cache (hit or miss)
Tnh AMAT cho lbm?
cache c dung lng 256kB ?
6% miss ratio
(1) + (6%)*(1+100) = 7.06 cycles per memory
access
cache c dung lng 8MB ?
3% miss ratio
(1) + (3%)*(1+100) = 4.03 cycles per memory
access
Machine 2
100 cycles to DRAM
2 cycle cache access time (hit or miss)
Tnh AMAT cho bzip2?
Q: Ti sao bzip2s AMAT vi
256kB cache?
256kB cache gn ging vi
1.5% miss ratio
lbms vi 8MB?
(2) + (1.5%)*(2+100) = 3.53
1. Slower cache
2. Different miss ratios
3. Different applications

A: Slower cache
Cache mt 2 cycles cho
Machine 2 v ch mt chu k
cho Machine 1. Diu ny lm
nh hng n AMAT.ng
dng khc nhau s thy 8MB
cache c 3% miss ratio i vi
lbm trong khi bzip2 c 1.5%
miss ratio vi cache 256kB

Performance impacts of
memory access time
nh thng ca cace n hiu nng nh th no(CPI)?
100 chu k trng pht trt b m
3% t l trt
1.33 ln truy nhp b nh trn mt lnh (1 for instruction 33% for data)
CPI = 1.0 khi thc thi thng thng (l tng)
Hiu nng thay i th no?
CPI = CPIExecution + Memory stall cycles per instruction = 1.0 +
1.33*0.03*100 = 4.99
Phn cp b nh ny lm b x l chy chm hn 5 ln!
Caches l rt quan trng! Caches l quan trng nht tng hiu nng.

How much does cache matter?

Phn tch lnh v d liu b


nh m
Cn np lnh(IF) v d liu (MEM) cng mt thi im
Cn 2 loi cache:
Instruction cache (just instructions)
Data cache (just data)
AMAT khi phn chia cache:
AMAT: (% lnh truy cp)*(hit time + (t l trt b m
lnh)*(miss time + miss penalty)) + (% d liu truy
cp)*(hit time + (t l trt b m d liu)*(miss time +
miss penalty))
V d:
Thi gian truy nhp cache l 1 cycle (hit or miss) v
thi gian truy nhp b nh 100 cycle
Icache: 1% miss ratio, Dcache: 5% miss ratio
33% l lnh loads/stores 25% truy cp d liu /
75% truy cp lnh
(75%)*(1 + (1%)*(1+100)) + (25%)*(1 + (5%)*(1+100))
= 1.5 + 1.5 = 3.0

AMD/Intel caches

Heterogeneous processor
caches

Memory hierarchy
We want big and fast
Build a hierarchy where we keep the most important data in fast
memory
Other data goes in slow memory
If we move the data correctly we provide the illusion of fast and
big
Registers 3 accesses/cycle 3264
Cache 110 cycles 8kB256kB
Cache 40 cycles 420MB
DRAM 200 cycles 416GB
Flash 1000+ cycles 64512GB
Hard Disk 1M+ cycles 2 4TB

Summary: how to make the


memory hierarchy work
3 different cache types
Fullyassociative: Have to search all blocks, but very flexible
Directmapped: Only one place for each block, no flexibility
Setassociative: Only have to search one set for each block, flexible
We can adjust the block (line) size to reduce the overhead of tags
We figure out where data goes in a cache by looking at the address
Last 2 bits are the byte in the word
Next N bits are the word in the cache block
Remaining bits are for the tag
We have different write policies
Writethrough: slow, simple
Writeback: fast (keeps the data just in the cache), more complex
Performance effects are due to the average memory access time

You might also like