Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Carnegie Mellon

1
Cache Memor|es
13-213: lnLroducuon Lo CompuLer SysLems
10
Lh
LecLure, Sep. 23, 2010.
Instructors:
8andy 8ryanL and uave C'Pallaron
Carnegie Mellon
2
1oday
! Cache memory organ|zanon and operanon
! erformance |mpact of caches
! 1he memory mounLaln
! 8earranglng loops Lo lmprove spaual locallLy
! uslng blocklng Lo lmprove Lemporal locallLy
Carnegie Mellon
3
Cache Memor|es
! Cache memor|es are sma||, fast SkAM-based memor|es
managed automanca||y |n hardware.
! Pold frequenLly accessed blocks of maln memory
! CU |ooks hrst for data |n caches (e.g., L1, L2, and L3),
then |n ma|n memory.
! 1yp|ca| system structure:
Main
memory
I/O
bridge
Bus interface
ALU
Register file
CPU chip
System bus Memory bus
Cache
memories
Carnegie Mellon
4
Genera| Cache Crgan|zanon (S, L, 8)
L = 2
e
||nes per set
S = 2
s
sets
set
||ne
0 1 2 8-1 tag v
8 = 2
b
bytes per cache b|ock (the data)
!"#$% '()%*
! + , - . - / 0"1" 231%'
va||d b|t
Carnegie Mellon
5
Cache kead
L = 2
e
||nes per set
S = 2
s
sets
0 1 2 8-1 tag v
va||d b|t
8 = 2
b
bytes per cache b|ock (the data)
t b|ts s b|ts b b|ts
Address of word:
tag set
|ndex
b|ock
oset
data beg|ns at th|s oset
45#"1% '%1
!$%#6 (7 "83 9(8% (8 '%1
$"' :"1#$(8; 1";
<%' = 9(8% >"9(0* $(1
45#"1% 0"1" '1"?@8;
"1 5A'%1
Carnegie Mellon
6
Lxamp|e: D|rect Mapped Cache (L = 1)
S = 2
s
sets
D|rect mapped: Cne ||ne per set
Assume: cache b|ock s|ze 8 bytes
t b|ts 0.01 100
Address of |nt:
0 1 2 7 tag v 3 6 S 4
0 1 2 7 tag v 3 6 S 4
0 1 2 7 tag v 3 6 S 4
0 1 2 7 tag v 3 6 S 4
hnd set
Carnegie Mellon
7
Lxamp|e: D|rect Mapped Cache (L = 1)
D|rect mapped: Cne ||ne per set
Assume: cache b|ock s|ze 8 bytes
t b|ts 0.01 100
Address of |nt:
0 1 2 7 tag v 3 6 S 4
match: assume yes = h|t va||d? +
b|ock oset
tag
Carnegie Mellon
8
Lxamp|e: D|rect Mapped Cache (L = 1)
D|rect mapped: Cne ||ne per set
Assume: cache b|ock s|ze 8 bytes
t b|ts 0.01 100
Address of |nt:
0 1 2 7 tag v 3 6 S 4
match: assume yes = h|t va||d? +
|nt (4 8ytes) |s here
b|ock oset
No match: o|d ||ne |s ev|cted and rep|aced
Carnegie Mellon
9
D|rect-Mapped Cache S|mu|anon
M=16 byLe addresses, 8=2 byLes/block,
S=4 seLs, L=1 8locks/seL
Address Lrace (reads, one byLe per read):
0 [0000
2
],
1 [0001
2
],
7 [0111
2
],
8 [1000
2
],
0 [0000
2
]
x
L=1 s=2 b=1
xx x
0 ? ?
v 1ag 8lock
mlss
1 0 M[0-1]
hlL
mlss
1 0 M[6-7]
mlss
1 1 M[8-9]
mlss
1 0 M[0-1] Set 0
Set 1
Set 2
Set 3
Carnegie Mellon
10
A n|gher Leve| Lxamp|e
int sum_array_rows(double a[16][16])
{
int i, j;
double sum = 0;
for (i = 0; i < 16; i++)
for (j = 0; j < 16; j++)
sum += a[i][j];
return sum;
}
32 8 = 4 doub|es
assume: co|d (empty) cache,
a[0][0] goes here
int sum_array_cols(double a[16][16])
{
int i, j;
double sum = 0;
for (j = 0; i < 16; i++)
for (i = 0; j < 16; j++)
sum += a[i][j];
return sum;
}
b|ackboard
B;85?% 1$% >"?("29%' 'C:D (D E
Carnegie Mellon
11
L-way Set Assoc|anve Cache (nere: L = 2)
L = 2: 1wo ||nes per set
Assume: cache b|ock s|ze 8 bytes
t b|ts 0.01 100
Address of short |nt:
0 1 2 7 tag v 3 6 S 4 0 1 2 7 tag v 3 6 S 4
0 1 2 7 tag v 3 6 S 4 0 1 2 7 tag v 3 6 S 4
0 1 2 7 tag v 3 6 S 4 0 1 2 7 tag v 3 6 S 4
0 1 2 7 tag v 3 6 S 4 0 1 2 7 tag v 3 6 S 4
hnd set
Carnegie Mellon
12
L-way Set Assoc|anve Cache (nere: L = 2)
L = 2: 1wo ||nes per set
Assume: cache b|ock s|ze 8 bytes
t b|ts 0.01 100
Address of short |nt:
0 1 2 7 tag v 3 6 S 4 0 1 2 7 tag v 3 6 S 4
compare both
va||d? + match: yes = h|t
b|ock oset
tag
Carnegie Mellon
13
L-way Set Assoc|anve Cache (nere: L = 2)
L = 2: 1wo ||nes per set
Assume: cache b|ock s|ze 8 bytes
t b|ts 0.01 100
Address of short |nt:
0 1 2 7 tag v 3 6 S 4 0 1 2 7 tag v 3 6 S 4
compare both
va||d? + match: yes = h|t
b|ock oset
short |nt (2 8ytes) |s here
No match:
Cne ||ne |n set |s se|ected for ev|cnon and rep|acement
kep|acement po||c|es: random, |east recent|y used (LkU), .
Carnegie Mellon
14
2-Way Set Assoc|anve Cache S|mu|anon
M=16 byLe addresses, 8=2 byLes/block,
S=2 seLs, L=2 blocks/seL
Address Lrace (reads, one byLe per read):
0 [0000
2
],
1 [0001
2
],
7 [0111
2
],
8 [1000
2
],
0 [0000
2
]
xx
L=2 s=1 b=1
x x
0 ? ?
v 1ag 8|ock
0
0
0
mlss
1 00 M[0-1]
hlL
mlss
1 01 M[6-7]
mlss
1 10 M[8-9]
hlL
Set 0
Set 1
Carnegie Mellon
15
A n|gher Leve| Lxamp|e
int sum_array_rows(double a[16][16])
{
int i, j;
double sum = 0;
for (i = 0; i < 16; i++)
for (j = 0; j < 16; j++)
sum += a[i][j];
return sum;
}
32 8 = 4 doub|es
assume: co|d (empty) cache,
a[0][0] goes here
int sum_array_rows(double a[16][16])
{
int i, j;
double sum = 0;
for (j = 0; i < 16; i++)
for (i = 0; j < 16; j++)
sum += a[i][j];
return sum;
}
b|ackboard
B;85?% 1$% >"?("29%' 'C:D (D E
Carnegie Mellon
16
What about wr|tes?
! Mu|np|e cop|es of data ex|st:
! L1, L2, Maln Memory, ulsk
! What to do on a wr|te-h|t?
! WrlLe-Lhrough (wrlLe lmmedlaLely Lo memory)
! WrlLe-back (defer wrlLe Lo memory unul replacemenL of llne)
! need a dlrLy blL (llne dlerenL from memory or noL)
! What to do on a wr|te-m|ss?
! WrlLe-allocaLe (load lnLo cache, updaLe llne ln cache)
! Cood lf more wrlLes Lo Lhe locauon follow
! no-wrlLe-allocaLe (wrlLes lmmedlaLely Lo memory)
! 1yp|ca|
! WrlLe-Lhrough + no-wrlLe-allocaLe
! Wr|te-back + Wr|te-a||ocate
Carnegie Mellon
17
Inte| Core |7 Cache n|erarchy
Regs
L1
d-cache
L1
i-cache
L2 unified cache
Core 0
Regs
L1
d-cache
L1
i-cache
L2 unified cache
Core 3

L3 unified cache
(shared by all cores)
Main memory
Processor package
L1 |-cache and d-cache:
32 k8, 8-way,
Access: 4 cycles
L2 un|hed cache:
236 k8, 8-way,
Access: 11 cycles
L3 un|hed cache:
8 M8, 16-way,
Access: 30-40 cycles
8|ock s|ze: 64 byLes for
all caches.
Carnegie Mellon
18
Cache erformance Metr|cs
! M|ss kate
! lracuon of memory references noL found ln cache (mlsses / accesses)
= 1 - hlL raLe
! 1yplcal numbers (ln percenLages):
! 3-10 for L1
! can be qulLe small (e.g., < 1) for L2, dependlng on slze, eLc.
! n|t 1|me
! 1lme Lo dellver a llne ln Lhe cache Lo Lhe processor
! lncludes ume Lo deLermlne wheLher Lhe llne ls ln Lhe cache
! 1yplcal numbers:
! 1-2 clock cycle for L1
! 3-20 clock cycles for L2
! M|ss ena|ty
! Addluonal ume requlred because of a mlss
! Lyplcally 30-200 cycles for maln memory (1rend: lncreaslng!)
Carnegie Mellon
19
Lets th|nk about those numbers
! nuge d|erence between a h|t and a m|ss
! Could be 100x, lf [usL L1 and maln memory
! Wou|d you be||eve 99 h|ts |s tw|ce as good as 97?
! Conslder:
cache hlL ume of 1 cycle
mlss penalLy of 100 cycles
! Average access ume:
97 hlLs: 1 cycle + 0.03 * 100 cycles = 4 cyc|es
99 hlLs: 1 cycle + 0.01 * 100 cycles = 2 cyc|es
! 1h|s |s why "m|ss rate" |s used |nstead of "h|t rate"
Carnegie Mellon
20
Wr|nng Cache Ir|end|y Code
! Make the common case go fast
! locus on Lhe lnner loops of Lhe core funcuons
! M|n|m|ze the m|sses |n the |nner |oops
! 8epeaLed references Lo varlables are good (Lemporal locallLy)
! SLrlde-1 reference pauerns are good (spaual locallLy)
key |dea: Cur qua||tanve nonon of |oca||ty |s quannhed
through our understand|ng of cache memor|es.
Carnegie Mellon
21
1oday
! Cache organ|zanon and operanon
! erformance |mpact of caches
! 1he memory mounLaln
! 8earranglng loops Lo lmprove spaual locallLy
! uslng blocklng Lo lmprove Lemporal locallLy
Carnegie Mellon
22
1he Memory Mounta|n
! kead throughput (read bandw|dth)
! number of byLes read from memory per second (M8/s)
! Memory mounta|n: Measured read throughput as a
funcnon of spana| and tempora| |oca||ty.
! CompacL way Lo characLerlze memory sysLem performance.
Carnegie Mellon
23
Memory Mounta|n 1est Iuncnon
/* The test function */
void test(int elems, int stride) {
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i += stride)
result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
}
/* Run test(elems, stride) and return read throughput (MB/s) */
double run(int size, int stride, double Mhz)
{
double cycles;
int elems = size / sizeof(int);
test(elems, stride); /* warm up the cache */
cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */
return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */
}
Carnegie Mellon
24
1he Memory Mounta|n
6
4
M

8
M

1
M

1
2
8
K

1
6
K

2
K

0
1000
2000
3000
4000
5000
6000
7000
s
1

s
3

s
5

s
7

s
9

s
1
1

s
1
3

s
1
5

s
3
2

Working set size (bytes)
R
e
a
d


t
h
r
o
u
g
h
p
u
t

(
M
B
/
s
)

Stride (x8 bytes)
Inte| Core |7
32 k8 L1 |-cache
32 k8 L1 d-cache
2S6 k8 un|hed L2 cache
8M un|hed L3 cache
A|| caches on-ch|p
Carnegie Mellon
25
1he Memory Mounta|n
6
4
M

8
M

1
M

1
2
8
K

1
6
K

2
K

0
1000
2000
3000
4000
5000
6000
7000
s
1

s
3

s
5

s
7

s
9

s
1
1

s
1
3

s
1
5

s
3
2

Working set size (bytes)
R
e
a
d


t
h
r
o
u
g
h
p
u
t

(
M
B
/
s
)

Stride (x8 bytes)
Inte| Core |7
32 k8 L1 |-cache
32 k8 L1 d-cache
2S6 k8 un|hed L2 cache
8M un|hed L3 cache
A|| caches on-ch|p
,95F%' 57
'F"@"9
95#"9(13
Carnegie Mellon
26
1he Memory Mounta|n
6
4
M

8
M

1
M

1
2
8
K

1
6
K

2
K

0
1000
2000
3000
4000
5000
6000
7000
s
1

s
3

s
5

s
7

s
9

s
1
1

s
1
3

s
1
5

s
3
2

Working set size (bytes)
R
e
a
d


t
h
r
o
u
g
h
p
u
t

(
M
B
/
s
)

Stride (x8 bytes)
L1
L2
Mem
L3
Inte| Core |7
32 k8 L1 |-cache
32 k8 L1 d-cache
2S6 k8 un|hed L2 cache
8M un|hed L3 cache
A|| caches on-ch|p
,95F%' 57
'F"@"9
95#"9(13
G(0;%' 57
H%:F5?"9
95#"9(13
Carnegie Mellon
27
1oday
! Cache organ|zanon and operanon
! erformance |mpact of caches
! 1he memory mounLaln
! 8earranglng loops Lo lmprove spaual locallLy
! uslng blocklng Lo lmprove Lemporal locallLy
Carnegie Mellon
28
M|ss kate Ana|ys|s for Matr|x Mu|np|y
! Assume:
! Llne slze = 328 (blg enough for four 64-blL words)
! MaLrlx dlmenslon (n) ls very large
! ApproxlmaLe 1/n as 0.0
! Cache ls noL even blg enough Lo hold muluple rows
! Ana|ys|s Method:
! Look aL access pauern of lnner loop
A
k
i
B
k
j
C
i
j
Carnegie Mellon
29
Matr|x Mu|np||canon Lxamp|e
! Descr|pnon:
! Muluply n x n maLrlces
! C(n
3
) LoLal operauons
! n reads per source
elemenL
! n values summed per
desunauon
! buL may be able Lo
hold ln reglsLer
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
Variable sum
held in register
Carnegie Mellon
30
Layout of C Arrays |n Memory (rev|ew)
! C arrays a||ocated |n row-ma[or order
! each row ln conuguous memory locauons
! Stepp|ng through co|umns |n one row:
! for (i = 0; i < N; i++)
sum += a[0][i];
! accesses successlve elemenLs
! lf block slze (8) > 4 byLes, explolL spaual locallLy
! compulsory mlss raLe = 4 byLes / 8
! Stepp|ng through rows |n one co|umn:
! for (i = 0; i < n; i++)
sum += a[i][0];
! accesses dlsLanL elemenLs
! no spaual locallLy!
! compulsory mlss raLe = 1 (l.e. 100)
Carnegie Mellon
31
Matr|x Mu|np||canon (|[k)
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
A 8 C
(l,*)
(*,[)
(l,[)
lnner loop:
Column-
wlse
8ow-wlse llxed
Mlsses per lnner loop lLerauon:
A 8 C
0.23 1.0 0.0
Carnegie Mellon
32
Matr|x Mu|np||canon ([|k)
/* jik */
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum
}
}
A 8 C
(l,*)
(*,[)
(l,[)
lnner loop:
8ow-wlse Column-
wlse
llxed
Mlsses per lnner loop lLerauon:
A 8 C
0.23 1.0 0.0
Carnegie Mellon
33
Matr|x Mu|np||canon (k|[)
/* kij */
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
A 8 C
(l,*)
(l,k) (k,*)
lnner loop:
8ow-wlse 8ow-wlse llxed
Mlsses per lnner loop lLerauon:
A 8 C
0.0 0.23 0.23
Carnegie Mellon
34
Matr|x Mu|np||canon (|k[)
/* ikj */
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
A 8 C
(l,*)
(l,k) (k,*)
lnner loop:
8ow-wlse 8ow-wlse llxed
Mlsses per lnner loop lLerauon:
A 8 C
0.0 0.23 0.23
Carnegie Mellon
35
Matr|x Mu|np||canon ([k|)
/* jki */
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
A 8 C
(*,[)
(k,[)
lnner loop:
(*,k)
Column-
wlse
Column-
wlse
llxed
Mlsses per lnner loop lLerauon:
A 8 C
1.0 0.0 1.0
Carnegie Mellon
36
Matr|x Mu|np||canon (k[|)
/* kji */
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
A 8 C
(*,[)
(k,[)
lnner loop:
(*,k)
llxed Column-
wlse
Column-
wlse
Mlsses per lnner loop lLerauon:
A 8 C
1.0 0.0 1.0
Carnegie Mellon
37
Summary of Matr|x Mu|np||canon
|[k (& [|k):
- 2 loads, 0 sLores
- mlsses/lLer = 1.2S
k|[ (& |k[):
- 2 loads, 1 sLore
- mlsses/lLer = 0.S
[k| (& k[|):
- 2 loads, 1 sLore
- mlsses/lLer = 2.0
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
Carnegie Mellon
38
Core |7 Matr|x Mu|np|y erformance
0
10
20
30
40
50
60
50 100 150 200 250 300 350 400 450 500 550 600 650 700 750
C
y
c
l
e
s

p
e
r

i
n
n
e
r

l
o
o
p

i
t
e
r
a
t
i
o
n

Array size (n)
jki
kji
ijk
jik
kij
ikj
[k| ] k[|
|[k ] [|k
k|[ ] |k[
Carnegie Mellon
39
1oday
! Cache organ|zanon and operanon
! erformance |mpact of caches
! 1he memory mounLaln
! 8earranglng loops Lo lmprove spaual locallLy
! uslng blocklng Lo lmprove Lemporal locallLy
Carnegie Mellon
40
Lxamp|e: Matr|x Mu|np||canon
a b
|
[
*
c
=
c = (double *) calloc(sizeof(double), n*n);
/* Multiply n x n matrices a and b */
void mmm(double *a, double *b, double *c, int n) {
int i, j, k;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
for (k = 0; k < n; k++)
c[i*n+j] += a[i*n + k]*b[k*n + j];
}
Carnegie Mellon
41
Cache M|ss Ana|ys|s
! Assume:
! MaLrlx elemenLs are doubles
! Cache block = 8 doubles
! Cache slze C << n (much smaller Lhan n)
! I|rst |teranon:
! n/8 + n = 9n/8 mlsses
! Aerwards ln cache:
(schemauc)
*
=
n
*
=
8 w|de
Carnegie Mellon
42
Cache M|ss Ana|ys|s
! Assume:
! MaLrlx elemenLs are doubles
! Cache block = 8 doubles
! Cache slze C << n (much smaller Lhan n)
! Second |teranon:
! Agaln:
n/8 + n = 9n/8 mlsses
! 1ota| m|sses:
! 9n/8 * n
2
= (9/8) * n
3

n
*
=
8 w|de
Carnegie Mellon
43
8|ocked Matr|x Mu|np||canon
c = (double *) calloc(sizeof(double), n*n);
/* Multiply n x n matrices a and b */
void mmm(double *a, double *b, double *c, int n) {
int i, j, k;
for (i = 0; i < n; i+=B)
for (j = 0; j < n; j+=B)
for (k = 0; k < n; k+=B)
/* B x B mini matrix multiplications */
for (i1 = i; i1 < i+B; i++)
for (j1 = j; j1 < j+B; j++)
for (k1 = k; k1 < k+B; k++)
c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1];
}
a b
|1
[1
*
c
=
c
+
8|ock s|ze 8 x 8
Carnegie Mellon
44
Cache M|ss Ana|ys|s
! Assume:
! Cache block = 8 doubles
! Cache slze C << n (much smaller Lhan n)
! 1hree blocks L lnLo cache: 38
2
< C
! I|rst (b|ock) |teranon:
! 8
2
/8 mlsses for each block
! 2n/8 * 8
2
/8 = n8/4
(omlmng maLrlx c)
! Aerwards ln cache
(schemauc)
*
=
*
=
8|ock s|ze 8 x 8
n]8 b|ocks
Carnegie Mellon
45
Cache M|ss Ana|ys|s
! Assume:
! Cache block = 8 doubles
! Cache slze C << n (much smaller Lhan n)
! 1hree blocks L lnLo cache: 38
2
< C
! Second (b|ock) |teranon:
! Same as rsL lLerauon
! 2n/8 * 8
2
/8 = n8/4
! 1ota| m|sses:
! n8/4 * (n/8)
2
= n
3
/(48)
*
=
8|ock s|ze 8 x 8
n]8 b|ocks
Carnegie Mellon
46
Summary
! No b|ock|ng: (9]8) * n
3
! 8|ock|ng: 1](48) * n
3

! Suggest |argest poss|b|e b|ock s|ze 8, but ||m|t 38
2
< C!
! keason for dramanc d|erence:
! MaLrlx mulupllcauon has lnherenL Lemporal locallLy:
! lnpuL daLa: 3n
2
, compuLauon 2n
3
! Lvery array elemenLs used C(n) umes!
! 8uL program has Lo be wrluen properly
Carnegie Mellon
47
Conc|ud|ng Cbservanons
! rogrammer can opnm|ze for cache performance
! Pow daLa sLrucLures are organlzed
! Pow daLa are accessed
! nesLed loop sLrucLure
! 8locklng ls a general Lechnlque
! A|| systems favor "cache fr|end|y code"
! Cemng absoluLe opumum performance ls very plauorm speclc
! Cache slzes, llne slzes, assoclauvlues, eLc.
! Can geL mosL of Lhe advanLage wlLh generlc code
! keep worklng seL reasonably small (Lemporal locallLy)
! use small sLrldes (spaual locallLy)

You might also like