Professional Documents
Culture Documents
Systems I: Locality and Caching
Systems I: Locality and Caching
Locality of reference
Cache principles
Multi-level caches
Locality
Principle of Locality:
Locality Example:
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
Data
Reference array elements in succession
(stride-1 reference pattern): Spatial locality
Reference sum each iteration: Temporal locality
Instructions
Reference instructions in sequence: Spatial locality
Cycle through loop repeatedly: Temporal locality
2
Locality Example
Claim: Being able to look at code and get a qualitative
sense of its locality is a key skill for a professional
programmer.
Question: Does this function have good locality?
int sumarrayrows(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
3
Locality Example
Question: Does this function have good locality?
Locality Example
Question: Can you permute the loops so that the
function scans the 3-d array a[] with a stride-1
reference pattern (and thus has good spatial
locality)?
int sumarray3d(int a[M][N][N])
{
int i, j, k, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
sum += a[k][i][j];
return sum;
}
Memory Hierarchies
Some fundamental and enduring properties of
hardware and software:
Fast storage technologies cost more per byte and have less
capacity.
L0:
registers
L1: on-chip L1
cache (SRAM)
L2:
L3:
Larger,
slower,
and
cheaper
(per byte)
storage
devices
L5:
L4:
off-chip L2
cache (SRAM)
main memory
(DRAM)
Caches
Cache: A smaller, faster storage device that acts as a staging area
for a subset of the data in a larger, slower device.
Fundamental idea of a memory hierarchy:
Programs tend to access the data at level k more often than they
access the data at level k+1.
Thus, the storage at level k+1 can be slower, and thus larger and
cheaper per bit.
Net effect: A large pool of memory that costs as much as the cheap
storage near the bottom, but that serves data to programs at the
rate of the fast storage near the top.
Use combination of small fast memory and big slow memory to
give illusion of big fast memory.
8
8
4
10
4
Level k+1:
14
10
10
11
12
13
14
15
Request
12
14
Cache hit
4*
12
14
12
4*
Level
k+1:
Cache miss
Request
12
4
4*
10
11
12
13
14
15
Conflict miss
Most caches limit blocks at level k+1 to a small subset
Capacity miss
Occurs when the set of active cache blocks (working set) is
11
What Cached
Where Cached
Registers
4-byte word
CPU registers
0 Compiler
TLB
Address
translations
32-byte block
32-byte block
4-KB page
On-Chip TLB
0 Hardware
On-Chip L1
Off-Chip L2
Main memory
Parts of files
Main memory
1 Hardware
10 Hardware
100 Hardware+
OS
100 OS
L1 cache
L2 cache
Virtual
Memory
Buffer cache
Local disk
Local disk
Remote server
disks
Latency
(cycles)
Managed
By
10,000,000 AFS/NFS
client
10,000,000 Web
browser
1,000,000,000 Web proxy
server
12
Cache Memories
Cache memories are small, fast SRAM-based memories
managed automatically in hardware.
CPU looks first for data in L1, then in L2, then in main
memory.
Typical bus structure:
CPU chip
register file
L1
cache
cache bus
L2 cache
ALU
system bus memory bus
bus interface
I/O
bridge
main
memory
13
line 1
abcd
...
block 21
pqrs
...
block 30
wxyz
...
14
Multi-Level Caches
Options: separate data and instruction caches, or a
unified cache
Processor
Regs
L1
d-cache
L1
i-cache
size:
speed:
$/Mbyte:
line size:
200 B
3 ns
8-64 KB
3 ns
8B
32 B
larger, slower, cheaper
Unified
L2
Cache
1-4MB SRAM
6 ns
$100/MB
32 B
Memory
128 MB DRAM
60 ns
$1.50/MB
8 KB
disk
30 GB
8 ms
$0.05/MB
15
Regs.
L1 Data
1 cycle latency
16 KB
4-way assoc
Write-through
32B lines
L1 Instruction
16 KB, 4-way
32B lines
L2 Unified
128KB--2 MB
4-way assoc
Write-back
Write allocate
32B lines
Main
Memory
Up to 4GB
Processor Chip
16
Summary
Today
Cache principles
Next Time
Cache organization
Programming considerations
18