Professional Documents
Culture Documents
Mei Yang: Adapted From David Patterson's Slides On Graduate Computer Architecture
Mei Yang: Adapted From David Patterson's Slides On Graduate Computer Architecture
` Introduction
` Ten Advanced Optimizations of Cache
Performance
` Memory Technology and Optimizations
` Virtual Memory and Virtual Machines
` Crosscutting Issues
` Memory Hierarchies in ARM Cortex-A8 And
Intel Core i7
` Fallacies and Pitfalls
` Miss rate
◦ Fraction of cache access that result in a miss
` Causes of misses
◦ Compulsory
x First reference to a block
◦ Capacity
x Blocks discarded and later retrieved
◦ Conflict
x Program makes repeated references to multiple addresses
from different blocks that map to the same location in the
cache
10
12
13
14
18
20
21
22
24
25
No write
merging
Write
merging
26
` Blocking
◦ Instead of accessing entire rows or columns,
subdivide matrices into blocks
◦ Requires more memory accesses but improves
locality of accesses
27
28
Pentium 4 Pre-fetching
29
` Register prefetch
◦ Loads data into register
` Cache prefetch
◦ Loads data into cache
Since a has 3 rows and 100 for (j=0; j < 100; j=j+1) {
columns, its accesses will lead to prefetch(b[j+7][0]); //b(j,0) for 7 iterations later
3x(100/2), or 150 misses. prefetch(a[0][j+7]); //a(0,j) for 7 iterations later
While b has 101 misses for a[0][j] = b[j][0] * b[j+1][0]; };
accessing b[j+1][0]. for (i=0; i<3; i=i+1)
So totally 251 misses. for (j=0; j<100; j=j+1){
prefetch(a[i][j+7]); //a[i][j] for +7 iterations
a[i][j] = b[j][0] * b[j+1][0]; } 31
Prefetch loop:
9 x 100 + 11 x 100 + 2 x 100 x 8 + 8 x 100 = 4400 clock cycles
` Performance metrics
◦ Latency is concern of cache
◦ Bandwidth is concern of multiprocessors and
I/O
◦ Access time
x Time between read request and when desired word
arrives
◦ Cycle time
x Minimum time between unrelated requests to memory
34
` DRAM
◦ Must be re-written after being read
◦ Must also be periodically refeshed
x Every ~ 8 ms
x Each row can be refreshed simultaneously
◦ One transistor/bit
◦ Address lines are multiplexed:
x Upper half of address: row access strobe (RAS)
x Lower half of address: column access strobe (CAS)
35
` Internal organization
36
` Some optimizations:
◦ Multiple accesses to same row
◦ Synchronous DRAM
x Added clock to DRAM interface
x Burst mode with critical word first
◦ Wider interfaces
◦ Double data rate (DDR)
◦ Multiple banks on each DRAM device
37
38
` DDR:
◦ DDR2
x Lower power (2.5 V -> 1.8 V)
x Higher clock rates (266 MHz, 333 MHz, 400 MHz)
◦ DDR3
x 1.5 V
x 800 MHz
◦ DDR4
x 1-1.2 V
x 1600 MHz
40
41
42
43
44
` Role of architecture:
◦ Provide user mode and supervisor mode
◦ Protect certain aspects of CPU state
◦ Provide mechanisms for switching between
user mode and supervisor mode
◦ Provide mechanisms to limit memory accesses
◦ Provide TLB to translate addresses
45
46
47
48
50