Professional Documents
Culture Documents
Lect 05 Dram
Lect 05 Dram
Lect 05 Dram
Announcements
Caches
Non-processor caches
Processor
DRAM
Write Buffer
Processor writes data into the cache and the write buffer Memory controller slowly drains buffer to memory
Typically holds a small number of writes Can absorb small bursts as long as the long term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM
4
If a write-through cache has a write buffer, what should happen on a read miss?
Processor
DRAM
Hardware guarantees that loads from all cores will return the value of the latest write
Coherence mechanisms
Metadata to track state for cached data Controller that snoops bus (or interconnect) activity and reacts if needed to adjust the state of the cache data
Snooping takes care of this Associates state with each cache line
(In)consistency example
/* initially A = B = flag = 0 */
Processor 1
A = 1; B = 1; flag = 1;
Processor 2
while (flag == 0); /* spin */ print(A); print(B);
Processor 1
A = 1; B = 1; FENCE; flag = 1;
Processor 2
while (flag == 0); /* spin */ FENCE; print(A); print(B);
Sometimes called Memory Barrier, MemBar, Synch. Dont allow re-ordering across it.
11
Caches
Non-processor caches
Disk
Software/OS
Tape
13
Prefetching?
Coherence?
Leases in network filesystems
Associativity?
Coherence?
No writes!
Replacement policy?
Did the page change since I last checked? Relaxed coherence If-Modified-Since header
Probably LRU
AMAT?
15
Size, associativity, granularity Hit time, miss rate, miss penalty, bandwidth, complexity
16
Caches
Non-processor caches
18
Decode row address & drive word-lines Selected bits drive bit-lines
Amplify row data Decode column address & select subset of row Send to output Precharge bit-lines for next access
19
6T vs. 1T1C
_bitline
Bit cell loses charge when read Bit cell loses charge over time
_bitline
bitline
Fast access No refreshes Simpler manufacturing (compatible with logic process) Lower density (6 transistors per cell) Higher cost
RAS bit-cell array Addr n 2n 2n row x 2m-col (n~m to minimize overall latency) m 2m sense amp and mux 1
5 basic commands
ACTIVATE READ WRITE PRECHARGE REFRESH To reduce pin count, row and column share same address pins
CAS
Commands
Columns
Row decoder ACTIVATE 0 READ 0 READ 1 READ 85 PRECHARGE ACTIVATE 1 READ 0
Column address 0 85 1
23
If another row already active, must first issue PRECHARGE ACTIVATE to open new row READ/WRITE to access row buffer
DRAM: Burst
Each READ/WRITE command can transfer multiple words (8 in DDR3) DRAM channel clocked faster than DRAM core
DRAM: Banks
Each can have a different row active Can overlap ACTIVATE and PRECHARGE latencies! (i.e. READ to bank 0 while ACTIVATING bank 1)
26
DRAM: Banks
27
[Micron]
28
[Micron]
29
Introduced in 2007 SDRAM = Synchronous DRAM = Clocked DDR = Double Data Rate
x4, x8, x16 datapath widths Minimum burst length of 8 8 banks 1Gb, 2Gb, 4Gb capacity common Relative to SDR/DDR/DDR2: + bandwidth, ~ latency
30
tCAS = Time between column command and data out tCCD = Time between column commands
tRP = Time to precharge DRAM array tRAS = Time between RAS and data restoration in DRAM array (minimum time a row must be open) tRC = tRAS + tRP = Row cycle time
31
tWTR = Write to read delay tWR = Time from end of last write to PRECHARGE tFAW = Four ACTIVATE window (limits current surge)
Makes performance analysis, memory controller design difficult Datasheets for DRAM devices freely available
http://download.micron.com/pdf/datasheets/dram/ddr3/2 Gb_DDR3_SDRAM.pdf
32
Queuing & scheduling delay at the controller Access converted to basic commands tCAS is row is open OR tRCD + tCAS if array precharged OR tRP + tRCD + tCAS (worst case: tRC + tRCD + tCAS) BurstLen / (MT/s)
33
34
DRAM Modules
DRAM chips have narrow interface (typically x4, x8, x16) Multiple chips are put together to form a wide interface
DIMM: Dual Inline Memory Module To get a 64-bit DIMM, we need to access 8 chips with 8-bit interfaces Share command/address lines, but not data
Advantages
Disadvantages
8x power
35
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
Command
Data
36
37
Advantages:
Disadvantages:
Interconnect latency, complexity, and energy get higher Addr/Cmd signal integrity is a challenge
38
DRAM Ranks
Ranks provide more banks across multiple chips (but dont confuse rank and bank!)
39
DIMM 2
Rank 3 Rank 4
40
DRAM Capacity
Can only put limited number of ranks on channel (signal and driver limitations)
x2 chips? x1 chips?
More channels?
41
42
DRAM Channels
All DIMMs get the same command, one of the ranks replies
System options
Single controller with wider interface (faster cache line refill!) Sometimes called Gang Mode Only works if DIMMs are identical (organization, timing) Requires multiple controllers
Tradeoffs
MC
Independent
CPU
MC
44
Implication?
45
MC
CPU
CPU
QPI MC
QPI
MC
CPU
MC
48
Map Physical Address to DRAM Address Obey timing constraints of DRAM, arbitrate resource conflicts (i.e. bank, channel)
Ensure correct operation of DRAM (refresh) Manage power consumption and thermals in DRAM
50
2GB memory, 8 banks, 16K rows & 2K columns per bank Consecutive rows of memory in consecutive banks
Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)
Row interleaving
Bank (3 bits)
Low Col.
3 bits
DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely
3 bits
XOR
52
Lock-step
C Low Col.
3 bits
VA
PA PA
Physical Frame number (19 bits) Row (14 bits) Bank (3 bits)
Page offset (12 bits) Column (11 bits) Byte in bus (3 bits)
OS can control which bank a virtual page is mapped to Randomize Page<Bank,Channel> for load-balancing Optimize page placement for NUMA locality App doest know which bank/channel it is accessing
54
Open row
Keep the row open after an access Pro: Next access might need the same row row hit Con: Next access might need a different row row conflict, wasted energy
Closed row
Close the row after an access (if no other requests already in the request buffer need the same row) Pro: Next access might need a different row avoid a row conflict Con: Next access might need the same row extra activate latency
Adaptive policies
Predict whether or not the next access to the bank will be to the same row
55
Closed row
Row 0
Row 0 access in request buffer (row hit) Row 0 access not in request buffer (row closed)
Closed row
Row 0
Closed row
Row 0
56
1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate maximize DRAM throughput
57
Request age Row buffer hit/miss status Request type (prefetch, read, write) Requestor type (load miss or store miss) Request criticality
Oldest miss in the core? How many instructions in core are dependent on it?
58
Implications on performance?
DRAM bank unavailable while refreshed Long pause times: If we refresh all rows in burst, every 64ms the DRAM will be unavailable until refresh ends
59
Distributed refresh eliminates long pause times How else we can reduce the effect of refresh on performance?
Next Lecture
62
DRAM chips have power modes Idea: When not accessing a chip power it down Power states
Active (highest power) All banks idle (i.e. precharged) Power-down Self-refresh (lowest power)
State transitions incur latency during which the chip cannot be accessed
63