Lect 05 Dram

EE282 Lecture 5 Main Memory
Jacob Leverich http://eeclass.stanford.edu/ee282
EE282 Spring 2011 Lecture 05
Announcements
Todays lecture: Main Memory
Caches

Write buffering A couple clarifications

Banks vs. sub-arrays Coherence vs. consistency
Non-processor caches
DRAM technology DRAM as Main Memory

3
Write Buffer: Avoid Write-Through Stalls

Cache
Processor
DRAM
Write Buffer
Use Write Buffer between cache and memory

Processor writes data into the cache and the write buffer Memory controller slowly drains buffer to memory
Write Buffer: a first-in-first-out buffer (FIFO)

Typically holds a small number of writes Can absorb small bursts as long as the long term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM
4
Write Buffers (cont)
If a write-through cache has a write buffer, what should happen on a read miss?
Are write-buffers of any use for write-back caches?
Processor
Cache Write Buffer
DRAM
Sub-arrays vs. Banks

Cache logical organization
Sub-arrays vs. Banks

Sub-arrays Tag/data arrays divided into sub-arrays Shared cache control Banks Independent caches partitioned by address
Hardware Cache Coherence Using Snooping
Hardware guarantees that loads from all cores will return the value of the latest write
Coherence mechanisms

Metadata to track state for cached data Controller that snoops bus (or interconnect) activity and reacts if needed to adjust the state of the cache data
There needs to be a serialization point
Shared L3, memory controller, or memory bus

8
Coherence vs. Consistency
This seems necessary but not sufficient
True, but huge can of worms (see CS315A)
Coherence concerns one memory location

Snooping takes care of this Associates state with each cache line
Consistency concerns apparent ordering of all memory locations

9
(In)consistency example
/* initially A = B = flag = 0 */
Processor 1
A = 1; B = 1; flag = 1;
Processor 2
while (flag == 0); /* spin */ print(A); print(B);
Intuition says it will print A = B = 1 Coherence doesnt guarantee this! Why?

10
Dealing With Relaxed Consistency

/* initially A = B = flag = 0 */
Processor 1
A = 1; B = 1; FENCE; flag = 1;
Processor 2
while (flag == 0); /* spin */ FENCE; print(A); print(B);
Sometimes called Memory Barrier, MemBar, Synch. Dont allow re-ordering across it.
11
Caches


Banking vs. sub-banking Coherence vs. consistency

12
Everything is a Cache for Something Else

Access Time Registers Level 1 Cache Level 2 Cache DRAM 1 cycle 1-3 cycles 5-10 cycles ~100 cycles Capacity ~500B ~64KB 1-10MB ~10GB Managed by Software/compiler Hardware Hardware Software/OS
Disk
106-107 cycles ~1TB

The Interwebs
Software/OS
Tape
13
Example: File cache
Do files exhibit locality?
Prefetching?
Write back or write through?
When should we write to disk?
Microsoft SuperFetch: load common programs at boot
Coherence?
Leases in network filesystems
Associativity?
Place arbitrarily and keep an index
Most disks have caches

14
Example: Browser cache
Do web pages you visit exhibit locality?
Coherence?
Write back or write through?
No writes!
Replacement policy?
Did the page change since I last checked? Relaxed coherence If-Modified-Since header
Probably LRU
AMAT?
15
Caching is a ubiquitous tool
Same design issues in system design as in processor design
Placement, lookup, write policies, replacement policies, coherence
Same optimization dimensions

Size, associativity, granularity Hit time, miss rate, miss penalty, bandwidth, complexity
16
Caches


Banking vs. sub-banking Coherence vs. consistency

17
Main Memory Overview
18
Background: Memory Bank Organization
Read access sequence
Decode row address & drive word-lines Selected bits drive bit-lines
Entire row read
Amplify row data Decode column address & select subset of row Send to output Precharge bit-lines for next access
19
SRAM vs. DRAM

Static Random Access Mem.
row select
Dynamic Random Access Mem.

row enable
6T vs. 1T1C
_bitline
Large (~6-10x) Fast (~10x)
Bits stored as charges on node capacitance (non-restorative)

Bitlines driven by transistors
Bit cell loses charge when read Bit cell loses charge over time
Must periodically refresh
Once every 10s of ms

20
_bitline
bitline
SRAM vs. DRAM
SRAM is preferable for register files and L1/L2 caches

Fast access No refreshes Simpler manufacturing (compatible with logic process) Lower density (6 transistors per cell) Higher cost
DRAM is preferable for stand-alone memory chips

Much higher capacity Higher density Lower cost

21
DRAM: Memory-Access Protocol
RAS bit-cell array Addr n 2n 2n row x 2m-col (n~m to minimize overall latency) m 2m sense amp and mux 1
5 basic commands
ACTIVATE READ WRITE PRECHARGE REFRESH To reduce pin count, row and column share same address pins
CAS
A DRAM die is comprised of multiple such arrays
RAS = Row Address Strobe CAS = Column Address Strobe

22
DRAM: Basic Operation

Addresses
(Row 0, Column 0) (Row 0, Column 1) (Row 0, Column 85) (Row 1, Column 0) Row address 0 1
Commands
Columns
Row decoder ACTIVATE 0 READ 0 READ 1 READ 85 PRECHARGE ACTIVATE 1 READ 0
Rows Row 1 Row 0 Empty Column mux Data
Row Buffer CONFLICT ! HIT

(aka sense amps) (aka page)
Column address 0 85 1
23
DRAM: Basic Operation
Access to an open row

No need for ACTIVATE command READ/WRITE to access row buffer
Access to a closed row

If another row already active, must first issue PRECHARGE ACTIVATE to open new row READ/WRITE to access row buffer
Optional: PRECHARGE after READ/WRITEs finished

24
DRAM: Burst

Each READ/WRITE command can transfer multiple words (8 in DDR3) DRAM channel clocked faster than DRAM core
Critical word first?

25
DRAM: Banks
Modern DRAM chips consist of multiple banks
Address = (Bank x, Row y, Column z)
Banks operate independently, but share command/address/data pins

Each can have a different row active Can overlap ACTIVATE and PRECHARGE latencies! (i.e. READ to bank 0 while ACTIVATING bank 1)
26
DRAM: Banks
Enable concurrent DRAM accesses (overlapping)
27
2Gb x8 DDR3 Chip
[Micron]
Observe: bank organization
28
2Gb x8 DDR3 Chip
[Micron]
Observe: row width, 64 8 bit datapath
29
DDR3 SDRAM: Current standard

Introduced in 2007 SDRAM = Synchronous DRAM = Clocked DDR = Double Data Rate
Data transferred on both clock edges 400 MHz = 800 MT/s
x4, x8, x16 datapath widths Minimum burst length of 8 8 banks 1Gb, 2Gb, 4Gb capacity common Relative to SDR/DDR/DDR2: + bandwidth, ~ latency
30
DRAM: Timing Constraints
Memory controller must respect physical device characteristics
tRCD = Row to Column command delay
How long it takes row to get to sense amps
tCAS = Time between column command and data out tCCD = Time between column commands
Rate that you can pipeline column commands
tRP = Time to precharge DRAM array tRAS = Time between RAS and data restoration in DRAM array (minimum time a row must be open) tRC = tRAS + tRP = Row cycle time
Minimum time between accesses to different rows
31
DRAM: Timing Constraints
There are dozens of these

tWTR = Write to read delay tWR = Time from end of last write to PRECHARGE tFAW = Four ACTIVATE window (limits current surge)
Makes performance analysis, memory controller design difficult Datasheets for DRAM devices freely available
http://download.micron.com/pdf/datasheets/dram/ddr3/2 Gb_DDR3_SDRAM.pdf
32
Latency Components: Basic DRAM Operation

CPU controller transfer time Controller latency

Queuing & scheduling delay at the controller Access converted to basic commands tCAS is row is open OR tRCD + tCAS if array precharged OR tRP + tRCD + tCAS (worst case: tRC + tRCD + tCAS) BurstLen / (MT/s)
33
DRAM bank latency

DRAM data transfer time
Controller CPU transfer time
34
DRAM Modules

DRAM chips have narrow interface (typically x4, x8, x16) Multiple chips are put together to form a wide interface

DIMM: Dual Inline Memory Module To get a 64-bit DIMM, we need to access 8 chips with 8-bit interfaces Share command/address lines, but not data
Advantages
Acts like a high-capacity DRAM chip with a wide interface
8x capacity, 8x bandwidth, same latency
Disadvantages
Granularity: Accesses cannot be smaller than the interface width
8x power
35
A 64-bit Wide DIMM

(physical view)
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
Command
Data
36
A 64-bit Wide DIMM

(logical view)
37
Multiple DIMMs on a Channel

Rank 1 Rank 2 Rank 3 Rank 4
Advantages:
Enables even higher capacity
Disadvantages:
Interconnect latency, complexity, and energy get higher Addr/Cmd signal integrity is a challenge
38
DRAM Ranks
A DIMM may include multiple Ranks
A 64-bit DIMM using 8 chips with x16 interfaces has 2 ranks
Each 64-bit group of chips is called a rank

All chips in a rank respond to a single command
Different ranks share command/address/data lines
Select between ranks with Chip Select signal
Ranks provide more banks across multiple chips (but dont confuse rank and bank!)
39
Multiple Ranks on a DIMM

DIMM 1
Rank 1 Rank 2
DIMM 2
Rank 3 Rank 4
40
DRAM Capacity
What if we want more capacity?
Can only put limited number of ranks on channel (signal and driver limitations)
x2 chips? x1 chips?
What happens to power consumption?
More channels?
41
42
DRAM Channels
Channel: a set of DIMMs in series
All DIMMs get the same command, one of the ranks replies
System options
Single channel system Multiple dependent (lock-step) channels

Single controller with wider interface (faster cache line refill!) Sometimes called Gang Mode Only works if DIMMs are identical (organization, timing) Requires multiple controllers
Multiple independent channels
Tradeoffs

Cost: pins, wires, controller Benefit: higher bandwidth, capacity, flexibility

43
DRAM Channel Options

Lock-step
MC
Independent
CPU
MC
44
Lock-step DRAM Channels
Assume a lock-step dual channel bus

Two 64-bit channels Minimum burst-length of 8 (i.e. DDR3)
Whats the minimum transaction size?
Implication?
45
Multi-CPU (old school)

CPU
Front-side bus
MC
CPU
External MC adds latency Capacity doesnt grow w/ # of CPUs

46
NUMA Topology (modern)

MC
CPU
QPI MC
QPI
MC
CPU
MC
Capacity grows w/ # of CPUs NUMA: Non-uniform Memory Access

47
48
DRAM Controller Functionality
Translate memory requests into DRAM command sequences

Map Physical Address to DRAM Address Obey timing constraints of DRAM, arbitrate resource conflicts (i.e. bank, channel)
Buffer and schedule requests to improve performance
Row-buffer management and re-ordering
Ensure correct operation of DRAM (refresh) Manage power consumption and thermals in DRAM
Turn on/off DRAM chips, manage power modes

49
A Modern DRAM Controller
50
Address Mapping (Single Channel)

Assume a single-channel system with 8-byte memory bus
2GB memory, 8 banks, 16K rows & 2K columns per bank Consecutive rows of memory in consecutive banks
Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)
Row interleaving
Cache block interleaving

Consecutive cache block addresses in consecutive banks 64 byte cache blocks

High Column
8 bits
Row (14 bits)
Bank (3 bits)
Low Col.
3 bits
Byte in bus (3 bits)
Accesses to consecutive cache blocks can be serviced in parallel
How about random accesses? Strided accesses?

51
Bank Mapping Randomization
DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely
3 bits
Column (11 bits)
XOR
Bank index (3 bits)
52
Address Mapping (Multiple Channels)

C Row (14 bits) Row (14 bits) Row (14 bits) Row (14 bits) Bank (3 bits) C Bank (3 bits) Bank (3 bits) C Bank (3 bits) Column (11 bits) Column (11 bits) Column (11 bits) Column (11 bits) Byte in bus (3 bits) Byte in bus (3 bits) Byte in bus (3 bits) C Byte in bus (3 bits)
Cache block interleaving

Row (14 bits) Bank (3 bits) High Col.
8 bits
Lock-step
C Low Col.
3 bits
What happens with NUMA topology?

Rank (2 bits)
Bank (3 bits) High Col.
8 bits
Dont forget ranks

C Low Col.
3 bits
53
Row (14 bits)
Interaction with Virtual Physical Mapping
OS influences where an address maps to in DRAM

Virtual Page number (52 bits) Page offset (12 bits)
VA
PA PA
Physical Frame number (19 bits) Row (14 bits) Bank (3 bits)
Page offset (12 bits) Column (11 bits) Byte in bus (3 bits)
OS can control which bank a virtual page is mapped to Randomize Page<Bank,Channel> for load-balancing Optimize page placement for NUMA locality App doest know which bank/channel it is accessing
54
Row Buffer Management Policies
Open row

Keep the row open after an access Pro: Next access might need the same row row hit Con: Next access might need a different row row conflict, wasted energy
Closed row
Close the row after an access (if no other requests already in the request buffer need the same row) Pro: Next access might need a different row avoid a row conflict Con: Next access might need the same row extra activate latency
Adaptive policies
Predict whether or not the next access to the bank will be to the same row
55
Open vs. Closed Row Policies

Policy Open row Open row First access Row 0 Row 0 Next access Row 0 (row hit) Row 1 (row conflict) Commands needed for next access Read Precharge + Activate Row 1 + Read Read
Closed row
Row 0
Row 0 access in request buffer (row hit) Row 0 access not in request buffer (row closed)
Closed row
Row 0
Activate Row 0 + Read + Precharge
Closed row
Row 0
Row 1 (row closed)
Activate Row 1 + Read + Precharge
56
DRAM Scheduling Policies (I)
FCFS (first come first served)
Oldest request first
FR-FCFS (first ready, first come first served)

1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate maximize DRAM throughput
57
DRAM Scheduling Policies (II)
A scheduling policy is a prioritization order

Prioritization can be based on

Request age Row buffer hit/miss status Request type (prefetch, read, write) Requestor type (load miss or store miss) Request criticality

Oldest miss in the core? How many instructions in core are dependent on it?
58
DRAM Refresh (I)

DRAM capacitor charge leaks over time

The memory controller needs to read each row periodically to restore the charge

Activate + precharge each row every N ms Typical N = 64 ms
Implications on performance?
DRAM bank unavailable while refreshed Long pause times: If we refresh all rows in burst, every 64ms the DRAM will be unavailable until refresh ends
59
DRAM Refresh (II)
Distributed refresh eliminates long pause times How else we can reduce the effect of refresh on performance?
Can we reduce the number of refreshes?

60
DRAM Controllers are Difficult to Design
Need to obey DRAM timing constraints for correctness
There are many (50+) timing constraints in DRAM
Need to keep track of many resources to prevent conflicts
Channels, banks, ranks, data bus, address bus, row buffers
Need to handle DRAM refresh
Need to optimize for performance (in the presence of constraints)

Reordering is not simple Predicting the future?

61
Next Lecture
62
DRAM Power Management

DRAM chips have power modes Idea: When not accessing a chip power it down Power states

Active (highest power) All banks idle (i.e. precharged) Power-down Self-refresh (lowest power)
State transitions incur latency during which the chip cannot be accessed
63

Lect 05 Dram

Uploaded by

Copyright:

Available Formats

You might also like

Lect 05 Dram

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 05 Dram

Uploaded by

Copyright:

Available Formats

EE282 Lecture 5 Main Memory

Jacob Leverich http://eeclass.stanford.edu/ee282

EE282 Spring 2011 Lecture 05

Todays lecture: Main Memory

Write buffering A couple clarifications

Banks vs. sub-arrays Coherence vs. consistency

DRAM technology DRAM as Main Memory

Write Buffer: Avoid Write-Through Stalls

Use Write Buffer between cache and memory

Write Buffer: a first-in-first-out buffer (FIFO)

Write Buffers (cont)

Are write-buffers of any use for write-back caches?

Cache Write Buffer

Sub-arrays vs. Banks

Sub-arrays vs. Banks

Hardware Cache Coherence Using Snooping

There needs to be a serialization point

Shared L3, memory controller, or memory bus

Coherence vs. Consistency

This seems necessary but not sufficient

True, but huge can of worms (see CS315A)

Coherence concerns one memory location

Consistency concerns apparent ordering of all memory locations

Intuition says it will print A = B = 1 Coherence doesnt guarantee this! Why?

Dealing With Relaxed Consistency

Todays lecture: Main Memory

Write buffering A couple clarifications

Banking vs. sub-banking Coherence vs. consistency

DRAM technology DRAM as Main Memory

Everything is a Cache for Something Else

106-107 cycles ~1TB

Example: File cache

Do files exhibit locality?

Write back or write through?

When should we write to disk?

Microsoft SuperFetch: load common programs at boot

Place arbitrarily and keep an index

Most disks have caches

Example: Browser cache

Do web pages you visit exhibit locality?

Write back or write through?

Caching is a ubiquitous tool

Same design issues in system design as in processor design

Placement, lookup, write policies, replacement policies, coherence

Same optimization dimensions

Todays lecture: Main Memory

Write buffering A couple clarifications

Banking vs. sub-banking Coherence vs. consistency

DRAM technology DRAM as Main Memory

Main Memory Overview

Background: Memory Bank Organization

Read access sequence

Entire row read

SRAM vs. DRAM

Dynamic Random Access Mem.

Large (~6-10x) Fast (~10x)

Bits stored as charges on node capacitance (non-restorative)