Lect 05 Dram

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

EE282 Lecture 5 Main Memory

Jacob Leverich http://eeclass.stanford.edu/ee282

EE282 Spring 2011 Lecture 05

Announcements

Todays lecture: Main Memory

Caches

Write buffering A couple clarifications


Banks vs. sub-arrays Coherence vs. consistency

Non-processor caches

DRAM technology DRAM as Main Memory


3

Write Buffer: Avoid Write-Through Stalls


Cache

Processor

DRAM

Write Buffer

Use Write Buffer between cache and memory


Processor writes data into the cache and the write buffer Memory controller slowly drains buffer to memory

Write Buffer: a first-in-first-out buffer (FIFO)


Typically holds a small number of writes Can absorb small bursts as long as the long term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM
4

Write Buffers (cont)

If a write-through cache has a write buffer, what should happen on a read miss?

Are write-buffers of any use for write-back caches?

Processor

Cache Write Buffer

DRAM

Sub-arrays vs. Banks


Cache logical organization

Sub-arrays vs. Banks


Sub-arrays Tag/data arrays divided into sub-arrays Shared cache control Banks Independent caches partitioned by address

Hardware Cache Coherence Using Snooping

Hardware guarantees that loads from all cores will return the value of the latest write

Coherence mechanisms

Metadata to track state for cached data Controller that snoops bus (or interconnect) activity and reacts if needed to adjust the state of the cache data

There needs to be a serialization point

Shared L3, memory controller, or memory bus


8

Coherence vs. Consistency

This seems necessary but not sufficient

True, but huge can of worms (see CS315A)

Coherence concerns one memory location


Snooping takes care of this Associates state with each cache line

Consistency concerns apparent ordering of all memory locations


9

(In)consistency example
/* initially A = B = flag = 0 */

Processor 1
A = 1; B = 1; flag = 1;

Processor 2
while (flag == 0); /* spin */ print(A); print(B);

Intuition says it will print A = B = 1 Coherence doesnt guarantee this! Why?


10

Dealing With Relaxed Consistency


/* initially A = B = flag = 0 */

Processor 1
A = 1; B = 1; FENCE; flag = 1;

Processor 2
while (flag == 0); /* spin */ FENCE; print(A); print(B);

Sometimes called Memory Barrier, MemBar, Synch. Dont allow re-ordering across it.
11

Todays lecture: Main Memory

Caches

Write buffering A couple clarifications


Banking vs. sub-banking Coherence vs. consistency

Non-processor caches

DRAM technology DRAM as Main Memory


12

Everything is a Cache for Something Else


Access Time Registers Level 1 Cache Level 2 Cache DRAM 1 cycle 1-3 cycles 5-10 cycles ~100 cycles Capacity ~500B ~64KB 1-10MB ~10GB Managed by Software/compiler Hardware Hardware Software/OS

Disk

106-107 cycles ~1TB


The Interwebs

Software/OS

Tape

13

Example: File cache

Do files exhibit locality?

Prefetching?

Write back or write through?

When should we write to disk?

Microsoft SuperFetch: load common programs at boot

Coherence?
Leases in network filesystems

Associativity?

Place arbitrarily and keep an index

Most disks have caches


14

Example: Browser cache

Do web pages you visit exhibit locality?

Coherence?

Write back or write through?

No writes!

Replacement policy?

Did the page change since I last checked? Relaxed coherence If-Modified-Since header

Probably LRU

AMAT?
15

Caching is a ubiquitous tool

Same design issues in system design as in processor design

Placement, lookup, write policies, replacement policies, coherence

Same optimization dimensions


Size, associativity, granularity Hit time, miss rate, miss penalty, bandwidth, complexity
16

Todays lecture: Main Memory

Caches

Write buffering A couple clarifications


Banking vs. sub-banking Coherence vs. consistency

Non-processor caches

DRAM technology DRAM as Main Memory


17

Main Memory Overview

18

Background: Memory Bank Organization

Read access sequence

Decode row address & drive word-lines Selected bits drive bit-lines

Entire row read

Amplify row data Decode column address & select subset of row Send to output Precharge bit-lines for next access
19

SRAM vs. DRAM


Static Random Access Mem.
row select

Dynamic Random Access Mem.


row enable

6T vs. 1T1C

_bitline

Large (~6-10x) Fast (~10x)

Bits stored as charges on node capacitance (non-restorative)


Bitlines driven by transistors

Bit cell loses charge when read Bit cell loses charge over time

Must periodically refresh

Once every 10s of ms


20

_bitline

bitline

SRAM vs. DRAM

SRAM is preferable for register files and L1/L2 caches


Fast access No refreshes Simpler manufacturing (compatible with logic process) Lower density (6 transistors per cell) Higher cost

DRAM is preferable for stand-alone memory chips


Much higher capacity Higher density Lower cost


21

DRAM: Memory-Access Protocol

RAS bit-cell array Addr n 2n 2n row x 2m-col (n~m to minimize overall latency) m 2m sense amp and mux 1

5 basic commands
ACTIVATE READ WRITE PRECHARGE REFRESH To reduce pin count, row and column share same address pins

CAS

A DRAM die is comprised of multiple such arrays

RAS = Row Address Strobe CAS = Column Address Strobe


22

DRAM: Basic Operation


Addresses
(Row 0, Column 0) (Row 0, Column 1) (Row 0, Column 85) (Row 1, Column 0) Row address 0 1

Commands
Columns
Row decoder ACTIVATE 0 READ 0 READ 1 READ 85 PRECHARGE ACTIVATE 1 READ 0

Rows Row 1 Row 0 Empty Column mux Data

Row Buffer CONFLICT ! HIT


(aka sense amps) (aka page)

Column address 0 85 1

23

DRAM: Basic Operation

Access to an open row


No need for ACTIVATE command READ/WRITE to access row buffer

Access to a closed row


If another row already active, must first issue PRECHARGE ACTIVATE to open new row READ/WRITE to access row buffer

Optional: PRECHARGE after READ/WRITEs finished


24

DRAM: Burst

Each READ/WRITE command can transfer multiple words (8 in DDR3) DRAM channel clocked faster than DRAM core

Critical word first?


25

DRAM: Banks

Modern DRAM chips consist of multiple banks

Address = (Bank x, Row y, Column z)

Banks operate independently, but share command/address/data pins


Each can have a different row active Can overlap ACTIVATE and PRECHARGE latencies! (i.e. READ to bank 0 while ACTIVATING bank 1)

26

DRAM: Banks

Enable concurrent DRAM accesses (overlapping)

27

2Gb x8 DDR3 Chip

[Micron]

Observe: bank organization

28

2Gb x8 DDR3 Chip

[Micron]

Observe: row width, 64 8 bit datapath

29

DDR3 SDRAM: Current standard


Introduced in 2007 SDRAM = Synchronous DRAM = Clocked DDR = Double Data Rate

Data transferred on both clock edges 400 MHz = 800 MT/s

x4, x8, x16 datapath widths Minimum burst length of 8 8 banks 1Gb, 2Gb, 4Gb capacity common Relative to SDR/DDR/DDR2: + bandwidth, ~ latency
30

DRAM: Timing Constraints

Memory controller must respect physical device characteristics

tRCD = Row to Column command delay

How long it takes row to get to sense amps

tCAS = Time between column command and data out tCCD = Time between column commands

Rate that you can pipeline column commands

tRP = Time to precharge DRAM array tRAS = Time between RAS and data restoration in DRAM array (minimum time a row must be open) tRC = tRAS + tRP = Row cycle time

Minimum time between accesses to different rows

31

DRAM: Timing Constraints

There are dozens of these


tWTR = Write to read delay tWR = Time from end of last write to PRECHARGE tFAW = Four ACTIVATE window (limits current surge)

Makes performance analysis, memory controller design difficult Datasheets for DRAM devices freely available

http://download.micron.com/pdf/datasheets/dram/ddr3/2 Gb_DDR3_SDRAM.pdf
32

Latency Components: Basic DRAM Operation


CPU controller transfer time Controller latency


Queuing & scheduling delay at the controller Access converted to basic commands tCAS is row is open OR tRCD + tCAS if array precharged OR tRP + tRCD + tCAS (worst case: tRC + tRCD + tCAS) BurstLen / (MT/s)
33

DRAM bank latency


DRAM data transfer time

Controller CPU transfer time

Main Memory Overview

34

DRAM Modules

DRAM chips have narrow interface (typically x4, x8, x16) Multiple chips are put together to form a wide interface

DIMM: Dual Inline Memory Module To get a 64-bit DIMM, we need to access 8 chips with 8-bit interfaces Share command/address lines, but not data

Advantages

Acts like a high-capacity DRAM chip with a wide interface

8x capacity, 8x bandwidth, same latency

Disadvantages

Granularity: Accesses cannot be smaller than the interface width

8x power
35

A 64-bit Wide DIMM


(physical view)

DRAM Chip

DRAM Chip

DRAM Chip

DRAM Chip

DRAM Chip

DRAM Chip

DRAM Chip

DRAM Chip

Command

Data

36

A 64-bit Wide DIMM


(logical view)

37

Multiple DIMMs on a Channel


Rank 1 Rank 2 Rank 3 Rank 4

Advantages:

Enables even higher capacity

Disadvantages:

Interconnect latency, complexity, and energy get higher Addr/Cmd signal integrity is a challenge

38

DRAM Ranks

A DIMM may include multiple Ranks

A 64-bit DIMM using 8 chips with x16 interfaces has 2 ranks

Each 64-bit group of chips is called a rank


All chips in a rank respond to a single command

Different ranks share command/address/data lines

Select between ranks with Chip Select signal

Ranks provide more banks across multiple chips (but dont confuse rank and bank!)
39

Multiple Ranks on a DIMM


DIMM 1
Rank 1 Rank 2

DIMM 2
Rank 3 Rank 4

40

DRAM Capacity

What if we want more capacity?

Can only put limited number of ranks on channel (signal and driver limitations)

x2 chips? x1 chips?

What happens to power consumption?

More channels?
41

Main Memory Overview

42

DRAM Channels

Channel: a set of DIMMs in series

All DIMMs get the same command, one of the ranks replies

System options

Single channel system Multiple dependent (lock-step) channels


Single controller with wider interface (faster cache line refill!) Sometimes called Gang Mode Only works if DIMMs are identical (organization, timing) Requires multiple controllers

Multiple independent channels

Tradeoffs

Cost: pins, wires, controller Benefit: higher bandwidth, capacity, flexibility


43

DRAM Channel Options


Lock-step

MC

Independent

CPU

MC
44

Lock-step DRAM Channels

Assume a lock-step dual channel bus


Two 64-bit channels Minimum burst-length of 8 (i.e. DDR3)

Whats the minimum transaction size?

Implication?

45

Multi-CPU (old school)


CPU
Front-side bus

MC

CPU

External MC adds latency Capacity doesnt grow w/ # of CPUs


46

NUMA Topology (modern)


MC

CPU
QPI MC

QPI

MC

CPU
MC

Capacity grows w/ # of CPUs NUMA: Non-uniform Memory Access


47

Main Memory Overview

48

DRAM Controller Functionality

Translate memory requests into DRAM command sequences


Map Physical Address to DRAM Address Obey timing constraints of DRAM, arbitrate resource conflicts (i.e. bank, channel)

Buffer and schedule requests to improve performance

Row-buffer management and re-ordering

Ensure correct operation of DRAM (refresh) Manage power consumption and thermals in DRAM

Turn on/off DRAM chips, manage power modes


49

A Modern DRAM Controller

50

Address Mapping (Single Channel)


Assume a single-channel system with 8-byte memory bus

2GB memory, 8 banks, 16K rows & 2K columns per bank Consecutive rows of memory in consecutive banks
Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)

Row interleaving

Cache block interleaving


Consecutive cache block addresses in consecutive banks 64 byte cache blocks


High Column
8 bits

Row (14 bits)

Bank (3 bits)

Low Col.
3 bits

Byte in bus (3 bits)

Accesses to consecutive cache blocks can be serviced in parallel

How about random accesses? Strided accesses?


51

Bank Mapping Randomization

DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely

3 bits

Column (11 bits)

Byte in bus (3 bits)

XOR

Bank index (3 bits)

52

Address Mapping (Multiple Channels)


C Row (14 bits) Row (14 bits) Row (14 bits) Row (14 bits) Bank (3 bits) C Bank (3 bits) Bank (3 bits) C Bank (3 bits) Column (11 bits) Column (11 bits) Column (11 bits) Column (11 bits) Byte in bus (3 bits) Byte in bus (3 bits) Byte in bus (3 bits) C Byte in bus (3 bits)

Cache block interleaving


Row (14 bits) Bank (3 bits) High Col.
8 bits

Lock-step
C Low Col.
3 bits

Byte in bus (3 bits)

What happens with NUMA topology?


Rank (2 bits)
Bank (3 bits) High Col.
8 bits

Dont forget ranks


C Low Col.
3 bits
53

Row (14 bits)

Byte in bus (3 bits)

Interaction with Virtual Physical Mapping

OS influences where an address maps to in DRAM


Virtual Page number (52 bits) Page offset (12 bits)

VA
PA PA

Physical Frame number (19 bits) Row (14 bits) Bank (3 bits)

Page offset (12 bits) Column (11 bits) Byte in bus (3 bits)

OS can control which bank a virtual page is mapped to Randomize Page<Bank,Channel> for load-balancing Optimize page placement for NUMA locality App doest know which bank/channel it is accessing
54

Row Buffer Management Policies

Open row

Keep the row open after an access Pro: Next access might need the same row row hit Con: Next access might need a different row row conflict, wasted energy

Closed row

Close the row after an access (if no other requests already in the request buffer need the same row) Pro: Next access might need a different row avoid a row conflict Con: Next access might need the same row extra activate latency

Adaptive policies

Predict whether or not the next access to the bank will be to the same row
55

Open vs. Closed Row Policies


Policy Open row Open row First access Row 0 Row 0 Next access Row 0 (row hit) Row 1 (row conflict) Commands needed for next access Read Precharge + Activate Row 1 + Read Read

Closed row

Row 0

Row 0 access in request buffer (row hit) Row 0 access not in request buffer (row closed)

Closed row

Row 0

Activate Row 0 + Read + Precharge

Closed row

Row 0

Row 1 (row closed)

Activate Row 1 + Read + Precharge

56

DRAM Scheduling Policies (I)

FCFS (first come first served)

Oldest request first

FR-FCFS (first ready, first come first served)


1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate maximize DRAM throughput
57

DRAM Scheduling Policies (II)

A scheduling policy is a prioritization order


Prioritization can be based on

Request age Row buffer hit/miss status Request type (prefetch, read, write) Requestor type (load miss or store miss) Request criticality

Oldest miss in the core? How many instructions in core are dependent on it?
58

DRAM Refresh (I)


DRAM capacitor charge leaks over time


The memory controller needs to read each row periodically to restore the charge

Activate + precharge each row every N ms Typical N = 64 ms

Implications on performance?

DRAM bank unavailable while refreshed Long pause times: If we refresh all rows in burst, every 64ms the DRAM will be unavailable until refresh ends
59

DRAM Refresh (II)

Distributed refresh eliminates long pause times How else we can reduce the effect of refresh on performance?

Can we reduce the number of refreshes?


60

DRAM Controllers are Difficult to Design

Need to obey DRAM timing constraints for correctness

There are many (50+) timing constraints in DRAM

Need to keep track of many resources to prevent conflicts

Channels, banks, ranks, data bus, address bus, row buffers

Need to handle DRAM refresh

Need to optimize for performance (in the presence of constraints)


Reordering is not simple Predicting the future?


61

Next Lecture

62

DRAM Power Management


DRAM chips have power modes Idea: When not accessing a chip power it down Power states

Active (highest power) All banks idle (i.e. precharged) Power-down Self-refresh (lowest power)

State transitions incur latency during which the chip cannot be accessed
63

You might also like