Memory Scaling Is Dead, Long Live Memory Scaling: Le Memoire Scaling Est Mort, Vive Le Memoire Scaling!

Memory Scaling is Dead,
Long Live Memory Scaling

Le Memoire Scaling est mort, vive le Memoire Scaling!
Moinuddin K. Qureshi
ECE, Georgia Tech
At Yale’s “Mid Career” Celebration at University of Texas at Austin, Sept 19 2014

The Gap in Memory Hierarchy
L1(SRAM) EDRAM DRAM Flash HDD

?????
21 23 25 27 29 211 213 215 217 219 221 223

Typical access latency in processor cycles (@ 4 GHz)
Misses in main memory (page faults) degrade performance severely

Main memory system must scale to maintain performance growth
The Memory Capacity Gap
Trends: Core count doubling every 2 years.
DRAM DIMM capacity doubling every 3 years
Lim+ ISCA’09
Memory capacity per core expected to drop by 30% every two years
Challenges for DRAM: Scaling Wall
Scaling wall
DRAM does not scale well to
small feature sizes (sub 1x nm)
Increasing error rates can render DRAM scaling infeasible

Two Roads Diverged …
Architectural support Find alternative
for DRAM scaling technology that
and to reduce avoids problems
refresh overheads of DRAM
DRAM challenges
Important to investigate both approaches

Outline
 Introduction
 ArchShiled: Yield Aware (arch support for DRAM )

 Hybrid Memory: reduce Latency, Energy, Power
 Adaptive Tuning of Systems to Workloads
 Summary
Reasons for DRAM Faults
• Unreliability of ultra-thin dielectric material
• In addition, DRAM cell failures also from:

– Permanently leaky cells
– Mechanically unstable cells
DRAM Cells
– Broken links in the DRAM array
Q DRAM Cell Capacitor
Charge (tilting towards ground)
DRAM
Leaks
Cell Capacitor
Permanently Mechanically Broken Links

Leaky Cell Unstable Cell
Permanent faults for future DRAMs expected to be much higher
7
Row and Column Sparing
• DRAM chip (organized into rows and columns) have spares
Replaced Columns
Deactivated Rows
and Columns
Spare Columns
Faults
Spare Rows Replaced Rows
DRAM Chip: DRAM Chip:

Before Row/Column Sparing After Row/Column Sparing
• Laser fuses enable spare rows/columns

• Entire row/column needs to be sacrificed for a few faulty cells
Row and Column Sparing incurs large area overheads

8
Commodity ECC-DIMM
• Commodity ECC DIMM with SECDED at 8 bytes (72,64)
• Mainly used for soft-error protection
• For hard errors, high chance of two errors in

same word (birthday paradox)
For 8GB DIMM  1 billion words

Expected errors till double-error word
= 1.25*Sqrt(N) = 40K errors  0.5 ppm
SECDED not enough at high error-rate (what about soft-error?)

9
Dissecting Fault Probabilities
At Bit Error Rate of 10-4 (100ppm) for an 8GB DIMM (1 billion words)
Faulty Bits per word (8B) Probability Num words in 8GB

0 99.3% 0.99 Billion
1 0.7% 7.7 Million
2 26 x 10-6 28 K
3 62 x 10-9 67
4 10-10 0.1
Most faulty words have 1-bit error  The skew in fault probability
can be leveraged for low cost resilience
Tolerate high error rates with commodity ECC DIMM while

retaining soft-error resilience
10
ArchShield: Overview
Inspired from Solid State Drives (SSD) to tolerate high bit-error rate
Expose faulty cell information to Architecture layer via runtime testing
Replication Fault
Area Map
Main Memory
ArchShield
Most words will be error-free
Fault Map 1-bit error handled with SECDED

(cached)
Multi-bit error handled with replication
ArchShield stores the error mitigation information

11 in memory
ArchShield: Yield Aware Design
When DIMM is configured, runtime testing is performed. Each 8B word
gets classified into one of three types:
No Error 1-bit Error Multi-bit Error
(Replication not SECDED can Word gets

needed) correct hard error decommissioned
SECDED can Need replication Only the replica

correct soft error for soft error is used
(classification of faulty words can be stored in hard drive for future use)
Tolerates 100ppm fault rate with 1% slowdown and124% capacity loss

Outline
 Introduction

 Summary
Emerging Technology to aid Scaling
Phase Change Memory (PCM): Scalable to sub 10nm
Resistive memory: High resistance (0), Low resistance (1)
Advantages: scalable, has MLC capability, non volatile (no leakage)
PCM is attractive for designing scalable memory systems. But …

Challenges for PCM
Key Problems:
1. Higher read latency (compared to DRAM)

2. Limited write endurance (~10-100 million writes per cell)
3. Writes are much slower, and power hungry
Replacing DRAM with PCM causes:

High Read Latency,
High Power
High Energy Consumption
How do we design a scalable PCM without these disadvantages?

Hybrid Memory: Best of DRAM and PCM
PCM Main Memory
DATA
Processor DRAM Buffer

T DATA Flash
Or
HDD
T=Tag-Store
PCM Write Queue
Hybrid Memory System:

1. DRAM as cache to tolerate PCM Rd/Wr latency and Wr bandwidth
2. PCM as main-memory to provide large capacity at good cost/power
3. Write filtering techniques to reduces wasteful writes to PCM
Latency, Energy, Power: Lowered
1.1
1
Normalized Execution Time
0.9
0.8
0.7 8GB DRAM

32GB PCM
0.6
32GB DRAM
0.5 32GB PCM + 1GB DRAM
0.4
0.3
0.2
0.1
0
db1 db2 qsort bsearch kmeans gauss daxpy vdotp gmean
Hybrid memory provides performance similar to iso-capacity DRAM

Also avoids the energy/power overheads from frequent writes
Outline
 Introduction

 Summary
Workload Adaptive Systems
Different policies work well for different workloads
1. No single replacement policy works well for all workloads

2. Or, the prefetch algorithm
3. Or, the memory scheduling algorithm
4. Or, the coherence algorithm
5. Or, any other policy (write allocate/no allocate?)
Unfortunately: systems are designed to cater to average case

(a policy that works good enough for all workloads)
Ideal for each workload to have the policy that works best for it
Adaptive Tuning via Runtime Testing
Say we want to select between two policies: P0 and P1
Divide the cache in three: miss
P0-sets +
– Dedicated P0 sets n-bit cntr
– Dedicated P1 sets P1-sets –
– Follower sets (winner of P0,P1) miss
n-bit saturating counter

misses to P0-sets: counter++ Follower Sets
misses to P1-set: counter--
Counter decides policy for Followers:

– MSB = 0, Use P0
– MSB = 1, Use P1 monitor  choose  apply
(Set Dueling: using a single counter)
Adaptive Tuning can allow dynamic policy selection at low cost

Outline
 Introduction

 Summary
Challenges for Computer Architects
End of: Technology Scaling, Frequency Scaling, Moore’s Law, ????
How do we address these challenges:
The solution for all computer architecture problems is:
Yield Awareness
Hybrid memory: Latency, Energy, Power reduction for PCM
Workload adaptive systems: low cost “Adaptivity Through Testing”

Challenges for Computer Architects
End of: Technology Scaling, Frequency Scaling, Moore’s Law, ????
How do we address these challenges:
The solution for all computer architecture problems is:
Yield Awareness
Hybrid memory: Latency, Energy, Power reduction for PCM
Workload adaptive systems: get low cost Adaptivity Through Testing
Happy 75th Yale !

Memory Scaling Is Dead, Long Live Memory Scaling: Le Memoire Scaling Est Mort, Vive Le Memoire Scaling!

Uploaded by

Copyright:

Available Formats

You might also like

Memory Scaling Is Dead, Long Live Memory Scaling: Le Memoire Scaling Est Mort, Vive Le Memoire Scaling!

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Memory Scaling Is Dead, Long Live Memory Scaling: Le Memoire Scaling Est Mort, Vive Le Memoire Scaling!

Uploaded by

Copyright:

Available Formats

Memory Scaling is Dead,

Long Live Memory Scaling

At Yale’s “Mid Career” Celebration at University of Texas at Austin, Sept 19 2014

L1(SRAM) EDRAM DRAM Flash HDD

21 23 25 27 29 211 213 215 217 219 221 223

Misses in main memory (page faults) degrade performance severely

Increasing error rates can render DRAM scaling infeasible

Important to investigate both approaches

 ArchShiled: Yield Aware (arch support for DRAM )

• In addition, DRAM cell failures also from:

Permanently Mechanically Broken Links

Permanent faults for future DRAMs expected to be much higher

Spare Rows Replaced Rows

DRAM Chip: DRAM Chip:

• Laser fuses enable spare rows/columns

Row and Column Sparing incurs large area overheads

• Mainly used for soft-error protection

• For hard errors, high chance of two errors in

For 8GB DIMM  1 billion words

SECDED not enough at high error-rate (what about soft-error?)

Faulty Bits per word (8B) Probability Num words in 8GB

Tolerate high error rates with commodity ECC DIMM while

Expose faulty cell information to Architecture layer via runtime testing

Most words will be error-free

Fault Map 1-bit error handled with SECDED

ArchShield stores the error mitigation information

No Error 1-bit Error Multi-bit Error

(Replication not SECDED can Word gets

SECDED can Need replication Only the replica

Tolerates 100ppm fault rate with 1% slowdown and124% capacity loss

 ArchShiled: Yield Aware (arch support for DRAM )

Advantages: scalable, has MLC capability, non volatile (no leakage)

PCM is attractive for designing scalable memory systems. But …

1. Higher read latency (compared to DRAM)

Replacing DRAM with PCM causes:

How do we design a scalable PCM without these disadvantages?

Processor DRAM Buffer

PCM Write Queue

Hybrid Memory System:

0.7 8GB DRAM

Hybrid memory provides performance similar to iso-capacity DRAM

 ArchShiled: Yield Aware (arch support for DRAM )

1. No single replacement policy works well for all workloads

Unfortunately: systems are designed to cater to average case

n-bit saturating counter

Counter decides policy for Followers:

Adaptive Tuning can allow dynamic policy selection at low cost

 ArchShiled: Yield Aware (arch support for DRAM )

How do we address these challenges:

The solution for all computer architecture problems is:

Workload adaptive systems: low cost “Adaptivity Through Testing”

How do we address these challenges:

The solution for all computer architecture problems is:

Workload adaptive systems: get low cost Adaptivity Through Testing

Happy 75th Yale !

You might also like