Memory Hierarchy - Caches

SANTA CLARA UNIVERSITY
Large and Fast: Exploiting

Memory Hierarchy
Early Memory Technology-I

UNIVAC I Delay Line Memory
• Mercury Delay Line Memory

• Circulated sound in mercury tubes
• Delay indicated signal
• Used in UNIVAC I (1951)
• 1000 words x12 alphanumerical characters
• ≈ 1200 bytes total!
• Bandwidth ≈ 300 Kbytes/Sec
Early Memory Technology-II
IBM 650 Magnetic Drum Memory

• Magnetic Drum
• Data organized in circular tracks
• Instruction points to next instruction(no PC)
• Used in IBM 650 (1953)
• Between 1 and 4 Kbytes
• Average access time 2.5 ms
Early Memory Technology-III

• Random Access Memory (RAM) arrives
Both mercury delay line and magnetic drum memories were
sequential in nature.
Access time of a location depended on what was accessed before it
(like current magnetic disk drives).
Needed careful layout of data optimizations to reduce access time
• In RAM (bad name) any random location can be accessed
at the same speed regardless of previous access.
Magnetic Core RAM
• 2-dimensional array of tiny doughnut-like rings.
• Currents in two perpendicular wire induce & sense
magnetic polarization in rings.
• Non-volatile (does not lose data when power is off).
• Used in Whirlwind project (MIT)
• 2 Kbytes
• 9 us access time.
Semiconductor RAM: SRAM
• Most popular is 6 transistor per bit.
• Less transistors per bit than Flip-flop.
• Saving due to all bits in a word are read/written
simultaneously, and
• A maximum (preset) number of words can be accessed at
same time.
• Original semiconductor RAM
• Volatile but holds contents constantly if power is
applied.
DRAM Technology
• Uses one transistor per bit
• Data stored as a charge in a capacitor
• Must periodically be refreshed
• Read contents and write back
• Performed on a DRAM “row”
Flash Storage
• Nonvolatile semiconductor storage
• 100× – 1000× faster than disk
• Smaller, lower power, more robust
• But more $/GB (between disk and DRAM)
Flash Types
• NOR flash: bit cell like a NOR gate
• Random read/write access
• Used for instruction memory in embedded systems
• NAND flash: bit cell like a NAND gate
• Denser (bits/area), but block-at-a-time access
• Cheaper per GB
• Used for USB keys, media storage, …
• Flash bits wears out after 1000’s of accesses
• Not suitable for direct RAM or disk replacement
• Wear leveling: remap data to less used blocks
Memory Pyramid
• Static RAM (SRAM)
• 0.5ns – 2.5ns, $2000 – $5000 per GB
• Dynamic RAM (DRAM)
• 50ns – 70ns, $20 – $75 per GB
• Magnetic disk
• 5ms – 20ms, $0.20 – $2 per GB
•Ideal (perfect) memory

• Access time of SRAM
• Capacity and cost/GB of disk
Memory Hierarchy
• Multiple levels of memory
• Closest to CPU
• SRAM (fast) but can use small amounts.
• May be multiple levels (bigger is slower)
• Further away
• DRAM (slower), but can use more of it (cheap)
• Farthest from CPU
• Disk drive, slowest, but can be huge.
Hierarchy in Action
• Store everything on disk
• Copy recently accessed (and nearby) items from disk to
smaller DRAM memory
• Main memory
• Copy more recently accessed (and nearby) items from
DRAM to smaller SRAM memory
• Cache memory attached to CPU
Cache Use
• Will copy data as needed from memory into cache.
• Two factors determining the size of data to be copied.
• Memory access delay
• Locality of reference
First factor: DRAM access delay
• Access time for 1’st data item :
• long (100’s ns)
• Due to structure of DRAM
• Possible queuing for DRAM access behind other CPU’s accesses
• Access time of 2’nd, 3’rd … datat items:
• relatively faster (e.g. 10 ns)
DRAM Access Delay Equation

• Access time for n bytes is
• T(n) = a *n + b,
• Cost of one byte access = a + b/n
• More effective for larger n
Second Factor: Spatial Locality

• Spatial locality is a one of two forms of address locality captured by
“Principle of Locality of Reference”
Principle of Locality of Reference
• Small proportion of addresses accessed at any time

• Temporal locality
• Will likely reuse recently accessed items in the short term
• e.g., instructions in a loop, induction variables
• Trace: a, b, a, b, c, d, e, a, c, b, x, a, a, b, d, x, …..
• Spatial locality
• Will likely access items near recently accessed items in the
short term
• e.g., sequential instruction access, array data
• Trace: a, a+2, a+6, b, a-2, b+4, b+6, b+8, a+10, ….
Unit of Data Transfer between

Memory & Cache
• Unit of data transferred between cache and memory is
called “memory block”
• Memory “block” maps to a cache “line”.
• Unit of data transferred between memory and disk it is
called “page”. (Will be introduced to this concept later
in the chapter.)
Effect of Cache Line Size

• Larger cache line sizes should reduce miss rate
(especially in instruction caches)
• Due to spatial locality
• But when cache size is fixed
• Larger line size  fewer number of cache lines  more
competition for a line  increased miss rate
• Larger cache line size  Larger miss penalty
• Can override benefit of reduced miss rate
Summing Up
• There are two benefits to copying many bytes from
memory to cache in one access:
• Reducing cost of transfer per byte &
• Exploiting spatial locality
• However
• If the data brought in is not “re-used” often enough, then
copying too much might be wasteful because some of the
data copied might not be used.
Memory Hierarchy Levels

• Block (aka line): unit of copying
• Must be a multiple of CPU word.
• Size is always a power of 2.
• If accessed data is present in
upper level
• Hit: access satisfied by upper level
• Hit ratio: hits/accesses
• If accessed data is absent
• Miss: block copied from lower level
• Time taken: miss penalty
• Miss ratio: misses/accesses
= 1 – hit ratio
• Then accessed data supplied to
CPU from upper level
Cache Memory
• Cache memory
• The level of the memory hierarchy closest to the CPU
• Given accesses X1, …, Xn–1, Xn
◼ How do we know if
the data is present?
◼ Where do we look?
Direct Mapped Cache

• Location determined by address
• Direct mapped: only one address option
• Cache line address = (Memory block address) modulo (# of Blocks in cache)
◼ Number of cache lines is a power of two.

◼ Low order bits of memory block address
determine cache line address.
Cache Line Tags

• How do we know which particular block is stored in a
cache location?
• Store memory block address in tag array
• Every cache line has a tag:
• All tags of cache are stored in tag “array”
• For direct-mapped cache the part of the memory block
address used to select a cache line, needs not be part of the
tag.
Valid Bits
• What if there is no meaningful data in a cache line ? (e.g
at startup time)
• Add one valid bit per cache line.
• Valid bit: 1 = tag is for real data , 0 = junk tag
• Valid bit is Initially 0 (at power-up).
• Hit condition:
• Relevant part of address matches tag of some cache line
AND
• Valid bit is 1 for matched cache line.
Cache Misses
• On cache hit, CPU proceeds normally
• On cache miss it will
• Stall the CPU pipeline
• Fetch block from next level of hierarchy
• Instruction cache miss
• Restart instruction fetch
• Data cache miss
• Complete data access
Address Breakdown:
• Example:
• Direct-mapped cache
• 64 cache lines, 16 bytes/lines
• To what cache line number does address 1200 map?
• Memory block address = 1200/16 = 75
• Cache line # to be checked= 75 modulo 64 = 11
31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits
Fully Associative Caches

• A memory block can be placed anywhere in cache.
• If we repeat previous example using fully-associative cache
• Access to location 260 did not need to bump a cache line off the cache
• There were plenty of empty (invalid) cache lines.
• Important: Fully associative cache does not have a notion of cache
line index. All lines have the same index = 0.
Fully Associative Caches In Practice

• It requires all line tags to be searched in parallel ➔
• One tag comparator per cache line (very expensive for realistic cache sizes)
• Not really used in processors.
• Value of fully-associative arrays
• In simulation to provide good miss rate estimates (for comparison purposes).
• Same principal used in some CPU structures
Set Associative Caches

• n-way set associative
• Cache is organized as array of sets * ways
• Each set contains n entries (one for every way)
• Can be viewed as
• n ways, each way is like a direct-mapped cache
• s sets, each set is like a fully-associative cache
• Number of cache lines = n x s
• Memory block # determines cache set searched:
• Set = (Memory block #) modulo (# Sets in cache)
• # of sets = (cache size/line size)/n
Cache Structure Example
• All caches same size (8 lines)

• Direct-mapped has 8 addresses (indices)
• Two way set-associative has 8/2 = 4 address
• Fully associative cache 8/8 = one address (no need to be specified).
Spectrum of Associativity
• For a cache with 8 lines
How Much Associativity

• Increased associativity decreases miss rate,
• However, diminishing returns on miss rates
• Example: system with 64KB, D-cache, 64-bytes blocks, SPEC2000
• 1-way: 10.3% 2-way: 8.6%
• 4-way: 8.3%8-way: 8.1%
• More hardware overhead
• Longer hit time
• More power consumption
4-Way Set Associative Cache Organization

Replacement Policy
• When memory block brought to associative cache,
controller needs to select which cache line to replace:
• Direct mapped:
• There is a single cache line corresponding to a memory block, so no
policy is needed.
• Set associative
• Memory block maps to a cache set.
• If set has one unused line (valid bit is 0), put memory block
there.
• Otherwise, choose a “victim” line from the lines in the set.
• Need replacement policy (i.e., algorithm)
Choosing a Line to Replace

• Commonly-used policies
• Least-recently used (LRU)
• Choose the line nonaccessed for the longest time
• Implementation gets harder as associativity increases
• Counters updated once per access (hit or miss)
• Round Robin (RR)
• A modulo counter per set incremented every miss.
• Replace set member that counter is pointing to.
• Counter updated once per miss.
• Random
• Often based on a pseudo-random (hash) function based address.
Pseudo-LRU
• Real LRU:
• Sort all lines in a cache set in a total (linear) order (need log2S
bits per line to store order)
• Pseudo-LRU:
• Only remember the line which was accessed most recently
• On a miss, select one of the other S-1 lines for replacement
(some pseudo-random choice).
• Needs one ordering bit per line.
• Easier to implement in hardware.
Replacement Policies Comparison
Policy Miss Rate Hardware Cost Power Usage

LRU Best Highest Highest
Pseudo-LRU Good Moderate Moderate
RR Good Less expensive Low
Random OK Least expensive Lowest
Handling Writes
• Cache writes impose requirement:
• Need to keep cache and memory copies of the same variable
(e.g. X) consistent!
• Techniques for handling:
• Write hits
• Write misses
Write Hit Option 1: Write-Through

• Writethrough policy updates memory and the cache simultaneously
• Write to memory takes longer to complete.
• Possible Solution: write buffer
• Holds data waiting to be written to memory
• CPU continues immediately
• Only stalls on write if write buffer is already full
• When writes are sparser write data is sent to memory and CPU can keep
running.
• Otherwise, writes can “clog” the cache-memory interface and cause CPU to
stall.
Write Hit Option 2: Write-Back
• Alternative: On data-write hit, just update the line in cache

• Every cache line has a “dirty” bit.
• Keeps track of whether each any byte in line was modified (written).
• When a line is selected for replacement
• Write it back to memory if the dirty bit is set.
Write Miss Option 1 :Write Allocation
• What should happen on a write miss?

• Fetch memory block from memory to cache.
• Write to cache line once it is updated.
• Rationale:
• Subsequent read hits to the same cache line will benefit from
the line being already in the cache.
• Read misses cause more grief than write misses because the CPU
needs the data.
Write Miss Option 2 :Write Bypass

• Don’t fetch the block
• Data is NOT written to cache, just written directly to memory.
• Rationale: written data (hit or miss) will have to be
eventually written to memory.
• It might then make sense not to waste time fetching the data
from the memory, knowing that the write will cost us some
cycles sooner or later.
Selecting Cache Writes Options

• There are pros and cons for different options.
• Analyzing the program workload will help select the
best option.
• Some CPU’s will allow the cache to be configured (by
programming) to allow different options at different
times.
Main Memory Supporting Caches

• Use DRAMs for main memory
• Fixed width (e.g., 1 word)
• Connected by fixed-width clocked bus
• Bus clock is typically slower than CPU clock
• Example cache block read
• For 32-byte block, 8-byte-wide CPU/DRAM Interface
• 1 bus cycle for address transfer
• 100 bus cycles for initial DRAM access
• 5 bus cycle per data transfer (32/8 = 4 transfers)
• Miss penalty = 1 + 100+ 4×5 = 121 bus cycles
• Bandwidth = 32 bytes / 121 cycles ≈ 0.25 B/cycle
Average Access Time

• Hit time is also important for performance
• Average Memory Access Time (AMAT)
= Hit time + Miss rate × Miss penalty
• Example
• CPU with 1ns clock, hit time = 1 cpu cycle, miss penalty = 20 cpu cycles,
cache miss rate = 5%
• AMAT = 1 + 0.05 × 20= 2 ns = 2 cpu cycles
• Example
• CPU with 1ns clock, hit time = 1 cycle, miss penalty = 200 cycles, cache
miss rate = 5%
• AMAT = 1 + 0.05 × 200= 11 ns
• Cache with lower miss rate may not be best choice if its hardware complexity
causes its hit time to be high.
Cache Performance Example

• Given CPI when no misses = 1.3 cpi
• Cache miss rate m = .03 (3%)
• Cache miss penalty M = 100 CPU cycles
• Find CPI when memory is taken into consideration.
CPI = original CPI (no misses) + mM
= 1.3 + .03 x 100= 4.3 cpi
CPI
Ideal = 1 No misses = 1.3 Overall = 4.3
Performance Summary
• When CPU performance increased
• Miss penalty becomes more significant
• Decreasing base CPI
• Greater proportion of time spent on memory stalls
• Increasing clock rate
• Memory stalls account for more CPU cycles
• Can’t neglect cache behavior when evaluating system
performance
Types of Misses (Tentative Reading)

• Compulsory Misses: Due to accessing new address.
• Capacity Misses: Due to cache being smaller than working set of
addresses.
• Conflict Misses: Due to uneven distribution to accesses amongst
cache sets. Some sets get a larger share of accesses than others.
8/11/2022 (c) Copyright 2020 Amr zaky

Two Level Caches (Tentative Reading)

• If there are two levels of caches L1, and L2, then the relation between
their contents can be:
• Inclusive: Everything in L1 must be in L2.
• Exclusive: Any data in L1 cannot be in L2 and vice-versa.
• Let it go (laissez faire): No rules applied.
8/11/2022 (c) Copyright 2020 Amr zaky

Multilevel On-Chip Caches

Memory Hierarchy - Caches

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Memory Hierarchy - Caches

Uploaded by

Copyright:

Available Formats

SANTA CLARA UNIVERSITY

Large and Fast: Exploiting

Early Memory Technology-I

• Mercury Delay Line Memory

IBM 650 Magnetic Drum Memory

Early Memory Technology-III

•Ideal (perfect) memory

DRAM Access Delay Equation

Second Factor: Spatial Locality

• Small proportion of addresses accessed at any time

Unit of Data Transfer between

Effect of Cache Line Size

Memory Hierarchy Levels

Direct Mapped Cache

◼ Number of cache lines is a power of two.

Cache Line Tags

Fully Associative Caches

Fully Associative Caches In Practice

Set Associative Caches

Cache Structure Example

• All caches same size (8 lines)

How Much Associativity

4-Way Set Associative Cache Organization

Choosing a Line to Replace

Replacement Policies Comparison

Policy Miss Rate Hardware Cost Power Usage

Write Hit Option 1: Write-Through

• Alternative: On data-write hit, just update the line in cache

• What should happen on a write miss?

Write Miss Option 2 :Write Bypass

Selecting Cache Writes Options

Main Memory Supporting Caches

Average Access Time

Cache Performance Example

Types of Misses (Tentative Reading)

8/11/2022 (c) Copyright 2020 Amr zaky

Two Level Caches (Tentative Reading)

8/11/2022 (c) Copyright 2020 Amr zaky

Multilevel On-Chip Caches

You might also like