Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

SANTA CLARA UNIVERSITY

Large and Fast: Exploiting


Memory Hierarchy
SANTA CLARA UNIVERSITY

Early Memory Technology-I


UNIVAC I Delay Line Memory

• Mercury Delay Line Memory


• Circulated sound in mercury tubes
• Delay indicated signal
• Used in UNIVAC I (1951)
• 1000 words x12 alphanumerical characters
• ≈ 1200 bytes total!
• Bandwidth ≈ 300 Kbytes/Sec
SANTA CLARA UNIVERSITY
Early Memory Technology-II

IBM 650 Magnetic Drum Memory


• Magnetic Drum
• Data organized in circular tracks
• Instruction points to next instruction(no PC)
• Used in IBM 650 (1953)
• Between 1 and 4 Kbytes
• Average access time 2.5 ms
SANTA CLARA UNIVERSITY

Early Memory Technology-III


• Random Access Memory (RAM) arrives
Both mercury delay line and magnetic drum memories were
sequential in nature.
Access time of a location depended on what was accessed before it
(like current magnetic disk drives).
Needed careful layout of data optimizations to reduce access time
• In RAM (bad name) any random location can be accessed
at the same speed regardless of previous access.
SANTA CLARA UNIVERSITY
Magnetic Core RAM
• 2-dimensional array of tiny doughnut-like rings.
• Currents in two perpendicular wire induce & sense
magnetic polarization in rings.
• Non-volatile (does not lose data when power is off).
• Used in Whirlwind project (MIT)
• 2 Kbytes
• 9 us access time.
SANTA CLARA UNIVERSITY
Semiconductor RAM: SRAM
• Most popular is 6 transistor per bit.
• Less transistors per bit than Flip-flop.
• Saving due to all bits in a word are read/written
simultaneously, and
• A maximum (preset) number of words can be accessed at
same time.
• Original semiconductor RAM
• Volatile but holds contents constantly if power is
applied.
SANTA CLARA UNIVERSITY

DRAM Technology
• Uses one transistor per bit
• Data stored as a charge in a capacitor
• Must periodically be refreshed
• Read contents and write back
• Performed on a DRAM “row”
SANTA CLARA UNIVERSITY

Flash Storage
• Nonvolatile semiconductor storage
• 100× – 1000× faster than disk
• Smaller, lower power, more robust
• But more $/GB (between disk and DRAM)
SANTA CLARA UNIVERSITY

Flash Types
• NOR flash: bit cell like a NOR gate
• Random read/write access
• Used for instruction memory in embedded systems
• NAND flash: bit cell like a NAND gate
• Denser (bits/area), but block-at-a-time access
• Cheaper per GB
• Used for USB keys, media storage, …
• Flash bits wears out after 1000’s of accesses
• Not suitable for direct RAM or disk replacement
• Wear leveling: remap data to less used blocks
SANTA CLARA UNIVERSITY

Memory Pyramid
• Static RAM (SRAM)
• 0.5ns – 2.5ns, $2000 – $5000 per GB
• Dynamic RAM (DRAM)
• 50ns – 70ns, $20 – $75 per GB
• Magnetic disk
• 5ms – 20ms, $0.20 – $2 per GB

•Ideal (perfect) memory


• Access time of SRAM
• Capacity and cost/GB of disk
SANTA CLARA UNIVERSITY

Memory Hierarchy
• Multiple levels of memory
• Closest to CPU
• SRAM (fast) but can use small amounts.
• May be multiple levels (bigger is slower)
• Further away
• DRAM (slower), but can use more of it (cheap)
• Farthest from CPU
• Disk drive, slowest, but can be huge.
SANTA CLARA UNIVERSITY

Hierarchy in Action
• Store everything on disk
• Copy recently accessed (and nearby) items from disk to
smaller DRAM memory
• Main memory
• Copy more recently accessed (and nearby) items from
DRAM to smaller SRAM memory
• Cache memory attached to CPU
SANTA CLARA UNIVERSITY

Cache Use
• Will copy data as needed from memory into cache.
• Two factors determining the size of data to be copied.
• Memory access delay
• Locality of reference
SANTA CLARA UNIVERSITY
First factor: DRAM access delay
• Access time for 1’st data item :
• long (100’s ns)
• Due to structure of DRAM
• Possible queuing for DRAM access behind other CPU’s accesses
• Access time of 2’nd, 3’rd … datat items:
• relatively faster (e.g. 10 ns)
SANTA CLARA UNIVERSITY

DRAM Access Delay Equation


• Access time for n bytes is
• T(n) = a *n + b,
• Cost of one byte access = a + b/n
• More effective for larger n
SANTA CLARA UNIVERSITY

Second Factor: Spatial Locality


• Spatial locality is a one of two forms of address locality captured by
“Principle of Locality of Reference”
SANTA CLARA UNIVERSITY
Principle of Locality of Reference

• Small proportion of addresses accessed at any time


• Temporal locality
• Will likely reuse recently accessed items in the short term
• e.g., instructions in a loop, induction variables
• Trace: a, b, a, b, c, d, e, a, c, b, x, a, a, b, d, x, …..
• Spatial locality
• Will likely access items near recently accessed items in the
short term
• e.g., sequential instruction access, array data
• Trace: a, a+2, a+6, b, a-2, b+4, b+6, b+8, a+10, ….
SANTA CLARA UNIVERSITY

Unit of Data Transfer between


Memory & Cache
• Unit of data transferred between cache and memory is
called “memory block”
• Memory “block” maps to a cache “line”.
• Unit of data transferred between memory and disk it is
called “page”. (Will be introduced to this concept later
in the chapter.)
SANTA CLARA UNIVERSITY

Effect of Cache Line Size


• Larger cache line sizes should reduce miss rate
(especially in instruction caches)
• Due to spatial locality
• But when cache size is fixed
• Larger line size  fewer number of cache lines  more
competition for a line  increased miss rate
• Larger cache line size  Larger miss penalty
• Can override benefit of reduced miss rate
SANTA CLARA UNIVERSITY

Summing Up
• There are two benefits to copying many bytes from
memory to cache in one access:
• Reducing cost of transfer per byte &
• Exploiting spatial locality
• However
• If the data brought in is not “re-used” often enough, then
copying too much might be wasteful because some of the
data copied might not be used.
SANTA CLARA UNIVERSITY

Memory Hierarchy Levels


• Block (aka line): unit of copying
• Must be a multiple of CPU word.
• Size is always a power of 2.
• If accessed data is present in
upper level
• Hit: access satisfied by upper level
• Hit ratio: hits/accesses
• If accessed data is absent
• Miss: block copied from lower level
• Time taken: miss penalty
• Miss ratio: misses/accesses
= 1 – hit ratio
• Then accessed data supplied to
CPU from upper level
SANTA CLARA UNIVERSITY

Cache Memory
• Cache memory
• The level of the memory hierarchy closest to the CPU
• Given accesses X1, …, Xn–1, Xn

◼ How do we know if
the data is present?
◼ Where do we look?
SANTA CLARA UNIVERSITY

Direct Mapped Cache


• Location determined by address
• Direct mapped: only one address option
• Cache line address = (Memory block address) modulo (# of Blocks in cache)

◼ Number of cache lines is a power of two.


◼ Low order bits of memory block address
determine cache line address.
SANTA CLARA UNIVERSITY

Cache Line Tags


• How do we know which particular block is stored in a
cache location?
• Store memory block address in tag array
• Every cache line has a tag:
• All tags of cache are stored in tag “array”
• For direct-mapped cache the part of the memory block
address used to select a cache line, needs not be part of the
tag.
SANTA CLARA UNIVERSITY

Valid Bits
• What if there is no meaningful data in a cache line ? (e.g
at startup time)
• Add one valid bit per cache line.
• Valid bit: 1 = tag is for real data , 0 = junk tag
• Valid bit is Initially 0 (at power-up).
• Hit condition:
• Relevant part of address matches tag of some cache line
AND
• Valid bit is 1 for matched cache line.
SANTA CLARA UNIVERSITY

Cache Misses
• On cache hit, CPU proceeds normally
• On cache miss it will
• Stall the CPU pipeline
• Fetch block from next level of hierarchy
• Instruction cache miss
• Restart instruction fetch
• Data cache miss
• Complete data access
SANTA CLARA UNIVERSITY

Address Breakdown:
• Example:
• Direct-mapped cache
• 64 cache lines, 16 bytes/lines
• To what cache line number does address 1200 map?
• Memory block address = 1200/16 = 75
• Cache line # to be checked= 75 modulo 64 = 11

31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits
SANTA CLARA UNIVERSITY

Fully Associative Caches


• A memory block can be placed anywhere in cache.
• If we repeat previous example using fully-associative cache
• Access to location 260 did not need to bump a cache line off the cache
• There were plenty of empty (invalid) cache lines.
• Important: Fully associative cache does not have a notion of cache
line index. All lines have the same index = 0.
SANTA CLARA UNIVERSITY

Fully Associative Caches In Practice


• It requires all line tags to be searched in parallel ➔
• One tag comparator per cache line (very expensive for realistic cache sizes)
• Not really used in processors.
• Value of fully-associative arrays
• In simulation to provide good miss rate estimates (for comparison purposes).
• Same principal used in some CPU structures
SANTA CLARA UNIVERSITY

Set Associative Caches


• n-way set associative
• Cache is organized as array of sets * ways
• Each set contains n entries (one for every way)
• Can be viewed as
• n ways, each way is like a direct-mapped cache
• s sets, each set is like a fully-associative cache
• Number of cache lines = n x s
• Memory block # determines cache set searched:
• Set = (Memory block #) modulo (# Sets in cache)
• # of sets = (cache size/line size)/n
SANTA CLARA UNIVERSITY

Cache Structure Example

• All caches same size (8 lines)


• Direct-mapped has 8 addresses (indices)
• Two way set-associative has 8/2 = 4 address
• Fully associative cache 8/8 = one address (no need to be specified).
SANTA CLARA UNIVERSITY

Spectrum of Associativity
• For a cache with 8 lines
SANTA CLARA UNIVERSITY

How Much Associativity


• Increased associativity decreases miss rate,
• However, diminishing returns on miss rates
• Example: system with 64KB, D-cache, 64-bytes blocks, SPEC2000
• 1-way: 10.3% 2-way: 8.6%
• 4-way: 8.3%8-way: 8.1%
• More hardware overhead
• Longer hit time
• More power consumption
SANTA CLARA UNIVERSITY

4-Way Set Associative Cache Organization


SANTA CLARA UNIVERSITY

Replacement Policy
• When memory block brought to associative cache,
controller needs to select which cache line to replace:
• Direct mapped:
• There is a single cache line corresponding to a memory block, so no
policy is needed.
• Set associative
• Memory block maps to a cache set.
• If set has one unused line (valid bit is 0), put memory block
there.
• Otherwise, choose a “victim” line from the lines in the set.
• Need replacement policy (i.e., algorithm)
SANTA CLARA UNIVERSITY

Choosing a Line to Replace


• Commonly-used policies
• Least-recently used (LRU)
• Choose the line nonaccessed for the longest time
• Implementation gets harder as associativity increases
• Counters updated once per access (hit or miss)
• Round Robin (RR)
• A modulo counter per set incremented every miss.
• Replace set member that counter is pointing to.
• Counter updated once per miss.
• Random
• Often based on a pseudo-random (hash) function based address.
SANTA CLARA UNIVERSITY

Pseudo-LRU
• Real LRU:
• Sort all lines in a cache set in a total (linear) order (need log2S
bits per line to store order)
• Pseudo-LRU:
• Only remember the line which was accessed most recently
• On a miss, select one of the other S-1 lines for replacement
(some pseudo-random choice).
• Needs one ordering bit per line.
• Easier to implement in hardware.
SANTA CLARA UNIVERSITY

Replacement Policies Comparison

Policy Miss Rate Hardware Cost Power Usage


LRU Best Highest Highest
Pseudo-LRU Good Moderate Moderate
RR Good Less expensive Low
Random OK Least expensive Lowest
SANTA CLARA UNIVERSITY

Handling Writes
• Cache writes impose requirement:
• Need to keep cache and memory copies of the same variable
(e.g. X) consistent!
• Techniques for handling:
• Write hits
• Write misses
SANTA CLARA UNIVERSITY

Write Hit Option 1: Write-Through


• Writethrough policy updates memory and the cache simultaneously
• Write to memory takes longer to complete.
• Possible Solution: write buffer
• Holds data waiting to be written to memory
• CPU continues immediately
• Only stalls on write if write buffer is already full

• When writes are sparser write data is sent to memory and CPU can keep
running.
• Otherwise, writes can “clog” the cache-memory interface and cause CPU to
stall.
SANTA CLARA UNIVERSITY
Write Hit Option 2: Write-Back

• Alternative: On data-write hit, just update the line in cache


• Every cache line has a “dirty” bit.
• Keeps track of whether each any byte in line was modified (written).
• When a line is selected for replacement
• Write it back to memory if the dirty bit is set.
SANTA CLARA UNIVERSITY
Write Miss Option 1 :Write Allocation

• What should happen on a write miss?


• Fetch memory block from memory to cache.
• Write to cache line once it is updated.
• Rationale:
• Subsequent read hits to the same cache line will benefit from
the line being already in the cache.
• Read misses cause more grief than write misses because the CPU
needs the data.
SANTA CLARA UNIVERSITY

Write Miss Option 2 :Write Bypass


• Don’t fetch the block
• Data is NOT written to cache, just written directly to memory.
• Rationale: written data (hit or miss) will have to be
eventually written to memory.
• It might then make sense not to waste time fetching the data
from the memory, knowing that the write will cost us some
cycles sooner or later.
SANTA CLARA UNIVERSITY

Selecting Cache Writes Options


• There are pros and cons for different options.
• Analyzing the program workload will help select the
best option.
• Some CPU’s will allow the cache to be configured (by
programming) to allow different options at different
times.
SANTA CLARA UNIVERSITY

Main Memory Supporting Caches


• Use DRAMs for main memory
• Fixed width (e.g., 1 word)
• Connected by fixed-width clocked bus
• Bus clock is typically slower than CPU clock
• Example cache block read
• For 32-byte block, 8-byte-wide CPU/DRAM Interface
• 1 bus cycle for address transfer
• 100 bus cycles for initial DRAM access
• 5 bus cycle per data transfer (32/8 = 4 transfers)
• Miss penalty = 1 + 100+ 4×5 = 121 bus cycles
• Bandwidth = 32 bytes / 121 cycles ≈ 0.25 B/cycle
SANTA CLARA UNIVERSITY

Average Access Time


• Hit time is also important for performance
• Average Memory Access Time (AMAT)
= Hit time + Miss rate × Miss penalty
• Example
• CPU with 1ns clock, hit time = 1 cpu cycle, miss penalty = 20 cpu cycles,
cache miss rate = 5%
• AMAT = 1 + 0.05 × 20= 2 ns = 2 cpu cycles
• Example
• CPU with 1ns clock, hit time = 1 cycle, miss penalty = 200 cycles, cache
miss rate = 5%
• AMAT = 1 + 0.05 × 200= 11 ns
• Cache with lower miss rate may not be best choice if its hardware complexity
causes its hit time to be high.
SANTA CLARA UNIVERSITY

Cache Performance Example


• Given CPI when no misses = 1.3 cpi
• Cache miss rate m = .03 (3%)
• Cache miss penalty M = 100 CPU cycles
• Find CPI when memory is taken into consideration.
CPI = original CPI (no misses) + mM
= 1.3 + .03 x 100= 4.3 cpi
CPI
Ideal = 1 No misses = 1.3 Overall = 4.3
SANTA CLARA UNIVERSITY

Performance Summary
• When CPU performance increased
• Miss penalty becomes more significant
• Decreasing base CPI
• Greater proportion of time spent on memory stalls
• Increasing clock rate
• Memory stalls account for more CPU cycles
• Can’t neglect cache behavior when evaluating system
performance
SANTA CLARA UNIVERSITY

Types of Misses (Tentative Reading)


• Compulsory Misses: Due to accessing new address.
• Capacity Misses: Due to cache being smaller than working set of
addresses.
• Conflict Misses: Due to uneven distribution to accesses amongst
cache sets. Some sets get a larger share of accesses than others.

8/11/2022 (c) Copyright 2020 Amr zaky


SANTA CLARA UNIVERSITY

Two Level Caches (Tentative Reading)


• If there are two levels of caches L1, and L2, then the relation between
their contents can be:
• Inclusive: Everything in L1 must be in L2.
• Exclusive: Any data in L1 cannot be in L2 and vice-versa.
• Let it go (laissez faire): No rules applied.

8/11/2022 (c) Copyright 2020 Amr zaky


SANTA CLARA UNIVERSITY

Multilevel On-Chip Caches

You might also like