Cache Memory: 13 March 2013

Chapter 4:
Cache Memory
13th March 2013
Characteristics of Computer Memory Systems
2
Location
• refers to whether memory is internal or
external to the computer.
– Cache is the form of internal memory.
– External memory consists of peripheral storage
devices, such as disk and tape, that are accessible
to the processor via I/O controllers.
3
Capacity
• the amount of information that can be
contained in a memory unit.
– usually in terms of words or bytes
• For internal memory, this is typically
expressed in terms of bytes (1 byte=8 bits)
or words.
• External memory capacity is typically
expressed in terms of bytes.
4
Unit of Transfer
• The number of data elements transferred
at a time
– Usually bits in main memory and blocks in
secondary memory
5
Access Methods - how are memory
contents accessed
• Sequential access
– Data does not have a unique address
– Must read all data items in sequence until the desired item
is found
– Access times are highly variable
– Example: tape drive units
• Direct access
– Data items have unique addresses
– Access is by jumping to vicinity plus sequential search
– access time is again variable
– Example: disk drives
6
Access Methods (cont.)
• Random access
– Each location has a unique physical address
– Locations can be accessed in any order and all access times are the same
– Example: Main memory
• Associative
– A variation of random access memory
– Data items are accessed based on their contents rather than their actual
location
– Search all data items in parallel for a match to a given search pattern
– All memory locations searched in parallel without regard to the size of
the memory
• Extremely fast for large memory sizes
– Cost per bit is 5-10 times that of a “normal” RAM cell
– Example: some cache memory units
7
Performance
• Access time (latency)
– For random-access memory, this is the time it takes
to perform read or write operation, that is, the time
from the instant that an address is presented to the
memory to the instant that data have been stored or
made available for use.
– For non-random-access memory, access time is the
time it takes to position the read–write mechanism
at the desired location.
8
Performance (cont.)
• Memory Cycle time
– Access time plus any other time required before
a second access can be started
• Transfer rate
– rate at which data can be transferred into or out of a
memory unit.
9
Physical Types
• Semiconductor
– RAM
• Magnetic
– Disk & Tape
• Optical
– CD & DVD
• Magneto-optical
10
Physical Characteristics
• Volatile
– information decays naturally or is lost when
electrical power is switched off.
• Nonvolatile
– information once recorded remains without
deterioration until deliberately changed; no
electrical power is needed to retain information.
– Magnetic-surface memories are nonvolatile
• Semiconductor memory may be either volatile
or nonvolatile.
11
Organisation
• The physical arrangement of bits to form words
• The obvious arrangement is not always used
12
Objective of Memory System
• Major design objective of any memory system

– To provide adequate storage capacity at
– An acceptable level of performance
– At a reasonable cost
• There is a trade-off among the three key
characteristics of memory: cost, capacity, and
access time.
– Faster access time – greater cost per bit
– Greater capacity – smaller cost per bit
– Greater capacity – slower access time
13
Dilemma of Designer
• The designer would like to use memory
technologies that provide
– large-capacity memory as the capacity is needed and
because the cost per bit is low.
– However, to meet performance requirements, the
designer needs to use expensive, relatively lower-
capacity memories with short access times.
• The way out of this dilemma is not to rely on a
single memory component or technology.
• Employ a memory hierarchy.
14
Memory Hierarchy
15
Memory Hierarchy (cont.)
• Basis of the memory hierarchy
– Registers internal to the CPU for temporary data
storage (small in number but very fast)
– External storage for data and programs (relatively
large and fast)
– External permanent storage (much larger and much
slower)
16
• Characteristics of the memory hierarchy
– Consists of distinct “levels” of memory components
– Each level characterized by its size, access time, and
cost per bit
– Each increasing level in the hierarchy consists of
modules of larger capacity, slower access time, and
lower cost/bit
• Goal of the memory hierarchy
– Try to match the processor speed with the rate of
information transfer from the lowest element in the
hierarchy
17
• As one goes down the hierarchy, the following
occur:
a. Decreasing cost per bit
b. Increasing capacity
c. Increasing access time
d. Decreasing frequency of access of the
memory by the processor
• Thus, smaller, more expensive, faster memories
are supplemented by larger, cheaper, slower
memories.
• The key to the success of this organization is item
(d): decreasing frequency of access.
18
Locality of Reference
• The memory hierarchy works because of locality of
reference
– Memory references made by the processor, for both
instructions and data, tend to cluster together
– Programs contain iterative loops and subroutines -
once a loop or subroutine is entered, there are
repeated references to a small set of instructions
– Operations on tables and arrays involve access to a
clustered set of data word
– Keep these clusters in high speed memory to reduce
the average delay in accessing data
– Over time, the clusters being referenced will change --
memory management must deal with this
19
Cache Memory
• Cache memory is a critical component of the
memory hierarchy
• Placed between the processor and main memory
– Compared to the size of main memory, cache is
relatively small
• Operates at or near the speed of the processor
• Very expensive compared to main memory
• Cache contains a copy of portions of main
memory.
• Located either on the processor chip or on a
separate module
20
Cache and Main Memory
21
Overview of Cache operation
• Processor requests the contents of some memory
location
• The cache is checked for the requested data
– If found, the requested word is delivered to the
processor (fast)
– If not found, a block of main memory is first read into
the cache, then the requested word is delivered to the
processor
• Because of the phenomenon of locality of
reference, when a block of data is fetched into the
cache to satisfy a single memory reference, it is
likely that there will be future references to that
same memory location or to other words in the
block.
22
Multiple levels of Cache
• L2 cache is slower and typically larger than the
L1 cache.
• L3 cache is slower and typically larger than the
L2 cache.
23
Cache/Main Memory Structure
24
Cache/Main Memory Structure
• Memory locations are grouped into blocks of
2^n locations where n represents the number
of bits used to identify a word within a block.
– Example: for n=2,it indicates that for each block
of memory, there are 2^2=4 memory locations.
• Cache is organized into lines, each of which
contains enough space to store exactly one
block of data and a tag uniquely identifies
where that block came from in memory.
25
Cache Read Operation
26
Cache Read Operation
• The processor generates the read address (RA)
of a word to be read.
• If the word is contained in the cache, it is
delivered to the processor.
• Otherwise, the block containing that word is
loaded into the cache, and the word is delivered
to the processor.
27
Typical Cache Organization
28
Typical Cache Organization
• the cache connects to the processor via data, control,
and address lines.
• The data and address lines also attach to data and
address buffers, which attach to a system bus from which
main memory is reached.
• When a cache hit occurs, the data and address buffers
are disabled and communication is only between
processor and cache, with no system bus traffic.
– Cache hit – reference to a data that is in the cache
• When a cache miss occurs, the desired address is loaded
onto the system bus and the data are returned through
the data buffer to both the cache and the processor.
– Cache miss – reference to a data that is not resident in cache,
but is resident in main memory.
29
Elements of Cache Design
30
Cache Addresses
• Where does cache sit?
– Between processor and virtual memory management unit (MMU)
– Between MMU and main memory
• Logical cache (virtual cache) stores data using virtual addresses
– Processor accesses cache directly with out going through the MMU.
– Advantage : Cache access speed is faster as the addresses do not have to
be translated by the MMU.
– Disadvantage: Virtual addresses use same address space for different
applications – a block of memory starting at 0.
• The same virtual addresses in two applications usually refers to different
physical addresses.
• So either flush cache with each application context switch or add extra bits to
each line of the cache to identify which virtual address this address refers to.
• Physical cache stores data using main memory physical addresses
31
Logical and Physical Caches
32
Cache Size
• Small enough so overall cost/bit is close to that
of main memory
• Large enough so overall average access time is
close to that of the cache alone
• Large caches tend to be slightly slower than
small caches
33
Cache Mapping Schemes
• When accessing data or instructions, the CPU first
generates a main memory address.
• If the data has been copied to cache, the address of the
data in cache is not the same as the main memory
address.
• For example, data located at main memory address 2E3
could be located in the very first location in cache.
• How, then, does the CPU locate data when it has been
copied into cache?
– The CPU uses a specific mapping scheme that
"converts" the main memory address into a cache
location.
34
Cache Mapping Schemes (cont.)
• The address conversion is done by giving special
significance to the bits in the main memory
address.
• First divide the bits into distinct groups called
fields.
• Depending on the mapping scheme, we may have
two or three fields.
• How we use these fields depends on the particular
mapping scheme being used.
• The mapping scheme determines where the data is
placed when it is originally copied into cache and
also provides a method for the CPU to find
previously copied data when searching cache
35
How data is copied into cache?
• Main memory and cache are both divided into the same
size blocks (the size of these blocks varies).
• When a memory address is generated, cache is searched
first to see if the required word exists there.
• When the requested word is not found in cache, the
entire main memory block in which the word resides is
loaded into cache.
• Because of the principle of locality of reference-if a word
was just referenced, there is a good chance that words in
the same general vicinity will soon be referenced as well.
• Therefore, one missed word often results in several
found words.
36
How data is copied into cache? (cont.)
• For example
when you are in the kitchen and you first need
tools, you have a "miss" and must go to the kitchen
store. If you gather up a set of tools that you might
need and return to the kitchen, you hope that you'll
have several "hits" while working in your kitchen
and don't have to make many more trips to the
kitchen store. Because accessing a cache word (a
tool already in the kitchen) is faster than accessing
a main memory word (going to the kitchen store
again!), cache memory speeds up the overall
access time.
37
• How do we use fields in the main memory address?
• One field of the main memory address points to a
location in cache in which the data resides if it is resident
in cache (cache hit), or where it is to be placed if it is not
resident (cache miss).
• The cache block referenced is then checked to see if it is
valid.
• This is done by associating a valid bit with each cache
block.
• A valid bit of 0 means the cache block is not valid (cache
miss) and we must access main memory.
• A valid bit of 1 means it is valid (cache hit but we need to
complete one more step before we know for sure).
38
• Then compare the tag in the cache block to the tag field
of the address.
– The tag is a special group of bits derived from the main
memory address that is stored with its corresponding block
in cache.
• If the tags are the same, then we have found the desired
cache block (cache hit).
• At this point we need to locate the desired word in the
block; this can be done using a different portion of the
main memory address called the word field.
• All cache mapping schemes require a word field;
however, the remaining fields are determined by the
mapping scheme.
• Three main cache mapping schemes are:
– direct, associative, and set associative
39
Direct Mapping
• This is the simplest form of mapping.
• One block from main memory maps into only one
possible line of cache memory.
• As there are more blocks of main memory than there
are lines of cache, many blocks in main memory can
map to the same line in cache.
• The mapping is expressed as i = j modulo m
Where i = cache line number,
J= main memory block number,
m = number of lines in the cache
40
Direct Mapping (cont.)
• Example: if cache contains 10 blocks/lines, then
– main memory block 0 maps to cache block 0,
– main memory block 1 maps to cache block 1, . . . ,
– main memory block 9 maps to cache block 9, and
– main memory block 10 maps to cache block 0.
• Thus, main memory blocks 0 and 10 (and 20, 30,
and so on) all compete for cache block 0.
41
Direct
Mapping of
Main Memory
Blocks to
Cache Blocks
42
• If main memory blocks 0 and 10 both map to cache
block 0, how does the CPU know which block
actually resides in cache block 0 at any given time?
• The answer is that each block is copied to cache and
identified by the tag.
• If we take a closer look at cache, we see that it stores
more than just the data copied from main memory
43
• There are two valid cache blocks.
• Block 0 contains multiple words from main memory,
identified using the tag "00000000".
• Block 1 contains words identified using tag
"11110101".
• The other two cache blocks are not valid.
• To perform direct mapping, the binary main memory
address is partitioned into fields
44
• The size of each field depends on the physical
characteristics of main memory and cache.
• The word field (sometimes called the offset field)
uniquely identifies a word from a specific block;
therefore, it must contain the appropriate number of
bits to do this.
• This is also true of the block field-it must select a
unique block of cache.
• The tag field is whatever is left over.
• When a block of main memory is copied to cache, this
tag is stored with the block and uniquely identifies this
block.
• The total of all three fields must, of course, add up to
the number of bits in a main memory address.
45
Direct Mapping - Example
• Assume memory consists of 214 words, cache has 16 blocks, and
each block has 8 words.
• From this we determine that memory has 211 blocks.
• We know that each main memory address requires 14 bits.
• Of this 14-bit address field, the rightmost 3 bits reflect the word
field (we need 3 bits to uniquely identify one of 8 words in a
block).
• We need 4 bits to select a specific block in cache, so the block
field consists of the middle 4 bits. The remaining 7 bits make up
the tag field.
46
Direct Mapping – Example (cont.)
• As mentioned previously, the tag for each
block is stored with that block in the cache.
• In this example, because main memory blocks
0 and 16 both map to cache block 0, the tag
field would allow the system to differentiate
between block 0 and block 16.
• The binary addresses in block 0 differ from
those in block 16 in the upper leftmost 7 bits,
so the tags are different and unique.
• To see how these addresses differ, let's look at
a smaller, simpler example in the next slide.
47
Direct Mapping – Example of Main Memory Mapped
to Cache
• Suppose we have 16 = 2^4words of main memory divided
into 8 blocks (so each block has 2 words).
• Assume the cache is 4 blocks in size (for a total of 8
words).
48
Direct Mapping – Example of Main Memory Mapped
to Cache (cont.)
• We know:
– A main memory address has 4 bits (because there are 24 or
16 words in main memory).
– This 4-bit main memory address is divided into three
fields:
– The word field is 1 bit (we need only 1 bit to differentiate
between the two words in a block);
– the block field is 2 bits (we have 4 blocks in cache and
need 2 bits to uniquely identify each block); and
– the tag field has 1 bit (this is all that is left over).
49
Direct Mapping – Example of Main Memory
Mapped to Cache (cont.)
• Suppose CPU generate the main memory
address 9.
• We can see from the mapping listing that
address 9 is in main memory block 4 and
should map to cache block 0
– which means the contents of main memory block
4 should be copied into cache block 0.
• The computer, however, uses the actual main
memory address to determine the cache
mapping block.
50
Direct Mapping – Example of Main Memory Mapped to
Cache (cont.)
Main Memory Address 9 = 1001 Split into Fields
When CPU generates this address, it first takes the block field bits
00 and uses these to direct it to the proper block in cache. 00
indicates that cache block 0 should be checked.
If the cache block is valid, it then compares the tag field value of 1
(in the main memory address) to the tag associated with cache
block 0.
51
Direct Mapping – Example of Main Memory
Mapped to Cache (cont.)
• If the cache tag is 1, then block 4 currently resides in cache block
0.
• If the tag is 0, then block 0 from main memory is located in block
0 of cache. (To see this, compare main memory address 9 =
1001, which is in block 4, to main memory address 1 = 0001,
which is in block 0. These two addresses differ only in the
leftmost bit, which is the bit used as the tag by the cache.)
• Assuming the tags match, which means that block 4 from main
memory (with addresses 8 and 9) resides in cache block 0, the
word field value of 1 is used to select one of the two words
residing in the block.
• Because the bit is 1, we select the word with offset 1, which
results in retrieving the data copied from main memory address
9. 52
Direct Mapping – More Example
• Suppose the CPU now generates address 4 =
0100.
• The middle two bits (10) direct the search to
cache block 2.
• If the block is valid, the leftmost tag bit (0)
would be compared to the tag bit stored with
the cache block.
• If they match, the first word in that block (of
offset 0) would be returned to the CPU.
53
Direct Mapping Cache Organization
54
Direct Mapping Summary
55
Pros & Cons of Direct Mapping
• Relatively inexpensive to • If a block already occupies
implement the cache location where a
– does not require any new block must be placed,
searching. Each main the block currently in cache
memory block has a specific
location to which it maps in
is removed (it is written back
cache; when a main to main memory if it has
memory address is been modified or simply
converted to a cache overwritten if it has not been
address, the CPU knows changed).
exactly where to look in the
• blocks constantly swapped in and
cache for that memory block
out of cache causes the hit ratio to
by simply examining the bits
be low.
in the block field.
56
Fully Associative Mapping
• Instead of specifying a unique location for each main
memory block, we can allow a main memory block to
be placed anywhere in cache.
• The only way to find a block mapped this way is to
search all of cache.
• the main memory address is partitioned into two
fields, the tag and the word.
57
Fully Associative Mapping - Example
• For example, memory with 214 words, a cache with 16
blocks, and blocks of 8 words
– word field is still 3 bits, but now the tag field is 11 bits.
– This tag must be stored with each block in cache.
– When the cache is searched for a specific main memory
block, the tag field of the main memory address is compared
to all the valid tag fields in cache; if a match is found, the
block is found. (Remember, the tag uniquely identifies a main
memory block.) If there is no match, we have a cache miss
and the block must be transferred from main memory.
58
Fully Associative Cache Organization
59
Associative Mapping Summary
60
Pros & Cons of Associative Mapping
• Flexible • Quite expensive
– flexibility as to which block – a single search must
to replace when a new compare the requested
block is read into the tag to all tags in cache to
cache. determine whether the
• when cache is full, we need a desired data block is
replacement algorithm to present in cache.
decide which block we wish • Slow process for large
to throw out of cache (victim caches!
block).
– Requires special hardware
to allow associative
searching.
61
Set Associative Mapping
• N-way set associative cache mapping, a
combination of two mapping approaches
• This scheme is similar to direct mapped cache
– use the address to map the block to a certain
cache location.
– The important difference is that instead of
mapping to a single cache block, an address maps
to a set of several cache blocks.
– All sets in cache must be the same size. This size
can vary from cache to cache
62
Set Associative Mapping - Example
• In 2-way set associative cache, there are two cache blocks per
set.
• set 0 contains two blocks, one that is valid and holds the data A,
B, C, . . . , and another that is not valid.
• The same is true for Set 1.
• Set 2 and Set 3 also hold two blocks, but currently, only the
second block is valid in each set.
• In 8-way set associative cache, there are 8 cache blocks per set.
63
Memory Address Format for Set Associative
Mapping
• the main memory address is partitioned into

three fields: the tag field, the set field, and the
word field.
• The tag and word fields assume the same roles
as before; the set field indicates into which
cache set the main memory block maps.
64
Memory Address Format for Set Associative
Mapping
• Suppose we are using 2-way set associative mapping
with a main memory of 214 words, a cache with 16
blocks, where each block contains 8 words.
• If cache consists of a total of 16 blocks, and each set
has 2 blocks, then there are 8 sets in cache.
• Therefore, the set field is 3 bits, the word field is 3
bits, and the tag field is 8 bits.
65
N-Way Set Associative Cache Organization
66
Replacement Algorithms
• Once the cache has been filled, when a new
block is brought into the cache, one of the
existing blocks must be replaced.
• For direct mapping:
– there is only one possible action.
– The existing block is kicked out of cache to make
room for the new block.
– there is no need for a replacement algorithms
because the location for each new block is
predetermined
67
Replacement Algorithms (cont.)
• For associative and set associative:
– When an associative cache or a set associative
cache set is full, which line should be replaced by
the new line that is to be read from memory?
– a replacement algorithm is needed.
– To achieve high speed, such an algorithm must be
implemented in hardware.
68
Replacement Algorithms for associative and set
associative
• Least recently used (LRU)

• First in first out (FIFO)
• Least frequently used
• Random
69
Least recently used (LRU) Algorithm
• The cache block that has been least recently used is

selected for replacement.
• Among the three replacement techniques, the LRU
technique is the most effective. This is because the
history of block usage (as the criterion for
replacement) is taken into consideration.
• The LRU algorithm requires the use of a cache
controller circuit that keeps track of references to all
blocks while residing in the cache.
• This can be achieved with the use of counters. Each
cache block is assigned a counter.
70
Least recently used (LRU) Algorithm(cont.)
• Upon a cache hit

– the counter of the corresponding block is set to 0, all other counters
having a smaller value than the original value in the counter of the hit
block are incremented by 1, and
– all counters having a larger value are kept unchanged.
• Upon a cache miss,
– the block whose counter is showing the maximum value is chosen for
replacement, the counter is set to 0, and all other counters are
incremented by 1.
• For two-way set associative, this is easily implemented.
– Each line includes a USE bit.
– When a line is referenced, its USE bit is set to 1 and the USE bit of the other line in that set is
set to 0.
– When a block is to be read into the set, the line whose USE bit is 0 is used.
71
First in first out (FIFO) Algorithm
• the block that has been in cache the longest

(regardless of how recently it has been used)
is selected as the victim to be removed from
cache memory.
• FIFO is easily implemented as a round-robin or
circular buffer technique.
72
Least frequently used (LFU) and Random Algorithm
 LFU
 replace block which has had fewest hits
• Random
– select a victim at random
– sometimes throws out data that will be needed soon
The algorithm selected often depends on how the system

will be used. No single (practical) algorithm is best for all
scenarios. For that reason, designers use algorithms that
perform well under a wide variety of circumstances.
73
Write Policy
• When a block that is resident in the cache is to be
replaced, there are two cases to consider.
– If the old block in the cache has not been altered, then it
may be overwritten with a new block without first
writing out the old block.
– If at least one write operation has been performed on a
word in that line of the cache, then main memory must
be updated by writing the line of cache out to the block
of memory before bringing in the new block.
• Two techniques:
– Write through
– Write back
74
Write Policy
• Write through
– Anytime a word in cache is changed, it is also changed in
main memory
– Both copies always agree
– Generates lots of memory writes to main memory
• Write back
– During a write, only change the contents of the cache
– Update main memory only when the cache line is to be
replaced
– Causes “cache coherency” problems -- different values for
the contents of an address are in the cache and the main
memory
• portions of main memory are invalid, and hence accesses by I/O
75
modules can be allowed only through the cache.
Line Size
• How much data should be transferred from main
memory to the cache in a single memory reference
• As block size increases,
– Locality of reference predicts that the additional
information transferred will likely be used and thus
increases the hit ratio (good)
– Number of blocks in cache goes down, limiting the total
number of blocks in the cache (bad)
• As the block size gets big,
– the probability of referencing all the data in it goes down
(hit ratio goes down) (bad)
• A size of from 8 to 64 bytes seems reasonably close
to optimum
76
Number of Caches
• Multilevel caches
• Unified vs. Split Caches
77
Number of Caches – Multilevel caches
 Can have two levels of cache in memory hierarchy.
• The level-1 cache (or L1 cache, or internal
cache) is smaller and faster, and lies on the
same chip as the processor.
• The Level-2 cache(or L2 cache, or external
cache) is larger but slower, and lies outside the
processor.
• Memory access first goes to L1 cache. If L1
cache access is a miss, go to L2 cache. If L2
cache is a miss, go to main memory.
78
Number of Caches – Unified vs. Split
Caches
• Unified cache stores data and instructions in 1
cache
– Only 1 cache to design and operate
– Cache is flexible and can balance “allocation” of space
to instructions or data to best fit the execution of the
program -- higher hit ratio
• Split cache uses 2 caches -- 1 for instructions and
1 for data
– Must build and manage 2 caches
– Can out perform unified cache in systems that support
parallel execution and pipelining (reduces cache
contention)
79
80

Cache Memory: 13 March 2013

Uploaded by

Copyright:

Available Formats

You might also like

Cache Memory: 13 March 2013

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cache Memory: 13 March 2013

Uploaded by

Copyright:

Available Formats

Chapter 4:

• Major design objective of any memory system

Main Memory Address 9 = 1001 Split into Fields

• the main memory address is partitioned into

• Least recently used (LRU)

• The cache block that has been least recently used is

• Upon a cache hit

• the block that has been in cache the longest

The algorithm selected often depends on how the system

You might also like