DigitalLogic ComputerOrganization L22 CachesP3 Handout

DIGITAL LOGIC AND
COMPUTER ORGANIZATION
Lecture 22: Caches (P3)
Measuring Performance
ELEC3010
ACKNOWLEGEMENT
I would like to express my special thanks to Professor Zhiru Zhang

School of Electrical and Computer Engineering, Cornell University
and Prof. Rudy Lauwereins, KU Leuven for sharing their teaching
materials.
2
COVERED IN THIS COURSE
❑ Binary numbers and logic gates
❑ Boolean algebra and combinational logic
❑ Sequential logic and state machines
❑ Binary arithmetic
Digital logic
❑ Memories
❑ Instruction set architecture

❑ Processor organization Computer
❑ Caches and virtual memory
❑ Input/output Organization
❑ Advanced topics
3
CAN YOU DO IT?
❑ Assuming 16-bit memory addresses, how many bits are
associated with the tag, index, and offset of the following
configurations for a 4-way set associative cache?
(a) 32 blocks, 4 bytes per block
(b) 16 blocks, 8 bytes per block
4
BLOCK REPLACEMENT POLICY
❑Direct mapped: no choice
❑Set associative and fully associative
▪ Pick non-valid entry, if there is one
▪ Otherwise, choose among entries in the set
❑Least recently used (LRU)
▪ Choose the one unused for the longest time
▪ Requires extra bits to order the blocks
▪ High overhead beyond 4-way set associative
❑Random
▪ Similar performance as LRU for high associativity
5
LRU REPLACEMENT EXAMPLE
6
ANOTHER LRU REPLACEMENT EXAMPLE
7
WHAT ABOUT WRITES?
❑Where do we put the result of a store?
❑Cache hit (block is in cache)
▪ Write new data value to the cache
▪ Also write to memory (write through)
▪ Don’t write to memory (write back)
• Requires an additional dirty bit for each cache block
• Writes back to memory when a dirty cache block is evicted
❑Cache miss (block is not in cache)
▪ Allocate the line (bring it into the cache) (write allocate)
▪ Write to memory without allocation (no write allocate or write
around)
8
WRITE THROUGH EXAMPLE
❑Assume write allocate
❑Size of each block is 8 bytes
❑Cache holds 2 blocks
❑Memory holds 8 blocks
❑6-bit memory address
9
10
11
12
13
14
15
16
17
18
19
20
WRITE BACK EXAMPLE
❑Assume write allocate
❑Size of each block is 8 bytes
❑Cache holds 2 blocks
❑Memory holds 8 blocks
❑6-bit memory address
21
WRITE BACK EXAMPLE
22
WRITE BACK EXAMPLE
23
WRITE BACK EXAMPLE
24
WRITE BACK EXAMPLE
25
WRITE BACK EXAMPLE
26
WRITE BACK EXAMPLE
27
WRITE BACK EXAMPLE
28
WRITE BACK EXAMPLE
29
WRITE BACK EXAMPLE
30
WRITE BACK EXAMPLE
31
WRITE BACK EXAMPLE
32
WRITE BACK EXAMPLE
33
CACHE HIERARCHY
❑Time to get a block from memory is so long that
performance suffers even with a low miss rate
❑Example: 3% miss rate, 100 cycles to main memory

▪ 0.03 × 100 = 3 extra cycles on average to access instructions or
data
❑Solution: Add another level of cache
34
PIPELINE WITH A CACHE HIERARCHY
35
THE MEMORY HIERARCHY
36
THE MEMORY HIERARCHY
37
CACHE HIERARCHY
❑Example: assume 1 cycle to access L1 (3% miss rate), 10
cycles to L2, 10% L2 miss rate, 100 cycles to main memory
❑How many cycles on average for instruction/data access?
1 + 0.03 × (10 + 0.1 × 100) = 1.6 cycles
38
HOW DO WE MEASURE PERFORMANCE?
❑Execution time: The time between the start and completion

of a program (or task)
❑Throughput: Total amount of work done in a given time
❑ Improving performance means

▪ Reducing execution time, or
▪ Increasing throughput
39
CPU EXECUTION TIME
❑Amount of time the CPU takes to run a program

❑Derivation
40
INSTRUCTION COUNT (I)
❑Total number of instructions in the given program

❑Factors
❑ Instruction set
❑ Mix of instructions chosen by the compiler
41
CYCLE TIME (CT)
❑Clock period (1/frequency)

❑Factors
▪ Instruction set
▪ Structure of the processor and memory hierarchy
42
CYCLES PER INSTRUCTION (CPI)
❑Average number of cycles required to execute each

instruction
❑Factors
▪ Instruction set
▪ Mix of instructions chosen by the compiler
▪ Ordering of the instructions by the compiler
▪ Structure of the processor and memory hierarchy
43
A ROUGH BREAKDOWN OF CPI
44
IMPACT OF L1 CACHES
❑ With L1 caches
– L1 instruction cache miss rate = 2%
– L1 data cache miss rate = 5%
– Miss penalty = 100 cycles (access main memory)
– 20% of all instructions are loads, 10% are stores
❑ CPImemhier =
45
IMPACT OF L1 CACHES
❑ With L1 caches
– Miss penalty = 100 cycles (access main memory)
❑ CPImemhier =
46
IMPACT OF L1+L2 CACHES
❑ With L1 and L2 caches

– L2 access time = 15 cycles
– L2 miss rate = 25%
– L2 miss penalty = 100 cycles (access main memory)
❑ CPImemhier =
47
IMPACT OF L1+L2 CACHES
❑ With L1 and L2 caches

– L2 access time = 15 cycles
– L2 miss rate = 25%
– L2 miss penalty = 100 cycles (access main memory)
❑ CPImemhier =
48
PROCESSOR ORGANIZATION
IMPACT ON CPI (EXAMPLE 1)
49
PROCESSOR ORGANIZATION
IMPACT ON CPI (EXAMPLE 2)
50
COMPILER IMPACT ON CPI (EXAMPLE 3)
51
BEFORE NEXT CLASS
• Next time:
Multicore
52

DigitalLogic ComputerOrganization L22 CachesP3 Handout

Uploaded by

Copyright:

Available Formats

You might also like

DigitalLogic ComputerOrganization L22 CachesP3 Handout

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DigitalLogic ComputerOrganization L22 CachesP3 Handout

Uploaded by

Copyright:

Available Formats

DIGITAL LOGIC AND

I would like to express my special thanks to Professor Zhiru Zhang

❑ Instruction set architecture

(a) 32 blocks, 4 bytes per block

(b) 16 blocks, 8 bytes per block

❑Example: 3% miss rate, 100 cycles to main memory

❑Solution: Add another level of cache

❑How many cycles on average for instruction/data access?

1 + 0.03 × (10 + 0.1 × 100) = 1.6 cycles

❑Execution time: The time between the start and completion

❑Throughput: Total amount of work done in a given time

❑ Improving performance means

❑Amount of time the CPU takes to run a program

❑Total number of instructions in the given program

❑Clock period (1/frequency)

❑Average number of cycles required to execute each

❑ With L1 and L2 caches

❑ With L1 and L2 caches

You might also like