DigitalLogic ComputerOrganization L22 CachesP3 Handout

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

DIGITAL LOGIC AND

COMPUTER ORGANIZATION
Lecture 22: Caches (P3)
Measuring Performance
ELEC3010
ACKNOWLEGEMENT

I would like to express my special thanks to Professor Zhiru Zhang


School of Electrical and Computer Engineering, Cornell University
and Prof. Rudy Lauwereins, KU Leuven for sharing their teaching
materials.

2
COVERED IN THIS COURSE
❑ Binary numbers and logic gates
❑ Boolean algebra and combinational logic
❑ Sequential logic and state machines
❑ Binary arithmetic
Digital logic
❑ Memories

❑ Instruction set architecture


❑ Processor organization Computer
❑ Caches and virtual memory
❑ Input/output Organization
❑ Advanced topics
3
CAN YOU DO IT?
❑ Assuming 16-bit memory addresses, how many bits are
associated with the tag, index, and offset of the following
configurations for a 4-way set associative cache?

(a) 32 blocks, 4 bytes per block

(b) 16 blocks, 8 bytes per block

4
BLOCK REPLACEMENT POLICY
❑Direct mapped: no choice
❑Set associative and fully associative
▪ Pick non-valid entry, if there is one
▪ Otherwise, choose among entries in the set
❑Least recently used (LRU)
▪ Choose the one unused for the longest time
▪ Requires extra bits to order the blocks
▪ High overhead beyond 4-way set associative
❑Random
▪ Similar performance as LRU for high associativity

5
LRU REPLACEMENT EXAMPLE

6
ANOTHER LRU REPLACEMENT EXAMPLE

7
WHAT ABOUT WRITES?
❑Where do we put the result of a store?
❑Cache hit (block is in cache)
▪ Write new data value to the cache
▪ Also write to memory (write through)
▪ Don’t write to memory (write back)
• Requires an additional dirty bit for each cache block
• Writes back to memory when a dirty cache block is evicted
❑Cache miss (block is not in cache)
▪ Allocate the line (bring it into the cache) (write allocate)
▪ Write to memory without allocation (no write allocate or write
around)
8
WRITE THROUGH EXAMPLE
❑Assume write allocate
❑Size of each block is 8 bytes
❑Cache holds 2 blocks
❑Memory holds 8 blocks
❑6-bit memory address

9
WRITE THROUGH EXAMPLE

10
WRITE THROUGH EXAMPLE

11
WRITE THROUGH EXAMPLE

12
WRITE THROUGH EXAMPLE

13
WRITE THROUGH EXAMPLE

14
WRITE THROUGH EXAMPLE

15
WRITE THROUGH EXAMPLE

16
WRITE THROUGH EXAMPLE

17
WRITE THROUGH EXAMPLE

18
WRITE THROUGH EXAMPLE

19
WRITE THROUGH EXAMPLE

20
WRITE BACK EXAMPLE
❑Assume write allocate
❑Size of each block is 8 bytes
❑Cache holds 2 blocks
❑Memory holds 8 blocks
❑6-bit memory address

21
WRITE BACK EXAMPLE

22
WRITE BACK EXAMPLE

23
WRITE BACK EXAMPLE

24
WRITE BACK EXAMPLE

25
WRITE BACK EXAMPLE

26
WRITE BACK EXAMPLE

27
WRITE BACK EXAMPLE

28
WRITE BACK EXAMPLE

29
WRITE BACK EXAMPLE

30
WRITE BACK EXAMPLE

31
WRITE BACK EXAMPLE

32
WRITE BACK EXAMPLE

33
CACHE HIERARCHY
❑Time to get a block from memory is so long that
performance suffers even with a low miss rate

❑Example: 3% miss rate, 100 cycles to main memory


▪ 0.03 × 100 = 3 extra cycles on average to access instructions or
data

❑Solution: Add another level of cache

34
PIPELINE WITH A CACHE HIERARCHY

35
THE MEMORY HIERARCHY

36
THE MEMORY HIERARCHY

37
CACHE HIERARCHY
❑Example: assume 1 cycle to access L1 (3% miss rate), 10
cycles to L2, 10% L2 miss rate, 100 cycles to main memory

❑How many cycles on average for instruction/data access?

1 + 0.03 × (10 + 0.1 × 100) = 1.6 cycles

38
HOW DO WE MEASURE PERFORMANCE?

❑Execution time: The time between the start and completion


of a program (or task)

❑Throughput: Total amount of work done in a given time

❑ Improving performance means


▪ Reducing execution time, or
▪ Increasing throughput

39
CPU EXECUTION TIME

❑Amount of time the CPU takes to run a program


❑Derivation

40
INSTRUCTION COUNT (I)

❑Total number of instructions in the given program


❑Factors
❑ Instruction set
❑ Mix of instructions chosen by the compiler

41
CYCLE TIME (CT)

❑Clock period (1/frequency)


❑Factors
▪ Instruction set
▪ Structure of the processor and memory hierarchy

42
CYCLES PER INSTRUCTION (CPI)

❑Average number of cycles required to execute each


instruction
❑Factors
▪ Instruction set
▪ Mix of instructions chosen by the compiler
▪ Ordering of the instructions by the compiler
▪ Structure of the processor and memory hierarchy

43
A ROUGH BREAKDOWN OF CPI

44
IMPACT OF L1 CACHES

❑ With L1 caches
– L1 instruction cache miss rate = 2%
– L1 data cache miss rate = 5%
– Miss penalty = 100 cycles (access main memory)
– 20% of all instructions are loads, 10% are stores

❑ CPImemhier =

45
IMPACT OF L1 CACHES

❑ With L1 caches
– L1 instruction cache miss rate = 2%
– L1 data cache miss rate = 5%
– Miss penalty = 100 cycles (access main memory)
– 20% of all instructions are loads, 10% are stores

❑ CPImemhier =

46
IMPACT OF L1+L2 CACHES

❑ With L1 and L2 caches


– L1 instruction cache miss rate = 2%
– L1 data cache miss rate = 5%
– L2 access time = 15 cycles
– L2 miss rate = 25%
– L2 miss penalty = 100 cycles (access main memory)
– 20% of all instructions are loads, 10% are stores

❑ CPImemhier =

47
IMPACT OF L1+L2 CACHES

❑ With L1 and L2 caches


– L1 instruction cache miss rate = 2%
– L1 data cache miss rate = 5%
– L2 access time = 15 cycles
– L2 miss rate = 25%
– L2 miss penalty = 100 cycles (access main memory)
– 20% of all instructions are loads, 10% are stores

❑ CPImemhier =

48
PROCESSOR ORGANIZATION
IMPACT ON CPI (EXAMPLE 1)

49
PROCESSOR ORGANIZATION
IMPACT ON CPI (EXAMPLE 2)

50
COMPILER IMPACT ON CPI (EXAMPLE 3)

51
BEFORE NEXT CLASS

• Next time:
Multicore

52

You might also like