25 e 50 Beb 5 Aad 8 F 60

Lecture on Global Informatics
and Electronics Ⅱ
Jubee Tada
Graduate School of Science and Engineering,
Yamagata University
Tel:0238-26-3576
E-mail：jubee@yz.yamagata-u.ac.jp
Effects of Memory Hierarchy
• Memory wall problem

• Memory hierarchy
• Cache memory
• Performance improvement of cache memories
1
A Problem with improving performance by using parallelism
• Processor processing power increases

→Increases the number of instructions and data
required at once
→ Requires high-speed memories
Memory wall problem
Solution:
• Cache Memory
2
Relationship between processor and main memory
• The processor reads instructions and necessary data
from memory and writes the results to memory.
• Without data transfer to and from memory, the
processor cannot operate.
# Register
Address Address Data
0
1 Variable A
0 Instruction 1
Data 4 Instruction 2
…
…
15
Data 100 Variable A
104 Variable B
…
…
ＡＬＵ
3
Changes in processor performance and memory access times
Memory Processor
100000
10000
1000
100
10
1
1980 1983 1985 1989 1992 1996 1998 2000 2004 2007 2010 2012
4
To fill the performance gap
• Processors require memory as fast as the

processor.
• Relationship between capacity and speed
– High-speed memories cannot be increased in
capacity.
– Large capacity memories cannot be accelerated.
• How to achieve large capacity and high-speed
memories?
→Memory Hierarchy
5
Memory hierarchy
• Combination of small-
Processor High-
capacity, high-speed speed
memory and large-

capacity, slow memory Small capacity
memory
– Store entire data in large
capacity memory
Middle capacity
– Store a few data in small memory
capacity memory
Large Capacity Low-

memory
Cache memory speed
6
Behavior of cache read
• block
– Unit of data exchange in a cache system Processor
• Data exchange in a cache system
– The processor sends the address of the
block containing the required data to the
cache.
Cache memory
– If the block exists in the cache, send it to
the processor
→Cache Hit
– If it does not exist, read the block from
main memory and store it in the cache.
→Cache Miss Main memory
7
Memory stall
• Each access to memory causes an access to cache

memory.
– Hit: Continue processing without stalling
– Miss: Read the required data from memory
→Stalls occur: It’s called memory stalls
• Miss penalty
– Time required to read data from the next memory
level
Main
Processor Cache
memory
1cycle Several to
hundreds of
cycles 8
Behavior at a cache hit/miss
0x2000 0000…0010 0x0000 0000…0000
Request data 0x5101 0010…0001 0x0001 0000…0001
Processor at 0x2000
0xA01A 0110…1010
…
Data 0xFF13 1100…0010 0xFFFF 1111…0000
Cache(Hit） Main memory
0x2000 0000…0010 0x0000 0000…0000

Request data 0x5101 0010…0001 0x0001 0000…0001
Processor at 0x0001
0xA01A 0110…1010
…
0xFF13 1100…0010 0xFFFF 1111…0000
0x2000 0000…0010 0x0000 0000…0000

Request data 0x0001 0000…0001 Request data 0x0001 0000…0001
Processor at 0x0001 at 0x0001
0xA01A 0110…1010
…
Data 0xFF13 1100…0010 0xFFFF 1111…0000
Data
Cache(Miss) Main memory
9
Handling instructions and data
• Unified cache
– It stores instructions and data in one cache.
– It can reduce hardware costs.
• Split cache
– It stores instructions and data in each cache.
– It can avoid structural hazards.
– Harvard architecture
Instruction
Instruction
/Data Instruction Cache
Processor Cache Processor Data

Data Cache
Unified cache Split cache

10
Problems of a cache memory
• Only a few data can be stored in cache.

• What data should be stored in the cache?
– Data that is likely to be needed
→ It can reduce the number of cache misses
• Multiple pieces of data correspond to one
location
– How do you decide where to store it?
• Find from data address
– How to determine whether the data is the required
one ?
• Store address with data 11
Laws of locality
• Only a few data can be stored in cache.

– What kind of data should be stored in the cache?
→Data that is likely to be needed
• Use locality
• Temporal locality
– Referenced data is likely to be referenced again soon.
• Spatial locality
– Data that is close to referenced data is more likely to
be referenced.
12
How to choose a storage location
Address Data
000...000 0
000...001 9
Index Data 000...010 3
000 0 000...011 5
9
...
...
001
010 1 011...010 1
011 2 011...011 2
100 4
...
...
101 7
101...100 4
110 2
101...101 7
111 5
...
...
111...100 6
Find the index 111...101 9
corresponding to the 111...110 2
address and store it at that 111...111 5
location 13
Equations for the index
• Use lower bits of address as index

(block address) mod (Number of blocks in cache)＝index
Number of bits of index
=log2（Number of blocks in cache ）
• A method in which the storage location is uniquely

determined by the address
→Direct mapped
• Multiple blocks are assigned to one location in the cache.

→ It is necessary to determine which address the stored
data belongs to.
14
Structure of a cache
• It is necessary to determine which address the stored

data belongs to.
→Hold addresses at the same time: Tag
– Use a part other than the index of the address
– Determine whether the data is the required data by
extracting and comparing the tags.
• Need to indicate whether the data stored in the cache

is valid
→valid bit
– If the valid bit is not set, the data is invalid.
15
Structure of a cache(direct map)
31 30 29 28 ... 15 14 13 12 11 10 ... 3 2 1 0
Byte
Tag Index offset
Data Hit
10
Index Valid Tag Data

0
1
2
...
...
1021
1022
1023
20 32
20
=
16
Behavior at loading
• Get index from address

• Access the location indicated by the index
• Extract tags and data
• Compare the top part of the tag and address
• Equal
– If the valid bit is 1:Hit
– If the valid bit is 0:miss
• Not equal
– Miss
17
Behavior at storing
• Execute store instruction

– If the data is written only to cache (not written to
main memory), what will happen?
→The data differs between cache and main memory
– The coherency is lost.
• How to keep the coherency?

– One way is to write to both cache and main memory.
→write-through method
18
Problems with the write-through method
• Always write to main memory when storing

→Performance is decreased drastically
Example:
• Main memory is 100 times slower than the processor
• Store instructions in the program account for 10% of the total
• Original CPI is 1.0
→Each write consumes 100 extra cycles
CPI is 1.0+100×10%=11
Performance decreased by about 10x
19
Storing methods of a cache
• Write through
– Writes to both cache and main memory
• Write buffer
– A device that temporarily stores writes.
• Write back
– Writes only the blocks targeted for replacement
back to main memory
20
Write buffer
• Stores writes to main memory

• Processor writes to cache and write buffer
→ Continues processing
• Write buffer writes to main memory
→Delete after writing is complete
• Problem: If a store is executed before the write

buffer is free → the processor stalls
– It can avoid by making it possible to hold multiple
pieces of data.
21
Write-back method
• Writes are only made to the cache

• Data written to the cache is written to the
lower memory only when it becomes a
replacement target.
• Particularly useful when stores occur
frequently
• Problem: Control becomes complicated
– It is necessary to save information on whether
writing has been performed for each block in the
cache → dirty bit is required
22
Multi-level caches
• Memory system CPU

performance can be
improved by using
short
multiple levels of cache.
• The cache closest to CPU Level1
→L1 cache
Levels in Level2
• The next level cache of the memory
hierarchy
Access time
L1 cache Level3
→L2 cache
• The cache closest to Main ・・・
memory
Level n
→LLC(Last Level Cache) （LLC)
long
Memory capacity
23
Methods for improving cache performance
• How to improve cache performance?

→Reduce the number of cache misses
• When does a cache miss occur?

• Types of cache misses
– Cold-start miss (compulsory miss)
– Capacity miss
– Conflict miss
24
Cold-start miss
• Cache always misses on first reference.

– Hit for the first time on the second reference
• Solution:
– Increase block size
→It is possible to load several instructions and
several pieces of data from main memory at once.
☺Several references can turn into hits.
Miss penalty will increase
Conflict misses will increase because the number
of blocks is decreased.
25
Exploiting spatial locality
• Data that is close to referenced data is more

likely to be referenced.
• How to take advantage of spatial locality?
– Move more data from memory at once
→ Increase block size
• Problem
– A lot of data are transferred at once.
– It will increase miss penalty.
→ High data transfer ability required.
26
Structure of a cache(large block size)
31 30 29 28 ... 15 14 13 12 ... 5 4 3 2 1 0
Block Byte
Tag Index offset offset
Data Hit
10
Index Valid Tag Data

0
1
2
...
...
1021
1022
1023
18 32
18
=
27
Effects of block size on a cache miss rate
David A.Patterson/John L.Hennessy, “Computer Organization and Design, Fifth Edition: The Hardware/Software Interface”
28
Capacity miss
• The amount of frequently referenced data

exceeds the cache capacity
• Example：Data cache capacity < array size
0 A[0] 0 A[0]
1 A[1] 1 A[1]
2 A[2] 2 A[2]
…
3 A[3]
4 A[4] 45 A[45]
5 A[5] 46 A[46]
…
6 A[6]
• Solution: 7 A[7] 99 A[99]
– Increase cache size

The access time of the cache will increase.
– Makes the dataset smaller
→Blocking
29
Conflict miss
• With direct map cache, multiple accesses refer

to the same entry.
0 0 0 0
1 1 1 1
2 2 2 2
…
…
3 3
4 202 202
5 203 203
…
…
6
7 402 402
• Solution:
– Adopt set associative cache
 Increase the hardware cost
 Increase the access time of the cache
30
Reducing the cache miss rate using the set associative
• Direct map method

– The storage location is determined by the address
– Miss rate increases due to conflict miss
• Because data from multiple addresses is stored in one
place
• Solution:
– Store anywhere in cache → Full associative method
• It will not be known if the required data exists unless all
entries are searched.
→ It can applicable only to small caches with only a few
entries.
– Prepare multiple locations where data for an address
can be stored→ Set associative method
31
Set associative method
• There are n locations where data from an

address can be stored.
→n-way set associative cache
• Associativity
– Number of locations that can store data at a given
address
Direct map
Block Tag Data 2-way set associative
0
1 Set Tag Data Tag Data
2 0
3 1
4 2
5 3
6
7
32
Cache replacement algorithm
• If there are n storage locations, the problem is

which one to write to.
– The data at the location where it was written is
removed from the cache.
→ Cache replacement algorithm impacts on the
number of misses.
• LRU(Least Recently Used)
– Write to the one that has been unused the longest
– Recently referenced items tend to remain
→ Possible to use temporal locality
33
Difference in associativity
Direct map 2-way set associative

Block Tag Data
Set Tag Data Tag Data
0 0
1 1
2 2
3 3
4
5 4-wat set associative
6
7 Set Tag Data Tag Data Tag Data Tag Data
0
1
Full associative
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
34
Structure of a cache (2-way set associative)
31 30 29 28 ... 14 13 12 11 10 9 ... 3 2 1 0
Byte
Tag Index offset
21 9
Index Valid Tag Data Valid Tag Data
0
1
...
511
= 32 = 32
2-1multiplexer
Hit Data 35
Effects of associativity on cache miss rate
36
Trends of Microprocessors
• History of Intel Core i series

• Improving branch prediction accuracy
• Extension of SIMD instructions
• Heterogeneous multi-core (big.LITTLE)
• Increase in cache capacity and associativity
• New packaging technologies
37
History of the Intel Core i processor
• 1st generation： Nehalem/Westmere：Intel Turbo Boost

• 2nd generation: Sandy Bridge: AVX, GPU integration
• 3rd generation： Ivy Bridge: Tri-gate transistor
• 4th generation： Haswell: AVX2
• 5th generation： Broadwell: Radix-1024 divider
• 6th generation： Sky Lake: AVX-512
• 7th generation： Kaby Lake
• 8th generation： Coffee Lake: Up to 8-core
• 9th generation： Cannon Lake/Coffee Lake-Refresh
• 10th generation： Ice Lake/Comet Lake
• 11th generation： Tige Lake/Rocket Lake: Intel Xe GPU
• 12th generation： Alder Lake: Heterogeneous multi-core
• 13th generation： Raptor Lake: Increased P core cache, doubled E core 38
Improving branch prediction accuracy
• AMD Zen uses perceptron branch prediction.

• AMD Zen 2 also features TAGE branch prediction
– TAgged GEometric history length predictor
– When the cases where branch prediction did not work is
investigated, a history length of over 1000 bits was required.
– Short history lengths cannot be predicted well when matched to
long history lengths → Use multiple history lengths (increasing
geometrically)
– Preferential use of perceptron branch prediction avoids impact
on clock cycle time
• Zen → Zen 2 improves IPC by 15%
39
Extensions of SIMD instructions
• AVX-512
– Register length changed from conventional AVX2 (256-bit)
to 512-bit
– If it is single precision (32-bit), 16 operations can be
performed simultaneously.
– Number of registers increased from 16 to 32
– Addition of various new instructions
– Predication support
• By using the mask register, it is possible to set whether or not to
execute an operation individually.
• Like ARM's conditional execution, it is possible to implement code
that includes branch judgment without branch instructions.
– Not compatible with Intel's 12th generation Core (Alder
Lake)
40
big.LITTLE
• Multi-core processor configuration

– Homogeneous multi-core: Prepare multiple identical cores
– Heterogeneous multi-core: Prepare different types of cores
• big.LITTLE
– Combining high-performance and low-performance cores
– It is possible to increase the number of cores rather than preparing
multiple high-performance cores.
– Considering Pollack's law, a processor with a performance of 1∕√2
can be realized in half the area.
– Developed by ARM and used in Apple A series, etc.
– Adopted by Intel in 12th generation Core i(Alder Lake)
41
Increasing in cache capacity
• As the number of processor cores increases, the

overload on LLC shared by multiple cores becomes a
problem.
→ The performance of the cache in the processor core is
improved.
• Intel's Raptor Lake increases P core L2 cache from
1.25MB to 2MB per core.
– 32MB L2 cache across chip
• AMD's Zen3 uses 3D V-Cache to increase L3 cache
capacity.
– Zen4 achieves up to 96MB (32MB+64MB) L3 cache.
42
Semiconductor chip manufacturing process
43
Yield and die cost
• If a defect exists, the die becomes defective.

– As the die area increases, the possibility of defects
increases
– The cost increases significantly as the die gets larger.
𝐶𝑜𝑠𝑡 𝑜𝑓 𝑎 𝑊𝑎𝑓𝑒𝑟
𝐶𝑜𝑠𝑡 𝑜𝑓 𝑎 𝑑𝑖𝑒 =
𝐷𝑖𝑒𝑠 𝑝𝑒𝑟 𝑊𝑎𝑓𝑒𝑟 × 𝑌𝑖𝑒𝑙𝑑
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑎 𝑊𝑎𝑓𝑒𝑟
𝐷𝑖𝑒𝑠 𝑝𝑒𝑟 𝑊𝑎𝑓𝑒𝑟 ≈
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑎 𝑑𝑖𝑒
1
𝑌𝑖𝑒𝑙𝑑 = 2
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑎 𝐷𝑖𝑒
1 + 𝐷𝑖𝑓𝑒𝑐𝑡𝑠 𝑝𝑒𝑟 𝑢𝑛𝑖𝑡 𝑎𝑟𝑒𝑎 ×
2
• Solution: Connecting multiple small dies

→ Connection technologies become a problem. 44
Three-dimensional stacking technology using TSV
• TSV（Through-Silicon-Via)
– Enables multiple integrated circuits to be stacked and
connected
• HBM（High Bandwidth Memory）
– Memory using three-dimensional stacking technology
– Achieves higher bandwidth compared to traditional memory
– NVIDIA H100 is equipped with HBM3
Silocon TSV
NVIDIA H100
Overview of TSV Max performance:4000TFLOPS(FP8) 45

3D V-Cache
• Stacking L3 cache dies using 3D stacking technology

• Large capacity cache can be realized at low cost.
46
Conclusions
• Increased number of transistors available within one

chip
– Increase in the number of arithmetic units/increase in cache capacity and
associativity, enable the adoption of complex branch predictions, etc.
• Reduce branch mispredictions by employing
complex branch prediction
– In modern processors, it is important not to stall the pipeline
• Improving peak performance by increasing the
number of computing units.
– Memory system performance is important
→Increase in cache capacity and associativity
• Higher costs due to larger chips become a problem
– New packaging technologies such as three-dimensional stacking
technology using TSV are expected. 47
Report
• Describe the following two items on one sheet

of A4 paper.
– Processor performance improvement methods
using parallelism
– Cache memory mechanism and performance
improvement methods
• Submit in PDF format from Webclass

– Deadline: February 13th
48

25 e 50 Beb 5 Aad 8 F 60

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

25 e 50 Beb 5 Aad 8 F 60

Uploaded by

Copyright:

Available Formats

Lecture on Global Informatics

• Memory wall problem

• Processor processing power increases

Memory wall problem

• Processors require memory as fast as the

memory and large-

Large Capacity Low-

• Each access to memory causes an access to cache

0x2000 0000…0010 0x0000 0000…0000

0x2000 0000…0010 0x0000 0000…0000

Processor Cache Processor Data

Unified cache Split cache

• Only a few data can be stored in cache.

• Only a few data can be stored in cache.

• Use lower bits of address as index

• A method in which the storage location is uniquely

• Multiple blocks are assigned to one location in the cache.

• It is necessary to determine which address the stored

• Need to indicate whether the data stored in the cache

Index Valid Tag Data

• Get index from address

• Execute store instruction

• How to keep the coherency?

• Always write to main memory when storing

• Stores writes to main memory

• Problem: If a store is executed before the write

• Writes are only made to the cache

• Memory system CPU

• How to improve cache performance?

• When does a cache miss occur?

• Cache always misses on first reference.

• Data that is close to referenced data is more

Index Valid Tag Data

• The amount of frequently referenced data

• Solution: 7 A[7] 99 A[99]

– Increase cache size

• With direct map cache, multiple accesses refer

• Direct map method

• There are n locations where data from an

• If there are n storage locations, the problem is

Direct map 2-way set associative

• History of Intel Core i series

• 1st generation： Nehalem/Westmere：Intel Turbo Boost

• AMD Zen uses perceptron branch prediction.

• Multi-core processor configuration

• As the number of processor cores increases, the

• If a defect exists, the die becomes defective.

• Solution: Connecting multiple small dies

Overview of TSV Max performance:4000TFLOPS(FP8) 45

• Stacking L3 cache dies using 3D stacking technology

• Increased number of transistors available within one

• Describe the following two items on one sheet

• Submit in PDF format from Webclass

You might also like