Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Lecture on Global Informatics

and Electronics Ⅱ
Jubee Tada
Graduate School of Science and Engineering,
Yamagata University
Tel:0238-26-3576
E-mail:jubee@yz.yamagata-u.ac.jp
Effects of Memory Hierarchy

• Memory wall problem


• Memory hierarchy
• Cache memory
• Performance improvement of cache memories

1
A Problem with improving performance by using parallelism

• Processor processing power increases


→Increases the number of instructions and data
required at once
→ Requires high-speed memories

Memory wall problem

Solution:
• Cache Memory
2
Relationship between processor and main memory
• The processor reads instructions and necessary data
from memory and writes the results to memory.
• Without data transfer to and from memory, the
processor cannot operate.

# Register
Address Address Data
0
1 Variable A
0 Instruction 1
Data 4 Instruction 2


15
Data 100 Variable A
104 Variable B


ALU
3
Changes in processor performance and memory access times

Memory Processor

100000

10000

1000

100

10

1
1980 1983 1985 1989 1992 1996 1998 2000 2004 2007 2010 2012

4
To fill the performance gap

• Processors require memory as fast as the


processor.
• Relationship between capacity and speed
– High-speed memories cannot be increased in
capacity.
– Large capacity memories cannot be accelerated.
• How to achieve large capacity and high-speed
memories?
→Memory Hierarchy

5
Memory hierarchy

• Combination of small-
Processor High-
capacity, high-speed speed

memory and large-


capacity, slow memory Small capacity
memory
– Store entire data in large
capacity memory
Middle capacity
– Store a few data in small memory
capacity memory

Large Capacity Low-


memory
Cache memory speed

6
Behavior of cache read

• block
– Unit of data exchange in a cache system Processor
• Data exchange in a cache system
– The processor sends the address of the
block containing the required data to the
cache.
Cache memory
– If the block exists in the cache, send it to
the processor
→Cache Hit
– If it does not exist, read the block from
main memory and store it in the cache.
→Cache Miss Main memory

7
Memory stall

• Each access to memory causes an access to cache


memory.
– Hit: Continue processing without stalling
– Miss: Read the required data from memory
→Stalls occur: It’s called memory stalls
• Miss penalty
– Time required to read data from the next memory
level

Main
Processor Cache
memory
1cycle Several to
hundreds of
cycles 8
Behavior at a cache hit/miss
0x2000 0000…0010 0x0000 0000…0000
Request data 0x5101 0010…0001 0x0001 0000…0001
Processor at 0x2000
0xA01A 0110…1010


Data 0xFF13 1100…0010 0xFFFF 1111…0000
Cache(Hit) Main memory

0x2000 0000…0010 0x0000 0000…0000


Request data 0x5101 0010…0001 0x0001 0000…0001
Processor at 0x0001
0xA01A 0110…1010


0xFF13 1100…0010 0xFFFF 1111…0000

0x2000 0000…0010 0x0000 0000…0000


Request data 0x0001 0000…0001 Request data 0x0001 0000…0001
Processor at 0x0001 at 0x0001
0xA01A 0110…1010


Data 0xFF13 1100…0010 0xFFFF 1111…0000
Data
Cache(Miss) Main memory
9
Handling instructions and data

• Unified cache
– It stores instructions and data in one cache.
– It can reduce hardware costs.
• Split cache
– It stores instructions and data in each cache.
– It can avoid structural hazards.
– Harvard architecture
Instruction
Instruction
/Data Instruction Cache

Processor Cache Processor Data


Data Cache

Unified cache Split cache


10
Problems of a cache memory

• Only a few data can be stored in cache.


• What data should be stored in the cache?
– Data that is likely to be needed
→ It can reduce the number of cache misses
• Multiple pieces of data correspond to one
location
– How do you decide where to store it?
• Find from data address
– How to determine whether the data is the required
one ?
• Store address with data 11
Laws of locality

• Only a few data can be stored in cache.


– What kind of data should be stored in the cache?
→Data that is likely to be needed
• Use locality
• Temporal locality
– Referenced data is likely to be referenced again soon.
• Spatial locality
– Data that is close to referenced data is more likely to
be referenced.

12
How to choose a storage location

Address Data
000...000 0
000...001 9
Index Data 000...010 3
000 0 000...011 5
9

...
...
001
010 1 011...010 1
011 2 011...011 2
100 4

...
...
101 7
101...100 4
110 2
101...101 7
111 5

...
...
111...100 6
Find the index 111...101 9
corresponding to the 111...110 2
address and store it at that 111...111 5
location 13
Equations for the index

• Use lower bits of address as index


(block address) mod (Number of blocks in cache)=index
Number of bits of index
=log2(Number of blocks in cache )

• A method in which the storage location is uniquely


determined by the address
→Direct mapped

• Multiple blocks are assigned to one location in the cache.


→ It is necessary to determine which address the stored
data belongs to.
14
Structure of a cache

• It is necessary to determine which address the stored


data belongs to.
→Hold addresses at the same time: Tag
– Use a part other than the index of the address
– Determine whether the data is the required data by
extracting and comparing the tags.

• Need to indicate whether the data stored in the cache


is valid
→valid bit
– If the valid bit is not set, the data is invalid.
15
Structure of a cache(direct map)
31 30 29 28 ... 15 14 13 12 11 10 ... 3 2 1 0
Byte
Tag Index offset

Data Hit
10

Index Valid Tag Data


0
1
2
...

...
1021
1022
1023
20 32
20
=
16
Behavior at loading

• Get index from address


• Access the location indicated by the index
• Extract tags and data
• Compare the top part of the tag and address
• Equal
– If the valid bit is 1:Hit
– If the valid bit is 0:miss
• Not equal
– Miss
17
Behavior at storing

• Execute store instruction


– If the data is written only to cache (not written to
main memory), what will happen?
→The data differs between cache and main memory
– The coherency is lost.

• How to keep the coherency?


– One way is to write to both cache and main memory.
→write-through method

18
Problems with the write-through method

• Always write to main memory when storing


→Performance is decreased drastically
Example:
• Main memory is 100 times slower than the processor
• Store instructions in the program account for 10% of the total
• Original CPI is 1.0
→Each write consumes 100 extra cycles
CPI is 1.0+100×10%=11
Performance decreased by about 10x

19
Storing methods of a cache

• Write through
– Writes to both cache and main memory
• Write buffer
– A device that temporarily stores writes.
• Write back
– Writes only the blocks targeted for replacement
back to main memory

20
Write buffer

• Stores writes to main memory


• Processor writes to cache and write buffer
→ Continues processing
• Write buffer writes to main memory
→Delete after writing is complete

• Problem: If a store is executed before the write


buffer is free → the processor stalls
– It can avoid by making it possible to hold multiple
pieces of data.
21
Write-back method

• Writes are only made to the cache


• Data written to the cache is written to the
lower memory only when it becomes a
replacement target.
• Particularly useful when stores occur
frequently
• Problem: Control becomes complicated
– It is necessary to save information on whether
writing has been performed for each block in the
cache → dirty bit is required
22
Multi-level caches

• Memory system CPU


performance can be
improved by using
short
multiple levels of cache.
• The cache closest to CPU Level1

→L1 cache
Levels in Level2
• The next level cache of the memory
hierarchy
Access time
L1 cache Level3

→L2 cache
• The cache closest to Main ・・・

memory
Level n
→LLC(Last Level Cache) (LLC)
long

Memory capacity
23
Methods for improving cache performance

• How to improve cache performance?


→Reduce the number of cache misses

• When does a cache miss occur?


• Types of cache misses
– Cold-start miss (compulsory miss)
– Capacity miss
– Conflict miss

24
Cold-start miss

• Cache always misses on first reference.


– Hit for the first time on the second reference
• Solution:
– Increase block size
→It is possible to load several instructions and
several pieces of data from main memory at once.
☺Several references can turn into hits.
Miss penalty will increase
Conflict misses will increase because the number
of blocks is decreased.
25
Exploiting spatial locality

• Data that is close to referenced data is more


likely to be referenced.
• How to take advantage of spatial locality?
– Move more data from memory at once
→ Increase block size
• Problem
– A lot of data are transferred at once.
– It will increase miss penalty.
→ High data transfer ability required.

26
Structure of a cache(large block size)
31 30 29 28 ... 15 14 13 12 ... 5 4 3 2 1 0
Block Byte
Tag Index offset offset

Data Hit
10

Index Valid Tag Data


0
1
2
...

...
1021
1022
1023
18 32
18
=
27
Effects of block size on a cache miss rate

David A.Patterson/John L.Hennessy, “Computer Organization and Design, Fifth Edition: The Hardware/Software Interface”

28
Capacity miss

• The amount of frequently referenced data


exceeds the cache capacity
• Example:Data cache capacity < array size
0 A[0] 0 A[0]
1 A[1] 1 A[1]
2 A[2] 2 A[2]


3 A[3]
4 A[4] 45 A[45]
5 A[5] 46 A[46]


6 A[6]

• Solution: 7 A[7] 99 A[99]

– Increase cache size


The access time of the cache will increase.
– Makes the dataset smaller
→Blocking
29
Conflict miss

• With direct map cache, multiple accesses refer


to the same entry.
0 0 0 0
1 1 1 1
2 2 2 2


3 3
4 202 202
5 203 203


6
7 402 402

• Solution:
– Adopt set associative cache
 Increase the hardware cost
 Increase the access time of the cache
30
Reducing the cache miss rate using the set associative

• Direct map method


– The storage location is determined by the address
– Miss rate increases due to conflict miss
• Because data from multiple addresses is stored in one
place
• Solution:
– Store anywhere in cache → Full associative method
• It will not be known if the required data exists unless all
entries are searched.
→ It can applicable only to small caches with only a few
entries.
– Prepare multiple locations where data for an address
can be stored→ Set associative method
31
Set associative method

• There are n locations where data from an


address can be stored.
→n-way set associative cache
• Associativity
– Number of locations that can store data at a given
address
Direct map
Block Tag Data 2-way set associative
0
1 Set Tag Data Tag Data
2 0
3 1
4 2
5 3
6
7
32
Cache replacement algorithm

• If there are n storage locations, the problem is


which one to write to.
– The data at the location where it was written is
removed from the cache.
→ Cache replacement algorithm impacts on the
number of misses.
• LRU(Least Recently Used)
– Write to the one that has been unused the longest
– Recently referenced items tend to remain
→ Possible to use temporal locality
33
Difference in associativity

Direct map 2-way set associative


Block Tag Data
Set Tag Data Tag Data
0 0
1 1
2 2
3 3
4
5 4-wat set associative
6
7 Set Tag Data Tag Data Tag Data Tag Data
0
1

Full associative
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

34
Structure of a cache (2-way set associative)
31 30 29 28 ... 14 13 12 11 10 9 ... 3 2 1 0
Byte
Tag Index offset

21 9
Index Valid Tag Data Valid Tag Data
0
1
...

511

= 32 = 32

2-1multiplexer

Hit Data 35
Effects of associativity on cache miss rate

David A.Patterson/John L.Hennessy, “Computer Organization and Design, Fifth Edition: The Hardware/Software Interface”

36
Trends of Microprocessors

• History of Intel Core i series


• Improving branch prediction accuracy
• Extension of SIMD instructions
• Heterogeneous multi-core (big.LITTLE)
• Increase in cache capacity and associativity
• New packaging technologies

37
History of the Intel Core i processor

• 1st generation: Nehalem/Westmere:Intel Turbo Boost


• 2nd generation: Sandy Bridge: AVX, GPU integration
• 3rd generation: Ivy Bridge: Tri-gate transistor
• 4th generation: Haswell: AVX2
• 5th generation: Broadwell: Radix-1024 divider
• 6th generation: Sky Lake: AVX-512
• 7th generation: Kaby Lake
• 8th generation: Coffee Lake: Up to 8-core
• 9th generation: Cannon Lake/Coffee Lake-Refresh
• 10th generation: Ice Lake/Comet Lake
• 11th generation: Tige Lake/Rocket Lake: Intel Xe GPU
• 12th generation: Alder Lake: Heterogeneous multi-core
• 13th generation: Raptor Lake: Increased P core cache, doubled E core 38
Improving branch prediction accuracy

• AMD Zen uses perceptron branch prediction.


• AMD Zen 2 also features TAGE branch prediction
– TAgged GEometric history length predictor
– When the cases where branch prediction did not work is
investigated, a history length of over 1000 bits was required.
– Short history lengths cannot be predicted well when matched to
long history lengths → Use multiple history lengths (increasing
geometrically)
– Preferential use of perceptron branch prediction avoids impact
on clock cycle time
• Zen → Zen 2 improves IPC by 15%

39
Extensions of SIMD instructions

• AVX-512
– Register length changed from conventional AVX2 (256-bit)
to 512-bit
– If it is single precision (32-bit), 16 operations can be
performed simultaneously.
– Number of registers increased from 16 to 32
– Addition of various new instructions
– Predication support
• By using the mask register, it is possible to set whether or not to
execute an operation individually.
• Like ARM's conditional execution, it is possible to implement code
that includes branch judgment without branch instructions.
– Not compatible with Intel's 12th generation Core (Alder
Lake)
40
big.LITTLE

• Multi-core processor configuration


– Homogeneous multi-core: Prepare multiple identical cores
– Heterogeneous multi-core: Prepare different types of cores
• big.LITTLE
– Combining high-performance and low-performance cores
– It is possible to increase the number of cores rather than preparing
multiple high-performance cores.
– Considering Pollack's law, a processor with a performance of 1∕√2
can be realized in half the area.
– Developed by ARM and used in Apple A series, etc.
– Adopted by Intel in 12th generation Core i(Alder Lake)

41
Increasing in cache capacity

• As the number of processor cores increases, the


overload on LLC shared by multiple cores becomes a
problem.
→ The performance of the cache in the processor core is
improved.
• Intel's Raptor Lake increases P core L2 cache from
1.25MB to 2MB per core.
– 32MB L2 cache across chip
• AMD's Zen3 uses 3D V-Cache to increase L3 cache
capacity.
– Zen4 achieves up to 96MB (32MB+64MB) L3 cache.
42
Semiconductor chip manufacturing process

David A.Patterson/John L.Hennessy, “Computer Organization and Design, Fifth Edition: The Hardware/Software Interface”

43
Yield and die cost

• If a defect exists, the die becomes defective.


– As the die area increases, the possibility of defects
increases
– The cost increases significantly as the die gets larger.
𝐶𝑜𝑠𝑡 𝑜𝑓 𝑎 𝑊𝑎𝑓𝑒𝑟
𝐶𝑜𝑠𝑡 𝑜𝑓 𝑎 𝑑𝑖𝑒 =
𝐷𝑖𝑒𝑠 𝑝𝑒𝑟 𝑊𝑎𝑓𝑒𝑟 × 𝑌𝑖𝑒𝑙𝑑

𝐴𝑟𝑒𝑎 𝑜𝑓 𝑎 𝑊𝑎𝑓𝑒𝑟
𝐷𝑖𝑒𝑠 𝑝𝑒𝑟 𝑊𝑎𝑓𝑒𝑟 ≈
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑎 𝑑𝑖𝑒

1
𝑌𝑖𝑒𝑙𝑑 = 2
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑎 𝐷𝑖𝑒
1 + 𝐷𝑖𝑓𝑒𝑐𝑡𝑠 𝑝𝑒𝑟 𝑢𝑛𝑖𝑡 𝑎𝑟𝑒𝑎 ×
2

• Solution: Connecting multiple small dies


→ Connection technologies become a problem. 44
Three-dimensional stacking technology using TSV
• TSV(Through-Silicon-Via)
– Enables multiple integrated circuits to be stacked and
connected
• HBM(High Bandwidth Memory)
– Memory using three-dimensional stacking technology
– Achieves higher bandwidth compared to traditional memory
– NVIDIA H100 is equipped with HBM3

Silocon TSV

NVIDIA H100

Overview of TSV Max performance:4000TFLOPS(FP8) 45


3D V-Cache

• Stacking L3 cache dies using 3D stacking technology


• Large capacity cache can be realized at low cost.

46
Conclusions

• Increased number of transistors available within one


chip
– Increase in the number of arithmetic units/increase in cache capacity and
associativity, enable the adoption of complex branch predictions, etc.
• Reduce branch mispredictions by employing
complex branch prediction
– In modern processors, it is important not to stall the pipeline
• Improving peak performance by increasing the
number of computing units.
– Memory system performance is important
→Increase in cache capacity and associativity
• Higher costs due to larger chips become a problem
– New packaging technologies such as three-dimensional stacking
technology using TSV are expected. 47
Report

• Describe the following two items on one sheet


of A4 paper.
– Processor performance improvement methods
using parallelism
– Cache memory mechanism and performance
improvement methods

• Submit in PDF format from Webclass


– Deadline: February 13th

48

You might also like