Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Parallel Computing

Parallel Computer Memory


Architectures
Shared Memory
General Characteristics:

• Shared memory parallel computers vary widely, but generally


have in common the ability for all processors to access all memory
as global address space.
Multiple processors can operate independently but share the
same memory resources.
Changes in a memory location effected by one processor are
visible to all other processors.
Shared memory machines can be divided into two main classes
based upon memory access times: UMA and NUMA.
Shared Memory (UMA)
Shared Memory (NUMA)
Uniform Memory Access
(UMA):
Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
Identical processors
Equal access and access times to memory
Sometimes called CC-UMA - Cache Coherent UMA.
Cache coherent means if one processor updates a
location in shared memory, all the other processors
know about the update. Cache coherency is
accomplished at the hardware level.
Non-Uniform Memory
Access (NUMA)
Often made by physically linking two or more SMPs
One SMP can directly access memory of another
SMP
Not all processors have equal access time to all
memories
Memory access across link is slower
If cache coherency is maintained, then may also be
called CC-NUMA - Cache Coherent NUMA
Advantages:

Global address space provides a user-friendly


programming perspective to memory
Data sharing between tasks is both fast and
uniform due to the proximity of memory to CPUs
Disadvantages:
Primary disadvantage is the lack of scalability between
memory and CPUs. Adding more CPUs can geometrically
increases traffic on the shared memory-CPU path, and for
cache coherent systems, geometrically increase traffic
associated with cache/memory management.
Programmer responsibility for synchronization constructs
that ensure "correct" access of global memory.
Expense: it becomes increasingly difficult and expensive to
design and produce shared memory machines with ever
increasing numbers of processors.
Distributed Memory
General Characteristics:
Like shared memory systems, distributed memory systems
vary widely but share a common characteristic. Distributed
memory systems require a communication network to
connect inter-processor memory.
Processors have their own local memory. Memory
addresses in one processor do not map to another
processor, so there is no concept of global address space
across all processors.
Because each processor has its own local memory, it
operates independently. Changes it makes to its local
memory have no effect on the memory of other processors.
Hence, the concept of cache coherency does not apply.
Distributed Memory (cont.)
When a processor needs access to data in another
processor, it is usually the task of the programmer
to explicitly define how and when data is
communicated. Synchronization between tasks is
likewise the programmer's responsibility.
The network "fabric" used for data transfer varies
widely, though it can can be as simple as Ethernet.
Distributed Memory (cont.)
Distributed Memory (cont.)
Advantages:
Memory is scalable with number of processors.
Increase the number of processors and the size of
memory increases proportionately.
Each processor can rapidly access its own memory
without interference and without the overhead
incurred with trying to maintain cache coherency.
Cost effectiveness: can use commodity, off-the-
shelf processors and networking
Distributed Memory (cont.)
Disadvantages:
The programmer is responsible for many of the
details associated with data communication
between processors.
It may be difficult to map existing data structures,
based on global memory, to this memory
organization.
Non-uniform memory access (NUMA) times
Hybrid Distributed-Shared
Memory
The largest and fastest computers in the world today
employ both shared and distributed memory
architectures.
Hybrid Distributed-Shared
Memory (cont.)
The shared memory component is usually a cache
coherent SMP machine. Processors on a given SMP
can address that machine's memory as global.
The distributed memory component is the
networking of multiple SMPs. SMPs know only
about their own memory - not the memory on
another SMP. Therefore, network communications
are required to move data from one SMP to
another.
Hybrid Distributed-Shared
Memory (cont.)
Current trends seem to indicate that this type of
memory architecture will continue to prevail and
increase at the high end of computing for the
foreseeable future.
Advantages and Disadvantages: whatever is
common to both shared and distributed memory
architectures.
Bus-based Shared Memory
Organization

Basic picture is simple :-

Shared
CPU CPU CPU Memory
Cache Cache Cache

Shared Bus

3
Organization

• Bus is usually simple physical connection


(wires)
• Bus bandwidth limits no. of CPUs
• Could be multiple memory elements
• For now, assume that each CPU has only a
single level of cache

4
Problem of Memory Coherence

• Assume just single level caches and main


memory
• Processor writes to location in its cache
• Other caches may hold shared copies - these
will be out of date
• Updating main memory alone is not enough

5
Example
1 2 3 Shared X: 24
CPU CPU CPU Memory
Cache Cache Cache

Shared Bus
Processor 1 reads X: obtains 24 from memory and caches it
Processor 2 reads X: obtains 24 from memory and caches it
Processor 1 writes 32 to X: its locally cached copy is updated
Processor 3 reads X: what value should it get?
Memory and processor 2 think it is 24
Processor 1 thinks it is 32
6
Bus Snooping

• Scheme where every CPU knows who has a copy


of its cached data is far too complex.
• So each CPU (cache system) ‘snoops’ (i.e.
watches continually) for write activity concerned
with data addresses which it has cached.
• This assumes a bus structure which is ‘global’, i.e
all communication can be seen by all.
• More scalable solution: ‘directory based’
coherence schemes

7
MESI Protocol

Any cache line can be in one of 4 states (2 bits)


• Modified - cache line has been modified, is
different from main memory - is the only cached
copy. (multiprocessor ‘dirty’)
• Exclusive - cache line is the same as main
memory and is the only cached copy
• Shared - Same as main memory but copies may
exist in other caches.
• Invalid - Line data is not valid (as in simple
cache)
14
MESI Protocol

• Cache line changes state as a function of


memory access events.
• Event may be either
– Due to local processor activity (i.e. cache
access)
– Due to bus activity - as a result of snooping

15
MESI Protocol

• Operation can be described informally by


looking at action in local processor
– Read Hit
– Read Miss
– Write Hit
– Write Miss

16
MESI Local Read Hit

• Line must be in one of MES


• This must be correct local value (if M it
must have been modified locally)
• Simply return value
• No state change

17
MESI Local Read Miss

• No other copy in caches


– Processor makes bus request to memory
– Value read to local cache, marked E
• One cache has E copy
– Processor makes bus request to memory
– Snooping cache puts copy value on the bus
– Memory access is abandoned
– Local processor caches value
– Both lines set to S
18
MESI Local Write Hit

Line must be one of MES


• M
– line is exclusive and already ‘dirty’
– Update local cache value
– no state change
• E
– Update local cache value
– State E -> M
21
MESI Local Write Miss

Detailed action depends on copies in other


processors

• No other copies
– Value read from memory to local cache
– Value updated
– Local copy state set to M

23
Putting it all together

• All of this information can be described


compactly using a state transition diagram
• Diagram shows what happens to a cache
line in a processor as a result of
– memory accesses made by that processor (read
hit/miss, write hit/miss)
– memory accesses made by other processors that
result in bus transactions observed by this
snoopy cache (Mem read, RWITM (read with
intent to modify),Invalidate)
27
MESI notes
• There are minor variations (particularly to
do with write miss)
• Normal ‘write back’ when cache line is
evicted is done if line state is M
• Multi-level caches
– If caches are inclusive, only the lowest level
cache needs to snoop on the bus

30
Directory
Schemes
• Snoopy schemes do not scale because they rely on
broadcast

• Directory-based schemes allow scaling.


– avoid broadcasts by keeping track of all PEs caching a
memory block, and then using point-to-point messages to
maintain coherence
– they allow the flexibility to use any scalable point-to-point
network

31
Basic Scheme (Censier &
P Feautrier)
P

C ache C ache • Assume "k" processors.


• With each cache-block in memory:
Interco nnection Netwo rk k presence-bits, and 1 dirty-bit
• With each cache-block in cache:
Memo ry •• • Directo ry 1valid bit, and 1 dirty (owner) bit

presence bits dirty bit

– Read from main memory by PE-i:


• If dirty-bit is OFF then { read from main memory; turn p[i] ON; }
• if dirty-bit is ON then { recall line from dirty PE (cache state to
shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply
recalled data to PE-i; }
– Write to main memory:
• If dirty-bit OFF then { send invalidations to all PEs caching that block;
turn dirty-bit ON; turn P[i] ON; ... }
• ... 32
Key
Issues
• Scaling of memory and directory bandwidth
– Can not have main memory or directory memory centralized
– Need a distributed memory and directory structure
• Directory memory requirements do not scale well
– Number of presence bits grows with number of PEs
– Many ways to get around this problem
• limited pointer schemes of many flavors
• Industry standard
– SCI: Scalable Coherent Interface

33

You might also like