Professional Documents
Culture Documents
UNIT IV
UNIT IV
Symmetric and distributed shared memory architectures – Performance issues –Synchronization – Models of
memory consistency – Introduction to Multithreading.
Parallel Architectures:
Parallel Computer Architecture is the method of organizing all the resources to maximize the performance and
the programmability within the limits given by technology and the cost at any instance of time. It adds a new
dimension in the development of computer system by using more and more number of processors.
Flynn’s Classification:
MISD (Multiple Instruction Single Data) – multiple processors on a single data stream
MIMD (Multiple Instruction Multiple Data) – Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
• Flexible
• Use off-the-shelf micros
MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines 4
In order to take advantage of an MIMD multiprocessor with n processors, we must have at least n threads or
processes to execute. –Thread-level parallelism
• ILP exploits implicit parallel operations within a loop or straight-line code segment
• TLP explicitly represented by the use of multiple threads of execution that are inherently parallel • You
must rewrite your code to be thread-parallel.
Multiprocessor—multiple processors with a single shared address space • Symmetric multiprocessors: All
memory is the same distance away from all processors (UMA = uniform memory access)
Cluster—multiple computers (each with their own address space) connected over a local area network (LAN)
functioning as a single system.
Multiprocessing(MP), involves computer hardware and software architecture where there are multiple(two
or more) processing units executing programs for the single operating(computer) system.
SMP i.e. symmetric multiprocessing, refers to the computer architecture where multiple identical
processors are interconnected to a single shared main memory, with full accessibility to all the I/O devices,
Characteristics of SMP
• Identical: All the processors are treated equally i.e. all are identical.
• Communication: Shared memory is the mode of communication among processors.
• Complexity: Are complex in design, as all units share same memory and data bus.
• Expensive: They are costlier in nature.
• Unlike asymmetric where a task is done only by Master processor, here tasks of the operating
system are handled individually by processors.
Applications
This concept finds its application in parallel processing, where time-sharing systems(TSS) have assigned
tasks to different processors running in parallel to each other, also in TSS that uses multithreading i.e.
multiple threads running simultaneously.
Advantages of symmetric multiprocessing include the following:
• Increased throughput. Using multiple processors to run tasks decreases the time it takes for the
tasks to execute.
• Reliability. If a processor fails, the whole system does not fail. But efficiency may still be affected.
• Cost-effective. SMP is a less expensive way long term to increase system throughput than a single
processor system, as the processors in an SMP system share data storage, power supplies and other
resources.
• Performance. Because of the increased throughput, the performance of an SMP computer system
is significantly higher than a system with only one processor.
• Multiple processors. If a task is taking too long to complete, multiple processors can be added to
speed up the process.
However, SMP also comes with the following disadvantages:
• Memory expense. Because all processors in SMP share common memory, main memory must be
large enough to support all the included processors.
• Compatibility. For SMP to work, the OS, programs and applications all need to support the
architecture.
• Complicated OS. The OS manages all the processors in an SMP system. This means the design
and management of the OS can be complex, as the OS needs to handle all the available processors,
while operating in resource-intensive computing environments.
Parallel computations are made possible by multiprocessor systems. Asymmetric Multiprocessing
and Symmetric Multiprocessing are two types of multiprocessing.
AsymmetricMultiprocessing:
Asymmetric Multiprocessing system is a multiprocessor computer system where not all of the multiple
interconnected central processing units (CPUs) are treated equally. In asymmetric multiprocessing, only a
master processor runs the tasks of the operating system.
The processors in this instance are in a master-slave relationship. While the other processors are viewed as
slave processors, one serves as the master or supervisor process.
SymmetricMultiprocessing:
It involves a multiprocessor computer hardware and software architecture where two or more identical
processors are connected to a single, shared main memory, and have full access to all input and output
MIMD Architectures:
Two general classes of MIMD machines:
Centralized Shared-Memory Architectures
• One memory system for the entire multiprocessor system
• Memory references from all of the processors go to that memory system
DSM permits programs running on separate reasons to share information while not the software engineer
having to agitate causation message instead underlying technology can send the messages to stay the DSM
consistent between compute. DSM permits programs that wont to treat constant laptop to be simply tailored
to control on separate reason. Programs access what seems to them to be traditional memory. Hence,
programs that Pine Tree State DSM square measure sometimes shorter and easier to grasp than programs
that use message passing.
Memory Interconnect: In a NUMA system, processors are grouped together, each with its own set of local
memory modules. These groups are interconnected, allowing processors to access memory local to them or
remote memory through the interconnect.
Memory Access Latency: Accessing local memory has lower latency compared to accessing remote
memory. This latency difference arises due to the varying distances between processors and memory
modules.
Improved Performance: By allowing processors to access local memory with lower latency, NUMA
architectures can enhance overall system performance, especially for applications with high memory access
requirements. This locality of memory access can lead to reduced contention and improved throughput.
Scalability: NUMA architectures provide scalability by allowing additional processors and memory modules
to be added to the system without significantly impacting performance. This scalability is achieved through
the distributed nature of memory access, which avoids centralized bottlenecks often encountered in
symmetric multiprocessing (SMP) systems.
Flexibility: NUMA architectures offer flexibility in system design and resource allocation. System
administrators can configure memory and processor assignments based on workload requirements,
optimizing performance for specific tasks or applications.
High Availability: NUMA architectures can enhance system reliability and availability by isolating failures
to specific nodes or components. If a processor or memory module fails, it typically affects only the local
node, minimizing the impact on the rest of the system.
Overall, NUMA architectures are well-suited for high-performance computing environments, database
servers, virtualization platforms, and other applications that demand both performance and scalability. By
optimizing memory access and resource allocation, NUMA architectures enable efficient utilization of
system resources while maintaining responsiveness and scalability.
Cache Coherence:
Cache coherence : In a multiprocessor system, data inconsistency may occur among adjacent levels or
within the same level of the memory hierarchy. In a shared memory multiprocessor with a separate cache
memory for each processor, it is possible to have many copies of any one instruction operand: one copy in
the main memory and one in each cache memory. When one copy of an operand is changed, the other copies
of the operand must be changed also. Example : Cache and the main memory may have inconsistent copies
of the same object.
Cache coherence protocols like MESI (Modified, Exclusive, Shared, Invalid) are used to implement cache
coherence in hardware. Here's how MESI works:
Modified (M):
When a cache line is in the Modified state, it means that the processor has both read and written to the data
in that cache line, and the data in the cache is different from the data in main memory.
Any writes to this cache line are not visible to other caches in the system until the cache line is either written
back to memory or invalidated.
Exclusive (E):
When a cache line is in the Exclusive state, it means that the processor has a copy of the data in that cache
line, and no other cache in the system has a copy of that data.
Shared (S):
When a cache line is in the Shared state, it means that multiple caches in the system have a copy of the data
in that cache line, and the data is coherent with the data in main memory.
Any read accesses to this cache line can be served from any of the caches holding a copy, and any writes to
this cache line must be broadcast to invalidate or update copies in other caches to maintain coherence.
Invalid (I):
When a cache line is in the Invalid state, it means that the data in that cache line is invalid or stale, and it
cannot be used for read or write accesses until it is updated or reloaded from memory.
This state is typically used to indicate that the cache line does not contain valid data or that the data has been
modified by another processor.
• When a processor wants to read or write to a memory location, it first checks its cache for the
presence of that memory location.
• If the cache line is not present or is in the Invalid state, the processor may need to fetch the data from
main memory or another cache.
• If the cache line is present and in the Shared state, the processor can read from it. If it needs to write
to it, the cache line transitions to the Modified state, and the processor's cache becomes the sole
owner of the data.
• If the cache line is present and in the Modified state, the processor can read from or write to it directly
without involving other caches. Any writes are eventually propagated to main memory or other
caches to maintain coherence.
• If another processor wants to read or write to the same memory location, cache coherence
mechanisms ensure that the data remains consistent across all caches by updating or invalidating
copies as necessary.
In a multiprocessor system, data inconsistency may occur among adjacent levels or within the same level of
the memory hierarchy. For example, the cache and the main memory may have inconsistent copies of the same
object.
As multiple processors operate in parallel, and independently multiple caches may possess different copies of
the same memory block, this creates cache coherence problem. Cache coherence schemes help to avoid this
problem by maintaining a uniform state for each cached block of data.
Let X be an element of shared data which has been referenced by two processors, P1 and P2. In the beginning,
three copies of X are consistent. If the processor P1 writes a new data X1 into the cache, by using write-
through policy, the same copy will be written immediately into the shared memory. In this case, inconsistency
occurs between cache memory and the main memory. When a write-back policy is used, the main memory
will be updated when the modified data in the cache is replaced or invalidated.
Snoopy Cache Coherence Protocol: There are two ways to maintain the coherence requirement. One method
is to ensure that a processor has exclusive access to a data item before it writes that item. This style of protocol
is called a write invalidate protocol because it invalidates other copies on a write. It is the most common
protocol, both for snooping and for directory schemes. Exclusive access ensures that no other readable or
writable copies of an item exist when the write occurs: All other cached copies of the item are invalidated.
Snoopy protocols achieve data consistency between the cache memory and the shared memory through a bus-
based memory system. Write-invalidate and write-update policies are used for maintaining cache
consistency.
Following events and actions occur on the execution of memory-access and invalidation commands −
• Read-miss − When a processor wants to read a block and it is not in the cache, a read-miss occurs.
This initiates a bus-read operation. If no dirty copy exists, then the main memory that has a consistent
copy, supplies a copy to the requesting cache memory. If a dirty copy exists in a remote cache memory,
that cache will restrain the main memory and send a copy to the requesting cache memory. In both the
cases, the cache copy will enter the valid state after a read miss.
• Write-hit − If the copy is in dirty or reserved state, write is done locally and the new state is dirty. If
the new state is valid, write-invalidate command is broadcasted to all the caches, invalidating their
copies. When the shared memory is written through, the resulting state is reserved after this first write.
• Write-miss − If a processor fails to write in the local cache memory, the copy must come either from
the main memory or from a remote cache memory with a dirty block. This is done by sending a read-
invalidate command, which will invalidate all cache copies. Then the local copy is updated with dirty
state.
• Read-hit − Read-hit is always performed in local cache memory without causing a transition of state
or using the snoopy bus for invalidation.
• Block replacement − When a copy is dirty, it is to be written back to the main memory by block
replacement method. However, when the copy is either in valid or reserved or invalid state, no
replacement will take place.
The various activities of Snoopy protocol are given below:
1. Read request by the processor which is a hit – the cache block can be in the shared state or modified state –
Normal hit operation where the data is read from the local cache.
2. Read request by the processor, which is a miss. This indicates that the cache block can be in any of the
following three states:
a. Invalid – It is a normal miss and the read request is placed on the bus. The requested block will be
brought from memory and the status will become shared.
b. Shared – It is a replacement miss, probably because of an address conflict. The read request is placed
on the bus and the requested block will be brought from memory and the status will become shared.
c. Modified – It is a replacement miss, probably because of an address conflict. The read request is placed
on the bus, the processor -cache holding it in the modified state writes it back to memory and the requested
block will be brought from memory and the status will become shared in both the caches.
3. Write request by the processor which is a hit – the cache block can be in the shared state or modified state.
a. Modified – Normal hit operation where the data is written in the local cache.
➢ Shared: 1 or more processors have the block cached, and the value in memory is up-to-date (as well as
in all the caches)
➢ Uncached: no processor has a copy of the cache block (not valid in any cache)
➢ Exclusive: Exactly one processor has a copy of the cache block, and it has written the block, so the
memory copy is out of date
• The processor is called the owner of the block
• In addition to tracking the state of each cache block, we must track the processors that have
copies of the block when it is shared (usually a bit vector for each memory block: 1 if processor
has copy)
• Keep it simple(r): – Writes to non-exclusive data => write miss – Processor blocks until access
completes – Assume messages received and acted upon in order sent
– identical states
– write a shared cache block is treated as a write miss (without fetch the block)
• write miss: data fetch and selective invalidate operations sent by the directory controller (broadcast in
snooping protocols) Directory Operations: Requests and Actions
• Message sent to directory causes two actions: – Update the directory – More messages to satisfy request
• Block is in Uncached state: the copy in memory is the current value; only possible requests for that
block are:
• Read miss: requesting processor sent data from memory &requestor made only sharing node; state of
block made Shared.
• Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made
Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.
• Block is Shared => the memory value is up-to-date:
• Read miss: requesting processor is sent back the data from memory & requesting processor is added
to the sharing set.
Multiprocessor systems use hardware mechanisms to implement low-level synchronization operations. Most
multiprocessors have hardware mechanisms to impose atomic operations such as memory read, write or read-
modify-write operations to implement some synchronization primitives. Other than atomic memory
operations, some inter-processor interrupts are also used for synchronization purposes.
Maintaining cache coherency is a problem in multiprocessor system when the processors contain local cache
memory. Data inconsistency between different caches easily occurs in this system.
When two processors (P1 and P2) have same data element (X) in their local caches and one process (P1) writes
to the data element (X), as the caches are write-through local cache of P1, the main memory is also updated.
Now when P2 tries to read data element (X), it does not find X because the data element in the cache of P2
has become outdated.
Process migration
In the first stage, cache of P1 has data element X, whereas P2 does not have anything. A process on P2 first
writes on X and then migrates to P1. Now, the process starts reading data element X, but as the processor P1
has outdated data the process cannot read it. So, a process on P1 writes to the data element X and then migrates
to P2. After migration, a process on P2 starts reading the data element X but it finds an outdated version of X
in the main memory.
As illustrated in the figure, an I/O device is added to the bus in a two-processor multiprocessor architecture.
In the beginning, both the caches contain the data element X. When the I/O device receives a new element X,
it stores the new element directly in the main memory. Now, when either P1 or P2 (assume P1) tries to read
element X it gets an outdated copy. So, P1 writes to element X. Now, if I/O device tries to transmit X it gets
an outdated copy.
Synchronization:
Synchronization mechanisms are typically built with user-level software routines that rely on hardware –
supplied synchronization instructions.
Why Synchronize? Need to know when it is safe for different processes to use shared data
• Issues for Synchronization: – Uninterruptable instruction to fetch and update memory (atomic operation); –
User level synchronization operation using this primitive; – For large scale MPs, synchronization can be a
bottleneck; techniques to reduce contention and latency of synchronization Uninterruptable Instruction to
Fetch and Update Memory
• Atomic exchange: interchange a value in a register for a value in memory 0 ⇒ synchronization variable is
free 1 ⇒ synchronization variable is locked and unavailable – Set register to 1 & swap – New value in register
determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor
had already claimed access – Key is that exchange operation is indivisible
• Test-and-set: tests a value and sets it if the value passes the test
• Load linked (or load locked) + store conditional – Load linked returns the initial value – Store conditional
returns 1 if it succeeds (no other store to same memory location since preceding load) and 0 otherwise
User Level Synchronization—Operation Using this Primitive • Spin locks: processor continuously tries to
acquire, spinning around a loop trying to get the lock li R2,#1 lockit: exch R2,0(R1) ;atomic exchange bnez
R2,lockit ;already locked? • What about MP with cache coherency? – Want to spin on cache copy to avoid full
memory latency – Likely to get cache hits for such variables • Problem: exchange includes a write, which
invalidates all other copies; this generates considerable bus traffic • Solution: start by simply repeatedly
reading the variable; when it changes, then try exchange (“test and test&set”):
Synchronization in multithreaded programming is essential for coordinating access to shared resources and
ensuring correct behavior in concurrent execution environments. However, it introduces various challenges
due to the non-deterministic nature of thread scheduling and the potential for race conditions, deadlocks,
and performance overhead. Here are some challenges associated with synchronization in multithreaded
programming along with examples of synchronization mechanisms used to address them:
Only those programs willing to be nondeterministic are not synchronized: “data race”: outcome f(proc. speed)
• Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by
their attitude towards: RAR, WAR, RAW, WAW to different addresses
Features :
• These studies provide numerous fashions of simultaneous multithreading and compare them with
opportunity groups: a huge superscalar, a fine-grain multithreaded processor, and single-chip, a
couple of-difficulty multiprocessing architectures.
• The outcomes display that each (single-threaded) superscalar and fine-grain multithreaded
architectures are constrained of their capacity to make use of the assets of a huge-difficulty super-
scalar processor.
• Simultaneous multithreading has the ability to gain four instances the throughput of a superscalar,
and double that of fine-grain multithreading.
• Simultaneous multithreading is likewise an appealing opportunity to single-chip multiprocessors;
simultaneous multithreaded processors with a lot of groups outperform corresponding traditional
multiprocessors with comparable execution assets.
• This dissertation additionally suggests that the throughput profits from simultaneous
multithreading may be accomplished without tremendous adjustments to a traditional huge-
difficulty superscalar, both in hardware systems or sizes.
Goals:
An architecture for simultaneous multithreading is provided that achieves 3 goals are as follows.
1. It minimizes the architectural effect on a traditional superscalar design.
2. It has a minimum overall performance effect on a single thread executing alone.
3. It achieves widespread throughput profits while going for walks a couple of threads.
Fundamentals of Simultaneous Multithreading :
• Our simultaneous multithreading structure achieves a throughput of 5. Four commands according
to cycle, a 2.5-fold development over an unmodified superscalar with comparable hardware
assets.
• This speedup is superior through a bonus of multithreading formerly unexploited in different
architectures: the capacity to want for fetch and difficulty the threads of the one to be able to use