UNIT IV

UNIT IV MULTIPROCESSORS AND THREAD LEVEL PARALLELISM
Symmetric and distributed shared memory architectures – Performance issues –Synchronization – Models of
memory consistency – Introduction to Multithreading.
Parallel Architectures:
Parallel Computer Architecture is the method of organizing all the resources to maximize the performance and
the programmability within the limits given by technology and the cost at any instance of time. It adds a new
dimension in the development of computer system by using more and more number of processors.
Flynn’s Classification:
SISD (Single Instruction Single Data) – Uniprocessors
MISD (Multiple Instruction Single Data) – multiple processors on a single data stream
SIMD (Single Instruction Multiple Data) – Examples:
• Simple programming model

• Low overhead
• Flexibility
• All custom integrated circuits – (Phrase reused by Intel marketing for media instructions ~ vector)
MIMD (Multiple Instruction Multiple Data) – Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
• Flexible
• Use off-the-shelf micros
MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines 4
Prepared by Ms.K.Sherin, AP/CSE/SJIT

MIMD:
In order to take advantage of an MIMD multiprocessor with n processors, we must have at least n threads or
processes to execute. –Thread-level parallelism
Thread Level Parallelism
• ILP exploits implicit parallel operations within a loop or straight-line code segment
• TLP explicitly represented by the use of multiple threads of execution that are inherently parallel • You
must rewrite your code to be thread-parallel.
Goal: Use multiple instruction streams to improve
• Throughput of computers that run many programs

• Execution time of multi-threaded programs
• TLP could be more cost-effective to exploit than ILP Organizing Many Processors
Organizing Many Processors
Multiprocessor—multiple processors with a single shared address space • Symmetric multiprocessors: All
memory is the same distance away from all processors (UMA = uniform memory access)
Cluster—multiple computers (each with their own address space) connected over a local area network (LAN)
functioning as a single system.
Symmetric Multi-Processing (SMP):
Multiprocessing(MP), involves computer hardware and software architecture where there are multiple(two
or more) processing units executing programs for the single operating(computer) system.
SMP i.e. symmetric multiprocessing, refers to the computer architecture where multiple identical
processors are interconnected to a single shared main memory, with full accessibility to all the I/O devices,

unlike asymmetric MP. In other words, all the processors have common shared(common) memory and same
data path or I/O bus as shown in the figure.
Characteristics of SMP
• Identical: All the processors are treated equally i.e. all are identical.
• Communication: Shared memory is the mode of communication among processors.
• Complexity: Are complex in design, as all units share same memory and data bus.
• Expensive: They are costlier in nature.
• Unlike asymmetric where a task is done only by Master processor, here tasks of the operating
system are handled individually by processors.
Applications
This concept finds its application in parallel processing, where time-sharing systems(TSS) have assigned
tasks to different processors running in parallel to each other, also in TSS that uses multithreading i.e.
multiple threads running simultaneously.
Advantages of symmetric multiprocessing include the following:
• Increased throughput. Using multiple processors to run tasks decreases the time it takes for the
tasks to execute.
• Reliability. If a processor fails, the whole system does not fail. But efficiency may still be affected.
• Cost-effective. SMP is a less expensive way long term to increase system throughput than a single
processor system, as the processors in an SMP system share data storage, power supplies and other
resources.
• Performance. Because of the increased throughput, the performance of an SMP computer system
is significantly higher than a system with only one processor.

• Programming and executing code. A program can run on any processor in the SMP system and
reach about the same level of performance, which makes programming and executing code
relatively straightforward.
• Multiple processors. If a task is taking too long to complete, multiple processors can be added to
speed up the process.
However, SMP also comes with the following disadvantages:
• Memory expense. Because all processors in SMP share common memory, main memory must be
large enough to support all the included processors.
• Compatibility. For SMP to work, the OS, programs and applications all need to support the
architecture.
• Complicated OS. The OS manages all the processors in an SMP system. This means the design
and management of the OS can be complex, as the OS needs to handle all the available processors,
while operating in resource-intensive computing environments.
Parallel computations are made possible by multiprocessor systems. Asymmetric Multiprocessing
and Symmetric Multiprocessing are two types of multiprocessing.
AsymmetricMultiprocessing:
Asymmetric Multiprocessing system is a multiprocessor computer system where not all of the multiple
interconnected central processing units (CPUs) are treated equally. In asymmetric multiprocessing, only a
master processor runs the tasks of the operating system.
The processors in this instance are in a master-slave relationship. While the other processors are viewed as
slave processors, one serves as the master or supervisor process.
SymmetricMultiprocessing:
It involves a multiprocessor computer hardware and software architecture where two or more identical
processors are connected to a single, shared main memory, and have full access to all input and output

devices, In other words, Symmetric Multiprocessing is a type of multiprocessing where each processor is
self-scheduling.
Some common performance issues that may arise in symmetric shared memory systems:
In symmetric shared memory systems, where multiple processors or CPU cores share a common memory
space, several performance issues can arise due to contention, synchronization overhead, and resource
utilization. Here are some common performance issues along with potential solutions or mitigation
strategies:
Cache Contention:
Issue: Cache contention occurs when multiple processors access the same cache lines concurrently, leading
to frequent cache invalidations and increased cache coherence traffic.
Solution:
Utilize cache-conscious programming techniques to minimize cache line contention. This involves
optimizing data structures and access patterns to reduce sharing between threads or processes.
Employ data partitioning or data locality techniques to ensure that frequently accessed data is distributed
across different cache lines, reducing contention.
Consider utilizing cache-coherent architectures or cache-coherent interconnects that are designed to
minimize cache contention and improve cache utilization.
False Sharing:
Issue: False sharing occurs when multiple threads or processors modify data located in the same cache line,
even if they are working on different data elements. This can lead to unnecessary cache invalidations and
reduced performance.
Solution:
Align data structures to cache line boundaries to minimize the likelihood of false sharing. This ensures that
each data element resides in its own cache line, reducing contention.
Use padding or data replication techniques to separate data elements that are frequently accessed by different
threads or processors, minimizing the impact of false sharing.
Synchronization Overhead:
Issue: Synchronization primitives such as locks, mutexes, and semaphores are used to coordinate access to
shared resources. However, excessive synchronization can introduce overhead and limit scalability.
Solution:
Minimize the use of global locks and employ finer-grained locking strategies to reduce contention. For
example, use reader-writer locks or lock-free data structures where applicable.
Utilize non-blocking synchronization techniques such as atomic operations or transactional memory to
reduce contention and improve scalability.

Explore alternative synchronization mechanisms tailored to specific use cases, such as software transactional
memory (STM) or optimistic concurrency control.
Load Imbalance:
Issue: Load imbalance occurs when the workload is not evenly distributed among processors, leading to
underutilization of some processors and potential bottlenecks.
Solution:
Implement load-balancing algorithms that dynamically distribute tasks among processors based on workload
characteristics and resource availability.
Utilize techniques such as work stealing or task scheduling to dynamically adjust the workload distribution
and maintain high processor utilization.
Profile and optimize application performance to identify and address load imbalance issues, such as uneven
data distribution or inefficient parallelization.
Memory Bandwidth Saturation:
Issue: Memory bandwidth saturation occurs when the memory subsystem becomes a bottleneck due to high
contention for memory access among multiple processors.
Solution:
Optimize memory access patterns to reduce contention and improve memory locality. This includes
optimizing data placement and access patterns to minimize the number of memory transactions.
Utilize NUMA-aware memory allocation strategies to distribute memory accesses evenly across memory
modules, reducing contention for shared memory resources.
Consider employing memory compression techniques or increasing memory bandwidth through hardware
upgrades to alleviate memory bandwidth saturation.
For example, SMP applies multiple processors to that one problem, known as parallel programming.

S.no Asymmetric Multiprocessing Symmetric Multiprocessing
1. In asymmetric multiprocessing, the In symmetric multiprocessing, all the processors are
processors are not treated equally. treated equally.
Tasks of the operating system are done Tasks of the operating system are done individual
2.
by master processor. processor.
No Communication between Processors
All processors communicate with another processor
3. as they are controlled by the master
by a shared memory.
processor.
In asymmetric multiprocessing, process
In symmetric multiprocessing, the process is taken
4. scheduling approach used is master-
from the ready queue.
slave.
Asymmetric multiprocessing systems
5. Symmetric multiprocessing systems are costlier.
are cheaper.
Asymmetric multiprocessing systems Symmetric multiprocessing systems are complex to
6.
are easier to design. design.
All processors can exhibit different
7. The architecture of each processor is the same.
architecture.
It is simple as here the master processor It is complex as synchronization is required of the
8.
has access to the data, etc. processors in order to maintain the load balance.
In case a master processor malfunctions
then slave processor continues the
In case of processor failure, there is reduction in the
9. execution which is turned to master
system’s computing capacity.
processor. When a slave processor fails
then other processors take over the task.
It is suitable for homogeneous or
10. It is suitable for homogeneous cores.
heterogeneous cores.
MIMD Architectures:
Two general classes of MIMD machines:
Centralized Shared-Memory Architectures
• One memory system for the entire multiprocessor system
• Memory references from all of the processors go to that memory system

Advantages:
• All of the data in the memory accessible to any processor
• Never a problem with multiple copies of a given datum existence
Limitations o CSM:
• Bandwidth of the centralized memory system does not grow as the number of processors in the
machine increases
• Latency of the network added to the latency of each memory reference
Distributed Shared Memory (DSM)
implements the distributed systems shared memory model in a distributed system, that hasn’t any physically
shared memory. Shared model provides a virtual address area shared between any or all nodes. To beat the
high forged of communication in distributed system. DSM memo, model provides a virtual address area
shared between all nodes. systems move information to the placement of access.
DSM permits programs running on separate reasons to share information while not the software engineer
having to agitate causation message instead underlying technology can send the messages to stay the DSM
consistent between compute. DSM permits programs that wont to treat constant laptop to be simply tailored
to control on separate reason. Programs access what seems to them to be traditional memory. Hence,
programs that Pine Tree State DSM square measure sometimes shorter and easier to grasp than programs
that use message passing.

Architecture of Distributed Shared Memory (DSM) :
The architecture of a Distributed Shared Memory (DSM) system typically consists of several key
components that work together to provide the illusion of a shared memory space across distributed nodes.
the components of Architecture of Distributed Shared Memory :
1.Nodes: Each node in the distributed system consists of one or more CPUs and a memory unit. These nodes
are connected via a high-speed communication network.
2.Memory Mapping Manager Unit: The memory mapping manager routine in each node is responsible for
mapping the local memory onto the shared memory space. This involves dividing the shared memory space
into blocks and managing the mapping of these blocks to the physical memory of the node.
Caching is employed to reduce operation latency. Each node uses its local memory to cache portions of the
shared memory space. The memory mapping manager treats the local memory as a cache for the shared
memory space, with memory blocks as the basic unit of caching.
3.Communication Network Unit: This unit facilitates communication between nodes. When a process
accesses data in the shared address space, the memory mapping manager maps the shared memory address
to physical memory. The communication network unit handles the communication of data between nodes,
ensuring that data can be accessed remotely when necessary.
A layer of code, either implemented in the operating system kernel or as a runtime routine, is responsible
for managing the mapping between shared memory addresses and physical memory locations.
Each node’s physical memory holds pages of the shared virtual address space. Some pages are local to the
node, while others are remote and stored in the memory of other nodes.
NUMA
Non-Uniform Memory Access (NUMA) architecture is a design approach used in multiprocessing systems
where the memory access time depends on the memory location relative to the processor accessing it. In
NUMA architectures, multiple processors or CPU cores are connected to a common set of memory modules,
but the access time for each processor can vary based on its proximity to different memory modules.
Here's how it typically works:
Memory Interconnect: In a NUMA system, processors are grouped together, each with its own set of local
memory modules. These groups are interconnected, allowing processors to access memory local to them or
remote memory through the interconnect.
Memory Access Latency: Accessing local memory has lower latency compared to accessing remote
memory. This latency difference arises due to the varying distances between processors and memory
modules.

Load Balancing: NUMA architectures often employ techniques to balance the memory and processor load
across the system to minimize the impact of memory access latency. This can involve strategies such as
memory migration or scheduling processes on processors closer to their required memory.
Advantages of NUMA architectures in terms of performance and scalability:
Improved Performance: By allowing processors to access local memory with lower latency, NUMA
architectures can enhance overall system performance, especially for applications with high memory access
requirements. This locality of memory access can lead to reduced contention and improved throughput.
Scalability: NUMA architectures provide scalability by allowing additional processors and memory modules
to be added to the system without significantly impacting performance. This scalability is achieved through
the distributed nature of memory access, which avoids centralized bottlenecks often encountered in
symmetric multiprocessing (SMP) systems.
Flexibility: NUMA architectures offer flexibility in system design and resource allocation. System
administrators can configure memory and processor assignments based on workload requirements,
optimizing performance for specific tasks or applications.
High Availability: NUMA architectures can enhance system reliability and availability by isolating failures
to specific nodes or components. If a processor or memory module fails, it typically affects only the local
node, minimizing the impact on the rest of the system.
Overall, NUMA architectures are well-suited for high-performance computing environments, database
servers, virtualization platforms, and other applications that demand both performance and scalability. By
optimizing memory access and resource allocation, NUMA architectures enable efficient utilization of
system resources while maintaining responsiveness and scalability.
Cache Coherence:
Cache coherence : In a multiprocessor system, data inconsistency may occur among adjacent levels or
within the same level of the memory hierarchy. In a shared memory multiprocessor with a separate cache
memory for each processor, it is possible to have many copies of any one instruction operand: one copy in
the main memory and one in each cache memory. When one copy of an operand is changed, the other copies
of the operand must be changed also. Example : Cache and the main memory may have inconsistent copies
of the same object.

Suppose there are three processors, each having cache. Suppose the following scenario: -
• Processor 1 read X : obtains 24 from the memory and caches it.

• Processor 2 read X : obtains 24 from memory and caches it.
• Again, processor 1 writes as X : 64, Its locally cached copy is updated. Now, processor 3
reads X, what value should it get?
• Memory and processor 2 thinks it is 24 and processor 1 thinks it is 64.
As multiple processors operate in parallel, and independently multiple caches may possess different copies
of the same memory block, this creates a cache coherence problem. Cache coherence is the discipline that
ensures that changes in the values of shared operands are propagated throughout the system in a timely
fashion.
Cache coherence protocols like MESI (Modified, Exclusive, Shared, Invalid) are used to implement cache
coherence in hardware. Here's how MESI works:
Modified (M):
When a cache line is in the Modified state, it means that the processor has both read and written to the data
in that cache line, and the data in the cache is different from the data in main memory.
Any writes to this cache line are not visible to other caches in the system until the cache line is either written
back to memory or invalidated.
Exclusive (E):
When a cache line is in the Exclusive state, it means that the processor has a copy of the data in that cache
line, and no other cache in the system has a copy of that data.

The data in the cache is coherent with the data in main memory, and any changes made to the data in this
cache line are not visible to other caches until the cache line is either invalidated or evicted from the cache.
Shared (S):
When a cache line is in the Shared state, it means that multiple caches in the system have a copy of the data
in that cache line, and the data is coherent with the data in main memory.
Any read accesses to this cache line can be served from any of the caches holding a copy, and any writes to
this cache line must be broadcast to invalidate or update copies in other caches to maintain coherence.
Invalid (I):
When a cache line is in the Invalid state, it means that the data in that cache line is invalid or stale, and it
cannot be used for read or write accesses until it is updated or reloaded from memory.
This state is typically used to indicate that the cache line does not contain valid data or that the data has been
modified by another processor.
MESI protocol operates as follows to maintain cache coherence:
• When a processor wants to read or write to a memory location, it first checks its cache for the
presence of that memory location.
• If the cache line is not present or is in the Invalid state, the processor may need to fetch the data from
main memory or another cache.
• If the cache line is present and in the Shared state, the processor can read from it. If it needs to write
to it, the cache line transitions to the Modified state, and the processor's cache becomes the sole
owner of the data.
• If the cache line is present and in the Modified state, the processor can read from or write to it directly
without involving other caches. Any writes are eventually propagated to main memory or other
caches to maintain coherence.
• If another processor wants to read or write to the same memory location, cache coherence
mechanisms ensure that the data remains consistent across all caches by updating or invalidating
copies as necessary.
Coherency mechanisms : There are three types of coherence :

1. Directory-based – In a directory-based system, the data being shared is placed in a common
directory that maintains the coherence between caches. The directory acts as a filter through

which the processor must ask permission to load an entry from the primary memory to its cache.
When an entry is changed, the directory either updates or invalidates the other caches with that
entry.
2. Snooping – First introduced in 1983, snooping is a process where the individual caches
monitor address lines for accesses to memory locations that they have cached. It is called a write
invalidate protocol. When a write operation is observed to a location that a cache has a copy of
and the cache controller invalidates its own copy of the snooped memory location.
3. Snarfing – It is a mechanism where a cache controller watches both address and data in an
attempt to update its own copy of a memory location when a second master modifies a location
in main memory. When a write operation is observed to a location that a cache has a copy of the
cache controller updates its own copy of the snarfed memory location with the new data.
The Cache Coherence Problem
In a multiprocessor system, data inconsistency may occur among adjacent levels or within the same level of
the memory hierarchy. For example, the cache and the main memory may have inconsistent copies of the same
object.
As multiple processors operate in parallel, and independently multiple caches may possess different copies of
the same memory block, this creates cache coherence problem. Cache coherence schemes help to avoid this
problem by maintaining a uniform state for each cached block of data.
Let X be an element of shared data which has been referenced by two processors, P1 and P2. In the beginning,
three copies of X are consistent. If the processor P1 writes a new data X1 into the cache, by using write-
through policy, the same copy will be written immediately into the shared memory. In this case, inconsistency
occurs between cache memory and the main memory. When a write-back policy is used, the main memory
will be updated when the modified data in the cache is replaced or invalidated.

In general, there are three sources of inconsistency problem −
• Sharing of writable data

• Process migration
• I/O activity
Snoopy Bus Protocols
Snoopy Cache Coherence Protocol: There are two ways to maintain the coherence requirement. One method
is to ensure that a processor has exclusive access to a data item before it writes that item. This style of protocol
is called a write invalidate protocol because it invalidates other copies on a write. It is the most common
protocol, both for snooping and for directory schemes. Exclusive access ensures that no other readable or
writable copies of an item exist when the write occurs: All other cached copies of the item are invalidated.
Snoopy protocols achieve data consistency between the cache memory and the shared memory through a bus-
based memory system. Write-invalidate and write-update policies are used for maintaining cache
consistency.

In this case, we have three processors P1, P2, and P3 having a consistent copy of data element ‘X’ in their
local cache memory and in the shared memory (Figure-a). Processor P1 writes X1 in its cache memory
using write-invalidate protocol. So, all other copies are invalidated via the bus. It is denoted by ‘I’ (Figure-
b). Invalidated blocks are also known as dirty, i.e. they should not be used. The write-update
protocol updates all the cache copies via the bus. By using write back cache, the memory copy is also updated
(Figure-c).

Cache Events and Actions
Following events and actions occur on the execution of memory-access and invalidation commands −
• Read-miss − When a processor wants to read a block and it is not in the cache, a read-miss occurs.
This initiates a bus-read operation. If no dirty copy exists, then the main memory that has a consistent
copy, supplies a copy to the requesting cache memory. If a dirty copy exists in a remote cache memory,
that cache will restrain the main memory and send a copy to the requesting cache memory. In both the
cases, the cache copy will enter the valid state after a read miss.
• Write-hit − If the copy is in dirty or reserved state, write is done locally and the new state is dirty. If
the new state is valid, write-invalidate command is broadcasted to all the caches, invalidating their
copies. When the shared memory is written through, the resulting state is reserved after this first write.
• Write-miss − If a processor fails to write in the local cache memory, the copy must come either from
the main memory or from a remote cache memory with a dirty block. This is done by sending a read-
invalidate command, which will invalidate all cache copies. Then the local copy is updated with dirty
state.
• Read-hit − Read-hit is always performed in local cache memory without causing a transition of state
or using the snoopy bus for invalidation.
• Block replacement − When a copy is dirty, it is to be written back to the main memory by block
replacement method. However, when the copy is either in valid or reserved or invalid state, no
replacement will take place.
The various activities of Snoopy protocol are given below:
1. Read request by the processor which is a hit – the cache block can be in the shared state or modified state –
Normal hit operation where the data is read from the local cache.
2. Read request by the processor, which is a miss. This indicates that the cache block can be in any of the
following three states:
a. Invalid – It is a normal miss and the read request is placed on the bus. The requested block will be
brought from memory and the status will become shared.
b. Shared – It is a replacement miss, probably because of an address conflict. The read request is placed
on the bus and the requested block will be brought from memory and the status will become shared.
c. Modified – It is a replacement miss, probably because of an address conflict. The read request is placed
on the bus, the processor -cache holding it in the modified state writes it back to memory and the requested
block will be brought from memory and the status will become shared in both the caches.
3. Write request by the processor which is a hit – the cache block can be in the shared state or modified state.
a. Modified – Normal hit operation where the data is written in the local cache.

b. Shared – It is a coherence action. The status of the block has to be changed to modified and it is hence
called upgrade or ownership misses. Invalidates will have to be sent on the bus to invalidate all the other copies
in the shared state.
4. Write request by the processor, which is a miss. This indicates that the cache block can be in any of the
following three states:
a. Invalid – It is a normal miss and the write request is placed on the bus. The requested block will be
brought from memory and the status will become modified.
b. Shared – It is a replacement miss, probably because of an address conflict. The write request is placed
on the bus and the requested block will be brought from memory and the status will become modified. The
other shared copies will be invalidated.
c. Modified – It is a replacement miss, probably because of an address conflict. The write request is placed
on the bus, the processor-cache holding it in the modified state writes it back to memory, is invalidated and
the requested block will be brought from memory and the status will become modified in the writing cache .
5. From the bus side, a read miss could be put out, and the cache block can be in the shared state or modified
state
a. Shared – Either one of the caches holding the data in the shared state or the memory will respond to the
miss by sending the block
b. Modified – A coherence action has to take place. The block has to be supplied to the requesting cache
and the status of the block in both the caches is shared.
6. The bus sends out an invalidate when a write request comes for a shared block. The shared block has to be
invalidated and this is a coherence action.
7. From the bus side, a write miss could be put out, and the cache block can be in the shared state or modified
state
a. Shared – It is a write request for a shared block. So, the block has to be invalidated and it is a coherence
action.
b. Modified – A coherence action has to take place. The block has to be written back and its status has to
be invalidated in the original cache .

Directory-Based Protocols
Similar to Snoopy Protocol: Three states
➢ Shared: 1 or more processors have the block cached, and the value in memory is up-to-date (as well as
in all the caches)
➢ Uncached: no processor has a copy of the cache block (not valid in any cache)
➢ Exclusive: Exactly one processor has a copy of the cache block, and it has written the block, so the
memory copy is out of date
• The processor is called the owner of the block
• In addition to tracking the state of each cache block, we must track the processors that have
copies of the block when it is shared (usually a bit vector for each memory block: 1 if processor
has copy)
• Keep it simple(r): – Writes to non-exclusive data => write miss – Processor blocks until access
completes – Assume messages received and acted upon in order sent
local node: the node where a request originates

home node: the node where the memory location and directory entry of an address reside
remote node: the node that has a copy of a cache block (exclusive or shared)
By using a multistage network for building a large multiprocessor with hundreds of processors, the snoopy
cache protocols need to be modified to suit the network capabilities. Broadcasting being very expensive to
perform in a multistage network, the consistency commands is sent only to those caches that keep a copy of
the block. This is the reason for development of directory-based protocols for network-connected
multiprocessors.
In a directory-based protocols system, data to be shared are placed in a common directory that maintains the
coherence among the caches. Here, the directory acts as a filter where the processors ask permission to load
an entry from the primary memory to its cache memory. If an entry is changed the directory either updates it
or invalidates the other caches with that entry.

Comparing to snooping protocols:
– identical states
– stimulus is almost identical
– write a shared cache block is treated as a write miss (without fetch the block)
– cache block must be in exclusive state when it is written
– any shared block must be up to date in memory
• write miss: data fetch and selective invalidate operations sent by the directory controller (broadcast in
snooping protocols) Directory Operations: Requests and Actions
• Message sent to directory causes two actions: – Update the directory – More messages to satisfy request
• Block is in Uncached state: the copy in memory is the current value; only possible requests for that
block are:
• Read miss: requesting processor sent data from memory &requestor made only sharing node; state of
block made Shared.
• Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made
Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.
• Block is Shared => the memory value is up-to-date:
• Read miss: requesting processor is sent back the data from memory & requesting processor is added
to the sharing set.

• Write miss: requesting processor is sent the value. All processors in the set Sharers are sent invalidate
messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.
• Block is Exclusive: current value of the block is held in the cache of the processor identified by the set
Sharers (the owner) => three possible directory requests:
• Read miss: owner processor sent data fetch message, causing state of block in owner’s cache to
transition to Shared and causes owner to send data to directory, where it is written to memory & sent
back to requesting processor. Identity of requesting processor is added to set Sharers, which still
contains the identity of the processor that was the owner (since it still has a readable copy). State is
shared.
• Data write-back: owner processor is replacing the block and hence must write it back, making memory
copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and
the Sharer set is empty.
• Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value
of the block to the directory from which it is sent to the requesting processor, which becomes the new
owner. Sharers is set to identity of new owner, and state of block is made Exclusive.
Hardware Synchronization Mechanisms
Synchronization is a special form of communication where instead of data control, information is exchanged
between communicating processes residing in the same or different processors.
Multiprocessor systems use hardware mechanisms to implement low-level synchronization operations. Most
multiprocessors have hardware mechanisms to impose atomic operations such as memory read, write or read-
modify-write operations to implement some synchronization primitives. Other than atomic memory
operations, some inter-processor interrupts are also used for synchronization purposes.
Cache Coherency in Shared Memory Machines
Maintaining cache coherency is a problem in multiprocessor system when the processors contain local cache
memory. Data inconsistency between different caches easily occurs in this system.
The major concern areas are −
• Sharing of writable data

• Process migration
• I/O activity

Sharing of writable data
When two processors (P1 and P2) have same data element (X) in their local caches and one process (P1) writes
to the data element (X), as the caches are write-through local cache of P1, the main memory is also updated.
Now when P2 tries to read data element (X), it does not find X because the data element in the cache of P2
has become outdated.
Process migration
In the first stage, cache of P1 has data element X, whereas P2 does not have anything. A process on P2 first
writes on X and then migrates to P1. Now, the process starts reading data element X, but as the processor P1
has outdated data the process cannot read it. So, a process on P1 writes to the data element X and then migrates
to P2. After migration, a process on P2 starts reading the data element X but it finds an outdated version of X
in the main memory.

I/O activity
As illustrated in the figure, an I/O device is added to the bus in a two-processor multiprocessor architecture.
In the beginning, both the caches contain the data element X. When the I/O device receives a new element X,
it stores the new element directly in the main memory. Now, when either P1 or P2 (assume P1) tries to read
element X it gets an outdated copy. So, P1 writes to element X. Now, if I/O device tries to transmit X it gets
an outdated copy.
Synchronization:
Synchronization mechanisms are typically built with user-level software routines that rely on hardware –
supplied synchronization instructions.
Why Synchronize? Need to know when it is safe for different processes to use shared data
• Issues for Synchronization: – Uninterruptable instruction to fetch and update memory (atomic operation); –
User level synchronization operation using this primitive; – For large scale MPs, synchronization can be a
bottleneck; techniques to reduce contention and latency of synchronization Uninterruptable Instruction to
Fetch and Update Memory
• Atomic exchange: interchange a value in a register for a value in memory 0 ⇒ synchronization variable is
free 1 ⇒ synchronization variable is locked and unavailable – Set register to 1 & swap – New value in register
determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor
had already claimed access – Key is that exchange operation is indivisible
• Test-and-set: tests a value and sets it if the value passes the test

• Fetch-and-increment: it returns the value of a memory location and atomically increments it – 0 ⇒
synchronization variable is free
• Hard to have read & write in 1 instruction: use 2 instead
• Load linked (or load locked) + store conditional – Load linked returns the initial value – Store conditional
returns 1 if it succeeds (no other store to same memory location since preceding load) and 0 otherwise
User Level Synchronization—Operation Using this Primitive • Spin locks: processor continuously tries to
acquire, spinning around a loop trying to get the lock li R2,#1 lockit: exch R2,0(R1) ;atomic exchange bnez
R2,lockit ;already locked? • What about MP with cache coherency? – Want to spin on cache copy to avoid full
memory latency – Likely to get cache hits for such variables • Problem: exchange includes a write, which
invalidates all other copies; this generates considerable bus traffic • Solution: start by simply repeatedly
reading the variable; when it changes, then try exchange (“test and test&set”):
Synchronization in multithreaded programming is essential for coordinating access to shared resources and
ensuring correct behavior in concurrent execution environments. However, it introduces various challenges
due to the non-deterministic nature of thread scheduling and the potential for race conditions, deadlocks,
and performance overhead. Here are some challenges associated with synchronization in multithreaded
programming along with examples of synchronization mechanisms used to address them:

Race Conditions:
Challenge: Race conditions occur when the outcome of program execution depends on the timing or
interleaving of thread execution, leading to non-deterministic behavior and incorrect results.
Example Solution: Mutexes (Mutual Exclusion): Mutexes are synchronization primitives that allow only
one thread to access a critical section of code or a shared resource at a time. By acquiring a mutex before
accessing the shared resource and releasing it afterward, threads can synchronize their access and prevent
race conditions.
Deadlocks:
Challenge: Deadlocks occur when two or more threads are waiting for each other to release resources that
they need, resulting in a cyclic dependency and a state where no thread can proceed.
Example Solution: Lock Ordering: One way to prevent deadlocks is by establishing a global ordering on
locks and requiring threads to acquire locks in a consistent order. This ensures that threads cannot hold one
lock while waiting for another that is held by another thread, thereby preventing cyclic dependencies.
Priority Inversion:
Challenge: Priority inversion occurs when a low-priority thread holds a resource required by a high-priority
thread, causing the high-priority thread to wait longer than expected.
Example Solution: Priority Inheritance: Priority inheritance is a synchronization technique where the
priority of a low-priority thread temporarily inherits the priority of a high-priority thread while it holds a
resource required by the high-priority thread. This prevents priority inversion by ensuring that the resource
holder does not block the execution of higher-priority threads.
Performance Overhead:
Challenge: Synchronization mechanisms such as locks and mutexes incur performance overhead due to
context switching, contention, and serialization of execution.
Example Solution: Lock-Free Data Structures: Lock-free data structures, such as lock-free queues or lock-
free linked lists, are designed to allow concurrent access by multiple threads without the need for traditional
locking mechanisms. These data structures use atomic operations or compare-and-swap instructions to
ensure consistency without explicit locking, reducing contention and overhead.
Granularity:
Challenge: Choosing the appropriate granularity of synchronization can be challenging, as coarse -grained
locking may lead to reduced concurrency and scalability, while fine-grained locking may increase overhead
and complexity.
Example Solution: Read-Write Locks: Read-write locks allow multiple threads to concurrently read a shared
resource while ensuring exclusive access for writing. This provides a balance between concurrency and
synchronization overhead by allowing concurrent read access and serializing write access.
Starvation and Fairness:

Challenge: Starvation occurs when a thread is consistently denied access to a shared resource due to the
unfair scheduling of other threads. Ensuring fairness in resource allocation is essential to prevent starvation.
Example Solution: Fair Scheduling Policies: Schedulers can implement fair scheduling policies that
prioritize threads based on factors such as waiting time, thread priority, or the number of times a thread has
been preempted. Fair scheduling helps prevent starvation by ensuring that all threads have an opportunity
to access shared resources.
Memory Consistency Models:
Memory Consistency Models • What is consistency? When must a processor see the new value? e.g., seems
that
• Impossible for both if statements L1 & L2 to be true?

What if write invalidate is delayed & processor continues?
• Memory consistency models: what are the rules for such cases?
• Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in
order and the accesses among different processors were interleaved ⇒ assignments before ifs above – SC:
delay all memory accesses until all invalidates done
• Schemes faster execution to sequential consistency
• Not an issue for most programs; they are synchronized – A program is synchronized if all access to shared
data are ordered by synchronization operations
Only those programs willing to be nondeterministic are not synchronized: “data race”: outcome f(proc. speed)
• Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by
their attitude towards: RAR, WAR, RAW, WAW to different addresses

Relaxed Consistency Models : The Basics
• Key idea: allow reads and writes to complete out of order, but to use synchronization operations to enforce
ordering, so that a synchronized program behaves as if the processor were sequentially consistent
• By relaxing orderings, may obtain performance advantages
• Also specifies range of legal compiler optimizations on shared data
• Unless synchronization points are clearly defined and programs are synchronized, compiler could not
interchange read and write of 2 shared data items because might affect the semantics of the program
3 major sets of relaxed orderings:
1. W→R ordering (all writes completed before next read) Because retains ordering among writes, many
programs that operate under sequential consistency operate under this model, without additional
synchronization. Called processor consistency
2. W → W ordering (all writes completed before next write)
3. R → W and R → R orderings, a variety of models depending on ordering restrictions and how
synchronization operations enforce ordering
• Many complexities in relaxed consistency models; defining precisely what it means for a write to
complete; deciding when processors can see values that it has written.
Sequential Consistency Relaxed Memory Consistency
In sequential consistency, the execution of a Relaxed memory consistency models relax the strict
concurrent program is required to appear as if all ordering requirements of sequential consistency to
memory accesses by all threads occur in some improve performance and scalability in concurrent
sequential order that is consistent with the systems. These models allow for reordering of
program's control flow. This means that the result memory operations by the compiler, processor, or
of any execution should be the same as if the memory subsystem, as long as the observed
operations of all the processors were executed in behavior of the program is consistent with some
some sequential order, and the operations of each well-defined memory consistency model.
individual processor appear in the order specified
by its program.
Provides a strong consistency model Provides weaker consistency guarantees compared
where all threads observe the same global ordering to sequential consistency. Allows for relaxed
of memory operations. Reads and writes appear to ordering of memory operations, such as allowing
occur instantaneously and in the order specified by reads and writes to be reordered or observed out of
the program. order by different threads.
Synchronization Overhead: Achieving sequential Synchronization Overhead: Relaxed memory
consistency may require additional synchronization consistency models may reduce the need for
mechanisms, such as memory barriers or locks, to explicit synchronization mechanisms, leading to
enforce the required memory orderings. improved performance and scalability. However,
reasoning about the correctness of programs under
relaxed consistency models can be more
challenging.

Example: Consider a multi-threaded program where Example: Consider a multi-threaded program where
multiple threads are updating a shared counter one thread produces data and another thread
variable. With sequential consistency, the final consumes it. With relaxed memory consistency, the
value of the counter will be the sum of all consumer thread may observe the produced data out
increments and decrements in the order they were of order with respect to the producer thread's
performed by the threads. execution. As long as the program's logic is
designed to handle such out-of-order observations
correctly, relaxed consistency can improve
performance by allowing more opportunities for
parallel execution and optimization.
Out-of-order execution is a performance optimization technique used in modern processors to improve

instruction-level parallelism and overall throughput. In traditional processors, instructions are executed in
the order specified by the program, but out-of-order execution allows the processor to execute instructions
in an order that maximizes resource utilization and reduces pipeline stalls.
Here's how out-of-order execution typically works:
Instruction Dispatch: Instructions are fetched from memory and decoded, and their dependencies are
analyzed to determine whether they can be executed.
Instruction Scheduling: Instructions are dispatched to execution units based on availability and
dependencies. If an instruction's operands are not ready, it may be delayed until the required data becomes
available.
Out-of-Order Execution: Instructions are executed independently and out of program order whenever
possible, as long as data dependencies are satisfied. This allows the processor to utilize idle execution units
and keep the pipeline busy.
Retirement: Completed instructions are committed to architectural state and memory in program order to
maintain the illusion of sequential execution to software.
While out-of-order execution improves performance by exploiting instruction-level parallelism, it
introduces challenges related to memory consistency, particularly in concurrent programming environments
with shared memory. Memory consistency refers to the order in which memory operations are observed by
different threads or processors in a multiprocessor system.
Out-of-order execution can lead to memory consistency violations in situations where:
Reordering of Memory Operations: Out-of-order execution may cause memory operations (e.g., loads and
stores) to be executed out of program order. While this reordering is transparent to the executing thread, it
can lead to inconsistencies when observed by other threads or processors accessing shared memory.
Store Buffering: Modern processors often employ store buffers to temporarily hold store operations before
they are committed to memory. This buffering allows stores to be executed out of order with respect to other
instructions. However, if a subsequent load operation observes the value from the store buffer before it is
committed to memory, it can result in an inconsistent view of memory for other threads.

Weak Memory Ordering: Some processors employ weak memory consistency models, such as Total Store
Order (TSO) or Partial Store Order (PSO), where certain memory operations may not be globally ordered.
Out-of-order execution can exacerbate the effects of weak memory ordering by allowing memory operations
to be observed out of order by different threads or processors.
Simultaneous Multithreading:
This dissertation examines simultaneous multithreading, the way allowing numerous impartial threads to
difficulty commands to a superscalar processor’s practical devices in a single cycle. Simultaneous
multithreading appreciably will increase processor usage withinside the face of each lengthy training
latencies and constrained to be had parallelism according to thread.
Features :
• These studies provide numerous fashions of simultaneous multithreading and compare them with
opportunity groups: a huge superscalar, a fine-grain multithreaded processor, and single-chip, a
couple of-difficulty multiprocessing architectures.
• The outcomes display that each (single-threaded) superscalar and fine-grain multithreaded
architectures are constrained of their capacity to make use of the assets of a huge-difficulty super-
scalar processor.
• Simultaneous multithreading has the ability to gain four instances the throughput of a superscalar,
and double that of fine-grain multithreading.
• Simultaneous multithreading is likewise an appealing opportunity to single-chip multiprocessors;
simultaneous multithreaded processors with a lot of groups outperform corresponding traditional
multiprocessors with comparable execution assets.
• This dissertation additionally suggests that the throughput profits from simultaneous
multithreading may be accomplished without tremendous adjustments to a traditional huge-
difficulty superscalar, both in hardware systems or sizes.
Goals:
An architecture for simultaneous multithreading is provided that achieves 3 goals are as follows.
1. It minimizes the architectural effect on a traditional superscalar design.
2. It has a minimum overall performance effect on a single thread executing alone.
3. It achieves widespread throughput profits while going for walks a couple of threads.
Fundamentals of Simultaneous Multithreading :
• Our simultaneous multithreading structure achieves a throughput of 5. Four commands according
to cycle, a 2.5-fold development over an unmodified superscalar with comparable hardware
assets.
• This speedup is superior through a bonus of multithreading formerly unexploited in different
architectures: the capacity to want for fetch and difficulty the threads of the one to be able to use

the processor maximum correctly every cycle, thereby presenting the “best” commands to the
processor.
• An analytic response-time version suggests that the blessings of simultaneous multithreading in
multi-programmed surroundings aren’t constrained to multiplied throughput.
• That throughput will increase cause widespread discounts in queueing time for runnable
processes, main to response-time enhancements that during many instances are appreciably more
than the throughput enhancements themselves.
• Simultaneous multithreading (SMT) is a processor layout that mixes hardware multithreading
with superscalar processor technology. Simultaneous multithreading can use more than one
thread to problem commands every cycle.
• In positive hardware multithreaded architectures handiest a single hardware context, or thread, is
lively on any cycle. SMT helps all thread contexts to concurrently compete and percentage
processor resources. Unlike traditional superscalar processors, which be afflicted by a loss of
per-thread instruction-degree parallelism, simultaneous multithreading makes use of more than
one thread to make amends for low single-thread instruction-degree parallelism.
• Simultaneous multithreading use more than one thread to run unique commands withinside the
identical clock cycle via way of means of the use of the procedure gadgets that the primary thread
left.
FundamentalProcessorStructure:
For simultaneous multithreading, the modifications which might be required to the fundamental processor
structure are as follows.
• The ability to fetch commands from a couple of threads in a cycle.
• A large check-in document to keep records from a couple of threads.
OverallPerformanceBlessings:
The overall performance blessings of a device that could run simultaneous multithreading are as follows.
• Higher practice throughput
• Programs are quicker for numerous workloads that consist of industrial databases, internet
servers, and medical programs in each multi-programmed and parallel environment.
WhySMTisgood:
SMT implementations may be very efficient in phrases of die length and strength consumption, as a
minimum while in comparison with absolutely duplicating processor resources. With much less than a 5%
growth in die length, Intel claims that you could get a 30% overall performance raise with the aid of using
the usage of SMT for multithreaded workloads.
DoesSMTincreaseperformance:
In this case, SMT gives a tremendous overall performance in keeping with the watt increase. But on average,

there are small (+22% on MT) profits to be had, and gaming overall performance isn’t always disturbed, so
it’s miles really well worth maintaining enabled on Zen 3.

UNIT IV

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT IV

Uploaded by

Copyright:

Available Formats

UNIT IV MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

SISD (Single Instruction Single Data) – Uniprocessors

SIMD (Single Instruction Multiple Data) – Examples:

• Simple programming model

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Thread Level Parallelism

Goal: Use multiple instruction streams to improve

• Throughput of computers that run many programs

Organizing Many Processors

Symmetric Multi-Processing (SMP):

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Here's how it typically works:

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Advantages of NUMA architectures in terms of performance and scalability:

Prepared by Ms.K.Sherin, AP/CSE/SJIT

• Processor 1 read X : obtains 24 from the memory and caches it.

Prepared by Ms.K.Sherin, AP/CSE/SJIT

MESI protocol operates as follows to maintain cache coherence:

Coherency mechanisms : There are three types of coherence :

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

• Sharing of writable data

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

local node: the node where a request originates

Prepared by Ms.K.Sherin, AP/CSE/SJIT

– stimulus is almost identical

– cache block must be in exclusive state when it is written

– any shared block must be up to date in memory

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Cache Coherency in Shared Memory Machines

The major concern areas are −

• Sharing of writable data

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

• Hard to have read & write in 1 instruction: use 2 instead

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

• Impossible for both if statements L1 & L2 to be true?

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Out-of-order execution is a performance optimization technique used in modern processors to improve

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

You might also like