Unit 4 - Analytical Modeling of Parallel Programs

SNJB’s Late Sau. K. B. J.
College of Engineering
Subject :- High Performance Computing
Unit 4
Analytical Modeling of
Parallel Programs
By- Prof. Gunjan Deshmukh
SNJB’s KBJ CoE | Civil | Computer | E&TC | Mechanical | AIDS | MBA | Visit Us @: www.snjb.org
SNJB’s Late Sau. K. B. J. College of Engineering
Syllabus
1. Sources of Overhead in Parallel 1. Scalability of Parallel Systems

Programs 2. Minimum Execution Time and
2. Performance Measures and Minimum Cost
Analysis: Amdahl's and 3. Optimal Execution Time
Gustafson's Laws 4. Asymptotic Analysis of Parallel
3. Speedup Factor and Efficiency Programs.
4. Cost and Utilization 5. Matrix Computation:
5. Execution Rate and Redundancy 6. Matrix-Vector Multiplication
6. The Effect of Granularity on 7. Matrix-Matrix Multiplication.
Performance
Course Objectives & Outcomes
CO4: Analyze and measure performance of modern

parallel computing systems
SNJB’s KBJ CoE | Civil | Computer | E&TC | Mechanical | AIDS | MBA | Visit Us @: www.snjb.org
Sources of Overhead in Parallel Programs
● Parallel programming involves executing multiple tasks simultaneously on multiple
processors,which can lead to significant improvements in performance compared to
sequential programming. However, parallel programs also introduce additional overhead
that can affect their performance.
● Communication overhead: In a parallel program, the different processors or computing
nodes need to communicate with each other to exchange data and coordinate their actions.
This communication can be time-consuming and can introduce overhead, particularly when
large amounts of data need to be transferred.

● Synchronization overhead: In some parallel programs, different processors or computing
nodes need to synchronize their actions to ensure that they work correctly.
Synchronization can introduce overhead because it requires the processors to wait for each
other before proceeding with their tasks.
● Load balancing overhead: In a parallel program, different processors or computing nodes
may have different workloads, which can lead to load imbalance. This can introduce
overhead because some processors may be idle while others are busy, which can result in
wasted computational resources.

● Parallelism overhead: Although parallel programming can improve performance by
allowing multiple tasks to be executed simultaneously, it can also introduce overhead
because of the additional processing required to coordinate the parallel execution.
● I/O overhead: In parallel programs, input and output (I/O) operations can be a significant
source of overhead, particularly when multiple processors or computing nodes are
accessing the same file or data storage device.

● Memory access overhead: In a parallel program, multiple processors or computing nodes
may access the same memory location simultaneously. This can introduce overhead because
the processors need to coordinate their access to ensure that data is not corrupted or lost.
● Startup and shutdown overhead: In some parallel programs, significant overhead can be
introduced during program startup and shutdown, particularly when multiple processors or
computing nodes need to be initialized or shut down in a coordinated manner.

Performance Measures and Analysis:
Amdahl's Laws
● It is named after computer scientist Gene Amdahl
● Amdahl corporation presented at the AFIPS Spring Joint Computer Conference in 1967. It is
also known as Amdahl’s argument.
● It is a formula that gives the theoretical speedup in latency of the execution of a task at a
fixed workload that can be expected of a system whose resources are improved.
● In other words, it is a formula used to find the maximum improvement possible by just
improving a particular part of a system.

Performance Measures and Analysis: Amdahl's Laws
● Speedup- Speedup is defined as the ratio of performance for the entire task using the
enhancement and performance for the entire task without using the enhancement or speedup
can be defined as the ratio of execution time for the entire task without using the enhancement
and execution time for the entire task using the enhancement.
● If Pe is the performance for the entire task using the enhancement when possible, Pw is the
performance for the entire task without using the enhancement, Ew is the execution time for the
entire task without using the enhancement and Ee is the execution time for the entire task using the
enhancement when possible then, Speedup = Pe/Pw or Speedup = Ew/Ee

● The first law: also known as the strong scaling law
● states that the maximum speedup that can be achieved by parallelizing a computation is limited by
the portion of the computation that cannot be parallelized. Mathematically, the speedup is given by:
speedup = 1 / (serial fraction + parallel fraction/N)
● where N is the number of processors used to parallelize the computation, and the serial fraction is the
portion of the computation that cannot be parallelized. This law implies that the speedup is limited by
the serial fraction, which is a fixed value and cannot be reduced through parallelization.
● The second law:also known as the weak scaling law
● states that the maximum speedup that can be achieved by increasing the number of processors used
to parallelize a computation is limited by the communication overhead between the processors.
Mathematically, the speedup is given by:
speedup = (serial fraction + parallel fraction) / (serial fraction + parallel fraction/N + overhead)
● where overhead is the time spent on communication between the processors. This law implies that
the speedup is limited by the communication overhead, which increases as the number of processors
increases.
● The overall Speedup is the ratio of the execution time:-
● Amdahl’s law uses two factors to find speedup from some enhancement:
● Fraction enhanced – The fraction of the computation time in the original computer that can be
converted to take advantage of the enhancement.
● For example- if 10 seconds of the execution time of a program that takes 40 seconds in total can use
an enhancement, the fraction is 10/40. This obtained value is Fraction Enhanced. Fraction enhanced
is always less than 1.
● Speedup enhanced – The improvement gained by the enhanced execution mode; that is, how much
faster the task would run if the enhanced mode were used for the entire program.
● For example – If the enhanced mode takes, say 3 seconds for a portion of the program, while it is 6
seconds in the original mode, the improvement is 6/3. This value is Speedup enhanced. Speedup
Enhanced is always greater than 1.
Performance Measures and Analysis: Gustafson's Laws
● Amdahl’s law is suitable for applications where response time is critical. On the other hand,
there are many applications which require that accuracy of the resultant output should be
high.
● The basic idea behind Gustafson's Laws is that the speedup achieved by parallelizing a
computation can increase as the problem size increases, unlike in Amdahl's Laws where the
serial fraction remains a fixed value.
● The laws were proposed by computer scientist John Gustafson in 1988 and have since become
a popular alternative to Amdahl's Laws.

● The first law:
● states that the amount of work that can be parallelized increases with the size of the problem.
Mathematically, the fraction of parallelizable work is given by:
parallel fraction = S + P(N - 1),
● where S is the serial fraction of the computation, P is the parallelizable fraction, N is the
number of processors used to parallelize the computation.
● This law implies that the parallel fraction can increase as the problem size increases, which
allows for greater speedup.

● The second law
● states that the total execution time decreases as the problem size increases, assuming that the
parallel fraction is increased proportionally to the problem size. Mathematically, the execution
time is given by:
execution time = S + PNT(N),
● where T(N) is the time required to complete the parallelizable fraction on N processors.
● This law implies that the execution time can decrease as the problem size increases, which
allows for greater efficiency.

Speedup Factor and Efﬁciency
● Speedup: The speedup factor is a measure of how much faster a parallel algorithm runs
compared to the same algorithm executed sequentially on a single processor. It is calculated
as the ratio of the time taken to run the sequential algorithm to the time taken to run the parallel
algorithm on N processors. Mathematically, the speedup factor (S) is given by:
● S = T(1) / T(N)
● where T(1) is the time taken to run the algorithm on a single processor and T(N) is the time
taken to run the algorithm on N processors.

Speedup Factor and Efﬁciency
● Efficiency: Efficiency is a measure of how effectively a parallel algorithm is using the
available resources, and is typically expressed as a percentage. It is calculated as the ratio of
the speedup factor to the number of processors used. Mathematically, efficiency (E) is given by:
● E = (S / N) x 100%
● In other words, efficiency measures how well the parallel algorithm scales with the number of
processors used. Ideally, an efficient parallel algorithm should achieve a speedup factor that is
close to linear with the number of processors used, resulting in a high efficiency.
●
Cost and Utilization
● Cost refers to the total cost of ownership and operation of the computing system, including
hardware, software, maintenance, and energy costs.
● The cost of a high performance computing system can vary widely depending on factors such as
the size and complexity of the system, the type of hardware used, and the software applications
running on it.
● Utilization refers to the extent to which the computing resources are being used to perform
useful work. High utilization is desirable as it indicates that the system is being used efficiently
and effectively.
● However, achieving high utilization can be challenging due to factors such as workload
variability, load balancing, and scheduling overhead.
Execution Rate and Redundancy
● Execution rate refers to the speed at which computations can be performed in a parallel
computing system.
● This is often measured in terms of floating point operations per second (FLOPS) or
instructions per second (IPS).
● Higher execution rates can be achieved through the use of more powerful hardware, such as
multi-core processors or specialized accelerators like GPUs or FPGAs.
● Redundancy, on the other hand, refers to the duplication of computing resources or data to
improve the reliability and availability of the system.
● Redundancy can be implemented at various levels, including hardware redundancy (such as
redundant power supplies or disk drives) and software redundancy (such as redundant data storage
or backup processes).
The Effect of Granularity on Performance
● Granularity refers to the size of the computational tasks or data elements that are processed in a
parallel system.
● The choice of granularity can have a significant impact on system performance, as it affects
factors such as communication overhead, load balancing, and parallel efficiency.
● When tasks are too fine-grained, there may be too much communication overhead between
processors, which can slow down the overall performance of the system.
● On the other hand, when tasks are too coarse-grained, there may be load imbalance between
processors, which can result in underutilization of resources

Scalability of Parallel Systems
● Scalability is an important characteristic of parallel systems in high performance computing, as it
refers to the ability of a system to maintain or improve performance as the size of the
problem or the number of processors increases.
● Achieving good scalability is essential for achieving high performance in large-scale parallel
systems.
● Scalability can be classified into two categories: strong scalability and weak scalability.
● Strong scalability refers to the ability of a system to maintain a fixed problem size and to
achieve proportional speedup as the number of processors increases.

Scalability of Parallel Systems
● Weak scalability refers to the ability of a system to maintain a fixed workload per processor and
to achieve proportional speedup as the number of processors increases.
● In other words, if the number of processors is doubled and the size of the problem is also
doubled, the runtime of the problem should remain the same.
● Achieving weak scalability is generally easier than strong scalability, as it only requires
maintaining the same workload per processor as the number of processors increases.
Minimum Execution Time and Minimum Cost
● Minimum Execution Time: The minimum execution time (also known as the latency or response
time) is the time taken by a task or a program to complete its execution.
● In HPC, reducing the execution time is important for achieving higher performance and
throughput.
● This can be achieved through various techniques such as parallel processing, optimized
algorithms, and efficient use of resources.
● Parallel processing involves breaking down a large task into smaller sub-tasks that can be
executed simultaneously by multiple processors or cores.

● This can significantly reduce the execution time for the overall task.
● Similarly, optimized algorithms can help in reducing the number of computations required for a
task, thus reducing the execution time.
● Efficient use of resources such as memory, I/O, and network can also help in reducing the
execution time by minimizing bottlenecks and reducing idle time.
● We can determine the minimum parallel runtime TPmin for a given W by differentiating
the expression for TP w.r.t. p and equating it to zero

● Minimum Cost: The minimum cost is another important metric organizations with limited
resources. The cost of system includes various components such as hardware, software, power,
cooling, and maintenance.
● In general, reducing the cost involves minimizing the hardware and software expenses while
maintaining the required performance.
● Way to reduce cost is to use open-source software instead of commercial software
● Additionally, optimizing the use of resources and reducing idle time can also help in reducing
the power and cooling costs.

Optimal Execution Time
● The optimal execution time is the time required to execute a task or a program using the
minimum amount of resources while achieving the desired performance level.
● In other words, the optimal execution time represents the sweet spot between the minimum
execution time and the minimum cost.
● Achieving the optimal execution time involves finding the right balance between the resources
used and the performance achieved
● This can be achieved by optimizing the use of resources, such as processors, memory, I/O,
and network, while ensuring that the performance requirements are met.
Optimal Execution Time
● The difference between the optimal execution time and the minimum execution time is that the
minimum execution time represents the absolute fastest time that a task or program can be
executed, while the optimal execution time represents the most efficient use of resources while
achieving the desired performance level.
● Achieving the minimum execution time may require using more resources than necessary,
while achieving the optimal execution time requires using the minimum amount of resources
necessary to achieve the desired performance level.

Asymptotic Analysis of Parallel Programs
● Asymptotic analysis is a method used to study the behavior of algorithms and programs as the
input size grows to infinity
● Parallel programs are designed to execute tasks simultaneously across multiple processors or
cores, which can lead to significant speedup and performance improvements.
● However, the performance gain of parallel programs is not always proportional to the
number of processors used.
● As the number of processors increases, the communication overhead and contention for
shared resources can limit the scalability and performance of the program.
● Asymptotic analysis can help in identifying these limitations by analyzing the growth rate of
the computation and communication overhead as the input size and number of processors
increase.
● Consider the problem of sorting a list of n numbers. The fastest serial programs for this problem
run in time Θ(n log n). Consider four parallel algorithms, A1, A2, A3, and A4 as follows:
● Comparison of four different algorithms for sorting a given list of numbers. The table shows
number of processing elements, parallel runtime, speedup, efficiency and the pTP product
Matrix Computation
● Matrix computations involve the manipulation of large matrices and the efficient use of
parallel computing to perform matrix operations.
● Matrices are rectangular arrays of numbers, arranged in rows and columns

● They are used to represent data in a wide range of scientific and engineering applications,
such as image processing, signal processing, numerical analysis, and machine learning.
● matrix computations are often used in scientific simulations and data analysis.
● In matrix computation, parallel computing allows the computation to be split across multiple
processors or computer systems, enabling faster computation of large matrices.
● To optimize matrix computation in HPC, specialized libraries such as BLAS (Basic Linear
Algebra Subprograms) and LAPACK (Linear Algebra Package) are used.
Matrix Computation
1. Matrix-Vector Multiplication
a. Row Wise 1-D Partitioning Click Here
b. 2-D Partitioning
2. Matrix-Matrix Multiplication. Click Here
Matrix vector Multiplication
●
Matrix vector Multiplication
●
References
● https://www.geeksforgeeks.org/computer-organization-amdahls-law-and-its-proof/
● https://www.youtube.com/watch?v=9xSiDO2CWbE
● http://archive.nitjsr.ac.in/course_assignment/CS16CS601PERFORMANCE%20EVALUA
TIONS.pdf

Unit 4 - Analytical Modeling of Parallel Programs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 - Analytical Modeling of Parallel Programs

Uploaded by

Copyright:

Available Formats

SNJB’s Late Sau. K. B. J.

1. Sources of Overhead in Parallel 1. Scalability of Parallel Systems

Course Objectives & Outcomes

CO4: Analyze and measure performance of modern

processors,which can lead to significant improvements in performance compared to

sequential programming. However, parallel programs also introduce additional overhead

that can affect their performance.

● Communication overhead: In a parallel program, the different processors or computing

large amounts of data need to be transferred.

other before proceeding with their tasks.

● Load balancing overhead: In a parallel program, different processors or computing nodes

wasted computational resources.

allowing multiple tasks to be executed simultaneously, it can also introduce overhead

because of the additional processing required to coordinate the parallel execution.

source of overhead, particularly when multiple processors or computing nodes are

accessing the same file or data storage device.

computing nodes need to be initialized or shut down in a coordinated manner.

also known as Amdahl’s argument.

improving a particular part of a system.

enhancement when possible then, Speedup = Pe/Pw or Speedup = Ew/Ee

speedup = 1 / (serial fraction + parallel fraction/N)

to parallelize a computation is limited by the communication overhead between the processors.

Mathematically, the speedup is given by:

serial fraction remains a fixed value.

a popular alternative to Amdahl's Laws.

Mathematically, the fraction of parallelizable work is given by:

parallel fraction = S + P(N - 1),

number of processors used to parallelize the computation.

allows for greater speedup.

time is given by:

execution time = S + PNT(N),

allows for greater efficiency.

compared to the same algorithm executed sequentially on a single processor. It is calculated

algorithm on N processors. Mathematically, the speedup factor (S) is given by:

taken to run the algorithm on N processors.

available resources, and is typically expressed as a percentage. It is calculated as the ratio of

factors such as communication overhead, load balancing, and parallel efficiency.

processors, which can result in underutilization of resources

problem or the number of processors increases.

achieve proportional speedup as the number of processors increases.

to achieve proportional speedup as the number of processors increases.

doubled, the runtime of the problem should remain the same.

time) is the time taken by a task or a program to complete its execution.

algorithms, and efficient use of resources.

executed simultaneously by multiple processors or cores.

task, thus reducing the execution time.

execution time by minimizing bottlenecks and reducing idle time.

the expression for TP w.r.t. p and equating it to zero

cooling, and maintenance.

maintaining the required performance.

● Way to reduce cost is to use open-source software instead of commercial software

the power and cooling costs.

minimum amount of resources while achieving the desired performance level.

execution time and the minimum cost.

used and the performance achieved

achieving the desired performance level.

necessary to achieve the desired performance level.

input size grows to infinity

cores, which can lead to significant speedup and performance improvements.

number of processors used.

● Matrices are rectangular arrays of numbers, arranged in rows and columns

You might also like