Professional Documents
Culture Documents
Unit 4 - Analytical Modeling of Parallel Programs
Unit 4 - Analytical Modeling of Parallel Programs
College of Engineering
Subject :- High Performance Computing
Unit 4
Analytical Modeling of
Parallel Programs
By- Prof. Gunjan Deshmukh
SNJB’s KBJ CoE | Civil | Computer | E&TC | Mechanical | AIDS | MBA | Visit Us @: www.snjb.org
SNJB’s Late Sau. K. B. J. College of Engineering
Syllabus
SNJB’s KBJ CoE | Civil | Computer | E&TC | Mechanical | AIDS | MBA | Visit Us @: www.snjb.org
SNJB’s Late Sau. K. B. J. College of Engineering
Sources of Overhead in Parallel Programs
● Parallel programming involves executing multiple tasks simultaneously on multiple
nodes need to communicate with each other to exchange data and coordinate their actions.
This communication can be time-consuming and can introduce overhead, particularly when
nodes need to synchronize their actions to ensure that they work correctly.
Synchronization can introduce overhead because it requires the processors to wait for each
may have different workloads, which can lead to load imbalance. This can introduce
overhead because some processors may be idle while others are busy, which can result in
● I/O overhead: In parallel programs, input and output (I/O) operations can be a significant
may access the same memory location simultaneously. This can introduce overhead because
the processors need to coordinate their access to ensure that data is not corrupted or lost.
● Startup and shutdown overhead: In some parallel programs, significant overhead can be
introduced during program startup and shutdown, particularly when multiple processors or
● Amdahl corporation presented at the AFIPS Spring Joint Computer Conference in 1967. It is
● It is a formula that gives the theoretical speedup in latency of the execution of a task at a
fixed workload that can be expected of a system whose resources are improved.
● In other words, it is a formula used to find the maximum improvement possible by just
enhancement and performance for the entire task without using the enhancement or speedup
can be defined as the ratio of execution time for the entire task without using the enhancement
and execution time for the entire task using the enhancement.
● If Pe is the performance for the entire task using the enhancement when possible, Pw is the
performance for the entire task without using the enhancement, Ew is the execution time for the
entire task without using the enhancement and Ee is the execution time for the entire task using the
● states that the maximum speedup that can be achieved by parallelizing a computation is limited by
the portion of the computation that cannot be parallelized. Mathematically, the speedup is given by:
● where N is the number of processors used to parallelize the computation, and the serial fraction is the
portion of the computation that cannot be parallelized. This law implies that the speedup is limited by
the serial fraction, which is a fixed value and cannot be reduced through parallelization.
SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Amdahl's Laws
● The second law:also known as the weak scaling law
● states that the maximum speedup that can be achieved by increasing the number of processors used
speedup = (serial fraction + parallel fraction) / (serial fraction + parallel fraction/N + overhead)
● where overhead is the time spent on communication between the processors. This law implies that
the speedup is limited by the communication overhead, which increases as the number of processors
increases.
SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Amdahl's Laws
● The overall Speedup is the ratio of the execution time:-
SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Amdahl's Laws
● Amdahl’s law uses two factors to find speedup from some enhancement:
● Fraction enhanced – The fraction of the computation time in the original computer that can be
converted to take advantage of the enhancement.
● For example- if 10 seconds of the execution time of a program that takes 40 seconds in total can use
an enhancement, the fraction is 10/40. This obtained value is Fraction Enhanced. Fraction enhanced
is always less than 1.
● Speedup enhanced – The improvement gained by the enhanced execution mode; that is, how much
faster the task would run if the enhanced mode were used for the entire program.
● For example – If the enhanced mode takes, say 3 seconds for a portion of the program, while it is 6
seconds in the original mode, the improvement is 6/3. This value is Speedup enhanced. Speedup
Enhanced is always greater than 1.
SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Gustafson's Laws
● Amdahl’s law is suitable for applications where response time is critical. On the other hand,
there are many applications which require that accuracy of the resultant output should be
high.
● The basic idea behind Gustafson's Laws is that the speedup achieved by parallelizing a
computation can increase as the problem size increases, unlike in Amdahl's Laws where the
● The laws were proposed by computer scientist John Gustafson in 1988 and have since become
● states that the amount of work that can be parallelized increases with the size of the problem.
● where S is the serial fraction of the computation, P is the parallelizable fraction, N is the
● This law implies that the parallel fraction can increase as the problem size increases, which
● states that the total execution time decreases as the problem size increases, assuming that the
parallel fraction is increased proportionally to the problem size. Mathematically, the execution
● where T(N) is the time required to complete the parallelizable fraction on N processors.
● This law implies that the execution time can decrease as the problem size increases, which
as the ratio of the time taken to run the sequential algorithm to the time taken to run the parallel
● S = T(1) / T(N)
● where T(1) is the time taken to run the algorithm on a single processor and T(N) is the time
the speedup factor to the number of processors used. Mathematically, efficiency (E) is given by:
● E = (S / N) x 100%
● In other words, efficiency measures how well the parallel algorithm scales with the number of
processors used. Ideally, an efficient parallel algorithm should achieve a speedup factor that is
close to linear with the number of processors used, resulting in a high efficiency.
●
SNJB’s Late Sau. K. B. J. College of Engineering
Cost and Utilization
● Cost refers to the total cost of ownership and operation of the computing system, including
hardware, software, maintenance, and energy costs.
● The cost of a high performance computing system can vary widely depending on factors such as
the size and complexity of the system, the type of hardware used, and the software applications
running on it.
● Utilization refers to the extent to which the computing resources are being used to perform
useful work. High utilization is desirable as it indicates that the system is being used efficiently
and effectively.
● However, achieving high utilization can be challenging due to factors such as workload
variability, load balancing, and scheduling overhead.
SNJB’s Late Sau. K. B. J. College of Engineering
Execution Rate and Redundancy
● Execution rate refers to the speed at which computations can be performed in a parallel
computing system.
● This is often measured in terms of floating point operations per second (FLOPS) or
instructions per second (IPS).
● Higher execution rates can be achieved through the use of more powerful hardware, such as
multi-core processors or specialized accelerators like GPUs or FPGAs.
● Redundancy, on the other hand, refers to the duplication of computing resources or data to
improve the reliability and availability of the system.
● Redundancy can be implemented at various levels, including hardware redundancy (such as
redundant power supplies or disk drives) and software redundancy (such as redundant data storage
or backup processes).
SNJB’s Late Sau. K. B. J. College of Engineering
The Effect of Granularity on Performance
● Granularity refers to the size of the computational tasks or data elements that are processed in a
parallel system.
● The choice of granularity can have a significant impact on system performance, as it affects
● When tasks are too fine-grained, there may be too much communication overhead between
processors, which can slow down the overall performance of the system.
● On the other hand, when tasks are too coarse-grained, there may be load imbalance between
refers to the ability of a system to maintain or improve performance as the size of the
● Achieving good scalability is essential for achieving high performance in large-scale parallel
systems.
● Scalability can be classified into two categories: strong scalability and weak scalability.
● Strong scalability refers to the ability of a system to maintain a fixed problem size and to
● Weak scalability refers to the ability of a system to maintain a fixed workload per processor and
● In other words, if the number of processors is doubled and the size of the problem is also
● Achieving weak scalability is generally easier than strong scalability, as it only requires
maintaining the same workload per processor as the number of processors increases.
SNJB’s Late Sau. K. B. J. College of Engineering
Minimum Execution Time and Minimum Cost
● Minimum Execution Time: The minimum execution time (also known as the latency or response
● In HPC, reducing the execution time is important for achieving higher performance and
throughput.
● This can be achieved through various techniques such as parallel processing, optimized
● Parallel processing involves breaking down a large task into smaller sub-tasks that can be
● Similarly, optimized algorithms can help in reducing the number of computations required for a
● Efficient use of resources such as memory, I/O, and network can also help in reducing the
● We can determine the minimum parallel runtime TPmin for a given W by differentiating
resources. The cost of system includes various components such as hardware, software, power,
● In general, reducing the cost involves minimizing the hardware and software expenses while
● Additionally, optimizing the use of resources and reducing idle time can also help in reducing
● In other words, the optimal execution time represents the sweet spot between the minimum
● Achieving the optimal execution time involves finding the right balance between the resources
● This can be achieved by optimizing the use of resources, such as processors, memory, I/O,
and network, while ensuring that the performance requirements are met.
SNJB’s Late Sau. K. B. J. College of Engineering
Optimal Execution Time
● The difference between the optimal execution time and the minimum execution time is that the
minimum execution time represents the absolute fastest time that a task or program can be
executed, while the optimal execution time represents the most efficient use of resources while
● Achieving the minimum execution time may require using more resources than necessary,
while achieving the optimal execution time requires using the minimum amount of resources
● Parallel programs are designed to execute tasks simultaneously across multiple processors or
● However, the performance gain of parallel programs is not always proportional to the
● As the number of processors increases, the communication overhead and contention for
shared resources can limit the scalability and performance of the program.
SNJB’s Late Sau. K. B. J. College of Engineering
Asymptotic Analysis of Parallel Programs
● Asymptotic analysis can help in identifying these limitations by analyzing the growth rate of
the computation and communication overhead as the input size and number of processors
increase.
● Consider the problem of sorting a list of n numbers. The fastest serial programs for this problem
run in time Θ(n log n). Consider four parallel algorithms, A1, A2, A3, and A4 as follows:
● Comparison of four different algorithms for sorting a given list of numbers. The table shows
number of processing elements, parallel runtime, speedup, efficiency and the pTP product
SNJB’s Late Sau. K. B. J. College of Engineering
Asymptotic Analysis of Parallel Programs
SNJB’s Late Sau. K. B. J. College of Engineering
Matrix Computation
● Matrix computations involve the manipulation of large matrices and the efficient use of
parallel computing to perform matrix operations.
● matrix computations are often used in scientific simulations and data analysis.
● In matrix computation, parallel computing allows the computation to be split across multiple
processors or computer systems, enabling faster computation of large matrices.
● To optimize matrix computation in HPC, specialized libraries such as BLAS (Basic Linear
Algebra Subprograms) and LAPACK (Linear Algebra Package) are used.
SNJB’s Late Sau. K. B. J. College of Engineering
Matrix Computation
1. Matrix-Vector Multiplication
a. Row Wise 1-D Partitioning Click Here
b. 2-D Partitioning
2. Matrix-Matrix Multiplication. Click Here
SNJB’s Late Sau. K. B. J. College of Engineering
Matrix vector Multiplication
●
SNJB’s Late Sau. K. B. J. College of Engineering
Matrix vector Multiplication
●
SNJB’s Late Sau. K. B. J. College of Engineering
References
● https://www.geeksforgeeks.org/computer-organization-amdahls-law-and-its-proof/
● https://www.youtube.com/watch?v=9xSiDO2CWbE
● http://archive.nitjsr.ac.in/course_assignment/CS16CS601PERFORMANCE%20EVALUA
TIONS.pdf