Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

SNJB’s Late Sau. K. B. J.

College of Engineering
Subject :- High Performance Computing

Unit 4
Analytical Modeling of
Parallel Programs
By- Prof. Gunjan Deshmukh

SNJB’s KBJ CoE | Civil | Computer | E&TC | Mechanical | AIDS | MBA | Visit Us @: www.snjb.org
SNJB’s Late Sau. K. B. J. College of Engineering
Syllabus

1. Sources of Overhead in Parallel 1. Scalability of Parallel Systems


Programs 2. Minimum Execution Time and
2. Performance Measures and Minimum Cost
Analysis: Amdahl's and 3. Optimal Execution Time
Gustafson's Laws 4. Asymptotic Analysis of Parallel
3. Speedup Factor and Efficiency Programs.
4. Cost and Utilization 5. Matrix Computation:
5. Execution Rate and Redundancy 6. Matrix-Vector Multiplication
6. The Effect of Granularity on 7. Matrix-Matrix Multiplication.
Performance
SNJB’s Late Sau. K. B. J. College of Engineering

Course Objectives & Outcomes

CO4: Analyze and measure performance of modern


parallel computing systems

SNJB’s KBJ CoE | Civil | Computer | E&TC | Mechanical | AIDS | MBA | Visit Us @: www.snjb.org
SNJB’s Late Sau. K. B. J. College of Engineering
Sources of Overhead in Parallel Programs
● Parallel programming involves executing multiple tasks simultaneously on multiple

processors,which can lead to significant improvements in performance compared to

sequential programming. However, parallel programs also introduce additional overhead

that can affect their performance.

● Communication overhead: In a parallel program, the different processors or computing

nodes need to communicate with each other to exchange data and coordinate their actions.

This communication can be time-consuming and can introduce overhead, particularly when

large amounts of data need to be transferred.


SNJB’s Late Sau. K. B. J. College of Engineering
Sources of Overhead in Parallel Programs
● Synchronization overhead: In some parallel programs, different processors or computing

nodes need to synchronize their actions to ensure that they work correctly.

Synchronization can introduce overhead because it requires the processors to wait for each

other before proceeding with their tasks.

● Load balancing overhead: In a parallel program, different processors or computing nodes

may have different workloads, which can lead to load imbalance. This can introduce

overhead because some processors may be idle while others are busy, which can result in

wasted computational resources.


SNJB’s Late Sau. K. B. J. College of Engineering
Sources of Overhead in Parallel Programs
● Parallelism overhead: Although parallel programming can improve performance by

allowing multiple tasks to be executed simultaneously, it can also introduce overhead

because of the additional processing required to coordinate the parallel execution.

● I/O overhead: In parallel programs, input and output (I/O) operations can be a significant

source of overhead, particularly when multiple processors or computing nodes are

accessing the same file or data storage device.


SNJB’s Late Sau. K. B. J. College of Engineering
Sources of Overhead in Parallel Programs
● Memory access overhead: In a parallel program, multiple processors or computing nodes

may access the same memory location simultaneously. This can introduce overhead because

the processors need to coordinate their access to ensure that data is not corrupted or lost.

● Startup and shutdown overhead: In some parallel programs, significant overhead can be

introduced during program startup and shutdown, particularly when multiple processors or

computing nodes need to be initialized or shut down in a coordinated manner.


SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis:
Amdahl's Laws
● It is named after computer scientist Gene Amdahl

● Amdahl corporation presented at the AFIPS Spring Joint Computer Conference in 1967. It is

also known as Amdahl’s argument.

● It is a formula that gives the theoretical speedup in latency of the execution of a task at a

fixed workload that can be expected of a system whose resources are improved.

● In other words, it is a formula used to find the maximum improvement possible by just

improving a particular part of a system.


SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Amdahl's Laws
● Speedup- Speedup is defined as the ratio of performance for the entire task using the

enhancement and performance for the entire task without using the enhancement or speedup

can be defined as the ratio of execution time for the entire task without using the enhancement

and execution time for the entire task using the enhancement.

● If Pe is the performance for the entire task using the enhancement when possible, Pw is the

performance for the entire task without using the enhancement, Ew is the execution time for the

entire task without using the enhancement and Ee is the execution time for the entire task using the

enhancement when possible then, Speedup = Pe/Pw or Speedup = Ew/Ee


SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Amdahl's Laws
● The first law: also known as the strong scaling law

● states that the maximum speedup that can be achieved by parallelizing a computation is limited by

the portion of the computation that cannot be parallelized. Mathematically, the speedup is given by:

speedup = 1 / (serial fraction + parallel fraction/N)

● where N is the number of processors used to parallelize the computation, and the serial fraction is the

portion of the computation that cannot be parallelized. This law implies that the speedup is limited by

the serial fraction, which is a fixed value and cannot be reduced through parallelization.
SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Amdahl's Laws
● The second law:also known as the weak scaling law

● states that the maximum speedup that can be achieved by increasing the number of processors used

to parallelize a computation is limited by the communication overhead between the processors.

Mathematically, the speedup is given by:

speedup = (serial fraction + parallel fraction) / (serial fraction + parallel fraction/N + overhead)

● where overhead is the time spent on communication between the processors. This law implies that

the speedup is limited by the communication overhead, which increases as the number of processors

increases.
SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Amdahl's Laws
● The overall Speedup is the ratio of the execution time:-
SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Amdahl's Laws
● Amdahl’s law uses two factors to find speedup from some enhancement:
● Fraction enhanced – The fraction of the computation time in the original computer that can be
converted to take advantage of the enhancement.
● For example- if 10 seconds of the execution time of a program that takes 40 seconds in total can use
an enhancement, the fraction is 10/40. This obtained value is Fraction Enhanced. Fraction enhanced
is always less than 1.
● Speedup enhanced – The improvement gained by the enhanced execution mode; that is, how much
faster the task would run if the enhanced mode were used for the entire program.
● For example – If the enhanced mode takes, say 3 seconds for a portion of the program, while it is 6
seconds in the original mode, the improvement is 6/3. This value is Speedup enhanced. Speedup
Enhanced is always greater than 1.
SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Gustafson's Laws
● Amdahl’s law is suitable for applications where response time is critical. On the other hand,

there are many applications which require that accuracy of the resultant output should be

high.

● The basic idea behind Gustafson's Laws is that the speedup achieved by parallelizing a

computation can increase as the problem size increases, unlike in Amdahl's Laws where the

serial fraction remains a fixed value.

● The laws were proposed by computer scientist John Gustafson in 1988 and have since become

a popular alternative to Amdahl's Laws.


SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Gustafson's Laws
● The first law:

● states that the amount of work that can be parallelized increases with the size of the problem.

Mathematically, the fraction of parallelizable work is given by:

parallel fraction = S + P(N - 1),

● where S is the serial fraction of the computation, P is the parallelizable fraction, N is the

number of processors used to parallelize the computation.

● This law implies that the parallel fraction can increase as the problem size increases, which

allows for greater speedup.


SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Gustafson's Laws
● The second law

● states that the total execution time decreases as the problem size increases, assuming that the

parallel fraction is increased proportionally to the problem size. Mathematically, the execution

time is given by:

execution time = S + PNT(N),

● where T(N) is the time required to complete the parallelizable fraction on N processors.

● This law implies that the execution time can decrease as the problem size increases, which

allows for greater efficiency.


SNJB’s Late Sau. K. B. J. College of Engineering
Performance Measures and Analysis: Gustafson's Laws
SNJB’s Late Sau. K. B. J. College of Engineering
Speedup Factor and Efficiency
● Speedup: The speedup factor is a measure of how much faster a parallel algorithm runs

compared to the same algorithm executed sequentially on a single processor. It is calculated

as the ratio of the time taken to run the sequential algorithm to the time taken to run the parallel

algorithm on N processors. Mathematically, the speedup factor (S) is given by:

● S = T(1) / T(N)

● where T(1) is the time taken to run the algorithm on a single processor and T(N) is the time

taken to run the algorithm on N processors.


SNJB’s Late Sau. K. B. J. College of Engineering
Speedup Factor and Efficiency
● Efficiency: Efficiency is a measure of how effectively a parallel algorithm is using the

available resources, and is typically expressed as a percentage. It is calculated as the ratio of

the speedup factor to the number of processors used. Mathematically, efficiency (E) is given by:

● E = (S / N) x 100%

● In other words, efficiency measures how well the parallel algorithm scales with the number of

processors used. Ideally, an efficient parallel algorithm should achieve a speedup factor that is

close to linear with the number of processors used, resulting in a high efficiency.


SNJB’s Late Sau. K. B. J. College of Engineering
Cost and Utilization
● Cost refers to the total cost of ownership and operation of the computing system, including
hardware, software, maintenance, and energy costs.
● The cost of a high performance computing system can vary widely depending on factors such as
the size and complexity of the system, the type of hardware used, and the software applications
running on it.
● Utilization refers to the extent to which the computing resources are being used to perform
useful work. High utilization is desirable as it indicates that the system is being used efficiently
and effectively.
● However, achieving high utilization can be challenging due to factors such as workload
variability, load balancing, and scheduling overhead.
SNJB’s Late Sau. K. B. J. College of Engineering
Execution Rate and Redundancy
● Execution rate refers to the speed at which computations can be performed in a parallel
computing system.
● This is often measured in terms of floating point operations per second (FLOPS) or
instructions per second (IPS).
● Higher execution rates can be achieved through the use of more powerful hardware, such as
multi-core processors or specialized accelerators like GPUs or FPGAs.
● Redundancy, on the other hand, refers to the duplication of computing resources or data to
improve the reliability and availability of the system.
● Redundancy can be implemented at various levels, including hardware redundancy (such as
redundant power supplies or disk drives) and software redundancy (such as redundant data storage
or backup processes).
SNJB’s Late Sau. K. B. J. College of Engineering
The Effect of Granularity on Performance
● Granularity refers to the size of the computational tasks or data elements that are processed in a

parallel system.

● The choice of granularity can have a significant impact on system performance, as it affects

factors such as communication overhead, load balancing, and parallel efficiency.

● When tasks are too fine-grained, there may be too much communication overhead between

processors, which can slow down the overall performance of the system.

● On the other hand, when tasks are too coarse-grained, there may be load imbalance between

processors, which can result in underutilization of resources


SNJB’s Late Sau. K. B. J. College of Engineering
Scalability of Parallel Systems
● Scalability is an important characteristic of parallel systems in high performance computing, as it

refers to the ability of a system to maintain or improve performance as the size of the

problem or the number of processors increases.

● Achieving good scalability is essential for achieving high performance in large-scale parallel

systems.

● Scalability can be classified into two categories: strong scalability and weak scalability.

● Strong scalability refers to the ability of a system to maintain a fixed problem size and to

achieve proportional speedup as the number of processors increases.


SNJB’s Late Sau. K. B. J. College of Engineering
Scalability of Parallel Systems

● Weak scalability refers to the ability of a system to maintain a fixed workload per processor and

to achieve proportional speedup as the number of processors increases.

● In other words, if the number of processors is doubled and the size of the problem is also

doubled, the runtime of the problem should remain the same.

● Achieving weak scalability is generally easier than strong scalability, as it only requires

maintaining the same workload per processor as the number of processors increases.
SNJB’s Late Sau. K. B. J. College of Engineering
Minimum Execution Time and Minimum Cost
● Minimum Execution Time: The minimum execution time (also known as the latency or response

time) is the time taken by a task or a program to complete its execution.

● In HPC, reducing the execution time is important for achieving higher performance and

throughput.

● This can be achieved through various techniques such as parallel processing, optimized

algorithms, and efficient use of resources.

● Parallel processing involves breaking down a large task into smaller sub-tasks that can be

executed simultaneously by multiple processors or cores.


SNJB’s Late Sau. K. B. J. College of Engineering
Minimum Execution Time and Minimum Cost
● This can significantly reduce the execution time for the overall task.

● Similarly, optimized algorithms can help in reducing the number of computations required for a

task, thus reducing the execution time.

● Efficient use of resources such as memory, I/O, and network can also help in reducing the

execution time by minimizing bottlenecks and reducing idle time.

● We can determine the minimum parallel runtime TPmin for a given W by differentiating

the expression for TP w.r.t. p and equating it to zero


SNJB’s Late Sau. K. B. J. College of Engineering
Minimum Execution Time and Minimum Cost
● Minimum Cost: The minimum cost is another important metric organizations with limited

resources. The cost of system includes various components such as hardware, software, power,

cooling, and maintenance.

● In general, reducing the cost involves minimizing the hardware and software expenses while

maintaining the required performance.

● Way to reduce cost is to use open-source software instead of commercial software

● Additionally, optimizing the use of resources and reducing idle time can also help in reducing

the power and cooling costs.


SNJB’s Late Sau. K. B. J. College of Engineering
Optimal Execution Time
● The optimal execution time is the time required to execute a task or a program using the

minimum amount of resources while achieving the desired performance level.

● In other words, the optimal execution time represents the sweet spot between the minimum

execution time and the minimum cost.

● Achieving the optimal execution time involves finding the right balance between the resources

used and the performance achieved

● This can be achieved by optimizing the use of resources, such as processors, memory, I/O,

and network, while ensuring that the performance requirements are met.
SNJB’s Late Sau. K. B. J. College of Engineering
Optimal Execution Time
● The difference between the optimal execution time and the minimum execution time is that the

minimum execution time represents the absolute fastest time that a task or program can be

executed, while the optimal execution time represents the most efficient use of resources while

achieving the desired performance level.

● Achieving the minimum execution time may require using more resources than necessary,

while achieving the optimal execution time requires using the minimum amount of resources

necessary to achieve the desired performance level.


SNJB’s Late Sau. K. B. J. College of Engineering
Asymptotic Analysis of Parallel Programs
● Asymptotic analysis is a method used to study the behavior of algorithms and programs as the

input size grows to infinity

● Parallel programs are designed to execute tasks simultaneously across multiple processors or

cores, which can lead to significant speedup and performance improvements.

● However, the performance gain of parallel programs is not always proportional to the

number of processors used.

● As the number of processors increases, the communication overhead and contention for

shared resources can limit the scalability and performance of the program.
SNJB’s Late Sau. K. B. J. College of Engineering
Asymptotic Analysis of Parallel Programs
● Asymptotic analysis can help in identifying these limitations by analyzing the growth rate of

the computation and communication overhead as the input size and number of processors

increase.

● Consider the problem of sorting a list of n numbers. The fastest serial programs for this problem

run in time Θ(n log n). Consider four parallel algorithms, A1, A2, A3, and A4 as follows:

● Comparison of four different algorithms for sorting a given list of numbers. The table shows

number of processing elements, parallel runtime, speedup, efficiency and the pTP product
SNJB’s Late Sau. K. B. J. College of Engineering
Asymptotic Analysis of Parallel Programs
SNJB’s Late Sau. K. B. J. College of Engineering
Matrix Computation
● Matrix computations involve the manipulation of large matrices and the efficient use of
parallel computing to perform matrix operations.

● Matrices are rectangular arrays of numbers, arranged in rows and columns


● They are used to represent data in a wide range of scientific and engineering applications,
such as image processing, signal processing, numerical analysis, and machine learning.

● matrix computations are often used in scientific simulations and data analysis.
● In matrix computation, parallel computing allows the computation to be split across multiple
processors or computer systems, enabling faster computation of large matrices.
● To optimize matrix computation in HPC, specialized libraries such as BLAS (Basic Linear
Algebra Subprograms) and LAPACK (Linear Algebra Package) are used.
SNJB’s Late Sau. K. B. J. College of Engineering
Matrix Computation

1. Matrix-Vector Multiplication
a. Row Wise 1-D Partitioning Click Here
b. 2-D Partitioning
2. Matrix-Matrix Multiplication. Click Here
SNJB’s Late Sau. K. B. J. College of Engineering
Matrix vector Multiplication

SNJB’s Late Sau. K. B. J. College of Engineering
Matrix vector Multiplication

SNJB’s Late Sau. K. B. J. College of Engineering
References
● https://www.geeksforgeeks.org/computer-organization-amdahls-law-and-its-proof/

● https://www.youtube.com/watch?v=9xSiDO2CWbE

● http://archive.nitjsr.ac.in/course_assignment/CS16CS601PERFORMANCE%20EVALUA

TIONS.pdf

You might also like