L11 Granularity

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

Parallel Computing

Lecture #10

Performance Metrics
Performance

Two key goals to be achieved with the design of parallel


applications are:
• Performance – the capacity to reduce the time needed to solve a
problem as the computing resources increase

• Scalability – the capacity to increase performance as the size of


the problem increases
Need of Performance Metrics

 Determining the best algorithm

 Evaluating Hardware Platforms

 Examining the Benefits from parallelism


Performance Metrics

There are 2 distinct classes of performance metrics:


• Performance metrics for processors/cores – assess the
performance of a processing unit, normally done by measuring the
speed or the number of operations that it does in a certain period
of time
• Performance metrics for parallel applications – assess the
performance of a parallel application, normally done by comparing
the execution time with multiple processing units against the
execution time with just one processing unit
Analytical Modeling –
Basics

• A sequential algorithm is evaluated by its runtime.

• The parallel runtime of a program depends on the input size,


the number of processors, a n d the communication parameters
of the machine.
Sources of Overhead in Parallel
Programs
• If I use two processors, shouldnt my program run twice as
fast?

• No – a number of overheads, including wasted computation,


communication, idling, a n d contention cause degradation
in performance. Execution Time

P0

P1

P2

P3

P4
Essential/Excess Computation Interprocessor Communication
P5 Idling

P6

The execution profile of a hypothetical parallel program


P7

executing on eight processing elements.


Profile indicates times spent performing computation (both
essential a n d excess), communication, a n d idling.
Sources of Overheads in Parallel
Programs
• Interprocess interactions: Processors working on any non-trivial
parallel problem will n e e d to talk to e a c h other.

• Idling: Processes ma y idle b e cau s e of imbalance,


load synchronization, or serial components.

• Excess Computation: This is computation not perform e d b y


the serial version. This might b e b e c a u se the serial algorithm
is difficult to parallelize, or that some computations are
repeated across processors to minimize communication.
Performance Metrics

A Number of Metrics have been used based on the desired outcome of


performance analysis

• Execution Time

• Total Parallel Overhead

• Speedup

• Efficiency

• Cost
Execution Time

• Serial runtime of a program is the time elapsed between the


beginning a n d the e n d of its execution on a sequential
computer.

• The parallel runtime is the time that elapses from the moment
the Parallel Computation starts to the moment the last processor
finishes execution.

• We denote the serial runtime by T S a n d the parallel runtime by


TP .
Total Parallel Overhead

• The overheads incurred by a parallel program are encapsulated into a


single expression referred as the overhead function.

• We define overhead function or toral overhead of a parallel system as

• The total time collectively spent by all the processing elements over
and above that required by the fasters known sequential algorithm for
solving the same problem on a single processing element.
Total Parallel Overhead

• Let T a l l b e the total time collectively spent by all the


processing elements.

• T S is the serial time.

• Observe that T a l l − T S is then the total time spend by all


processors c o m b i ned in non-useful work. This is called the
total overhead.

• The total time collectively spent by all the processing


elements (p is the number of
T all = p T P
processors).
• The overhead function (T o ) is therefore given by

To (1)

=
Speedup

• Performance gain is achieved by parallelizing a given


application over a sequential implementation.

• Speedup is a measure that captures the relative benefit of


solving a problem in parallel.

• What is the benefit fro m parallelism?

• Speedup (S) is defined as the ratio of the time taken to solve


a problem on a single processor to the time required to
solve the same problem on a parallel computer with p
identical processing elements.

• Formally speedup S is defined as the ratio of serial runtime


of the best sequential algorithm for solving a problem to the
time taken by the parallel algorithm to solve the same
problem on p processing elements
Performance Metrics: Speedup

• For a given problem, there might b e many serial


algorithms  available. These algorithms may have
different asymptotic  runtimes (The limiting behavior of
the execution time of an algorithm when the size of the
problem goes to infinity) a nd may b e parallelizable to
different degrees.

• For the purpose of computing speedup, we always


consider  the best sequential program as the
baseline.
Speedup: Example 1

• Consider the problem of adding n numbersby


using n processing elements.

• If n is a power of two, we c a n perform this operation


in log n steps by propagating partial sums up a
logical binary tree of processors.
Performance Metrics: Example (continued)

• If a n addition takes constant time, say, t c and


communication of a single word takes time t s + t w , w e have
the parallel time T P = Θ(log n)

• We know that T S = Θ(n)

n
• Speedup S is given b y S = log n
Θ
Efficiency

• Only an ideal parallel system containing p processing can deliver a speedup


equal to p.
• Efficiency is a measure of the fraction of time for which a processing
element is usefully employed; it is defined as the ratio of speedup to the
number of processing elements.
• In an ideal parallel system, speedup is equal to p and efficiency is equal
to one.
• In practice, speedup is less than p and efficiency is between zero and one,
depending on the effectiveness with which the processing elements are
utilized.
• Efficiency is denoted by the symbol E. Mathematically, it is given by
• E= S/P
Performance Metrics: Efficiency

• Efficiency is a measure of the fraction of time for w hich


a processing element is usefully employed

• Mathem atically, it is given b y

S
E = . (2)
p

• Following the bounds on speedup, efficiency c a n b e as low


as 0 a n d as high as 1.
Performance Metrics: Efficiency Example

• The speedup S of ad d in g n numbers on n processors is given


by S = logn n .

• Efficiency E is given b y

n
Θ log n
E =
n
1
= Θ
log n
Cost of a Parallel
System
• Cost is the product of parallel runtime a n d the number of
processing elements used (p × T P ).

• Cost reflects the sum of the time that e a c h processing


element spends solving the problem.

• A parallel system is said to b e cost-optimal if the cost of


solving  a problem on a parallel computer is asymptotically
identical to  serial cost.

• Since E = T S / p T P , for cost optimal systems, E = O(1).

• Cost is sometimes referred to as work or processor-time 


product.
Cost of a Parallel System:
Example

Consider the problem of adding n numbers on n


processors.

• We have, T P = log n (for p = n).

• The cost of this system is given by pT P = n log n.

• Since the serial runtime of this operation is Θ(n), the


algorithm is not cost optimal.
GRANULARITY
Parallel tasks

•  Can be specified at various levels of granularity.


• At one extreme, each program in a set of programs can
be viewed as one parallel task. 
• At the other extreme, individual instructions within a
program can be viewed as parallel tasks. 
Definition

• Granularity of a task
• measure of the amount of work which is performed by that
task
• communication overhead between multiple processors or
processing elements
• Granularity - Computation / Communication Ratio
• Impact of granularity on performance

• Fine grains or small tasks 


• more parallelism 
• hence increases the speedup
• synchronization overhead -  scheduling strategies etc. can
negatively impact the performance of fine-grained tasks
•  Increasing parallelism alone cannot give the best
performance.
• Impact of granularity on performance

• In order to reduce the communication


overhead, granularity can be increased.
• Coarse grained tasks have
less communication overhead but they often
cause load imbalance. 
• Optimal performance is achieved between the
two extremes of fine-grained and coarse-grained
parallelism.
Granularity

The Number and size of tasks into which a


problem is decomposed determines the granularity
of decomposition.

 Granularity is a measure of the ratio of computation to


communication.

In parallel computing, granularity (or grain size) of


a task is a measure of the amount of work (or
computation) which is performed by that Task.
Effect of Granularity on
Performance
• N elements n Processors-Communication Overhead

• Often, using fewer processors improves performance of parallel systems.

• Using fewer than the maximum possible number of processing elements


to execute a parallel algorithm is called scaling down a parallel system.
(p<n)

• If there are n inputs and only p processing elements (p < n), we can use the
parallel algorithm designed for n processing elements by assuming n virtual
processing elements and having each of the p physical processing elements
simulate n/p virtual processing elements.

• Since the number of processing elements decreases by a factor of n/p,


the computation at e a c h processing element increases by a factor of
n/p.

• If virtual processing elements are mapped appropriately onto physical processing


elements, the overall communication time does not grow by more than a factor of
n/p.
Building Granularity: Example

Consider the problem of adding n numbers on p


processing elements such that p < n an d both n an d p
are powers of 2.

• Use the parallel algorithm for n processors, except, in


this case, we think of them as virtual processors.

• Each of the p processors is now assigned n/p virtual


processors.

• n = 16 and p = 4

• Virtual processing element i is simulated by the physical


processing element labeled i mod p; the numbers to be
added are distributed similarly
Building Granularity: Example

• The first log p of the log n steps of


the   original algorithm are  simulated in (n/p)
log p steps on p processing elements.

• Subsequent log n−log p steps do not require any


comm unication.
Building Granularity: Example
(continued)

• The overall parallel execution time


of this parallel system is
Θ((n/p) log p).

• The cost is Θ(n log p), which is asymptotically


higher than the Θ(n) cost of adding n
numbers sequentially. Therefore, the parallel
system is not cost-optimal.

You might also like