Professional Documents
Culture Documents
HPC Notes
HPC Notes
HPC Notes
Unit 1
➢ One of the key architectural features of vector processors is the use of multi-track
pipelines for the arithmetic operations.
➢ A prototypical vector processor has the following characteristics:
○ Vector Registers:
■ Vector processors operate on vector registers, which can hold a number of
operands (usually between 64 and 256 for double-precision data).
■ The width of a vector register is called the vector length, denoted as Lv.
○ Pipelines:
■ There is a dedicated pipeline for each arithmetic operation, such as
addition, multiplication, division, etc.
■ Each pipeline can deliver a certain number of results per cycle, typically
ranging from 2 to 16. This is referred to as the pipeline track-width or
multi-track design.
○ Peak Performance:
■ The peak performance of a vector processor is determined by the number
of these multi-track pipelines, their throughput, and the processor's clock
frequency.
■ For a vector CPU running at 2 GHz with four-track ADD and MULT
pipelines, the peak performance would be 2 (ADD + MULT) × 4 (tracks) × 2
(GHz) = 16 GFlops/s.
○ Memory Bandwidth:
■ Vector processors often have a memory interface running at the same
frequency as the core, delivering higher bandwidth relative to the peak
performance compared to cache-based microprocessors.
■ For example, a four-track load/store pipeline running at 2 GHz could
achieve a memory bandwidth of 4 (tracks) × 2 (GHz) × 8 (bytes) = 64 GB/s
➢ Prefetching for vector processors hides memory latency. Prefetching allows the
processor to fetch data from memory in advance of when it is actually needed by the
application.
➢ This can be done either through compiler-inserted prefetch instructions or by hardware
prefetchers that detect regular access patterns.
➢ The number of outstanding prefetch requests the processor can sustain is crucial for
hiding latency completely.
➢ If the application cannot sustain the required number of outstanding prefetches, the full
memory bandwidth will not be utilized, and the performance will be limited by latency
rather than bandwidth.
Unit 2
➢ Parallelism is the process of executing multiple tasks using multiple processors at the
same time in a computer.
➢ Data Parallelism
○ Many problems in scientific computing involve processing of large quantities of
data stored on a computer. If this manipulation can be performed in parallel, i.e.,
by multiple processors working on different parts of the data, it is data
parallelism.
○ As a matter of fact, this is the dominant parallelization concept in scientific
computing on MIMD-type computers. It also goes under the name of SPMD
(Single Program Multiple Data), where the same code is executed on all
processors, with independent instruction pointers.
○ Examples:
○ Medium Grained Loop Parallelism
■ Processing of array data by loops or loop nests is a central component in
most scientific codes. A typical example are linear algebra operations on
vectors or matrices.
■ Often the computations performed on individual array elements are
independent of each other and are hence typical candidates for parallel
execution by several processors in shared memory
○ Coarse Grained parallelism by domain decomposition
■ Simulations of physical processes often work with a simplified picture of
reality in which a computational domain is represented as a grid that
defines discrete positions for the physical quantities under consideration
■ The goal of the simulation is usually the computation of observables on
this grid.
■ A straightforward way to distribute the work involved across workers, i.e.,
processors, is to assign a part of the grid to each worker. This is called
domain decomposition.
➢ Functional Parallelism
○ Sometimes the solution of a “big” numerical problem can be split into disparate
subtasks, which work together by data exchange and synchronization.
○ In this case, the subtasks execute completely different code on different data
items, which is why functional parallelism is also called MPMD (Multiple
Program Multiple Data).
○ When different parts of the problem have different performance properties and
hardware requirements, bottlenecks and load imbalance can easily arise.
○ On the other hand, overlapping tasks that would otherwise be executed
sequentially could accelerate execution considerably.
○ Examples
○ Master Worker Scheme:
■ Reserving one compute element for administrative tasks while all others
solve the actual problem is called the master-worker scheme.
■ The master distributes work and collects results. A typical example is a
parallel ray tracing program.
○ Functional Decomposition:
■ Multiphysics simulations are prominent applications for parallelization
by functional decomposition.
➢ In any OpenMP program, a single thread, the master thread, runs immediately after
startup.
➢ Truly parallel execution happens inside parallel regions, of which an arbitrary number
can exist in a program.
➢ Between two parallel regions, no thread except the master thread executes any code. This
is also called the “fork-join model”
➢ Inside a parallel region, a team of threads executes instruction streams concurrently. The
number of threads in a team may vary among parallel regions.
➢ OpenMP is a layer that adapts the raw OS thread interface to make it more usable with
the typical structures that numerical software tends to employ.
➢ In practice, parallel regions in Fortran are initiated by !$OMP PARALLEL and ended by
!$OMP END PARALLEL directives, respectively.
➢ The !$OMP string is a so-called sentinel, which starts an OpenMP directive.
➢ Inside a parallel region, each thread carries a unique identifier, its thread ID, which runs
from zero to the number of threads minus one, and can be obtained by the
omp_get_thread_num() API function.
➢ The omp_get_num_threads() function returns the number of active threads in the
current parallel region.
➢ Code between OMP PARALLEL and OMP END PARALLEL, including subroutine calls,
is executed by every thread.
➢ In the simplest case, the thread ID can be used to distinguish the tasks to be performed
on different threads.
➢ Critical Regions:
○ Concurrent write access to a shared variable or, in more general terms, a shared
resource, must be avoided to circumvent race conditions.
○ Critical regions solve this problem by making sure that at most one thread at a
time executes some piece of code.
○ If a thread is executing code inside a critical region, and another thread wants to
enter, the latter must wait (block) until the former has left the region.
○ The order in which threads enter the critical region is undefined, and can change
from run to run.
○ Consequently, the definition of a “correct result” must encompass the possibility
that the partial sums are accumulated in a random order, and the usual
reservations regarding floating-point accuracy do apply
○ Critical regions hold the danger of deadlocks when used inappropriately
○ A deadlock arises when one or more “agents” (threads in this case) wait for
resources that will never become available, a situation that is easily generated
with badly arranged CRITICAL directives.
○ OpenMP has a simple solution for this problem: A critical region may be given a
name that distinguishes it from others.
○ The name is specified in parentheses after the CRITICAL directive.
○ Without the names on the two different critical regions, this code would
deadlock because a thread that has just called func(), already in a critical region,
would immediately encounter the second critical region and wait for itself
indefinitely to free the resource.
○ With the names, the second critical region is understood to protect a different
resource than the first.
○ A disadvantage of named critical regions is that the names are unique identifiers.
○ It is not possible to have them indexed by an integer variable, for instance.
➢ Barriers:
○ If, at a certain point in the parallel execution, it is necessary to synchronize all
threads, a BARRIER can be used.
○ The barrier is a synchronization point, which guarantees that all threads have
reached it before any thread goes on executing the code below it.
○ Certainly, it must be ensured that every thread hits the barrier, or a deadlock may
occur.
○ Barriers should be used with caution in OpenMP programs, partly because of
their potential to cause deadlocks, but also due to their performance impact
(synchronization is overhead).
○ Every parallel region executes an implicit barrier at its end, which cannot be
removed.
○ There is also a default implicit barrier at the end of work-sharing loops and some
other constructs to prevent race conditions.
○ It can be eliminated by specifying the NOWAIT clause.
➢ The simplest possibility is STATIC, which divides the loop into contiguous chunks of
(roughly) equal size. Each thread then executes on exactly one chunk.
➢ If for some reason the amount of work per loop iteration is not constant but, e.g.,
decreases with loop index, this strategy is suboptimal because different threads will get
vastly different workloads, which leads to load imbalance.
➢ One solution would be to use a chunksize like in “STATIC,1,” dictating that chunks of
size 1 should be distributed across threads in a round-robin manner.
➢ The chunksize may not only be a constant but any valid integer-valued expression.
➢ Dynamic scheduling assigns a chunk of work, whose size is defined by the chunksize, to
the next thread that has finished its current chunk.
➢ This allows for a very flexible distribution which is usually not reproduced from run to
run.
➢ Threads that get assigned “easier” chunks will end up completing more of them, and
load imbalance is greatly reduced.
➢ The downside is that dynamic scheduling generates significant overhead if the chunks
are too small in terms of execution time.
➢ This is why it is often desirable to use a moderately large chunksize on tight loops,
which in turn leads to more load imbalance.
➢ In cases where this is a problem, the guided schedule may help. Again, threads request
new chunks dynamically, but the chunksize is always proportional to the remaining
number of iterations divided by the number of threads.
➢ The smallest chunksize is specified in the schedule clause (default is 1). Despite the
dynamic assignment of chunks, scheduling overhead is kept under control.
➢ Of course, there is also the possibility of “explicit” scheduling, using the thread ID
number to assign work to threads.
➢ For debugging and profiling purposes, OpenMP provides a facility to determine the loop
scheduling at runtime.
➢ If the scheduling clause specifies “RUNTIME,” the loop is scheduled according to the
contents of the OMP_SCHEDULE shell variable.
Unit 3
Question 1: Describe the method for determining OpenMP overhead for short
loops.
➢ In general, the overhead consists of a constant part and a part that depends on the
number of threads.
➢ There are vast differences from system to system as to how large it can be, but it is
usually of the order of at least hundreds if not thousands of CPU cycles.
➢ It can be determined easily by fitting a simple performance model to data obtained from
a low-level benchmark.
➢ As an example we pick the vector triad with short vector lengths and static scheduling so
that the parallel run scales across threads if each core has its own L1:
➢ NITER is chosen so that overall runtime can be accurately measured and one-time
startup effects (loading data into cache the first time etc.) become unimportant. The
NOWAIT clause is optional and is used here only to demonstrate the impact of the
implied barrier at the end of the loop worksharing construct.
➢ The performance model gives a benchmark for the above program as follows:
○ It considers different scheduling strategies in OpenMP, such as static, dynamic,
and guided scheduling.
○ By fitting the measured performance data to a model that accounts for the
OpenMP overhead, the file is able to estimate the constant and thread-dependent
parts of the overhead
➢ Surprisingly, there is a measurable overhead for running with a single OpenMP thread
versus purely serial mode. However, as the two versions reach the same asymptotic
performance at N ≲ 1000, this effect cannot be attributed to inefficient scalar code,
although OpenMP does sometimes prevent advanced loop optimizations. The
single-thread overhead originates from the inability of the compiler to generate a
separate code version for serial execution.
➢ The barrier is negligible with a single thread, and accounts for most of the over head at
four threads. But even without it about 190ns (570 CPU cycles) are spent for setting up
the worksharing loop in that case. There is an additional cost of nearly 460ns (1380
cycles) for restarting the team, which proves that it can indeed be beneficial to minimize
the number of parallel regions in an OpenMP program.
➢ Similar to the experiments performed with dynamic scheduling in the previous section,
the actual overhead incurred for small OpenMP loops depends on many factors like the
compiler and runtime libraries, organization of caches, and the general node structure. It
also makes a significant difference whether the thread team resides inside a cache group
or encompasses several groups.
➢ It is a general result, however, that implicit (and explicit) barriers are the largest
contributors to OpenMP-specific parallel overhead.
➢ Serialization:
○ Implicit or unintended serialization can occur in parallel programs when there
are false assumptions about how message transfers or synchronization operations
work. One example is the ring shift communication pattern, where the message
transfers are serialized even though there is no actual deadlock.
○ In MPI, such implicit serialization can happen if the programmer makes
incorrect assumptions about the order or concurrency of message transfers.
○ Similarly, in OpenMP, the careless use of synchronization constructs like
CRITICAL regions can effectively serialize parallel execution.
○ Serialization negates the potential benefits of parallelization and should be
avoided through careful analysis of the communication and synchronization
behavior of the parallel algorithm.
○ Solutions:
■ Carefully analyze the communication and synchronization patterns in the
parallel algorithm to identify potential sources of implicit serialization.
■ Avoid making false assumptions about the order or concurrency of
operations.
■ Use appropriate MPI constructs (e.g., non blocking communication) to
overlap communication and computation where possible.
■ In OpenMP, be cautious about the use of synchronization constructs like
CRITICAL regions, as they can serialize execution.
➢ False Sharing:
○ False sharing occurs when multiple threads or processes are continuously
modifying data that reside in the same cache line, causing frequent cache
coherence traffic. This can happen, for example, in a parallel histogram
computation where each thread updates a shared histogram array.
○ The cache coherence logic is forced to constantly evict and reload the cache lines,
leading to severe performance degradation compared to the serial case where the
data fits in a single CPU's cache.
○ Compiler optimizations like register allocation can sometimes eliminate false
sharing, but in more complex cases, manual code restructuring may be necessary.
○ False sharing can be a significant issue in shared-memory parallel programming
models like OpenMP, where the cache coherence mechanisms are transparent to
the programmer.
○ Solutions:
■ Identify data structures that are being modified by multiple threads or
processes and reside in the same cache lines.
■ Restructure the data layout or access patterns to avoid this problem, for
example, by using thread-private data structures for intermediate
computations.
■ Leverage compiler optimizations like register allocation when possible to
eliminate false sharing.
■ Manually optimize the code to avoid false sharing if necessary, even
though this can be a challenge for complex data structures and access
patterns.
Question 4: Write the program in MPI to display “Hello world” and rank in
each statement of the code.
Fortran Code
C Code
Output
Question 5: Write a note on (i) profiling (ii) performance tuning.
➢ Profiling:
○ Profiling as a key technique for understanding a program's behavior and
identifying performance bottlenecks. There are two main approaches to profiling:
○ Instrumentation:
■ Instrumentation involves modifying the compiled code to insert logging
or timing calls at function or basic block boundaries.
■ This approach can provide detailed information about function calls,
execution times, and call graphs, but it also incurs overhead that can
distort the program's actual behavior.
○ Sampling:
■ Sampling periodically interrupts the program and records the current
program counter and call stack.
■ Sampling is less intrusive than instrumentation, but the results are
statistical in nature and require longer running times to obtain accurate
profiles.
○ One of the popular profiling tools is gprof from the GNU binutils package, which
uses both instrumentation and sampling to generate flat function profiles and
call graphs.
○ Another profiling tool is OProfile, which is a sampling-based profiler that can
provide annotated source code with line-by-line execution statistics without
requiring any special instrumentation of the binary.
○ Usage hardware performance counters to understand resource utilization is
better, rather than relying solely on execution time. Tools like PAPI can be used
to access these hardware counters from within a program.
➢ Performance Tuning:
○ Identifying performance bottlenecks through profiling is just the first step in the
optimization process. The actual reasons for poor performance may not be
obvious from the profiling data alone.
○ Common sense guidelines for performance tuning:
■ Avoid unnecessary memory accesses:
● Redundant memory loads and stores can significantly impact
performance, especially in memory-bound applications.
● Techniques like loop unrolling, common subexpression
elimination, and scalar replacement can help reduce memory
traffic.
■ Leverage the memory hierarchy:
● Ensure that data structures and access patterns are designed to
take advantage of the CPU's cache hierarchy.
● Avoid false sharing and unnecessary synchronization, which can
cause cache coherence issues and degrade performance.
■ Optimize control flow:
● Minimize branching and branch mispredictions, which can stall
the CPU pipeline.
● Consider loop transformations like blocking, tiling, and unrolling
to improve instruction-level parallelism.
■ Use SIMD instructions effectively:
● Take advantage of the CPU's vector processing capabilities by
writing code that can exploit SIMD instructions.
● Compiler auto-vectorization can help, but manual tuning may be
necessary for complex algorithms.
■ Consider the compiler's code generation:
● Understand the strengths and weaknesses of the compiler you are
using and adjust your coding style accordingly.
● Experiment with different compiler flags and optimization levels
to find the best balance of performance and code size.
Unit 4
➢ Collective Communication:
○ Collective communication in MPI involves operations where multiple processes
collaborate to perform a common communication task.
○ These operations are designed to synchronize data exchange among processes
and are particularly useful for scenarios where all processes need to participate in
the communication.
○ Examples of collective communication operations include broadcasting data to
all processes, reducing data across all processes, scattering data from one process
to all others, and gathering data from all processes to one.
○ In collective operations, all processes must call the same routine to ensure proper
coordination and data consistency. This synchronization ensures that all
processes reach a common communication point before proceeding further,
which is crucial for maintaining the integrity of the exchanged data.
○ MPI provides a range of collective communication routines that simplify the
implementation of common communication patterns. For instance, MPI_Bcast
allows a single process to broadcast data to all other processes in the
communicator, while MPI_Reduce aggregates data from all processes to a single
process using a specified operation (e.g., sum, max, min).
○ Similarly, MPI_Scatter distributes data from one process to all others, and
MPI_Gather collects data from all processes to a single process.
○ Collective communication is well-suited for scenarios where all processes need to
participate in a communication task and synchronization is critical for data
consistency.
➢ Non blocking point to point communication
○ Non-blocking point-to-point communication in MPI offers a more flexible and
asynchronous communication model.
○ With non-blocking communication, processes can initiate send and receive
operations and continue with other computations without waiting for the
communication to complete.
○ This asynchronous nature of non-blocking communication allows processes to
overlap communication and computation, thereby improving overall performance
and efficiency.
○ Non-blocking point-to-point communication calls, such as MPI_Isend and
MPI_Irecv, return immediately after posting the communication request. This
enables processes to continue with independent computations while the
communication progresses in the background.
○ Once the communication operation is completed, processes can synchronize and
exchange data as needed.
○ The ability to overlap communication with computation is a significant
advantage of non-blocking communication in MPI. By allowing processes to
perform other tasks while waiting for communication to finish, non-blocking
communication helps reduce idle time and maximize resource utilization in
parallel computing environments.
○ Non-blocking point-to-point communication is preferred when processes can
continue with independent computations while waiting for communication to
complete.
➢ When figuring out the optimal distribution of threads and processes across the cores and
nodes of a system, it is often assumed that any intranode MPI communication is
infinitely fast.
➢ Surprisingly, this is not true in general, especially with regard to bandwidth.
➢ Although even a single core can today draw a bandwidth of multiple GB/sec out of a
chip’s memory interface, inefficient intranode communication mechanisms used by the
MPI implementation can harm performance dramatically.
➢ The simplest “bug” in this respect can arise when the MPI library is not aware of the fact
that two communicating processes run on the same shared-memory node.
➢ In this case, relatively slow network protocols are used instead of memory-to-memory
copies.
➢ But even if the library does employ shared memory communication where applicable,
there is a spectrum of possible strategies:
➢ Nontemporal stores or cache line zero may be used or not, probably depending on
message and cache sizes. If a message is small and both processes run in a cache group,
using nontemporal stores is usually counterproductive because it generates additional
memory traffic. However, if there is no shared cache or the message is large, the data
must be written to main memory anyway, and nontemporal stores avoid the write
allocate that would otherwise be required.
➢ The data transfer may be “single-copy,” meaning that a simple block copy operation is
sufficient to copy the message from the send buffer to the receive buffer (implicitly
implementing a synchronizing rendezvous protocol), or an intermediate (internal) buffer
may be used. The latter strategy requires additional copy operations, which can
drastically diminish communication bandwidth if the network is fast.
➢ There may be hardware support for intranode memory-to-memory transfers. In
situations where shared caches are unimportant for communication performance, using
dedicated hardware facilities can result in superior point-to-point bandwidth.
➢ The behavior of MPI libraries with respect to above issues can sometimes be influenced
by tunable parameters, but the rapid evolution of multicore processor architectures with
complex cache hierarchies and system designs also makes it a subject of intense
development.
➢ The simple Ping-Pong benchmark from the IMB suite can be used to fathom the
properties of intranode MPI communication.
➢ Example: Cray XT5 system:
➢ One XT5 node comprises two AMD Opteron chips with a 2 MB quadcore L3 group each.
These nodes are connected via a 3d Torus network.
➢ Due to this structure one can expect three different levels of point-to-point
communication characteristics, depending on whether message transfer occurs inside an
L3 group (intranode-intrasocket), between cores on different sockets
(intranode-intersocket), or between different nodes (internode).
➢ Communication characteristics are quite different between internode and intranode
situations for small and intermediate-length messages. Two cores on the same socket
can really benefit from the shared L3 cache, leading to a peak bandwidth of over
3GBytes/sec.
➢ There are two reasons for the performance drop at larger message sizes:
○ First, the L3 cache (2MB) is too small to hold both or at least one of the local
receive buffer and the remote send buffer.
○ Second, the IMB is performed so that the number of repetitions is decreased with
increasing message size until only one iteration — which is the initial copy
operation through the network — is done for large messages.
Question 3: Explain the different cubic domain decomposition of size L^3
into N sub domains
➢ Depending on whether the domain cuts are performed in one, two, or all three
dimensions (top to bottom), the number of elements that must be communicated by one
process with its neighbors, c(L,N), changes its dependence on N.
➢ The best behavior, i.e., the steepest decline, is achieved with cubic subdomains. We are
neglecting here that the possible options for decomposition depend on the prime factors
of N and the actual shape of the overall domain (which may not be cubic).
➢ The MPI_Dims_create() function tries, by default, to make the subdomains “as cubic as
possible,” under the assumption that the complete domain is cubic.
➢ As a result, although much easier to implement, “slab” subdomains should not be used in
general because they incur a much larger and, more importantly, N-independent
overhead as compared to pole shaped or cubic ones.
➢ A constant cost of communication per subdomain will greatly harm strong scalability
because performance saturates at a lower level, determined by the message size (i.e., the
slab area) instead of the latency.
➢ The negative power of N appearing in the halo volume expressions for pole- and
cube-shaped subdomains will dampen the overhead, but still the surface-to-volume ratio
will grow with N. Even worse, scaling up the number of processors at constant problem
size “rides the Ping-Pong curve” down towards smaller messages and, ultimately, into
the latency-dominated regime.
➢ The absence of overlap effects, each of the six halo communications is subject to latency;
if latency dominates the overhead, “optimal” 3D decomposition may even be
counterproductive because of the larger number of neighbors for each domain.
➢ The communication volume per site (w) depends on the problem.
➢ If an algorithm requires higher-order derivatives or if there is some long-range
interaction, w is larger.
➢ The same is true if one grid point is a more complicated data structure than just a scalar.
Question 4: Briefly explain with diagrams (i) Perfect mapping of MPI ranks (ii)
Round robin mapping
➢ Up to now we have presupposed that the MPI subsystem assigns successive ranks to the
same node when possible, i.e., filling one node before using the next.
➢ Although this may be a reasonable assumption on many parallel computers, it should by
no means be taken for granted. In case of a round-robin, or cyclic distribution, where
successive ranks are mapped to successive nodes, the “best” solution from Figure 1 will
be turned into the worst possible alternative.
➢ Figure 2 illustrates that each node now has 32 internode connections
➢ Similar considerations apply to other types of parallel systems, like architectures based
on (hyper-)cubic mesh networks, on which next-neighbor communication is often
favored.
➢ If the Cartesian topology does not match the mapping of MPI ranks to nodes, the
resulting long-distance connections can result in painfully slow communication, as
measured by the capabilities of the network.
Question 5: Describe the different communication parameters with diagrams