HPC Unit 456

जय श्री राम
UNIT 4
1. What are the sources of overhead in Parallel Programs?

=>
Inter-process communication:
Parallel programs often require communication between different processes or threads.
It involves the exchange of data or messages between processors.
Time spent in sending and receiving messages, coordinating operations can introduce
overhead in parallel programs
Idling:
Idling refers to situation where processors or threads remain inactive or have no work to
perform during parallel execution.
This can happen due to load imbalance, resulting in wasted resources and decreased
efficiency.
Excess computation:
Excess computation overhead arises from redundant or unnecessary calculations in
parallel program.
It can occur due to improper task decomposition, redundant calculations, or inefficient
algorithms.
It leads to additional computational resources being used and longer execution time
2. Explain Performance Metrics for Parallel Systems

=>
Ts = Serial Runtime
Tp = Parallel Runtime
p = Number of processing elements
Performance overhead:
• Execution time:
The total time taken to execute a parallel program.
• Total overhead:
The additional time or resources consumed by a parallel program beyond the
actual computation.
• Speedup:
The performance improvement achieved by parallelizing a program compared to
its sequential version.
• Efficiency:
It measures of how effectively computational resources are utilized to solve
problem
• Scalability:
The ability of parallel system to maintain or improve its performance as number
of processors/threads increases.
3. What are the effect of Granularity on Performance

=>
Scaling down:
Using fewer processors improves performance of parallel systems.
Scaling down refers to using fewer processors or processing elements than maximum
possible number in parallel system.
It means reducing the number of processors from 'n' to 'p'.
• Communication decrease:
When scaling down, the communication cost decreases by a factor of 'n/p'.
• Computation increase:
Computation at each processing element increases by factor of 'n/p' when scaling
down. This increase is due to total workload is divided among reduced number of
processors.
Total cost: Θ (n log p)
Cost optimal way:

Total cost: Θ (n + plog p)
https://youtu.be/gOv7t5yYvmo
4. Describe Minimum Execution time and minimum cost, optimal execution time
=>
Minimum Execution Time (TPmin):
To find the minimum execution time, you can differentiate the expression for TP (parallel
runtime) with respect to p (number of processors) and set it to zero:
d(TP)/dp = 0
Minimum Cost-Optimal Parallel Time (TPcost_opt):

for cost optimality, p = O(f-1(W))
TPcost_opt = Θ(W/p)
5. Define Matrix-Vector Multiplication with examples

a) Row-wise 1D partitioning
b) 2D partitioning
https://youtu.be/0-uajl0skxA
https://youtu.be/4py-tfXTld8
https://youtu.be/kpQheszFaEI
6. Define Matrix-Matrix Multiplication with examples

https://youtu.be/YmugJ2SLA5g
https://youtu.be/ejLoqAE1vgc
7. State and explain Canon's Algorithm

https://youtu.be/ZaNxMTjUB0w
https://youtu.be/Z-LWWpH2ScE
Or
https://youtu.be/Vk-mJcj7y_I
8. State and explain Dense Matrix Algorithm

=>
Characteristics:
Operate on matrices with multiple non-zero elements
Utilize parallelism for efficient computation
Break down tasks for concurrent processing
Advantages:
Efficient
High performance
Faster processing
Limitations:
Require significant memory
High computational complexity for large matrices
Not suitable for sparse matrices with mostly zero elements
Applications:
Scientific simulations, data analytics, and machine learning
Image/signal processing, data compression, and recommendation systems
9. Explain Scalability in Parallel Systems

UNIT 5
1. What are the issues in Sorting on Parallel Computers?

=>
Issues:
Where input and output is stored:
Determining the optimal storage location for input and output data in parallel sorting
algorithm is crucial for efficient data access
How comparisons are performed:

Sorting algorithms rely on comparisons between elements to arrange them in desired
order.
Techniques like data partitioning and efficient communication protocols are used to
perform these comparisons efficiently across multiple processing elements
Load Balancing:
Load balancing involves distributing computational workload evenly across available
processing elements.
Load balancing techniques aim to minimize impact of workload and computational
complexities.
Data Locality:
Data locality refers to proximity of data to processing element that operates on it.
Efficient utilization of data locality can minimize data transfer and communication
overhead, leading to improved performance.
Techniques like data partitioning, data replication are used to enhance data locality in
parallel sorting.
Scalability:
Scalability refers to ability of parallel sorting algorithm to handle larger problem sizes
and utilize increasing computational resources effectively.
For this proper algorithm need to be designed
2. Explain Parallelizing Quick Sort

=>
Parallelizing Quick Sort involves dividing sorting task among multiple processors to
accelerate sorting process.
The basic steps of Quick Sort are as follows:

Choose a pivot element from array.
Partition array into two sub-arrays, one with elements smaller than pivot and other with
elements larger than pivot.
Recursively apply Quick Sort on two sub-arrays.
To parallelize Quick Sort:

Partitioning Phase:
Each processor takes a subset of the array.
Each processor selects a pivot element from its subset.
Processors exchange elements based on the pivot to separate smaller and larger
elements.
Recursive Sorting Phase:

Sub-arrays are created with elements less than or greater than the pivot.
Each processor independently applies Quick Sort on its subset.
Combining Phase:
Sorted sub-arrays need to be combined into one sorted array.
Use parallel merge algorithm or communication/synchronization between processors.
Advantages:
Improved performance
Scalability
Efficient resource utilization
Challenges:
Load balancing
Data partitioning
Communication and synchronization overhead
https://youtu.be/UO5cQ5G9DFI
3. Explain All-Pairs Shortest Paths?

=>
Dijkstra's Algorithm helps to find shortest path between every pair of vertices in
weighted directed graph
Serial Execution Time (Tp): θ(n3)
It can be parallelized using two approaches: source partitioned and source parallel.
a) Source partitioned:
Graph is divided into smaller subgraphs, and each processor is assigned a subgraph
Parallel Execution Time (Tp): θ(n2)
Advantages:
Clearly defined partitioning
Each processor can compute shortest path independently
Suitable for distributed memory systems
Limitations:
Overhead of exchanging information between processors
Load imbalancing
b) Source parallel:
p>n
In this multiple processor work concurrently to explore different parts of the graph.
Advantages:
Parallelism is achieved by distributing the workload among multiple processors.
Improved performance.
Suitable for large graphs
Limitations:
Communication and synchronization overhead
Load imbalancing
Dijkstra Algorithm
https://youtu.be/84y-fHI008M
Floyd Warshall Algorithm

https://youtu.be/waG9itqH-EI
4. Explain dynamic load balancing scheme

=>
Asynchronous round robin:
Each processor maintains a counter. When processor needs more work, it requests it by
incrementing its counter and requesting next available task in round-robin fashion.
Global round robin:

The system maintains a global counter. Each processor requests task in round-robin
fashion, ensuring that tasks are distributed equally among processors globally.
Random polling:
Request randomly selected processor for work. This approach helps prevent any
particular processor from being overloaded with work.
5. Compare various communication strategies

=>
Message Passing Interface (MPI):
Widely used for communication in distributed memory systems.
Message-based communication between processes.
Explicit coding required.
Flexible and portable.
Shared Memory:
Processes access a shared memory region.
Communication through reading and writing to shared memory.
Simple implementation, high performance on shared memory systems.
Requires synchronization.
Remote Procedure Calls (RPC):

Processes call functions on remote processes.
Abstracts communication as local function calls.
Simplifies programming.
Used in distributed and shared memory systems.
Publish-Subscribe:
Publishers send messages to a central broker, subscribers receive relevant messages.
Loosely coupled communication.
Simplifies system design, dynamic scalability.
Used in event-driven systems and publish-subscribe frameworks.
Direct Memory Access (DMA):

Direct data transfers between I/O devices and memory without CPU involvement.
Reduces CPU overhead, improves data transfer efficiency.
Used in high-performance computing and data-intensive applications.
6. Explain Bubble Sort and its variants

https://youtu.be/5-xExK9Wf1o
https://youtu.be/Sazh4Y-WlDk
7. Explain Algorithm for sparse graph

https://youtu.be/ShYMsLp8rAA
8. Dense Graph vs Sparse Graph

9. Describe Parallel Depth-First Search

https://youtu.be/dkp9KvUtrWo
https://youtu.be/embRDiiH-ts
10. Explain Parallel Best-First Search

https://youtu.be/alxqDHJg_q0
UNIT 6
1. Explain CUDA architecture with example and its application

=>
CUDA (Compute Unified Device Architecture) is parallel computing platform and
programming model developed by NVIDIA.
It allows developers to harness the power of NVIDIA GPUs for general-purpose
computing tasks
Components:
• Host (CPU):
It is responsible for managing overall execution of program and managing tasks
• Device (GPU):
It performs parallel computations
• Kernels:
Kernels are parallel functions that are executed on the GPU
Written in the CUDA C/C++ language
• Thread Hierarchy:
Threads are lightweight, independent units of execution that run on GPU. Threads
are organized in a hierarchical manner, starting from individual threads grouped
into thread blocks (grid)
• Grid:
A grid is collection of thread blocks that execute independently on GPU.
• Memory Hierarchy:
The CUDA architecture includes various memory types that are accessible by
threads. This includes registers (private to each thread), shared memory (shared
within thread block) and global memory (accessible by all threads)
CUDA Applications
Medical Imaging
Computing Finance
Oil and Gas Exploration
Data science and analytics
Deep learning and machine learning
Benefits:
Massive Parallelism
Accelerated Performance
Heterogeneous Computing
Programming Flexibility
Wide Adoption and Support
Limitations:
Hardware Dependency
Learning Curve
Memory Limitations
Limited Software Support
Development Complexity
https://youtu.be/Ongct-wmYxo
2. Explain Heterogenous system architecture

=>
3. Explain CUDA program flow:

=>
• Load data into CPU memory

• Copy data from CPU to GPU memory
• Call GPU kernel using device variable
• Copy results from GPU to CPU memory
• Use results on CPU
4. CUDA programming model

https://youtu.be/X2hcpD7pUAM
5. Write short note on CUDA C

=>
CUDA C is a programming language and framework for GPU programming.
It is extension of C programming language which provides set of APIs and libraries that
allow developers to utilize parallel processing power of GPUs.
It allows developers to write kernel functions that can be executed in parallel on the
GPU.
Key features of CUDA C:

• Kernel Functions:
It allows writing parallel kernel functions that run on GPU
• Thread Hierarchy:
It organizes threads into group of blocks (grid)
• Memory Management:
It offers explicit memory management for global memory (accessible by all
threads) and shared memory (accessible within a block)
• Libraries and Tools:

It provides collection of libraries and tools that simplify development
Benefits of CUDA C:
High Performance
Flexibility
Broad GPU Support
Limitations of CUDA C:
GPU Dependency of NVIDIA
More Learning Time
Limited Portability
6. Explain CUDA Kernel

=>
CUDA Kernel is a function in CUDA C/C++ that is designed to be executed in parallel on
GPU.
It is code that runs on GPU and performs computations for specific task.
Writing CUDA Kernel:

Define kernel function using the __global__ specifier.
__global__ void kernel_name (argument_list)
We can also use __host__ or __device__
Specify input and output parameters.
Write the computation or task that needs to be performed by each thread.
Launching CUDA Kernel:

Allocate memory on GPU for input and output data.
Copy input data from CPU to GPU memory using cudaMemcpy.
Call the CUDA kernel function using the <<< >>> syntax, passing the required
parameters.
kernel_name <<< grid, block >>> (argument_list)
The GPU launches multiple threads to execute the kernel function in parallel.
After launching copy results from GPU to CPU using cudaMemcpy.
7. Managing Communication and Synchronization

=>
Data Transfer:
Efficiently transferring data between CPU and GPU memories using cudaMemcpy, taking
into account data size and memory access patterns.
Inter-Thread Communication:
Implementing mechanisms for inter-thread communication, such as shared memory, to
facilitate data sharing and coordination between threads within block.
__syncthreads():
It is synchronization primitive in CUDA. It includes strategies for data transfer between
CPU and GPU and utilizing shared memory effectively,
8. Explain Apache Hadoop

=>
Apache Hadoop is an open-source framework for distributed storage and processing of
large datasets. It provides a scalable and fault-tolerant platform for handling big data
analytics
Components:
• HDFS:
Distributed file system for storing and accessing large data files.
• MapReduce:
Programming model for parallel processing and analysis of large datasets.
• YARN:
Resource management framework for scheduling jobs and allocating resources in
a Hadoop cluster.
• Hadoop Ecosystem:
Collection of tools and frameworks that extend Hadoop's capabilities.
• Fault-tolerance and Scalability:

Hadoop handles failures without data loss and can scale by adding more
machines to the cluster.
9. Explain Apache Spark

=>
Apache Spark is an open-source distributed computing system designed for big data
processing and analytics. It provides a fast and flexible framework for processing large
datasets in parallel
Features:
• Speed:
Fast in-memory processing.
• Distributed computing:
It distributes data and computation across machines for efficient parallel
processing.
• Resilient Distributed Datasets (RDDs):

Fault-tolerant data processing using data cache in memory.
• Data processing capabilities:

Spark offers APIs for batch processing, real-time streaming, ML and SQL queries.
• Scalability:
Handles large data and scales from single machines to clusters.
• Integration:
Works well with popular big data tools.
• Ease of use:
User-friendly API supporting multiple languages.
10. Explain Apache Flink

=>
Apache Flink is an open-source stream processing and batch processing framework
designed to process large-scale data in real-time.
Features:
• Stream Processing:
It enables real-time data processing and analysis
• Batch Processing:
It efficiently executes complex batch jobs.
• Fault Tolerance:
It allows recovery from failures without data loss.
• Event Time Processing:

It handles event time delays for accurate processing.
• Scalability:
It scales horizontally to handle large data volumes, automatically parallelizing
computations
• Integration:
It seamlessly integrates with other big data ecosystems, like Kafka, Hadoop, etc.
• Dynamic Updates:
It supports dynamic updates to running jobs.
11. Explain OpenCL

=>
OpenCL (Open Computing Language) is an open standard for parallel programming
across heterogeneous devices, including CPUs, GPUs, and other accelerators. It allows
developers to write programs that can utilize the processing power of multiple devices
Features:
• Platform and Device Independence:
It allows developers to write code that runs on different hardware platforms and
devices.
• Heterogeneous Computing:
Utilize multiple devices simultaneously for efficient processing.
• Parallel Execution:
Task is divided into multiple small tasks, which are executed concurrently
• Portability:
Write code once and run it on various hardware platforms.
• Community and Ecosystem:

Active community support and rich set of tools and libraries.

HPC Unit 456

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HPC Unit 456

Uploaded by

Copyright:

Available Formats

जय श्री राम

1. What are the sources of overhead in Parallel Programs?

2. Explain Performance Metrics for Parallel Systems

3. What are the effect of Granularity on Performance

Cost optimal way:

Minimum Cost-Optimal Parallel Time (TPcost_opt):

5. Define Matrix-Vector Multiplication with examples

6. Define Matrix-Matrix Multiplication with examples

7. State and explain Canon's Algorithm

8. State and explain Dense Matrix Algorithm

9. Explain Scalability in Parallel Systems

1. What are the issues in Sorting on Parallel Computers?

How comparisons are performed:

2. Explain Parallelizing Quick Sort

The basic steps of Quick Sort are as follows:

To parallelize Quick Sort:

Recursive Sorting Phase:

3. Explain All-Pairs Shortest Paths?

Floyd Warshall Algorithm

4. Explain dynamic load balancing scheme

Global round robin:

5. Compare various communication strategies

Remote Procedure Calls (RPC):

Direct Memory Access (DMA):

6. Explain Bubble Sort and its variants

7. Explain Algorithm for sparse graph

8. Dense Graph vs Sparse Graph

9. Describe Parallel Depth-First Search

10. Explain Parallel Best-First Search

1. Explain CUDA architecture with example and its application

2. Explain Heterogenous system architecture

3. Explain CUDA program flow:

• Load data into CPU memory

4. CUDA programming model

5. Write short note on CUDA C

Key features of CUDA C:

• Libraries and Tools:

6. Explain CUDA Kernel

Writing CUDA Kernel:

Launching CUDA Kernel:

7. Managing Communication and Synchronization

8. Explain Apache Hadoop

• Fault-tolerance and Scalability:

9. Explain Apache Spark

• Resilient Distributed Datasets (RDDs):

• Data processing capabilities:

10. Explain Apache Flink

• Event Time Processing:

11. Explain OpenCL

• Community and Ecosystem:

You might also like