Ass Parallel (1)

Karim Salah Ahmed
202003047
1. Compare between: 1) Problem constrained scaling, 2) Memory constrained scaling and 3) Time constrained scaling
(TC)
1. Problem-Constrained Scaling
Definition: Problem-constrained scaling, also known as weak scaling, involves increasing the problem size proportionally to
the number of processors. The goal is to keep the workload per processor constant.
Characteristics:
Workload: The workload per processor remains constant as the number of processors increases.
Objective: To evaluate how well the system can handle larger problems as more resources are added.
Performance Metric: Typically measured in terms of efficiency or speedup for a fixed problem size per processor.
Use Cases: Suitable for scenarios where the problem size can grow, such as simulations that require finer resolutions with
more processors.
Advantages:
Good for applications where problem size can be scaled up easily.
Reflects the ability to handle larger datasets or more complex simulations.
Disadvantages:
Not useful for fixed-size problems where increasing the number of processors doesn’t change the problem size.
……………………………………………………………………………………………………..
2. Memory-Constrained Scaling
Definition: Memory-constrained scaling involves increasing the number of processors to handle problems that are constrained
by memory limitations. The focus is on distributing a problem that is too large to fit into the memory of a single processor.
Characteristics:
Workload: The problem size is fixed, but the memory required exceeds the capacity of a single processor.
Objective: To distribute the memory load across multiple processors to solve large problems that would otherwise be
infeasible.
Performance Metric: Typically measured by the ability to handle larger memory requirements and by the reduction in
execution time.
Advantages:
Allows solving problems that require more memory than is available on a single processor.
Effective for memory-intensive applications like large databases and scientific simulations.
Disadvantages:
Performance can be limited by communication overhead between processors.
May not be applicable to problems that are not memory-intensive.
………………………………………………………………………………….
3. Time-Constrained Scaling (TC)
Definition: Time-constrained scaling focuses on reducing the execution time of a fixed-size problem by increasing the number
of processors. This is often referred to as strong scaling.
Characteristics:
Workload: The problem size is fixed, and the goal is to decrease the execution time by adding more processors.
Objective: To evaluate how well the execution time decreases with the addition of more processors.
Performance Metric: Measured in terms of speedup, which is the ratio of the time taken with one processor to the time taken
with multiple processors.
Advantages:
Useful for applications where the problem size is fixed and improving execution time is crucial.
Reflects the practical performance improvements in reducing time-to-solution.
Disadvantages:
Diminishing returns as more processors are added due to overhead and communication costs.
Scalability is limited by the problem’s inherent parallelism and the overhead of managing more processors.
2. Write a pseud code for Summing M numbers with N processors using shared and distribute memory.
Shared Memory Approach

parallel for i from 0 to N-1
local_sum = 0
start_index = (M / N) * i
end_index = (M / N) * (i + 1)
for j from start_index to end_index - 1

local_sum += numbers[j]
acquire(lock)
global_sum += local_sum
release(lock)
Distributed Memory Approach

master_process:
distribute numbers array to all processors
parallel for each processor i

local_sum = 0
receive subarray from master (size = M / N)
for j from 0 to (M / N) - 1
local_sum += subarray[j]
send local_sum to master
master_process:
global_sum = 0
for each processor i from 0 to N-1

receive local_sum from processor i
global_sum += local_sum
3. Is a program that obtains linear speedup strongly scalable? Explain your answer.
yes, a program that obtains linear speedup is considered strongly scalable.
Explanation:
Strong scalability, also known as strong scaling, refers to the ability of a parallel system to effectively utilize
an increasing number of processors to solve a fixed-size problem. When a program achieves linear speedup,
it means that the performance (typically measured in terms of execution time) improves proportionally with
the addition of more processors.
4.Assume the runtime of a program is 100 seconds for a problem of size 1. The program consists of an initialization phase
which lasts for 10 seconds and cannot be parallelized, and a problem solving phase which can be perfectly parallelized
and grows quadratically with increasing problem size.
i. What is the speedup for the program as a function of the number of processors p and the problem size n.
ii. What is the execution time and speedup of the program with problem size 1, if it is parallelized and run on 4
processors?
iii. What is the execution time of the program if the problem size is increased by a factor of 4 and it is run on 4
processors? And on 16 processors? What is the speedup of both measurements?
5. A program P consists of parts A, B and C. Part A is not parallelized while parts B and C are parallelized. The
program is run on a computer with 10 cores. For a given problem size, the execution of the program in 1 of the
cores takes 10, 120, and 53 seconds, for parts A, B, and C, respectively. Answer the following questions:
i. Which is the minimum execution time we can attain for each part if we now execute the same program with
same problem sizes on 5 of the cores? What is the best speedup we can expect for the entire program?
ii. What if we run it using the 10 cores?
6. Design a multi-threaded, superscalar dual-core processor. The processor executes up to two instructions per clock
from one instruction stream on each core (one SIMD instruction + one scalar instruction). Also, it can switch to
execute the other instruction stream when faced with stall
7. Given the following program parallelized with OpenMP:
Write an equivalent version of the program without the reduction annotation.
#include <omp.h>
double calculate_pi(double step) {

int i;
double sum = 0.0;
#pragma omp parallel private(i)

{
double x, local_sum = 0.0;
#pragma omp for
for (i = 0; i < 1000000; ++i) {
x = (i + 0.5) * step;
local_sum += 2.0 / (1.0 + x * x);
}
#pragma omp critical
{
sum += local_sum;
}
}
return step * sum;

}
8. The pseudo code for sequential algorithm is provided below for a 2D-grid based solver (partial differential
equation (PDE) on (N+2) x (N+2) grid).
i Write the pseudo code for the shared address space solver.
#include <omp.h>
const int n;
float* A; // assume allocated to grid of N+2 x N+2 elements
void solve(float* A) {
float diff, prev;
bool done = false;
#pragma omp parallel

{
while (!done) { // outermost loop: iterations
diff = 0.0f;
#pragma omp for reduction(+:diff) private(prev)

for (int i = 1; i <= n; i++) { // iterate over non-border points of grid
for (int j = 1; j <= n; j++) {
prev = A[i * (n + 2) + j];
A[i * (n + 2) + j] = 0.2f * (A[(i - 1) * (n + 2) + j] + A[(i + 1) * (n + 2) + j] +
A[i * (n + 2) + (j - 1)] + A[i * (n + 2) + (j + 1)] +
prev);
diff += fabs(A[i * (n + 2) + j] - prev); // compute amount of change
}
}
#pragma omp single

{
if (diff / (n * n) < TOLERANCE) { // quit if converged
done = true;
}
}
#pragma omp barrier

}
}
}
i Compare between the shared address space and message passing programming models in terms of the
computation, communication, and synchronization.
programming models
Computation:
Shared Address Space:

Threads share a common address space, making it easier to access and update shared data.
Parallel regions and work-sharing constructs are used to divide tasks among threads OpenMP.
Message Passing:
Each process has its own local memory. Data exchange occurs through message passing.
Explicit communication is required to share data between processes (e.g., MPI).
Communication:
Implicit communication through shared variables.
Easier to program for simple sharing of data but requires careful synchronization to avoid race conditions.
Message Passing:
Explicit communication via sending and receiving messages.
Clearer control over data movement, which can lead to better performance tuning but requires more complex
code.
Synchronization:
Synchronization is typically managed using locks, barriers, and atomic operations.
Easier to encounter issues such as deadlocks and race conditions.
Message Passing:
Synchronization is achieved through the exchange of messages.
No need for locks, but requires careful design of communication patterns to avoid bottlenecks and ensure
correctness.
9. For the below multi-core processor, find the followings.
i Number of ALUs per core and then total number of ALUs.

In the diagram, each core seems to have 4 ALUs (the yellow boxes).
There are 8 cores in total (2 rows and 4 columns).
Therefore, the total number of ALUs is:
Number of ALUs per core=4Number of ALUs per core=4
Total number of ALUs=4×8=32
i Number of threads per core.
.Each core appears to support 2 threads (indicated by the two blue sections).
i Number of simultaneous and concurrent instruction streams.
.Number of instruction streams per core=2Number of instruction streams per core=2

Total number of instruction streams=2×8=16
i Number of independent pieces of work is needed to run chip with maximal latency hiding ability.
.Given that each core has 2 threads and there are 8 cores:
Number of independent pieces of work=2×8=16

Examine the task graph shown below. Each task, which you can think of as an async, is labeled with its runtime. Answer
the following four questions about the program’s runtime. In all cases you may ignore any work scheduling or task
spawning overheads.
i Assuming a single worker thread (X10 NTHREADS=1) what is the runtime of this program?
ii What is the speedup when X10 NTHREADS=8??

iii If each parallel task (PN = X ms) were parallelized further to become two parallel tasks (QN =𝑋2 ms,RN =𝑋2 ms), and
again run with X10 NTHREADS=8, what would the runtime be? What is the speedup be relative to the previous run??
vi. Why is the speedup not 2X?
The speedup is not 2X because of the overheads involved in parallelization and the presence of serial components in the task
graph. Tasks 𝑆1,2, and the synchronization points before and after parallel tasks contribute to the non-parallelizable portion
of the total runtime. This inherent serialization limits the achievable speedup according to Amdahl's Law.

Ass Parallel (1)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ass Parallel (1)

Uploaded by

Copyright:

Available Formats

Karim Salah Ahmed

Good for applications where problem size can be scaled up easily.

Reflects the ability to handle larger datasets or more complex simulations.

Performance can be limited by communication overhead between processors.

May not be applicable to problems that are not memory-intensive.

3. Time-Constrained Scaling (TC)

of processors. This is often referred to as strong scaling.

with multiple processors.

Reflects the practical performance improvements in reducing time-to-solution.

Shared Memory Approach

for j from start_index to end_index - 1

Distributed Memory Approach

parallel for each processor i

send local_sum to master

for each processor i from 0 to N-1

yes, a program that obtains linear speedup is considered strongly scalable.

ii. What if we run it using the 10 cores?

Write an equivalent version of the program without the reduction annotation.

double calculate_pi(double step) {

#pragma omp parallel private(i)

return step * sum;

#pragma omp parallel

#pragma omp for reduction(+:diff) private(prev)

#pragma omp single

#pragma omp barrier

Shared Address Space:

i Number of ALUs per core and then total number of ALUs.

i Number of threads per core.

i Number of simultaneous and concurrent instruction streams.

.Number of instruction streams per core=2Number of instruction streams per core=2

Number of independent pieces of work=2×8=16

ii What is the speedup when X10 NTHREADS=8??

vi. Why is the speedup not 2X?

You might also like