Mid Sem QP&Solution

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Mid-Semester Examination, February 2023

VIII Semester Computer Science and Engineering

Subject: - Introduction to Parallel Processing (CSL447) Slot: - D


Date/Time: - 24/02/2023 (3:00 – 4:30 pm) Max. Marks: - 25
---------------------------------------------------------------------------------------------------------------------

Q1.

a. Consider a memory system with a cache of 62KB and DRAM of 1024 MB with the
processor operating at 2GHz. The latency to cache is two cycle and the latency to DRAM
is 200 cycles. In each memory cycle, the processor fetches four words (cache line size of
four words). Calculate the peak achievable performance for the following piece of code
for dot product of two vectors? CO-1, CO-2

/* dot product loop */


for( i=1 ; i < n ; i++)
dot_prod += a[i] +b[i]; (3)

Ans. Processor can fetch 4-words every 200cycles (4-multiply-add operation) can be
performed in 400-cycles. (As cache introduces 2 cycles latency with 2GHz processor
speed 0.5ns + 0.5ns = 1ns )
This corresponds to 1 FLOP every 50ns, for a peak speed of (1/50ns) 20 MFLOPS
Q1

b. Using suitable example (parallel code), explain in what situation firstprivate and
lastprivate clauses are used in OpenMP. Also show how it affects scope of the variable.
CO-1 (2+2)

Ans. Private data/variable in enclosed parallel region is undefined before and after the
construct, even if it is defined prior to parallel region.
Thus if its value is needed after parallel region OR initialized before the loop so as to
be used by all threads, Lastprivate and Firstprivate clauses are used respectively.

First Private Clause: “firstprivate” variable are pre-initialized with the value of the
variable with the same name before construct
Last Private Clause: “lastprivate” value of the variable is retained with the same name
after construct

Ex of lastprivate:

# pragma omp parallel for private( i ) lastprivate ( a) {


for ( i = 0; i < n ; i++) {
a = i + 1;
printf ( “Thread %d has a value of a = %d for i = %d”,
omp_get_hread_num() , a , i );
}
}
printf( “ Value of a after parallel for a = %d” , a); /* a= 5*/

Ex of firstprivate:

indx = 4;
# pragma omp parallel default(none) firstprivate( indx ) private ( i, TID) shared(n, a)
{
TID = omp_get_thread_num();
indx += n * TID ; /*here value of indx is 4*/
for( i = indx ; i < (indx + n) ; i++)
a[i] = TID + 1;
}
Q2.

a. Is a program that obtains linear speedup strongly scalable? CO-2 (3)

Ans. Scalability is the ability of hardware and software to give greater computational
power when the amount of resources/hardware is increased.
For software, scalability is referred to as parallelization efficiency — speedup= t1/tp
A linear speedup means, speedup = P (number of processors)
Amdahl law states that the speedup is limited by the fraction of the serial part of
the code that cannot be parallelized. And not by number of processors.
So the answer is NO.
For a fixed problem, the upper limit of speedup is determined by the serial fraction
of the code. This is called strong scaling.

Q2.

b. Formulate and define various parameters to gauge the performance of interconnection


networks in distributed memory architecture. Also draw the networks.
CO-1 (4)
Ans.

- The latency, which is the diameter;


- The number of links, which determines the cost;
- The degree, which determines how practical it is to build such a network
- The bisection bandwidth, which determines how much data we can push
through the network

2D Mesh:
√n - 1

√n - 1
Latency=Diameter = 2√n #links = n - √n links
Degree: √n Bisection Bandwidth: 2√𝑛
2D Torus

n-1

n-1

Latency = Diameter = √n #links = 2n links


Degree: √n Bisection Bandwidth: 2√𝑛

Static (complete) and Dynamic (Linear) network

Latency = Diameter = 1 Latency = Diameter

Degree: n - 1. Degree: 2

#links: nC2 #links: 2

bisection bandwidth: n bisection bandwidth: 1


Q3.

a. Using suitable example (i.e OpenMP code), show the similarity and difference bertween
#pragma omp barrier and #pragma omp taskwait constructs.
C0-1 (3)

Ans.
Similarity between #pragma omp barrier and #pragma omp taskwait
Both waits for thread process and spawned task

Differences
#pragma omp barrier: threads waits for all the threads at barrier
#pragma omp tasjwait: task waits for those task only which have been spawned
by the same thread
Any example of each is expected
Q3.

b. Assume the matrix-vector multiplication problem with matrix size 8000x8000


elements. Suppose thread 0 and thread 2 (out of total 4 threads) are assigned to
different processors. If a cache line contains 64 bytes (8 doubles), is it possible for
false sharing between threads 0 and 2 to occur for any part of the output vector Y?
Why? What if thread 0 and thread 3 are assigned to different processors; is it possible
for false sharing to occur between them for any part of the output vector Y?
CO-3 (5)

Ans.
With the input matrix having 8000x8000 elements, the y vector will have 8000 elements.
Thus, with four threads, assuming a default scheduling of 8000/t, where t is the
thread_count, the elements of the output vector will be divided approximately as follows:
8000/4 = 2000 elements per thread

The false sharing occurs when some element is stored on a cache belongs to the same
cache-line as another element that is stored in other location. For example, if Thread 0
has a value in its cache that belongs to a line that an element of this line is stored into
the cache of Thread 1, when the Thread 0 updates its variable the line will get invalidated
and the Thread 1 will have to request a new load from memory as it has been marked
dirty.
In order for false-sharing to occur between thread 0 and thread 2, there must be
elements of y that belong to the same cache line, but are assigned to different threads
(here 0 and 2).
Assuming a row-major storage in memory, that all values are doubles and the 64bytes
cache line can only store 8 values, this will only occur on the "edge of columns". So, the
first value of thread 1 which belongs to a line that includes a value from thread two will
be index 1993 (0 – 1992 divisible by 8 total 249 times cache gets updated). The highest
index from thread 0 that will still include an element from itself is the 1999. Since the
least index of an element of y assigned to thread 2 is 4000, there can't possibly be a
cache line that has elements belonging to both thread 0 and thread 2.
The same approach applies to threads 0 and 3, which have their least elements as 0
and 6000 respectively.
Q4.

In the following two versions of a program to execute two tasks:


i. Why in the second pragma, nowait is used?
ii. What is the difference between the two versions? CO-2 (3)

#pragma omp parallel #pragma omp parallel


{ {
#pragma omp single nowait #pragma omp single nowait
{ {
#pragma omp task #pragma omp task
a = function1(); a = function1();
#pragma omp task
b = function2(); b = function2();
} }
} }

Ans.

The question is to check, if the student knows there is an implicit barrier at the end of
single region. Since there is another implicit barrier just after that of parallel
construct, this barrier is not necessary. So it can be skipped. In this example it does
not make much difference.
Such difference is taken into account to improve the performance of the program.
i. single with nowait is equivalent to master construct.
ii. Because of single construct ‘No’ difference. The program works fine in
both cases.

You might also like