Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 10

SRI RAMAKRISHNA ENGINEERING COLLEGE

[Educational Service: SNR Sons Charitable Trust]


[Autonomous Institution, Reaccredited by NAAC with ‘A+’ Grade]
[Approved by AICTE and Permanently Affiliated to Anna University, Chennai]
[ISO 9001:2015 Certified and all eligible programmes Accredited by NBA]
VATTAMALAIPALAYAM, N.G.G.O. COLONY POST, COIMBATORE – 641 022.

Department of Computer Science and Engineering


Internal Test I – Retest
Date 13.04.2024 Department CSE
Semester V Class/section III Year B.E & M.Tech
Duration 2:00 Hours Maximum marks 50
Course Code &Title: 20CS257 High Performance Computing – Answer key
Course Outcomes Addressed:
CO1: Understand the popular parallel programming paradigms concepts.
CO2: Describe the concept of modern processors and its performance.
CO3: Apply parallelism to extract maximum performance in multicore and shared memory processor.

S.No Questions Cognitive


Level/CO
PART – A (Answer All Questions) (10*1 =10 Marks)

1. The average number of tasks completed by the server over a time period is called
__________________
The average number of tasks 4completed by the server over a time period is
called as ________. U/ CO1

a) Reliability b) Bandwidth c) Throughput d) Latency

2. In parallelization, if P is the proportion of a system or program that can be made


parallel, and 1-P is the proportion that remains serial, then the maximum speedup U/ CO2
that
can be achieved using N number of processors is 1/((1P)+(P/N). This law is
called
In parallelization, if P is the proportion of a system or program that can be made
parallel, and 1-P is the proportion that remains serial, then the maximum speedup
that
can be achieved using N number of processors is 1/((1P)+(P/N). This law is
called
In parallelization, if P is the proportion of a system or program that can be made
parallel, and 1-P is the proportion that remains serial, then the maximum speedup
that
can be achieved using N number of processors is 1/((1P)+(P/N). This law is
called
In parallelization, if P is the proportion of a system or program that can be made
parallel, and 1-P is the proportion that remains serial, then the maximum speedup
that
can be achieved using N number of processors is 1/((1P)+(P/N). This law is
called
In parallelization, if P is the proportion of a system or program that can be made
parallel, and 1-P is the proportion that remains serial, then the maximum speedup
that
can be achieved using N number of processors is 1/((1P)+(P/N). This law is
called
In parallelization, if P is the proportion of a system or program that can be made
parallel, and 1-P is the proportion that remains serial, then the maximum speedup
that
can be achieved using N number of processors is 1/((1P)+(P/N). This law is
called
In parallelization, if P is the proportion of a system or program that can be made
parallel, and 1-P is the proportion that remains serial, then the maximum speedup
that can be achieved using N number of processors is 1/((1P)+(P/N). Choose the
respective law.

a) Newton’s b) Ohms law c) Amdahl’s Law d) Moore’s Law


law

3. Suppose we run a parallel program with a fixed number of processes/threads and


a fixed input size, and we obtain an efficiency E. Suppose we now increase the U/ CO1
number of processes/threads that are used by the program and find a
corresponding rate of increase in the problem size so that the program always has
efficiency E. Which of the following suits such a program?

a) Not scalable b) Scalable c) May be d) Optimizable


scalable
4. If the main memory is of 8K bytes and the cache memory is of 2K words. It uses
associative mapping. What is the size of each word of cache memory? U/CO2

a) 11 bits d) 20 bits
b) 16 bits c) 21 bits
5. In a ----------system, all caches on the bus monitor the bus to determine if they
have a copy of the block of data that is requested on the bus. R/ CO2
a) Coherence b) Recurrence c) Replication d) Snooping

6. Identify the directive that force threads to wait till all are done
U/ CO3
a) #pragma b) #pragma c) c) #pragma d) #pragma omp
omp parallel omp barrier omp critical sections

7. All the components of a CPU core can operate at some maximum speed called
__________. R/ CO1
a) ) Accelerated b) Peak c) High d) ) Scalable
Performance Performance Performance Performance
8. Cache memory works on the principle of locality of reference. R/CO1
9. The OpenMP library function omp_set_num_threads( ) is used to set the
R/ CO1
number of threads in upcoming parallel regions.
10.
Parallel efficiency is defined as R/ CO2

PART – B (Answer All Questions) (5*2 =10 Marks) Cognitive


Level/CO
11. Outline parallel scalability. U/CO3
(Explanation: 2 Marks)
The scalability of a parallel algorithm on a parallel architecture is a measure of
its capacity to effectively utilize an increasing number of processors. It is the
ratio between the actual speedup and the ideal speedup obtained when using a
certain number of processors.
12. Distinguish between data and functional parallelism.
(Difference: 2 marks)

Data- level parallelism Functional Parallelism

Data parallelism refers to concurrent Task Parallelism means concurrent


execution of the same instruction execution of the different task on
stream on multiple data. multiple computing cores.

SIMD (Single Instruction Multiple MPMD (Multiple Program Multiple


Data) instructions issue identical Data). Big problem divided into U/CO4
operations on a whole array of integer subtasks that execute completely
or floating point operands, usually in different code on different data items.
special registers.

They improve arithmetic peak The overlapping tasks that would


performance without the requirement otherwise be executed sequentially
for increased superscalarity. could accelerate execution
considerably.

13. Compare shared memory and distributed memory computers.


(Comparison: 2 marks)

Shared memory computers Distributed memory computers


A shared-memory parallel computer In distributed memory computers,
is a system in which a number of each processor is connected to
CPUs exclusive local memory.
work on a common, shared physical
address space.
No other CPU has direct access to it.
Uniform Memory Access (UMA) No Remote Memory Access
systems exhibit a “flat” memory (NORMA).
model: Latency and bandwidth are the Each node comprises at least one
same for all processors and all network interface (NI) that mediates
memory locations. the connection to a communication U/ CO3
This is also called symmetric network. A serial process runs on
multiprocessing (SMP). each CPU that can communicate with
other processes on other CPUs by
means of the network.

On cache-coherent Nonuniform Like shared-memory parallel


Memory Access (ccNUMA) computer, as there is no remote
machines, memory is physically memory access on
distributed but logically shared. distributed-memory machines, the
problem has to be solved
cooperatively by sending messages
back and forth between processes,
using MPI.

14. Consider the following statements U/CO3


(i) Pthreads require that programmer explicitly specify the behavior of each
thread
(ii) Pthreads does not allow programmer to explicitly specify the behavior of
each thread
(iii) OpenMP allows programmer to simply state that a block of code should be
executed in parallel
(iv) Pthreads program can be run with C Compiler
Which of the above statement is/are true?
Develop a hello world program using OpenMP.
(Program: 2 marks)

// OpenMP program to print Hello World using C language


// OpenMP header
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char* argv[])


{
// Beginning of parallel region
#pragma omp parallel
{
printf("Hello World... from thread = %d\n",
omp_get_thread_num());
}
// Ending of parallel region
}

15. #pragma omp parallel private(i)


for (int i = 0; i < 100; i++)
{
a[i] = i;
}
How many iterations are executed if four threads execute the above program?
a) 20
b) 40
c) 25
d) 35
Answer: c) 25 Ap/CO3
Consider the following code segment.
(Answer: 2 Marks)
#pragma omp parallel private(i)
for (int i = 0; i < 100; i++)
{ a[i] = i; }

How many iterations are executed if four threads execute the above program?

If four threads execute the program, the Loop is splitted among four threads
(100/4= 25), therefore there are 25 iterations.

PART – C (3*10 = 30 Marks)


16. Compulsory Question:
Build matrix multiplication using OpenMP (Column Order, Row Order, Block Ap/CO3
Matrix).
(row-major: 4 marks, column-major: 3 marks, block matrices: 3 marks)
ROW MAJOR MATRICES:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>
#define N 600

int A[N][N];
int B[N][N][N];
int C[N][N][N];
int main()
{
int i,j,k;
struct timeval tv1, tv2;
struct timezone tz;
double elapsed;
omp_set_num_threads(4);
for (i= 0; i< N; i++){
for (j= 0; j< N; j++)
{
for (k= 0; k< N; k++)
{
A[i][j] = 2;
B[i][j][k] = 4;
}
}
}
gettimeofday(&tv1, &tz);
#pragma omp parallel for private(i,j,k) shared(A,B,C)
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
C[i][j][k] += A[i][k] * B[k][j][i];
}
}
}

gettimeofday(&tv2, &tz);
elapsed = (double) (tv2.tv_sec-tv1.tv_sec) + (double) (tv2.tv_usec-
tv1.tv_usec) * 1.e-6;
printf("Elapsed time = %f seconds.\\n", elapsed);

/*for (i= 0; i< N; i++)


{
for (j= 0; j< N; j++)
{
printf("%d\t",C[i][j]);
}
printf("\n");
}*/
}
COLUMN MAJOR MATRICES:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>
int main(int argc, const char* argv[]) {
int rows=2;
int cols=3;
int i,j,k;
struct timeval tv1, tv2;
struct timezone tz;
double elapsed;
omp_set_num_threads(4);

int A[rows*rows];
int B[rows*cols];
int res[rows*cols];
A[0]=1;
A[1]=2;
A[2]=3;
A[3]=4;
B[0]=5;
B[1]=6;
B[2]=2;
B[3]=3;
B[4]=1;
B[5]=7;
gettimeofday(&tv1, &tz);
#pragma omp parallel for private(i,j,k) shared(A,B)
//multiplication as column major
for (int i=0;i<rows;i++){
for (int j=0;j<cols-1;j++){
res[i+j*rows]=0;
for (int k=0;k<rows;k++){
res[i+j*rows]+=A[i+k*rows]*B[k+j*cols];
printf("%d %d %d\n",res[i+j*rows],A[i+k*rows],B[k+j*cols]);
}
}
}
gettimeofday(&tv2, &tz);
elapsed = (double) (tv2.tv_sec-tv1.tv_sec) + (double) (tv2.tv_usec-tv1.tv_usec)
* 1.e-6;
printf("Elapsed time = %f seconds.\\n", elapsed);
for (int i=0;i<rows*(cols-1);i++){
printf("/nB[%d]=%d",i,res[i]);
}
return 0;
}
MATRIX MULTIPLICATION:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>
#define N 100
int A[N][N];
int B[N][N];
int C[N][N];
int main()
{
int i,j,k;
struct timeval tv1, tv2;
struct timezone tz;
double elapsed;
omp_set_num_threads(100);
for (i= 0; i< N; i++){
for (j= 0; j< N; j++){
A[i][j] = 2;
B[i][j] = 2;
}
}
gettimeofday(&tv1, &tz);
#pragma omp parallel for private(i,j,k) shared(A,B,C)
for (i = 0; i < N; ++i) {
for (j = 0; j < N; ++j) {
for (k = 0; k < N; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
gettimeofday(&tv2, &tz);
elapsed = (double) (tv2.tv_sec-tv1.tv_sec) + (double) (tv2.tv_usec-
tv1.tv_usec) * 1.e-6;
printf("Elapsed time = %f seconds.\\n", elapsed);
}

Any Two Questions

17. Compare multicore, multithreaded and vector processors. U/ CO1

Multithreaded processors (e.g., simultaneous multithreading) – single CPU core


that can execute multiple threads simultaneously. Multithreading is running
multiple tasks within a process. It is of two types, namely user level threads and
kernel level threads. It is economical, responsive, scalable, efficient, and allows
resource sharing. There are three models in multithreading: Many to many
model, Many to one model, and one to one model. All modern processors are
heavily pipelined, which opens the possibility for high performance if the
pipelines can actually be used.
-Hyperthreading (or) SMT (3 Marks)

Multicore processors – multiprocessor where the CPU cores coexist on a single


processor chip. The first challenge of multicore transition is the absolute
necessity to put those resources to efficient use by parallel programming, instead
of relying on single-core performance.Another challenge posed by multicore is
the gradual reduction in main memory bandwidth and cache size available per
core. Finally, the complex structure of shared and nonshared caches on current
multicore chips makes communication characteristics between different cores
highly nonisotropic. If there is a shared cache, two cores can exchange certain
amounts of information much faster; e.g., they can synchronize via a variable in
cache instead of having to exchange data over the memory bus. At the time of
writing, there are very few truly “multicore-aware” programming techniques that
explicitly exploit this most important feature to improve performance of parallel
code.
(3 Marks)

vector processors show a much better ratio of real application performance to


peak performance. They follow the SIMD (Single Instruction Multiple Data)
paradigm which demands that a single machine instruction is automatically
applied to a presumably large number of arguments of the same type, i.e., a
vector. Vector computers have much more massive parallelism built into
execution units and, more importantly, the memory subsystem. All needed
elements of the required argument vectors are first collected into vector registers
(gather), then the vector operation is executed on them and finally the results are
stored back (scatter). In contrast to cache-based processors where such operations
are extremely expensive due to the cache line concept, vector machines can
economically perform gather/scatter (although stride-one access is still most
efficient).
(4 Marks)

18. Explain memory hierarchies in detail. U/ CO1

Data can be stored in a computer system in many different ways. CPUs have a
set of registers, which can be accessed without delay. In addition, there are one or
more small but very fast caches holding copies of recently used data items. Main
memory is much slower, but also much larger than cache. Finally, data can be
stored on disk and copied to main memory as needed. This is a complex
hierarchy.

The data transfer between different levels of the hierarchy is vital in order to
identify performance bottlenecks.
(2 marks)

Explanation of Cache (3 marks)

Explanation of Cache mapping (3 marks)

Explanation of Prefetch (2 marks)


Prefetch

19. Describe shared memory parallel programming with OpenMP. U/ CO3

Shared memory opens the possibility to have immediate access to all data from
all processors without explicit communication. OpenMP is a set of compiler
directives.
Model for OpenMP thread operations: The master thread “forks” team of
threads, which work on shared memory in a parallel region. After the parallel
region, the threads are “joined,” i.e., terminated or put to sleep, until the next
parallel region starts. The number of running threads may vary among parallel
regions. (3 Marks)

 Parallel execution (7 Marks)


 Data scoping
 OpenMP worksharing for loops
 Synchronization
 Reductions
 Loop scheduling
 Tasking

Course Instructors Programme Assessment Committee HOD/CSE


(Dr.M.S.Geetha Devasena, Prof/CSE)
(Dr. R. Kingsy Grace, Asso. Prof./CSE)

You might also like