HPC Int I Retest Answer Key

SRI RAMAKRISHNA ENGINEERING COLLEGE
[Educational Service: SNR Sons Charitable Trust]

[Autonomous Institution, Reaccredited by NAAC with ‘A+’ Grade]
[Approved by AICTE and Permanently Affiliated to Anna University, Chennai]
[ISO 9001:2015 Certified and all eligible programmes Accredited by NBA]
VATTAMALAIPALAYAM, N.G.G.O. COLONY POST, COIMBATORE – 641 022.
Department of Computer Science and Engineering

Internal Test I – Retest
Date 13.04.2024 Department CSE
Semester V Class/section III Year B.E & M.Tech
Duration 2:00 Hours Maximum marks 50
Course Code &Title: 20CS257 High Performance Computing – Answer key
Course Outcomes Addressed:
CO1: Understand the popular parallel programming paradigms concepts.
CO2: Describe the concept of modern processors and its performance.
CO3: Apply parallelism to extract maximum performance in multicore and shared memory processor.
S.No Questions Cognitive

Level/CO
PART – A (Answer All Questions) (10*1 =10 Marks)
1. The average number of tasks completed by the server over a time period is called
__________________
The average number of tasks 4completed by the server over a time period is
called as ________. U/ CO1
a) Reliability b) Bandwidth c) Throughput d) Latency
2. In parallelization, if P is the proportion of a system or program that can be made

parallel, and 1-P is the proportion that remains serial, then the maximum speedup U/ CO2
that
can be achieved using N number of processors is 1/((1P)+(P/N). This law is
called
In parallelization, if P is the proportion of a system or program that can be made
parallel, and 1-P is the proportion that remains serial, then the maximum speedup
that
called
that
called
that
called
that
called
that
called
that can be achieved using N number of processors is 1/((1P)+(P/N). Choose the
respective law.
a) Newton’s b) Ohms law c) Amdahl’s Law d) Moore’s Law

law
3. Suppose we run a parallel program with a fixed number of processes/threads and

a fixed input size, and we obtain an efficiency E. Suppose we now increase the U/ CO1
number of processes/threads that are used by the program and find a
corresponding rate of increase in the problem size so that the program always has
efficiency E. Which of the following suits such a program?
a) Not scalable b) Scalable c) May be d) Optimizable

scalable
4. If the main memory is of 8K bytes and the cache memory is of 2K words. It uses
associative mapping. What is the size of each word of cache memory? U/CO2
a) 11 bits d) 20 bits
b) 16 bits c) 21 bits
5. In a ----------system, all caches on the bus monitor the bus to determine if they
have a copy of the block of data that is requested on the bus. R/ CO2
a) Coherence b) Recurrence c) Replication d) Snooping
6. Identify the directive that force threads to wait till all are done
U/ CO3
a) #pragma b) #pragma c) c) #pragma d) #pragma omp
omp parallel omp barrier omp critical sections
7. All the components of a CPU core can operate at some maximum speed called
__________. R/ CO1
a) ) Accelerated b) Peak c) High d) ) Scalable
Performance Performance Performance Performance
8. Cache memory works on the principle of locality of reference. R/CO1
9. The OpenMP library function omp_set_num_threads( ) is used to set the
R/ CO1
number of threads in upcoming parallel regions.
10.
Parallel efficiency is defined as R/ CO2
PART – B (Answer All Questions) (5*2 =10 Marks) Cognitive

Level/CO
11. Outline parallel scalability. U/CO3
(Explanation: 2 Marks)
The scalability of a parallel algorithm on a parallel architecture is a measure of
its capacity to effectively utilize an increasing number of processors. It is the
ratio between the actual speedup and the ideal speedup obtained when using a
certain number of processors.
12. Distinguish between data and functional parallelism.
(Difference: 2 marks)
Data- level parallelism Functional Parallelism
Data parallelism refers to concurrent Task Parallelism means concurrent

execution of the same instruction execution of the different task on
stream on multiple data. multiple computing cores.
SIMD (Single Instruction Multiple MPMD (Multiple Program Multiple

Data) instructions issue identical Data). Big problem divided into U/CO4
operations on a whole array of integer subtasks that execute completely
or floating point operands, usually in different code on different data items.
special registers.
They improve arithmetic peak The overlapping tasks that would

performance without the requirement otherwise be executed sequentially
for increased superscalarity. could accelerate execution
considerably.
13. Compare shared memory and distributed memory computers.

(Comparison: 2 marks)
Shared memory computers Distributed memory computers

A shared-memory parallel computer In distributed memory computers,
is a system in which a number of each processor is connected to
CPUs exclusive local memory.
work on a common, shared physical
address space.
No other CPU has direct access to it.
Uniform Memory Access (UMA) No Remote Memory Access
systems exhibit a “flat” memory (NORMA).
model: Latency and bandwidth are the Each node comprises at least one
same for all processors and all network interface (NI) that mediates
memory locations. the connection to a communication U/ CO3
This is also called symmetric network. A serial process runs on
multiprocessing (SMP). each CPU that can communicate with
other processes on other CPUs by
means of the network.
On cache-coherent Nonuniform Like shared-memory parallel

Memory Access (ccNUMA) computer, as there is no remote
machines, memory is physically memory access on
distributed but logically shared. distributed-memory machines, the
problem has to be solved
cooperatively by sending messages
back and forth between processes,
using MPI.
14. Consider the following statements U/CO3

(i) Pthreads require that programmer explicitly specify the behavior of each
thread
(ii) Pthreads does not allow programmer to explicitly specify the behavior of
each thread
(iii) OpenMP allows programmer to simply state that a block of code should be
executed in parallel
(iv) Pthreads program can be run with C Compiler
Which of the above statement is/are true?
Develop a hello world program using OpenMP.
(Program: 2 marks)
// OpenMP program to print Hello World using C language

// OpenMP header
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[])

{
// Beginning of parallel region
#pragma omp parallel
{
printf("Hello World... from thread = %d\n",
omp_get_thread_num());
}
// Ending of parallel region
}
15. #pragma omp parallel private(i)

for (int i = 0; i < 100; i++)
{
a[i] = i;
}
How many iterations are executed if four threads execute the above program?
a) 20
b) 40
c) 25
d) 35
Answer: c) 25 Ap/CO3
Consider the following code segment.
(Answer: 2 Marks)
#pragma omp parallel private(i)
for (int i = 0; i < 100; i++)
{ a[i] = i; }
How many iterations are executed if four threads execute the above program?
If four threads execute the program, the Loop is splitted among four threads
(100/4= 25), therefore there are 25 iterations.
PART – C (3*10 = 30 Marks)

16. Compulsory Question:
Build matrix multiplication using OpenMP (Column Order, Row Order, Block Ap/CO3
Matrix).
(row-major: 4 marks, column-major: 3 marks, block matrices: 3 marks)
ROW MAJOR MATRICES:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>
#define N 600
int A[N][N];
int B[N][N][N];
int C[N][N][N];
int main()
{
int i,j,k;
struct timeval tv1, tv2;
struct timezone tz;
double elapsed;
omp_set_num_threads(4);
for (i= 0; i< N; i++){
for (j= 0; j< N; j++)
{
for (k= 0; k< N; k++)
{
A[i][j] = 2;
B[i][j][k] = 4;
}
}
}
gettimeofday(&tv1, &tz);
#pragma omp parallel for private(i,j,k) shared(A,B,C)
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
C[i][j][k] += A[i][k] * B[k][j][i];
}
}
}
elapsed = (double) (tv2.tv_sec-tv1.tv_sec) + (double) (tv2.tv_usec-
tv1.tv_usec) * 1.e-6;
printf("Elapsed time = %f seconds.\\n", elapsed);
/*for (i= 0; i< N; i++)

{
for (j= 0; j< N; j++)
{
printf("%d\t",C[i][j]);
}
printf("\n");
}*/
}
COLUMN MAJOR MATRICES:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, const char* argv[]) {
int rows=2;
int cols=3;
int i,j,k;
struct timezone tz;
double elapsed;
int A[rows*rows];
int B[rows*cols];
int res[rows*cols];
A[0]=1;
A[1]=2;
A[2]=3;
A[3]=4;
B[0]=5;
B[1]=6;
B[2]=2;
B[3]=3;
B[4]=1;
B[5]=7;
#pragma omp parallel for private(i,j,k) shared(A,B)
//multiplication as column major
for (int i=0;i<rows;i++){
for (int j=0;j<cols-1;j++){
res[i+j*rows]=0;
for (int k=0;k<rows;k++){
res[i+j*rows]+=A[i+k*rows]*B[k+j*cols];
printf("%d %d %d\n",res[i+j*rows],A[i+k*rows],B[k+j*cols]);
}
}
}
elapsed = (double) (tv2.tv_sec-tv1.tv_sec) + (double) (tv2.tv_usec-tv1.tv_usec)
* 1.e-6;
for (int i=0;i<rows*(cols-1);i++){
printf("/nB[%d]=%d",i,res[i]);
}
return 0;
}
MATRIX MULTIPLICATION:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 100
int A[N][N];
int B[N][N];
int C[N][N];
int main()
{
int i,j,k;
struct timezone tz;
double elapsed;
for (i= 0; i< N; i++){
for (j= 0; j< N; j++){
A[i][j] = 2;
B[i][j] = 2;
}
}
#pragma omp parallel for private(i,j,k) shared(A,B,C)
for (i = 0; i < N; ++i) {
for (j = 0; j < N; ++j) {
for (k = 0; k < N; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
elapsed = (double) (tv2.tv_sec-tv1.tv_sec) + (double) (tv2.tv_usec-
tv1.tv_usec) * 1.e-6;
}
Any Two Questions
17. Compare multicore, multithreaded and vector processors. U/ CO1
Multithreaded processors (e.g., simultaneous multithreading) – single CPU core

that can execute multiple threads simultaneously. Multithreading is running
multiple tasks within a process. It is of two types, namely user level threads and
kernel level threads. It is economical, responsive, scalable, efficient, and allows
resource sharing. There are three models in multithreading: Many to many
model, Many to one model, and one to one model. All modern processors are
heavily pipelined, which opens the possibility for high performance if the
pipelines can actually be used.
-Hyperthreading (or) SMT (3 Marks)
Multicore processors – multiprocessor where the CPU cores coexist on a single

processor chip. The first challenge of multicore transition is the absolute
necessity to put those resources to efficient use by parallel programming, instead
of relying on single-core performance.Another challenge posed by multicore is
the gradual reduction in main memory bandwidth and cache size available per
core. Finally, the complex structure of shared and nonshared caches on current
multicore chips makes communication characteristics between different cores
highly nonisotropic. If there is a shared cache, two cores can exchange certain
amounts of information much faster; e.g., they can synchronize via a variable in
cache instead of having to exchange data over the memory bus. At the time of
writing, there are very few truly “multicore-aware” programming techniques that
explicitly exploit this most important feature to improve performance of parallel
code.
(3 Marks)
vector processors show a much better ratio of real application performance to

peak performance. They follow the SIMD (Single Instruction Multiple Data)
paradigm which demands that a single machine instruction is automatically
applied to a presumably large number of arguments of the same type, i.e., a
vector. Vector computers have much more massive parallelism built into
execution units and, more importantly, the memory subsystem. All needed
elements of the required argument vectors are first collected into vector registers
(gather), then the vector operation is executed on them and finally the results are
stored back (scatter). In contrast to cache-based processors where such operations
are extremely expensive due to the cache line concept, vector machines can
economically perform gather/scatter (although stride-one access is still most
efficient).
(4 Marks)
18. Explain memory hierarchies in detail. U/ CO1
Data can be stored in a computer system in many different ways. CPUs have a
set of registers, which can be accessed without delay. In addition, there are one or
more small but very fast caches holding copies of recently used data items. Main
memory is much slower, but also much larger than cache. Finally, data can be
stored on disk and copied to main memory as needed. This is a complex
hierarchy.
The data transfer between different levels of the hierarchy is vital in order to
identify performance bottlenecks.
(2 marks)
Explanation of Cache (3 marks)
Explanation of Cache mapping (3 marks)
Explanation of Prefetch (2 marks)

Prefetch
19. Describe shared memory parallel programming with OpenMP. U/ CO3
Shared memory opens the possibility to have immediate access to all data from
all processors without explicit communication. OpenMP is a set of compiler
directives.
Model for OpenMP thread operations: The master thread “forks” team of
threads, which work on shared memory in a parallel region. After the parallel
region, the threads are “joined,” i.e., terminated or put to sleep, until the next
parallel region starts. The number of running threads may vary among parallel
regions. (3 Marks)
 Parallel execution (7 Marks)

 Data scoping
 OpenMP worksharing for loops
 Synchronization
 Reductions
 Loop scheduling
 Tasking
Course Instructors Programme Assessment Committee HOD/CSE

(Dr.M.S.Geetha Devasena, Prof/CSE)
(Dr. R. Kingsy Grace, Asso. Prof./CSE)

HPC Int I Retest Answer Key

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HPC Int I Retest Answer Key

Uploaded by

Copyright:

Available Formats

SRI RAMAKRISHNA ENGINEERING COLLEGE

[Educational Service: SNR Sons Charitable Trust]

Department of Computer Science and Engineering

S.No Questions Cognitive

a) Reliability b) Bandwidth c) Throughput d) Latency

2. In parallelization, if P is the proportion of a system or program that can be made

a) Newton’s b) Ohms law c) Amdahl’s Law d) Moore’s Law

3. Suppose we run a parallel program with a fixed number of processes/threads and

a) Not scalable b) Scalable c) May be d) Optimizable

PART – B (Answer All Questions) (5*2 =10 Marks) Cognitive

Data- level parallelism Functional Parallelism

Data parallelism refers to concurrent Task Parallelism means concurrent

SIMD (Single Instruction Multiple MPMD (Multiple Program Multiple

They improve arithmetic peak The overlapping tasks that would

13. Compare shared memory and distributed memory computers.

Shared memory computers Distributed memory computers

On cache-coherent Nonuniform Like shared-memory parallel

14. Consider the following statements U/CO3

// OpenMP program to print Hello World using C language

int main(int argc, char* argv[])

15. #pragma omp parallel private(i)

PART – C (3*10 = 30 Marks)

/*for (i= 0; i< N; i++)

Any Two Questions

17. Compare multicore, multithreaded and vector processors. U/ CO1

Multithreaded processors (e.g., simultaneous multithreading) – single CPU core

Multicore processors – multiprocessor where the CPU cores coexist on a single

vector processors show a much better ratio of real application performance to

18. Explain memory hierarchies in detail. U/ CO1

Explanation of Cache (3 marks)

Explanation of Cache mapping (3 marks)

Explanation of Prefetch (2 marks)

19. Describe shared memory parallel programming with OpenMP. U/ CO3

 Parallel execution (7 Marks)

Course Instructors Programme Assessment Committee HOD/CSE

You might also like