Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Concurrent Programming*

*- This assignment will be will have about 4 questions and they will be added during the course. Students
groups (4 students) can work on this assignment. The final submission date will be announced later.

Q1.) Write a well optimized parallel algorithm with support for cache (blocking) for the following matrix
operation using OpenMP, Intel Thread Building Blocks and Cilk++.
A = B*C + B*D’

All matrices are of same dimension (n x n). You must provide a performance analysis with different
matrix dimensions and block sizes.

Optional (not be evaluated but highly encouraged)


Try the above implementation in GPU with CUDA/OpenCL programming. Learn different memory
structures in GPU (texture) and warp methods and test them with your implementation.
(http://developer.nvidia.com/object/cuda_3_2_downloads.html)

Q2) Bitonic search is a network sorting algorithm efficient with multiprocessors. Study the Bitonic search
algorihms (http://en.wikipedia.org/wiki/Bitonic_sorter) and analyze it using the divide and conquer
pattern. Use the fork-join pattern to implement the algorithms. Use Intel thread building blocks and its
Task-Based Programming model.

Q3) Assume you are given a file which contains lines of numbers separated by spaces. Each line consists
of 1000-2000 or more numbers and there should about 10000 or more such lines. (you have to create a
such file). You have to design a program that read each line from the file and sort it and write the sorted
lines to another file. For the sorting you have to use a Bitonic sorter. You are required to,

A.) Implement program for sequential processing.


B.) Implement using the algorithm you developed in Q2) above.
C.) Implement the program using the pipeline pattern using Intel Thread Building Blocks. Design
appropriate pipeline stages for best performance gains. Apply cache optimization if possible.
Experiment with the pipeline stages with the Bitonic sort itself.
D.) Show performances of each of the implementations. Test you program on different multicore
machines.

You might also like