HPC 1

Faculty of Engineering & Technology
Semester: 6th Year: 3rd

B.Tech CSE-AI
Subject Name: GPU
Subject Code : 203105398
FACULTY OF ENGINEERING AND

TECHNOLOGY
BACHELOR OF TECHNOLOGY
GPU COMPUTING
(203105398)
LAB MANUAL
6th SEMESTER
COMPUTER SCIENCE & ENGINEERING
DEPARTMENT
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
CERTIFICATE
This is to certify that Mr./Ms …………………………with

Enrolment no.200303124548 has successfully completed his/her
laboratory experiments in the GPU COMPUTING
(203105398) from the department of PIT.-CSE (AI). During
the academic year.2022-23.
Date of Submission:......................... Staff In charge:...........................
Head Of Department:...........................................
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
INDEX
Sr. Page Starting Ending

Experiment Title Grade Sign
No. No. Date Date
Understand the system by various linux/

1. windows commands & GPU and CUDA
Architectures.
2. Understand the google colab.
3. Analyze the program using gprof profiles.
Wap to demonstrate the addition of an array

4.
using cuda code
Wap to demonstrate squaring an array using a

5.
simple cuda kernel
Wap to demonstrate vector-matrix

6.
multiplication using gpu global memory
Wap vector matrix multiplication with

7. measuring time using cuda events and uses
shared memory
Wap demonstrate vector-matrix

8. multiplication using gpu constant memory it
stores vector v in gpu constant memory
9. Analyse the program using nvidia profilers
With the help of gpu libraries like keras,

10.
tensorflow, gan etc develop a mini project.
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
PRACTICAL: 01
AIM: Understand the system by various linux/ windows commands & GPU
and CUDA Architectures.
GPU ARCHITECTURE:
Its architecture is tolerant of memory latency. Compared to a CPU, a GPU works with fewer,
and relatively small, memory cache layers. Reason being is that a GPU has more transistors
dedicated to computation meaning it cares less how long it takes the retrieve data from memory.
A Graphics Processor Unit (GPU) is mostly known for the hardware device used when running
applications that weigh heavy on graphics, i.e.3D modeling software or VDI infrastructures. In
the consumer market, a GPU is mostly used to accelerate gaming graphics. Today, GPGPU’s
(General Purpose GPU) are the choice of hardware to accelerate computational workloads in
modern High Performance Computing (HPC) landscapes.
Figure :1.1 GPU Architecture
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
Compute Unified Device Architecture(CUDA) ARCHITECTURE:

CUDA stands for Compute Unified Device Architecture. It is an extension of C/C++
programming. CUDA is a programming language that uses the Graphical Processing Unit
(GPU). It is a parallel computing platform and an API (Application Programming Interface)
model, Compute Unified Device Architecture was developed by Nvidia. This allows
computations to be performed in parallel while providing well-formed speed. Using CUDA, one
can harness the power of the Nvidia GPU to perform common computing tasks, such as
processing matrices and other linear algebra operations, rather than simply performing graphical
calculations.
Figure :1.2 CUDA Architecture
WORKING OF CUDA:
 GPUs run one kernel (a group of tasks) at a time.
 Each kernel consists of blocks, which are independent groups of ALUs.
 Each block contains threads, which are levels of computation.
 The threads in each block typically work together to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to the GPU is often the most typical part
of the computation.
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
Applications of CUDA
1. Computational finance
2. Climate, weather, and ocean modeling
3. Data science and analytics
4. Deep learning and machine learning
5. Defense and intelligence
6. Manufacturing/AEC
7. Media and entertainment
8. Medical imaging
9. Oil and gas
10. Research
11. Safety and security
12. Tools and management
Linux Commands:
Figure :1.3 pwd command
Figure :1.4 cd command
Figure :1.5 cd .. command
Figure :1.6 ls command
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
Figure :1.7 mkdir command
Figure :1.8 rmdir command
Figure :1.9 echo command
Figure :1.10 echo command
Figure :1.11 command
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
PRACTICAL: 03
AIM: Analyze the program using GPROF profiles.
CODE:
#include <stdio.h>
int binarySearch(int array[], int x, int low, int high) {

while (low <= high) {
int mid = low + (high - low) / 2;
if (array[mid] == x)
return mid;
if (array[mid] < x)
low = mid + 1;
else
high = mid - 1;
}
return -1;
}
int main(void) {
int array[] = {3, 4, 5, 6, 7, 8, 9};
int n = sizeof(array) / sizeof(array[0]);
int x = 4;
int result = binarySearch(array, x, 0, n - 1);
if (result == -1)
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
printf("Not found");
else
printf("Element is found at index %d", result);
return 0;
}
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
OUTPUT
Figure :2.1 gprof profiles
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
PRACTICAL: 02
AIM: Understand the google colab.
What is Colaboratory?
Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody
to write and execute arbitrary python code through the browser, and is especially well suited to
machine learning, data analysis and education. More technically, Colab is a hosted Jupyter
notebook service that requires no setup to use, while providing access free of charge to
computing resources including GPUs
Is it really free of charge to use?
Yes. Colab is free of charge to use.
Seems too good to be true. What are the limitations?
Colab resources are not guaranteed and not unlimited, and the usage limits sometimes fluctuate.
This is necessary for Colab to be able to provide resources free of charge. For more details,
see Resource Limits
Users who are interested in more reliable access to better resources may be interested in Colab
Pro.
Resources in Colab are prioritized for interactive use cases. We prohibit actions associated with
bulk compute, actions that negatively impact others, as well as actions associated with bypassing
our policies. The following are disallowed from Colab runtimes:
 file hosting, media serving, or other web service offerings not related to interactive
compute with Colab
 downloading torrents or engaging in peer-to-peer file-sharing
 using a remote desktop or SSH
 connecting to remote proxies
 mining cryptocurrency
 running denial-of-service attacks
 password cracking
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
Figure: 3.1 Enabling gpu
Figure: 3.2 Simple python codes
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
Figure:3.3 Mathematical equation
Figure:3.6 Testing of gpu
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
PRACTICAL: 04
AIM: W.A.P to demonstrate the addition of an array using CUDA code.
CODE:
%%cu
#include <stdio.h>
int main()
int arr[] = {1, 2, 3, 4, 5};
int sum = 0;
int length = sizeof(arr)/sizeof(arr[0]);
for (int i = 0; i < length; i++) {
sum = sum + arr[i];
printf("Sum of all the elements of an array: %d", sum);
return 0;
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
OUTPUT
Figure: 4.1 Addition of Arrays
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
PRACTICAL: 05
AIM: W.A.P to demonstrate squaring an array using a simple CUDA kernel.
CODE:
%%cu
#include<stdio.h>
int main()
int arr[5] = {1,2,3,4,5};
int i = 0;
printf("Array elements:\n");
for(i = 0;i<5;i++)
printf("%d",arr[i]);
printf("\nsquare of array elements;\n");
for(i = 0;i<5;i++);
printf("%d",arr[i]*arr[i]);
printf("\n");
return 0;
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
OUTPUT
Figure: 5.1 Squaring an Array
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
PRACTICAL: 06
AIM: W.A.P to demonstrate vector-matrix multiplication using GPU global
memory.
CODE:
%%cu
#include<stdio.h>
#include<stdlib.h>
global void arradd(int* md, int* nd, int* pd, int size)
{
//Get unique identification number for a given thread
int myid = blockIdx.x*blockDim.x + threadIdx.x;
pd[myid] = md[myid] + nd[myid];

}
int main()
{
int size = 2000 * sizeof(int);
int m[2000], n[2000], p[2000],*md, *nd,*pd;
int i=0;
//Initialize the arrays

for(i=0; i<2000; i++ )
{
m[i] = i;
n[i] = i;
p[i] = 0;
}
// Allocate memory on GPU and transfer the data

cudaMalloc(&md, size);
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
cudaMemcpy(md, m, size, cudaMemcpyHostToDevice);
cudaMalloc(&nd, size);
cudaMemcpy(nd, n, size, cudaMemcpyHostToDevice);
cudaMalloc(&pd, size);
// Define number of threads and blocks

dim3 DimGrid(10, 1);
dim3 DimBlock(200, 1);
// Launch the GPU kernel function

arradd<<< DimGrid,DimBlock >>>(md,nd,pd,size);
// Transfer the results back to CPU memory

cudaMemcpy(p, pd, size, cudaMemcpyDeviceToHost);
// Free GPU arrays

cudaFree(md);
cudaFree(nd);
cudaFree (pd);
// Print the results

for(i=0; i<2000; i++ )
{
printf("\t%d",p[i]);
}
}
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
OUTPUT
Figure: 6.1 Vector Matrix Multiplication
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
PRACTICAL: 07
AIM: W.A.P vector matrix multiplication with measuring time using CUDA
events and uses shared memory.
CODE:
%%cu
#include<stdio.h>
#include<stdlib.h>
global void arradd(int* md, int* nd, int* pd, int size)
{
//Get unique identification number for a given thread

int myid = blockIdx.x*blockDim.x + threadIdx.x;
pd[myid] = md[myid] * nd[myid];

}
int main()
{
int size = 2000 * sizeof(int);
int m[2000], n[2000], p[2000],*md, *nd,*pd;
int i=0;
//Initialize the arrays

for(i=0; i<2000; i++ )
{
m[i] = i;
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
n[i] = i;
p[i] = 0;
}
// Allocate memory on GPU and transfer the data

cudaMalloc(&md, size);
cudaMemcpy(md, m, size, cudaMemcpyHostToDevice);
cudaMalloc(&nd, size);
cudaMemcpy(nd, n, size, cudaMemcpyHostToDevice);
cudaMalloc(&pd, size);
dim3 DimGrid(10, 1);
dim3 DimBlock(200, 1);
arradd<<< DimGrid,DimBlock >>>(md,nd,pd,size);

cudaMemcpy(p, pd, size, cudaMemcpyDeviceToHost);
cudaFree(md);
cudaFree(nd);
cudaFree (pd);
for(i=0; i<2000; i++ )
{
printf("\t%d",p[i]);
}
}
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
OUTPUT
Figure: 7.1 vector matrix multiplication uses shared memory
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
PRACTICAL: 08
AIM: W.A.P demonstrate vector-matrix multiplication using GPU constant
memory it stores vector v in GPU constant memory.
CODE:
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
Figure: 8.1 Code
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
OUTPUT
Figure: 8.2 Vector-matrix multiplication using GPU
PAGE NO.:
B.Tech CSE-AI
Subject Name: GPU
PRACTICAL: 09
AIM: Analyse the program using NVIDIA Profilers.
CODE:
!nvidia-smi
OUTPUT
Figure: 9.1 NVIDIA Profile
PAGE NO.:

HPC 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HPC 1

Uploaded by

Copyright:

Available Formats

Faculty of Engineering & Technology

Semester: 6th Year: 3rd

FACULTY OF ENGINEERING AND

This is to certify that Mr./Ms …………………………with

Date of Submission:......................... Staff In charge:...........................

Sr. Page Starting Ending

Understand the system by various linux/

2. Understand the google colab.

3. Analyze the program using gprof profiles.

Wap to demonstrate the addition of an array

Wap to demonstrate squaring an array using a

Wap to demonstrate vector-matrix

Wap vector matrix multiplication with

Wap demonstrate vector-matrix

9. Analyse the program using nvidia profilers

With the help of gpu libraries like keras,

Figure :1.1 GPU Architecture

Compute Unified Device Architecture(CUDA) ARCHITECTURE:

Figure :1.2 CUDA Architecture

Figure :1.3 pwd command

Figure :1.4 cd command

Figure :1.5 cd .. command

Figure :1.6 ls command

Figure :1.7 mkdir command

Figure :1.8 rmdir command

Figure :1.9 echo command

Figure :1.10 echo command

Figure :1.11 command

int binarySearch(int array[], int x, int low, int high) {

Figure :2.1 gprof profiles

Yes. Colab is free of charge to use.

Seems too good to be true. What are the limitations?

Figure: 3.1 Enabling gpu

Figure: 3.2 Simple python codes

Figure:3.3 Mathematical equation

Figure:3.4 Mathematical equation

Figure:3.5 Mathematical equation

Figure:3.6 Testing of gpu

int arr[] = {1, 2, 3, 4, 5};

int length = sizeof(arr)/sizeof(arr[0]);

for (int i = 0; i < length; i++) {

sum = sum + arr[i];

printf("Sum of all the elements of an array: %d", sum);

Figure: 4.1 Addition of Arrays

int arr[5] = {1,2,3,4,5};

printf("\nsquare of array elements;\n");

Figure: 5.1 Squaring an Array

pd[myid] = md[myid] + nd[myid];

//Initialize the arrays

// Allocate memory on GPU and transfer the data

cudaMemcpy(md, m, size, cudaMemcpyHostToDevice);

// Define number of threads and blocks

// Launch the GPU kernel function

// Transfer the results back to CPU memory

// Free GPU arrays

// Print the results

Figure: 6.1 Vector Matrix Multiplication

//Get unique identification number for a given thread

pd[myid] = md[myid] * nd[myid];

//Initialize the arrays

// Allocate memory on GPU and transfer the data

arradd<<< DimGrid,DimBlock >>>(md,nd,pd,size);

Figure: 7.1 vector matrix multiplication uses shared memory