Laboratory Practice I (410246)

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

WELCOME

Laboratory Practice I
[410246]

Ms. R. S. Shishupal
Assistant Professor
Dept. of Computer Engineering,
Sinhgad Institute of Technology, Lonavala
rss.sit@sinhgad.edu
Cell. +91 9011909490

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


Assignment-1
High Performance Computing

Write a CUDA program that, given an N-element vector, find-


• The maximum element in the vector
• The minimum element in the vector
• The arithmetic mean of the vector
• The standard deviation of the values in the vector
• Test for input N and generate a randomized vector V of length
N (N should be large). The program should generate output as
the two computed maximum values as well as the time taken
to find each value.

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


Steps to Execute CUDA C on Google Colab

• Step 1:
!nvcc --version

• Step 2:
!apt-get --purge remove cuda nvidia* libnvidia-*
!dpkg -l | grep cuda- | awk '{print $2}' | xargs -n1 dpkg
--purge
!apt-get remove cuda-*
!apt autoremove
!apt-get update

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


• Step 3:
!wget
https://developer.nvidia.com/compute/cuda/9.2/Prod/loc
al_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-
1_amd64 -O cuda-repo-ubuntu1604-9-2-local_9.2.88-
1_amd64.deb
!dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-
1_amd64.deb
!apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub
!apt-get update
!apt-get install cuda-9.2

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


• Step 4:
!nvcc --version

• Step 5:
!pip install
git+git://github.com/andreinechaev/nvcc4jupyter.git

• Step 6:
• %load_ext nvcc_plugin

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


Introduction to CUDA

Compute Unified Device Architecture


• CUDA is a parallel computing platform and
programming model created by NVIDIA.
• It is used to execute the general purpose programs
that can be executed in parallel

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


CUDA C: The Basics
Terminology
• Host – The CPU and its memory (host memory)
• Device – The GPU and its memory (device memory)

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


CUDA C: Hello World!

int main( void )


{
printf( "Hello, World!\n" );
return 0;
}

• This basic program is just standard C that runs on the


host
• NVIDIA’s compiler (nvcc) will not complain about
CUDA programs with no device code
• At its simplest, CUDA C is just C!

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


CUDA Programming Model

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


CUDA Programming Model
Definitions:

Device = GPU
Host = CPU
Kernel =
function that
runs on the
device

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


Introduction to CUDA-Write CUDA C kernels

• CUDA kernels:
• CUDA defines an extension to the C language used to
invoke a kernel.
• A kernel is just a name for a function that executes on
the GPU.

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


Hello World! With Device Code

__global__ void kernel( void)


{
}
Int main( void)
{
kernel<<<1,1>>>();
printf( "Hello, World!\n" );
return0;
}

Two notable additions to the original “Hello, World!”


Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala
CUDA API for dealing with device memory

Basic CUDA API for dealing with device memory


• cudaMalloc(), cudaFree(), cudaMemcpy()
• Similar to their C equivalents, malloc(), free(), memcpy()

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


cudaMalloc(void **pointer, size_t nbytes)
• Allocates object in the device Global Memory
• Requires two parameters
- Address of a pointer to the allocated object
- Size of allocated object

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


cudaFree(void *pointer)
• Frees object from device Global Memory
Pointer to freed object

Example:
WIDTH = 64;
float* Md
int size = WIDTH * WIDTH * sizeof(float);
cudaMalloc((void**)&Md, size);
cudaFree(Md);

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


cudaMemcpy(void *dst, void *src, size_t nbytes, enum
cudaMemcpyKind direction);
• memory data transfer
• Requires four parameters
i. Pointer to destination
ii. Pointer to source
iii. Number of bytes copied
iv. Type of transfer
– Host to Host
– Host to Device
– Device to Host
– Device to Device
• Asynchronous transfer
Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala
Example:

Destination pointer Source pointer Size in bytes Direction

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);


cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


Executing Code on the GPU
1. Allocate CPU Data Structure
2. Initialize Data on CPU
3. Allocate GPU Data Structure
cudaMalloc(void **pointer, size_t nbytes)
4. Copy Data from CPU to GPU
cudaMemcpy(void *dst, void *src, size_t nbytes, enum
cudaMemcpyKind direction);

5. Define Execution Configuration


6. Run Kernel
7. CPU synchronizes with GPU
8. Copy Data from GPU to CPU
9. De-allocate GPU and CPU memory
cudaFree(void *pointer)

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


Executing Code on the GPU

1. Allocate CPU Data Structure


//allocate mem on host
a=(int*)malloc(size);
b=(int*)malloc(size);
c=(int*)malloc(size);

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


2. Initialize Data on CPU
for(i=0;i<N;i++)
{
a[i]=i;
b[i]=i;
}

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


3. Allocate GPU Data Structure
//allocate mem on device
cudaMalloc((void**) &ad,size);
cudaMalloc((void**)&bd,size);
cudaMalloc((void**)&cd,size);

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


4. Copy Data from CPU to GPU
//copy from host to device
cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


5. Define Execution Configuration
6. Run Kernel
7. CPU synchronizes with GPU
//lauch add() kernel on GPU
add<<<N,1>>>(ad,bd,cd);

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


8. Copy Data from GPU to CPU
//copy result to host
cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


9. De-allocate GPU and CPU memory
free(a);
free(b);
free(c);
cudaFree(ad);
cudaFree(bd);
cudaFree(cd);

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala
CUDA Programming Model

A kernel is executed by a grid of thread blocks

 A thread block is a batch of threads that can


cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution

 Threads from different blocks cannot cooperate

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala


CUDA Kernels and Threads

 Parallel portions of an application are executed on


the device as kernels
 One kernel is executed at a time
 Many threads execute each kernel

 Differences between CUDA and CPU threads


 CUDA threads are extremely lightweight
 Very little creation overhead
 Instant switching
 CUDA uses 1000s of threads to achieve efficiency
 Multi-core CPUs can use only a few

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

You might also like