Laboratory Practice I (410246)

WELCOME
Laboratory Practice I
[410246]
Ms. R. S. Shishupal
Assistant Professor
Dept. of Computer Engineering,
Sinhgad Institute of Technology, Lonavala
rss.sit@sinhgad.edu
Cell. +91 9011909490
Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Assignment-1
High Performance Computing
Write a CUDA program that, given an N-element vector, find-

• The maximum element in the vector
• The minimum element in the vector
• The arithmetic mean of the vector
• The standard deviation of the values in the vector
• Test for input N and generate a randomized vector V of length
N (N should be large). The program should generate output as
the two computed maximum values as well as the time taken
to find each value.

Steps to Execute CUDA C on Google Colab
• Step 1:
!nvcc --version
• Step 2:
!apt-get --purge remove cuda nvidia* libnvidia-*
!dpkg -l | grep cuda- | awk '{print $2}' | xargs -n1 dpkg
--purge
!apt-get remove cuda-*
!apt autoremove
!apt-get update

• Step 3:
!wget
https://developer.nvidia.com/compute/cuda/9.2/Prod/loc
al_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-
1_amd64 -O cuda-repo-ubuntu1604-9-2-local_9.2.88-
1_amd64.deb
!dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-
1_amd64.deb
!apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub
!apt-get update
!apt-get install cuda-9.2

• Step 4:
!nvcc --version
• Step 5:
!pip install
git+git://github.com/andreinechaev/nvcc4jupyter.git
• Step 6:
• %load_ext nvcc_plugin

Introduction to CUDA
Compute Unified Device Architecture

• CUDA is a parallel computing platform and
programming model created by NVIDIA.
• It is used to execute the general purpose programs
that can be executed in parallel

CUDA C: The Basics
Terminology
• Host – The CPU and its memory (host memory)
• Device – The GPU and its memory (device memory)

CUDA C: Hello World!
int main( void )

{
printf( "Hello, World!\n" );
return 0;
}
• This basic program is just standard C that runs on the

host
• NVIDIA’s compiler (nvcc) will not complain about
CUDA programs with no device code
• At its simplest, CUDA C is just C!

CUDA Programming Model

Definitions:
Device = GPU
Host = CPU
Kernel =
function that
runs on the
device

Introduction to CUDA-Write CUDA C kernels
• CUDA kernels:
• CUDA defines an extension to the C language used to
invoke a kernel.
• A kernel is just a name for a function that executes on
the GPU.

Hello World! With Device Code
__global__ void kernel( void)

{
}
Int main( void)
{
kernel<<<1,1>>>();
printf( "Hello, World!\n" );
return0;
}
Two notable additions to the original “Hello, World!”

CUDA API for dealing with device memory
Basic CUDA API for dealing with device memory

• cudaMalloc(), cudaFree(), cudaMemcpy()
• Similar to their C equivalents, malloc(), free(), memcpy()

cudaMalloc(void **pointer, size_t nbytes)
• Allocates object in the device Global Memory
• Requires two parameters
- Address of a pointer to the allocated object
- Size of allocated object

cudaFree(void *pointer)
• Frees object from device Global Memory
Pointer to freed object
Example:
WIDTH = 64;
float* Md
int size = WIDTH * WIDTH * sizeof(float);
cudaMalloc((void**)&Md, size);
cudaFree(Md);

cudaMemcpy(void *dst, void *src, size_t nbytes, enum
cudaMemcpyKind direction);
• memory data transfer
• Requires four parameters
i. Pointer to destination
ii. Pointer to source
iii. Number of bytes copied
iv. Type of transfer
– Host to Host
– Host to Device
– Device to Host
– Device to Device
• Asynchronous transfer
Example:
Destination pointer Source pointer Size in bytes Direction
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

Executing Code on the GPU
1. Allocate CPU Data Structure
2. Initialize Data on CPU
3. Allocate GPU Data Structure
cudaMalloc(void **pointer, size_t nbytes)
4. Copy Data from CPU to GPU
cudaMemcpy(void *dst, void *src, size_t nbytes, enum
cudaMemcpyKind direction);
5. Define Execution Configuration

6. Run Kernel
7. CPU synchronizes with GPU
8. Copy Data from GPU to CPU
9. De-allocate GPU and CPU memory
cudaFree(void *pointer)

Executing Code on the GPU
1. Allocate CPU Data Structure

//allocate mem on host
a=(int*)malloc(size);
b=(int*)malloc(size);
c=(int*)malloc(size);

2. Initialize Data on CPU
for(i=0;i<N;i++)
{
a[i]=i;
b[i]=i;
}

3. Allocate GPU Data Structure
//allocate mem on device
cudaMalloc((void**) &ad,size);
cudaMalloc((void**)&bd,size);
cudaMalloc((void**)&cd,size);

4. Copy Data from CPU to GPU
//copy from host to device
cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);

5. Define Execution Configuration
6. Run Kernel
7. CPU synchronizes with GPU
//lauch add() kernel on GPU
add<<<N,1>>>(ad,bd,cd);

8. Copy Data from GPU to CPU
//copy result to host
cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);

9. De-allocate GPU and CPU memory
free(a);
free(b);
free(c);
cudaFree(ad);
cudaFree(bd);
cudaFree(cd);

A kernel is executed by a grid of thread blocks
 A thread block is a batch of threads that can

cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution
 Threads from different blocks cannot cooperate

CUDA Kernels and Threads
 Parallel portions of an application are executed on

the device as kernels
 One kernel is executed at a time
 Many threads execute each kernel
 Differences between CUDA and CPU threads

 CUDA threads are extremely lightweight
 Very little creation overhead
 Instant switching
 CUDA uses 1000s of threads to achieve efficiency
 Multi-core CPUs can use only a few

Laboratory Practice I (410246)

Uploaded by

Copyright:

Available Formats

You might also like

Laboratory Practice I (410246)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Laboratory Practice I (410246)

Uploaded by

Copyright:

Available Formats

WELCOME

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Write a CUDA program that, given an N-element vector, find-

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Compute Unified Device Architecture

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

int main( void )

• This basic program is just standard C that runs on the

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

__global__ void kernel( void)

Two notable additions to the original “Hello, World!”

Basic CUDA API for dealing with device memory

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Destination pointer Source pointer Size in bytes Direction

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

5. Define Execution Configuration

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

1. Allocate CPU Data Structure

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

A kernel is executed by a grid of thread blocks

 A thread block is a batch of threads that can

 Threads from different blocks cannot cooperate

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

 Parallel portions of an application are executed on

 Differences between CUDA and CPU threads

Ms. R. S. Shishupal, Asst. Prof., Dept of Comp Engg, SIT, Lonavala

You might also like

global void kernel( void)