Introduction CUDA

Visual Computing – GPU Computing
Introduction
Frauke Sprengel
Outline
1 Lecture Organization
2 Contents
3 Parallel computing
4 Why GPU Computing?
5 Examples and Applications
6 Literature
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Introduction, 2
Outline
2 Contents
6 Literature
Visual Computing
Prof. Dr. Frauke Sprengel

Phone 0511/9296-1812
Room 1H2.49
frauke.sprengel@hs-hannover.de
This year’s topic: GPU Computing
Assignment: oral examination in January, two bigger exercises during
semester
Lecture notes and exercises: Moodle course
Outline
2 Contents
6 Literature
Contents
Contents of lecture
• Why GPU Computing?

• Hardware architecture of GPUs
• CUDA C programming language
• OpenCL kernel language and API
• From CUDA to OpenCL
• Memory layout
• Concurrency and synchronization
• Graphics interoperability
• Application examples
• Computational thinking
Contents
Contents of exercises
• The focus of this course is on practical exercises!

• Programming with C++, CUDA, and OpenGL under Linux
• Programming with C/C++, OpenCL, OpenGL, Java under Linux
• Larger projects in the end of the two parts
• No graded project work, but preliminary for examination
Outline
2 Contents
Why parallel computing?
Identification of parallizable problem parts
Flynn’s Classical Taxonomy
6 Literature
Why parallel computing?
How can we run a program in a faster way ?

• run it on a faster processor
• faster clock
• shorter time for each computation
• increases power consumption (limited)
• more work per step (clock cycle)
• also limited
• run it on more (simple) processors
Serial and Parallel Problems
Figure 1.1: Serial (left) and parallel (right) problems in principle (Barney 2016).
Problem: Identify independent problem parts which can be solved in parallel.
Matrix Addition
Example: Matrix Addition

f o r ( i =0; i <M; i ++){
f o r ( j =0; j <N ; j ++) {
c[ i ][ j ] = a[ i ][ j ] + b[ i ][ j ];
}
}
Matrix Multiplication
Example: Matrix Multiplication

f o r ( i =0; i <M; i ++){
f o r ( j =0; j <N ; j ++) {
c [ i ] [ j ] = 0;
f o r ( k =0; k<P ; k++)
c [ i ] [ j ] += a [ i ] [ k ] ∗ b [ k ] [ j ] ;
}
}
Inner Product
Example: Inner Product

dot = 0 ;
f o r ( i =0; i <N ; i ++){
d o t += a [ i ] ∗ b [ i ] ;
}
Repetition
Question (Flynn)
How is computing hardware classified (Flynn)?
Flynn’s Classical Taxonomy
According to Flynn (1966) multi-processor computer architectures are
distingued according to
• instruction stream
• data stream
Figure 1.2: Flynn’s Taxonomy (Barney 2016).
Single Instruction - Single Data
• single instruction: one instruction stream is being acted on by the CPU
• single data: one data stream is being used as input
• deterministic execution
• serial computer: older generation mainframes, single processor/core PCs.
Figure 1.3: Single Instruction, Single Data (Flow, Example) (Barney 2016).
Multiple Instruction - Single Data
• multiple instruction: each processing unit operates on the data
independently via separate instruction streams.
• single data: a single data stream is fed into multiple processing units.
• more or less irrelevant (no actual computers known)
Figure 1.4: Multiple Instruction, Single Data (Flow, Example) (Barney 2016).
Single Instruction - Multiple Data
• single instruction: all processing units execute the same instruction at
any given clock cycle
• multiple data: each processing unit can operate on a different data
element
• best suited for specialized problems characterized by a high degree of
regularity, such as graphics/image processing.
Figure 1.5: Single Instruction, Multiple Data (Flow, Example) (Barney 2016).
Single Instruction - Multiple Data
Figure 1.6: Single Instruction,

Multiple Data Example (Barney
2016).
• Synchronous (lockstep) and deterministic

execution
• two varieties: processor arrays, vector
pipelines (e.g. Cray X-MP)
• most modern computers, particularly
those with graphics processor units
(GPUs) employ SIMD instructions and Figure 1.7: Cray X-MP, GPU (Barney
execution units. 2016).
Multiple Instruction - Multiple Data
• multiple instruction: every processor may be executing a different
instruction stream
• multiple data: every processor may be working with a different data
stream
• execution can be synchronous or asynchronous, deterministic or
non-deterministic
Figure 1.8: Multiple Instruction, Multiple Data (Flow, Example) (Barney 2016).
Multiple Instruction - Multiple Data
• currently, the most common type of parallel computer - most modern
supercomputers fall into this category, networked parallel computer
clusters and "grids", multi-core PCs.
• Note: many MIMD architectures also include SIMD execution
sub-components
• distingush further: shared memory (e.g. multi-core PCs), distributed
memory (e.g. clusters), hybrid
Figure 1.9: Hybrid distributed-shared memory architectures (Barney 2016).
Outline
2 Contents

CPU vs. GPU Performance
GPU Architecture
A Brief History of GPU Computing
6 Literature
Speculation
Question
Where does parallelism come into the game using GPUs?
Introduction
Floating Point Operations per Second
Figure 1 Floating-Point Operations per Second for the CPU and GPU
Figure 1.10: Maximal values of processing speed (in GFLOP/s) for NVIDIA
GPUs and Intel CPUs (NVIDIA 2015a)
Introduction
Memory Bandwidth
Figure 2 Memory Bandwidth for the CPU and GPU

Figure 1.11: Memory bandwidth
The reason behind the discrepancy(GB/s) forcapability
in floating-point NVIDIA between GPUs
the CPU andand
the Intel CPUs
(NVIDIA 2015a)GPU is that the GPU is specialized for compute-intensive, highly parallel computation
- exactly what graphics rendering is about - and therefore designed such that more
transistors are devoted to data processing rather than data caching and flow control, as
schematically illustrated by Figure 3.
Cont r ol
CPU CPU and
vs. GPU GPU are designed very differently
Architecture
CPU GPU
Latency Oriented Cores Throughput Oriented Cores
Chip Chip
Core Compute Unit
Cache/Local Mem
Local Cache
Threading
Registers
Control
Registers SIMD
SIMD Unit Unit
Figure 1.12: Architecture of a CPU vs. a GPU (Nvidia 2016a).
CPU vs. GPU Architecture
CPUs: Latency Oriented Design
ALU ALU – Powerful ALU
Control – Reduced operation latency
ALU ALU
– Large caches
– Convert long latency memory
CPU Cache accesses to short latency cache
accesses
– Sophisticated control
DRAM – Branch prediction for reduced
branch latency
– Data forwarding for reduced data
latency
Figure 1.13: Architecture of a quad-core CPU (Nvidia 2016a).
5
CPU vs. GPU Architecture
GPUs: Throughput Oriented Design
– Small caches
– To boost memory throughput
– Simple control
– No branch prediction
GPU
– No data forwarding
– Energy efficient ALUs
– Many, long latency but heavily
DRAM pipelined for high throughput
– Require massive number of
threads to tolerate latencies
– Threading logic
– Thread state
Figure 1.14: Architecture of a GPU with 8 processors, each of which consists

of 16 cores (Nvidia 2016a).
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Introduction,
6 28
2001 NVIDIA GeForce 3 allows shader programming in Assembler.

2002 High-level shading languages GLSL (OpenGL),
HLSL (DirectX 9), Cg (NVIDIA)
2004 GPGPU (general purpose computation on GPUs) course at
SIGGRAPH 2004 using shading languages
(http://gpgpu.org/s2004);
BrookGPU project at Stanford University
2006 Unified shader architecture;

GPU computing languages CUDA C (NVIDIA),
Stream SDK/Brook+ (ATI/AMD)
2008 Folding@Home project at Stanford University, using GPU
computing techniques on internet-connected PCs for distributed
simulation of protein folding
2009 GPU computing languages DirectCompute (DirectX 11),
OpenCL (Khronos, cross-platform computing language)
GPGPU and GPU computing
GPGPU General purpose computation on GPUs using shading languages

(e. g., GLSL) or the graphics API itself (e. g., OpenGL functions)
GPU computing General purpose computation on GPUs using computing
languages (e. g., CUDA)
Outline
2 Contents
6 Literature
CUDA C Examples I
Figure 1.15: Fluid simulation (left) and smoke particles from CUDA SDK.
CUDA C Examples II
Figure 1.16: Ocean (left) and Mandelbrot fractal from CUDA SDK.
Real-world Applications of GPU Computing
• Engineering: computational fluid dynamics, FEM simulations

• Finance: option pricing, market analysis
• Geology: oil and gas exploration
• Medical imaging: volume rendering, CT reconstruction
• Life sciences: gene sequence alignment, protein folding
• “Consumer applications”: Photoshop, 3ds Max, Final Cut Pro, Matlab,
Mathematica
Outline
2 Contents
6 Literature
Books
Websites and other Sources
Literature: Books and Reports I
Kirk, David B. and Wen-mei W. Hwu (2010).

Programming Massively Parallel Processors.
Hands-on textbook on CUDA, supported by NVIDIA.
Burlington, MA: Morgan Kaufmann.
Nguyen, Hubert, ed. (2008).
GPU Gems 3.
Chapters 29 to 41 on GPU computing (mostly) with CUDA, available
online at http://http.developer.nvidia.com/GPUGems3/
gpugems3_part01.html.
Upper Saddle River, NJ: Addison-Wesley.
Literature: Books and Reports II
Tsuchiyama, Ryoji, Takashi Nakamura, Takuro Iizuka, Akihiro Asahara,
and Satoshi Miki (2012).
The OpenCL Programming Book – Parallel Programming for MultiCore
CPU and GPU.
2010 version available as html from
https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/contents/.
Japan: Fixstars Corporation.
Munshi, Aaftab, Benedict Gaster, Timothy G. Mattson, James Fung,
and Dan Ginsburg (2011).
OpenCL Programming Guide.
Addison-Wesley Professional,
P. 648.
isbn: 0321749642.
Literature: Books and Reports III
Banger, Ravishekhar and Koushik Bhattacharyya (2013).

OpenCL Programming by Example.
Packt Publishing.
Pharr, Matt, ed. (2005).
GPU Gems 2.
Chapters 29 to 36 on GPGPU before CUDA, available online at http://
http.developer.nvidia.com/GPUGems2/gpugems2_part01.html.
Upper Saddle River, NJ: Addison-Wesley.
Literature: Websites and Articles I
Barney, Blaise (2016).

Introduction to Parallel Computing.
https://computing.llnl.gov/tutorials/parallel_comp.
NVIDIA (2015a).
NVIDIA CUDA C Programming Guide. Version 7.5.
Tech. rep.
Moodle course,
http://docs.nvidia.com/cuda/cuda-c-programming-guide/.
NVIDIA Corp.
Nvidia (2016a).
GPU Teaching Kit.
Literature: Websites and Articles II
Segal, Mark and Kurt Akeley (2015).
The OpenGL Graphics System: A Specification (Version 4.5).
Tech. rep.
The definitive OpenGL reference,
https://www.opengl.org/documentation/.
Khronos Group.
Wikipedia (2016).
OpenCL.
en.wikipedia.org/wiki/OpenCL.
AMD (2010a).
Introduction to OpenCL Programming.
developer.amd.com/zones/OpenCLZone/courses/Documents/
Introduction_to_OpenCL_Programming(201005).pdf.
Literature: Websites and Articles III
AMD (2010b).
AMD OpenCL Tutorial at SAAHPC2010 (Benedict R. Gaster and Lee
Howes).
developer.amd.com/zones/OpenCLZone/courses/Documents/AMD_
OpenCL_Tutorial_SAAHPC2010.pdf.
NVIDIA (2015b).
NVIDIA CUDA Runtime API (Reference Manual). Version 7.5.
Tech. rep.
Moodle course,
http://docs.nvidia.com/cuda/cuda-runtime-api/.
NVIDIA Corp.
Literature: Websites and Articles IV
NVIDIA (2015c).
NVIDIA CUDA C Best Practices Guide. Version 7.5.
Tech. rep.
Moodle course,
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/.
NVIDIA Corp.
— (2015d).
The CUDA Compiler Driver NVCC.
Tech. rep.
Useful details and options of NVCC,
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc.
NVIDIA Corp.
Literature: Websites and Articles V
Kessenich, John (2016).
The OpenGL Shading Language (Language Version 4.4).
Tech. rep.
The definitive GLSL reference,
https://www.opengl.org/documentation/glsl/.
Khronos Group.
GPGPU.org (2015).
General-Purpose Computation on Graphics Hardware.
gpgpu.org.
Khronos Group (2016).
OpenCL Specification.
www.khronos.org/opencl.
Literature: Websites and Articles VI
Nvidia (2016b).
Nvidia’s OpenCl samples.
https://developer.nvidia.com/opencl.
AMD (2016).
AMD’s OpenCl tutorials, drivers, SDK and more.
developer.amd.com/zones/OpenCLZone.
jogamp.org (2016).
OpenGl and OpenCL bindings for Java (high level).
jogamp.org.
jocl.org (2016).
OpenCL bindings for Java (rather closed to C).
jocl.org.
Literature: Websites and Articles VII
AMD (2010c).
ATI Stream SDK OpenCL Programming Guide.
developer.amd.com/gpu_assets/ATI_Stream_SDK_OpenCL_
Programming_Guide.pdf.

Introduction CUDA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction CUDA

Uploaded by

Copyright:

Available Formats

Visual Computing – GPU Computing

4 Why GPU Computing?

5 Examples and Applications

4 Why GPU Computing?

5 Examples and Applications

Prof. Dr. Frauke Sprengel

4 Why GPU Computing?

5 Examples and Applications

• Why GPU Computing?

• The focus of this course is on practical exercises!

4 Why GPU Computing?

5 Examples and Applications

How can we run a program in a faster way ?

Problem: Identify independent problem parts which can be solved in parallel.

Example: Matrix Addition

Example: Matrix Multiplication

Example: Inner Product

Figure 1.2: Flynn’s Taxonomy (Barney 2016).

Figure 1.6: Single Instruction,

• Synchronous (lockstep) and deterministic

Figure 1.9: Hybrid distributed-shared memory architectures (Barney 2016).

4 Why GPU Computing?

5 Examples and Applications

Floating Point Operations per Second

Figure 2 Memory Bandwidth for the CPU and GPU

Figure 1.12: Architecture of a CPU vs. a GPU (Nvidia 2016a).

Figure 1.13: Architecture of a quad-core CPU (Nvidia 2016a).

Figure 1.14: Architecture of a GPU with 8 processors, each of which consists

2001 NVIDIA GeForce 3 allows shader programming in Assembler.

2006 Unified shader architecture;

GPGPU General purpose computation on GPUs using shading languages

4 Why GPU Computing?

5 Examples and Applications

• Engineering: computational fluid dynamics, FEM simulations

4 Why GPU Computing?

5 Examples and Applications

Kirk, David B. and Wen-mei W. Hwu (2010).

Banger, Ravishekhar and Koushik Bhattacharyya (2013).

Barney, Blaise (2016).

You might also like