CUDA PPT Report

TECHNOSOFT REPORT
CUDA
A technology that can make super-computers
personal…
‘The soul of supercomputer is the body of GPU’
SUBMITTED BY
KUNAL GARG
CSE(A)-VI Sem.
2507276
UIET KU
Kurukshetra,India
INDEX
Abstract…………………………………………………………………………….…………..3
Supercomputer………………………………………………………………………………….4
GPU……………………………………………………………………………………………..4
GPU Computing…………………………………………………………………….……….….4
History of GPU Computing……………………………………………………………………..5
GPGPU………………………………………………………………………………………….6
CUDA…………………………………………………………………………………………...6
Advantages of CUDA………………………………………………………………………..…6
CUDA Programming Model……………………………………………………………………..7
CUDA Architecture…………………………………………………………………..…………7
Tesla 10-Series………………………………………………………………………..…………7
Tesla 10-Series Architecture..................................................................................................…...7
Thread Hierarchy………………………………………………………………………………...8
Execution Model…………………………………………………………………………………8
Warps and Half-Warps…………………………………………………………………………..9
GPU Memory Allocation/Release………………………………………………………...………9
Next Generation CUDA Architecture……………………………………………...…………….9
Applications……………………………………………………………………………………...10
Why should I use GPU as a Processor……………………………………………………...……10
Bibliography……………………………………………………………………………………..11
ABSTRACT
Today’s supercomputers are computers of tomorrow and GPU processors are bridge between
them today. A graphics processing unit or GPU is a specialized processor that offloads 3D or
2D graphics rendering from the microprocessor. GPU computing is the use of a GPU to do
general purpose scientific and engineering computing.
The model for GPU computing is to use a CPU and GPU together in a heterogeneous computing
model. The sequential part of the application runs on the CPU and the computationally-intensive
part runs on the GPU. From the user’s perspective, the application just runs faster because it is
using the high-performance of the GPU to boost performance. Computing is evolving from
"central processing" on the CPU to "co-processing" on the CPU and GPU. To enable this new
computing paradigm, NVIDIA invented the CUDA (Compute Unified Device Architecture)
parallel computing architecture.
The NVIDIA® Tesla™ 20-series is designed from the ground up for high performance
computing. Based on the next generation CUDA GPU architecture. When compared to the latest
quad-core CPU, Tesla 20-series GPU computing processors deliver equivalent performance at
1/20th the power consumption and 1/10th the cost i.e. it gives the power of super computer in a
pc based workstation.
“By 2012, three of the top five supercomputers in the world will have graphics processors using
parallel computing applications for computing,” Nvidia chief scientist David Kirk predicted.
SUPERCOMPUTER
A supercomputer is a computer that is at the frontline of current processing capacity, particularly
speed of calculation. Today, supercomputers are typically one-of-a-kind custom designs
produced by "traditional" companies such as Cray, IBM and Hewlett-Packard, who had
purchased many of the 1980s companies to gain their experience. As of July 2009, the Cray
Jaguar is the fastest supercomputer in the world.
The term supercomputer itself is rather fluid, and today's supercomputer tends to become
tomorrow's ordinary computer.
Supercomputers are used for highly calculation-intensive tasks such as problems involving
quantum physics, weather forecasting, climate research, molecular modeling (computing the
structures and properties of chemical compounds, biological macromolecules, polymers, and
crystals), physical simulations (such as simulation of airplanes in wind tunnels, simulation of the
detonation of nuclear weapons, and research into nuclear fusion).
In November 2009, the AMD Opteron-based Cray XT5 Jaguar at the Oak Ridge National
Laboratory was announced as the fastest operational supercomputer, with a sustained processing
rate of 1.759 PFLOPS
GPU
A graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a
specialized processor that offloads 3D or 2D graphics rendering from the microprocessor. It is
used in embedded systems, mobile phones, personal computers, workstations, and game
consoles. Modern GPUs are very efficient at manipulating computer graphics, and their highly
parallel structure makes them more effective than general-purpose CPUs for a range of complex
algorithms. In a personal computer, a GPU can be present on a video card, or it can be on the
motherboard. More than 90% of new desktop and notebook computers have integrated GPUs,
which are usually far less powerful than those on a dedicated video card.
GPU’s can be of following types:

1. Dedicated video cards: These have their own dedicated memory.
2. Integrated graphics processors: Share a portion of RAM
3. Hybrid: Share RAM while having their own cache memory.
GPU Computing
GPU computing is the use of a GPU (graphics processing unit) to do general purpose scientific
and engineering computing.
The model for GPU computing is to use a CPU and GPU together in a heterogeneous computing
model. The sequential part of the application runs on the CPU and the computationally-intensive
part runs on the GPU. From the user’s perspective, the application just runs faster because it is
using the high-performance of the GPU to boost performance.
The application developer has to modify their application to take the compute-intensive kernels
and map them to the GPU. The rest of the application remains on the CPU. Mapping a function
to the GPU involves rewriting the function to expose the parallelism in the function and adding
“C” keywords to move data to and from the GPU.
GPU computing is enabled by the massively parallel architecture of NVIDIA’s GPUs called the
CUDA architecture. The CUDA architecture consists of 100s of processor cores that operate
together to crunch through the data set in the application.
The Tesla 10-series GPU is the second generation CUDA architecture with features optimized
for scientific applications such as IEEE standard double precision floating point hardware
support, local data caches in the form of shared memory dispersed throughout the GPU,
coalesced memory accesses and so on.
"GPUs have evolved to the point where many real-world applications are easily implemented on
them and run significantly faster than on multi-core systems. Future computing architectures will
be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs," Prof.
Jack Dongarra predicted.
HISTORY OF GPU COMPUTING
Graphics chips started as fixed function graphics pipelines. Over the years, these graphics chips
became increasingly programmable, which led NVIDIA to introduce the first GPU or Graphics
Processing Unit. In the 1999-2000 timeframe, computer scientists in particular, along with
researchers in fields such as medical imaging and electromagnetics started using GPUs for
running general purpose computational applications. They found the excellent floating point
performance in GPUs led to a huge performance boost for a range of scientific applications. This
was the advent of the movement called GPGPU or General Purpose computing on GPUs.
The problem was that GPGPU required using graphics programming languages like OpenGL and
Cg to program the GPU. Developers had to make their scientific applications look like graphics
applications and map them into problems that drew triangles and polygons. This limited the
accessibility of tremendous performance of GPUs for science.
NVIDIA realized the potential to bring this performance to the larger scientific community and
decided to invest in modifying the GPU to make it fully programmable for scientific applications
and added support for high-level languages like C and C++. This led to the CUDA architecture
for the GPU.
GPGPU
General-purpose computing on graphics processing units is the technique of using a GPU,

which typically handles computation only for computer graphics, to perform computation in
applications traditionally handled by the CPU.
The problem was that GPGPU required using graphics programming languages like OpenGL and
Cg to program the GPU. Developers had to make their scientific applications look like graphics
applications and map them into problems that drew triangles and polygons
Because of their nature, GPUs are only effective at tackling problems that can be solved using
stream processing and the hardware can only be used in certain ways.
CUDA
CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing
architecture developed by NVIDIA. CUDA is the computing engine in NVIDIA graphics
processing units or GPUs that is accessible to software developers through industry standard
programming languages. Programmers use 'C for CUDA' (C with NVIDIA extensions),
compiled through a PathScale Open64 C compiler, to code algorithms for execution on the GPU.
CUDA architecture shares a range of computational interfaces with two competitors -the
Khronos Group's Open Computing Language and Microsoft's DirectCompute. Third party
wrappers are also available for Python, Fortran, Java and MATLAB.
The latest drivers all contain the necessary CUDA components. CUDA works with all NVIDIA
GPUs from the G8X series onwards, including GeForce, Quadro and the Tesla line. NVIDIA
states that programs developed for the GeForce 8 series will also work without modification on
all future Nvidia video cards, due to binary compatibility. CUDA gives developers access to the
native instruction set and memory of the parallel computational elements in CUDA GPUs. Using
CUDA, the latest NVIDIA GPUs effectively become open architectures like CPUs. Unlike CPUs
however, GPUs have a parallel "many-core" architecture, each core capable of running
thousands of threads simultaneously - if an application is suited to this kind of an architecture,
the GPU can offer large performance benefits.
Advantages of CUDA
CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU)
using graphics APIs.
 Scattered reads – code can read from arbitrary addresses in memory.

• Shared memory – CUDA exposes a fast shared memory region (16KB in size) that can be
shared amongst threads. This can be used as a user-managed cache, enabling higher
bandwidth than is possible using texture lookups.
• Faster downloads and readbacks to and from the GPU
• Full support for integer and bitwise operations, including integer texture lookups
CUDA Programming Model
 Parallel code (kernel) is launched and executed on a device by many threads
 Threads are grouped into thread blocks
 Parallel code is written for a thread
 Each thread is free to execute a unique code path
 Built-in thread and block ID variables
CUDA Architecture
The CUDA Architecture consists of several components
 Parallel compute engines inside NVIDIA GPUs
 OS kernel-level support for hardware initialization, configuration, etc.
 User-mode driver, which provides a device-level API for developers
 PTX instruction set architecture (ISA) for parallel computing kernels and functions
Tesla 10 Series
CUDA Computing with Tesla T10
 240 SP processors at 1.45 GHz: 1 TFLOPS peak
 30 DP processors at 1.44Ghz: 86 GFLOPS peak
 128 threads per processor: 30,720 threads total
Tesla 10-Series Architecture

 240 thread processors execute kernel threads
 30 multiprocessors, each contains
 8 thread processors
 One double-precision unit
 Shared memory enables thread cooperation
Thread Hierarchy
 Threads launched for a parallel section are partitioned into thread blocks
 Grid = all blocks for a given launch
 Thread block is a group of threads that can
 Synchronize their execution
 Communicate via shared memory

Execution Model
Warps and Half Warps
GPU Memory Allocation / Release

Host (CPU) manages device (GPU) memory:
 cudaMalloc (void ** pointer, size_t nbytes)
 cudaMemset (void * pointer, int value, size_t count)
 cudaFree (void* pointer)
Next Generation CUDA Architecture

The next generation CUDA architecture, codenamed Fermi is the most advanced GPU
architecture ever built. Its features include
• 3.2 billion transistors
• 512 CUDA cores-Optimized performance and accuracy with upto 8x faster double
precision
• Nvidia Parallel Datacache Technology-First GPU architecture to support a true cache

hierarchy in combination with on-chip shared memory
 Nvidia Gigathread Engine-Increased efficiency with concurrent kernel execution
 ECC Support-Detects and corrects error before system is defected
Applications
 Accelerated rendering of 3D graphics
 Video Forensic
 Molecular Dynamics
 Computational Chemistry
 Life Sciences
 Bioinformatics
 Electrodynamics
 Medical Imaging
 Oil and gas
 Weather and Ocean Modeling
 Electronic Design Automaton
 Video Imaging
 Video Acceleration
Why should I use GPU as a processor?

 When compared to the latest quad-core CPU, Tesla 20-series GPU computing processors
deliver equivalent performance at 1/20th the power consumption and 1/10th the cost
 When computational fluid dynamics problem is solved it takes
 9 minutes on a Tesla S870(4GPUs)
 12 hours on one 2.5 GHz CPU core

 Speed of Intel core i7 980XE is 107.6 GFLOPS in double precision while nVidia's Tesla
C1060 GPU computing card performs around 933 GFLOPS in single precision
calculations while AMD's HemlockXT 5970 reaches 4640 GFLOPS and with the same
Nvidia Tesla C1060 capable of 78 GFLOPS in double precision and the AMD Hemlock
5970 capable of 928 GFLOPS in double precision .
 GeForce 8800 GTX - 346 GigaFLOPs

Radeon HD 2900 XT - 475 GigaFLOPs
PS3 Cell - 154 GigaFLOPs
Core 2 Duo E6600 - 38 GigaFLOPs
Athlon 64 X2 4600+ - 19 GigaFLOPs
 After all, it’s a supercomputer
Bibliography
The data has been collected from
 Wikipedia
 Nvidia.com
 Google
 Intel.com

CUDA PPT Report

Uploaded by

Copyright:

Available Formats

You might also like

CUDA PPT Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CUDA PPT Report

Uploaded by

Copyright:

Available Formats

TECHNOSOFT REPORT

‘The soul of supercomputer is the body of GPU’

History of GPU Computing……………………………………………………………………..5

CUDA Programming Model……………………………………………………………………..7

Tesla 10-Series Architecture..................................................................................................…...7

Warps and Half-Warps…………………………………………………………………………..9

GPU Memory Allocation/Release………………………………………………………...………9

Next Generation CUDA Architecture……………………………………………...…………….9

Why should I use GPU as a Processor……………………………………………………...……10

GPU’s can be of following types:

HISTORY OF GPU COMPUTING

General-purpose computing on graphics processing units is the technique of using a GPU,

 Scattered reads – code can read from arbitrary addresses in memory.

CUDA Programming Model

 Parallel code (kernel) is launched and executed on a device by many threads

 Threads are grouped into thread blocks

 Parallel code is written for a thread

 Each thread is free to execute a unique code path

 Built-in thread and block ID variables

 Parallel compute engines inside NVIDIA GPUs

 OS kernel-level support for hardware initialization, configuration, etc.

 User-mode driver, which provides a device-level API for developers

 240 SP processors at 1.45 GHz: 1 TFLOPS peak

 30 DP processors at 1.44Ghz: 86 GFLOPS peak

 128 threads per processor: 30,720 threads total

Tesla 10-Series Architecture

 One double-precision unit

 Shared memory enables thread cooperation

 Grid = all blocks for a given launch

 Thread block is a group of threads that can

 Synchronize their execution

 Communicate via shared memory

GPU Memory Allocation / Release

 cudaMalloc (void ** pointer, size_t nbytes)

 cudaMemset (void * pointer, int value, size_t count)

 cudaFree (void* pointer)

Next Generation CUDA Architecture

• 3.2 billion transistors

• Nvidia Parallel Datacache Technology-First GPU architecture to support a true cache

 Nvidia Gigathread Engine-Increased efficiency with concurrent kernel execution

 ECC Support-Detects and corrects error before system is defected

Why should I use GPU as a processor?

 When computational fluid dynamics problem is solved it takes

 9 minutes on a Tesla S870(4GPUs)

 12 hours on one 2.5 GHz CPU core

 GeForce 8800 GTX - 346 GigaFLOPs

You might also like