Programming For Graphics Processing Units (Gpus) : Course Information

Course Information
CS6963: Parallel Programming for GPUs,

CS6963 MW 10:45-12:05, MEB
3105
• Website:
Parallel Programming for Graphics http://www.eng.utah.edu/~cs6963/
Processing Units (GPUs) • Professor:
Mary Hall
MEB 3466, mhall@cs.utah.edu, 5-1039
Lecture 1: Introduction Office hours: 12:20-1:20 PM, Mondays
• Teaching Assistant:
Sriram Aananthakrishnan
MEB 3157, sriram@cs.utah.edu
Office hours: 2-3PM, Thursdays
Lectures (slides and MP3) will be posted on website.
L1: Introduction L1: Introduction
1 2
Source Materials for Today’s Lecture Course Objectives

• Learn how to program “graphics” processors
• Wen-mei Hwu (UIUC) and David Kirk (NVIDIA) for general-purpose multi-core computing
http://courses.ece.uiuc.edu/ece498/al1/
applications
• Jim Demmel (UCB) and Kathy Yelick (UCB, NERSC) – Learn how to think in parallel and write correct
http://www.eecs.berkeley.edu/~yelick/cs267_sp07/lectures parallel programs
• NVIDIA: – Achieve performance and scalability through
understanding of architecture and software
http://www.nvidia.com mapping
• Others as noted • Significant hands-on programming experience
– Develop real applications on real hardware
• Discuss the current parallel computing context
– What are the drivers that make this course timely
– Contemporary programming models and
architectures, and where is the field going
3 4
Grading Criteria Primary Grade: Team Projects
• Homeworks and mini-projects: 25% • Some logistical issues:

• Midterm test: 15% – 2-3 person teams
– Projects will start in late February
• Project proposal: 10%
• Three parts:
• Project design review: 10%
– (1) Proposal; (2) Design review; (3) Final report and
• Project presentation/demo 15% demo
• Project final report 20% • Application code:
• Class participation 5% – Most students will work on MPM, a particle-in-cell
code.
– Alternative applications must be approved by me
(start early).

5 6
Lab Information
Collaboration Policy
Primary lab
• I encourage discussion and exchange • “lab6” in WEB 130
• Windows machines
of information between students.
• Accounts are supposed to be set up for all who
• But the final work must be your own. were registered as of Friday
– Do not copy code, tests, assignments or • Contact opers@eng.utah.edu with questions
written reports. Secondary
– Do not allow others to copy your code, • Until we get to timing experiments, assignments can
be completed on any machine running CUDA 2.0
tests, assignments or written reports. (Linux, Windows, MAC OS)
Tertiary
• Tesla S1070 system expected soon

7 8
Schedule:
Text and Notes
A Few Make-up Classes
1. NVidia, CUDA Programmng Guide, available
from http://www.nvidia.com/object/cuda_develo
p.html for CUDA 2.0 and Windows, Linux or MAC A few make-up classes needed due to my travel
OS.
2. [Recommended] M. Pharr (ed.), GPU Gems 2 – Time slot: Friday, 10:45-12:05, MEB 3105
Programming Techniques for High Performance
Graphics and General-Purpose Computation, Dates: February 20, March 13, April 3, April 24
Addison Wesley, 2005.
http://http.developer.nvidia.com/GPUGems2/gpug
ems2_part01.html
3. [Additional] Grama, A. Gupta, G. Karypis, and V.
Kumar, Introduction to Parallel Computing, 2nd
Ed. (Addison-Wesley, 2003).
4. Additional readings associated with lectures.
9 10
Today’s Lecture Parallel and Distributed Computing

• Limited to supercomputers?
• Overview of course (done)
– No! Everywhere!
• Important problems require powerful
• Scientific applications?
computers …
– These are still important, but also many new
– … and powerful computers must be parallel. commercial applications and new consumer
– Increasing importance of educating parallel applications are going to emerge.
programmers (you!)
• Programming tools adequate and
• Why graphics processors? established?
• Opportunities and limitations – No! Many new research challenges
• Developing high-performance parallel
applications
My Research Area
– An optimization perspective
11 12
Scientific Simulation:
The Third Pillar of Science
• Traditional scientific and engineering paradigm:
1) Do theory or paper design.
2) Perform experiments or build system.
Why we need powerful computers • Limitations:
– Too difficult -- build large wind tunnels.
– Too expensive -- build a throw-away passenger jet.
– Too slow -- wait for climate or galactic evolution.
– Too dangerous -- weapons, drug design, climate experimentation.
• Computational science paradigm:
3) Use high performance computer systems to simulate the
phenomenon
• Base on known physical laws and efficient numerical methods.

13 Slide source: Jim Demmel, UC Berkeley 14
The quest for increasingly more A Similar Phenomenon in Commodity

powerful machines Systems
• Scientific simulation will continue to • More capabilities in software
push on system requirements: • Integration across software
– To increase the precision of the result • Faster response
– To get to an answer sooner (e.g., climate
modeling, disaster modeling)
• More realistic graphics
• The U.S. will continue to acquire • …
systems of increasing scale
– For the above reasons
– And to maintain competitiveness

15 16
Technology Trends: Moore’s Law
Transistor
count still
rising
Clock speed
flattening
Why powerful computers must be sharply
parallel
Slide from Maurice Herlihy

17 18
Techology Trends: Power Issues Power Perspective

1000000000
100000000
10000000
1000000
GigaFlop/s
MegaWatts
100000
10000
Performance (Gflops)
1000
Power
100
10
1
1960 1970 1980 1990 2000 2010 2020
0.1
0.01
0.001
Slide source: Bob Lucas

From www.electronics-cooling.com/.../jan00_a2f2.jpg
19 20
The Multi-Core Paradigm Shift Who Should Care
About Performance Now?
What to do with all these transistors?
• Everyone! (Almost)
• Key ideas: – Sequential programs will not get faster
• If individual processors are simplified and compete for
– Movement away from increasingly complex shared resources.
processor design and faster clocks • And forget about adding new capabilities to the software!
– Parallel programs will also get slower
– Replicated functionality (i.e., parallel) is • Quest for coarse-grain parallelism at odds with smaller
simpler to design storage structures and limited bandwidth.
– Managing locality even more important than
– Resources more efficiently utilized parallelism!
• Hierarchies of storage and compute structures
– Huge power management advantages
• Small concession: some programs are
All Computers are Parallel Computers. nevertheless fast enough
21 22
GeForce 8800
Why Massively Parallel Processor
16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768
• A quiet revolution and potential build-up MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
– Calculation: 367 GFLOPS vs. 32 GFLOPS Host
– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
– Until last year, programmed through graphics API Input Assembler
Thread Execution Manager

GFLOPS
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
G80 = GeForce 8800 GTX Cache Cache Cache Cache Cache Cache Cache Cache
G71 = GeForce 7900 GTX Texture

Texture Texture Texture Texture Texture Texture Texture Texture
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
Load/store Load/store Load/store Load/store Load/store Load/store
– GPU in every PC and workstation – massive volume and
potential impact Global Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
L1: Introduction
ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL1, University of Illinois, Urbana-Champaign
23
Concept of GPGPU
(General-Purpose Computing on GPUs)
• Idea:
• Potential for very high performance at low cost
• Architecture well suited for certain kinds of parallel
applications (data parallel)
• Demonstrations of 30-100X speedup over CPU
• Early challenges:
– Architectures very customized to graphics problems
(e.g., vertex and fragment processors)
– Programmed using graphics-specific programming
models or libraries
• Recent trends:
– Some convergence between commodity and GPUs and
their associated parallel programming models
25
See http://gpgpu.org 26
Stretching Traditional The fastest computer in the world today

Architectures • What is its name? RoadRunner
• Traditional parallel architectures cover some
super-applications Los Alamos National
– DSP, GPU, network apps, Scientific, Transactions • Where is it located? Laboratory
• The game is to grow mainstream architectures

~19,000 processor chips
“out” or domain-specific architectures “in” • How many processors does it (~129,600 “processors”)
– CUDA is latter have?
AMD Opterons and

• What kind of processors? IBM Cell/BE (in Playstations)
1.105 Petaflop/second
One quadrilion operations/s
• How fast is it? 1 x 1016
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007

27
See http://www.top500.org L1: Introduction
ECE 498AL, University of Illinois, Urbana-Champaign 28
Parallel Programming Complexity Finding Enough Parallelism
An Analogy to Preparing Thanksgiving Dinner
• Suppose only part of an application seems parallel
• Enough parallelism? (Amdahl’s Law)
– Suppose you want to just serve turkey • Amdahl’s law
• Granularity – let s be the fraction of work done sequentially,
– How frequently must each assistant report to the chef so (1-s) is fraction parallelizable
• After each stroke of a knife? Each step of a recipe? Each dish
completed? – P = number of processors
All of these things makes parallel
• Locality
Speedup(P) = Time(1)/Time(P) 1-s
– Grab the spices one at a time? Or collect ones that are needed
programming
prior to startingeven
a dish? harder than sequential
programming.
– What if you have to go to the grocery store while cooking? <= 1/(s + (1-s)/P) s
• Load balance <= 1/s
– Each assistant gets a dish? Preparing stuffing vs. cooking green
beans? • Even if the parallel part speeds up perfectly,
• Coordination and Synchronization performance is limited by the sequential part
– Person chopping onions for stuffing can also supply green beans
– Start pie after turkey is out of the oven
29 30
(Device) Grid
Overhead of Parallelism Locality and Parallelism Block (0, 0) Block (1, 0)
• Given enough parallel work, this is the biggest

Conventional
Storage Shared Memory Shared Memory
barrier to getting desired speedup Hierarchy Proc

Cache
Host + GPU
Registers Registers Registers Registers
Storage
• Parallelism overheads include: L2 Cache Hierarchy
– cost of starting a thread or process Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
– cost of communicating shared data

L3 Cache Local Local Local Local
– cost of synchronizing
Memory Memory Memory Memory
– extra (redundant) computation Host Global

Memory
• Each of these can be in the range of milliseconds Memory Constant
(=millions of flops) on some systems

Memory
Texture
• Tradeoff: Algorithm needs sufficiently large units Memory
of work to run fast in parallel (I.e. large • Large memories are slow, fast memories are small Courtesy NVIDIA
granularity), but not so large that there is not • Storage hierarchies are large and fast on average
enough parallel work • Algorithm should do most work on nearby data
31 32
Load Imbalance Summary of Lecture
• Load imbalance is the time that some processors in • Technology trends have caused the multi-core
the system are idle due to paradigm shift in computer architecture
– insufficient parallelism (during that phase) – Every computer architecture is parallel
– unequal size tasks • Parallel programming is reaching the masses
• Examples of the latter – This course will help prepare you for the future
– different control flow paths on different tasks of programming.
– adapting to “interesting parts of a domain” • We are seeing some convergence of graphics and
general-purpose computing
– tree-structured computations
• Graphics processors can achieve high performance for
– fundamentally unstructured problems more general-purpose applications
• Algorithm needs to balance load • GPGPU computing
– Heterogeneous, suitable for data-parallel applications

33 34
Next Time
• Immersion!
– Introduction to CUDA
L1: Introduction
35

Programming For Graphics Processing Units (Gpus) : Course Information

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Programming For Graphics Processing Units (Gpus) : Course Information

Uploaded by

Copyright:

Available Formats

Course Information

CS6963: Parallel Programming for GPUs,

Source Materials for Today’s Lecture Course Objectives

• Homeworks and mini-projects: 25% • Some logistical issues:

L1: Introduction L1: Introduction

L1: Introduction L1: Introduction

Today’s Lecture Parallel and Distributed Computing

L1: Introduction L1: Introduction

The quest for increasingly more A Similar Phenomenon in Commodity

L1: Introduction L1: Introduction

Slide from Maurice Herlihy

L1: Introduction L1: Introduction

Techology Trends: Power Issues Power Perspective

Slide source: Bob Lucas

Thread Execution Manager

G71 = GeForce 7900 GTX Texture

Stretching Traditional The fastest computer in the world today

• The game is to grow mainstream architectures

AMD Opterons and

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007

Overhead of Parallelism Locality and Parallelism Block (0, 0) Block (1, 0)

• Given enough parallel work, this is the biggest

barrier to getting desired speedup Hierarchy Proc

– cost of communicating shared data

– extra (redundant) computation Host Global

• Each of these can be in the range of milliseconds Memory Constant

(=millions of flops) on some systems

• Tradeoff: Algorithm needs sufficiently large units Memory

L1: Introduction L1: Introduction

You might also like