Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Course Information

CS6963: Parallel Programming for GPUs,


CS6963 MW 10:45-12:05, MEB
3105
• Website:
Parallel Programming for Graphics http://www.eng.utah.edu/~cs6963/
Processing Units (GPUs) • Professor:
Mary Hall
MEB 3466, mhall@cs.utah.edu, 5-1039
Lecture 1: Introduction Office hours: 12:20-1:20 PM, Mondays
• Teaching Assistant:
Sriram Aananthakrishnan
MEB 3157, sriram@cs.utah.edu
Office hours: 2-3PM, Thursdays
Lectures (slides and MP3) will be posted on website.
L1: Introduction L1: Introduction
1 2

Source Materials for Today’s Lecture Course Objectives


• Learn how to program “graphics” processors
• Wen-mei Hwu (UIUC) and David Kirk (NVIDIA) for general-purpose multi-core computing
http://courses.ece.uiuc.edu/ece498/al1/
applications
• Jim Demmel (UCB) and Kathy Yelick (UCB, NERSC) – Learn how to think in parallel and write correct
http://www.eecs.berkeley.edu/~yelick/cs267_sp07/lectures parallel programs
• NVIDIA: – Achieve performance and scalability through
understanding of architecture and software
http://www.nvidia.com mapping
• Others as noted • Significant hands-on programming experience
– Develop real applications on real hardware
• Discuss the current parallel computing context
– What are the drivers that make this course timely
– Contemporary programming models and
architectures, and where is the field going
L1: Introduction L1: Introduction
3 4
Grading Criteria Primary Grade: Team Projects

• Homeworks and mini-projects: 25% • Some logistical issues:


• Midterm test: 15% – 2-3 person teams
– Projects will start in late February
• Project proposal: 10%
• Three parts:
• Project design review: 10%
– (1) Proposal; (2) Design review; (3) Final report and
• Project presentation/demo 15% demo
• Project final report 20% • Application code:
• Class participation 5% – Most students will work on MPM, a particle-in-cell
code.
– Alternative applications must be approved by me
(start early).

L1: Introduction L1: Introduction


5 6

Lab Information
Collaboration Policy
Primary lab
• I encourage discussion and exchange • “lab6” in WEB 130
• Windows machines
of information between students.
• Accounts are supposed to be set up for all who
• But the final work must be your own. were registered as of Friday
– Do not copy code, tests, assignments or • Contact opers@eng.utah.edu with questions
written reports. Secondary
– Do not allow others to copy your code, • Until we get to timing experiments, assignments can
be completed on any machine running CUDA 2.0
tests, assignments or written reports. (Linux, Windows, MAC OS)
Tertiary
• Tesla S1070 system expected soon

L1: Introduction L1: Introduction


7 8
Schedule:
Text and Notes
A Few Make-up Classes
1. NVidia, CUDA Programmng Guide, available
from http://www.nvidia.com/object/cuda_develo
p.html for CUDA 2.0 and Windows, Linux or MAC A few make-up classes needed due to my travel
OS.
2. [Recommended] M. Pharr (ed.), GPU Gems 2 – Time slot: Friday, 10:45-12:05, MEB 3105
Programming Techniques for High Performance
Graphics and General-Purpose Computation, Dates: February 20, March 13, April 3, April 24
Addison Wesley, 2005.
http://http.developer.nvidia.com/GPUGems2/gpug
ems2_part01.html
3. [Additional] Grama, A. Gupta, G. Karypis, and V.
Kumar, Introduction to Parallel Computing, 2nd
Ed. (Addison-Wesley, 2003).
4. Additional readings associated with lectures.
L1: Introduction L1: Introduction
9 10

Today’s Lecture Parallel and Distributed Computing


• Limited to supercomputers?
• Overview of course (done)
– No! Everywhere!
• Important problems require powerful
• Scientific applications?
computers …
– These are still important, but also many new
– … and powerful computers must be parallel. commercial applications and new consumer
– Increasing importance of educating parallel applications are going to emerge.
programmers (you!)
• Programming tools adequate and
• Why graphics processors? established?
• Opportunities and limitations – No! Many new research challenges
• Developing high-performance parallel
applications
My Research Area
– An optimization perspective
L1: Introduction L1: Introduction
11 12
Scientific Simulation:
The Third Pillar of Science
• Traditional scientific and engineering paradigm:
1) Do theory or paper design.
2) Perform experiments or build system.
Why we need powerful computers • Limitations:
– Too difficult -- build large wind tunnels.
– Too expensive -- build a throw-away passenger jet.
– Too slow -- wait for climate or galactic evolution.
– Too dangerous -- weapons, drug design, climate experimentation.
• Computational science paradigm:
3) Use high performance computer systems to simulate the
phenomenon
• Base on known physical laws and efficient numerical methods.

L1: Introduction L1: Introduction


13 Slide source: Jim Demmel, UC Berkeley 14

The quest for increasingly more A Similar Phenomenon in Commodity


powerful machines Systems
• Scientific simulation will continue to • More capabilities in software
push on system requirements: • Integration across software
– To increase the precision of the result • Faster response
– To get to an answer sooner (e.g., climate
modeling, disaster modeling)
• More realistic graphics
• The U.S. will continue to acquire • …
systems of increasing scale
– For the above reasons
– And to maintain competitiveness

L1: Introduction L1: Introduction


15 16
Technology Trends: Moore’s Law
Transistor
count still
rising

Clock speed
flattening
Why powerful computers must be sharply
parallel

Slide from Maurice Herlihy

L1: Introduction L1: Introduction


17 18

Techology Trends: Power Issues Power Perspective


1000000000

100000000

10000000

1000000
GigaFlop/s

MegaWatts
100000

10000
Performance (Gflops)
1000
Power
100

10

1
1960 1970 1980 1990 2000 2010 2020
0.1

0.01

0.001

Slide source: Bob Lucas


From www.electronics-cooling.com/.../jan00_a2f2.jpg
L1: Introduction L1: Introduction
19 20
The Multi-Core Paradigm Shift Who Should Care
About Performance Now?
What to do with all these transistors?
• Everyone! (Almost)
• Key ideas: – Sequential programs will not get faster
• If individual processors are simplified and compete for
– Movement away from increasingly complex shared resources.
processor design and faster clocks • And forget about adding new capabilities to the software!
– Parallel programs will also get slower
– Replicated functionality (i.e., parallel) is • Quest for coarse-grain parallelism at odds with smaller
simpler to design storage structures and limited bandwidth.
– Managing locality even more important than
– Resources more efficiently utilized parallelism!
• Hierarchies of storage and compute structures
– Huge power management advantages
• Small concession: some programs are
All Computers are Parallel Computers. nevertheless fast enough
L1: Introduction L1: Introduction
21 22

GeForce 8800
Why Massively Parallel Processor
16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768
• A quiet revolution and potential build-up MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
– Calculation: 367 GFLOPS vs. 32 GFLOPS Host
– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
– Until last year, programmed through graphics API Input Assembler

Thread Execution Manager


GFLOPS

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
G80 = GeForce 8800 GTX Cache Cache Cache Cache Cache Cache Cache Cache

G71 = GeForce 7900 GTX Texture


Texture Texture Texture Texture Texture Texture Texture Texture
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
Load/store Load/store Load/store Load/store Load/store Load/store
– GPU in every PC and workstation – massive volume and
potential impact Global Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
L1: Introduction
ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL1, University of Illinois, Urbana-Champaign
23
Concept of GPGPU
(General-Purpose Computing on GPUs)
• Idea:
• Potential for very high performance at low cost
• Architecture well suited for certain kinds of parallel
applications (data parallel)
• Demonstrations of 30-100X speedup over CPU
• Early challenges:
– Architectures very customized to graphics problems
(e.g., vertex and fragment processors)
– Programmed using graphics-specific programming
models or libraries
• Recent trends:
– Some convergence between commodity and GPUs and
their associated parallel programming models
L1: Introduction L1: Introduction
25
See http://gpgpu.org 26

Stretching Traditional The fastest computer in the world today


Architectures • What is its name? RoadRunner
• Traditional parallel architectures cover some
super-applications Los Alamos National
– DSP, GPU, network apps, Scientific, Transactions • Where is it located? Laboratory

• The game is to grow mainstream architectures


~19,000 processor chips
“out” or domain-specific architectures “in” • How many processors does it (~129,600 “processors”)
– CUDA is latter have?

AMD Opterons and


• What kind of processors? IBM Cell/BE (in Playstations)

1.105 Petaflop/second
One quadrilion operations/s
• How fast is it? 1 x 1016

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007


27
See http://www.top500.org L1: Introduction
ECE 498AL, University of Illinois, Urbana-Champaign 28
Parallel Programming Complexity Finding Enough Parallelism
An Analogy to Preparing Thanksgiving Dinner
• Suppose only part of an application seems parallel
• Enough parallelism? (Amdahl’s Law)
– Suppose you want to just serve turkey • Amdahl’s law
• Granularity – let s be the fraction of work done sequentially,
– How frequently must each assistant report to the chef so (1-s) is fraction parallelizable
• After each stroke of a knife? Each step of a recipe? Each dish
completed? – P = number of processors
All of these things makes parallel
• Locality
Speedup(P) = Time(1)/Time(P) 1-s
– Grab the spices one at a time? Or collect ones that are needed
programming
prior to startingeven
a dish? harder than sequential
programming.
– What if you have to go to the grocery store while cooking? <= 1/(s + (1-s)/P) s
• Load balance <= 1/s
– Each assistant gets a dish? Preparing stuffing vs. cooking green
beans? • Even if the parallel part speeds up perfectly,
• Coordination and Synchronization performance is limited by the sequential part
– Person chopping onions for stuffing can also supply green beans
– Start pie after turkey is out of the oven
L1: Introduction L1: Introduction
29 30

(Device) Grid

Overhead of Parallelism Locality and Parallelism Block (0, 0) Block (1, 0)

• Given enough parallel work, this is the biggest


Conventional
Storage Shared Memory Shared Memory

barrier to getting desired speedup Hierarchy Proc


Cache
Host + GPU
Registers Registers Registers Registers
Storage
• Parallelism overheads include: L2 Cache Hierarchy
– cost of starting a thread or process Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

– cost of communicating shared data


L3 Cache Local Local Local Local

– cost of synchronizing
Memory Memory Memory Memory

– extra (redundant) computation Host Global


Memory

• Each of these can be in the range of milliseconds Memory Constant

(=millions of flops) on some systems


Memory

Texture

• Tradeoff: Algorithm needs sufficiently large units Memory

of work to run fast in parallel (I.e. large • Large memories are slow, fast memories are small Courtesy NVIDIA
granularity), but not so large that there is not • Storage hierarchies are large and fast on average
enough parallel work • Algorithm should do most work on nearby data
L1: Introduction L1: Introduction
31 32
Load Imbalance Summary of Lecture

• Load imbalance is the time that some processors in • Technology trends have caused the multi-core
the system are idle due to paradigm shift in computer architecture
– insufficient parallelism (during that phase) – Every computer architecture is parallel
– unequal size tasks • Parallel programming is reaching the masses
• Examples of the latter – This course will help prepare you for the future
– different control flow paths on different tasks of programming.
– adapting to “interesting parts of a domain” • We are seeing some convergence of graphics and
general-purpose computing
– tree-structured computations
• Graphics processors can achieve high performance for
– fundamentally unstructured problems more general-purpose applications
• Algorithm needs to balance load • GPGPU computing
– Heterogeneous, suitable for data-parallel applications

L1: Introduction L1: Introduction


33 34

Next Time

• Immersion!
– Introduction to CUDA

L1: Introduction
35

You might also like