Professional Documents
Culture Documents
Programming For Graphics Processing Units (Gpus) : Course Information
Programming For Graphics Processing Units (Gpus) : Course Information
Lab Information
Collaboration Policy
Primary lab
• I encourage discussion and exchange • “lab6” in WEB 130
• Windows machines
of information between students.
• Accounts are supposed to be set up for all who
• But the final work must be your own. were registered as of Friday
– Do not copy code, tests, assignments or • Contact opers@eng.utah.edu with questions
written reports. Secondary
– Do not allow others to copy your code, • Until we get to timing experiments, assignments can
be completed on any machine running CUDA 2.0
tests, assignments or written reports. (Linux, Windows, MAC OS)
Tertiary
• Tesla S1070 system expected soon
Clock speed
flattening
Why powerful computers must be sharply
parallel
100000000
10000000
1000000
GigaFlop/s
MegaWatts
100000
10000
Performance (Gflops)
1000
Power
100
10
1
1960 1970 1980 1990 2000 2010 2020
0.1
0.01
0.001
GeForce 8800
Why Massively Parallel Processor
16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768
• A quiet revolution and potential build-up MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
– Calculation: 367 GFLOPS vs. 32 GFLOPS Host
– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
– Until last year, programmed through graphics API Input Assembler
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
G80 = GeForce 8800 GTX Cache Cache Cache Cache Cache Cache Cache Cache
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
L1: Introduction
ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL1, University of Illinois, Urbana-Champaign
23
Concept of GPGPU
(General-Purpose Computing on GPUs)
• Idea:
• Potential for very high performance at low cost
• Architecture well suited for certain kinds of parallel
applications (data parallel)
• Demonstrations of 30-100X speedup over CPU
• Early challenges:
– Architectures very customized to graphics problems
(e.g., vertex and fragment processors)
– Programmed using graphics-specific programming
models or libraries
• Recent trends:
– Some convergence between commodity and GPUs and
their associated parallel programming models
L1: Introduction L1: Introduction
25
See http://gpgpu.org 26
1.105 Petaflop/second
One quadrilion operations/s
• How fast is it? 1 x 1016
(Device) Grid
– cost of synchronizing
Memory Memory Memory Memory
Texture
of work to run fast in parallel (I.e. large • Large memories are slow, fast memories are small Courtesy NVIDIA
granularity), but not so large that there is not • Storage hierarchies are large and fast on average
enough parallel work • Algorithm should do most work on nearby data
L1: Introduction L1: Introduction
31 32
Load Imbalance Summary of Lecture
• Load imbalance is the time that some processors in • Technology trends have caused the multi-core
the system are idle due to paradigm shift in computer architecture
– insufficient parallelism (during that phase) – Every computer architecture is parallel
– unequal size tasks • Parallel programming is reaching the masses
• Examples of the latter – This course will help prepare you for the future
– different control flow paths on different tasks of programming.
– adapting to “interesting parts of a domain” • We are seeing some convergence of graphics and
general-purpose computing
– tree-structured computations
• Graphics processors can achieve high performance for
– fundamentally unstructured problems more general-purpose applications
• Algorithm needs to balance load • GPGPU computing
– Heterogeneous, suitable for data-parallel applications
Next Time
• Immersion!
– Introduction to CUDA
L1: Introduction
35