Welcome to Scribd!

Sum Reduction Week10 Lec30

Uploaded by

0% found this document useful (0 votes)

6 views8 pages

This document discusses parallel reduction in CUDA. It describes a tree-based approach used within each thread block to reduce portions of an array in parallel. It also addresses how to communicate partial results between thread blocks to process very large arrays. The solution is to decompose the computation into multiple kernel invocations in a recursive manner, with the code for each level being the same. The optimization goal is to achieve peak bandwidth, as reductions have very low arithmetic intensity. Algorithmic and code optimizations can provide speedups.

Original Description:

,bnnnnnnnnnnnnnnnnnnnnnm

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pptx, pdf, or txt

0% found this document useful (0 votes)

6 views8 pages

Sum Reduction Week10 Lec30

Uploaded by

wardabibi69

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pptx, pdf, or txt

Jump to Page

You are on page 1of 8

Search inside document

Parallel Reduction in CUDA

Week10_Lecture 30
Parallel Reduction

Common and important data parallel primitive

Easy to implement in CUDA

Serves as a great optimization example

2
Parallel Reduction

Tree-based approach used within each thread block

3 1 7 0 4 1 6 3

4 7 5 9

11 14

Need to be able to use multiple thread blocks

To process very large arrays
To keep all multiprocessors on the GPU busy
Each thread block reduces a portion of the array
But how do we communicate partial results between
thread blocks?
3
Parallel Reduction

Values (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Step 1 Thread
Stride 1 IDs
0 2 4 6 8 10 12 14

Values 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

Step 2 Thread
Stride 2 IDs
0 4 8 12

Values 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2

Step 3 Thread
Stride 4 IDs
0 8

Values 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

Step 4 Thread
0
Stride 8 IDs
Values 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

4
Solution: Kernel Decomposition

decomposing computation into multiple kernel

invocations

3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3
4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9
11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14
25 25 25 25 25 25 25 25 Level 0:
8 blocks

3 1 7 0 4 1 6 3
4 7 5 9
Level 1:
11
25
14
1 block

In the case of reductions, code for all levels is the

same
Recursive kernel invocation

5
What is Our Optimization Goal?

We should strive to reach GPU peak performance

Choose the right metric:
GFLOP/s: for compute-bound kernels
Bandwidth: for memory-bound kernels
Reductions have very low arithmetic intensity
1 flop per element loaded (bandwidth-optimal)
Therefore we should strive for peak bandwidth

6
Types of optimization

Interesting observation:

Algorithmic optimizations
Changes to addressing, algorithm
cascading

Code optimizations
Loop unrolling
2.54x speedup, combined

7
Conclusion
Understand CUDA performance characteristics
Memory coalescing
Divergent branching
Bank conflicts
Latency hiding
Use peak performance metrics to guide optimization
Understand parallel algorithm complexity theory
Know how to identify type of bottleneck
e.g. memory, core computation, or instruction overhead
Optimize your algorithm, then unroll loops
Use template parameters to generate optimal code

Metasploitable 3 - A Walk-Through - Linux Edition
Document80 pages
Metasploitable 3 - A Walk-Through - Linux Edition
Abril Jordan Casinillo
No ratings yet
Solution For Kriging Calculation
Document6 pages
Solution For Kriging Calculation
Azka Roby Antari
100% (2)
Data Structures and Algorithms Problems Techie Delight
Document21 pages
Data Structures and Algorithms Problems Techie Delight
Cheer You up
No ratings yet
Numerical Analysis Exam Review Answers
Document7 pages
Numerical Analysis Exam Review Answers
wasif
No ratings yet
Programmable Led Matrix With 125 Leds: Three-Dimensional Light Source
Document6 pages
Programmable Led Matrix With 125 Leds: Three-Dimensional Light Source
HecOs
No ratings yet
MasteringDyalogAPL PDF
Document818 pages
MasteringDyalogAPL PDF
Lamparas Estella
No ratings yet
DotNet Syllabus
Document5 pages
DotNet Syllabus
Kirubagaran T
No ratings yet
ASPLOS 2015 1 Slides
Document41 pages
ASPLOS 2015 1 Slides
Anusha Penchala
No ratings yet
Layout Design Entry, Verification, and Automation User Guide
Document141 pages
Layout Design Entry, Verification, and Automation User Guide
kksundari
No ratings yet
Rac Internals
Document19 pages
Rac Internals
Sammeer Bhalla
No ratings yet
Programming in Z80 Assembly Language
Document129 pages
Programming in Z80 Assembly Language
Trevor
100% (1)
1984 Bookmatter ProgrammingInZ80AssemblyLangua
Document9 pages
1984 Bookmatter ProgrammingInZ80AssemblyLangua
Lewis Liddell
No ratings yet
C. v. S. Rao - Switching Theory and Logic Design-Pearson Education (2005)
Document334 pages
C. v. S. Rao - Switching Theory and Logic Design-Pearson Education (2005)
Raja Sekhar
No ratings yet
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
Document21 pages
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
夏天
No ratings yet
Cluster Ppt1
Document27 pages
Cluster Ppt1
api-3717234
No ratings yet
Root Locus Notes
Document27 pages
Root Locus Notes
Hakimi Bob
No ratings yet
Chapter 1 Introduction 1
Document9 pages
Chapter 1 Introduction 1
Vijay Sagar
No ratings yet
Advanced Computer Architecture: Sabina Akram 18i-1424@nu - Edu.pk
Document13 pages
Advanced Computer Architecture: Sabina Akram 18i-1424@nu - Edu.pk
Aounaiza Ahmed
No ratings yet
PDA Chapter 3
Document21 pages
PDA Chapter 3
hicinob
No ratings yet
LAB 5-4 Decimal To BCD Encoder
Document2 pages
LAB 5-4 Decimal To BCD Encoder
leonardo333555
No ratings yet
NRAM Defines A New Category of "Memory Class Storage": Bill Gervasi Principal Systems Architect
Document29 pages
NRAM Defines A New Category of "Memory Class Storage": Bill Gervasi Principal Systems Architect
Gio Sss
No ratings yet
Experiment 1
Document6 pages
Experiment 1
Jomel Jomel
No ratings yet
Spatial Data Mining: Clustering Techniques
Document56 pages
Spatial Data Mining: Clustering Techniques
bwcastillo
No ratings yet
We Will Study ..
Document82 pages
We Will Study ..
Satryo Pramahardi
No ratings yet
C43F7
Document3 pages
C43F7
Albert Luna
No ratings yet
Coccinelle: Reducing The Barriers To Modularization in A Large C Code Base
Document30 pages
Coccinelle: Reducing The Barriers To Modularization in A Large C Code Base
crxz0193
No ratings yet
Micro - 51 Eb (User) PDF
Document298 pages
Micro - 51 Eb (User) PDF
nanobala15
100% (1)
Topics Covered: Verilog Coding - Behavioral Modeling: Faculty of Engineering, Science & Technology Assignment # 02
Document1 page
Topics Covered: Verilog Coding - Behavioral Modeling: Faculty of Engineering, Science & Technology Assignment # 02
Maryam Qazi
No ratings yet
Manual Well 5500
Document68 pages
Manual Well 5500
Estrellita Belén
50% (2)
6502 Technical Reference
Document127 pages
6502 Technical Reference
Trinummus
No ratings yet
DLbook
Document165 pages
DLbook
YAQI ZHAN
No ratings yet
HC2021.UWisc - Karu Sankaralingam.v02
Document20 pages
HC2021.UWisc - Karu Sankaralingam.v02
ddscribe
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
Document32 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
Phan Tuấn Khôi
No ratings yet
AES 128/192/256 (ECB) A - MM S: Valon Lave
Document15 pages
AES 128/192/256 (ECB) A - MM S: Valon Lave
Staifano
No ratings yet
Lab - 5-1-2 Implement EIGRP For IPv6
Document31 pages
Lab - 5-1-2 Implement EIGRP For IPv6
minhlilili
No ratings yet
Operation Sheet
Document8 pages
Operation Sheet
Aliyan Aman
No ratings yet
Floating Point ALU Design PDF
Document9 pages
Floating Point ALU Design PDF
Gorantala Anil Kumar
No ratings yet
Computer Laboratory Manual (Ai) Version 1.3
Document70 pages
Computer Laboratory Manual (Ai) Version 1.3
hassan khan
No ratings yet
Skylake
Document10 pages
Skylake
plesk199
No ratings yet
G. K. Kharate: Principal Matoshri College of Engineering and Research Centre Nashik
Document55 pages
G. K. Kharate: Principal Matoshri College of Engineering and Research Centre Nashik
JALLU SANTOSHI MRUDULA
No ratings yet
Lab 00
Document6 pages
Lab 00
Abdul Hadi
No ratings yet
ARM Based ESD UNIT 5C
Document17 pages
ARM Based ESD UNIT 5C
Manu Manu
No ratings yet
Final Exam: DISC 333 (Part 2)
Document11 pages
Final Exam: DISC 333 (Part 2)
Ahsan Ishtiaq
No ratings yet
Digital Logic Design - Course - Outline
Document4 pages
Digital Logic Design - Course - Outline
chala
No ratings yet
Winter Training Report: A Report Submitted in Partial Fulfillment of The Requirement For The Award of Degree of
Document6 pages
Winter Training Report: A Report Submitted in Partial Fulfillment of The Requirement For The Award of Degree of
Shru Tee
No ratings yet
Sample 7489
Document11 pages
Sample 7489
kpatil9887
No ratings yet
Thesis Reprt - 60761017
Document63 pages
Thesis Reprt - 60761017
Geetha Rani N
No ratings yet
Gruber-Keller2010 Chapter CoreOptimization
Document32 pages
Gruber-Keller2010 Chapter CoreOptimization
Alejandro
No ratings yet
4 - 5G NR - SS Block - MIB - SIB1
Document256 pages
4 - 5G NR - SS Block - MIB - SIB1
rajibbc
No ratings yet
Routing in Packet Switching Networks
Document52 pages
Routing in Packet Switching Networks
shambhu32
No ratings yet
About The C51 Primer
Document116 pages
About The C51 Primer
Vijay Kishore Reddy R
No ratings yet
Radiation Effects Challenges in 90nm Commercial-Density SRAMs
Document20 pages
Radiation Effects Challenges in 90nm Commercial-Density SRAMs
api-19668941
No ratings yet
Esa Mcbesa Mcb51 User Manual 51
Document28 pages
Esa Mcbesa Mcb51 User Manual 51
abcd
No ratings yet
Embedded SoPC Design with Nios II Processor and Verilog Examples
From Everand
Embedded SoPC Design with Nios II Processor and Verilog Examples
Pong P. Chu
No ratings yet
ENHANCING DATA ENCRYPTION STANDARD USING TIME N
Document42 pages
ENHANCING DATA ENCRYPTION STANDARD USING TIME N
christian Mac
No ratings yet
16 X 2 LCD Datasheet - 16x2 Character LCD Module PINOUT
Document3 pages
16 X 2 LCD Datasheet - 16x2 Character LCD Module PINOUT
Wals Aquarius Man
No ratings yet
Kumar, K. Udaya - Umashankar, B. S - The 8085 Microprocessor - Architecture, Programming and Interfacing, 1e-Dorling Kindersley (India) (2008)
Document625 pages
Kumar, K. Udaya - Umashankar, B. S - The 8085 Microprocessor - Architecture, Programming and Interfacing, 1e-Dorling Kindersley (India) (2008)
Ritesh Mahajan
No ratings yet
m0 - 61544122s Blog
Document11 pages
m0 - 61544122s Blog
RAZ
No ratings yet
Week4 Binary Search Trees
Document22 pages
Week4 Binary Search Trees
Hg
No ratings yet
Arista 7500 Datasheet
Document7 pages
Arista 7500 Datasheet
Franco Manuel
No ratings yet
Design Guide For Local Dimming Backlight With TLC6C5748-Q1
Document7 pages
Design Guide For Local Dimming Backlight With TLC6C5748-Q1
Carlos Ruiz
No ratings yet
Static Pipelining #2 and Goodbye To Computer Architecture: Prof. Lawrence Rauchwerger
Document22 pages
Static Pipelining #2 and Goodbye To Computer Architecture: Prof. Lawrence Rauchwerger
Dhirendra Pratap Pandey
No ratings yet
Expt 1
Document9 pages
Expt 1
PRINCESS JAMIE ARAH GALIAS
No ratings yet
Week15 Lec42 43 MPI
Document13 pages
Week15 Lec42 43 MPI
wardabibi69
No ratings yet
Intro To CUDA Lec26
Document26 pages
Intro To CUDA Lec26
wardabibi69
No ratings yet
Week11 Lec31 PP
Document6 pages
Week11 Lec31 PP
wardabibi69
No ratings yet
Week13 Lecture37
Document19 pages
Week13 Lecture37
wardabibi69
No ratings yet
Week13 Lecture 38
Document11 pages
Week13 Lecture 38
wardabibi69
No ratings yet
Week12 Lec33 and Lec35
Document23 pages
Week12 Lec33 and Lec35
wardabibi69
No ratings yet
Week11 Lec32
Document15 pages
Week11 Lec32
wardabibi69
No ratings yet
Week14 Lecture40
Document9 pages
Week14 Lecture40
wardabibi69
No ratings yet
Week12 Lec31 and Lec34
Document12 pages
Week12 Lec31 and Lec34
wardabibi69
No ratings yet
Pipelining Size and Depth
Document19 pages
Pipelining Size and Depth
wardabibi69
No ratings yet
Monte Carlo
Document21 pages
Monte Carlo
wardabibi69
No ratings yet
Lecture 1 Introduction PP
Document14 pages
Lecture 1 Introduction PP
wardabibi69
No ratings yet
Lecture 2 PP
Document15 pages
Lecture 2 PP
wardabibi69
No ratings yet
Parallel Processing Assignment 2
Document1 page
Parallel Processing Assignment 2
wardabibi69
No ratings yet
Assignment 3
Document2 pages
Assignment 3
wardabibi69
No ratings yet
Important Points
Document8 pages
Important Points
Meag Ghn
No ratings yet
Line Encoding
Document45 pages
Line Encoding
ravikumar rayala
No ratings yet
Digital Signal Processing-1703074
Document12 pages
Digital Signal Processing-1703074
Sourabh Kapoør
No ratings yet
Comparative Evaluation of Filters For Liver Ultrasound Image Enhancement
Document5 pages
Comparative Evaluation of Filters For Liver Ultrasound Image Enhancement
International Journal of Application or Innovation in Engineering & Management
No ratings yet
Enseble LEarning
Document57 pages
Enseble LEarning
YASH GAIKWAD
100% (1)
Sheet 3 - Sol
Document7 pages
Sheet 3 - Sol
Mrmouzinho
No ratings yet
Simplex Minimization Problem
Document22 pages
Simplex Minimization Problem
ShreyasKamat
No ratings yet
Dhanalakshmi College of Engineering
Document17 pages
Dhanalakshmi College of Engineering
Ramesh Kumar
No ratings yet
NCERT Solutions: CLASS - X Mathematics Chapter-01 Real Numbers (Exercise 1.1)
Document3 pages
NCERT Solutions: CLASS - X Mathematics Chapter-01 Real Numbers (Exercise 1.1)
royalrajoria4912
No ratings yet
25-Tap Differentiator Using Rectangular, Bartlett and Hamming Window EX - NO: Date
Document6 pages
25-Tap Differentiator Using Rectangular, Bartlett and Hamming Window EX - NO: Date
Saresh Kumar
No ratings yet
2 - PCM & Delta Modulation
Document33 pages
2 - PCM & Delta Modulation
ZHOU Yong
No ratings yet
EXAMPLES Dividing Polynomials Using LONG or SYNTHETIC DIVISION 2al37ao
Document4 pages
EXAMPLES Dividing Polynomials Using LONG or SYNTHETIC DIVISION 2al37ao
Gamer's Ground
No ratings yet
EE2211 CheatSheet
Document15 pages
EE2211 CheatSheet
Aditi
No ratings yet
Week 2 Quiz: Design and Analysis of Algorithms
Document3 pages
Week 2 Quiz: Design and Analysis of Algorithms
ajay negi
100% (1)
DAA Worksheet-3.2 20BCS7611
Document4 pages
DAA Worksheet-3.2 20BCS7611
AYUSH TIWARI
No ratings yet
Ds Programs
Document26 pages
Ds Programs
Swathi Pothurajula
No ratings yet
Spatial Filtering: by Dr. K. M. Bhurchandi
Document82 pages
Spatial Filtering: by Dr. K. M. Bhurchandi
Gejo Georgesan
No ratings yet
13 Disjoint-Set Data Structure
Document6 pages
13 Disjoint-Set Data Structure
Harshit Roy
No ratings yet
Bài tập ATHA
Document2 pages
Bài tập ATHA
Lang Tuấn Nguyên
No ratings yet
Mtech 1ST Sem Csel0902 Ip 2021 Endsem
Document5 pages
Mtech 1ST Sem Csel0902 Ip 2021 Endsem
Shimul Bhattacharjee
No ratings yet
Linear Programming: Presented by - Meenakshi Tripathi
Document13 pages
Linear Programming: Presented by - Meenakshi Tripathi
Rajendra Pansare
No ratings yet
Variants of SIMPLE
Document26 pages
Variants of SIMPLE
Yu Kok Hwa
No ratings yet
EECS Homework 1 PDF
Document16 pages
EECS Homework 1 PDF
Anselm Adrian
No ratings yet
Tree-Structured Vector Quantizers
Document13 pages
Tree-Structured Vector Quantizers
ABHISHEK B
No ratings yet
T L B O: Reduction of Side Lobes of A Linear Array Using Eaching Earning Ased Ptimization Algorithm
Document18 pages
T L B O: Reduction of Side Lobes of A Linear Array Using Eaching Earning Ased Ptimization Algorithm
Sai Kumar Yerra
No ratings yet
Simulated Annealing
Document54 pages
Simulated Annealing
AditiLavigne
No ratings yet
Necessary Conditions For An Interior Optimum
Document6 pages
Necessary Conditions For An Interior Optimum
Kaligula G
No ratings yet