Welcome to Scribd!

Module 4.4 - Memory and Data Locality: GPU Teaching Kit

Uploaded by

0% found this document useful (0 votes)

29 views13 pages

The document discusses tiled matrix multiplication on GPUs. It describes loading input tiles of matrices M and N into shared memory, then using those tiles to compute elements of the output matrix P. It recommends a tile size of at least 16x16 to maximize parallelism and shared memory usage. Larger tiles improve computational efficiency but may limit the number of concurrent thread blocks due to shared memory constraints. Synchronization between threads and blocks is needed when loading and computing tiles.

Original Description:

Original Title

Lecture-4-4-tiled-matrix-multiplication-kernel

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

29 views13 pages

Module 4.4 - Memory and Data Locality: GPU Teaching Kit

Uploaded by

Andy Ortiz

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 13

Search inside document

GPU Teaching Kit

Accelerated Computing

Module 4.4 - Memory and Data Locality

Tiled Matrix Multiplication Kernel
Objective
– To learn to write a tiled matrix-multiplication kernel
– Loading and using tiles for matrix multiplication
– Barrier synchronization, shared memory
– Resource Considerations
– Assume that Width is a multiple of tile size for simplicity

2
Loading Input Tile 0 of M (Phase 0)
– Have each thread load an M
element and an N element at the N
same relative position as its P
element.

WIDTH
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P

TILE_WIDTHE

WIDTH
Row

TILE_WIDTH

WIDTH WIDTH

Col
3
Loading Input Tile 0 of N (Phase 0)
– Have each thread load an M
element and an N element at the N
same relative position as its P
element.

WIDTH
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
4
Loading Input Tile 1 of M (Phase 1)
N

2D indexing for accessing Tile 1:

WIDTH
M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]

M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
5
Loading Input Tile 1 of N (Phase 1)
N

2D indexing for accessing Tile 1:

WIDTH
M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]

M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
6
M and N are dynamically allocated - use 1D indexing

M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]

N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]

where p is the sequence number of the current phase

7
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int p = 0; p < n/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];
__syncthreads();

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

8
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

9
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

10
Tile (Thread Block) Size Considerations
– Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads
– TILE_WIDTH of 32 gives 32*32 = 1024 threads

– For 16, in each phase, each block performs 2*256 = 512 float
loads from global memory for 256 * (2*16) = 8,192 mul/add
operations. (16 floating-point operations for each memory load)

– For 32, in each phase, each block performs 2*1024 = 2048 float
loads from global memory for 1024 * (2*32) = 65,536 mul/add
operations. (32 floating-point operation for each memory load)

11
Shared Memory and Threading
– For an SM with 16KB shared memory
– Shared memory size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared
memory.
– For 16KB shared memory, one can potentially have up to 8 thread blocks
executing
– This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4 Byte= 8K Byte shared
memory usage per thread block, allowing 2 thread blocks active at the same time
– However, in a GPU where the thread count is limited to 1536 threads per SM,
the number of blocks per SM is reduced to one!
– Each __syncthread() can reduce the number of active threads for a
block
– More thread blocks can be advantageous

12
GPU Teaching Kit
Accelerated Computing

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.

Quiz 02 Set 01 - 311.4L
Document3 pages
Quiz 02 Set 01 - 311.4L
Siam Al Qureshi 1831780642
No ratings yet
FINM3003 Continuous Time Finance
Document6 pages
FINM3003 Continuous Time Finance
Shivneet Kumar
No ratings yet
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
Document4 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
mrclueless
100% (5)
COP3530 Cheat Sheet Data Structures
Document2 pages
COP3530 Cheat Sheet Data Structures
Andy Ortiz
No ratings yet
Lecture 4
Document48 pages
Lecture 4
raghunaath
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-8-Kernel-matrix-multiplication
Document12 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-8-Kernel-matrix-multiplication
BagongJaruh
No ratings yet
12 Gpu Cuda 3
Document58 pages
12 Gpu Cuda 3
Andy Ortiz
No ratings yet
4 MM in CUDA
Document38 pages
4 MM in CUDA
Rupesh Nitnaware
No ratings yet
Chisel Manual
Document13 pages
Chisel Manual
passep08
No ratings yet
217 Lec3
Document46 pages
217 Lec3
palash
No ratings yet
CMSYS
Document29 pages
CMSYS
Sheri Ishmail
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
Document25 pages
ECE408 S19 ZJUI Exam1 Study Guide
Haider ali
No ratings yet
Mini Project Title: Group No
Document21 pages
Mini Project Title: Group No
vidhi Choudhari
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
Document27 pages
Matrix-Matrix Multiplication Using Shared Memory
jhaumosha
No ratings yet
Ac76/At76 Cryptography & Network Security
Document21 pages
Ac76/At76 Cryptography & Network Security
Rishabh Khandelwal
No ratings yet
The MD5 Message-Digest Algorithm
Document41 pages
The MD5 Message-Digest Algorithm
156ro
No ratings yet
Tk3082 Control Volume
Document61 pages
Tk3082 Control Volume
valakbersin
No ratings yet
Amortized 2
Document11 pages
Amortized 2
Anil Kumar
No ratings yet
Zadatak
Document6 pages
Zadatak
Paula Jagodić
No ratings yet
Duke EGR103 Skills Quiz 2021
Document6 pages
Duke EGR103 Skills Quiz 2021
Luke Meredith
No ratings yet
Operators
Document65 pages
Operators
Sujay Musale
No ratings yet
Perawatan Kampas Rem Cakram
Document4 pages
Perawatan Kampas Rem Cakram
aldy
No ratings yet
Warshalls Algorithm
Document6 pages
Warshalls Algorithm
jeganvishnu22
No ratings yet
Basic Lecture II: Definition of Tally: Multi-Purpose Article and Eavy On Ransport Code Ystem
Document58 pages
Basic Lecture II: Definition of Tally: Multi-Purpose Article and Eavy On Ransport Code Ystem
Sayantan Patra
No ratings yet
T09 Heaps
Document14 pages
T09 Heaps
SRC Exam
No ratings yet
Lecture 5-6-7
Document23 pages
Lecture 5-6-7
Aditya Agarwal
No ratings yet
Resuable Uvm Agent With Adaptive Packaging of Fields Dynamic in Size and Order PDF
Document6 pages
Resuable Uvm Agent With Adaptive Packaging of Fields Dynamic in Size and Order PDF
Mahmoud Wafa
No ratings yet
Vector Computers
Document43 pages
Vector Computers
Vikas Mishra
No ratings yet
Lecture07 Handout v2
Document41 pages
Lecture07 Handout v2
Shiva
No ratings yet
Co2011 Problem5-3
Document3 pages
Co2011 Problem5-3
Heinsberg Nguyen
No ratings yet
Assignment No. - Problem Statement:-: Input
Document6 pages
Assignment No. - Problem Statement:-: Input
Souradeep Ghosh
No ratings yet
Comparison of Convolutional Codes With Block Codes
Document5 pages
Comparison of Convolutional Codes With Block Codes
Eminent Aymee
No ratings yet
(E193004) Experiment Report 1 CCE-2408
Document10 pages
(E193004) Experiment Report 1 CCE-2408
Md Fahim
No ratings yet
A Greedy Algorithm For Wire Length Optimization
Document4 pages
A Greedy Algorithm For Wire Length Optimization
Raffi Sk
No ratings yet
Lec10 Handout
Document19 pages
Lec10 Handout
Wai Chun Chan
No ratings yet
Define Stack and Queue and Also Write Its Basic Operations
Document36 pages
Define Stack and Queue and Also Write Its Basic Operations
ramesh
No ratings yet
Girsanov, Numeraires, and All That PDF
Document9 pages
Girsanov, Numeraires, and All That PDF
Le Hoang Van
No ratings yet
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
Document44 pages
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
yassin mechbal
No ratings yet
Unit 5
Document25 pages
Unit 5
Charishma
No ratings yet
Experiment Number 7: Time Domain Signal Analysis Part 2 7.1 Objective
Document4 pages
Experiment Number 7: Time Domain Signal Analysis Part 2 7.1 Objective
Usama Tufail
No ratings yet
Building Your Deep Neural Network - Step by Step v8 PDF
Document44 pages
Building Your Deep Neural Network - Step by Step v8 PDF
jia
No ratings yet
DP Patterns
Document10 pages
DP Patterns
NetriderTheThechie
No ratings yet
Carry Skip Adders: - A Verilog Implementation
Document9 pages
Carry Skip Adders: - A Verilog Implementation
gowrav_hassan
No ratings yet
CA 13 VectorProcessors
Document16 pages
CA 13 VectorProcessors
Caroline Sam
No ratings yet
Processors
Document25 pages
Processors
chuck212
No ratings yet
Experiment No. 7 Aim:: Experiment N0.7 SC & Et 2013-14
Document3 pages
Experiment No. 7 Aim:: Experiment N0.7 SC & Et 2013-14
sonali_raisonigroup
No ratings yet
Introduction To Tools - Python: 1 Assignment
Document7 pages
Introduction To Tools - Python: 1 Assignment
Rahul Vasanth
No ratings yet
Error Correcting Cofe
Document28 pages
Error Correcting Cofe
Anamika singh
No ratings yet
OVCV Model Description
Document5 pages
OVCV Model Description
Vince
No ratings yet
Multidimensional Arrays
Document2 pages
Multidimensional Arrays
icul1
No ratings yet
DP Optimizations - CHT
Document26 pages
DP Optimizations - CHT
Matei
No ratings yet
Floyd Warshall Algorithm
Document6 pages
Floyd Warshall Algorithm
Souradeep Ghosh
No ratings yet
Message Authentication Code
Document36 pages
Message Authentication Code
BENHARD TITA THIONO
No ratings yet
md5 Algoritm PDF
Document22 pages
md5 Algoritm PDF
Javed
No ratings yet
Chapter Three Discrete - Time Convolutions: Lecture #6
Document19 pages
Chapter Three Discrete - Time Convolutions: Lecture #6
yohannes woldemichael
No ratings yet
Unit 3 MCQS
Document22 pages
Unit 3 MCQS
laxmimane245
No ratings yet
COT5405 - Analysis of Algorithms (Fall 2021) Final Exam: Problem 1
Document5 pages
COT5405 - Analysis of Algorithms (Fall 2021) Final Exam: Problem 1
Dhanush Pakanati
No ratings yet
Coursera Quiz Week2 Fall 2012
Document3 pages
Coursera Quiz Week2 Fall 2012
Soumyajit Das
No ratings yet
CodeISM Class 3-b (STL Vectors)
Document7 pages
CodeISM Class 3-b (STL Vectors)
ismheman
No ratings yet
Application and Implementation of DES Algorithm Based on FPGA
From Everand
Application and Implementation of DES Algorithm Based on FPGA
madhav
No ratings yet
Lesson 828 HB Have We Met
Document11 pages
Lesson 828 HB Have We Met
Andy Ortiz
No ratings yet
Write Ahead Logging
Document23 pages
Write Ahead Logging
Andy Ortiz
No ratings yet
Florida Department of Corrections: Criminal Investigation Investigative Assist
Document7 pages
Florida Department of Corrections: Criminal Investigation Investigative Assist
Andy Ortiz
No ratings yet
Recoil 2
Document3 pages
Recoil 2
Andy Ortiz
No ratings yet
Lesson 361 Level 2 My Family 2
Document11 pages
Lesson 361 Level 2 My Family 2
Andy Ortiz
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
Document132 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
Andy Ortiz
No ratings yet
Section 1.4: The Matrix Equation Ax B
Document2 pages
Section 1.4: The Matrix Equation Ax B
Andy Ortiz
No ratings yet
Notes On The Gram-Schmidt Process: MENU, Winter 2013
Document4 pages
Notes On The Gram-Schmidt Process: MENU, Winter 2013
Andy Ortiz
No ratings yet
CUDA Memories: GPU Teaching Kit
Document36 pages
CUDA Memories: GPU Teaching Kit
Andy Ortiz
No ratings yet
Gaussian Elimination and Back Substitution
Document10 pages
Gaussian Elimination and Back Substitution
Andy Ortiz
No ratings yet
Hungry As Fuck Recipes
Document68 pages
Hungry As Fuck Recipes
Andy Ortiz
No ratings yet
Feedback Report: Performance Attendance Homework
Document1 page
Feedback Report: Performance Attendance Homework
Andy Ortiz
No ratings yet
Bed Allocation
Document32 pages
Bed Allocation
Taj S
No ratings yet
Zwcad By Zwsoft (Cad By 软) : Autocad Competitive Battlecard - February 2022
Document4 pages
Zwcad By Zwsoft (Cad By 软) : Autocad Competitive Battlecard - February 2022
Muhammad Arief Maulana
No ratings yet
Previous GATE Questions With Solutions On C Programming - CS/IT
Document43 pages
Previous GATE Questions With Solutions On C Programming - CS/IT
dsmari
No ratings yet
Updated Resume 2023
Document2 pages
Updated Resume 2023
Jayesh Bablani
No ratings yet
How To Use Typescript With React and Redux - by Ross Bulat - Medium
Document19 pages
How To Use Typescript With React and Redux - by Ross Bulat - Medium
Himel Nag Rana
No ratings yet
ICDL Spreadsheets 2013 5.0 - Nu-Vision High School
Document214 pages
ICDL Spreadsheets 2013 5.0 - Nu-Vision High School
Mugisha Rubera Brian
No ratings yet
OOCP Course Plan
Document16 pages
OOCP Course Plan
Krish Patel
No ratings yet
Amisha G Solanki Final.
Document55 pages
Amisha G Solanki Final.
mpamol
No ratings yet
Dcap107 Dcap404 Object Oriented Programming PDF
Document329 pages
Dcap107 Dcap404 Object Oriented Programming PDF
naZar
No ratings yet
Computer Science HSSC-II Solution of 2nd Set Model Question Paper
Document13 pages
Computer Science HSSC-II Solution of 2nd Set Model Question Paper
M. Huzyefah Saqib
No ratings yet
Cópia de How To Create A New Event For E-Social - v1.7
Document28 pages
Cópia de How To Create A New Event For E-Social - v1.7
Alexandre Carvalho
No ratings yet
The Federal Polytechnic, Ado - Ekiti: An Assignment On Basic Computer Application
Document34 pages
The Federal Polytechnic, Ado - Ekiti: An Assignment On Basic Computer Application
Adeyinka Samuel Adedeji
No ratings yet
F
Document90 pages
F
nada nadoucha
No ratings yet
Advanced Developer Training - Client Samples II
Document7 pages
Advanced Developer Training - Client Samples II
AmadorCuencaLópez
No ratings yet
Communicating Using Markdown in GitHub
Document2 pages
Communicating Using Markdown in GitHub
Rrezearta Thaqi
No ratings yet
Coding101SlidesExercisesResourcesCode PDF
Document3 pages
Coding101SlidesExercisesResourcesCode PDF
Ionut Danaila
No ratings yet
S Um Lincom V1.9.1
Document36 pages
S Um Lincom V1.9.1
ssp tpcodl
No ratings yet
Microsoft Dynamics AX Online Training
Document10 pages
Microsoft Dynamics AX Online Training
cosmosonlinetraining
No ratings yet
Torrent Trackers List
Document19 pages
Torrent Trackers List
R4nd3l
No ratings yet
công nghệ mạng
Document20 pages
công nghệ mạng
Dương Hoàng
No ratings yet
HTML Unit - 1
Document149 pages
HTML Unit - 1
vrkatevarapu
No ratings yet
Inline Declarations in ABAP 7.40
Document4 pages
Inline Declarations in ABAP 7.40
Emil S
No ratings yet
Latitude 5420 - JJZN2F3
Document2 pages
Latitude 5420 - JJZN2F3
Andrew Case
No ratings yet
PHP MVC
Document26 pages
PHP MVC
Ousseynou Diop
No ratings yet
Connecting EIB To Linux and Java: Wolfgang Kastner and Bernd Thallner
Document4 pages
Connecting EIB To Linux and Java: Wolfgang Kastner and Bernd Thallner
laboratordiag
No ratings yet
Symbian OS: Quickstart and Carbide.c++ UI-Design
Document77 pages
Symbian OS: Quickstart and Carbide.c++ UI-Design
Prahlad Singh Rathore
No ratings yet
JavaScript Prototype
Document12 pages
JavaScript Prototype
Muhammad Amir
No ratings yet
Rajkumar J: 94428 57178 Linkedin: /In/Rajkumarjambunathan
Document3 pages
Rajkumar J: 94428 57178 Linkedin: /In/Rajkumarjambunathan
Rajkumar Jambunathan
No ratings yet
Dropbox
Document90 pages
Dropbox
Coala FIX
No ratings yet