Welcome to Scribd!

Experiments With Cache-Oblivious Matrix Multiplication For 18.335 (Optimal) Cache-Oblivious Matrix Multiply

Uploaded by

0% found this document useful (0 votes)

8 views1 page

This document discusses an implementation of cache-oblivious matrix multiplication. It summarizes that dividing the matrices and recursively computing block multiplications achieves optimal cache complexity, but absolute performance still suffers. The issue is that registers act like a very small cache, so nested loops do not make effective use of them. Long blocks of unrolled code loading matrix blocks into registers could improve performance.

Original Description:

cuda

Original Title

Oblivious Matmul Handout

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

8 views1 page

Experiments With Cache-Oblivious Matrix Multiplication For 18.335 (Optimal) Cache-Oblivious Matrix Multiply

Uploaded by

proxymo1

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 1

Search inside document

Experiments with Cache-Oblivious Matrix Multiplication for 18.

335
Steven G. Johnson MIT Applied Math
platform: 2.66GHz Intel Core 2 Duo, GNU/Linux + gcc 4.1.2 (-O3) (64-bit), double precision

(optimal) Cache-Oblivious Matrix Multiply

= C
m!p

A
m!n

B
n!p

divide and conquer: divide C into 4 blocks compute block multiply recursively achieves optimal "(n3/#Z) cache complexity

A little C implementation (~25 lines)

/* C = C + AB, where A is m x n, B is n x p, and C is m x p, in row-major order. Actually, the physical size of A, B, and C are m x fdA, n x fdB, and m x fdC, but only the first n/p/p columns are used, respectively. */ void add_matmul_rec(const double *A, const double *B, double *C, int m, int n, int p, int fdA, int fdB, int fdC) { if (m+n+p <= 48) { /* <= 16x16 matrices "on average" */ int i, j, k; for (i = 0; i < m; ++i) for (k = 0; k < p; ++k) { double sum = 0; for (j = 0; j < n; ++j) sum += A[i*fdA +j] * B[j*fdB + k]; C[i*fdC + k] += sum; } } else { /* divide and conquer */ int m2 = m/2, n2 = n/2, p2 = p/2;

No Cache-based Performance Drops!

note: base case is ~16!16

recursing down to 1!1 would kill performance
(1 function call per element, no register re-use)

add_matmul_rec(A, B, C, m2, n2, p2, fdA, fdB, fdC); add_matmul_rec(A+n2, B+n2*fdB, C, m2, n-n2, p2, fdA, fdB, fdC); add_matmul_rec(A, B+p2, C+p2, m2, n2, p-p2, fdA, fdB, fdC); add_matmul_rec(A+n2, B+p2+n2*fdB, C, m2, n-n2, p-p2, fdA, fdB, fdC); add_matmul_rec(A+m2*fdA, B, C+m2*fdC, m-m2, n2, p2, fdA, fdB, fdC); add_matmul_rec(A+m2*fdA+n2, B+n2*fdB, C+m2*fdC, m-m2, n-n2, p2, fdA, fdB, fdC);

dividing C into 4
note that, instead, for very non-square matrices, we might want to divide C in 2 along longest axis

add_matmul_rec(A+m2*fdA, B+p2, C+m2*fdC+p2, m-m2, n2, p-p2, fdA, fdB, fdC); add_matmul_rec(A+m2*fdA+n2, B+p2+n2*fdB, C+m2*fdC, m-m2, n-n2, p-p2, fdA, fdB, fdC); } } void matmul_rec(const double *A, const double *B, double *C, int m, int n, int p) { memset(C, 0, sizeof(double) * m*p); add_matmul_rec(A, B, C, m, n, p, n, p, p); }

but absolute performance still sucks

Registers .EQ. Cache

The registers (~100) form a very small, almost ideal cache
Three nested loops is not the right way to use this cache for the same reason as with other caches

(of course, there are lots of little optimizations, but there must be something big?)

Need long blocks of unrolled code: load blocks of matrix into local variables (= registers), do matrix multiply, write results
Loop-free blocks = many optimized hard-coded base cases of recursion for different-sized blocks often automatically generated (ATLAS) Unrolled n!n multiply has (n3)! possible code orderings compiler cannot find optimal schedule (NP hard) cacheoblivious scheduling can help (c.f. FFTW), but ultimately requires some experimentation (automated in ATLAS)

unfair factor of 2 from using SSE2 instructions

if this difference is not L1/L2 cache, what is it?

CLRS 4 1 2
Document7 pages
CLRS 4 1 2
MathKilledYouBro
No ratings yet
ATI TEAS Calculation Workbook: 300 Questions to Prepare for the TEAS (2023 Edition)
From Everand
ATI TEAS Calculation Workbook: 300 Questions to Prepare for the TEAS (2023 Edition)
Coventry House Publishing
No ratings yet
Lab Report 5 Zaryab Rauf Fa17-Ece-046
Document8 pages
Lab Report 5 Zaryab Rauf Fa17-Ece-046
HAMZA ALI
No ratings yet
5-8086 Assembly Language Programming
Document16 pages
5-8086 Assembly Language Programming
afzal_a
83% (6)
TAFJ R18 Release Notes
Document10 pages
TAFJ R18 Release Notes
Gopal Arunachalam
No ratings yet
Scribe Week4
Document3 pages
Scribe Week4
Edward
No ratings yet
Rational Functions: 5.4 Complex Arithmetic
Document3 pages
Rational Functions: 5.4 Complex Arithmetic
Vinay Gupta
No ratings yet
Algorithms - Exam 2021-2022 Model Answer
Document4 pages
Algorithms - Exam 2021-2022 Model Answer
Ahmed Mabrok
No ratings yet
Algo VC Lecture24
Document32 pages
Algo VC Lecture24
M HUZAIFA
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
Document24 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
Moazzam Hussain
No ratings yet
Assignment #2: Due March 14, 2016
Document14 pages
Assignment #2: Due March 14, 2016
Ahmed Hamouda
No ratings yet
Libya Free High Study Academy / Misrata: - Report of Address
Document19 pages
Libya Free High Study Academy / Misrata: - Report of Address
Basma Mohamed Elg
No ratings yet
DSA Final
Document37 pages
DSA Final
0 0
No ratings yet
Assignment On Bisection Method GIVEN ON 06/10/2020: Program
Document13 pages
Assignment On Bisection Method GIVEN ON 06/10/2020: Program
Rahul
No ratings yet
CT2 - C Answer Key
Document12 pages
CT2 - C Answer Key
Kriti Thirumurugan
No ratings yet
2011 Daa End-Regular
Document4 pages
2011 Daa End-Regular
8207dayaan Bashir
No ratings yet
Week 12 Assignment Solution
Document4 pages
Week 12 Assignment Solution
stephen neal
No ratings yet
Lesson2 4DistributedMartrixMultiply PDF
Document7 pages
Lesson2 4DistributedMartrixMultiply PDF
Projectlouis
No ratings yet
MIT18 335JF10 Lec2a Hand
Document7 pages
MIT18 335JF10 Lec2a Hand
DevendraReddyPoreddy
No ratings yet
Class18 - Linalg II Handout PDF
Document48 pages
Class18 - Linalg II Handout PDF
Pavan Behara
No ratings yet
Answer All Questions. 2. Write Legibly On Both Sides of The Answer Book. 3. Write Relevant Question Numbers Before Writing The Answer
Document4 pages
Answer All Questions. 2. Write Legibly On Both Sides of The Answer Book. 3. Write Relevant Question Numbers Before Writing The Answer
Anonymous BOreSF
No ratings yet
Introduction To Tools - Python: 1 Assignment
Document7 pages
Introduction To Tools - Python: 1 Assignment
Rahul Vasanth
No ratings yet
Gói Dsa
Document33 pages
Gói Dsa
Hồng Hải
No ratings yet
Rps School System: Final TERM (2018)
Document5 pages
Rps School System: Final TERM (2018)
Zahra Ahmed
No ratings yet
Daa MCQ - Sample-2020 PDF
Document40 pages
Daa MCQ - Sample-2020 PDF
Rishabh Raj
100% (4)
CDS S5
Document6 pages
CDS S5
Jaydeep S
No ratings yet
BPJ Lesson 6
Document4 pages
BPJ Lesson 6
api-307093783
No ratings yet
CS502 GQ DEC 2020, MCS of Virtuallians
Document41 pages
CS502 GQ DEC 2020, MCS of Virtuallians
Ch HamXa Aziz (HeMi)
100% (1)
DS MCQ
Document9 pages
DS MCQ
phani
No ratings yet
Security in Communications and Storage
Document18 pages
Security in Communications and Storage
Afonso Delgado
No ratings yet
CSEN701 PA5 Solution 27551 29906
Document5 pages
CSEN701 PA5 Solution 27551 29906
no name
No ratings yet
Lab Report 8 Nmce: (Numerical Methods in Chemical Engineering) by Muhammad Jawad
Document10 pages
Lab Report 8 Nmce: (Numerical Methods in Chemical Engineering) by Muhammad Jawad
muhammad jawad
No ratings yet
IA2 DAA Question Paper Updated
Document5 pages
IA2 DAA Question Paper Updated
Ambrish Reddy
No ratings yet
Matrix Mult
Document55 pages
Matrix Mult
sundeepadapa
100% (1)
Assignment-1: - : Basics
Document4 pages
Assignment-1: - : Basics
BHAVIK KALPESH
No ratings yet
Man 108
Document4 pages
Man 108
senthil.jpin8830
No ratings yet
2 EC Cryptography: 2.1 Elliptic Curve Arithmetic
Document8 pages
2 EC Cryptography: 2.1 Elliptic Curve Arithmetic
Hadis Mehari
No ratings yet
Tutorial Week-6 Set-1 Solution
Document4 pages
Tutorial Week-6 Set-1 Solution
devc96048
No ratings yet
Chapter 4 - Finite Fields
Document19 pages
Chapter 4 - Finite Fields
rishabhdubey
No ratings yet
GCD Program: C Program To Find GCD of Two Numbers
Document5 pages
GCD Program: C Program To Find GCD of Two Numbers
Chintha Venu
No ratings yet
Problem Set 2
Document5 pages
Problem Set 2
Eren Güngör (Student)
No ratings yet
(@bohring - Bot × @JEE - Tests) 04 RA - (P - C, BT) - E
Document2 pages
(@bohring - Bot × @JEE - Tests) 04 RA - (P - C, BT) - E
Rishi Jangid
No ratings yet
Performance Experiments With Matrix Multiplication A Trivial Problem?
Document1 page
Performance Experiments With Matrix Multiplication A Trivial Problem?
proxymo1
No ratings yet
Tutorial 2: Basic C++ Elements: Csc126 Fundamentals of Algorithms and Computer Problem Solving
Document4 pages
Tutorial 2: Basic C++ Elements: Csc126 Fundamentals of Algorithms and Computer Problem Solving
Leo Pietro
No ratings yet
Dual-Field Multiplier Architecture For Cryptographic Applications
Document5 pages
Dual-Field Multiplier Architecture For Cryptographic Applications
Akshansh Dwivedi
No ratings yet
Basic Computational Number Theory (Jacques Amsel)
Document8 pages
Basic Computational Number Theory (Jacques Amsel)
Gonzalo Rodriguez
No ratings yet
Questions Endsem
Document8 pages
Questions Endsem
Ritik Chaturvedi
No ratings yet
Design-And-Analysis-Of-Algorithms (Set 1)
Document26 pages
Design-And-Analysis-Of-Algorithms (Set 1)
Rana Hossam Elden
No ratings yet
Numerical Methods: Some Example Applications in C++
Document27 pages
Numerical Methods: Some Example Applications in C++
Sanjiv Ch
No ratings yet
Dosang Joe and Hyungju Park: Nal Varieties. Rational Parametrizations of The Irreducible Components
Document12 pages
Dosang Joe and Hyungju Park: Nal Varieties. Rational Parametrizations of The Irreducible Components
Mohammad Mofeez Alam
No ratings yet
Down Load QPaper View
Document4 pages
Down Load QPaper View
Pintu Das
No ratings yet
Solving Zero-Dimensional Algebraic Systems
Document15 pages
Solving Zero-Dimensional Algebraic Systems
Anonymous Tph9x741
No ratings yet
Cs502 Midterm Solved Mcqs by Junaid
Document47 pages
Cs502 Midterm Solved Mcqs by Junaid
Abdullah Naeem
100% (1)
MCC Esa99 Final
Document14 pages
MCC Esa99 Final
Ademar Cardoso
No ratings yet
Types of Algorithms Solutions: 1. Which of The Following Algorithms Is NOT A Divide & Conquer Algorithm by Nature?
Document8 pages
Types of Algorithms Solutions: 1. Which of The Following Algorithms Is NOT A Divide & Conquer Algorithm by Nature?
ergrehge
No ratings yet
True or False: Stack
Document15 pages
True or False: Stack
Amro Abosaif
No ratings yet
Cs502 Grand Quiz by Junaid
Document81 pages
Cs502 Grand Quiz by Junaid
Atif Mubashar
No ratings yet
Design and Analysis of Algorithms Solved MCQs (Set-1) PDF
Document7 pages
Design and Analysis of Algorithms Solved MCQs (Set-1) PDF
kantipucappa 19
No ratings yet
Assignment 1
Document2 pages
Assignment 1
karthikvs88
No ratings yet
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
Rating: 5 out of 5 stars
5/5 (1)
Math Practice Tests For The ACT
From Everand
Math Practice Tests For The ACT
Vibrant Publishers
No ratings yet
Metaheuristics for String Problems in Bio-informatics
From Everand
Metaheuristics for String Problems in Bio-informatics
Christian Blum
No ratings yet
Android Debug Bridge - Android Developers
Document11 pages
Android Debug Bridge - Android Developers
proxymo1
No ratings yet
CUDA Compute Unified Device Architecture
Document26 pages
CUDA Compute Unified Device Architecture
proxymo1
No ratings yet
Extending OpenMP To Clusters
Document1 page
Extending OpenMP To Clusters
proxymo1
No ratings yet
Parallel Novikov Si 1-2-2013
Document14 pages
Parallel Novikov Si 1-2-2013
proxymo1
No ratings yet
Khem Raj Embedded Linux Conference 2014, San Jose, CA
Document29 pages
Khem Raj Embedded Linux Conference 2014, San Jose, CA
proxymo1
No ratings yet
Device Tree Migration
Document19 pages
Device Tree Migration
proxymo1
No ratings yet
Factor of Safety and Stress Analysis of Fuselage Bulkhead Using Composite
Document7 pages
Factor of Safety and Stress Analysis of Fuselage Bulkhead Using Composite
proxymo1
No ratings yet
OpenCL Jumpstart Guide
Document17 pages
OpenCL Jumpstart Guide
proxymo1
No ratings yet
Model 0710 & Experiment Overview
Document10 pages
Model 0710 & Experiment Overview
proxymo1
No ratings yet
Aircraft Structures II Lab
Document15 pages
Aircraft Structures II Lab
proxymo1
0% (1)
GBA Programming Manual v1.22
Document172 pages
GBA Programming Manual v1.22
proxymo1
No ratings yet
Network World Magazine Article 1328
Document4 pages
Network World Magazine Article 1328
proxymo1
No ratings yet
Anupam Raj Resume
Document1 page
Anupam Raj Resume
Shivakant Upadhyay
No ratings yet
OWASP TOP 10 - A8-2017-Insecure Deserialization
Document12 pages
OWASP TOP 10 - A8-2017-Insecure Deserialization
apel
No ratings yet
Functional Design and Architecture
Document495 pages
Functional Design and Architecture
Suren Kirakosyan
No ratings yet
Concurrent Processes and Real-Time Scheduling: Concurrency in Embedded Systems
Document16 pages
Concurrent Processes and Real-Time Scheduling: Concurrency in Embedded Systems
fcmandi
No ratings yet
Speeding Up The Hungarian Algorithm: Computers & Operations Research December 1990
Document3 pages
Speeding Up The Hungarian Algorithm: Computers & Operations Research December 1990
James Templeton
No ratings yet
UFMFF8-30-1 Exam 2018
Document7 pages
UFMFF8-30-1 Exam 2018
Irfan Memon
No ratings yet
Socket Io
Document15 pages
Socket Io
verrymariyanto
No ratings yet
Powershell Gallery Powershellget 2.x
Document147 pages
Powershell Gallery Powershellget 2.x
mggs
No ratings yet
cs8492 Notes PDF
Document126 pages
cs8492 Notes PDF
Anand
No ratings yet
GeorgiaTech CS-6515: Graduate Algorithms: Graphs-Book Notes Flashcards by Calvin Hoang - Brainscape
Document10 pages
GeorgiaTech CS-6515: Graduate Algorithms: Graphs-Book Notes Flashcards by Calvin Hoang - Brainscape
Jan_Su
No ratings yet
Hardware Descriptive Language
Document3 pages
Hardware Descriptive Language
ShayneMayores
No ratings yet
"Implementation On An Approach For Mining of Datasets Using APRIORI Hybrid
Document5 pages
"Implementation On An Approach For Mining of Datasets Using APRIORI Hybrid
International Journal of Scholarly Research
No ratings yet
BSC Computer Science Full Syllabus
Document58 pages
BSC Computer Science Full Syllabus
JANARDHANA
No ratings yet
IM201-Fundamentals of Database Systems
Document128 pages
IM201-Fundamentals of Database Systems
Jc Barrera
No ratings yet
Oracle Interview
Document13 pages
Oracle Interview
subash pradhan
No ratings yet
Python: Under The Guidance of
Document16 pages
Python: Under The Guidance of
Shiva
No ratings yet
Inter Process Communication Tutorial PDF
Document20 pages
Inter Process Communication Tutorial PDF
manisha patwari
No ratings yet
Program I:: Int N
Document6 pages
Program I:: Int N
monalisa
No ratings yet
Understanding Probability 3rd Edition. Henk Tijms. Cambridge University Press 2012.
Document6 pages
Understanding Probability 3rd Edition. Henk Tijms. Cambridge University Press 2012.
Chavobar
0% (2)
Lead .Net Developer
Document7 pages
Lead .Net Developer
Karthik Reddy
No ratings yet
Assignment A1
Document7 pages
Assignment A1
pw6233
No ratings yet
Deepa CV
Document2 pages
Deepa CV
Shrutee Patil
No ratings yet
ITDBS Lab Session 03
Document8 pages
ITDBS Lab Session 03
Waqar
No ratings yet
4-Instruction Set - IAS Computer-18-Jul-2019Material I 18-Jul-2019 Instruction Set Ias
Document9 pages
4-Instruction Set - IAS Computer-18-Jul-2019Material I 18-Jul-2019 Instruction Set Ias
Vivank Sharma
No ratings yet
NFC Read and Write
Document7 pages
NFC Read and Write
Rafael Villarreal
No ratings yet
Sap SD en Hen Cements
Document11 pages
Sap SD en Hen Cements
Rahul Mohapatra
No ratings yet
Data Structure Using C (Unit I)
Document23 pages
Data Structure Using C (Unit I)
Uttkarsh Pandey
No ratings yet