Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

ST7: High Performance Simulation

Examples of serial optimization and


vectorization with SIMD units

Stéphane Vialle

Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
HPC programming strategy
Numerical algorithm

Optimized code on one core:


- Optimized compilation
- Serial optimizations
- Vectorization

Parallelized code on one node


- Multithreading
- Minimal/relaxed synchronization
- Data-Computation localization
- NUMA effect avoidance

Distributed code on a cluster:


- Message passing across processes
- Load balanced computations
- Minimal communications
- Overlapped computations and comms
2
Optimization and vectorization of a
Dense Matrix Product

1 – Stengths and weaknesses of the naive version


2 – 1st solution: evolution of the data storage
3 – 2nd solution: evolution of the loop order

3
Strenghts and weaknesses
of the naive version
j
B
A
Strenghts : = i
C
, , ,
The naive version
implements directly the
mathematical expression for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
 Code is: for (int k = 0; k < N; k++)
• easy to implement C[i][j] += A[i][k]*B[k][j];
• easy to maintain

Weaknesses :
1. Internal loop does not acces B[k][j] elements in contiguous order
 non-optimal use of the cache memory
2. Internal loop writes to the same variable C[i][j] at each iteration
 limits/stops the vectorization (not independent iterations)
4
Optimization and vectorization of a
Dense Matrix Product

1 – Stengths and weaknesses of the naive version


2 – 1st solution: evolution of the data storage
3 – 2nd solution: evolution of the loop order

5
1st solution: evolution of the data storage

Identification of « cache misses »


for i B
for j
for (int k=0; k<n; k++)
Cij += Aik * Bkj
k,
k+1,
Considering successive …
iterations of k-loop:
Accessing non
contiguous elts in RAM:
A • many « cache misses »
Cij += Aik*Bkj C
• poor usage of the
k,k+1,… cache memory
i

Access to successive elements


in RAM: takes advantage of the j
cache memory mechanism 6
1st solution: evolution of the data storage

Avoiding « cache misses »


for i TB Access to successive
for j
for (int k=0; k<n; k++) elements in RAM: takes
Cij += Aik * TBjk j advantage of the cache
k memory mechanism
Considering successive
iterations of k-loop:

A
Cij += Aik*TBjk
C
k
i

Access to successive elements


in RAM: takes advantage of the j
cache memory mechanism 7
1st solution: evolution of the data storage

Avoiding « cache misses »


Source code with new data storage
// Dense matrix product: C = A×B
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
C[i][j] += A[i][k]*TB[j][k];
}
}
Compilation: gcc –O3 pgm.c –o pgm

This source code version is closer of the architecture of the processor


• faster
• more complex to understand and to maintain
• requires TB … (store TB instead of B? transpose B when needed ?)
8
1st solution: evolution of the data storage

Identification of a vectorization lock


for i
At each iteration:
TB
for j • Identical computation
for (int k=0; k<n; k++) (no if-then-else,
Cij += Aik * TBjk j
k no divergence)
• Access to successive
Considering successive
array indexes
iterations of k-loop:
• Write/Accumulation
A in the same variable
C
Cij += Aik*TBjk (Cij):
k • iterations are not
i
independent
• vectorization
is limited

j
9
1st solution: evolution of the data storage

Unlocking the vectorization


for i if n = 4.q TB
• Loop unrolling
for j
double Acc[4] = {0} • Accumulation in a
k
for (k=0; k<n; k+=4) j vector of buffers
Acc[0] += Aik+0*TBjk+0
Acc[1] += Aik+1*TBjk+1 Each k-loop iteration:
Acc[2] += Aik+2*TBjk+2
Acc[3] += Aik+3*TBjk+3 • includes 4 identical &
Cij = Acc[0]+Acc[1]+ independent instructions
Acc[2]+Acc[3]; • reading and writing
A C successive array indexes
k  Compiler can
i vectorize each
k-loop iteration

j
10
1st solution: evolution of the data storage

Unlocking the vectorization


Source code 1: with new data storage & loop unrolling
// Dense matrix product: C = A×B Loop unrolling
for (int i = 0; i < n; i++) with 8-factor
for (int j = 0; j < n; j++) {
(in case of long
double accu[8] = {0.0};
for (int k = 0; k < (n/8)*8; k += 8) { AVX units)
accu[0] += A[i][k+0]*TB[j][k+0];
accu[1] += A[i][k+1]*TB[j][k+1]; Integer division:
……… (900/8)*8 = 896
accu[7] += A[i][k+7]*TB[j][k+7];
}
Generic solution
for (int k = (n/8)*8; k < n; k++)
accu[0] += A[i][k]*TB[j][k]; runs for any
C[i][j] = accu[0] + … + accu[7]; value of n
}
Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm
Specific compilation option:
to improve loop unrolling 11
1st solution: evolution of the data storage

Unlocking the vectorization


Source code 2: with new data storage & loop unrolling
// Dense matrix product: C = A×B Loop unrolling
for (int i = 0; i < n; i++) with 8-factor
for (int j = 0; j < n; j++) {
(in case of long
for (int k = 0; k < (n/8)*8; k += 8) {
C[i][j] += A[i][k+0]*TB[j][k+0] + AVX units)
A[i][k+1]*TB[j][k+1] +
……… Integer division:
A[i][k+7]*TB[j][k+7]; (900/8)*8 = 896
}
for (int k = (n/8)*8; k < n; k++) Generic solution
C[i][j] += A[i][k]*TB[j][k]; runs for any
}
value of n
Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm
Implement loop unrolling & a big instruction, grouping many identical
operations on successive array indexes
 Better or worst than previous solution: depends on the compiler 12
Optimization and vectorization of a
Dense Matrix Product

1 – Stengths and weaknesses of the naive version


2 – 1st solution: evolution of the data storage
3 – 2nd solution: evolution of the loop order

13
2nd solution: evolution of the loop order

Inversion of j and k loops


j
for i B Access to successive
for k array indexes: right use
for j k
of the cache memory
Cij += Aik * Bkj

Considering successive Access to successive


iterations of j-loop: array indexes: right use
of the cache memory
A C
+ Independent & identical
k operations
i + No write conflict
 auto-vectorization

+ no explicit loop-unrolling
(use only –funroll-loops
Access to only one element
j option of the compiler)
of A: no cache miss pb 14
2nd solution: evolution of the loop order

Inversion of j and k loops


Source code with new loop order: « ikj »
// Dense matrix product: C = A×B
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
C[i][j] += A[i][k]*B[k][j];
Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm
Usually faster when compiling
with –f unroll-loops option
• Right use of the cache memory
• Suppression of the write conflict during vectorization of the inner loop
 Decreases the number of cache misses
 Enable the auto-vectorization
Elegant and efficient ! … but not always so simple ! 15
2nd solution: evolution of the loop order

Investigating all possible inner loop


Inner
Cij += Aik * Bkj loop
Right use NO! NO! OK
of cache cache misses cache misses access same lt
Vectorisation NO NO! OK i
enabled not-contiguous NO not-contiguous NO access same elt
Right use OK OK OK
of cache contiguous access same elt contiguous
j
Vectorisation OK OK
OK access same elt OK OK
enabled contiguous contiguous
no W conflict
Right use OK OK NO!
of cache access same elt contiguous cache misses
Vectorisation NO! OK NO! k
W conflit NO contiguous
NO
enabled not-contiguous

 inner loop = j-loop : the only right solution


(without changing the data storage) 16
Experiments

17
Exp. on an Intel Xeon Haswell
Experiments (2018) :
Dense matrix product: 4096x4096, double precision
Processor: Intel Xeon Haswell E5‐2637 v3 - 2014
(4 physical cores – 2 threads/core)

Seq. Naive Seq. Naive -O3 + BLAS


-O0 -O3 Optimized code + monothread
Vectorization (OpenBLAS)
0.12 Gflops 0.35 Gflops 3.10 Gflops 46.3 Gflops
×1.0 ×2.9 ×25.8 ×385.8
2 threads/core 1 thread/core
(best configuration) (best configuration)

 Use optimized HPC libraries when available


 Optimize your source code when HPC library does not exist
18
Exp. on 2 Intel Xeon architectures
Experiments of kernel K0 (section 2) - 2019
• Sarah : quad-cores à 3.5 GHz
(E5-2637 v3 « haswell », 15MB cache, 2014 – gcc 5.4.0)
• Kyle : octo-cores à 2.1 GHz
(Silver 4110 CPU « skylake », 11MB cache, 2017 – gcc 7.3.0)

1024x1024 Naif 2048x2048


8 8
 + Accu
6 6
 + TB
4 4
GFlops

GFlops
 + Loop‐unroll
2  + Vect‐Accu 2
0  + Long op 0
Sarah Kyle Sarah Kyle ikj Sarah Kyle Sarah Kyle
O0 O0 O3 O3 O0 O0 O3 O3
 K0-ikj algorithm appears the best
 Significant parts of small matrices fit in cache: higher K0 perfs. 19
Exp. on 2 Intel Xeon architectures
Expériments of kernel K1 (OpenBLAS) – 2019
• Sarah : quad-cores à 3.5 GHz
(E5-2637 v3 « haswell », 15MB cache, 2014 – gcc 5.4.0)
• Kyle : octo-cores à 2.1 GHz
(Silver 4110 CPU « skylake », 11MB cache, 2017 – gcc 7.3.0)

1024x1024 2048x2048 4096x4096


60 60 60
GFlops

GFlops

GFlops
40 40 40
Best K0 ‐ TP
20 20 20
K1 ‐ OpenBLAS
0 0 0
Sarah Kyle Sarah Kyle Sarah Kyle

The "BLAS" are the result of long developments : use this library!
20
Examples of serial optimization and
vectorization with SIMD units

Questions ?

21

You might also like