ST7 SHP 1.3 ExOptimVectoSIMD 1spp

ST7: High Performance Simulation
Examples of serial optimization and

vectorization with SIMD units
Stéphane Vialle
Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
HPC programming strategy
Numerical algorithm
Optimized code on one core:

- Optimized compilation
- Serial optimizations
- Vectorization
Parallelized code on one node

- Multithreading
- Minimal/relaxed synchronization
- Data-Computation localization
- NUMA effect avoidance
Distributed code on a cluster:

- Message passing across processes
- Load balanced computations
- Minimal communications
- Overlapped computations and comms
2
Optimization and vectorization of a
Dense Matrix Product
1 – Stengths and weaknesses of the naive version

2 – 1st solution: evolution of the data storage
3 – 2nd solution: evolution of the loop order
3
Strenghts and weaknesses
of the naive version
j
B
A
Strenghts : = i
C
, , ,
The naive version
implements directly the
mathematical expression for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
 Code is: for (int k = 0; k < N; k++)
• easy to implement C[i][j] += A[i][k]*B[k][j];
• easy to maintain
Weaknesses :
1. Internal loop does not acces B[k][j] elements in contiguous order
 non-optimal use of the cache memory
2. Internal loop writes to the same variable C[i][j] at each iteration
 limits/stops the vectorization (not independent iterations)
4

5
1st solution: evolution of the data storage
Identification of « cache misses »

for i B
for j
for (int k=0; k<n; k++)
Cij += Aik * Bkj
k,
k+1,
Considering successive …
iterations of k-loop:
Accessing non
contiguous elts in RAM:
A • many « cache misses »
Cij += Aik*Bkj C
• poor usage of the
k,k+1,… cache memory
i
Access to successive elements

in RAM: takes advantage of the j
cache memory mechanism 6
Avoiding « cache misses »

for i TB Access to successive
for j
for (int k=0; k<n; k++) elements in RAM: takes
Cij += Aik * TBjk j advantage of the cache
k memory mechanism
Considering successive
A
Cij += Aik*TBjk
C
k
i
Access to successive elements

in RAM: takes advantage of the j
cache memory mechanism 7
Avoiding « cache misses »

Source code with new data storage
// Dense matrix product: C = A×B
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
C[i][j] += A[i][k]*TB[j][k];
}
}
Compilation: gcc –O3 pgm.c –o pgm
This source code version is closer of the architecture of the processor

• faster
• more complex to understand and to maintain
• requires TB … (store TB instead of B? transpose B when needed ?)
8
Identification of a vectorization lock

for i
At each iteration:
TB
for j • Identical computation
for (int k=0; k<n; k++) (no if-then-else,
Cij += Aik * TBjk j
k no divergence)
• Access to successive
Considering successive
array indexes
• Write/Accumulation
A in the same variable
C
Cij += Aik*TBjk (Cij):
k • iterations are not
i
independent
• vectorization
is limited
j
9
Unlocking the vectorization

for i if n = 4.q TB
• Loop unrolling
for j
double Acc[4] = {0} • Accumulation in a
k
for (k=0; k<n; k+=4) j vector of buffers
Acc[0] += Aik+0*TBjk+0
Acc[1] += Aik+1*TBjk+1 Each k-loop iteration:
Acc[2] += Aik+2*TBjk+2
Acc[3] += Aik+3*TBjk+3 • includes 4 identical &
Cij = Acc[0]+Acc[1]+ independent instructions
Acc[2]+Acc[3]; • reading and writing
A C successive array indexes
k  Compiler can
i vectorize each
k-loop iteration
j
10

Source code 1: with new data storage & loop unrolling
// Dense matrix product: C = A×B Loop unrolling
for (int i = 0; i < n; i++) with 8-factor
for (int j = 0; j < n; j++) {
(in case of long
double accu[8] = {0.0};
for (int k = 0; k < (n/8)*8; k += 8) { AVX units)
accu[0] += A[i][k+0]*TB[j][k+0];
accu[1] += A[i][k+1]*TB[j][k+1]; Integer division:
……… (900/8)*8 = 896
accu[7] += A[i][k+7]*TB[j][k+7];
}
Generic solution
for (int k = (n/8)*8; k < n; k++)
accu[0] += A[i][k]*TB[j][k]; runs for any
C[i][j] = accu[0] + … + accu[7]; value of n
}
Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm
Specific compilation option:
to improve loop unrolling 11

Source code 2: with new data storage & loop unrolling
// Dense matrix product: C = A×B Loop unrolling
for (int i = 0; i < n; i++) with 8-factor
for (int j = 0; j < n; j++) {
(in case of long
for (int k = 0; k < (n/8)*8; k += 8) {
C[i][j] += A[i][k+0]*TB[j][k+0] + AVX units)
A[i][k+1]*TB[j][k+1] +
……… Integer division:
A[i][k+7]*TB[j][k+7]; (900/8)*8 = 896
}
for (int k = (n/8)*8; k < n; k++) Generic solution
C[i][j] += A[i][k]*TB[j][k]; runs for any
}
value of n
Implement loop unrolling & a big instruction, grouping many identical
operations on successive array indexes
 Better or worst than previous solution: depends on the compiler 12

13
2nd solution: evolution of the loop order
Inversion of j and k loops

j
for i B Access to successive
for k array indexes: right use
for j k
of the cache memory
Cij += Aik * Bkj
Considering successive Access to successive

iterations of j-loop: array indexes: right use
of the cache memory
A C
+ Independent & identical
k operations
i + No write conflict
 auto-vectorization
+ no explicit loop-unrolling
(use only –funroll-loops
Access to only one element
j option of the compiler)
of A: no cache miss pb 14
Inversion of j and k loops

Source code with new loop order: « ikj »
// Dense matrix product: C = A×B
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
C[i][j] += A[i][k]*B[k][j];
Usually faster when compiling
with –f unroll-loops option
• Right use of the cache memory
• Suppression of the write conflict during vectorization of the inner loop
 Decreases the number of cache misses
 Enable the auto-vectorization
Elegant and efficient ! … but not always so simple ! 15
Investigating all possible inner loop

Inner
Cij += Aik * Bkj loop
Right use NO! NO! OK
of cache cache misses cache misses access same lt
Vectorisation NO NO! OK i
enabled not-contiguous NO not-contiguous NO access same elt
Right use OK OK OK
of cache contiguous access same elt contiguous
j
Vectorisation OK OK
OK access same elt OK OK
enabled contiguous contiguous
no W conflict
Right use OK OK NO!
of cache access same elt contiguous cache misses
Vectorisation NO! OK NO! k
W conflit NO contiguous
NO
enabled not-contiguous
 inner loop = j-loop : the only right solution

(without changing the data storage) 16
Experiments
17
Exp. on an Intel Xeon Haswell
Experiments (2018) :
Dense matrix product: 4096x4096, double precision
Processor: Intel Xeon Haswell E5‐2637 v3 - 2014
(4 physical cores – 2 threads/core)
Seq. Naive Seq. Naive -O3 + BLAS

-O0 -O3 Optimized code + monothread
Vectorization (OpenBLAS)
0.12 Gflops 0.35 Gflops 3.10 Gflops 46.3 Gflops
×1.0 ×2.9 ×25.8 ×385.8
2 threads/core 1 thread/core
(best configuration) (best configuration)
 Use optimized HPC libraries when available

 Optimize your source code when HPC library does not exist
18
Exp. on 2 Intel Xeon architectures
Experiments of kernel K0 (section 2) - 2019
• Sarah : quad-cores à 3.5 GHz
(E5-2637 v3 « haswell », 15MB cache, 2014 – gcc 5.4.0)
• Kyle : octo-cores à 2.1 GHz
(Silver 4110 CPU « skylake », 11MB cache, 2017 – gcc 7.3.0)
1024x1024 Naif 2048x2048

8 8
+ Accu
6 6
+ TB
4 4
GFlops
GFlops
+ Loop‐unroll
2 + Vect‐Accu 2
0 + Long op 0
Sarah Kyle Sarah Kyle ikj Sarah Kyle Sarah Kyle
O0 O0 O3 O3 O0 O0 O3 O3
 K0-ikj algorithm appears the best
 Significant parts of small matrices fit in cache: higher K0 perfs. 19
Exp. on 2 Intel Xeon architectures
Expériments of kernel K1 (OpenBLAS) – 2019
• Sarah : quad-cores à 3.5 GHz
(E5-2637 v3 « haswell », 15MB cache, 2014 – gcc 5.4.0)
• Kyle : octo-cores à 2.1 GHz
(Silver 4110 CPU « skylake », 11MB cache, 2017 – gcc 7.3.0)
1024x1024 2048x2048 4096x4096

60 60 60
GFlops
GFlops
GFlops
40 40 40
Best K0 ‐ TP
20 20 20
K1 ‐ OpenBLAS
0 0 0
Sarah Kyle Sarah Kyle Sarah Kyle
The "BLAS" are the result of long developments : use this library!
20
Examples of serial optimization and
vectorization with SIMD units
Questions ?
21

ST7 SHP 1.3 ExOptimVectoSIMD 1spp

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ST7 SHP 1.3 ExOptimVectoSIMD 1spp

Uploaded by

Copyright:

Available Formats

ST7: High Performance Simulation

Examples of serial optimization and

Optimized code on one core:

Parallelized code on one node

Distributed code on a cluster:

1 – Stengths and weaknesses of the naive version

1 – Stengths and weaknesses of the naive version

Identification of « cache misses »

Access to successive elements

Avoiding « cache misses »

Access to successive elements

Avoiding « cache misses »

This source code version is closer of the architecture of the processor

Identification of a vectorization lock

Unlocking the vectorization

Unlocking the vectorization

Unlocking the vectorization

1 – Stengths and weaknesses of the naive version

Inversion of j and k loops

Considering successive Access to successive

Inversion of j and k loops

Investigating all possible inner loop

 inner loop = j-loop : the only right solution

Seq. Naive Seq. Naive -O3 + BLAS

 Use optimized HPC libraries when available

1024x1024 Naif 2048x2048

1024x1024 2048x2048 4096x4096

You might also like