Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 42

PGI Compilers

Tools for Scientists and Engineers


September 2006

Brent Leback brent.leback@pgroup.com Dave Norton dave.norton@pgroup.com


www.pgroup.com

Outline of Todays Topics


Introduction to PGI Compilers and Tools Documentation. Getting Help Basic Compiler Options

Optimization Strategies
6.2 Features and Roadmap Questions and Answers

PGI Compilers and Tools, features


Optimization State-of-the-art vector, parallel, IPA, Feedback, Cross-platform AMD & Intel, 32/64-bit, Linux & Windows PGI Unified Binary for AMD and Intel processors

Tools Integrated OpenMP/MPI debug & profile, IDE integration


Parallel MPI, OpenMP 2.5, auto-parallel for Multi-core Comprehensive OS Support Red Hat 7.3 9.0, RHEL 3.0/4.0, Fedora Core 2/3/4/5, SuSE 7.1 10.1, SLES 8/9/10, Windows XP, Windows x64

PGI Tools Enable Developers to:


View x64 as a unified CPU architecture Extract peak performance from x64 CPUs Ride innovation waves from both Intel and AMD

Use a single source base and toolset across Linux and Windows
Develop, debug, tune parallel applications for Multi-core, Multi-core SMP, Clustered Multi-core SMP

PGI Documentation and Support


PGI provided documentation PGI User Forums, at www.pgroup.com PGI FAQs, Tips & Techniques pages

Email support, via trs@pgroup.com


Web support, a form-based system similar to email support Fax support

PGI Docs & Support, cont.


Legacy phone support, direct access, etc. PGI download web page PGI prepared/personalized training

PGI ISV program


PGI Premier Service program

PGI Basic Compiler Options


Basic Usage Language Dialects Target Architectures

Debugging aids
Optimization switches

PGI Basic Compiler Usage


A compiler driver interprets options and invokes preprocessors, compilers, assembler, linker, etc. Options precedence: if options conflict, last option on command line takes precedence Use -Minfo to see a listing of optimizations and transformations performed by the compiler Use -help to list all options or see details on how to use a given option, e.g. pgf90 -Mvect -help Use man pages for more details on options, e.g. man pgf90

Use v to see under the hood

Flags to support language dialects


Fortran
pgf77, pgf90, pgf95, pghpf tools Suffixes .f, .F, .for, .fpp, .f90, .F90, .f95, .F95, .hpf, .HPF -Mextend, -Mfixed, -Mfreeform Type size i2, -i4, -i8, -r4, -r8, etc. -Mcray, -Mbyteswapio, -Mupcase, -Mnomain, -Mrecursive, etc. pgcc, pgCC, aka pgcpp Suffixes .c, .C, .cc, .cpp, .i -B, -c89, -c9x, -Xa, -Xc, -Xs, -Xt -Msignextend, -Mfcon, -Msingle, -Muchar, -Mgccbugs

C/C++

Specifying the target architecture


Not an issue on XT3. Defaults to the type of processor/OS you are running on Use the tp switch.
-tp k8-64 or tp p7-64 or tp core2-64 for 64-bit code. -tp amd64e for AMD opteron rev E or later -tp x64 for unified binary -tp k8-32, k7, p7, piv, piii, p6, p5, px for 32 bit code

Flags for debugging aids


-g generates symbolic debug information used by a debugger -gopt generates debug information in the presence of optimization -Mbounds adds array bounds checking -v gives verbose output, useful for debugging system or build problems -Mlist will generate a listing -Minfo provides feedback on optimizations made by the compiler -S or Mkeepasm to see the exact assembly generated

Basic optimization switches


Traditional optimization controlled through -O[<n>], n is 0 to 4. -fast switch combines common set into one simple switch, is equal to -O2 -Munroll=c:1 -Mnoframe -Mlre
For -Munroll, c specifies completely unroll loops with this loop count or less -Munroll=n:<m> says unroll other loops m times

-Mnoframe does not set up a stack frame -Mlre is loop-carried redundancy elimination

Basic optimization switches, cont.


fastsse switch is commonly used, extends fast to SSE hardware, and vectorization -fastsse is equal to -O2 -Munroll=c:1 -Mnoframe -Mlre (-fast) plus -Mvect=sse, -Mscalarsse -Mcache_align, -Mflushz -Mcache_align aligns top level arrays and objects on cache-line boundaries -Mflushz flushes SSE denormal numbers to zero

Optimization Strategies
Establish a workload Optimization from the top-down Use of proper tools, methods Processor level optimizations, parallel methods Different flags/features for different types of code

Node level tuning


Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating examples

Function Inlining especially important for C and C++


Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try

15

Vectorizable F90 Array Syntax Data is REAL*4


350 ! 351 ! 352 ! 353 354 355 356 357 358 359 360 361 362
Initialize vertex, similarity and coordinate arrays Do Index = 1, NodeCount IX = MOD (Index - 1, NodesX) + 1 IY = ((Index - 1) / NodesX) + 1 CoordX (IX, IY) = Position (1) + (IX - 1) * StepX CoordY (IX, IY) = Position (2) + (IY - 1) * StepY JetSim (Index) = SUM (Graph (:, :, Index) * & & GaborTrafo (:, :, CoordX(IX,IY), CoordY(IX,IY))) VertexX (Index) = MOD (Params%Graph%RandomIndex (Index) - 1, NodesX) + 1 VertexY (Index) = ((Params%Graph%RandomIndex (Index) - 1) / NodesX) + 1 End Do

Inner loop at line 358 is vectorizable, can used packed SSE instructions

16

fastsse to Enable SSE Vectorization Minfo to List Optimizations to stderr


% pgf95 -fastsse -Mipa=fast -Minfo -S graphRoutines.f90 localmove: 334, Loop unrolled 1 times (completely unrolled) 343, Loop unrolled 2 times (completely unrolled) 358, Generated an alternate loop for the inner loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop
17

Scalar SSE:
.LB6_668: # lineno: 358 movss -12(%rax),%xmm2 movss -4(%rax),%xmm3 subl $1,%edx mulss -12(%rcx),%xmm2 addss %xmm0,%xmm2 mulss -4(%rcx),%xmm3 movss -8(%rax),%xmm0 mulss -8(%rcx),%xmm0 addss %xmm0,%xmm2 movss (%rax),%xmm0 addq $16,%rax addss %xmm3,%xmm2 mulss (%rcx),%xmm0 addq $16,%rcx testl %edx,%edx addss %xmm0,%xmm2 movaps %xmm2,%xmm0 jg .LB6_625

Vector SSE:
.LB6_1245: # lineno: 358 movlps (%rdx,%rcx),%xmm2 subl $8,%eax movlps 16(%rcx,%rdx),%xmm3 prefetcht0 64(%rcx,%rsi) prefetcht0 64(%rcx,%rdx) movhps 8(%rcx,%rdx),%xmm2 mulps (%rsi,%rcx),%xmm2 movhps 24(%rcx,%rdx),%xmm3 addps %xmm2,%xmm0 mulps 16(%rcx,%rsi),%xmm3 addq $32,%rcx testl %eax,%eax addps %xmm3,%xmm0 jg .LB6_1245:

Facerec Scalar: 104.2 sec Facerec Vector: 84.3 sec


18

Vectorizable C Code Fragment?


217 void func4(float *u1, float *u2, float *u3, 221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++) 222 u3[i] += clz * (p1[i] + p2[i]); 223 for (i = -NI+1, i < nx+NE-1; i++) { 224 float vdt = v[i] * dt; 225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i]; 226 }

% pgcc fastsse Minfo functions.c func4: 221, Loop unrolled 4 times 221, Loop not vectorized due to data dependency 223, Loop not vectorized due to data dependency

Pointer Arguments Inhibit Vectorization


217 void func4(float *u1, float *u2, float *u3, 221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++) 222 u3[i] += clz * (p1[i] + p2[i]); 223 for (i = -NI+1, i < nx+NE-1; i++) { 224 float vdt = v[i] * dt; 225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i]; 226 }

% pgcc fastsse Msafeptr Minfo functions.c func4: 221, Generated vector SSE code for inner loop Generated 3 prefetch instructions for this loop 223, Unrolled inner loop 4 times

C Constant Inhibits Vectorization


217 void func4(float *u1, float *u2, float *u3, 221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++) 222 u3[i] += clz * (p1[i] + p2[i]); 223 for (i = -NI+1, i < nx+NE-1; i++) { 224 float vdt = v[i] * dt; 225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i]; 226 }

% pgcc fastsse Msafeptr Mfcon Minfo functions.c func4: 221, Generated vector SSE code for inner loop Generated 3 prefetch instructions for this loop 223, Generated vector SSE code for inner loop Generated 4 prefetch instructions for this loop

-Msafeptr Option and Pragma


M[no]safeptr[=all | arg | auto | dummy | local | static | global] all All pointers are safe

arg
local static

Argument pointers are safe


local pointers are safe static local pointers are safe

global

global pointers are safe

#pragma [scope] [no]safeptr={arg | local | global | static | all}, Where scope is global, routine or loop
22

Common Barriers to SSE Vectorization

Potential Dependencies & C Pointers Give compiler more


info with Msafeptr, pragmas, or restrict type qualifer

Function Calls Try inlining with Minline or Mipa=inline

Type conversions manually convert constants or use flags


Large Number of Statements Try Mvect=nosizelimit Too few iterations Usually better to unroll the loop Real dependencies Must restructure loop, if possible
23

Barriers to Efficient Execution of Vector SSE Loops

Not enough work vectors are too short Vectors not aligned to a cache line boundary Non unity strides

Code bloat if altcode is generated

24

Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating example

Function Inlining especially important for C and C++


Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try

25

What can Interprocedural Analysis and Optimization with Mipa do for You?

Interprocedural constant propagation Pointer disambiguation Alignment detection, Alignment propagation Global variable mod/ref detection

F90 shape propagation


Function inlining IPA optimization of libraries, including inlining
26

Effect of IPA on the WUPWISE Benchmark


PGF95 Compiler Options fastsse fastsse Mipa=fast fastsse Mipa=fast,inline

Execution Time in Seconds 156.49 121.65 91.72

Mipa=fast => constant propagation => compiler sees complex


matrices are all 4x3 => completely unrolls loops

Mipa=fast,inline => small matrix multiplies are all inlined


27

Using Interprocedural Analysis


Must be used at both compile time and link time Non-disruptive to development process edit/build/run Speed-ups of 5% - 10% are common Mipa=safe:<name> - safe to optimize functions which call or are called from unknown function/library name

Mipa=libopt perform IPA optimizations on libraries


Mipa=libinline perform IPA inlining from libraries

28

Vectorization packed SSE instructions maximize performance

Interprocedural Analysis (IPA) use it! motivating examples


Function Inlining especially important for C and C++ SMP Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try

29

Explicit Function Inlining


Minline[=[lib:]<inlib> | [name:]<func> | except:<func> | size:<n> | levels:<n>] [lib:]<inlib> [name:]<func> except:<func> size:<n> levels:<n> Inline extracted functions from inlib Inline function func Do not inline function func Inline only functions smaller than n statements (approximate) Inline n levels of functions

For C++ Codes, PGI Recommends IPA-based inlining or Minline=levels:10!


30

Other C++ recommendations


Encapsulation, Data Hiding - small functions, inline! Exception Handling use no_exceptions until 7.0 Overloaded operators, overloaded functions - okay Pointer Chasing - -Msafeptr, restrict qualifer, 32 bits? Templates, Generic Programming now okay

Inheritance, polymorphism, virtual functions runtime lookup or check, no inlining, potential performance penalties

31

Vectorization packed SSE instructions maximize performance

Interprocedural Analysis (IPA) use it! motivating examples


Function Inlining especially important for C and C++ SMP Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try

32

SMP Parallelization

Mconcur for auto-parallelization on multi-core


Compiler strives for parallel outer loops, vector SSE inner loops Mconcur=innermost forces a vector/parallel innermost loop Mconcur=cncall enables parallelization of loops with calls

mp to enable OpenMP 2.5 parallel programming model


See PGI Users Guide or OpenMP 2.5 standard OpenMP programs compiled w/out mp just work Not supported on Cray XT3 would require some custom work

Mconcur and mp can be used together!


33

DO 10 I3=2, N-1 MGRID Benchmark DO 10 I2=2,N-1 DO 10 I1=2,N-1 Main Loop 10 R(I1,I2,I3) = V(I1,I2,I3) & -A(0)*(U(I1,I2,I3)) & -A(1)*(U(I1-1,I2,I3)+U(I1+1,I2,I3) & +U(I1,I2-1,I3)+U(I1,I2+1,I3) & +U(I1,I2,I3-1)+U(I1,I2,I3+1)) & -A(2)*(U(I1-1,I2-1,I3)+U(I1+1,I2-1,I3) & +U(I1-1,I2+1,I3)+U(I1+1,I2+1,I3) & +U(I1,I2-1,I3-1)+U(I1,I2+1,I3-1) & +U(I1,I2-1,I3+1)+U(I1,I2+1,I3+1) & +U(I1-1,I2,I3-1)+U(I1-1,I2,I3+1) & +U(I1+1,I2,I3-1)+U(I1+1,I2,I3+1) ) & -A(3)*(U(I1-1,I2-1,I3-1)+U(I1+1,I2-1,I3-1) & +U(I1-1,I2+1,I3-1)+U(I1+1,I2+1,I3-1) & +U(I1-1,I2-1,I3+1)+U(I1+1,I2-1,I3+1) & +U(I1-1,I2+1,I3+1)+U(I1+1,I2+1,I3+1))

Auto-parallel MGRID Overall Speed-up is 40% on Dual-core AMD Opteron


% pgf95 fastsse Mipa=fast,inline Minfo Mconcur mgrid.f resid: ... 189, Parallel code for non-innermost loop activated if loop count >= 33; block distribution 291, 4 loop-carried redundant expressions removed with 12 operations and 16 arrays Generated vector SSE code for inner loop Generated 8 prefetch instructions for this loop Generated vector SSE code for inner loop Generated 8 prefetch instructions for this loop

Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating examples Function Inlining especially important for C and C++

SMP Parallelization for Cray XD1 and multi-core processors


Miscellaneous Optimizations hit or miss, but worth a try

36

Miscellaneous Optimizations (1)

Mfprelaxed single-precision sqrt, rsqrt, div performed using reduced-precision reciprocal approximation

lacml and lacml_mp link in the AMD Core Math Library


Mprefetch=d:<p>,n:<q> control prefetching distance, max number of prefetch instructions per loop tp k8-32 can result in big performance win on some C/C++ codes that dont require > 2GB addressing; pointer and long data become 32-bits

37

Miscellaneous Optimizations (2)

O3 more aggressive hoisting and scalar replacement; not part of fastsse, always time your code to make sure its faster

For C++ codes: no_exceptions Minline=levels:10


M[no]movnt disable / force non-temporal moves

V[version] to switch between PGI releases at file level Mvect=noaltcode disable multiple versions of loops
38

Whats New in PGI 6.2


Industry-leading SPECFP06 and SPECINT06 Performance PGI Visual Fortran for Windows x64 & Windows XP Full-featured PGI Workstation/Server for 32-bit Windows XP PGI Unified Binary performance enhancements More gcc extensions / compatibility New SSE intrinsics

PGI CDK ROLL for ROCKS clusters


MPICH1 and MPICH2 support in the PGI CDK Incremental debugger/profiler enhancements

Limited tuning for Intel Core2 (Woodcrest et al)

PGI Visual Fortran 6.2


Deep integration with Visual Studio 2005
PGI-custom Fortran-aware text editor Syntax coloring, keyword completion Fortran 95 Intrinsics tips PGI-custom project system and icons

PGI Unified Binary executables


Auto-parallel for multi-core CPUs Native OpenMP 2.5 parallelization World-class performance 64-bit Windows x64 support

PGI-custom property pages


One-touch project build / execute MS Visual C++ interoperability Mixed VC++ / PGI Fortran applications PGI-custom parallel F95 debug engine OpenMP 2.5 / threads debugging Just-in-time debugging features DVF/CVF compatibility features Win32 API support

32-bit Windows 2000/XP support


Optimization/support for AMD64 Optimization/support for Intel EM64T DEC/IBM/Cray compatibility features cpp-compatible pre-processing Visual Studio 2005 bundled* MSDN Library bundled* GUI parallel debugging/profiling* Assembly-optimized BLAS/LAPACK/FFTs*

Complete (Vis Studio bundled) and Standard (no Vis Studio) versions

Boxed CD-ROM/Manuals media kit*


*PVF Workstation Complete Only

On the PGI Roadmap


PGI Unified Binary directives and enhancements Aggressive Intel Core2 and next gen AMD64 tuning Industry-leading SPECFP06 and SPECINT06 Performance on Linux/Windows/AMD/Intel/32/64 Incremental PGDBG enhancements, improved C++ support MPI Debugging / Profiling for Windows x64 CCS Clusters All-new cross-platform PGPROF performance profiler

Fortran 2003/C99 language features


GCC front-end compatibility, g++ interoperability PGC++ tuning, PGC++/VC++ interoperability Windows SUA and Apple/MacOS platform support

De facto standard scalable C/Fortran language/tools extensions

Questions?

Reach me at brent.leback@pgroup.com
Thanks for your time

You might also like