Cellprocessors

| Cell Broadoand Engine Optimizing Compiler for the Cell Processor Alexandre Eichenberger, Kathryn O’Brien, Kevin O’Brien, Peng Wu, Tong Chen, Peter Oden, Daniel Prener, Janice Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind www.research.ibm.com/cellcompiler/compiler.htm PACT, Tuesday, September 20th, 2005Q Multiprocessor on a chip > 241M transistors, 2: > 200 GFlops (SP) @3.2GHz > 200 GB/s bus (internal) @ 3.2GHz ie Orc) Pea eaae rs pees eS eee al| O Power Proc. Element (PPE) Sabre | > general purpose > running full-fledged OSs a on Cea QO Synergistic Proc. Element (SPE) > optimized for compute densityrE eceaaeaa ig Cell Broadband Engine Overview I 8 Bytes (per dir) alee 16Bytes (one dir) oa) a fe) £ ay ri Cy ro 128Bytes (one dir) Q Heterogeneous, multi-core engine > 1 multi-threaded power processor > up to 8 compute-intensive-ISA engines Q Local Memories » fast access to 256KB local memories > globally coherent DMA to transfer data Q Pervasive SIMD >» PPE has VMX » SPEs are SIMD-only engines Q High bandwidth » fast internal bus (200GB/s) > dual XDR™ controller (25.6GB/s) > two configurable interfaces (76.8GB/s) > numbers based on 3.2GHz clock rate eeeSupporting a Broad Range of Expertise to Program Cell Highest performance with help from programmers Multiple-ISA hand-tuned Explicit SIMD coding Explicit parallelization with programs local memories a fe) si Shared jared memory, es f=) SIMD/alignment iS : ie io a in] single program (a) 2 directives a 7 & E§ abstraction im ia ed Automatic tuning foreach ISA Automatic simdization Automatic parallelization Highest Productivity with fully automatic compiler technologyOutline lame} ECR ena single program laa ned Automatic SPE tuning Automatic simdizationArchitecture: Relevant SPE Features Synergistic Processing Element (SPE) © SIMD-only functional units > 16-bytes register/memory accesses Q Simplitied branch architecture » no hardware branch predictor > compiler managed hint/predication Q Duatl-issue for instructions » full dependence check in hardware > must be parallel & properly aligned branch: 1,2 Q Single-ported local memory branch hint: 1,2 a instr. fetch: 2 » aligned accesses only dma request: 3 » contentions alleviated by compiler pees 16 bytes 128 bytes (per dir) (one dir) (one ir) eeeFeature 1: Single-Ported Local Memory SPE Q Local store is single ported » less expensive hardware > asymmetric port = 16 bytes for load/store ops = 128 bytes for IFETCH/DMA > static priority Tel * DMA> MEM > IFETCH Seema) eel (128 x 16Byte register) Q If we are not careful, we may starve for instructions Die branch: 1,2 branch hint: 1,2 (elm ca) instr. fetch: 2 dma request: 3 Hoes 16Bytes 128Bytes (per dir) (one dir) (one dir) nailer for rocessor cenInstruction Starvation Situation nitiate vefill after ralf ampty Q There are 2 instruction buffers > up to 64 ops along the fall-through path Q First buffer is half-empty > can initiate refill Q When MEM port is continuously used » starvation occurs (no ops left in buffers) eeeInstruction Starvation Prevention nitiate vefill after ralf ampty Q SPE has an explicit IFETCH op » which initiates a instruction fetch before Scheduler monitors starvation situation itis too » when MEM port is continuously used late to » insert IFETCH op within the (red) window hide latency refill IFETCH latency Q Compiler design » scheduler must keep track of code layoutQ Branch architecture > no hardware branch-predictor, but > compare/select ops for predication > software-managed branch-hint > one hint active at a time Q Lowering overhead by cating small if-then-else > hinting predictably taken branches branch hint: 1,2 instr. fetch 16 bytes les ne dirQ SPE provides a HINT operation > fetches the branch target into HINT buffer > no penalty for correctly predicted branches IFETCH window refill latency BRANCH if true target > compiler inserts hints when beneficial Q Impact on instruction starvation rectly hinted branch, IFETCH is smaller > after windrecruited SPE Optimization Results 1.0 09 08 07 06 05 A Relative reductions in execution time Original @ +Bundle @ +Branch Hint C +Ifetch 04 mem ||| € € F&F 2 * gs Ss s er < * x = RV Ay oe . & eX FF XK és single SPE performance, optimized, simdized code (avg 1.00 + 0.78) 12 Spiinizig. Compterdica Colleen seaee rrercuck=aen Outline lame} Shared memory & single program abstr. laa ned Automatic SPE tuning Automatic simdization Multiple-ISA hand-tuned Explicit SIMD coding Explicit parallelization with programs Bs A local memories D ‘ompller for eenrecruited Single Instruction Multiple Data (SIMD) Computation Process multiple “b[i]+c[i]” data per operations 16-byte boundaries bi0 ] Ri R2 16-byte boundariesind Engine Successful Simdization Extract Parallelism loop level for (i=0; 1<256; i++) i basic-block level (c (i205 i Satisfy Con alignment constraints straints aT fear] data size conversion multiple targets GENERIC VMX SPEExample of SIMD-Parallelism Extraction loop level QO Loop level i<256; i++) » SIMD for a single statement across consecutive iterations » successful at: = efficiently handling misaligned data = pattern recognition (reduction, linear recursion) = leverage loop transformations in most compilers [Bik et al, IJPP 2002] [VAST compiler, 2004] (ce (i=0; 1<8; i++) [Eichenberger et al, PLD! 2004] [Wu et al, CGO 2005] afi] = [Naishlos, GCC Developer's Summit 2004] entire short loopExample of SIMD Constraints - Alignment in SIMD units matters: alignment constraints > consider “b[i+1] + ofi+0]” feel] 16-byte boundaries bO | b1 | 62 | B39 64 | b5 | b6 | b7 this is not b[1] + c[0] Rt 16-byte boundaries eeeExample of SIMD Constraints (cont.) Q Alignment in SIMD units matters alignment constraints » when alignments within inputs do not match » must realign the data feel] 16-byte boundaries eeeAutomatic Simdization for Cell Q Integrated Approach > extract at multiple levels > satisfy all SIMD constraints > use ‘virtual SIMD vector" as glue Q Minimize alignment overhead » lazily insert data reorganization » handle compile time & runtime alignment » simdize prologue/epilogue for SPEs = memory accesses are always safe on SPE Q Full throughput computations » even in presence of data conversions » manually unrolled loops... es(eee SPE Simdization Results 30 25 20 15 Speedup factors 26.2 25.3, - 99 114 75 4 24 25 29 29 i | Linpack Swiml2 FIR Auteor = Dot’ Checksum Alpha Product Blending single SPE, optimized, automatic simdization vs. scalar code Saxpy Mat Mult Average ProcessorrE eceaaeaa ig Outline Llama} Shared memory & single program abstr. Part 1: Automatic SPE tuning alae Pai colasreeeciianolr4=lte)A) Multiple-ISA hand-tuned Explicit SIMD coding Explicit parallelization with programs Be local memories cenCompeting for the SPE Local Store Local store is fast, need support when full. irregular data = Local Store Provided compiler support: Q SPE code too large » compiler partitions code > partition manager pulls in code as needed Q Data with regular accesses is too large » compiler stages data in & out > using static buffering » can hide latencies by using double buffering Q Data with irregular accesses is present > e.g. indirection, runtime pointers... > use a software cache approach to pull the data in & out (last resort solution)Software Cache for Irregular Accesses QO Data with irregular accesses > cannot reside permanently in the SPE’s local memory (typically) » thus reside in global memory > when accessed, = must translate the global address into a local store address * must pull the data in/out when its not already present QUse a software cache > managed by the SPEs in the local store » generate DMA requests to transfer data to/from global memory » use 4-way set associative cache to naturally use the SIMD units of the SPECell Broadband En: Software Cache Achitecture Q Cache directory » 128-set, 4-way set associative > pointers to data lines > use 16KByte of data set tags data ptrs. dirty bits... data array Q Data in a separate structure > 512 x 1288 lines >» use 64KByte of data eeerE eceaaeaa ig translate this global address Software Cache Access subset of addr used by tags data ptrs. dirty bits... addr offset SIMD comparisor when successful oMTTT)-O1T) hit latency: ~ 20 extra cycles cen“Single Source” Compiler, using OpenMP #pragma Omp parallel for outline omp for( i =0; i<10000; i++) region Ali] = x" Bl; void foo() #pragma Omp parallel for for( i =0; i<10000; i++) Ali = x* Bl; code with OpenMP directives clone for SPE clone for PPE void foo_PPE() init_omp_rte(); for( isLB; icUB; i++) Ali] = x* Bl; void foo_SPE() for( k=LB; k dense compute architecture O Present the application writer with a wide range of tool > from support to extract maximum performance > to support to achieve maximum productivity with automatic tools O Shown respectable speedups * using automatic tuning, simdization, and support for shared-memory abstraction fieQuestions For additional info: www.research.ibm.com/cellcompiler/compiler.htmExtra a(eErecreacnan Compiler Support For Single Source Program 73 Object PPE (eleita JOU SaS a oiteras JadwoO ads Jeppequiz dS PPE Source Object PPE PPE feelers aa SER sree Jai0WO9 3dd 2 v m ia a o eee

Cellprocessors

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cellprocessors

Uploaded by

Copyright:

Available Formats

You might also like