Para Concepts

Concepts of Parallel Computing
George Mozdzynski March 2012

Concepts of Parallel Computing Slide 1
ECMWF
Outline What is parallel computing? Why do we need it?
Types of computer
Parallel Computers today Challenges in parallel computing Parallel Programming Languages OpenMP and Message Passing
Terminology
ECMWF
What is Parallel Computing?
The simultaneous use of more than one processor or computer to solve a problem
Slide 3
ECMWF
Why do we need Parallel Computing?
Serial computing is too slow Need for large amounts of memory not accessible by a single processor
Slide 4
ECMWF
An IFS T2047L149 forecast model takes about 5000 seconds wall time for a 10 day forecast using 128 nodes of an IBM Power6 cluster.
How long would this model take using a fast PC with sufficient memory? (e.g. dual core Dell desktop)
Slide 5
ECMWF
Ans. About 1 year

This PC would also need ~2000 GBYTES of memory (4 GB is usual)
1 year is too long for a 10 day forecast!
5000 seconds is also too long
See www.spec.org for CPU performance data (e.g. Specfp2006)
ECMWF
Some Terminology
Hardware: CPU = Core = Processor = PE (Processing Element ) Socket = a chip with 1 or more cores (typically 2 or 4 today), more correctly the socket is what the chip fits into Software : Process (Unix/Linux) = Task (IBM) MPI = Message Passing Interface, standard for programming processes (tasks) on systems with distributed memory OpenMP = standard for shared memory programming (threads) Thread = some code / unit of work that can be scheduled (threads are cheap to start/stop compared with processes/tasks) User Threads = tasks * (threads per task)
ECMWF
IFS Operational Forecast Model (T1279L91, 2 days, Power6)

Actual Wall Time 1675.7 1138.0 899.9
Amdahls Law:
Wall Time = S + P/NCPUS Serial = 114 secs Parallel = 1591806 secs
(Calculated using Excels LINEST function)
Amdahls Law (formal): (Named after Gene Amdahl) If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is 1/(F+(1-F)/N).
User Threads
1024 1536 2048
2560
3072 3584 3840 4096
725.1
619.7 555.8 533.3 518.8
Slide 8
ECMWF
Power6: SpeedUp and Efficiency T1279L91 model, 2 day forecast (CY36R4)

parallel 1591806 Actual Wall Time 1675.7 1138.0 899.9 725.1 619.7 555.8 533.3 518.8 Calculated Wall Time 1668 1150 891 736 632 558 528 502 1587754 serial 114 Calculated SpeedUp 950 1399 1769 2195 2569 2864 2985 3068 Calculated Efficiency % 92.8 91.1 86.4 85.8 83.6 79.9 77.7 74.9
User Threads 1024 1536 2048 2560 3072 3584 3840 4096 1
10 day forecast ~ 45 min

Slide 9
ECMWF
IFS speedup on Power6

13312 12288 11264 10240 9216
IFS world record set on 10 March 2012, a T2047L137 model ran on a CRAY XE6 (HECToR) using 53,248 cores (user threads)
Speedup
8192 7168 6144 5120 4096 3072 2048 1024 0
ideal
T2047L149
T1279L91
User Threads
ECMWF
Measuring Performance
Wall Clock Floating point operations per second (FLOPS or FLOP/S)
- Peak (Hardware), Sustained (Application)
SI prefixes
- Mega Mflops - Giga - Tera - Peta Gflops Tflops Pflops 10**6 10**9 10**12 10**15 ECMWF: 2 * 156 Tflops peak (P6) 2008-2010 (early systems)
- Exa, Zetta, Yotta
Instructions per second, Mips, etc,
Transactions per second (Databases)

ECMWF
CRAY2 (1985, $20M, 2 Gflop/s peak, 2GB Memory, 2 x 600MB Disk)
Slide 12
ECMWF
Georges Antique Home PC (2005, 700, 6 Gflop/s peak, 2 GB Memory, 250 GB Disk)
Comparing with CRAY2

Similar performance 10,000X less expensive
200X more disk space

5,000X less power 1000X less volume 2012 can buy a PC with 2-3 times performance for about 400 (2 core, 4GB, 500GB disk).
ECMWF
Types of Parallel Computer

S
P
P=Processor M=Memory S=Switch
Shared Memory
Distributed Memory
Slide 14
ECMWF
IBM Cluster (Distributed + Shared memory)
P=Processor M=Memory S=Switch
M Node
M Node
Slide 15
ECMWF
IBM Power6 Clusters at ECMWF
This is just one of the TWO identical clusters

ECMWF
and the worlds fastest and largest supercomputer Fujitsu K computer
705,024 Sparc64 processor cores
Slide 17
ECMWF
ECMWF supercomputers
1979 CRAY 1A CRAY XMP-2 CRAY XMP-4 CRAY YMP-8 CRAY C90-16 Vector
}
}
Vector + Shared Memory Parallel
1996 Fujitsu VPP700 Fujitsu VPP5000
Vector + MPI Parallel
2002+ IBM Cluster (P4,5,6,7) Scalar+ MPI +Shared Memory Parallel
Slide 18
ECMWF
ECMWFs first Supercomputer
CRAY-1 (1979)
Slide 19
ECMWF
Types of Processor
DO J=1,1000 A(J)=B(J) + C ENDDO

Single instruction processes one element
LOAD FADD STORE INCR TEST
B(J) C A(J) J
SCALAR PROCESSOR
LOADV FADDV STOREV
B->V1 V1,C->V2 V2->A
VECTOR PROCESSOR
Slide 20
Single instruction processes many elements
ECMWF
Parallel Computers Today

Less General Purpose
Higher #s of cores
Fujitsu K-Computer (Sparc) IBM BlueGene IBM RoadRunner/ Cell (PS3)
=> Less memory per core
More General Purpose
Cray XT6, XE6, AMD Opteron

Hitachi, Opteron HP 3000, Xeon
IBM Power6 (e.g. ECMWF)
Fujitsu, SPARC
SGI Xeon Sun, Opteron
NEC SX8, SX9

Bull, Xeon
Slide 21
ECMWF
Performance
The TOP500 project

started in 1993 Top 500 sites reported
Report produced twice a year

EUROPE in JUNE USA in NOV
Performance based on LINPACK benchmark

dominated by matrix multiply (DGEMM)
HPC Challenge Benchmark http://www.top500.org/
Slide 22
ECMWF
Slide 23
ECMWF
ECMWF in Top 500

TFlops KW
Rmax Tflop/sec achieved with LINPACK Benchmark Rpeak Peak Hardware Tflop/sec (that will never be reached!) In the June 2012 Top 500 list ECMWF expect to have 2 Power7 clusters EACH with ~ 24000 cores
ECMWF
Slide 25
ECMWF
Slide 26
ECMWF
Slide 27
ECMWF
Slide 28
ECMWF
Slide 29
ECMWF
Why is Matrix Multiply (DGEMM) so efficient?

VECTOR
1 VL is vector register length VL FMAs (VL + 1) LDs
SCALAR / CACHE
n
(m * n) + (m + n) < # registers
m * n FMAs m + n LDs
VL
FMAs ~= LDs
FMAs >> LDs

Slide 30
ECMWF
NVIDIA Tesla C1060 GPU
GPU programming
GPU Graphics Processing Unit
Programmed using CUDA, OpenCL High performance, low power, but challenging to programme for large applications, separate memory, GPU/CPU interface Expect GPU technology to be more easily useable on future HPCs
http://gpgpu.org/developer See GPU talks from ECMWF HPC workshop (final slide)
Mark Govett (NOAA Earth System Research Laboratory) Using GPUs to run weather prediction models Tom Henderson (NOAA Earth System Research Laboratory) Progress on the GPU parallelization and optimization of the NIM global weather model Dave Norton (The Portland Group) Accelerating weather models with GPGPU's
ECMWF
Key Architectural Features of a Supercomputer CPU Performance Interconnect Latency / Bandwidth MEMORY Latency / Bandwidth a balancing act to achieve good sustained performance
Parallel File-system Performance
ECMWF
What performance do Meteorological Applications achieve?

Vector computers
- About 20 to 30 percent of peak performance (single node) - Relatively more expensive - Also have front-end scalar nodes (compiling, post-processing)
Scalar computers
- About 5 to 10 percent of peak performance - Relatively less expensive
Both Vector and Scalar computers are being used in Met NWP Centres around the world Is it harder to parallelize than vectorize?
- Vectorization is mainly a compiler responsibility - Parallelization is mainly the users responsibility
ECMWF
Challenges in parallel computing

Parallel Computers
- Have ever increasing processors, memory, performance, but - Need more space (new computer halls = $) - Need more power (MWs = $)
Parallel computers require/produce a lot of data (I/O)

- Require parallel file systems (GPFS, Lustre) + archive store
Applications need to scale to increasing numbers of processors, problems areas are

- Load imbalance, Serial sections, Global Communications
Debugging parallel applications (totalview, ddt)
We are going to be using more processors in the future!

More cores per socket, little/no clock speed improvements
ECMWF
Parallel Programming Languages? OpenMP
directive based
support for Fortran 90/95/2003 and C/C++

shared memory programming only http://www.openmp.org PGAS Languages (Partitioned Global Address Space) UPC, CAF, Titanium, Co-array Fortran (F2008) One programming model for inter and intra node parallelism MPI is not a programming language!
ECMWF
Most Parallel Programmers use
Fortran 90/95/2003, C/C++ with MPI for communicating between tasks (processes)
- works for applications running on shared and distributed memory systems
Fortran 90/95/2003, C/C++ with OpenMP

- For applications that need performance that is satisfied by a single node (shared memory)
Hybrid combination of MPI/OpenMP

- ECMWFs IFS uses this approach
Slide 36
ECMWF
More Parallel Computing (Terminology) Cache, Cache line Domain decomposition Halo, halo exchange Load imbalance Synchronization Barrier
Slide 37
ECMWF
Cache
P
P C C1
P
C1
P=Processor C=Cache M=Memory
C2
Slide 38
ECMWF
IBM Power architecture (3 levels of $)
P C1
P C1
P C1
P C1
P C1
P C1
P C1
P C1
C2
C2
C2
C2
C3
Memory
Slide 39
ECMWF
Cache on scalar systems

Processors are 100s of cycles away from Memory Cache is a small (and fast) memory closer to processor Cache line typically 128 bytes Good for cache performance - Single stride access is always the best - Over inner loop leftmost index (fortran)
BETTER DO J=1,N DO I=1,M A(I,J)= . . . ENDDO ENDDO

WORSE DO J=1,N DO I=1,M A(J,I)= . . . ENDDO ENDDO

Slide 40
ECMWF
IFS Grid-Point Calculations (an example of blocking for cache)

NLON
Scalar
NLAT
U(NGPTOT,NLEV) NGPTOT = NLAT * NLON NLEV = vertical levels
Vector
NLAT SUB GP_CALCS DO I=1,NPROMA ENDDO END
Slide 41
DO J=1, NGPTOT, NPROMA CALL GP_CALCS ENDDO Lots of work Independent for each J
ECMWF
Grid point space blocking for Cache (Power5)

RAPS9 FC T799L91 192 tasks x 4 threads 550 500
SECONDS
450 400 350 300 250 200 1
Optimal use of cache / subroutine call overhead
10
100
1000
Grid Space blocking (NPROMA)
Slide 42
ECMWF
T799 FC 192x4 (10 runs)

246 244 242
SECONDS
240 238 236 234 232 230 228 226 20 25 30 35 40 NPROMA

45
50
55
60
ECMWF
T799 FC 192x4 (average) 7% 6%

PERCENT WALLTIME
5% 4% 3% 2% 1% 0% 20 25 30 35 40
NPROMA
45
50
55
60
Slide 44
ECMWF
TL799 1024 tasks 2D partitioning

2D partitioning results in non-optimal Semi-Lagrangian comms requirement at poles and equator! Square shaped partitions are better than rectangular shaped partitions.
arrival x
x
mid-point
departure
MPI task partition Concepts of Parallel Computing Slide 45
ECMWF
eq_regions algorithm
Slide 46
ECMWF
2D partitioning T799 1024 tasks (NS=32 x EW=32)
Slide 47
ECMWF
eq_regions partitioning T799 1024 tasks

N_REGIONS( 1)= N_REGIONS( 2)= 1 7 N_REGIONS( 3)= 13 N_REGIONS( 4)= 19 N_REGIONS( 5)= 25 N_REGIONS( 6)= 31
N_REGIONS( 7)= 35
N_REGIONS( 8)= 41 N_REGIONS( 9)= 45 N_REGIONS(10)= 48 N_REGIONS(11)= 52 N_REGIONS(12)= 54 N_REGIONS(13)= 56 N_REGIONS(14)= 56 N_REGIONS(15)= 58 N_REGIONS(16)= 56 N_REGIONS(17)= 56 N_REGIONS(18)= 54 N_REGIONS(19)= 52 N_REGIONS(20)= 48 N_REGIONS(21)= 45 N_REGIONS(22)= 41 N_REGIONS(23)= 35 N_REGIONS(24)= 31 N_REGIONS(25)= 25 N_REGIONS(26)= 19 N_REGIONS(27)= 13 N_REGIONS(28)= N_REGIONS(29)= 7 1
Slide 48
ECMWF
IFS physics computational imbalance (T799L91, 384 tasks) ~11% imbalance in physics, ~5% imbalance (total)
Slide 49
ECMWF
http://en.wikipedia.org/wiki/Parallel_computing
Slide 50
ECMWF
http://www.ecmwf.int/newsevents/meetings/workshops/ 2010/high_performance_computing_14th/index.html a useful source of info
ECMWF HPC in Meteorology workshop Every 2 years, early Nov, most recently Nov 2010 Use above link to get access to all talks HPC vendors - CRAY, IBM, INTEL, FUJITSU, SGI, BULL NWP centre talks Talks on early experience with GPUs
ECMWF
Slide 52
ECMWF

Para Concepts

Uploaded by

Copyright:

Available Formats

You might also like

Para Concepts

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Para Concepts

Uploaded by

Copyright:

Available Formats

Concepts of Parallel Computing

George Mozdzynski March 2012

Outline What is parallel computing? Why do we need it?

What is Parallel Computing?

Concepts of Parallel Computing

Why do we need Parallel Computing?

Concepts of Parallel Computing

Concepts of Parallel Computing

Ans. About 1 year

IFS Operational Forecast Model (T1279L91, 2 days, Power6)

Concepts of Parallel Computing

Power6: SpeedUp and Efficiency T1279L91 model, 2 day forecast (CY36R4)

10 day forecast ~ 45 min

Concepts of Parallel Computing

IFS speedup on Power6

8192 7168 6144 5120 4096 3072 2048 1024 0

- Exa, Zetta, Yotta

Instructions per second, Mips, etc,

Transactions per second (Databases)

CRAY2 (1985, $20M, 2 Gflop/s peak, 2GB Memory, 2 x 600MB Disk)

Concepts of Parallel Computing

Comparing with CRAY2

200X more disk space

Types of Parallel Computer

P=Processor M=Memory S=Switch

IBM Cluster (Distributed + Shared memory)

P=Processor M=Memory S=Switch

Concepts of Parallel Computing

IBM Power6 Clusters at ECMWF

This is just one of the TWO identical clusters

and the worlds fastest and largest supercomputer Fujitsu K computer

705,024 Sparc64 processor cores

Concepts of Parallel Computing

Vector + Shared Memory Parallel

1996 Fujitsu VPP700 Fujitsu VPP5000

Vector + MPI Parallel

2002+ IBM Cluster (P4,5,6,7) Scalar+ MPI +Shared Memory Parallel

Concepts of Parallel Computing

ECMWFs first Supercomputer

Concepts of Parallel Computing

DO J=1,1000 A(J)=B(J) + C ENDDO

LOAD FADD STORE INCR TEST

LOADV FADDV STOREV

B->V1 V1,C->V2 V2->A

Single instruction processes many elements

Concepts of Parallel Computing

Parallel Computers Today

=> Less memory per core

More General Purpose

Cray XT6, XE6, AMD Opteron

IBM Power6 (e.g. ECMWF)

NEC SX8, SX9

Concepts of Parallel Computing

The TOP500 project

Report produced twice a year

Performance based on LINPACK benchmark

HPC Challenge Benchmark http://www.top500.org/

Concepts of Parallel Computing