Para Concepts

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Concepts of Parallel Computing

George Mozdzynski March 2012


Concepts of Parallel Computing Slide 1

ECMWF

Outline What is parallel computing? Why do we need it?

Types of computer
Parallel Computers today Challenges in parallel computing Parallel Programming Languages OpenMP and Message Passing

Terminology
Concepts of Parallel Computing Slide 2

ECMWF

What is Parallel Computing?

The simultaneous use of more than one processor or computer to solve a problem

Concepts of Parallel Computing

Slide 3

ECMWF

Why do we need Parallel Computing?

Serial computing is too slow Need for large amounts of memory not accessible by a single processor

Concepts of Parallel Computing

Slide 4

ECMWF

An IFS T2047L149 forecast model takes about 5000 seconds wall time for a 10 day forecast using 128 nodes of an IBM Power6 cluster.

How long would this model take using a fast PC with sufficient memory? (e.g. dual core Dell desktop)

Concepts of Parallel Computing

Slide 5

ECMWF

Ans. About 1 year


This PC would also need ~2000 GBYTES of memory (4 GB is usual)
1 year is too long for a 10 day forecast!
5000 seconds is also too long
See www.spec.org for CPU performance data (e.g. Specfp2006)
Concepts of Parallel Computing Slide 6

ECMWF

Some Terminology
Hardware: CPU = Core = Processor = PE (Processing Element ) Socket = a chip with 1 or more cores (typically 2 or 4 today), more correctly the socket is what the chip fits into Software : Process (Unix/Linux) = Task (IBM) MPI = Message Passing Interface, standard for programming processes (tasks) on systems with distributed memory OpenMP = standard for shared memory programming (threads) Thread = some code / unit of work that can be scheduled (threads are cheap to start/stop compared with processes/tasks) User Threads = tasks * (threads per task)
Concepts of Parallel Computing Slide 7

ECMWF

IFS Operational Forecast Model (T1279L91, 2 days, Power6)


Actual Wall Time 1675.7 1138.0 899.9

Amdahls Law:
Wall Time = S + P/NCPUS Serial = 114 secs Parallel = 1591806 secs
(Calculated using Excels LINEST function)
Amdahls Law (formal): (Named after Gene Amdahl) If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is 1/(F+(1-F)/N).

User Threads
1024 1536 2048

2560
3072 3584 3840 4096

725.1
619.7 555.8 533.3 518.8

Concepts of Parallel Computing

Slide 8

ECMWF

Power6: SpeedUp and Efficiency T1279L91 model, 2 day forecast (CY36R4)


parallel 1591806 Actual Wall Time 1675.7 1138.0 899.9 725.1 619.7 555.8 533.3 518.8 Calculated Wall Time 1668 1150 891 736 632 558 528 502 1587754 serial 114 Calculated SpeedUp 950 1399 1769 2195 2569 2864 2985 3068 Calculated Efficiency % 92.8 91.1 86.4 85.8 83.6 79.9 77.7 74.9

User Threads 1024 1536 2048 2560 3072 3584 3840 4096 1

10 day forecast ~ 45 min


Slide 9

Concepts of Parallel Computing

ECMWF

IFS speedup on Power6


13312 12288 11264 10240 9216

IFS world record set on 10 March 2012, a T2047L137 model ran on a CRAY XE6 (HECToR) using 53,248 cores (user threads)

Speedup

8192 7168 6144 5120 4096 3072 2048 1024 0

ideal

T2047L149
T1279L91

User Threads
Concepts of Parallel Computing Slide 10

ECMWF

Measuring Performance
Wall Clock Floating point operations per second (FLOPS or FLOP/S)
- Peak (Hardware), Sustained (Application)

SI prefixes
- Mega Mflops - Giga - Tera - Peta Gflops Tflops Pflops 10**6 10**9 10**12 10**15 ECMWF: 2 * 156 Tflops peak (P6) 2008-2010 (early systems)

- Exa, Zetta, Yotta

Instructions per second, Mips, etc,

Transactions per second (Databases)


Concepts of Parallel Computing Slide 11

ECMWF

CRAY2 (1985, $20M, 2 Gflop/s peak, 2GB Memory, 2 x 600MB Disk)

Concepts of Parallel Computing

Slide 12

ECMWF

Georges Antique Home PC (2005, 700, 6 Gflop/s peak, 2 GB Memory, 250 GB Disk)

Comparing with CRAY2


Similar performance 10,000X less expensive

200X more disk space


5,000X less power 1000X less volume 2012 can buy a PC with 2-3 times performance for about 400 (2 core, 4GB, 500GB disk).
Concepts of Parallel Computing Slide 13

ECMWF

Types of Parallel Computer


S
P

P=Processor M=Memory S=Switch

Shared Memory
Concepts of Parallel Computing

Distributed Memory
Slide 14

ECMWF

IBM Cluster (Distributed + Shared memory)

P=Processor M=Memory S=Switch

M Node

M Node

Concepts of Parallel Computing

Slide 15

ECMWF

IBM Power6 Clusters at ECMWF

This is just one of the TWO identical clusters


Concepts of Parallel Computing Slide 16

ECMWF

and the worlds fastest and largest supercomputer Fujitsu K computer

705,024 Sparc64 processor cores

Concepts of Parallel Computing

Slide 17

ECMWF

ECMWF supercomputers
1979 CRAY 1A CRAY XMP-2 CRAY XMP-4 CRAY YMP-8 CRAY C90-16 Vector

}
}

Vector + Shared Memory Parallel

1996 Fujitsu VPP700 Fujitsu VPP5000

Vector + MPI Parallel

2002+ IBM Cluster (P4,5,6,7) Scalar+ MPI +Shared Memory Parallel

Concepts of Parallel Computing

Slide 18

ECMWF

ECMWFs first Supercomputer

CRAY-1 (1979)

Concepts of Parallel Computing

Slide 19

ECMWF

Types of Processor

DO J=1,1000 A(J)=B(J) + C ENDDO


Single instruction processes one element

LOAD FADD STORE INCR TEST

B(J) C A(J) J

SCALAR PROCESSOR

LOADV FADDV STOREV

B->V1 V1,C->V2 V2->A

VECTOR PROCESSOR
Slide 20

Single instruction processes many elements

Concepts of Parallel Computing

ECMWF

Parallel Computers Today


Less General Purpose
Higher #s of cores
Fujitsu K-Computer (Sparc) IBM BlueGene IBM RoadRunner/ Cell (PS3)

=> Less memory per core

More General Purpose

Cray XT6, XE6, AMD Opteron


Hitachi, Opteron HP 3000, Xeon

IBM Power6 (e.g. ECMWF)

Fujitsu, SPARC
SGI Xeon Sun, Opteron

NEC SX8, SX9


Bull, Xeon

Concepts of Parallel Computing

Slide 21

ECMWF

Performance

The TOP500 project


started in 1993 Top 500 sites reported

Report produced twice a year


EUROPE in JUNE USA in NOV

Performance based on LINPACK benchmark


dominated by matrix multiply (DGEMM)

HPC Challenge Benchmark http://www.top500.org/

Concepts of Parallel Computing

Slide 22

ECMWF

Concepts of Parallel Computing

Slide 23

ECMWF

ECMWF in Top 500


TFlops KW

Rmax Tflop/sec achieved with LINPACK Benchmark Rpeak Peak Hardware Tflop/sec (that will never be reached!) In the June 2012 Top 500 list ECMWF expect to have 2 Power7 clusters EACH with ~ 24000 cores
Concepts of Parallel Computing Slide 24

ECMWF

Concepts of Parallel Computing

Slide 25

ECMWF

Concepts of Parallel Computing

Slide 26

ECMWF

Concepts of Parallel Computing

Slide 27

ECMWF

Concepts of Parallel Computing

Slide 28

ECMWF

Concepts of Parallel Computing

Slide 29

ECMWF

Why is Matrix Multiply (DGEMM) so efficient?


VECTOR
1 VL is vector register length VL FMAs (VL + 1) LDs

SCALAR / CACHE
n

(m * n) + (m + n) < # registers
m * n FMAs m + n LDs

VL

FMAs ~= LDs
Concepts of Parallel Computing

FMAs >> LDs


Slide 30

ECMWF

NVIDIA Tesla C1060 GPU

GPU programming
GPU Graphics Processing Unit
Programmed using CUDA, OpenCL High performance, low power, but challenging to programme for large applications, separate memory, GPU/CPU interface Expect GPU technology to be more easily useable on future HPCs

http://gpgpu.org/developer See GPU talks from ECMWF HPC workshop (final slide)
Mark Govett (NOAA Earth System Research Laboratory) Using GPUs to run weather prediction models Tom Henderson (NOAA Earth System Research Laboratory) Progress on the GPU parallelization and optimization of the NIM global weather model Dave Norton (The Portland Group) Accelerating weather models with GPGPU's
Concepts of Parallel Computing Slide 31

ECMWF

Key Architectural Features of a Supercomputer CPU Performance Interconnect Latency / Bandwidth MEMORY Latency / Bandwidth a balancing act to achieve good sustained performance
Concepts of Parallel Computing Slide 32

Parallel File-system Performance

ECMWF

What performance do Meteorological Applications achieve?


Vector computers
- About 20 to 30 percent of peak performance (single node) - Relatively more expensive - Also have front-end scalar nodes (compiling, post-processing)

Scalar computers
- About 5 to 10 percent of peak performance - Relatively less expensive

Both Vector and Scalar computers are being used in Met NWP Centres around the world Is it harder to parallelize than vectorize?
- Vectorization is mainly a compiler responsibility - Parallelization is mainly the users responsibility
Concepts of Parallel Computing Slide 33

ECMWF

Challenges in parallel computing


Parallel Computers
- Have ever increasing processors, memory, performance, but - Need more space (new computer halls = $) - Need more power (MWs = $)

Parallel computers require/produce a lot of data (I/O)


- Require parallel file systems (GPFS, Lustre) + archive store

Applications need to scale to increasing numbers of processors, problems areas are


- Load imbalance, Serial sections, Global Communications

Debugging parallel applications (totalview, ddt)

We are going to be using more processors in the future!


More cores per socket, little/no clock speed improvements
Concepts of Parallel Computing Slide 34

ECMWF

Parallel Programming Languages? OpenMP

directive based

support for Fortran 90/95/2003 and C/C++


shared memory programming only http://www.openmp.org PGAS Languages (Partitioned Global Address Space) UPC, CAF, Titanium, Co-array Fortran (F2008) One programming model for inter and intra node parallelism MPI is not a programming language!
Concepts of Parallel Computing Slide 35

ECMWF

Most Parallel Programmers use

Fortran 90/95/2003, C/C++ with MPI for communicating between tasks (processes)
- works for applications running on shared and distributed memory systems

Fortran 90/95/2003, C/C++ with OpenMP


- For applications that need performance that is satisfied by a single node (shared memory)

Hybrid combination of MPI/OpenMP


- ECMWFs IFS uses this approach

Concepts of Parallel Computing

Slide 36

ECMWF

More Parallel Computing (Terminology) Cache, Cache line Domain decomposition Halo, halo exchange Load imbalance Synchronization Barrier

Concepts of Parallel Computing

Slide 37

ECMWF

Cache
P
P C C1

P
C1

P=Processor C=Cache M=Memory

C2

Concepts of Parallel Computing

Slide 38

ECMWF

IBM Power architecture (3 levels of $)

P C1

P C1

P C1

P C1

P C1

P C1

P C1

P C1

C2

C2

C2

C2

C3

Concepts of Parallel Computing

Memory

Slide 39

ECMWF

Cache on scalar systems


Processors are 100s of cycles away from Memory Cache is a small (and fast) memory closer to processor Cache line typically 128 bytes Good for cache performance - Single stride access is always the best - Over inner loop leftmost index (fortran)

BETTER DO J=1,N DO I=1,M A(I,J)= . . . ENDDO ENDDO


Concepts of Parallel Computing

WORSE DO J=1,N DO I=1,M A(J,I)= . . . ENDDO ENDDO


Slide 40

ECMWF

IFS Grid-Point Calculations (an example of blocking for cache)


NLON

Scalar
NLAT

U(NGPTOT,NLEV) NGPTOT = NLAT * NLON NLEV = vertical levels

Vector
NLAT SUB GP_CALCS DO I=1,NPROMA ENDDO END
Slide 41

DO J=1, NGPTOT, NPROMA CALL GP_CALCS ENDDO Lots of work Independent for each J
Concepts of Parallel Computing

ECMWF

Grid point space blocking for Cache (Power5)


RAPS9 FC T799L91 192 tasks x 4 threads 550 500

SECONDS

450 400 350 300 250 200 1

Optimal use of cache / subroutine call overhead

10

100

1000

Grid Space blocking (NPROMA)

Concepts of Parallel Computing

Slide 42

ECMWF

T799 FC 192x4 (10 runs)


246 244 242
SECONDS

240 238 236 234 232 230 228 226 20 25 30 35 40 NPROMA


Concepts of Parallel Computing Slide 43

45

50

55

60

ECMWF

T799 FC 192x4 (average) 7% 6%


PERCENT WALLTIME

5% 4% 3% 2% 1% 0% 20 25 30 35 40
NPROMA

45

50

55

60

Concepts of Parallel Computing

Slide 44

ECMWF

TL799 1024 tasks 2D partitioning


2D partitioning results in non-optimal Semi-Lagrangian comms requirement at poles and equator! Square shaped partitions are better than rectangular shaped partitions.
arrival x

x
mid-point

departure

MPI task partition Concepts of Parallel Computing Slide 45

ECMWF

eq_regions algorithm

Concepts of Parallel Computing

Slide 46

ECMWF

2D partitioning T799 1024 tasks (NS=32 x EW=32)

Concepts of Parallel Computing

Slide 47

ECMWF

eq_regions partitioning T799 1024 tasks


N_REGIONS( 1)= N_REGIONS( 2)= 1 7 N_REGIONS( 3)= 13 N_REGIONS( 4)= 19 N_REGIONS( 5)= 25 N_REGIONS( 6)= 31

N_REGIONS( 7)= 35
N_REGIONS( 8)= 41 N_REGIONS( 9)= 45 N_REGIONS(10)= 48 N_REGIONS(11)= 52 N_REGIONS(12)= 54 N_REGIONS(13)= 56 N_REGIONS(14)= 56 N_REGIONS(15)= 58 N_REGIONS(16)= 56 N_REGIONS(17)= 56 N_REGIONS(18)= 54 N_REGIONS(19)= 52 N_REGIONS(20)= 48 N_REGIONS(21)= 45 N_REGIONS(22)= 41 N_REGIONS(23)= 35 N_REGIONS(24)= 31 N_REGIONS(25)= 25 N_REGIONS(26)= 19 N_REGIONS(27)= 13 N_REGIONS(28)= N_REGIONS(29)= 7 1

Concepts of Parallel Computing

Slide 48

ECMWF

IFS physics computational imbalance (T799L91, 384 tasks) ~11% imbalance in physics, ~5% imbalance (total)

Concepts of Parallel Computing

Slide 49

ECMWF

http://en.wikipedia.org/wiki/Parallel_computing

Concepts of Parallel Computing

Slide 50

ECMWF

http://www.ecmwf.int/newsevents/meetings/workshops/ 2010/high_performance_computing_14th/index.html a useful source of info

ECMWF HPC in Meteorology workshop Every 2 years, early Nov, most recently Nov 2010 Use above link to get access to all talks HPC vendors - CRAY, IBM, INTEL, FUJITSU, SGI, BULL NWP centre talks Talks on early experience with GPUs
Concepts of Parallel Computing Slide 51

ECMWF

Concepts of Parallel Computing

Slide 52

ECMWF

You might also like