Professional Documents
Culture Documents
Para Concepts
Para Concepts
Para Concepts
ECMWF
Types of computer
Parallel Computers today Challenges in parallel computing Parallel Programming Languages OpenMP and Message Passing
Terminology
Concepts of Parallel Computing Slide 2
ECMWF
The simultaneous use of more than one processor or computer to solve a problem
Slide 3
ECMWF
Serial computing is too slow Need for large amounts of memory not accessible by a single processor
Slide 4
ECMWF
An IFS T2047L149 forecast model takes about 5000 seconds wall time for a 10 day forecast using 128 nodes of an IBM Power6 cluster.
How long would this model take using a fast PC with sufficient memory? (e.g. dual core Dell desktop)
Slide 5
ECMWF
ECMWF
Some Terminology
Hardware: CPU = Core = Processor = PE (Processing Element ) Socket = a chip with 1 or more cores (typically 2 or 4 today), more correctly the socket is what the chip fits into Software : Process (Unix/Linux) = Task (IBM) MPI = Message Passing Interface, standard for programming processes (tasks) on systems with distributed memory OpenMP = standard for shared memory programming (threads) Thread = some code / unit of work that can be scheduled (threads are cheap to start/stop compared with processes/tasks) User Threads = tasks * (threads per task)
Concepts of Parallel Computing Slide 7
ECMWF
Amdahls Law:
Wall Time = S + P/NCPUS Serial = 114 secs Parallel = 1591806 secs
(Calculated using Excels LINEST function)
Amdahls Law (formal): (Named after Gene Amdahl) If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is 1/(F+(1-F)/N).
User Threads
1024 1536 2048
2560
3072 3584 3840 4096
725.1
619.7 555.8 533.3 518.8
Slide 8
ECMWF
User Threads 1024 1536 2048 2560 3072 3584 3840 4096 1
ECMWF
IFS world record set on 10 March 2012, a T2047L137 model ran on a CRAY XE6 (HECToR) using 53,248 cores (user threads)
Speedup
ideal
T2047L149
T1279L91
User Threads
Concepts of Parallel Computing Slide 10
ECMWF
Measuring Performance
Wall Clock Floating point operations per second (FLOPS or FLOP/S)
- Peak (Hardware), Sustained (Application)
SI prefixes
- Mega Mflops - Giga - Tera - Peta Gflops Tflops Pflops 10**6 10**9 10**12 10**15 ECMWF: 2 * 156 Tflops peak (P6) 2008-2010 (early systems)
ECMWF
Slide 12
ECMWF
Georges Antique Home PC (2005, 700, 6 Gflop/s peak, 2 GB Memory, 250 GB Disk)
ECMWF
Shared Memory
Concepts of Parallel Computing
Distributed Memory
Slide 14
ECMWF
M Node
M Node
Slide 15
ECMWF
ECMWF
Slide 17
ECMWF
ECMWF supercomputers
1979 CRAY 1A CRAY XMP-2 CRAY XMP-4 CRAY YMP-8 CRAY C90-16 Vector
}
}
Slide 18
ECMWF
CRAY-1 (1979)
Slide 19
ECMWF
Types of Processor
B(J) C A(J) J
SCALAR PROCESSOR
VECTOR PROCESSOR
Slide 20
ECMWF
Fujitsu, SPARC
SGI Xeon Sun, Opteron
Slide 21
ECMWF
Performance
Slide 22
ECMWF
Slide 23
ECMWF
Rmax Tflop/sec achieved with LINPACK Benchmark Rpeak Peak Hardware Tflop/sec (that will never be reached!) In the June 2012 Top 500 list ECMWF expect to have 2 Power7 clusters EACH with ~ 24000 cores
Concepts of Parallel Computing Slide 24
ECMWF
Slide 25
ECMWF
Slide 26
ECMWF
Slide 27
ECMWF
Slide 28
ECMWF
Slide 29
ECMWF
SCALAR / CACHE
n
(m * n) + (m + n) < # registers
m * n FMAs m + n LDs
VL
FMAs ~= LDs
Concepts of Parallel Computing
ECMWF
GPU programming
GPU Graphics Processing Unit
Programmed using CUDA, OpenCL High performance, low power, but challenging to programme for large applications, separate memory, GPU/CPU interface Expect GPU technology to be more easily useable on future HPCs
http://gpgpu.org/developer See GPU talks from ECMWF HPC workshop (final slide)
Mark Govett (NOAA Earth System Research Laboratory) Using GPUs to run weather prediction models Tom Henderson (NOAA Earth System Research Laboratory) Progress on the GPU parallelization and optimization of the NIM global weather model Dave Norton (The Portland Group) Accelerating weather models with GPGPU's
Concepts of Parallel Computing Slide 31
ECMWF
Key Architectural Features of a Supercomputer CPU Performance Interconnect Latency / Bandwidth MEMORY Latency / Bandwidth a balancing act to achieve good sustained performance
Concepts of Parallel Computing Slide 32
ECMWF
Scalar computers
- About 5 to 10 percent of peak performance - Relatively less expensive
Both Vector and Scalar computers are being used in Met NWP Centres around the world Is it harder to parallelize than vectorize?
- Vectorization is mainly a compiler responsibility - Parallelization is mainly the users responsibility
Concepts of Parallel Computing Slide 33
ECMWF
ECMWF
directive based
ECMWF
Fortran 90/95/2003, C/C++ with MPI for communicating between tasks (processes)
- works for applications running on shared and distributed memory systems
Slide 36
ECMWF
More Parallel Computing (Terminology) Cache, Cache line Domain decomposition Halo, halo exchange Load imbalance Synchronization Barrier
Slide 37
ECMWF
Cache
P
P C C1
P
C1
C2
Slide 38
ECMWF
P C1
P C1
P C1
P C1
P C1
P C1
P C1
P C1
C2
C2
C2
C2
C3
Memory
Slide 39
ECMWF
ECMWF
Scalar
NLAT
Vector
NLAT SUB GP_CALCS DO I=1,NPROMA ENDDO END
Slide 41
DO J=1, NGPTOT, NPROMA CALL GP_CALCS ENDDO Lots of work Independent for each J
Concepts of Parallel Computing
ECMWF
SECONDS
10
100
1000
Slide 42
ECMWF
45
50
55
60
ECMWF
5% 4% 3% 2% 1% 0% 20 25 30 35 40
NPROMA
45
50
55
60
Slide 44
ECMWF
x
mid-point
departure
ECMWF
eq_regions algorithm
Slide 46
ECMWF
Slide 47
ECMWF
N_REGIONS( 7)= 35
N_REGIONS( 8)= 41 N_REGIONS( 9)= 45 N_REGIONS(10)= 48 N_REGIONS(11)= 52 N_REGIONS(12)= 54 N_REGIONS(13)= 56 N_REGIONS(14)= 56 N_REGIONS(15)= 58 N_REGIONS(16)= 56 N_REGIONS(17)= 56 N_REGIONS(18)= 54 N_REGIONS(19)= 52 N_REGIONS(20)= 48 N_REGIONS(21)= 45 N_REGIONS(22)= 41 N_REGIONS(23)= 35 N_REGIONS(24)= 31 N_REGIONS(25)= 25 N_REGIONS(26)= 19 N_REGIONS(27)= 13 N_REGIONS(28)= N_REGIONS(29)= 7 1
Slide 48
ECMWF
IFS physics computational imbalance (T799L91, 384 tasks) ~11% imbalance in physics, ~5% imbalance (total)
Slide 49
ECMWF
http://en.wikipedia.org/wiki/Parallel_computing
Slide 50
ECMWF
ECMWF HPC in Meteorology workshop Every 2 years, early Nov, most recently Nov 2010 Use above link to get access to all talks HPC vendors - CRAY, IBM, INTEL, FUJITSU, SGI, BULL NWP centre talks Talks on early experience with GPUs
Concepts of Parallel Computing Slide 51
ECMWF
Slide 52
ECMWF