Pemrosesan Paralel

Kudang B. Seminar

Kebutuhan Komputer Berkinerja


Peramalan cuaca
Kercerdasan buatan: robotik
Rekayasa genetik

Contoh aplikasi di atas

melibatkan komputasi
intensif dan memerlukan

Example 1: Weather Prediction

Area, segments
3000*3000*11 cubic miles
.1*.1*.1 cubic mile: ~ 1011 segments

Two day prediction

half hour periods: ~ 100 periods

Computation per segment

Temp, Pressure, Humidity, Wind speed, Wind
Assume ~ 100 FLOPs

Performance: Weather
Computational requirement: 1015
Serial supercomputer: 109 instr/sec
Total serial time: 106 sec = 280 hours
Not too good for 48 hour weather

Parallel Weather Prediction

1 K workstations, grid connected

108 segment computations per processor

108 instructions per second
100 instructions per segment computation
100 time steps: 104 seconds = ~3 hours
Much more acceptable
Assumption: Communication not a problem here

More workstations:
finer grid
better accuracy

Example 2: N body problem

Astronomy: bodies in space
Attract each other: Gravitational force Newtons
O(n*n) calculations per snapshot
Galaxy: ~ 1011 bodies -> ~ 1022 calculations
Calculation 1 micro sec
Snapshot: 1016 secs = ~1011 days = ~ 3*108 years
Is parallelism going to help us? NO
What does help? Better algorithm: Barnes Hut
Divides the space in quad tree
Treats far away quads as one body

Other Challenging
Satellite data acquisition: billions of bits / sec
Satellite data processing
Pollution levels, Remote sensing of materials
Image recognition

Discrete optimization problems

Planning, Scheduling, VLSI design

Material modeling
Nuclear weapons modeling (ASCI)
Airplane/Satellite/Vehicle design

Application Specific
Mapping an algorithm directly onto hardware

ASICs: Application Specific Integrated Circuits

Levels of specificity
Full custom ASICs
Standard cell ASICs
Field programmable gate arrays
Computational models
Dataflow graphs
Systolic arrays
Orders of magnitude better performance
Orders of magnitude lower power

ASICS cont
How much faster than General purpose?
Example: 1D 1024 FFT
General purpose machine (G4): 25 micro secs
ASIC device (MIT Lincoln Labs): 32 nano secs
ASIC device uses 20 milliwatts (100 * less power)

Future designs:

2 tera ops in small ( < cubic ft ) device

Target applications
Finite Impulse Response (FIR) Filters
Matrix multiply
QR decomposition

Contoh Nyata
Peramalan cuaca 24 jam di UK melibatkan sekitar 1012

operasi untuk dieksekusi. Ini memerlukan waktu 2.7 hours

pada mesin Cray-1 (berkemampuan 108 operasi per detik).

Berapa operasi untuk peramalan

mingguan, bulanan, tahunan?

Menurut Einstein kecepatan cahaya: 3 x 108 m/dt. Dua

peralatan elektronik yang masing-masing mampu

melakukan 1012 operasi/detik dan terpisah dengan jarak 0.5
mm. Dalam hal ini akan lebih lama waktu yang diperlukan
bagi sinyal melakukan perjalanan antar dua peralatan
tersebut daripada waktu yang diperlukan untuk melakukan
eksekusi operasi (10-12 detik) oleh salah satu peralatan

Jadi faktor pembatasnya

adalah kecepatan cahaya.
elektronik tersebut.

SOLUSI: mendayagunakan

Motivation of Parallel
Parallel Computing is cost effective

Off the shelf, commodity processors are very fast

Memory is very cheap
Building a processor that is a small factor faster
costs an order of magnitude more
NoW is the time!
Cheapest way to get more performance: multiprocessor
NoW: Networks of workstations
Workstation can be an SMP
SMP: Symmetric Multi Processor
Shared memory

Wile E. Coyotes Parallel


Get a lot of the fastest processors

Get a lot of memory per processor
Get the fastest network
Hook it all together
And then what ???

Now you need to program

Parallel programming introduces:

Task partitioning, task scheduling

Data partitioning
Load balancing
Latency issues

Problem with Wile E. Coyote

Von Neumann Machines not built for //ism
To get high speed, processors have lots of state
Cache, stack, global memory

To tolerate latency, we need fast context switch. WHY?

No free lunch: cant have both
Certainly not if the processor was not designed for both

Memory wall: memory gets slower and slower

in terms of number of cycles it takes to access

Memory hierarchy gets more and more complex

Memory accesses block
No split phase memory access

Sequential vs Parallel
Efficient Parallel Algorithms

Maximize parallelism
Minimize synchronization, remote accesses
Efficiency is Architecture Dependent

Efficient Sequential Algorithms

Minimize time, space
Efficiency is portable
Efficient C program on Pentium ~ Efficient C program on

Ideal: n processors n fold speed up

Ideal not always possible. WHY?

Tasks are data dependent
Not all processors are always busy
Remote data

Super linear speedup: >n speedup

Nonsense! Because we can execute the faster
parallel program sequentially
No nonsense!! Because parallel computers do not
just have more processors, they have more caches

Parallel Programming
Parallel Programming Paradigms
Super compilers
20 years of parallelizing compilers and what do we get?
..not much: we understand loops (a bit)

Pthreads, Solaris threads, not much difference
Message Passing
MPI rules, ..well, there is PVM (parallel virtual machine)
Data parallel programming
Niche work, but important

Implicit vs Explicit //ism

Implicit: super compilers
Extract parallelism from sequential program
The general case is too hard
pointers, aliases, recursion, separate compilation
dynamic dependence distances in array references

Explicit Parallelism: threads or messages

Complicates programming
creation, allocation, scheduling of processes
data partitioning
Synchronization ( locks, messages )

Pemrosesan Sekuensial &


3 x lebih

Klasifikasi Mesin
Models of Computation ( Flynn
1966 )
1. Single Instruction Stream, Single Data Stream : SISD.
2. Multiple Instruction Stream, Single Data Stream : MISD.
3. Single Instruction Stream, Multiple Data Stream : SIMD.
4. Multiple Instruction Stream, Multiple Data Stream :
5. Single Program Multiple Data: SPMD.

SISD Computers

Untuk operasi a1 + a2 + a3 + + an
memerlukan sebanyak n akses ke
memori oleh prosesor dan sebanyak n-1
operasi penjumlahan. Jadi kompleksitas
waktu operasi adalah O(n).

von Neumann Architecture


MISD Computers
N prosesor yang memiliki unit kontrol pribadi, berbagi guna
memori bersama (shared memori).

Parallelisme diperoleh dengan menugaskan semua prosesor

mengerjakan operasi/tugas yang berbeda secara simultan pada
data yang sama.

SIMD Computers

N prosesor beroperasi dibawah kendali aliran

instruksi tunggal yang dikeluarkan oleh unit
kontrol pusat.

The processors operate synchronously and a

global clock is used to ensure lockstep operation.

MIMD Computers

Potensi dari 4 kelas


SPMD Computers

Program yang sama dieksekusi pada prosesor komputer

SPMD bukan merupakan paradigma hardware, ini adalah
software ekuivalen dari SIMD, namun bersifat

Perhatikan instruksi IF X = 0 THEN S1 ELSE S2

Asumsikan X = 0 pada prosesor P1, dan untuk X != 0 pada
prosesor P2
Proses P1 mengeksekusi S1 paralel dengan prosesor P2
mengeksekusi S2 ( ini tidak dapat terjadi pada SIMD )

