Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

High Performance

Computing
LECTURE #2

1
Agenda
o What is parallel computing?
o Why Parallel Computers?
o Motivation
o Inevitability of parallel computing
o Application demands
o Technology and architecture trends
oTerminologies
o How is parallelism expressed in a program
o Challenges
2
What is parallel computing?
Multiple processors cooperating concurrently to solve one problem.

3
What is parallel computing?

“A parallel computer is a collection of processing elements that can


communicate and cooperate to solve large problems fast”
Almasi/Gottlieb

“communicate and cooperate”


• Nodes and interconnect architecture
• Problem partitioning (Co-ordination of events in a process)

“large problems fast”


• Programming model
• Match of model and architecture

4
What is parallel computing?

Some broad issues:


• Resource Allocation:
– How large a collection?
– How powerful are the elements?

• Data access, Communication and Synchronization


– How are data transmitted between processors?
– How do the elements cooperate and communicate?
– What are the abstractions and primitives for cooperation?

• Performance and Scalability


– How does it all translate into performance?
– How does it scale? A service is said to be scalable when ??
5
Why Parallel Computer?
• Tremendous advances in microprocessor technology,
ex: clock rates of processors increased from 40MHz (e.g MIPS R3000, 1988)
to 2.0 GHz (e.g pentium 4, 2002)
to nowadays, 8.429GHz (AMD's Bulldozer based FX chips, 2012)

• Processor are now capable of executing multiple instruction in the same cycle

◦ The fundamental sequence of steps that a CPU performs. Also known as the "fetch-execute
cycle," it is the time in which a single instruction is fetched from memory, decoded and
executed.

◦ The first half of the cycle transfers the instruction from memory to the instruction register
and decodes it. The second half executes the instruction
6
Why parallel computing?

• The ability of memory system to feed data to processor at required rate


increased

• In addition, significant innovations in architecture and software have


addressed the mitigation of bottlenecks posed by data-Path and memory

◦ Hence, multiplicity of data-paths to increase access to storage elements (memory & disk)

7
Motivation
• Sequential architectures reaching physical limitation.

• Uniprocessor architectures will not be able to sustain the rate of performance


increments in the future.

• Computation requirements are ever increasing


-- visualization, distributed databases,
-- simulations, scientific prediction (earthquake), etc.
• Accelerating applications.

8
Inevitability of parallel computing
Application demand for performance
• Scientific: weather forecasting, pharmaceutical design, genomics
• Commercial: OLTP, search engine, decision support, data mining
• Scalable web servers

Technology and architecture trends


• limits to sequential CPU, memory, storage performance
• parallelism is an effective way of utilizing growing number of transistors.
• low incremental cost of supporting parallelism

9
Application Demand: Inevitability of parallel Computing
Engineering Computers
• Earthquake and structural modeling • Embedded systems increasingly rely on
distributed control algorithms.
• Design and simulation of micro- and nano-scale
systems. •Network intrusion detection, cryptography, etc.
•Optimizing performance of modern automobile.
Computational Sciences •Networks, mail-servers, search engines…
• Bioinformatics: Functional and structural •Visualization architectures & entertainment
characterization of genes and proteins.
•Simulation Traditional scientific and engineering
• Astrophysics: exploring the evolution of galaxies. paradigm:
 1) Do theory or paper design.
• Weather modeling, flood/tornado prediction..  2) Perform experiments or build system.

Limitations:
Commercial
– Too difficult -- build large wind tunnels.
• Data mining and analysis for optimizing business – Too expensive -- build a throw-away passenger jet.
and marketing decisions.
– Too slow -- wait for climate or galactic evolution.
• Database and Web servers for online transaction
– Too dangerous -- weapons, drug design, climate
processing experimentation. 10
Technology and architecture

1- Processor Capacity

11
2- Transistor Count

40% more functions can be performed by a CPU per year


Fundamentally, the use of more transistors improves performance in two ways:
◦ Parallelism: multiple operations done at once (less processing time)
◦ Locality: data references performed close to the processor (less memory latency)
12
3- Clock Rate

30% per year ---> today’s PC is yesterday’s Supercomputer

13
4- Similar Story for Memory and Disk
❖ Divergence between memory capacity and speed
o Capacity increased by 1000X from 1980-95, speed only 2X
o Larger memories are slower, while processors get faster “memory wall”
- Need to transfer more data in parallel
- Need deeper cache hierarchies
- Parallelism helps hide memory latency

❖Parallelism within memory systems too


o New designs fetch many bits within memory chip, follow with fast pipelined
transfer across narrower interface

14
5- Role of Architecture
Greatest trend in VLSI is an increase in the exploited parallelism
• Up to 1985: bit level parallelism:
– 4-bit -> 8 bit -> 16-bit – slows after 32 bit

• Mid 80s to mid 90s: Instruction Level Parallelism (ILP)


– pipelining and simple instruction sets (RISC)
– on-chip caches and functional units => superscalar execution
– Greater sophistication: out of order execution, speculation

• Nowadays:
– Hyper-threading
– Multi-core
15
❖ Definition
High-performance computing (HPC) is the use of parallel processing for
running advanced application programs efficiently, reliably and quickly.

17
Conclusions
• The hardware evolution, driven by Moore’s law, was geared toward two
things:
– exploiting parallelism
– Dealing with memory (latency, capacity)

18
Terminologies
❑Core a single computing unit with its own independent control

❑ Multicore is a processor having several cores that can access the same memory
concurrently

❑ A computation is decomposed into several parts called Tasks that can be computed
in parallel

❑Finding enough parallelism is (one of the) critical steps for high performance
(Amdahl’s law).

19
Performance Metrics

❑ Execution time:
The time elapsed between the beginning and the end of its execution.

❑Speedup:
The ration between serial and parallel time.
Speedup= Ts/Tp

❑ Efficiency:
Ratio of speedup to the number of processors.
Efficiency= Speedup/P
20
❑ Amdahl’s Law
Used to predict maximum speedup using multiple processors.
• Let f = fraction of work performed sequentially.
• (1 - f) = fraction of work that is parallelizable.
• P = number of processors
On 1 cpu: T1 = f + (1 – f ) = 1.
(1−𝑓)
On P processors: Tp = f +
𝑝

• Speedup
𝑇1 1 1
= <
𝑇𝑝 𝑓+(1−𝑓)/𝑝 𝑓

Speedup limited by sequential part

21
How is parallelism expressed in a
program
IMPLICITLY EXPLICITLY
❑ Define tasks only, rest implied; or ❑ Define tasks, work decomposition,
define tasks and work decomposition data decomposition, communication,
rest implied; synchronization.

❑ OpenMP is a high-level parallel ❑MPI is a library for fully explicit


programming model, which is mostly an parallelization.
implicit model.

23
1- IMPLICITLY
❑ It is a characteristic of a programming language that allows a compiler or interpreter to
automatically exploit the parallelism inherent to the computations expressed by some of the
language's constructs.

❑ A pure implicitly parallel language does not need special directives, operators or functions
to enable parallel execution.

❑ Programming languages with implicit parallelism include Axum, HPF, Id, LabVIEW, MATLAB
M-code,

❑ Example: taking the sine or logarithm of a group of numbers, a language that provides
implicit parallelism might allow the programmer to write the instruction thus:

24
Advantages Disadvantages

❑A programmer does not need to worry ❑It reduce the control that the
about task division or process programmer has over the parallel
communication, execution of the program,

❑ focusing instead in the problem that ❑resulting sometimes in less-than-optimal


his or her program is intended to solve. parallel efficiency.

❑It generally facilitates the design of ❑ Sometimes debugging is difficult.


parallel programs .

25
2- EXPLICITLY
How is parallelism expressed in a program
❑it is the representation of concurrent computations by means of primitives in the form
of special-purpose directives or function calls.

❑Most parallel primitives are related to process synchronization, communication or task


partitioning.

Advantages Disadvantages

❑ The absolute programmer control ❑ programming with explicit parallelism


over the parallel execution. is often difficult, especially for non
computing specialists,

❑ A skilled parallel programmer takes


advantage of explicit parallelism to ❑because of the extra work involved in
produce very efficient code. planning the task division and
synchronization of concurrent
26
What is Think - Different
How many people doing the work → (Degree of Parallelism)

What is needed to begin the work → (Initialization)

Who does what → (Work distribution)

Access to work part → (Data/IO access)

Whether they need info from each other to finish their own job → (Communication)

When are they all done → (Synchronization)

What needs to be done to collate the result

27
Challenges
All parallel programs contain:
❑ Parallel sections
❑ Serial sections
❑Serial sections are with work is being duplicated or no useful work is being done, (waiting
for others)

Building efficient algorithms avoiding:


❑ Communication delay
❑ Idling
❑ Synchronization

28
Sources of overhead in parallel programs
❑ Inter process interaction:
The time spent communicating data between processing elements is
usually the most significant source of parallel processing overhead.

❑ Idling:
Processes may become idle due to many reasons such as load
imbalance, synchronization, and presence of serial components in a
program.

❑ Excess Computation:
The fastest known sequential algorithm for a problem may be difficult
or impossible to parallelize, forcing us to use a parallel algorithm
based on a poorer but easily parallelizable sequential algorithm.

29

You might also like