Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

UNIT - IV

PARALLELISM
1. Parallelism
2. Parallel Processing
3. Amdahl’s Law
4. Flynn's Classification
5. Hardware Multithreading
6. Multicore Processors
7. Shared Memory Multiprocessors
8. Graphics Processing Units
9. Clusters
10. Warehouse Scale Computers
11. Message-Passing Multiprocessors
1.PARALLELISM
• ILP – Instruction Level Parallelism
• How many instructions can be executed simultaneously.
• Two approaches to instruction level parallelism:
Hardware and Software
• Hardware level works upon dynamic parallelism
• Software level works on static parallelism
• Dynamic parallelism means the processor decides which
instructions to execute in parallel.

• Static parallelism means the compiler decides which


instructions to execute in parallel.
Two methods for increasing the amount of ILP.
1. First approach - to increase the depth of the pipeline to
overlap more instructions.

2. Second approach - to replicate the internal components of


the computer. This technique is called as multiple issue.

There are two major ways to implement a multiple-issue:

• Static Multiple Issue : Decision of work are being made


statically (at compile time before execution).

• Dynamic Multiple Issue : Decision of work are being made


dynamically (at execution time).
SPECULATION
• An approach that allows the compiler or the processor
to “guess” about the properties of an instruction.

• Speculation may be done by the two ways:


By hardware (or) By the compiler

• In hardware speculation, the instructions are executed


and stored in buffers.

• In compiler-based speculation, special speculation


supports are added while executing the instructions.
2. PARALLEL PROCESSING
• Processing data concurrently is called as Parallel
Processing .

• There are two ways of achieving parallelism :

1. Multiple Functional Units : System may have two or


more ALU’s so that they can execute two or more
instructions at the same time.

2. Multiple Processors : System may have two or more


processors operating concurrently.
2.1 PARALLEL PROCESSING CHALLENGES

• Difficult to write software that uses multiple processors


• Must get better performance or better energy
efficiency
• Difficult to write parallel processing programs that are
fast
• Other challenges include
scheduling, load balancing, synchronization ,
overhead for communication
3. AMDAHL’S LAW

• Used to predict the theoretical speedup when using


multiple processors.
• Used in parallel computing .
• Even small parts of a program can be parallelized
SPEEDUP
• Speedup is defined as the ratio of
the execution time for the entire task without using the
enhancement
and
the execution time for the entire task using the
enhancement.
ACHIEVING SPEED UP / TYPES OF SCALING
SPEEDUP CHALLENGES
SOLVED PROBLEMS - AMDAHL’S LAW
SOLVED PROBLEMS ON PROTEIN STRING MATCHING CODE
4. FLYNN’S CLASSIFICATION / TAXONOMY
4.1 - Single Instruction, Single Data (SISD)
• Executes a single instruction on a single data stream using a
single processor.
• Based on traditional Von Neumann uniprocessor architecture
• Instructions are executed sequentially
• E.g - IBM 704, VAX 11/780, CRAY-1, Older mainframe computers

20
4.2 - Single Instruction, Multiple Data (SIMD)
• Executes a single instruction on multiple data values simultaneously
using many processors.
• A single control unit does the fetch and decoding for all processors.
• SIMD architectures include array processors.
• E.g – ILLIAC-IV, MPP, CM-2, STARAN

21
4.3 - Multiple Instruction, Single Data (MISD)
• Executing different instructions but all of them operating
on the same data stream.
• This structure is not commercially implemented.
• Systolic array is one example of an MISD architecture

22
4.4 - Multiple Instruction, Multiple Data (MIMD)
• Execute multiple instructions simultaneously on multiple data
streams.
• Each processor must include its own control unit
• Distributed-memory multiprocessor or Shared-memory
multicomputer.
• E.g – CRAY-XMP, IBM 370/168 M

23
4.5 - SINGLE PROGRAM, MULTIPLE DATA (SPMD)

• It is a subcategory of MIMD.
• Tasks are split up and run simultaneously on multiple
processors with different input in order to obtain results
faster.
5.HARDWARE MULTITHREADING
• Hardware multithreading allows multiple threads to share the
functional units of a single processor in an overlapping
fashion.

• The instruction stream is divided into several smaller streams,


known as threads, such that the threads can be executed in
parallel.

• There are two main approaches to hardware multithreading.


1. Fine-grained multithreading /
Interleaved Multithreading

2. Coarse-grained multithreading /
Blocking Multithreading
FINE-GRAINED MULTITHREADING
• The processor switches
between threads after each
instruction
• Also called as Interleaving.
• Done in a round robin fashion.
• In a pipelined Architecture
If
k stages,
k threads to be executed
then
there cannot be hazards
due to dependencies and the
pipeline never stalls.
COARSE-GRAINED MULTITHREADING

• Only switches threads that


are stalled waiting for a time-
consuming operation to
complete.

• Also called as Blocking.

• A switch is made to another


thread. When this thread in
turn causes a stall, a third
thread is scheduled and so
on.
SIMULTANEOUS MULTITHREADING (SMT)
• A variation of hardware
multithreading.
• Allows multiple threads to
execute simultaneously.
• SMT = ILP + TLP
• SMT is a multiple-issue,
dynamically scheduled processor
• Uses register renaming and
dynamic scheduling techniques
• (Register Renaming - all
WAW and WAR hazards are
avoided)
APPROACHES TO EXECUTE MULTIPLE THREADS
Two Approaches :

1.Approach To Execute Scalar(Single Issue) Processor

2. Approach To Execute SuperScalar (Multiple Issue)


Processor
APPROACHES TO EXECUTE SCALAR PROCESSOR
APPROACHES WITH A SUPERSCALAR PROCESSOR
DIFFERENT APPROACHES TO EXECUTE MULTIPLE THREADS
6. MULTICORE PROCESSORS

• A multicore computer, also known as a chip multiprocessor,


combines two or more processors (called cores) on a single
IC.

• Each core consists of registers, ALU, pipeline hardware, and


control unit, L1 ,L2 and L3 caches
(Instruction and Data caches)
A Typical Multi - core Structure
MULTICORE ORGANIZATION

• The four general organizations for multicore


systems are :

• Dedicated L1 cache
• Dedicated L2 cache
• Shared L2 cache
• Shared L3 cache
DEDICATED L1 CACHE
• On-chip cache
• Size - 256KB to 1MB
• Divided into instruction
and data cache.
• Has the data the CPU is
most likely to need
while completing a certain
task .
• Example :
ARM11 MPCore
DEDICATED L2 CACHE
• No on-chip cache
• Size -256KB to 8MB.
• Slower than L1 cache
• Bigger than L1 cache
• Holds data that is
likely to be accessed
by the CPU next.
• Example :
AMD Opteron
SHARED L2 CACHE
• L2 cache is shared.

• Example : Intel Core


Duo
SHARED L3 CACHE
• Largest cache
• Slowest one
• Size - 4MB to 50MB.
• Better Performance
• Example :
AMD K10
HARDWARE PERFORMANCE ISSUES

• Increase in Parallelism and Complexity

• Power Consumption
SOFTWARE PERFORMANCE ISSUES

• Multi-threaded applications

• Multi-process applications

• Multi-instance applications
7. SHARED MEMORY MULTIPROCESSORS
• A parallel processor with a
single address space across all
processors.

• Processors communicate
through shared variables in
memory.

• Processors access any memory


location via loads and stores.

• Can run independent jobs in


their own virtual address
spaces.
TYPES OF SHARED MEMORY MULTIPROCESSORS

• It comes in three styles :


– Uniform Memory Access Multiprocessors (UMA)

– Non Uniform Memory Access Multiprocessors


(NUMA)

- Cache Only Memory Access (COMA)


Uniform Memory Access Multiprocessors
(UMA)
• Accesses to main memory take about the
same amount of time
Non Uniform Memory Access Multiprocessors
(NUMA)

• Some memory accesses are much faster than others


depending on which processor asks for which word.
Cache Only Memory Access (Coma)
• Data have no specific “permanent” location

• The entire physical address space is considered a


huge, single cache.

• The data can be read into the local caches and/or


modified and then updated at their “permanent”
location.

• Data can migrate and/or can be replicated in the


various memory banks of the central main
memory.
TYPES OF SHARED MEMORY ARCHITECTURE

Two types of Shared memory Architecture.

They are
• Symmetric Shared-Memory Architectures
• Distributed Shared Memory Architectures
Symmetric Shared-Memory Architectures
• Consists of several processors with a single physical memory
shared by all processors through a shared bus.
• The cores have private level 1 caches, while other caches may
or may not be shared between the cores.
Distributed Shared Memory Architectures
• Consists of multiple independent processing nodes with local
memory modules which is connected by a general interconnection
network.
• Distributed-memory systems are also called as clusters or hybrid
systems.
• Each node of a cluster has access to shared memory in addition to
each node's non-shared private memory.
8. GRAPHICS PROCESSING UNIT (GPU)
Visual Processing unit
A graphics coprocessor / accelerator
Built with hundreds of processing cores
Handles large number of floating-point operations
in parallel.
Used in mobile phones, game consoles, embedded
systems, PCs and Servers.
Does not rely on multilevel caches
Rely on hardware multithreading
Size - 4 to 6 GB (or) less
Difference Between CPU and GPU
GPU ARCHITECTURE
• First GPU is GeForce 256 by NVIDIA in 1999.

• These GPU chips can process a minimum of 10 million


polygons per second.

• The NVIDIA GPU - 128 cores on a single chip.

• Each core can handle 8 threads of instructions.

• Totally 1,024 (8 * 128) threads are executed


concurrently on a single GPU.
Streaming Multiprocessors (SM)
• A GPU consists of many Streaming Multiprocessors (SM).

• Multiple SMs can be built on single GPU chip.

• Each SM is associated with a private L1 Data Cache.

• Each SM has 16 load/store units allowing source and


destination address to be calculated for 16 threads/clock.

• Each SM has 32 CUDA cores


(Totally 16*32 =512 CUDA Cores)
Memory Controller (MC)

• Every Memory Controller (MC) is associated with a

shared L2 cache for faster access to the cached data.

• Both MC and L2 are on-chip.


GPU PROGRAMMING MODEL - CUDA
• GPU uses a programming model called
CUDA (Compute Unified Device Architecture)

• CUDA is an extension of the C language.

• It enables the programmer to write C programs to execute


on GPUs.

• It is used to control the device.

• The programmer specifies CPU and GPU functions.


Host code (CPU) can be C++
Device code (GPU) may only be C
GPU MEMORY SYSTEM
• It has a Multi-Banked Memory
Structure.

• GPU memory systems are designed


for data throughput with wide
memory buses

• Much larger bandwidth than typical


CPUs typically 6 to 8 times

• The on-chip SMEM memory is local


to each Streaming Multiprocessor.

• The off - chip Global memory is


shared by the whole GPU and all
thread blocks GPU Memory
9. CLUSTERS
• Cluster is a set of loosely
(or) tightly connected
computers(nodes) working
together as a unified
computing resource.

• Each node performs the


same task

• Nodes are controlled by


software.

• Connected to each other by


I/O interconnect via
standard network switches
and cables
• Each computer has private memory and OS.

• It easier to expand the system

• It is also easier to replace a computer

• Easy to scale down gracefully


TYPES OF CLUSTERED SYSTEMS
1.Asymmetric Clustering System
• One of the nodes is in standby mode
• All the others run the required applications.
• The standby node continuously monitors the server and if it
fails, the standby node takes its place of the server.

2.Symmetric Clustering System


• All the nodes run applications as well as monitor each
other.
• More efficient
• Doesn't keep a node merely as a standby.
ATTRIBUTES/TYPES OF CLUSTERED SYSTEMS
1.Load Balancing Clusters
• Share the workload to provide a better performance.
• System performance is optimized.
• Use a round robin mechanism

2.High Availability Clusters


• Improve the availability of the clustered system.
• Have extra nodes to be used if some of the system
components fail.
• Removes single point of failure
• Also known as failover clusters or HA clusters.
BENEFITS OF CLUSTERED SYSTEMS

• Performance

• Fault Tolerance

• Scalability
APPLICATIONS OF CLUSTERS

• Amazon
• Facebook
• Google
• Microsoft

have multiple datacenters each with clusters of


tens of thousands of servers.
10. WAREHOUSE SCALE COMPUTERS (WSC)
• Large Clusters are called as Warehouse-Scale Computers.

• WSC is a cluster comprised of tens of thousands of servers.

• Act as one giant computer .

• Internet services necessitated the construction of new buildings


to house, power, and cool 100,000 servers.

• WSC often use a hierarchy of networks for interconnection.


Applications of WSC

A WSC can be used to provide internet services.


• Search - Google
• Social Networking - Facebook
• Video Sharing - YouTube
• Online Sales - Amazon
• Cloud Computing Services – Rackspace
Goals / Design factors for WSC
• Cost-performance
• Energy efficiency
• Dependability via redundancy
• Network I/O
• Interactive and batch processing workloads
• Ample computational parallelism
• Operational costs count
• Power consumption is a primary, not secondary,
constraint when designing system
• Scale and its opportunities and problems
• Can afford to build customized systems since WSC
require volume purchase
Programming Model of WSC
• WSC uses MapReduce programming model.

• MapReduce runtime environment schedules map and reduce


task to WSC nodes.

• MapReduce programs work in two phases:


1.Map Phase 2. Reduce Phase.

• An input to each phase is (key-value) pairs.

• Map function runs on thousands of servers to produce an


intermediate result of key-value pairs.

• Reduce function collects the output of those distributed tasks


and collapses them.
11. MESSAGE PASSING MULTIPROCESSORS (MPP)
• An alternative methods for communication and movement
of data among multiprocessors.
• A MPP combines local memory and processor at each node
of the interconnection network.
• Message passing is a way for communicating between multiple
processors by explicitly sending and receiving messages.
• The system has functions to send and receive messages.
• Coordination is built-in with message passing

• If the sender needs confirmation that the message has arrived,


the receiving processor can then send an acknowledgment
message back to the sender.

• Message passing can be synchronous or asynchronous.

Synchronous message passing systems - require the sender and


receiver to wait for each other while transferring the message.

Asynchronous message passing - the sender and receiver do not


wait for each other and can carry on their own computations
while transfer of messages is being done
Advantages of Message Passing Model
 Easier to build than scalable shared memory machines
 Easy to scale
 Coherency and synchronization is the responsibility of the
user, so the system designer need not worry about them.

Disadvantage of Message Passing Model


 Large overhead: copying of buffers requires large data
transfers
 Programming is more difficult.
 Blocking nature of SEND/RECEIVE can cause increased
latency and deadlock issues.

You might also like