CA - UNIT 4 - PPT

UNIT - IV
PARALLELISM
1. Parallelism
2. Parallel Processing
3. Amdahl’s Law
4. Flynn's Classification
5. Hardware Multithreading
6. Multicore Processors
7. Shared Memory Multiprocessors
8. Graphics Processing Units
9. Clusters
10. Warehouse Scale Computers
11. Message-Passing Multiprocessors
1.PARALLELISM
• ILP – Instruction Level Parallelism
• How many instructions can be executed simultaneously.
• Two approaches to instruction level parallelism:
Hardware and Software
• Hardware level works upon dynamic parallelism
• Software level works on static parallelism
• Dynamic parallelism means the processor decides which
instructions to execute in parallel.
• Static parallelism means the compiler decides which

instructions to execute in parallel.
Two methods for increasing the amount of ILP.
1. First approach - to increase the depth of the pipeline to
overlap more instructions.
2. Second approach - to replicate the internal components of

the computer. This technique is called as multiple issue.
There are two major ways to implement a multiple-issue:
• Static Multiple Issue : Decision of work are being made

statically (at compile time before execution).
• Dynamic Multiple Issue : Decision of work are being made

dynamically (at execution time).
SPECULATION
• An approach that allows the compiler or the processor
to “guess” about the properties of an instruction.
• Speculation may be done by the two ways:

By hardware (or) By the compiler
• In hardware speculation, the instructions are executed

and stored in buffers.
• In compiler-based speculation, special speculation

supports are added while executing the instructions.
2. PARALLEL PROCESSING
• Processing data concurrently is called as Parallel
Processing .
• There are two ways of achieving parallelism :
1. Multiple Functional Units : System may have two or

more ALU’s so that they can execute two or more
instructions at the same time.
2. Multiple Processors : System may have two or more

processors operating concurrently.
2.1 PARALLEL PROCESSING CHALLENGES
• Difficult to write software that uses multiple processors

• Must get better performance or better energy
efficiency
• Difficult to write parallel processing programs that are
fast
• Other challenges include
scheduling, load balancing, synchronization ,
overhead for communication
3. AMDAHL’S LAW
• Used to predict the theoretical speedup when using

multiple processors.
• Used in parallel computing .
• Even small parts of a program can be parallelized
SPEEDUP
• Speedup is defined as the ratio of
the execution time for the entire task without using the
enhancement
and
the execution time for the entire task using the
enhancement.
ACHIEVING SPEED UP / TYPES OF SCALING
SPEEDUP CHALLENGES
SOLVED PROBLEMS - AMDAHL’S LAW
SOLVED PROBLEMS ON PROTEIN STRING MATCHING CODE
4. FLYNN’S CLASSIFICATION / TAXONOMY
4.1 - Single Instruction, Single Data (SISD)
• Executes a single instruction on a single data stream using a
single processor.
• Based on traditional Von Neumann uniprocessor architecture
• Instructions are executed sequentially
• E.g - IBM 704, VAX 11/780, CRAY-1, Older mainframe computers
20
4.2 - Single Instruction, Multiple Data (SIMD)
• Executes a single instruction on multiple data values simultaneously
using many processors.
• A single control unit does the fetch and decoding for all processors.
• SIMD architectures include array processors.
• E.g – ILLIAC-IV, MPP, CM-2, STARAN
21
4.3 - Multiple Instruction, Single Data (MISD)
• Executing different instructions but all of them operating
on the same data stream.
• This structure is not commercially implemented.
• Systolic array is one example of an MISD architecture
22
4.4 - Multiple Instruction, Multiple Data (MIMD)
• Execute multiple instructions simultaneously on multiple data
streams.
• Each processor must include its own control unit
• Distributed-memory multiprocessor or Shared-memory
multicomputer.
• E.g – CRAY-XMP, IBM 370/168 M
23
4.5 - SINGLE PROGRAM, MULTIPLE DATA (SPMD)
• It is a subcategory of MIMD.
• Tasks are split up and run simultaneously on multiple
processors with different input in order to obtain results
faster.
5.HARDWARE MULTITHREADING
• Hardware multithreading allows multiple threads to share the
functional units of a single processor in an overlapping
fashion.
• The instruction stream is divided into several smaller streams,

known as threads, such that the threads can be executed in
parallel.
• There are two main approaches to hardware multithreading.

1. Fine-grained multithreading /
Interleaved Multithreading
2. Coarse-grained multithreading /
Blocking Multithreading
FINE-GRAINED MULTITHREADING
• The processor switches
between threads after each
instruction
• Also called as Interleaving.
• Done in a round robin fashion.
• In a pipelined Architecture
If
k stages,
k threads to be executed
then
there cannot be hazards
due to dependencies and the
pipeline never stalls.
COARSE-GRAINED MULTITHREADING
• Only switches threads that

are stalled waiting for a time-
consuming operation to
complete.
• Also called as Blocking.
• A switch is made to another

thread. When this thread in
turn causes a stall, a third
thread is scheduled and so
on.
SIMULTANEOUS MULTITHREADING (SMT)
• A variation of hardware
multithreading.
• Allows multiple threads to
execute simultaneously.
• SMT = ILP + TLP
• SMT is a multiple-issue,
dynamically scheduled processor
• Uses register renaming and
dynamic scheduling techniques
• (Register Renaming - all
WAW and WAR hazards are
avoided)
APPROACHES TO EXECUTE MULTIPLE THREADS
Two Approaches :
1.Approach To Execute Scalar(Single Issue) Processor
2. Approach To Execute SuperScalar (Multiple Issue)

Processor
APPROACHES TO EXECUTE SCALAR PROCESSOR
APPROACHES WITH A SUPERSCALAR PROCESSOR
DIFFERENT APPROACHES TO EXECUTE MULTIPLE THREADS
6. MULTICORE PROCESSORS
• A multicore computer, also known as a chip multiprocessor,

combines two or more processors (called cores) on a single
IC.
• Each core consists of registers, ALU, pipeline hardware, and

control unit, L1 ,L2 and L3 caches
(Instruction and Data caches)
A Typical Multi - core Structure
MULTICORE ORGANIZATION
• The four general organizations for multicore

systems are :
• Dedicated L1 cache
• Dedicated L2 cache
• Shared L2 cache
• Shared L3 cache
DEDICATED L1 CACHE
• On-chip cache
• Size - 256KB to 1MB
• Divided into instruction
and data cache.
• Has the data the CPU is
most likely to need
while completing a certain
task .
• Example :
ARM11 MPCore
DEDICATED L2 CACHE
• No on-chip cache
• Size -256KB to 8MB.
• Slower than L1 cache
• Bigger than L1 cache
• Holds data that is
likely to be accessed
by the CPU next.
• Example :
AMD Opteron
SHARED L2 CACHE
• L2 cache is shared.
• Example : Intel Core

Duo
SHARED L3 CACHE
• Largest cache
• Slowest one
• Size - 4MB to 50MB.
• Better Performance
• Example :
AMD K10
HARDWARE PERFORMANCE ISSUES
• Increase in Parallelism and Complexity
• Power Consumption
SOFTWARE PERFORMANCE ISSUES
• Multi-threaded applications
• Multi-process applications
• Multi-instance applications
7. SHARED MEMORY MULTIPROCESSORS
• A parallel processor with a
single address space across all
processors.
• Processors communicate
through shared variables in
memory.
• Processors access any memory

location via loads and stores.
• Can run independent jobs in

their own virtual address
spaces.
TYPES OF SHARED MEMORY MULTIPROCESSORS
• It comes in three styles :

– Uniform Memory Access Multiprocessors (UMA)
– Non Uniform Memory Access Multiprocessors

(NUMA)
- Cache Only Memory Access (COMA)

Uniform Memory Access Multiprocessors
(UMA)
• Accesses to main memory take about the
same amount of time
Non Uniform Memory Access Multiprocessors
(NUMA)
• Some memory accesses are much faster than others

depending on which processor asks for which word.
Cache Only Memory Access (Coma)
• Data have no specific “permanent” location
• The entire physical address space is considered a

huge, single cache.
• The data can be read into the local caches and/or

modified and then updated at their “permanent”
location.
• Data can migrate and/or can be replicated in the

various memory banks of the central main
memory.
TYPES OF SHARED MEMORY ARCHITECTURE
Two types of Shared memory Architecture.
They are
• Symmetric Shared-Memory Architectures
• Distributed Shared Memory Architectures
Symmetric Shared-Memory Architectures
• Consists of several processors with a single physical memory
shared by all processors through a shared bus.
• The cores have private level 1 caches, while other caches may
or may not be shared between the cores.
Distributed Shared Memory Architectures
• Consists of multiple independent processing nodes with local
memory modules which is connected by a general interconnection
network.
• Distributed-memory systems are also called as clusters or hybrid
systems.
• Each node of a cluster has access to shared memory in addition to
each node's non-shared private memory.
8. GRAPHICS PROCESSING UNIT (GPU)
Visual Processing unit
A graphics coprocessor / accelerator
Built with hundreds of processing cores
Handles large number of floating-point operations
in parallel.
Used in mobile phones, game consoles, embedded
systems, PCs and Servers.
Does not rely on multilevel caches
Rely on hardware multithreading
Size - 4 to 6 GB (or) less
Difference Between CPU and GPU
GPU ARCHITECTURE
• First GPU is GeForce 256 by NVIDIA in 1999.
• These GPU chips can process a minimum of 10 million

polygons per second.
• The NVIDIA GPU - 128 cores on a single chip.
• Each core can handle 8 threads of instructions.
• Totally 1,024 (8 * 128) threads are executed

concurrently on a single GPU.
Streaming Multiprocessors (SM)
• A GPU consists of many Streaming Multiprocessors (SM).
• Multiple SMs can be built on single GPU chip.
• Each SM is associated with a private L1 Data Cache.
• Each SM has 16 load/store units allowing source and

destination address to be calculated for 16 threads/clock.
• Each SM has 32 CUDA cores

(Totally 16*32 =512 CUDA Cores)
Memory Controller (MC)
• Every Memory Controller (MC) is associated with a
shared L2 cache for faster access to the cached data.
• Both MC and L2 are on-chip.

GPU PROGRAMMING MODEL - CUDA
• GPU uses a programming model called
CUDA (Compute Unified Device Architecture)
• CUDA is an extension of the C language.
• It enables the programmer to write C programs to execute

on GPUs.
• It is used to control the device.
• The programmer specifies CPU and GPU functions.

Host code (CPU) can be C++
Device code (GPU) may only be C
GPU MEMORY SYSTEM
• It has a Multi-Banked Memory
Structure.
• GPU memory systems are designed

for data throughput with wide
memory buses
• Much larger bandwidth than typical

CPUs typically 6 to 8 times
• The on-chip SMEM memory is local

to each Streaming Multiprocessor.
• The off - chip Global memory is

shared by the whole GPU and all
thread blocks GPU Memory
9. CLUSTERS
• Cluster is a set of loosely
(or) tightly connected
computers(nodes) working
together as a unified
computing resource.
• Each node performs the

same task
• Nodes are controlled by

software.
• Connected to each other by

I/O interconnect via
standard network switches
and cables
• Each computer has private memory and OS.
• It easier to expand the system
• It is also easier to replace a computer
• Easy to scale down gracefully

TYPES OF CLUSTERED SYSTEMS
1.Asymmetric Clustering System
• One of the nodes is in standby mode
• All the others run the required applications.
• The standby node continuously monitors the server and if it
fails, the standby node takes its place of the server.
2.Symmetric Clustering System

• All the nodes run applications as well as monitor each
other.
• More efficient
• Doesn't keep a node merely as a standby.
ATTRIBUTES/TYPES OF CLUSTERED SYSTEMS
1.Load Balancing Clusters
• Share the workload to provide a better performance.
• System performance is optimized.
• Use a round robin mechanism
2.High Availability Clusters

• Improve the availability of the clustered system.
• Have extra nodes to be used if some of the system
components fail.
• Removes single point of failure
• Also known as failover clusters or HA clusters.
BENEFITS OF CLUSTERED SYSTEMS
• Performance
• Fault Tolerance
• Scalability
APPLICATIONS OF CLUSTERS
• Amazon
• Facebook
• Google
• Microsoft
have multiple datacenters each with clusters of

tens of thousands of servers.
10. WAREHOUSE SCALE COMPUTERS (WSC)
• Large Clusters are called as Warehouse-Scale Computers.
• WSC is a cluster comprised of tens of thousands of servers.
• Act as one giant computer .
• Internet services necessitated the construction of new buildings

to house, power, and cool 100,000 servers.
• WSC often use a hierarchy of networks for interconnection.

Applications of WSC
A WSC can be used to provide internet services.

• Search - Google
• Social Networking - Facebook
• Video Sharing - YouTube
• Online Sales - Amazon
• Cloud Computing Services – Rackspace
Goals / Design factors for WSC
• Cost-performance
• Energy efficiency
• Dependability via redundancy
• Network I/O
• Interactive and batch processing workloads
• Ample computational parallelism
• Operational costs count
• Power consumption is a primary, not secondary,
constraint when designing system
• Scale and its opportunities and problems
• Can afford to build customized systems since WSC
require volume purchase
Programming Model of WSC
• WSC uses MapReduce programming model.
• MapReduce runtime environment schedules map and reduce

task to WSC nodes.
• MapReduce programs work in two phases:

1.Map Phase 2. Reduce Phase.
• An input to each phase is (key-value) pairs.
• Map function runs on thousands of servers to produce an

intermediate result of key-value pairs.
• Reduce function collects the output of those distributed tasks

and collapses them.
11. MESSAGE PASSING MULTIPROCESSORS (MPP)
• An alternative methods for communication and movement
of data among multiprocessors.
• A MPP combines local memory and processor at each node
of the interconnection network.
• Message passing is a way for communicating between multiple
processors by explicitly sending and receiving messages.
• The system has functions to send and receive messages.
• Coordination is built-in with message passing
• If the sender needs confirmation that the message has arrived,

the receiving processor can then send an acknowledgment
message back to the sender.
• Message passing can be synchronous or asynchronous.
Synchronous message passing systems - require the sender and

receiver to wait for each other while transferring the message.
Asynchronous message passing - the sender and receiver do not

wait for each other and can carry on their own computations
while transfer of messages is being done
Advantages of Message Passing Model
 Easier to build than scalable shared memory machines
 Easy to scale
 Coherency and synchronization is the responsibility of the
user, so the system designer need not worry about them.
Disadvantage of Message Passing Model

 Large overhead: copying of buffers requires large data
transfers
 Programming is more difficult.
 Blocking nature of SEND/RECEIVE can cause increased
latency and deadlock issues.

CA - UNIT 4 - PPT

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CA - UNIT 4 - PPT

Uploaded by

Copyright:

Available Formats

UNIT - IV

• Static parallelism means the compiler decides which

2. Second approach - to replicate the internal components of

There are two major ways to implement a multiple-issue:

• Static Multiple Issue : Decision of work are being made

• Dynamic Multiple Issue : Decision of work are being made

• Speculation may be done by the two ways:

• In hardware speculation, the instructions are executed

• In compiler-based speculation, special speculation

• There are two ways of achieving parallelism :

1. Multiple Functional Units : System may have two or

2. Multiple Processors : System may have two or more

• Difficult to write software that uses multiple processors

• Used to predict the theoretical speedup when using

• The instruction stream is divided into several smaller streams,

• There are two main approaches to hardware multithreading.

• Only switches threads that

• Also called as Blocking.

• A switch is made to another

1.Approach To Execute Scalar(Single Issue) Processor

2. Approach To Execute SuperScalar (Multiple Issue)

• A multicore computer, also known as a chip multiprocessor,

• Each core consists of registers, ALU, pipeline hardware, and

• The four general organizations for multicore

• Example : Intel Core

• Increase in Parallelism and Complexity

• Processors access any memory

• Can run independent jobs in

• It comes in three styles :

– Non Uniform Memory Access Multiprocessors

- Cache Only Memory Access (COMA)

• Some memory accesses are much faster than others

• The entire physical address space is considered a

• The data can be read into the local caches and/or

• Data can migrate and/or can be replicated in the

Two types of Shared memory Architecture.

• These GPU chips can process a minimum of 10 million

• The NVIDIA GPU - 128 cores on a single chip.

• Each core can handle 8 threads of instructions.

• Totally 1,024 (8 * 128) threads are executed

• Multiple SMs can be built on single GPU chip.

• Each SM is associated with a private L1 Data Cache.

• Each SM has 16 load/store units allowing source and

• Each SM has 32 CUDA cores

• Every Memory Controller (MC) is associated with a

shared L2 cache for faster access to the cached data.

• Both MC and L2 are on-chip.

• CUDA is an extension of the C language.

• It enables the programmer to write C programs to execute

• It is used to control the device.

• The programmer specifies CPU and GPU functions.

• GPU memory systems are designed

• Much larger bandwidth than typical

• The on-chip SMEM memory is local

• The off - chip Global memory is

• Each node performs the

• Nodes are controlled by

• Connected to each other by

• It easier to expand the system

• It is also easier to replace a computer

• Easy to scale down gracefully

2.Symmetric Clustering System

2.High Availability Clusters

have multiple datacenters each with clusters of