Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

UNIT IV PARALLELISM

Parallel processing challenges – Flynn‘s classification – SISD, MIMD, SIMD, SPMD, and Vector
Architectures - Hardware multithreading – Multi-core processors and other Shared Memory
Multiprocessors - Introduction to Graphics Processing Units, Clusters, Warehouse Scale
Computers and other Message-Passing Multiprocessors.
*************************************************************************************
Parallelism - Introduction:
- execute the machine instructions parallel
Instruction-level-parallelism (ILP):
 Pipelining exploits the potential parallelism among instructions.
 This parallelism is called instruction-level parallelism (ILP).

Figure: Pipelining

Figure: Instruction Level Parallelism


 Multiple operations will execute in parallel (simultaneously)
 Goal: Speed Up the execution
************************************************************************************
Parallel processing challenges:
 multiprocessor
 A computer system with at least two processors
 Replacing large inefficient processors with many smaller, efficient processors can deliver
better performance, if software can efficiently use them.
 task-level parallelism or process-level parallelism:
 Utilizing multiple processors by running independent programs simultaneously.

1
 parallel processing program:
 A single program that runs on multiple processors simultaneously.
 cluster :
 A set of computers connected over a local area network that function as a single
large multiprocessor.
 Solved - Large scientific problems
 Used in search engines, Web servers, email servers, and databases.
 Multicore microprocessor:
 A microprocessor containing multiple processors (“cores”) in a single integrated
circuit.
 Shared Memory Multiprocessor(SMP):
 A parallel processor with a single physical address space.

Difficulty of Creating Parallel Processing Programs:


The difficulty with parallelism is not the hardware; it is that too few important application
programs have been rewritten to complete tasks sooner on multiprocessors. It is difficult to write
software that uses multiple processors to complete one task faster, and the problem gets worse as
the number of processors increases.
Parallel processing programs been so much harder to develop than sequential programs:
Reasons
a) The first reason is that you must get better performance or better energy efficiency from a
parallel processing program on a multiprocessor otherwise; you would just use a sequential
program on a uniprocessor, as sequential programming is simpler. uniprocessor design techniques
such as superscalar and out-of order execution take advantage of instruction-level parallelism
b) Scheduling, partitioning the work into parallel pieces, balancing the load evenly between
the workers, time to synchronize, and overhead for communication between the parties
c) Amdahl’s Law- It reminds us that even small parts of a program must be parallelized if
the program is to make good use of many core

Speed-up Challenge:
 Amdahls Law says
𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑎𝑓𝑓𝑒𝑐𝑡𝑒𝑑 𝑏𝑦 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡 = + 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑢𝑛𝑎𝑓𝑓𝑒𝑐𝑡𝑒𝑑
𝐴𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡
 We can reformulate Amdahls Law in terms of speed-up versus the original execution time:

 This formula is usually rewritten assuming that the execution time before is 1 for some unit
of time, and the execution time affected by improvement is considered the fraction of the
original execution time:

2
Problem 1
Q: Suppose you want to achieve a speed-up of 90 times faster with 100 processors. What percentage
of the original computation can be sequential?
Solution:

Substituting 90 for speed-up and 100 for amount of improvement into the formula above:

Problem 2

3
Problem 3
To achieve the speed-up of 20.5 on the previous larger problem with 40 processors, we assumed the load
was perfectly balanced. That is, each of the 40 processors had 2.5% of the work to do. Instead, show the
impact on speed-up if one processor’s load is higher than all the rest. Calculate at twice the load (5%) and
five times the load (12.5%) for that hardest working processor. How well utilized are the rest of the
processors?

4
 strong scaling:
 Speedup achieved on a multiprocessor without increasing the size of the problem.
 weak scaling:
 Speedup achieved on a multiprocessor while increasing the size of the problem
proportionally to the increase in the number of processors.
*************************************************************************************

Flynn’s classification:
 In 1966, Michael Flynn proposed a classification for computer architectures based on the
number of instruction steams and data streams (Flynn’s Taxonomy).
 SISD (Single Instruction stream, Single Data stream)
 SIMD (Single Instruction stream, Multiple Data streams)
 MISD (Multiple Instruction streams, Single Data stream)
 MIMD (Multiple Instruction streams, Multiple Data streams)

Instruction Stream and Data Stream:


 The term ‘stream’ refers to a sequence or flow of either instructions or data operated on by
the computer.
 In the complete cycle of instruction execution, a flow of instructions from main memory
to the CPU is established. This flow of instructions is called instruction stream.
 Similarly, there is a flow of operands between processor and memory bi-directionally. This
flow of operands is called data stream.

5

SISD (Single Instruction stream, Single Data stream):


 SISD machines executes a single instruction on individual data values using a single
processor.
 Based on traditional Von Neumann uniprocessor architecture, instructions are executed
sequentially or serially, one step after the next.
 Until most recently, most computers are of SISD type.
 Conventional uniprocessor

Figure: Single Instruction stream, Single Data stream

SIMD (Single Instruction stream, Multiple Data streams):


 An SIMD machine executes a single instruction on multiple data values simultaneously
using many processors.
 Since there is only one instruction, each processor does not have to fetch and decode each
instruction. Instead, a single control unit does the fetch and decoding for all processors.
 SIMD architectures include array processors.
 Data level parallelism:
 Parallelism achieved by performing the same operation on independent data.

6
Figure: Single Instruction stream, Multiple Data streams

MISD (Multiple Instruction streams, Single Data stream):


 Each processor executes a different sequence of instructions.
 In case of MISD computers, multiple processing units operate on one single-data stream.
 This category does not actually exist. This category was included in the taxonomy for the
sake of completeness.
 Real time computers need to be fault tolerant where several processors execute the same
data for producing the redundant data. This is also known as N- version programming.
 All these redundant data are compared as results which should be same; otherwise faulty
unit is replaced.
 Thus MISD machines can be applied to fault tolerant real time computers.

Figure: Multiple Instruction streams, Single Data stream

MIMD (Multiple Instruction streams, Multiple Data streams):


 MIMD machines are usually referred to as multiprocessors or multicomputer.
 It may execute multiple instructions simultaneously, contrary to SIMD machines.
 Each processor must include its own control unit that will assign to the processors parts of
a task or a separate task.
 It has two subclasses: Shared memory and distributed memory

7
Figure: Multiple Instruction streams, Multiple Data streams
 When multiprocessors communicate through the global shared memory modules then this
organization is called Shared memory computer or Tightly coupled systems.
 Similarly when every processor in a multiprocessor system, has its own local memory and the
processors communicate via messages transmitted between their local memories, then this
organization is called Distributed memory computer or Loosely coupled systems.
Hardware categorization:

SSE: Streaming SIMD Extensions

Vector processors:
 more elegant interpretation of SIMD is called a vector architecture
 the vector architectures pipelined the ALU to get good performance at lower cost
 to collect data elements from memory, put them in order into a large set of registers, operate on
them sequentially in registers using pipelined execution units.
 then write the results back to memory

8
Figure: Structure of a vector unit containing four lanes
The vector-register storage is divided across the lanes, with each lane holding every fourth
element of each vector register. The above figure shows three vector functional units: an FP add,
an FP multiply, and a load-store unit. Each of the vector arithmetic units contains four execution
pipelines, one per lane, which acts in concert to complete a single vector instruction. Note how
each section of the vector-register file only needs to provide enough read and write ports for
functional units local to its lane.
Vector lane:
 One or more vector functional units and a portion of the vector register file.
*************************************************************************************
Hardware Multithreading:
Basics:
 Thread : lightweight process
 Instruction stream with state (registers and memory)
 Register state is also called “thread context”
 Threads could be part of the same process (program) or from different programs
 Threads in the same program share the same address space (shared memory model)
 Traditionally, the processor keeps track of the context of a single thread
 Multitasking: When a new thread needs to be executed, old thread’s context in hardware
written back to memory and new thread’s context loaded
 Process: A process includes one or more threads, the address space, and the operating
system state.

9
Hardware Multithreading:
 Increasing utilization of a processor by switching to another thread when one thread
is stalled.
 General idea: Have multiple thread contexts in a single processor
 When the hardware executes from those hardware contexts determines the
granularity of multithreading
 Why?
 To tolerate latency (initial motivation)
 Latency of memory operations, dependent instructions, branch resolution
 By utilizing processing resources more efficiently
 To improve system throughput
 By exploiting thread-level parallelism
 By improving superscalar/ processor utilization
 To reduce context switch penalty
 Tolerate latency
 When one thread encounters a long-latency operation, the processor can execute a
useful operation from another thread
 Benefit
 + Latency tolerance
 + Better hardware utilization (when?)
 + Reduced context switch penalty
 Cost
 - Requires multiple thread contexts to be implemented in hardware (area, power,
latency cost)
 - Usually reduced single-thread performance
- Resource sharing, contention
- Switching penalty (can be reduced with additional hardware)

Types of Multithreading:
 Fine-grained Multithreading
 Cycle by cycle
 Coarse-grained Multithreading
 Switch on event (e.g., cache miss)
 Switch on quantum/timeout
 Simultaneous Multithreading (SMT)
 Instructions from multiple threads executed concurrently in the same cycle

Fine-grained Multithreading:
 Idea: Switch to another thread every cycle such that no two instructions from the
thread are in the pipeline concurrently
 Improves pipeline utilization by taking advantage of multiple threads
 Alternative way of looking at it: Tolerates the control and data dependency latencies by
overlapping the latency with useful work from other threads

10
 Advantages
+ No need for dependency checking between instructions (only one instruction in pipeline
from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from different threads
+ Improved system throughput, latency tolerance, utilization
 Disadvantages
- Extra hardware complexity: multiple hardware contexts, thread selection logic
- Reduced single thread performance (one instruction fetched every N cycles)
- Resource contention between threads in caches and memory
- Dependency checking logic between threads remains (load/store)

Coarse-grained Multithreading:
 A version of hardware multithreading that implies switching between threads only
after significant events, such as a last-level cache miss.
 Idea: When a thread is stalled due to some event, switch to a different hardware context
 Switch-on-event multithreading
 Possible stall events
 Cache misses
 Synchronization events (e.g., load an empty location)
 FP operations

Fine-grained vs. Coarse-grained MT


 Fine-grained advantages
+ Simpler to implement, can eliminate dependency checking, branch prediction logic
completely
+ Switching need not have any performance overhead (i.e. dead cycles)
+ Coarse-grained requires a pipeline flush or a lot of hardware to save pipeline state
 Higher performance overhead with deep pipelines and large windows

 Disadvantages
- Low single thread performance: each thread gets 1/Nth of the bandwidth of the pipeline

Simultaneous Multithreading (SMT):


 A version of multithreading that lowers the cost of multithreading by utilizing the
resources needed for multiple issue, dynamically scheduled microarchitecture.
 Instructions from multiple threads issued on same cycle
 Uses register renaming and dynamic scheduling facility of multi-issue architecture
 Needs more hardware support
 Register files, PC’s for each thread
 Temporary result registers before commit
 Support to sort out which threads get results from which instructions
 Maximizes utilization of execution units

11
Fig: Hardware Multithreading options
The four threads at the top show how each would execute running alone on a standard
superscalar processor without multithreading support. The three examples at the bottom show how
they would execute running together in three multithreading options. The horizontal dimension
represents the instruction issue capability in each clock cycle. The vertical dimension represents a
sequence of clock cycles. An empty (white) box indicates that the corresponding issue slot is
unused in that clock cycle. The shades of gray and color correspond to four different threads in the
multithreading processors.

12
Comparison of Multithreading options:
SNo Features Fine Grained Coarse Grained Simultaneous
Multithreading Multithreading Multithreading
Issues instructions one thread runs until it Instructions from
Thread for different threads is blocked by an event multiple threads
1. Scheduling after every cycle in that normally would issued on same
Policy a Round Robin create a long-latency cycle in a Round
fashion, stall Robin fashion
Pipeline
2. Dynamic, no flush None, flush on switch Dynamic, no flush
Partitioning
More efficient than
Efficient than the
3. Efficiency Coarse grained Less efficient
other two
Multithreading
Requires more Requires lesser threads Requires more
Required
4. threads to keep the to keep the processor threads to keep the
Threads
processor busy busy processor busy
Hardware Extra Hardware is No such extra Extra Hardware is
5.
Complexity required Hardware is required required
 Implementation
Conceptually simple Hides memory
6. Advantages Simple
latency
 Low cost
Very poor single Not suitable for Out-of- Increased conflicts
7. Disadvantages
thread performance order execution in shared resources
UltraSparc 1 (Sun
8. Example IBM Northstar/Pulsar IntelP4
Niagara 1)

*************************************************************************************

Multicore processors:
 What is a Processor?
 A single chip package that fits in a socket.
 Cores can have functional units, cache, etc. associated with them
 The main goal of the multi-core design is to provide computing units with an increasing
processing power.
 A multicore processor is a single computing component with two or more “independent”
processors (called "cores").
 The instructions are ordinary CPU instructions such as add, move data, and branch, but the
multiple cores can run multiple instructions at the same time, increasing overall speed for
programs amenable to parallel computing.
 Manufacturers typically integrate the cores onto a single integrated circuit die (known as a
chip multiprocessor or CMP), or onto multiple dies in a single chip package.
 Examples:
 dual-core processor with 2 cores
e.g. AMD Phenom II X2, Intel Core 2 Duo E8500

13
 quad-core processor with 4 cores
e.g. AMD Phenom II X4, Intel Core i5 2500T)
 hexa-core processor with 6 cores
e.g. AMD Phenom II X6, Intel Core i7 Extreme Ed. 980X
 octa-core processor with 8 cores
e.g. AMD FX-8150, Intel Xeon E7-2820
Single Core

Multicore:

Figure: Multicore processor


 The cores run in parallel
 Within each core, threads are time-sliced (just like on a uniprocessor)

*************************************************************************************

14
Shared Memory Multiprocessors:
A shared memory multiprocessor (SMP) is one that offers the programmer a single
physical address space across all processors. Processors communicate through shared variables in
memory, with all processors capable of accessing any memory location via loads and stores.

Fig.: Classic organization of a shared memory multiprocessor


Single address space multiprocessors come in two styles.
 UMA Architecture: In the first style, the latency to a word in memory does not depend
on which processor asks for it. Such machines are called uniform memory access (UMA)
multiprocessors.
 NUMA/DSMA Architecture: In the second style, some memory accesses are much faster
than others, depending on which processor asks for which word, typically because main
memory is divided and attached to different microprocessors or to different memory
controllers on the same chip. Such machines are called nonuniform memory access
(NUMA) multiprocessors.

The shared-memory multiprocessors fall into two classes, depending on the number of
processors involved, which in turn dictates a memory organization and interconnect strategy.
They are:
1. Centralized shared memory (Uniform Memory Access)
2. Distributed shared memory (NonUniform Memory Access)
1. Centralized shared memory architecture
Centralized shared memory architecture share a single centralized memory, interconnect
processors and memory by a bus. It is also known as Uniform Memory Access (UMA) or
Symmetric (shared memory) multiprocessor‖ (SMP). With large caches, the bus and the single
memory, possibly with multiple banks, can satisfy the memory demands of a small number of
processors. By replacing a single bus with multiple buses, or even a switch, a centralized shared
memory design can be scaled to a few dozen processors. Although scaling beyond that is
technically possible, sharing a centralized memory, even organized as multiple banks, becomes
less attractive as the number of processors sharing it increases. The following figure illustrates the
basic structure of a centralized shared memory multiprocessor architecture.

15
Figure: Basic structure of a centralized shared-memory multiprocessor based on a
multicore chip
Because there is a single main memory that has a symmetric relationship to all processors and
a uniform access time from any processor, these multiprocessors are often called symmetric
shared-memory multiprocessor (SMP), and this style of architecture is sometimes called UMA
(Uniform Memory Access).

2. Distributed Shared Memory


The alternative design approach consists of multiprocessors with physically distributed
memory, called distributed shared memory (DSM). To support larger processor counts, memory
must be distributed among the processors rather than centralized; otherwise, the memory system
would not be able to support the bandwidth demands of a larger number of processors without
incurring excessively long access latency. The introduction of multicore processors has meant that
even two-chip multiprocessors use distributed memory. Of course, the larger number of processors
raises the need for a high bandwidth interconnection. Both directed networks (i.e., switches) and
indirect networks (typically multidimensional meshes) are used. The typical Distributed memory
multiprocessor is illustrated in the following diagram.

Figure: The basic architecture of a distributed-memory multiprocessor

16
 Synchronization:
 The process of coordinating the behavior of two or more processes, which may be running
on different processors.
 Lock:
 A synchronization device that allows access to data to only one processor at a time.
 OpenMP:
 An API for shared memory multiprocessing in C, C++, or FORTRAN that runs on UNIX
and Microsoft platforms. It includes compiler directives, a library, and runtime directives.

*************************************************************************************

17

You might also like