Professional Documents
Culture Documents
CA Unit IV Notes Part 1 PDF
CA Unit IV Notes Part 1 PDF
Parallel processing challenges – Flynn‘s classification – SISD, MIMD, SIMD, SPMD, and Vector
Architectures - Hardware multithreading – Multi-core processors and other Shared Memory
Multiprocessors - Introduction to Graphics Processing Units, Clusters, Warehouse Scale
Computers and other Message-Passing Multiprocessors.
*************************************************************************************
Parallelism - Introduction:
- execute the machine instructions parallel
Instruction-level-parallelism (ILP):
Pipelining exploits the potential parallelism among instructions.
This parallelism is called instruction-level parallelism (ILP).
Figure: Pipelining
1
parallel processing program:
A single program that runs on multiple processors simultaneously.
cluster :
A set of computers connected over a local area network that function as a single
large multiprocessor.
Solved - Large scientific problems
Used in search engines, Web servers, email servers, and databases.
Multicore microprocessor:
A microprocessor containing multiple processors (“cores”) in a single integrated
circuit.
Shared Memory Multiprocessor(SMP):
A parallel processor with a single physical address space.
Speed-up Challenge:
Amdahls Law says
𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑎𝑓𝑓𝑒𝑐𝑡𝑒𝑑 𝑏𝑦 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡 = + 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑢𝑛𝑎𝑓𝑓𝑒𝑐𝑡𝑒𝑑
𝐴𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡
We can reformulate Amdahls Law in terms of speed-up versus the original execution time:
This formula is usually rewritten assuming that the execution time before is 1 for some unit
of time, and the execution time affected by improvement is considered the fraction of the
original execution time:
2
Problem 1
Q: Suppose you want to achieve a speed-up of 90 times faster with 100 processors. What percentage
of the original computation can be sequential?
Solution:
Substituting 90 for speed-up and 100 for amount of improvement into the formula above:
Problem 2
3
Problem 3
To achieve the speed-up of 20.5 on the previous larger problem with 40 processors, we assumed the load
was perfectly balanced. That is, each of the 40 processors had 2.5% of the work to do. Instead, show the
impact on speed-up if one processor’s load is higher than all the rest. Calculate at twice the load (5%) and
five times the load (12.5%) for that hardest working processor. How well utilized are the rest of the
processors?
4
strong scaling:
Speedup achieved on a multiprocessor without increasing the size of the problem.
weak scaling:
Speedup achieved on a multiprocessor while increasing the size of the problem
proportionally to the increase in the number of processors.
*************************************************************************************
Flynn’s classification:
In 1966, Michael Flynn proposed a classification for computer architectures based on the
number of instruction steams and data streams (Flynn’s Taxonomy).
SISD (Single Instruction stream, Single Data stream)
SIMD (Single Instruction stream, Multiple Data streams)
MISD (Multiple Instruction streams, Single Data stream)
MIMD (Multiple Instruction streams, Multiple Data streams)
5
6
Figure: Single Instruction stream, Multiple Data streams
7
Figure: Multiple Instruction streams, Multiple Data streams
When multiprocessors communicate through the global shared memory modules then this
organization is called Shared memory computer or Tightly coupled systems.
Similarly when every processor in a multiprocessor system, has its own local memory and the
processors communicate via messages transmitted between their local memories, then this
organization is called Distributed memory computer or Loosely coupled systems.
Hardware categorization:
Vector processors:
more elegant interpretation of SIMD is called a vector architecture
the vector architectures pipelined the ALU to get good performance at lower cost
to collect data elements from memory, put them in order into a large set of registers, operate on
them sequentially in registers using pipelined execution units.
then write the results back to memory
8
Figure: Structure of a vector unit containing four lanes
The vector-register storage is divided across the lanes, with each lane holding every fourth
element of each vector register. The above figure shows three vector functional units: an FP add,
an FP multiply, and a load-store unit. Each of the vector arithmetic units contains four execution
pipelines, one per lane, which acts in concert to complete a single vector instruction. Note how
each section of the vector-register file only needs to provide enough read and write ports for
functional units local to its lane.
Vector lane:
One or more vector functional units and a portion of the vector register file.
*************************************************************************************
Hardware Multithreading:
Basics:
Thread : lightweight process
Instruction stream with state (registers and memory)
Register state is also called “thread context”
Threads could be part of the same process (program) or from different programs
Threads in the same program share the same address space (shared memory model)
Traditionally, the processor keeps track of the context of a single thread
Multitasking: When a new thread needs to be executed, old thread’s context in hardware
written back to memory and new thread’s context loaded
Process: A process includes one or more threads, the address space, and the operating
system state.
9
Hardware Multithreading:
Increasing utilization of a processor by switching to another thread when one thread
is stalled.
General idea: Have multiple thread contexts in a single processor
When the hardware executes from those hardware contexts determines the
granularity of multithreading
Why?
To tolerate latency (initial motivation)
Latency of memory operations, dependent instructions, branch resolution
By utilizing processing resources more efficiently
To improve system throughput
By exploiting thread-level parallelism
By improving superscalar/ processor utilization
To reduce context switch penalty
Tolerate latency
When one thread encounters a long-latency operation, the processor can execute a
useful operation from another thread
Benefit
+ Latency tolerance
+ Better hardware utilization (when?)
+ Reduced context switch penalty
Cost
- Requires multiple thread contexts to be implemented in hardware (area, power,
latency cost)
- Usually reduced single-thread performance
- Resource sharing, contention
- Switching penalty (can be reduced with additional hardware)
Types of Multithreading:
Fine-grained Multithreading
Cycle by cycle
Coarse-grained Multithreading
Switch on event (e.g., cache miss)
Switch on quantum/timeout
Simultaneous Multithreading (SMT)
Instructions from multiple threads executed concurrently in the same cycle
Fine-grained Multithreading:
Idea: Switch to another thread every cycle such that no two instructions from the
thread are in the pipeline concurrently
Improves pipeline utilization by taking advantage of multiple threads
Alternative way of looking at it: Tolerates the control and data dependency latencies by
overlapping the latency with useful work from other threads
10
Advantages
+ No need for dependency checking between instructions (only one instruction in pipeline
from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from different threads
+ Improved system throughput, latency tolerance, utilization
Disadvantages
- Extra hardware complexity: multiple hardware contexts, thread selection logic
- Reduced single thread performance (one instruction fetched every N cycles)
- Resource contention between threads in caches and memory
- Dependency checking logic between threads remains (load/store)
Coarse-grained Multithreading:
A version of hardware multithreading that implies switching between threads only
after significant events, such as a last-level cache miss.
Idea: When a thread is stalled due to some event, switch to a different hardware context
Switch-on-event multithreading
Possible stall events
Cache misses
Synchronization events (e.g., load an empty location)
FP operations
Disadvantages
- Low single thread performance: each thread gets 1/Nth of the bandwidth of the pipeline
11
Fig: Hardware Multithreading options
The four threads at the top show how each would execute running alone on a standard
superscalar processor without multithreading support. The three examples at the bottom show how
they would execute running together in three multithreading options. The horizontal dimension
represents the instruction issue capability in each clock cycle. The vertical dimension represents a
sequence of clock cycles. An empty (white) box indicates that the corresponding issue slot is
unused in that clock cycle. The shades of gray and color correspond to four different threads in the
multithreading processors.
12
Comparison of Multithreading options:
SNo Features Fine Grained Coarse Grained Simultaneous
Multithreading Multithreading Multithreading
Issues instructions one thread runs until it Instructions from
Thread for different threads is blocked by an event multiple threads
1. Scheduling after every cycle in that normally would issued on same
Policy a Round Robin create a long-latency cycle in a Round
fashion, stall Robin fashion
Pipeline
2. Dynamic, no flush None, flush on switch Dynamic, no flush
Partitioning
More efficient than
Efficient than the
3. Efficiency Coarse grained Less efficient
other two
Multithreading
Requires more Requires lesser threads Requires more
Required
4. threads to keep the to keep the processor threads to keep the
Threads
processor busy busy processor busy
Hardware Extra Hardware is No such extra Extra Hardware is
5.
Complexity required Hardware is required required
Implementation
Conceptually simple Hides memory
6. Advantages Simple
latency
Low cost
Very poor single Not suitable for Out-of- Increased conflicts
7. Disadvantages
thread performance order execution in shared resources
UltraSparc 1 (Sun
8. Example IBM Northstar/Pulsar IntelP4
Niagara 1)
*************************************************************************************
Multicore processors:
What is a Processor?
A single chip package that fits in a socket.
Cores can have functional units, cache, etc. associated with them
The main goal of the multi-core design is to provide computing units with an increasing
processing power.
A multicore processor is a single computing component with two or more “independent”
processors (called "cores").
The instructions are ordinary CPU instructions such as add, move data, and branch, but the
multiple cores can run multiple instructions at the same time, increasing overall speed for
programs amenable to parallel computing.
Manufacturers typically integrate the cores onto a single integrated circuit die (known as a
chip multiprocessor or CMP), or onto multiple dies in a single chip package.
Examples:
dual-core processor with 2 cores
e.g. AMD Phenom II X2, Intel Core 2 Duo E8500
13
quad-core processor with 4 cores
e.g. AMD Phenom II X4, Intel Core i5 2500T)
hexa-core processor with 6 cores
e.g. AMD Phenom II X6, Intel Core i7 Extreme Ed. 980X
octa-core processor with 8 cores
e.g. AMD FX-8150, Intel Xeon E7-2820
Single Core
Multicore:
*************************************************************************************
14
Shared Memory Multiprocessors:
A shared memory multiprocessor (SMP) is one that offers the programmer a single
physical address space across all processors. Processors communicate through shared variables in
memory, with all processors capable of accessing any memory location via loads and stores.
The shared-memory multiprocessors fall into two classes, depending on the number of
processors involved, which in turn dictates a memory organization and interconnect strategy.
They are:
1. Centralized shared memory (Uniform Memory Access)
2. Distributed shared memory (NonUniform Memory Access)
1. Centralized shared memory architecture
Centralized shared memory architecture share a single centralized memory, interconnect
processors and memory by a bus. It is also known as Uniform Memory Access (UMA) or
Symmetric (shared memory) multiprocessor‖ (SMP). With large caches, the bus and the single
memory, possibly with multiple banks, can satisfy the memory demands of a small number of
processors. By replacing a single bus with multiple buses, or even a switch, a centralized shared
memory design can be scaled to a few dozen processors. Although scaling beyond that is
technically possible, sharing a centralized memory, even organized as multiple banks, becomes
less attractive as the number of processors sharing it increases. The following figure illustrates the
basic structure of a centralized shared memory multiprocessor architecture.
15
Figure: Basic structure of a centralized shared-memory multiprocessor based on a
multicore chip
Because there is a single main memory that has a symmetric relationship to all processors and
a uniform access time from any processor, these multiprocessors are often called symmetric
shared-memory multiprocessor (SMP), and this style of architecture is sometimes called UMA
(Uniform Memory Access).
16
Synchronization:
The process of coordinating the behavior of two or more processes, which may be running
on different processors.
Lock:
A synchronization device that allows access to data to only one processor at a time.
OpenMP:
An API for shared memory multiprocessing in C, C++, or FORTRAN that runs on UNIX
and Microsoft platforms. It includes compiler directives, a library, and runtime directives.
*************************************************************************************
17