L38 TLP

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

THREAD LEVEL PARALLELISM

NEED FOR MULTI-PROCESSER


 The importance of multiprocessors was growing as
 Designers found a way to build servers and supercomputers
that achieved higher performance than a single
microprocessor
 While exploiting the cost-performance advantages of
commodity microprocessors
 Slowdown in uniprocessor performance arising from
 Diminishing returns in exploiting instruction-level
parallelism (ILP) combined with growing concern over
power
 Leading to a new era in computer architecture—where
multiprocessors play a major role from the low end to high end2
FACTORS REFLECTING THE IMPORTANCE OF
MULTIPROCESSING
 Finding and exploiting more ILP, turned out to be inefficient, since power
and silicon costs grew faster than performance
 Other than ILP, the only scalable and general-purpose way to increase
performance is through multiprocessing
 A growing interest in high-end servers
 A growth in data-intensive applications
 Increasing performance on the desktop is less important, as highly
compute and data-intensive applications are being done in cloud
 An improved understanding of effective use of multiprocessors
 The advantages of leveraging a design investment by replication rather
than unique design
3
MULTIPROCESSOR
 Thread level parallelism (TLP) implies existence of multiple program
counter and is exploited through MIMDs
 Multiprocessors
 Computers consisting of tightly coupled processors
 Coordination and usage controlled by a single operating system
 Share memory through a shared address space
 Multiprocessing exploits TLP in two different software models
 Parallel processing - execution of a tightly coupled set of threads
collaborating on a single task
 Request level parallelism - Execution of multiple, relatively
independent processes originate from one or more users
4
MULTIPROCESSOR
 Multiprocessors have typically dual to dozens of processors
 Communicate and coordinate through the sharing of memory
 Such multiprocessors include both
 single-chip systems with multiple cores
 multiple chips, each of which may be a multicore design

5
MULTIPROCESSOR ARCHITECTURE
 To take advantage of an MIMD multiprocessor with n processors,
we must usually have at least n threads or processes to execute
 Independent threads within a single process are typically
identified by the programmer or created by the OS
 Grain size
 The amount of computation assigned to a thread
 Important in considering how to exploit TLP efficiently
 Threads consist of hundreds to millions of instructions that may
be executed in parallel

6
THREADS AND DLP
 Threads can also be used to exploit data-level parallelism (DLP)
 The overhead is likely to be higher than SIMD processor or with
a GPU
 Grain size must be sufficiently large to exploit the parallelism
efficiently
 The grain size when the parallelism is split among many threads
may be so small that the overhead makes the exploitation of the
parallelism prohibitively expensive in an MIMD

7
CLASSES OF SHARED MEMORY
MULTIPROCESSORS
 Based on number of processors involved which in turn dictate a
memory organization and interconnect strategy
 Symmetric (shared memory) multiprocessor (SMPs) or
centralized shared memory multiprocessor
 Distributed shared memory (DSM)

 Small numbers of cores, typically eight or fewer

 Possible for the processors to share a single centralized memory


with all processors have equal access
 In multicore chips, the memory is effectively shared in a
centralized fashion among the cores, and all existing
multicores are SMPs
 SMP architectures are also sometimes called uniform memory 8
access (UMA) multiprocessors
UMA
 Multiple processor–cache
subsystems share the
same physical memory,
typically with one level
of shared cache, and one
or more levels of private
per-core cache

 The key architectural


property is the uniform
access time to all of the
memory from all of the
processors

9
DSM

 Multiprocessor with physically distributed memory


 Distributing the memory among the nodes both increases the
bandwidth and reduces the latency to local memory
10
 NUMA (nonuniform memory access), since the access time depends
on the location of a data word in memory
CHALLENGES OF PARALLEL
PROCESSING
 The application of multiprocessors ranges from running
independent tasks with essentially no communication to running
parallel programs where threads must communicate to complete
the task
 Two important hurdles make parallel processing challenging
 The first hurdle limited parallelism available in programs,
 Second arises from the relatively high cost of
communications.
 Limitations in available parallelism make it difficult to achieve
good speedups in any parallel processor,
11
 Suppose you want to achieve a speedup of 80 with 100
processors
 What fraction of the original computation can be
sequential?

 Assume that the program operates in only two modes:


 Parallel with all processors fully used, (enhanced mode)
or 12

 Serial with only one processor in use


13
 To achieve a speedup of 80 with 100 processors, only 0.25% of the
original computation can be sequential.

You might also like