L38 TLP

THREAD LEVEL PARALLELISM
NEED FOR MULTI-PROCESSER

 The importance of multiprocessors was growing as
 Designers found a way to build servers and supercomputers
that achieved higher performance than a single
microprocessor
 While exploiting the cost-performance advantages of
commodity microprocessors
 Slowdown in uniprocessor performance arising from
 Diminishing returns in exploiting instruction-level
parallelism (ILP) combined with growing concern over
power
 Leading to a new era in computer architecture—where
multiprocessors play a major role from the low end to high end2
FACTORS REFLECTING THE IMPORTANCE OF
MULTIPROCESSING
 Finding and exploiting more ILP, turned out to be inefficient, since power
and silicon costs grew faster than performance
 Other than ILP, the only scalable and general-purpose way to increase
performance is through multiprocessing
 A growing interest in high-end servers
 A growth in data-intensive applications
 Increasing performance on the desktop is less important, as highly
compute and data-intensive applications are being done in cloud
 An improved understanding of effective use of multiprocessors
 The advantages of leveraging a design investment by replication rather
than unique design
3
MULTIPROCESSOR
 Thread level parallelism (TLP) implies existence of multiple program
counter and is exploited through MIMDs
 Multiprocessors
 Computers consisting of tightly coupled processors
 Coordination and usage controlled by a single operating system
 Share memory through a shared address space
 Multiprocessing exploits TLP in two different software models
 Parallel processing - execution of a tightly coupled set of threads
collaborating on a single task
 Request level parallelism - Execution of multiple, relatively
independent processes originate from one or more users
4
MULTIPROCESSOR
 Multiprocessors have typically dual to dozens of processors
 Communicate and coordinate through the sharing of memory
 Such multiprocessors include both
 single-chip systems with multiple cores
 multiple chips, each of which may be a multicore design
5
MULTIPROCESSOR ARCHITECTURE
 To take advantage of an MIMD multiprocessor with n processors,
we must usually have at least n threads or processes to execute
 Independent threads within a single process are typically
identified by the programmer or created by the OS
 Grain size
 The amount of computation assigned to a thread
 Important in considering how to exploit TLP efficiently
 Threads consist of hundreds to millions of instructions that may
be executed in parallel
6
THREADS AND DLP
 Threads can also be used to exploit data-level parallelism (DLP)
 The overhead is likely to be higher than SIMD processor or with
a GPU
 Grain size must be sufficiently large to exploit the parallelism
efficiently
 The grain size when the parallelism is split among many threads
may be so small that the overhead makes the exploitation of the
parallelism prohibitively expensive in an MIMD
7
CLASSES OF SHARED MEMORY
MULTIPROCESSORS
 Based on number of processors involved which in turn dictate a
memory organization and interconnect strategy
 Symmetric (shared memory) multiprocessor (SMPs) or
centralized shared memory multiprocessor
 Distributed shared memory (DSM)
 Small numbers of cores, typically eight or fewer
 Possible for the processors to share a single centralized memory

with all processors have equal access
 In multicore chips, the memory is effectively shared in a
centralized fashion among the cores, and all existing
multicores are SMPs
 SMP architectures are also sometimes called uniform memory 8
access (UMA) multiprocessors
UMA
 Multiple processor–cache
subsystems share the
same physical memory,
typically with one level
of shared cache, and one
or more levels of private
per-core cache
 The key architectural

property is the uniform
access time to all of the
memory from all of the
processors
9
DSM
 Multiprocessor with physically distributed memory

 Distributing the memory among the nodes both increases the
bandwidth and reduces the latency to local memory
10
 NUMA (nonuniform memory access), since the access time depends
on the location of a data word in memory
CHALLENGES OF PARALLEL
PROCESSING
 The application of multiprocessors ranges from running
independent tasks with essentially no communication to running
parallel programs where threads must communicate to complete
the task
 Two important hurdles make parallel processing challenging
 The first hurdle limited parallelism available in programs,
 Second arises from the relatively high cost of
communications.
 Limitations in available parallelism make it difficult to achieve
good speedups in any parallel processor,
11
 Suppose you want to achieve a speedup of 80 with 100
processors
 What fraction of the original computation can be
sequential?
 Assume that the program operates in only two modes:

 Parallel with all processors fully used, (enhanced mode)
or 12
 Serial with only one processor in use

13
 To achieve a speedup of 80 with 100 processors, only 0.25% of the
original computation can be sequential.

L38 TLP

Uploaded by

Copyright:

Available Formats

You might also like

L38 TLP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L38 TLP

Uploaded by

Copyright:

Available Formats

THREAD LEVEL PARALLELISM

NEED FOR MULTI-PROCESSER

 Small numbers of cores, typically eight or fewer

 Possible for the processors to share a single centralized memory

 The key architectural

 Multiprocessor with physically distributed memory

 Assume that the program operates in only two modes:

 Serial with only one processor in use

You might also like