cs668 Lec1 ParallelArch

Today’s topics
• Single processors and the Memory

Hierarchy
• Busses and Switched Networks
• Interconnection Network Topologies
• Multiprocessors
• Multicomputers
• Flynn’s Taxonomy
• Modern clusters – hybrid
Processors and the Memory
Hierarchy
• Registers (1 clock cycle, 100s of bytes)
• 1st level cache (3-5 clock cycles, 100s KBytes)
• 2nd level cache (~10 clock cycles, MBytes)
• Main memory (~100 clock cycles, GBytes)
• Disk (milliseconds, 100GB to gianormous)
CPU
registers
1st level 1st level
Instructions Data
2nd Level unified
(Instructions & Data)
IBM Dual Core
From Intel® 64 and IA-32 Architectures Optimization Reference Manual

http://www.intel.com/design/processor/manuals/248966.pdf
Interconnection Network
Topologies - Bus
• Bus
– A single shared data path
– Pros
• Simplicity
– cache coherence Global Memory
– synchronization
– Cons
• fixed bandwidth
– Does not scale well CPU CPU CPU
Topologies – Switch based
• Switch Based
– mxn switches
– Many possible topologies
• Characterized by CPU CPU CPU CPU
– Diameter
• Worst case number of switches between two processors
• Impacts latency
– Bisection width
• Minimum number of connections that must be removed to
split the network into two
• Communication bandwidth limitation
– Edges per switch
• Best if this is independent of the size of the network
Topologies - Mesh
• 2-D Mesh
– 2-D array of processors
• Torus/Wraparound Mesh
– Processors on edge of mesh
are connected
• Characteristics (n nodes)
– Diameter = or
n 2( n − 1)
– Bisection width =
n
– Switch size = 4
– Number of switches = n
Topologies - Hypercube
• Hypercube
– A d-dimensional hypercube has
n=2d processors. 3-D Hypercube
– Each processor directly connected
to d other processors
– Shortest path between a pair of
processors is at most d
• Characteristics (n=2d nodes)
4-D Hypercube
– Diameter = d
– Bisection width = n/2
– Switch size = d
– Number of switches = n
Multistage Networks
• Butterfly • Characteristics for an Omega network
(n=2d nodes)
• Omega Diameter = d-1
–
• Perfect shuffle – Bisection width = n/2
– Switch size = 2
– Number of switches = d× n/2
An 8-input,
8-output Omega
network of 2x2
switches
Shared Memory
• One or more memories
• Global address space (all system memory visible to all processors)
• Transfer of data between processors is usually implicit, just read (write) to
(from) a given address (OpenMP)
• Cache-coherency protocol to maintain consistency between processors.
(UMA) Uniform-memory-access Shared-memory System

Memory Memory Memory
CPU CPU CPU

Distributed Shared Memory
• Single address space with implicit communication
• Hardware support for read/write to non-local memories, cache
coherency
• Latency for a memory operation is greater when accessing non local
data than when accessing date within a CPU’s own memory
(NUMA)Non-Uniform-memory-access Shared-memory System
CPU Memory CPU Memory CPU Memory

Distributed Memory
• Each processor has access to its own memory only
• Data transfer between processors is explicit, user calls message passing
functions
• Common Libraries for message passing
– MPI, PVM
• User has complete control/responsibility for data placement and
management
CPU Memory CPU Memory CPU Memory

Hybrid Systems
• Distributed memory system with multiprocessor shared memory nodes.
• Most common architecture for current generation of parallel machines
Network Network Network
Interface Interface Interface
CPU CPU CPU
Memory
Memory
Memory
CPU CPU CPU
CPU CPU CPU

Flynn’s Taxonomy
(figure 2.20 from Quinn)
Data stream
Single Multiple
SISD SIMD
Uniprocessor Procesor arrays
Single
Instruction stream
Pipelined vector processors
MISD MIMD
Systolic array Multiprocessors
Multiple
Multicomputers
Top 500 List
• Some highlights from http://www.top500.org/
– On the new list, the IBM BlueGene/L system, installed at DOE’s
Lawrence Livermore National Laboratory (LLNL), retains the No. 1 spot
with a Linpack performance of 280.6 teraflops (trillions of calculations
per second, or Tflop/s).
– The new No. 2 systems is Sandia National Laboratories’ Cray Red
Storm supercomputer, only the second system ever to be recorded to
exceed the 100 Tflops/s mark with 101.4 Tflops/s. The initial Red Storm
system was ranked No. 9 in the last listing.
– Slipping to No. 3 from No. 2 last June is the IBM eServer Blue Gene
Solution system, installed at IBM’s Thomas Watson Research Center
with 91.20 Tflops/s Linpack performance.
– The new No. 5 is the largest system in Europe, an IBM JS21 cluster
installed at the Barcelona Supercomputing Center. The system reached
62.63 Tflops/s.
Linux/Beowulf cluster basics
• Goal
– Get super computing processing power at the
cost of a few PCs
• How
– Commodity components: PCs and networks
– Free software with open source
CPU nodes
• A typical configuration
– Dual socket
– Dual core AMD or Intel nodes
– 4 GB memory per node
Network Options
From D.K. Panda’s Nowlab website at Ohio State,

http://nowlab.cse.ohio-state.edu/
Research Overview presentation
Challenges
• Cooling
• Power constraints
• Reliability
• System Administration

cs668 Lec1 ParallelArch

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

cs668 Lec1 ParallelArch

Uploaded by

Copyright:

Available Formats

Today’s topics

• Single processors and the Memory

From Intel® 64 and IA-32 Architectures Optimization Reference Manual

(UMA) Uniform-memory-access Shared-memory System

CPU CPU CPU

(NUMA)Non-Uniform-memory-access Shared-memory System

CPU Memory CPU Memory CPU Memory

CPU Memory CPU Memory CPU Memory

CPU CPU CPU

CPU CPU CPU

Pipelined vector processors

From D.K. Panda’s Nowlab website at Ohio State,

You might also like