Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 13

ACA NOTES

Parallel Computing A common way of satisfying the described needs is to use parallel computers. A parallel computer consists of two or more processing units, which are operating more or less independently in parallel. Using such a computer, a problem can (theoretically) be divided into n sub problems (where n is typically the number of available processing units), and each part of the problem will be solved by one of the processing units concurrently. Ideally, the completion time of the computation will be t/n, where t is the completion time for the problem on a computer containing only one processing unit. In practice, a value of t/n will rarely be achieved due to manifold reasons: sometimes a problem cannot be divided exactly into n independent parts, usually there will be a need for communication between the parallel executing processes (e.g. for data exchanges, synchronization, etc.), some problems contain parts that are per se sequential and therefore cannot be processed in parallel, and so on. This leads us to the term scalability. Scalability is a measure that specifies, whether or not a given problem can be solved faster as more processing units are added to the computer. This applies to hardware and software. Scalability A computer system, including all its hardware and software resources, is called scalable if it can scale up (i.e., improve its resources) to accommodate ever-increasing performance and functionality demand and/or scale down (i.e., decrease its resources) to reduce cost. Parallel computers can be classified by various aspects of their architecture. Here we present three different classification schemes. In the first, parallel computers are distinguished by the way the processors are connected with the memory. The second scheme (called "Flynn's Classification Scheme") takes the number of instruction-streams and the number of data-streams into account. Finally, the third scheme (ECS, the "Erlanger Classification Scheme") focuses on the number of control units, functional units, and the word-size of the computer.

Multiprocessor and Multicomputers


Memory-Processor Organization

In terms of memory-processor organization three main groups of architectures can be distinguished. These are

shared memory architectures, distributed memory architectures, and distributed shared memory architectures

Shared Memory Architectures The main property of shared memory architectures is, that all processors in the system have access to the same memory, there is only one global address space. Typically, the main memory consists of several memory modules (whose number is not necessarily equal to the number of processors in the computer, see Figure 2-1). In such a system, communication and synchronization between the processors is done implicitly via shared variables.

The processors are connected to the memory modules via some kind of interconnection network. This type of parallel computer is also called UMA, which stands for uniform memory access, since all processors access every memory module in the same way concerning latency and bandwidth.

A big advantage of shared memory computers is, that programming a shared memory computer is very convenient due to the fact that all data are accessible by all processors, such that there is no need to copy data. Furthermore the programmer does not have to care for synchronization, since this is carried out by the system automatically (which makes the hardware more complex and hence more expensive). However, it is very difficult to obtain high levels of parallelism with shared memory machines; most systems do not have more than 64 processors. This limitation stems from the fact, that a centralized memory and the interconnection network are both difficult to scale once built. Distributed Memory Architectures In case of a distributed memory computer (in literature also called multiprocessor or multi computer, each processor has its own, private memory. There is no common address space, i.e. the processors can access only their own memories. Communication and synchronization between the processors is done by exchanging messages over the interconnection network. Figure shows the organization of the processors and memory modules in a distributed memory computer. In contrary to shared memory architecture a distributed memory machine scales very well, since all processors have their own local memory which means that there are no memory access conflicts. Using this architecture, massively parallel processors (MPP) can be built, with up to several hundred or even thousands of processors.

Typical representatives of a pure distributed memory architecture are clusters of computers, which become more and more important nowadays.

In a cluster each node is a complete computer, and these computers are connected through a low-cost commodity network (e.g. Ethernet, Myrinet, etc.). The big advantage of clusters compared to MPPs is, that they have a much better cost/performance ratio. Distributed Shared Memory Architectures To combine the advantages of the architectures described above, ease of programming on the one hand, and high scalability on the other hand, a third kind of architecture has been established: distributed shared memory machines. Here, each processor has its own local memory, but, contrary to the distributed memory architecture, all memory modules form one common address space, i.e. each memory cell has a system-wide unique address. In order to avoid the disadvantage of shared memory computers, namely the low scalability, each processor uses a cache, which keeps the number of memory access conflicts and the network contention low. However, the usage of caches introduces a number of problems, for example how to keep the data in the memory and the copies in the caches up-to-date. This problem is solved by using sophisticated cache coherence and consistency protocols. A detailed description of the most important protocols can be found in.

Interconnection Networks Since the processors in a parallel computer need to communicate in order to solve a given problem, there is a need for some kind of communication infrastructure, i.e. the processors need to be connected in some way.

Basically, there are two kinds of interconnection networks: static and dynamic. In case of a static interconnection network, all connections are fixed, i.e. the processors are wired directly, whereas in the latter case there are switches in between. The decision whether to use a static or dynamic interconnection network depends on the kind of problem that should be solved with the computer. Generally, static topologies are suitable for problems whose communication patterns can be predicted reasonably well, whereas dynamic topologies (switching networks), though more expensive, are suitable for a wider class of problems [1]. In the following, we will give a description of some important static and dynamic topologies, including routing protocols. Static Topologies Descriptions Meshes and Rings The simplest - and cheapest - way to connect the nodes of a parallel computer is to use a one-dimensional mesh. Each node has two connections, boundary nodes have one. If the boundary nodes are connected to each other, we have a ring, and all nodes have two connections. The onedimensional mesh can be generalized to a k-dimensional mesh, where each node (except boundary nodes) have 2k connections. Again, boundary nodes can be connected, but there is no general consensus on what to do on boundary nodes.

However, this type of topology is not suitable to build large-scale computers, since the maximum message latency, that is, the maximum delay of a message from one of the N processors to another, is ; this is bad for two reasons: firstly, there is a wide range of latencies (the latency between neighbouring processors is much lower than between not-neighbors), and secondly the maximum latency grows with the number of processors. Stars In a star topology there is one central node, to which all other nodes are connected; each node has one connection, except the centre node, which has N-1 connections.

Stars are also not suitable for large systems, since the centre node will become a bottleneck with increasing number of processors. Hyper cubes The hypercube topology is one of the most popular and used in many large scale systems. A k-dimensional hypercube has 2knodes, each with k connections. In Figure a four-dimensional hypercube is displayed. Hyper cubes scale very well, the maximum latency in a k-dimensional (or "k-ary") hypercube is log2 N, with N = 2k. An important property of hypercubes is the relationship between node-number and which nodes are connected together. The rule is, that any two nodes in the hypercube, whose binary representations differ in exactly one bit, are connected together. For example in a four-dimensional hypercube, node 0 (0000) is connected to node 1 (0001), node 2 (0010), node 4 (0100) and node 8 (1000). This numbering scheme is called Gray code scheme

Routing Meshes and Rings Typically, in meshes the so called dimension-order routing technique is used. That is, routing is performed in one dimension at a time. In a threedimensional mesh for example, a message's path from node (a,b,c) to the node (x,y,z) would be moved along the first dimension to node (x,b,c), then, along the second dimension to node (x,y,c), and finally, in the third dimension to the destination-node (x,y,z). Stars Routing in stars is trivial. If one of the communicating nodes is the centre node, then the path is just the edge connecting them. If not, the message is routed from the source node to the centre node, and from there to the destination node. Hypercubes A k-dimensional hypercube is nothing else than a k-dimensional mesh with only two nodes in each dimension, and thus the routing algorithm is the same as for meshes; apart from one difference: the path from node A to node B is calculated by simply calculating the exclusive-or X = A B from the binary representations for node A and B. If the i-th bit in X is '1' the message is moved to the neighbouring node in the i-th dimension. If the i-th bit is '0' the message is not moved anyway. This means, that it takes at most log2 N steps for a message to reach it's destination (where N is the number of nodes in the hypercube).

Dynamic Topologies Single-Stage Networks Buses and crossbars are the two main representatives of this class. A bus is the simplest way to connect a number of processors with each other: all processors are simply connected to one wire. This makes communication and especially message routing very simple. The drawback of this type of network is, that the available bandwidth is inversely proportional to the number of connected processors. This means, that buses are good only for small networks with a maximum of about 10 processors.

The other extreme in terms of complexity is the crossbar network. With a crossbar full connectivity is given, i.e. all processors can communicate with each other simultaneously without reduction of bandwidth. In Figure the connection of n processors with m memory modules (as in a shared memory system) is shown. Certainly crossbars can also be used to connect processors with each other. In that case the memory modules are connected directly to the processors (which results in a distributed memory system), and the lines that were connected to the memory modules Mi are now connected to the processors Pi.

To connect n processors to n memory modules n2 switches are needed. Consequently, crossbar networks can not be scaled to any arbitrary size. Today's commercially available crossbars can connect up to 256 units. Multi-Stage Networks Multi-stage networks are based on the so called shuffle-exchange switching element, which is basically a 2 x 2 crossbar. Multiple layers of these elements are connected and form the network. Depending on the way these elements are connected, the following topologies can be distinguished:

Banyan Baseline Cube Delta Flip Indirect cube Omega

As an example of a multistage network a 8 x 8 Benes network is shown in Figure

Summary The networks can be classified as static or dynamic. Static interconnection networks are mainly used in message-passing architectures; the following types are commonly defined:

completely-connected network. star-connected network. linear array or ring of processors. mesh network (in 2- or 3D). Each processor has a direct link to four/six (in 2D/3D) neighbor processors. Extensions of this kind of networks is a wraparound mesh or torus. Commercial examples are Intel Paragon XP/S and Cray T3D/E. These examples cover also another class, namely the direct network topology. tree network of processors. Communication bottleneck likely to occur in large configurations can be alleviated by increasing the number of communication links for processors closer to the root, which results in the fat-tree topology, efficiently used in the TMC CM5 computer. CM5 could be also an example of indirect network topology. hypercube network. Classically this is a multidimensional mesh of processors with exactly two processors in each dimension. An example of such a system is the Intel iPSC/860 computer. Some new projects incorporate the idea of several procesors in each node which

results in fat hypercube, i.e. indirect network topology. An example is the SGI/Cray Origin2000 computer. Dynamics interconnection networks implement one of four main alternatives:

bus-based networks - the simplest and efficient solution when the cost and moderate number of processors are involved. Its main drawback is a bottleneck to the memory when number of processors becomes large and also a single point of failure. To overcome the problems, sometimes several parallel buses are incorporated. The classical example of such machine is the SGI Power Challenge computer with packet data bus.

Table 1: Properties of various types of multiprocessor interconnections Property Speed Cost Reliability Configurability Complexity Bus low low low high low Crossbar High High High Low High Multistage high moderate high moderate moderate

crossbar switching networks, which employ a grid of switching elements. The network is nonblocking, since the connection of a processor to a memory bank does not block the connection of any other processor to any other memory bank. In spite of high speed, their use is limited, due to nonlinear complexity (o(p2), p - number of processors) and the cost (cf. Table 1). They are applied mostly in multiprocessor vector computers (like Cray YMP) and in multiprocessors with multilevel interconnections (e.g. HP/Convex Exemplar SPP). One outstanding example is the Fujitsu VPP500 which incorporates 224x224 crossbar switch. multistage interconnection networks formulate the most advanced pure solution, which lies between the two extremes (Table 1). A typical example is the omega network, which consists of stages,

where p is number of inputs and outputs (usually number of processor and of memory banks). Its complexity is , less than for the crossbar switch. However, in the omega network some memory accesses can be blocked. Although machines of this kind of interconnections offer virtual global memory model of programming and ease of use they are still not much popular. Examples from the past cover BBN Butterfly and IBM RP-3 computers, at present IBM RS6K SP incorporates multistage interconnections with Vulcan switch. multilevel interconnection network seems to be a relatively recent development. The idea comes directly from clusters of computers and consists of two or more levels of connections with different aggregated bandwidths. Typical examples are: SGI/Cray Origin2000, IBM RS6K SP with PowerPC604 SMP nodes and HP/Convex Exemplar. This kind of architecture is getting the most interest at present.

You might also like