Professional Documents
Culture Documents
jamshed2015
jamshed2015
jamshed2015
Cluster Classification
1. CLASSIFICATION OF CLUSTERS
1.1 Classification on the Basis of Architecture
Parallel architecture is based on how processors, memory, and interconnect
are laid out and how resources are managed. They can be classified as
follows:
1. Symmetric multiprocessing (SMP)
2. Massively parallel processing (MPP)
3. Nonuniform memory access (NUMA)
With a single processor, memory can have difficulty coordinating
processor needs, and when processors become multiple the problem mul-
tiplies several times. According to the problem of integrating data and the
approach adopted by processors, parallel machines can be categorized on
the basis of memory accession. The major difference is between the
distributed or shared memory. With distributed memory, each processor
has its own physical memory (with L1, L2, and L3 cache) and its own
address space. Thus, for example, if two processors refer to a variable X, on
a shared memory bank they will access the same memory location.
memory bus between processors works only for a small number of pro-
cessors. For multi-core processor, however, the cores typically have a
shared cache, such as an L3 or L2 cache.
2. Large numbers of processors use a cross-bar that connects multiple pro-
cessors to multiple memory banks.
In the early 1960s, Burroughs Corporation introduced a symmetrical
multiple-instruction multiple-data multiprocessor with four central pro-
cessing units (CPUs) and up to sixteen memory modules connected via a
cross-bar switch (the first SMP architecture). In the late 1960s, Honeywell
delivered the first Multics system, another SMP system of eight CPUs.
While multiprocessing systems were being developed, technology also
advanced in the ability to shrink the processors and operate at much higher
clock rates. In the 1980s, companies such as Cray Research introduced
multiprocessor systems and UNIX-like operating systems that could take
advantage of them (CX-OS). With the popularity of uniprocessor personal
computer (PCs) systems such as the IBM PC, the late 1980s saw a decline in
multiprocessing systems. But currently, more than twenty years later,
multiprocessing has returned to these same PC systems through SMP.
Cluster Classification 103
4. Storage
Storage services are provided by most cloud providers, which
removes the headache of storing data from the user’s end.
5. Synchronization
Currently more than 100 companies provide cloud-based services,
well beyond the initial concept of computers for rent. Examples include
Google, Amazon, and Netflix. The infrastructure for cloud computing
can be interesting from a computer science point of view, because it in-
volves distributed file systems, scheduling, virtualization, and mecha-
nisms for ensuring high reliability.
a. Enhanced Performance
A cluster should be a system that gives enhanced performance.
b. Enhanced Availability
A cluster system must be available 24/7/365. Fault tolerance is
the terminology used for clusters. In simpler terms, it means that
system must be sufficient to bear the faults or continue operating
properly in the event of the failure of (or one or more malfunctions
within) some of its components. For a system to be fault tolerant,
redundancy is kept in these systems. This is not specific to nodes
but also to storage, power supplies, and so on.
c. Single System Image
A cluster should look like a complete system that occupy less
space with a comfortable working area. It means that the lesser the
space it occupies, the better. Although distributed systems take up
a lot of space, shared memory architecture has had a significant
role in forming single-system images.
d. Fast Communication
In cluster networking, intercommunication between nodes
remains a bottleneck. This has been controlled to a considerable
extent with the introduction of the high-speed Infiniband, but it
is still above the range of an economical Beowulf cluster. Normally,
Infiniband takes up to 20% of the budget cost of the overall cluster.
For academic purposes, therefore, people usually go for common
and affordable communication networks such as Ethernet. Myrinet
is also a plausible choice and is an intermediate between Ethernet
and Infiniband.
e. Load Balancing
Load balancing has an important role in the performance of a
cluster. When multiple users submit a job to a cluster, the load-
balancing software intelligently distributes the load to whichever
nodes are free. The most popular load balancing is done via a soft-
ware LSF, which is available in the ANSYS FLUENT but requires
software such Portable Batch Scheduler (PBS), which is available for
pay, and SGE (Sun Grid Engine) and Ganglia, which are free.
f. System Security
A key feature of using Linux is that it is highly secure. A system
must be secure enough that no intruder can get into it to crash your
job. A virus-free environment is also mandatory and in Linux, you
do not have to worry about that.
110 Using HPC for Computational Fluid Dynamics
g. Programmability (Optional)
For most clusters, if you are getting a turnkey (ready) solution,
you do not need to program anything most times. Commercial
codes such as ANSYS FLUENT and ANSYS CFX probably do
not need programming unless you use user define functions (in
Fluent, for example), or if you are making your own CFD code
or some other number-crunching code. Frankly speaking, engineers
are mostly afraid of programming and they simply run away by
saying “It’s the programmer’s job!”
2. Use and Application
It is important that you know whether you will achieve some degree
of performance if you use a cluster or whether you do not need it.
Budget allocation is obviously a main factor to consider after deciding
to buy a cluster; however, it is not like a small stationery item that
you buy from a shop and start using. Thorough research and investiga-
tion are always recommended.
3. Environment
A neat, clean, and temperature controlled environment is necessary
for these sophisticated clusters to work best. Cooling load calculations
must be properly done to account for heat, ventilation, and air-
conditioning issues. Space should be properly allocated so that the racks,
shelves, and uninterruptible power supply systems do not become
congested.
4. Component Selection for an Integrated System
Usually for defense sector applications, FEA and CFD packages are
used; the AMD processor has been used by NASA; however, a number
of companies are content with Intel Xeon. AMD processors have a
common issue regarding a rise in temperature. Chapter 6 compared
their performance.
a. Architecture: 32- or 64-Bit
Do you want to use a 32-bit or 64-bit architecture? When it was
first introduced, Itanium was popular, but later, when Intel and
AMD started to improve the Pentium series with 86-64-bit archi-
tecture, Itanium’s fame faded.
b. Main Memory
The main memory in conventional servers (whether blade- or
rack-mounted) is the standard RAM chip called DIMM. Currently,
it comes in double data rating (DDR). Thus, one could easily think
that there must once have been single data rating or SDR RAM.
Cluster Classification 111
because they come in chasses, so you can place two or three chasses
in a rack. Blade design is the most costly equipment among the
three. The reason is that it has separate power provided by the chassis
unit, temperature control devices, and monitoring. As shown in
Figures 5.7 and 5.8 a blade system occupies less space than a rack,
requires fewer network cables, and consumes less power. A rack-
mounted design contains individual nodes mounted on it, so each
node occupies considerable space. A node is usually more compact
than a normal CPU but a little bit thicker than an Apple laptop.
The typical thickness of cluster components is measured in a stan-
dard called “U.” The nominal thickness of the nodes is 1 U, where
1 U corresponds to 1.75 in. A stand-alone configuration is usually
referred to as a workstation, or a group of workstations connected
together. Because each workstation is a separate PC, stand-alone
configurations occupy lot of space. Figures 5.5–5.8 show three types
of cluster fittings. Figure 5.5 shows a desktop cluster. It is offered by
Cray, and at one platform you can get the benefit of a mini-
supercomputer. Its configuration can easily be seen on the Web.
Figure 5.6 shows individual PCs clustered together and adjusted
in a rack to occupy minimum space. HP is a leading service
provider in the HPC community. Apart from printers, laptops,
1. Task
A task is a logically discrete section of computational work. It is
typically a program or program-like set of instructions that is executed
by a processor. A parallel program consists of multiple tasks running on
multiple processors.
2. Pipelining
Pipelining is breaking a task into steps performed by different pro-
cessor units, with inputs streaming through, much like an assembly
line; a type of parallel computing.
3. Shared Memory
As the name indicates, it describes a computer architecture in which
all processors have direct (usually bus-based) access to common physical
memory. This is with regard to hardware. In a programming sense, it
describes a model in which parallel tasks all have the same picture of
memory and can directly address and access the same logical memory
locations regardless of where the physical memory actually exists.
4. Symmetric Multiprocessor
Symmetric multiprocessing is hardware architecture in which mul-
tiple processors share a single address space and access to all resources;
shared memory computing.
5. Distributed Memory
In hardware, this refers to network-based memory access for phys-
ical memory that is not common. As a programming model, tasks can
only logically see local machine memory and must use communica-
tions to access memory on other machines where other tasks are being
executed.
6. Communications
Communication is required for parallel tasks to exchange data
frequently. There are several ways this can be accomplished, such as
through a shared memory bus or over a network; however, the actual
event of data exchange is commonly referred to as communications
regardless of the method employed.
7. Synchronization
Synchronization is the coordination of parallel tasks in real time, often
associated with communications. It is often implemented by establishing a
synchronization point within an application in which a task may not
proceed further until another task(s) reaches the same or logically equiva-
lent point. It usually involves waiting by at least one task, and can therefore
cause a parallel application’s wall clock execution time to increase.
116 Using HPC for Computational Fluid Dynamics
8. Granularity
In parallel computing, granularity is a qualitative measure of the
ratio of computation to communication. Like the CFD mesh, it can
also be categorized as
a. Coarse: relatively large amounts of computational work are done
between communication events
b. Fine: relatively small amounts of computational work are done
between communication events
9. Observed Speedup
Observed speedup of a code that has been parallelized is defined
as the ratio between the wall clock time of serial execution and
the wall clock time of parallel execution. It is one of the simplest
and most widely used indicators of a parallel program’s performance.
A similar definition is used by ANSYS FLUENT and is discussed in
Chapter 6.
10. Parallel Overhead
Parallel overhead is the amount of time required to coordinate
parallel tasks, as opposed to doing useful work. Parallel overhead can
include factors such as:
a. Task startup time
b. Synchronizations
c. Data communications
d. Software overhead imposed by parallel languages, libraries, oper-
ating system, etc.
e. Task termination time
11. Massively Parallel
This refers to the hardware composed of a given parallel system
with many processors. The meaning of “many” keeps increasing, but
currently, the largest parallel computers can be composed of processors
numbering in the hundreds of thousands.
12. Embarrassingly Parallel
Embarrassingly parallel deals with solving many similar but inde-
pendent tasks simultaneously; there is no need for coordination
between tasks.
13. Scalability
This means the ability to demonstrate a proportionate increase
in parallel speedup of a parallel system with the addition of more
resources. A parallel system may be hardware or software. Factors
that contribute to scalability include hardware (particularly memory),
Cluster Classification 117
REFERENCES
[1] Dean Jeffrey, Ghemawat Sanjay. MapReduce: simplified data processing on large
clusters. In: OSDI’04: Sixth Symposium on Operating System Design and Imple-
mentation; 2004.
[2] Hadoop Wiki. http://wiki.apache.org/hadoop/FrontPage, last edited 24-10-2014 at
12:27:46 by DevopamMittra, visited on 4th of November 2014.