High Performance Computing - Project Report

HIGH PERFORMANCE
COMPUTING
CONTENTS
1. Introduction…………………………………………1
2. History of Computing………………………………2
3. Parallel Computing…………………………………5
4. Classification of Computers………………………..9
5. High Performance Computing……………………..13
• Architecture…………………………………….14
• Symmetric Multiprocessing……………………16
6. Computer Clusters…………………………………18
• Cluster Categorizations……………………..21
• Basics of Cluster Computing……………….22
• Description over HPC………………………26
• Cluster Components………………………...28
• Message Passing Interface………………….31
• Parallel Virtual Machine……………………33
• Cluster Middleware………………………....37
• Storage………………………………………42
• Cluster Features……………………………..44
7. Grid Computing……………………………………49
• Cycle Stealing………………………………….53
8. Bibliography……………………………………….57
INTRODUCTION
HPC can be boiled down to one thing – SPEED. The goal is to

achieve the maximum amount of computations in minimum amount of time.
The term HPC refers to the use parallel supercomputers and computer
clusters i.e. computing systems comprised of multiple processors linked
together in a single system with commercially available interconnects.
Today the HPC systems have become a basic need where the work is
required to be done much more quickly and efficiently. Many organizations
and institutions across the world are incorporation this latest trend.
HISTORY OF COMPUTING
The history of computing is longer than the history of computing

hardware and modern computing technology and includes the history of
methods intended for pen and paper or for chalk and slate, with or without
the aid of tables.
Concrete devices:
Computing is intimately tied to the representation of numbers. But long
before abstractions like number arose, there were mathematical concepts to
serve the purposes of civilization. These concepts are implicit in concrete
practices such as:
• one-to-one correspondence, a rule to count how many items, say on a

tally stick, which was eventually abstracted into number;
• comparison to a standard, a method for assuming reproducibility in a
measurement, for example, the number of coins;
• The 3-4-5 right triangle was a device for assuring a right angle, using
ropes with 12 evenly spaced knots, for example.
Numbers:
'Eventually, the concept of numbers became concrete and familiar
enough for counting to arise, at times with sing-song mnemonics to teach
sequences to others. All the known languages have words for at least "one"
and "two", and even some animals like the blackbird can distinguish a
surprising number of items.
Advances in the numeral system and mathematical notation eventually

led to the discovery of mathematical operations such as addition,
subtraction, multiplication, division, squaring, square root, and so forth.
Eventually the operations were formalized, and concepts about the
operations became understood well enough to be stated formally, and even
proven. See, for example, Euclid's algorithm for finding the greatest
common divisor of two numbers.
By the High Middle Ages, the positional Hindu-Arabic numeral

system had reached Europe, which allowed for systematic computation of
numbers. During this period, the representation of a calculation on paper
actually allowed calculation of mathematical expressions, and the tabulation
of mathematical functions such as the square root and the common
logarithm (for use in multiplication and division) and the trigonometric
functions. By the time of Isaac Newton's research, paper or vellum was an
important computing resource, and even in our present time, researchers like
Enrico Fermi would cover random scraps of paper with calculation, to
satisfy their curiosity about an equation. Even into the period of
programmable calculators, Richard Feynman would unhesitatingly compute
any steps which overflowed the memory of the calculators, by hand, just to
learn the answer.
Navigation and astronomy:

Starting with known special cases, the calculation of logarithms and
trigonometric functions can be performed by looking up numbers in a
mathematical table, and interpolating between known cases. For small
enough differences, this linear operation was accurate enough for use in
navigation and astronomy in the Age of Exploration. The uses of
interpolation have thrived in the past 500 years: by the twentieth century
Leslie Comrie and W.J. Eckert systematized the use of interpolation in
tables of numbers for punch card calculation.
In our time, even a student can simulate the motion of the planets, an N-
body differential equation, using the concepts of numerical approximation, a
feat which even Isaac Newton could admire, given his struggles with the
motion of the Moon.
Weather prediction:
The numerical solution of differential equations, notably the Navier-
Stokes equations was an important stimulus to computing, with Lewis Fry
Richardson's numerical approach to solving differential equations. To this
day, some of the most powerful computer systems of the Earth are used for
weather forecasts.
Symbolic computations
By the late 1960s, computer systems could perform symbolic
algebraic manipulations well enough to pass college-level calculus courses.
Using programs like Maple, Macsyma (now Maxima) and Mathematica,
including some open source programs like Yacas, it is now possible to
visualize concepts such as modular forms which were only accessible to the
mathematical imagination before this.
PARALLEL COMPUTING
Parallel computing is the simultaneous execution of some

combination of multiple instances of programmed instructions and data on
multiple processors in order to obtain results faster. The idea is based on the
fact that the process of solving a problem usually can be divided into smaller
tasks, which may be carried out simultaneously with some coordination.
Definition:
A parallel computing system is a computer with more than one
processor for parallel processing. In the past, each processor of a
multiprocessing system always came in its own processor packaging, but
recently-introduced multicore processors contain multiple logical processors
in a single package. There are many different kinds of parallel computers.
They are distinguished by the kind of interconnection between processors
(known as "processing elements" or PEs) and memory. Flynn's taxonomy,
one of the most accepted taxonomies of parallel architectures, classifies
parallel (and serial) computers according to: whether all processors execute
the same instructions at the same time (single instruction/multiple data --
SIMD) or whether each processor executes different instructions (multiple
instruction/multiple data -- MIMD).
One major way to classify parallel computers is based on their

memory architectures. Shared memory parallel computers have multiple
processors accessing all available memory as global address space. They can
be further divided into two main classes based on memory access times:
Uniform Memory Access (UMA), in which access times to all parts of
memory are equal, or Non-Uniform Memory Access (NUMA), in which
they are not. Distributed memory parallel computers also have multiple
processors, but each of the processors can only access its own local memory;
no global memory address space exists across them. Parallel computing
systems can also be categorized by the numbers of processors in them.
Systems with thousands of such processors are known as massively parallel.
Subsequently there is what is referred to as "large scale" vs. "small scale"
parallel processors. This depends on the size of the processor, e.g. a PC
based parallel system would generally be considered a small scale system.
Parallel processor machines are also divided into symmetric and
asymmetric multiprocessors, depending on whether all the processors are the
same or not (for instance if only one is capable of running the operating
system code and others are less privileged).
A variety of architectures have been developed for parallel processing.

For example Ring architecture has processors linked by a ring structure.
Other architectures include Hypercubes, Fat trees, systolic arrays, and so on.
Theory and practice:

Parallel computers can be modeled as Parallel Random Access
Machines (PRAMs). The PRAM model ignores the cost of interconnection
between the constituent computing units, but is nevertheless very useful in
providing upper bounds on the parallel solvability of many problems. In
reality the interconnection plays a significant role. The processors may
communicate and cooperate in solving a problem or they may run
independently, often under the control of another processor which distributes
work to and collects results from them (a "processor farm").
Processors in a parallel computer may communicate with each other

in a number of ways, including shared (either multiported or multiplexed)
memory, a crossbar, a shared bus or an interconnect network of a myriad of
topologies including star, ring, tree, hypercube, fat hypercube (a hypercube
with more than one processor at a node), an n-dimensional mesh, etc.
Parallel computers based on interconnect network need to employ some kind
of routing to enable passing of messages between nodes that are not directly
connected. The communication medium used for communication between
the processors is likely to be hierarchical in large multiprocessor machines.
Similarly, memory may be either private to the processor, shared between a
numbers of processors, or globally shared. Systolic array is an example of a
multiprocessor with fixed function nodes, local-only memory and no
message routing.
Approaches to parallel computers include multiprocessing, parallel

supercomputers, NUMA vs. SMP vs. massively parallel computer systems,
distributed computing (esp. computer clusters and grid computing).
According to Amdahl's law, parallel processing is less efficient than one x-
times-faster processor from a computational perspective. However, since
power consumption is a super-linear function of the clock frequency on
modern processors, we are reaching the point where from an energy cost
perspective it can be cheaper to run many low speed processors in parallel
than a single highly clocked processor.
Parallel programming:
Parallel programming is the design, implementation, and tuning of
parallel computer programs which take advantage of parallel computing
systems. It also refers to the application of parallel programming methods to
existing serial programs (parallelization). Parallel programming focuses on
partitioning the overall problem into separate tasks, allocating tasks to
processors and synchronizing the tasks to get meaningful results. Parallel
programming can only be applied to problems that are inherently
parallelizable, mostly without data dependence. A problem can be
partitioned based on domain decomposition or functional decomposition, or
a combination.
There are two major approaches to parallel programming: implicit

parallelism, where the system (the compiler or some other program)
partitions the problem and allocates tasks to processors automatically (also
called automatic parallelizing compilers); or explicit parallelism, where the
programmer must annotate their program to show how it is to be partitioned.
Many factors and techniques impact the performance of parallel
programming, especially load balancing, which attempts to keep all
processors busy by moving tasks from heavily loaded processors to less
loaded ones.
Some people consider parallel programming to be synonymous with

concurrent programming. Others draw a distinction between parallel
programming, which uses well-defined and structured patterns of
communications between processes and focuses on parallel execution of
processes to enhance throughput, and concurrent programming, which
typically involves defining new patterns of communication between
processes that may have been made concurrent for reasons other than
performance. In either case, communication between processes is performed
either via shared memory or with message passing, either of which may be
implemented in terms of the other.
Programs which work correctly in a single CPU system may not do so

in a parallel environment. This is because multiple copies of the same
program may interfere with each other, for instance by accessing the same
memory location at the same time. Therefore, careful programming
(synchronization) is required in a parallel system.
CLASSIFICATION OF COMPUTERS
The classification of computers can be described as follows:
1. Mainframe Computers.
Mainframes (often colloquially referred to as Big Iron) are computers
used mainly by large organizations for critical applications, typically bulk
data processing such as census, industry and consumer statistics, ERP, and
financial transaction processing.
The term probably originated from the early mainframes, as they were
housed in enormous, room-sized metal boxes or frames. Later the term was
used to distinguish high-end commercial machines from less powerful units
which were often contained in smaller packages.
Today in practice, the term usually refers to computers compatible with

the IBM System/360 line, first introduced in 1965. (IBM System z9 is IBM's
latest incarnation.) Otherwise, systems with similar functionality but not
based on the IBM System/360 are referred to as "servers". However,
"server" and "mainframe" are sometimes used interchangeably.
Some non-System/360-compatible systems derived from or compatible

with older (pre-web) server technology may also be considered mainframes.
These include the Burroughs large systems and the UNIVAC 1100/2200
series systems. Most large-scale computer system architectures were firmly
established in the 1960s and most large computers were based on
architecture established during that era up until the advent of web servers in
the 1990s.
There were several minicomputer operating systems and architectures that

arose in the 1970s and 1980s, but minicomputers are generally not
considered mainframes. (UNIX is generally considered a minicomputer
operating system even though it has scaled up over the years to match
mainframe characteristics in many ways.)
Thus, the defining characteristics of “mainframe” appear to be being
compatible with large computer systems that were established in the 1960s.
2. Minicomputers.
Minicomputer (colloquially, mini) is a largely obsolete term for a class of
multi-user computers that lies in the middle range of the computing
spectrum, in between the largest multi-user systems (mainframe computers)
and the smallest single-user systems (microcomputers or personal
computers). Formerly this class formed a distinct group with its own
hardware and operating systems. While the distinction between mainframe
computers and smaller computers remains fairly clear, contemporary
middle-range computers are not well differentiated from personal computers,
being typically just a more powerful but still compatible version of personal
computer. More modern terms for minicomputer-type machines include
midrange systems (IBM parlance), workstations (Sun Microsystems and
general UNIX/Linux parlance), and servers.
3. Microcomputers.
Although there is no rigid definition, a microcomputer (sometimes
shortened to micro) is most often taken to mean a computer with a
microprocessor (µP) as its CPU. Another general characteristic of these
computers is that they occupy physically small amounts of space. Although
the terms are not synonymous, many microcomputers are also personal
computers (in the generic sense) and vice versa.
The microcomputer came after the minicomputer, most notably replacing

the many distinct components that made up the minicomputer's CPU with a
single integrated microprocessor chip. The early microcomputers were
primitive, the earliest models shipping with as little as 256 bytes of RAM,
and no input / output other than lights and switches. However, as
microprocessor design advanced rapidly and memory became less expensive
from the early 1970s onwards, microcomputers in turn grew faster and
cheaper. This resulted in an explosion in their popularity during the late
1970s and early 1980s.
The increasing availability and power of such computers attracted the
attention of more software developers. As time went on and the industry
matured, the market standardized around IBM PC clones running MS-DOS
(and later Windows).
Modern desktop computers, video game consoles, laptop computers,

tablet PCs, and many types of handheld devices, including mobile phones,
may all be considered examples of microcomputers according to the
definition given above.
4. Supercomputers.
A supercomputer is a computer that led the world (or was close to doing
so) in terms of processing capacity, particularly speed of calculation, at the
time of its introduction. The term "Super Computing" was first used by New
York World newspaper in 1920 to refer to large custom-built tabulators IBM
made for Columbia University.
The term supercomputer itself is rather fluid, and today's supercomputer

tends to become tomorrow's normal computer. CDC's early machines were
simply very fast scalar processors, some ten times the speed of the fastest
machines offered by other companies. In the 1970s most supercomputers
were dedicated to running a vector processor, and many of the newer players
developed their own such processors at a lower price to enter the market.
The early and mid-1980s saw machines with a modest number of vector
processors working in parallel become the standard. Typical numbers of
processors were in the range 4–16. In the later 1980s and 1990s, attention
turned from vector processors to massive parallel processing systems with
thousands of "ordinary" CPUs, some being off the shelf units and others
being custom designs. (This is commonly and humorously referred to as the
attack of the killer micros in the industry.) Today, parallel designs are based
on "off the shelf" server-class microprocessors, such as the PowerPC,
Itanium, or x86-64, and most modern supercomputers are now highly-tuned
computer clusters using commodity processors combined with custom
interconnects.
Supercomputers are used for highly calculation-intensive tasks such as

problems involving quantum mechanical physics, weather forecasting,
climate research (including research into global warming), molecular
modeling (computing the structures and properties of chemical compounds,
biological macromolecules, polymers, and crystals), physical simulations
(such as simulation of airplanes in wind tunnels, simulation of the detonation
of nuclear weapons, and research into nuclear fusion), cryptanalysis, and the
like. Major universities, military agencies and scientific research laboratories
are heavy users.
HIGH PERFORMANCE COMPUTING
Introduction:
The term high performance computing (HPC) refers to the use of
(parallel) supercomputers and computer clusters, that is, computing systems
comprised of multiple (usually mass-produced) processors linked together in
a single system with commercially available interconnects. This is in
contrast to mainframe computers, which are generally monolithic in nature.
While a high level of technical skill is undeniably needed to assemble and
use such systems, they can be created from off-the-shelf components.
Because of their flexibility, power, and relatively low cost, HPC systems
increasingly dominate the world of supercomputing. Usually, computer
systems in or above the teraflop-region are counted as HPC-computers.
The term is most commonly associated with computing used for

scientific research. A related term, High-performance technical computing
(HPTC), generally refers to the engineering applications of cluster-based
computing (such as computational fluid dynamics and the building and
testing of virtual prototypes). Recently, HPC has come to be applied to
business uses of cluster-based supercomputers, such as data warehouses,
line-of-business (LOB) applications and transaction processing.
Evolving the "HPC" Concept:

It should be noted that there is an evolution that is happening with
regards to the nomenclature surrounding the "HPC" acronym. The ‘old’
definition of HPC, High Performance Computing, was the natural semantic
evolution of the 'supercomputing' market, referring to the expanded and
diverse range of platforms, from scalable high-end systems to COTS
clusters, blade servers and of course the traditional vector supercomputers
used to attack the most complex data- and computational-intensive
applications. A key trend that is currently taking root is the shift in focus
towards productivity – or more precisely, how systems and technology are
applied. This encompasses everything in the HPC ecosystem, from the
development environment, to systems and storage, to the use and
interoperability of applications, to the total user experience – all combined to
address and solve real world problems.
The more current and evolving definition of HPC refers to High

Productivity Computing, and reflects the purpose and use model of the
myriad of existing and evolving architectures, and the supporting ecosystem
of software, middleware, storage, networking and tools behind the next
generation of applications.
Architecture:
A HPC cluster uses a multiple-computer architecture that features a
parallel computing system consisting of one or more master nodes and one
or more compute nodes interconnected in/by a private network system. All
the nodes in the cluster are commodity systems – PCs, workstations or
servers – running on commodity software such as Linux. The master node
acts as server for network file system (NFS) and as a gateway to the outside
world. In order to make the master node highly available to the users, high
availability (HA) clustering might be employed.
The sole task of compute nodes is to execute parallel jobs. In most

cases, therefore, the compute nodes do not have any peripherals connected.
All access and control to the compute nodes are provided via remote
connections, such as network and/or serial port through the master node.
Since compute nodes do not need to access the machines outside the cluster,
nor do the machines outside the cluster need to access the compute nodes
directly, compute nodes commonly use private IP addresses.
Symmetric Multiprocessing:
Symmetric multiprocessing, or SMP, is a multiprocessor computer
architecture where two or more identical processors are connected to a single
shared main memory. Most common multiprocessor systems today use SMP
architecture.
SMP systems allow any processor to work on any task no matter

where the data for that task are located in memory; with proper operating
system support, SMP systems can easily move tasks between processors to
balance the workload efficiently.
SMP is one of many styles of multiprocessor machine architecture;

others include NUMA (Non-Uniform Memory Access) which dedicates
different memory banks to different processors allowing them to access
memory in parallel. This can dramatically improve memory throughput as
long as the data is localized to specific processes (and thus processors). On
the downside, NUMA makes the cost of moving data from one processor to
another, as in workload balancing, more expensive. The benefits of NUMA
are limited to particular workloads, notably on servers where the data is
often associated strongly with certain tasks or users.
Other systems include asymmetric multiprocessing (ASMP), in which

separate specialized processors are used for specific tasks, and computer
clustered multiprocessing (e.g. Beowulf), in which not all memory is
available to all processors.
The former is not widely used or supported (though the high-powered

3D chipsets in modern video cards could be considered a form of
asymmetric multiprocessing) while the latter is used fairly extensively to
build very large supercomputers. In this discussion a single processor is
denoted as a uni processor (UN).
Advantages & Disadvantages:

SMP has many uses in science, industry, and business where software
is usually custom programmed for multithreaded processing. However, most
consumer products such as word processors and computer games are written
in such a manner that they cannot gain large benefits from SMP systems. For
games this is usually because writing a program to increase performance on
SMP systems will produce a performance loss on uniprocessor systems,
which were predominant in the home computer market as of 2007. Due to
the nature of the different programming methods, it would generally require
two separate projects to support both uniprocessor and SMP systems with
maximum performance. Programs running on SMP systems do, however,
experience a performance increase even when they have been written for
uniprocessor systems. This is because hardware interrupts that usually
suspend program execution while the kernel handles them can run on an idle
processor instead. The effect in most applications (e.g. games) is not so
much a performance increase as the appearance that the program is running
much more smoothly. In some applications, particularly software compilers
and some distributed computing projects; one will see an improvement by a
factor of (nearly) the number of additional processors.
In situations where more than one program is running at the same

time, an SMP system will have considerably better performance than a uni-
processor, because different programs can run on different CPUs
simultaneously.
Support for SMP must be built into the operating system. Otherwise,
the additional processors remain idle and the system functions as a
uniprocessor system.
In cases where many jobs are being processed in an SMP

environment, administrators often experience a loss of hardware efficiency.
Software programs have been developed to schedule jobs so that the
processor utilization reaches its maximum potential. Good software
packages can achieve this maximum potential by scheduling each CPU
separately, as well as being able to integrate multiple SMP machines and
clusters.
Access to RAM is serialized; this and cache coherency issues causes

performance to lag slightly behind the number of additional processors in the
system.
COMPUTER CLUSTERS
A computer cluster is a group of tightly coupled computers that work

together closely so that in many respects they can be viewed as though they
are a single computer. The components of a cluster are commonly, but not
always, connected to each other through fast local area networks. Clusters
are usually deployed to improve performance and/or availability over that
provided by a single computer, while typically being much more cost-
effective than single computers of comparable speed or availability.
"A cluster is a logical arrangement of independent entities that

collectively provide a service."
• "Logical arrangement" implies a structured organization. Logical

emphasizes that this organization is not necessarily static. Smart
software and/or hardware are typically involved.
• "Independent entities" implies a level of distinction and function
outside of a cluster context that may involve a system or some
fraction of a system (e.g., an operating system).
• "Provide a service" implies the intended purpose of the cluster.
Elements of pre-service preparation (i.e., provisioning), and post
service teardown, may be involved here.
What is computer clusters?

Cluster is a collection of networked computers enabling one or more
defined resources referenced via single name
Computer clusters are groups of computers working together to complete

one task or multiple tasks. Clusters can be used in many different ways.
Some examples of how clusters are used are fault tolerance (high
availability), load balancing, and parallel computing. Many of the bigger
computer clusters out there reach supercomputer status.
In other words, a cluster is a group of computers which work together

toward a final goal. Some would argue that a cluster must at least consist of
a message passing interface and a job scheduler. The message passing
interface works to transmit data among the computers (commonly called
nodes or hosts) in the cluster.
The job scheduler is just what it sounds like. It takes job requests
from user input or other means and schedules them to be run on the number
of nodes required in the cluster. It is possible to have a cluster without either
of these components, however. Consider a cluster built for a single purpose.
There would be no need for a job scheduler and data could be shared among
the hosts with simple methods like a CORBA interface.
History:
The history of cluster computing is best captured by a footnote in
Greg Pfister's In Search of Clusters: "Virtually every press release from
DEC mentioning clusters says 'DEC, who invented clusters...'. IBM did not
invent them either. Customers invented clusters, as soon as they could not fit
all their work on one computer, or needed a backup. The date of the first is
unknown, but it would be surprising if it was not in the 1960s, or even late
1950s."
The formal engineering basis of cluster computing as a means of

doing parallel work of any sort was arguably invented by Gene Amdahl of
IBM, who in 1967 published what has come to be regarded as the seminal
paper on parallel processing: Amdahl's Law. Amdahl's Law describes
mathematically the speedup one can expect from parallelizing any given
otherwise serially performed task on a parallel architecture. This article
defined the engineering basis for both multiprocessor computing and cluster
computing, where the primary differentiator is whether or not the
interprocessor communications are supported "inside" the computer (on for
example a customized internal communications bus or network) or "outside"
the computer on a commodity network.
Consequently the history of early computer clusters is more or less

directly tied into the history of early networks, as one of the primary
motivations for the development of a network was to link computing
resources, creating a de facto computer cluster. Packet switching networks
were conceptually invented by the RAND Corporation in 1962. Using the
concept of a packet switched network, the ARPANET project succeeded in
creating in 1969 what was arguably the world's first commodity-network
based computer cluster by linking four different computer centers (each of
which was something of a "cluster" in its own right, but probably not a
commodity cluster). The ARPANET project grew into the Internet -- which
can be thought of as "the mother of all computer clusters" (as the union of
nearly all of the compute resources, including clusters, that happen to be
connected). It also established the paradigm in use by all computer clusters
in the world today -- the use of packet-switched networks to perform
interprocessor communications between processor (sets) located in
otherwise disconnected frames.
The development of customer-built and research clusters proceeded

hand in hand with that of both networks and the Unix operating system from
the early 1970s, as both TCP/IP and the Xerox PARC project created and
formalized protocols for network-based communications. The Hydra
operating system was built for a cluster of DEC PDP-11 minicomputers
called C.mmp at C-MU in 1971. However, it was not until circa 1983 that
the protocols and tools for easily doing remote job distribution and file
sharing were defined (largely within the context of BSD Unix, as
implemented by Sun Microsystems) and hence became generally available
commercially, along with a shared file system.
The first commercial clustering product was ARCnet, developed by

Datapoint in 1977. ARCnet was not a commercial success and clustering per
se did not really take off until DEC released their VAXcluster product in the
1984 for the VAX/VMS operating system. The ARCnet and VAXcluster
products not only supported parallel computing, but also shared file systems
and peripheral devices. They were supposed to give you the advantage of
parallel processing, while maintaining data reliability and uniqueness.
VAXcluster, now VMScluster, is still available on OpenVMS systems from
HP running on Alpha and Itanium systems.
Two other noteworthy early commercial clusters were the Tandem

Himalaya (a circa 1994 high-availability product) and the IBM S/390
Parallel Sysplex (also circa 1994, primarily for business use).
No history of commodity computer clusters would be complete without

noting the pivotal role played by the development of Parallel Virtual
Machine (PVM) software in 1989. This open source software based on
TCP/IP communications enabled the instant creation of a virtual
supercomputer -- a high performance compute cluster -- made out of any
TCP/IP connected systems. Free form heterogeneous clusters built on top of
this model rapidly achieved total throughput in FLOPS that greatly exceeded
that available even with the most expensive "big iron" supercomputers.
PVM and the advent of inexpensive networked PCs led, in 1993, to a NASA
project to build supercomputers out of commodity clusters. In 1995 the
invention of the "Beowulf"-style cluster -- a compute cluster built on top of a
commodity network for the specific purpose of "being a supercomputer"
capable of performing tightly coupled parallel HPC computations. This in
turn spurred the independent development of Grid computing as a named
entity, although Grid-style clustering had been around at least as long as the
Unix operating system and the Arpanet, whether or not it, or the clusters that
used it, were named.
Cluster categorizations:
1. High-availability (HA) cluster.

High-availability clusters (also known as failover clusters) are
implemented primarily for the purpose of improving the availability of
services which the cluster provides. They operate by having redundant
nodes, which are then used to provide service when system components fail.
The most common size for an HA cluster is two nodes, which is the
minimum requirement to provide redundancy. HA cluster implementations
attempt to manage the redundancy inherent in a cluster to eliminate single
points of failure. There are many commercial implementations of High-
Availability clusters for many operating systems. The Linux-HA project is
one commonly used free software HA package for the Linux OSs.
2. Load-balancing cluster.
Load-balancing clusters operate by having all workload come through
one or more load-balancing front ends, which then distribute it to a
collection of back end servers. Although they are primarily implemented for
improved performance, they commonly include high-availability features as
well. Such a cluster of computers is sometimes referred to as a server farm.
There are many commercial load balancers available including Platform LSF
HPC, Sun Grid Engine, Moab Cluster Suite and Maui Cluster Scheduler.
The Linux Virtual Server project provides one commonly used free software
package for the Linux OS.
3. High-performance computing (HPC) clusters.

High-performance computing (HPC) clusters are implemented
primarily to provide increased performance by splitting a computational task
across many different nodes in the cluster, and are most commonly used in
scientific computing. Such clusters commonly run custom programs which
have been designed to exploit the parallelism available on HPC clusters.
HPCs are optimized for workloads which require jobs or processes
happening on the separate cluster computer nodes to communicate actively
during the computation. These include computations where intermediate
results from one node's calculations will affect future calculations on other
nodes.
One of the most popular HPC implementations is a cluster with nodes

running Linux as the OS and free software to implement the parallelism.
This configuration is often referred to as a Beowulf cluster.
Microsoft offers Windows Compute Cluster Server as a high-

performance computing platform to compete with Linux.
Many software programs running on High-performance computing

(HPC) clusters use libraries such as MPI which are specially designed for
writing scientific applications for HPC computers.
Basics of Cluster Computing
Cluster computing refers to technologies that allow multiple

computers, called cluster nodes, to work together with the aim to solve
common computing problems. Generic cluster architecture is shown in
Figure. Each node can be a single or multiprocessor computer, such as a PC,
workstation or SMP server, equipped with its own memory, I/O devices and
operating system. The cluster, having similar nodes, is called homogeneous,
otherwise - heterogeneous.
The nodes are usually interconnected by local area network (LAN)

based on one of the following technologies: Ethernet, Fast Ethernet, Gigabit
Ethernet, Myrinet, Quadrics Network (QsNet), InfiniBand communication
fabric, Scalable Coherent Interface (SCI), Virtual Interface Architecture
(VIA) or Memory Channel.
The speed of network technology is characterized by a bandwidth and

latency. Bandwidth means how much information can be sent through a
particular network connection and latency is defined as the time it takes for a
networking device to process a data frame. Note that a higher network speed
is usually associated with a higher price of related equipment. To improve
further cluster performance, different network topologies can be
implemented in each particular case. Moreover, channel bonding technology
can be used in the case of the Ethernet-type networking to double the
network bandwidth.
To realize this technology, two network interface cards (NIC's) should

be installed in each node, and two network switches should be used, one for
each channel, to form two separate virtual networks. The optimal choice of
the network type is dictated by demands on speed and volume of data
exchange between several parts of the application software, running on
different nodes.
Various operating systems, including Linux, Solaris and Windows,

can be used to manage the nodes. However, in order for the clusters to be
able to pool their computing resources, special cluster enabled applications
must be written using clustering libraries or a system level middleware [13]
should be used. The most popular clustering libraries are PVM (Parallel
Virtual Machine) [14] and MPI (Message Passing Interface) [15]; both are
very mature and work well. By using PVM or MPI, programmers can design
applications that can span across an entire cluster's computing resources
rather than being confined to the resources of a single machine. For many
applications, PVM and MPI allow computing problems to be solved at a rate
that scales almost linearly in relation to the number of processors in the
cluster.
The cluster architecture is usually optimized for High Performance
Computing or High Availability Computing. The choice of the architecture
is dictated by the type of an application and available budget. A
combination of both approaches is utilized in some cases, resulting in a
highly reliable system, characterized by a very high performance. The
principal difference between these two approaches consists of that in the
HPC case, each node in the cluster executes a part of the common job,
where as in the second case, several nodes perform or are ready to perform
the same job and, thus, are able to substitute each other in a case of failure.
High availability (HA) clusters are used in mission critical

applications to have constant availability of services to end-users through
multiple instances of one or more applications on many computing nodes.
Such systems found their application as Web servers, e-commerce engines
or database servers. HA clusters use redundancy to ensure that a service
remains running, so that even when a server fails or must go offline for
service, the other servers pick up the load. The system optimized for
maximum availability should not have any single point of failure, thus
requiring a specific architecture (Figure).
Two types of HA clusters can be distinguished - shared nothing

architecture and shared disk architecture. In the first case, each computing
node is using dedicated storage, whereas the second type of HA cluster
shares common storage resources, interconnected by Storage Area Network
(SAN). The operation of HA cluster requires normally special software,
which is able to recognize the occurred problem and transparently migrate
the job to another node.
HPC clusters are built to improve processing throughput in order to

handle multiple jobs of various sizes and types or to increase performance.
The most common HPC clusters are used to shorten turnaround times on
compute-intensive problems by running the job on multiple nodes at the
same time or when the problem is just too big for a single system. This is
often the case in scientific, design analysis and research computing, where
the HPC cluster is built purely to obtain maximum performance during the
solution of a single, very large problem. Such HPC clusters utilize
parallelized software that breaks down the problem into smaller parts, which
are dispatched across a network of interconnected systems that concurrently
process each small part and then communicate with each other using
message-passing libraries to coordinate and synchronize their results. The
Beowulf-type cluster [17], which will be described in the next section, is an
example of the HPC system. Beowulf system is the cluster which is built
primarily out of commodity hardware components, is running a free-
software operating system like Linux or FreeBSD and is interconnected by a
private high-speed network. However, some Linux clusters, which are built
for high availability instead of speed, are not Beowulf’s.
While Beowulf clusters are extremely powerful, they are not for
everyone.
The primary drawback of Beowulf clusters is that they require

specially designed software in order to take advantage of cluster resources.
This is generally not a problem for those in the scientific and research
communities who are used to writing their own special purpose applications
since they can use PVM or MPI libraries to create cluster-aware
applications. However, many potential users of the cluster technologies
would like to have some kind of performance benefit using standard
applications. Since such applications have not been written with the use of
PVM or MPI libraries, such users simply cannot take advantage of a cluster.
This problem has been limited the use of cluster technologies to a small
group of users for years. Recently, a new technology, called openMosix
[18], appears that allows standard applications to take advantage of
clustering without being rewritten or even recompiled.
OpenMosix is a "patch" to the standard Linux kernel, which adds

clustering abilities and allows any standard Linux process to take advantage
of a cluster's resources. OpenMosix uses adaptive load balancing techniques
and allows processes running on one node in the cluster to migrate
transparently to another node where they can execute faster. Because
OpenMosix is completely transparent to all running programs, the process
that has been migrated does not even know that it is running on another
remote node. This transparency means that no special programming is
required to take advantage of OpenMosix load-balancing technology. In
fact, a default OpenMosix installation will migrate processes to the best
node automatically. This makes OpenMosix a clustering solution that can
provide an immediate benefit for many applications.
A cluster of Linux computers running OpenMosix can be considered

as a large virtual SMP system with some exclusion. The CPUs on a "real"
SMP system can exchange data very fast, but with OpenMosix, the speed at
which nodes can communicate with one another is determined by the speed
of the network. Besides, OpenMosix does not currently offer support for
allowing multiple cooperating threads to be separated from one another.
Also, like an SMP system, OpenMosix cannot execute a single process on
multiple physical CPU s at the same time. This means that OpenMosix will
be not able to speed up a single process/program, except to migrate it to a
node where it can execute most efficiently. At the same time, OpenMosix
can migrate most standard Linux processes between nodes and, thus, allows
for extremely scalable parallel execution at the process level. Besides, if an
application forks many child processes then OpenMosix will be able to
migrate each one of these processes to an appropriate node in the cluster.
Thus, OpenMosix provides a number of benefits over traditional
multiprocessor systems.
The OpenMosix technology can work in both homogeneous and

heterogeneous environments, thus allowing building clusters, consisting of
tens or even hundreds of nodes, using inexpensive PC hardware as well as a
bunch of high-end multi-processor systems. The use of OpenMosix together
with new Intel's Hyper- Threading technology, available with the last
generation of Intel Xeon processors, allows additional improving of
performance for threaded applications. Also existing MPI/PVM programs
can benefit from OpenMosix technology.
Description over High Performance Computing (HPC):
High Performance Computing (HPC) uses clusters of inexpensive,

high performance processing blocks to solve difficult computational
problems. Historically',1 HPC Linux cluster technology was incubated in
scientific computing circles, where complex problems could only be tackled
by individuals possessing domain knowledge and the know-how to build,
debug and manage their own clusters. As a result, they were mainly used to
tackle computationally challenging scientific work in government agencies
and university research labs.
That technology has evolved such that today, the performance,
scalability, flexibility, and reliability benefits of HPC clusters are being
realized in nearly all businesses where rigorous analytics are being applied
to simulation and product modeling. HPC clustering provides these
businesses with a scalable fabric of servers that can be allocated on an as-
needed basis to provide unprecedented computational throughput.
HPC provides enterprises and organizations with a productive, simple
and hardware agnostic HPC system enabling administrators to install,
monitor and manage the cluster as a single system, from a single node - the
Master. Through the Master, thousands of systems can be managed as if
they were a single, consistent, virtual system, dramatically simplifying
deployment and management and significantly improving data center
resource utilization and server performance.
High Performance Computing employs a unique architecture based on
three principles that, combined, deliver unparalleled productivity and lower
TCO.
• The operating environment deployed to the compute nodes is provisioned

"Stateless", directly to memory.
• The compute node operating environment is lightweight, stripped of
unnecessary software, overhead and vulnerabilities.
• A simple operating system extension virtualizes the cluster into a pool of
processors operating as if it were a single virtual machine.
The result is a highly efficient, more reliable and scalable system,

capable of processing more work in less time, while being vastly simpler to
use and maintain. Its powerful unified process space means that end users
can easily and intuitively deploy, manage and run complex applications
from the Master, as if the servers were a single virtual machine. The
compute servers are fully transparent and directly accessible if need be. But
if you only care about the compute capacity presented at the single Master
node, you need never look further than this one machine.
Cluster components:
1. Software components.
Single System Image (SSI):
• A single system image is the illusion, created by software or

hardware, that presents a collection of resources as one, more
powerful resource
• SSI makes the cluster appear like a single machine to the user,
to applications, and to the network.
MOSIX:
• SSI approach, for Linux

• With MOSIX a Linux cluster appears as a single multiprocessor
machine
• Load balancing and job migration are supported
• Comes as a patch for the 2.4 Linux kernel and a set of user
land tools
• No special APIs are used, applications are executed the same
way as on real SMPs (symmetric multiprocessing machines).
Shared memory architectures:
• SMP: System with n processors with access to the same

physical memory
• Distributed shared memory (DSM): the illusion of a system of n
processors (each with its own physical memory) having access
to a global (shared) memory
• In both systems, different threads of execution communicate
via shared memory
DSM:
• Problem of maintaining consistency of the global memory

• Each process may have a copy of a page/segment
• How to maintain a global view?
• Locking mechanisms are required to prevent concurrent write
access
Beowulf Project:
• Project for building cheap Linux-based cluster systems

• A set of Open Source tools for cluster computing
• Message passing libraries are used for node communication
• Implementations of common message passing systems (PVM
and MPI) are part of a Beowulf system
• Opposed to MOSIX, each node in the cluster appears as a
computer with its own OS and hardware resources
Message passing vs. Distributed shared memory:
• Most parallel/distributed applications today are based on

message passing
• DSM makes programming easier but may not scale well with
the numbers of processors due to undesired communication
patterns.
Message Passing Systems:
Message Passing:
• The processors of a parallel system communicate by

exchanging messages.
• Each processor has a mailbox for receiving incoming
messages (Messages don’t get lost).
• Receiving messages can be blocking/non-blocking
synchronous or asynchronous.
APIs:
There are two mainly used APIs

• Message passing interface (MPI)
• And parallel virtual machine (PVM)
Message Passing Interface:
The Message Passing Interface (MPI) is a language-independent

communications protocol used to program parallel computers. Although
MPI belongs in layers 5 and higher of the OSI Reference Model,
implementations may cover most layers of the reference model, with sockets
and TCP/IP being used in the transport layer. MPI is not sanctioned by any
major standards body; nevertheless, it has become the de facto standard for
communication among processes that model a parallel program running on a
distributed memory system. Actual distributed memory supercomputers such
as computer clusters often run these programs. The principal MPI-1 model
has no shared memory concept, and MPI-2 has only a limited distributed
shared memory concept.
The advantages of MPI over older message passing libraries are
portability (because MPI has been implemented for almost every distributed
memory architecture) and speed (because each implementation is in
principle optimized for the hardware on which it runs). MPI is supported on
shared-memory and NUMA (Non-Uniform Memory Access) architectures
as well, where it often serves not only as important portability architecture,
but also helps achieve high performance in applications that are naturally
owner-computes oriented. However, it has also been criticized for being too
low level and difficult to use. Despite this complaint, it remains a crucial
part of parallel programming, since no effective alternative has come forth to
take its place.
MPI is a specification, not an implementation. MPI has Language
Independent Specifications (LIS) for the function calls and language
bindings. There are two versions of the standard that are currently popular:
version 1.2, which emphasizes message passing and has a static runtime
environment (fixed size of world), and, MPI-2.1, which includes new
features such as scalable file I/O, dynamic process management and
collective communication with two groups of processes. The MPI interface
is meant to provide essential virtual topology, synchronization and
communication functionality between a set of processes (that have been
mapped to nodes/servers/ computer instances) in a language independent
way, with language specific syntax (bindings).
MPI guarantees that there is progress of asynchronous messages
independent of the subsequent calls to MPI made by user processes
(threads).The relative value of overlapping communication and computation,
asynchronous vs. synchronous transfers and low latency vs. low overhead
communication remain important controversies in the MPI user and
implementer communities. MPI also specifies thread safe interfaces, which
have cohesion and coupling strategies that help avoid the manipulation of
unsafe hidden state within the interface.
There has been research over time into implementing MPI directly
into the hardware of the system, for example by means of Processor-in-
memory, where the MPI operations are actually built into the micro circuitry
of the RAM chips in each node. By implication, this type of implementation
would be independent of the language, OS or CPU on the system, but cannot
be readily updated or unloaded. Another approach has been to add hardware
acceleration to one or more parts of the operation. This may include
hardware processing of the MPI queues or the use of RDMA to directly
transfer data between memory and the network interface without needing
CPU or kernel intervention.
Parallel Virtual Machine:
PVM (Parallel Virtual Machine) is a software package that permits a

heterogeneous collection of UNIX and/or Windows computers hooked
together by a network to be used as a single large parallel computer. Thus
large computational problems can be solved more cost effectively by using
the aggregate power and memory of many computers. The software is very
portable. PVM enables users to exploit their existing computer hardware to
solve much larger problems at minimal additional cost. Hundreds of sites
around the world are using PVM to solve important scientific, industrial, and
medical problems in addition to PVM's use as an educational tool to teach
parallel programming. With tens of thousands of users, PVM has become
the de facto standard for distributed computing world-wide. PVM is an
integrated set of software tools and libraries that emulates a general-purpose,
flexible, heterogeneous concurrent computing framework on interconnected
computers of varied architecture. The overall objective of the PVM system
is to enable such a collection of computers to be used cooperatively for
concurrent or parallel computation.
The PVM system is composed of two parts. The first part is a daemon,
called pvmd3 and sometimes abbreviated pvmd that resides on all the
computers making up the virtual machine. (An example of a daemon
program is the mail program that runs in the background and handles all the
incoming and outgoing electronic mail on a computer). The second part of
the system is a library of PVM interface routines. It contains a functionally
complete repertoire of primitives that are needed for cooperation between
tasks of an application. This library contains user-callable routines for
message passing, spawning processes, coordinating tasks, and modifying the
virtual machine.
The PVM computing model is based on the notion that an

application consists of several tasks. Each task is responsible for a part of the
application's computational workload. Sometimes an application is
parallelized along its functions; that is, each task performs a different
function, for example, input, problem setup, solution, output, and display.
This process is often called functional parallelism. A more common method
of parallelizing an application is called data parallelism. In this method all
the tasks are the same, but each one only knows and solves a small part of
the data. This is also referred to as the SPMD (single-program multiple-data)
model of computing. PVM supports either or a mixture of these methods.
Depending on their functions, tasks may execute in parallel and may need to
synchronize or exchange data, although this is not always the case.
An exemplary diagram of the PVM computing model is shown in
Figure and an architectural view of the PVM system, highlighting the
heterogeneity of the computing platforms supported by PVM, is shown in
Figure
The principles upon which PVM is based include the following:

• User-configured host pool
• Translucent access to hardware
• Process-based computation
• Explicit message-passing model
• Heterogeneity support
• Multiprocessor support
Differences between PVM and MPI:
• MPI is a standard, PVM not

• MPI supports collective operations PVM does not
• MPI supports more modes of sending messages
• PVM dynamically spawns processes
• In MPI 1.x, process are created once at startup
• MPI has more support for dedicated hardware
• There are several implementations of the MPI standard
Commercial softwares:
• Load Leveler - IBM Corp., USA

• LSF (Load Sharing Facility) - Platform Computing, Canada
• NQE (Network Queuing Environment) - Craysoft Corp., USA
• Open Frame - Centre for Development of Advanced
Computing, India
• RWPC (Real World Computing Partnership), Japan
• UnixWare (SCO-Santa Cruz Operations,), USA
• Solaris-MC (Sun Microsystems), USA
• Cluster Tools (A number for free HPC clusters tools from Sun)
Cluster Middleware:
Middleware is generally considered the layer of software sandwiched

between the operating system and applications. Middleware provides
various services required by an application to function correctly.
Middleware has been around since the 1960’s. More recently, it has
reemerged as a means of integrating software applications running in a
heterogeneous environment. There is large overlap between the
infrastructures that it provides a cluster
with high-level Single System Image (SSI) services and those provided by
the traditional view of middleware. It can be described as the software that
resides above the kernel and the network and provides services to
applications.
Heterogeneity can arise in at least two ways in a cluster environment.

First, as clusters are typically built from commodity workstations, the
hardware platform can become heterogeneous. As the cluster is
incrementally expanded using newer generations of the same computer
product line, or even using hardware components that have a very different
architecture, problems related to these differences are introduced. For
example, a typical problem that must be resolved in heterogeneous hardware
environments is the conversion of numeric values to the correct byte
ordering. A second way that clusters become heterogeneous is the
requirement to support very different applications. Examples of this include
applications that integrate software from different sources, or require access
to data or software outside the cluster. In addition, a requirement to develop
applications rapidly can exacerbate the problems inherent with
heterogeneity. Middleware has the ability to help the application developer
overcome these problems. In addition, middleware also provides
services for the management and administration of a heterogeneous system.
2. Hardware Components:
Use of high density rack mounted servers in most popular configuration

for today’s HPC cluster environment. Besides the compute nodes, each rack
could be equipped with network switches, UPS, PDU, and so on. Fig. 1
illustrates a typical HPC cluster configuration. Left of the rack shows the
possible interconnections for the compute nodes. For some type of
applications where communication bandwidth between nodes is critical, low
latency and high bandwidth interconnections such as Gigabit Ethernet,
Myrinet, InfiniBand etc. are common choices for interconnecting among
compute nodes.
Several connections on the right of the rack in fig.2 represent the

connections for cluster monitoring and management. The serial port and
BMC provides console redirection feature as an additional rule for
monitoring and managing the compute nodes from the master without
relying on the network connectivity or interfering with the network
activities. DRAC could be used for remote management along with KVM
that access the compute nodes through non traditional switch with CAT-5
cables and TCP/IP networking.
Nodes:
• Computing nodes.
• Master nodes.
The types of CPUs used in the nodes frequently are Intel and AMD. In
Intel we use Xeon and Itanium processors and in AMD we use Optaron.
In a cluster, each node can be a single or multiprocessor computer,

such as a PC, workstation or SMP server, equipped with its own memory,
I\O devices & operating system.
The nodes are interconnected by LAN using one of the following

technologies: Ethernet, fast Ethernet, gigabit Ethernet, myrinet, infiniBand
communication fabric.
Gigabit Ethernet:
It is a transmission technology based on the Ethernet frame format

and protocol used in the local area networks, provides a data rate of 1 billion
bits/sec (1gigabit). Gigabit Ethernet is defined in the IEEE 802.3 standards.
Gigabit Ethernet is carried primarily on optical fiber. Existing

Ethernet LANs with 10 and 100 mbps cards can feed into a gigabit Ethernet
backbone. An alternative technology that competes with After the 10 and
100Mbps cards, a newer standard, 10Gb Ethernet is also becoming available.
InfiniBand:
InfiniBand is a switched fabric communications link primarily used in

high-performance computing. Its features include quality of service and
failover, and it is designed to be scalable. The InfiniBand architecture
specification defines a connection between processor nodes and high
performance I/O nodes such as storage devices. It is a superset of the Virtual
Interface Architecture. It is an architecture and specification for data flow
between processors and I/O devices that promise data bandwidth and almost
unlimited expandability in tomorrow’s computer systems.
Like Fiber Channel, PCI Express, Serial ATA, and many other
modern interconnects, InfiniBand is a point-to-point bidirectional serial link
intended for the connection of processors with high speed peripherals such
as disks. It supports several signaling rates and, as with PCI Express, links
can be bonded together for additional bandwidth.
It is expected to gradually replace the existing peripheral component
interconnect (PCI) shared-bus approach used in most of today’s PC’s and
servers. Offering up to 2.5GB/s and supporting up to 64000 addressable
devices it has increased reliability, better sharing between clustered
processors and built in security.
Myrinet:
Myrinet, ANSI/VITA 26-1998, is a high-speed local area networking
system designed by Myricom to be used as an inter-connect between
multiple machines to form computer clusters. Myrinet has much less
protocol overhead than standards such as Ethernet, and therefore provides
better throughput, less interference, and less latency while using the host
CPU. Although it can be used as a traditional networking system, Myrinet is
often used directly by programs that "know" about it, thereby bypassing a
call into the operating system.
Myrinet physically consists of two fiber optic cables, upstream and

downstream, connected to the host computers with a single connector.
Machines are connected via low-overhead routers and switches, as opposed
to connecting one machine directly to another. Myrinet includes a number of
fault-tolerance features, mostly backed by the switches. These include flow
control, error control, and "heartbeat" monitoring on every link. The newest,
"fourth-generation" Myrinet, called Myri-10G, supports a 10 Gbit/s data rate
and is inter-operable with 10 Gigabit Ethernet on PHY, the physical layer
(cables, connectors, distances, signaling). Myri-10G started shipping at the
end of 2006.
Myrinet characteristics:
• Flow control, error control, and heartbeat continuity

monitoring on every link.
• Low latency, cut through switches with monitoring of high
availability applications.
Storage:
1. DAS:
Direct-attached storage (DAS) refers to a digital storage system
directly attached to a server or workstation, without a storage network
in between. It is a retronym, mainly used to differentiate non-
networked storage from SAN and NAS.
2. SAN:
In computing, a storage area network (SAN) is an architecture to

attach remote computer storage devices such as disk arrays, tape libraries
and optical jukeboxes to servers in such a way that, to the operating
system, the devices appear as locally attached devices. Although cost and
complexity is dropping, as of 2007, SANs are still uncommon outside
larger enterprises.
By contrast to a SAN, network-attached storage (NAS) uses file-
based protocols such as NFS or SMB/CIFS where it is clear that the
storage is remote, and computers request a portion of an abstract file
rather than a disk block.
3. NAS:
Network-attached storage (NAS) is a file-level data storage
connected to a computer network providing data access to
heterogeneous network clients. NAS hardware is similar to the
traditional file server equipped with direct attached storage, however
it differs considerably on the software side. The operating system and
other software on the NAS unit provides only the functionality of data
storage, data access and the management of these functionalities. Use
of NAS devices for other purposes (like scientific computations or
running database engine) is strongly discouraged. Many vendors also
purposely make it hard to develop or install any third-party software
on their NAS device by using closed source operating systems and
protocol implementations. In other words, NAS devices are server
appliances.
Cluster features:
In a cluster system, it is important to eliminate single point of failure
in terms of hardware. Other than this ,data integrity and system health
checking is very important .for a long term investment ,a cluster shall be
able to add additional nodes in the future in order to minimize the TCO.
No Single Point of Failure:

Cluster cat operate from two independent machines. Provides
complete redundancy and can have no single point of failure (SPOF).
Shared storage can use RAID or even multiple disk arrays to achieve
a high level of storage redundancy. Supports for multiple heartbeat
communication channels between clusters.
Data integrity and manageability:

Cluster works independent of any types of Linux file systems and
volumes, including journaling file systems and software RAID
drivers. This ensures file systems are protected using all sorts of
storage software without the need of reconfiguration or data
migration. The use of journaling file systems also enables a fast
recovery time without the need to run through a lengthy fsck check.
Application monitoring:
It is important to monitor the application availability. Application can
fail due to various reasons. Cluster offers service monitoring agents
(SMA) which can execute custom status check scripts which control
the availability of a specific application. Monitoring agents are
available for common database, middle-ware and services.
Direct Linux kernel communications:

Ensures the Linux kernel gets monitored by Cluster. Communications
of clusters are done by kernel modules instead of user applications.
This technology is known as highly reliable and crash safe from the
user space.
Robust user interface:
The user interface is an interactive menu driven interface which can
be used in a console and graphical Java console, local or remote
access. This ensures Cluster can be monitored and controlled by
system administrator anytime, anywhere and using any type of
connection.
Commodity hardware:
Runs on x86 based commodity hardware or even PowerPC based
hardware. There is no need for proprietary type of architecture to
operate. Future investment protection of your e-business application
and cluster software are assured.
About Computer Clusters and their performance:
• Clusters Evaluation: This is a good paper but very old (1995).

• Nevertheless, it points out the various parameters on which cluster
performance should be analyzed. It also talks about commercial clustering
software available and their evaluation.
• Scalable Cluster Computing with MOSIX for LINUX: A great paper

about great software. If you are thinking in terms of using a cluster of
workstations this is a definitive value-add.
• Scalability Limitations of VIA Based Technologies in supporting MPI
• Implementation and Evaluation of MPI on an SMP Cluster
• PVM Guide
• MPICH related articles, tutorials and software
• Gigabit Performance Evaluation
• Cluster Performance for various processors
• Comparison between MPI and PVM
• Performance Evaluation of LAM, MPICH, and MVICH on a Linux
cluster connected by a Gigabit Ethernet network
• Another useful link for parallel computing resources
Clusters often are confused with traditional massively parallel
systems and conventional distributed systems. A cluster is a type of parallel
or distributed computer system that forms--to varying degrees--a single,
unified resource composed of several interconnected computers. Each
interconnected computer has one or more processors, I/O capabilities, an
operating-system kernel, and memory. The key difference between
clustering and traditional distributed computing is that the cluster has a
strong sense of membership. Like team members, the nodes of a cluster
typically are peers, with all nodes agreeing on who the current participants
are. This sense of membership becomes the basis for providing availability,
scalability, and manageability.
However, capabilities beyond membership show up in different

cluster solutions--and have spawned different classes of clustering solutions.
Before exploring different approaches, it's helpful to compare clustering
with traditional distributed computing and scaling solutions such as
symmetric multiprocessing (SMP) and non-uniform memory addressing
(NUMA).
Although there are minor hardware differences, the principal

distinction between clusters and distributed computing is in software. In
hardware, a cluster is more likely to share storage between computers and
use a high bandwidth, low-latency, reliable interconnect. From a software
perspective, the key difference between clusters and distributed computing
is the strong sense of membership that exists in the cluster. Distributed
computing, like client-server computing, relies on pair wise connections,
while clusters are composed of peer nodes. Another key software difference
is that clusters tend to be single administrative domains, which allows
cluster implementations to avoid the security overhead and complexity
required by distributed computing. Because a cluster's goal is to act as a
single, reliable, scalable server, its machines are not usually distributed
physically.
Another important distinction between clusters and distributed

computing is the way remote resources are accessed. In a cluster, it is likely
that these resources will be accessed transparently. In contrast, distributed
computing often has heavyweight, cumbersome, and complicated
interprocessor communications (IPC) capabilities between computers.
Clusters--in particular, full clusters--use single-node IPC paradigms such as
pipes and message queues along with traditional T C P /IPsockets.
Advantages:
Earlier in this chapter we have discussed the reasons why we would
want to put together a high performance cluster, that of providing a
computational platform for all types of parallel and distributed applications.
The class of applications that a cluster can typically cope with would be
considered grand challenge or super-computing applications. GCAs (Grand
Challenge Applications) are fundamental problems in science and
engineering with broad economic and scientific impact. They are generally
considered intractable without the use of state-of-the-art parallel computers.
The scale of their resource requirements, such as processing time, memory,
and communication needs distinguishes GCAs. A typical example of a grand
challenge problem is the simulation of some phenomena that cannot be
measured through experiments. GCAs include massive crystallographic and
microtomographic structural problems, protein dynamics and biocatalysis,
relativistic quantum chemistry of actinides, virtual materials design and
processing, global climate modeling, and discrete event simulation.
Low cost solution and high performance are only a few of the
advantages of utilizing a High Performance Computing Cluster. Other key
benefits that distinguish it from large SMP’s are described below:
Features Large SMP’s HPCC
Scalability Fixed Unbounded

Availability Moderate High
Ease of technology Difficult Manageable
refresh
Service and support Expensive Affordable
System manageability Custom; better usability Standard; moderate
usability
Application availability High Moderate
Reusability of Low High
components
Disaster recovery ability Weak Strong
Installation Non-standard Standard
Cluster computing research projects:
• Beowulf (CalTech and NASA) - USA

• CCS (Computing Centre Software) - Paderborn, Germany
• DQS (Distributed Queuing System) - Florida State University, US.
• HPVM -(High Performance Virtual Machine), UIUC&now UCSB,US
• MOSIX - Hebrew University of Jerusalem, Israel
• MPI (MPI Forum, MPICH is one of the popular implementations)
• NOW (Network of Workstations) - Berkeley, USA
• NIMROD - Monash University, Australia
• NetSolve - University of Tennessee, USA
• PVM - Oak Ridge National Lab./UTK/Emory, USA
PARAM Padma:
PARAM Padma is C-DAC's next generation high performance

scalable computing cluster, currently with a peak computing power of One
Teraflop. The hardware environment is powered by the Compute Nodes
based on the state-of-the-art Power4 RISC processors’ technology. These
nodes are connected through a primary high performance System Area
Network, PARAMNet-II, designed and developed by C-DAC and a Gigabit
Ethernet as a backup network.
ONGC Clusters:
ONGC implements two LINUX cluster machines:

• One is a 272 nodes dual core computing system with each node
equivalent to two CPU’s. The master node has 12 nodes dual CPU
and a 32terabyte SAN storage.
• The second system has 48 nodes i.e. 96 CPUs code computing
nodes and the master node has 4 nodes and 20 terabyte SAN
storage.
GRID COMPUTING
Grid computing is a phrase in distributed computing which can have

several meanings:
• A local computer cluster which is like a "grid" because it is composed

of multiple nodes.
• Offering online computation or storage as a metered commercial
service, known as utility computing, "computing on demand", or
"cloud computing".
• The creation of a "virtual supercomputer" by using spare computing
resources within an organization.
• The creation of a "virtual supercomputer" by using a network of
geographically dispersed computers. Volunteer computing, which
generally focuses on scientific, mathematical, and academic problems,
is the most common application of this technology.
These varying definitions cover the spectrum of "distributed computing",

and sometimes the two terms are used as synonyms. This article focuses on
distributed computing technologies which are not in the traditional dedicated
clusters; otherwise, see computer cluster.
Functionally, one can also speak of several types of grids:

• Computational grids (including CPU Scavenging grids) which
focuses primarily on computationally-intensive operations.
• Data grids or the controlled sharing and management of large
amounts of distributed data.
• Equipment grids which have a primary piece of equipment e.g. a
telescope, and where the surrounding Grid is used to control the
equipment remotely and to analyze the data produced.
History:
The term Grid computing originated in the early 1990s as a metaphor
for making computer power as easy to access as an electric power grid in Ian
Foster and Carl Kesselmans seminal work, "The Grid: Blueprint for a new
computing infrastructure".
CPU scavenging and volunteer computing were popularized

beginning in 1997 by distributed.net and later in 1999 by SETI@home to
harness the power of networked PCs worldwide, in order to solve CPU-
intensive research problems.
The ideas of the grid (including those from distributed computing,

object oriented programming, cluster computing, web services and others)
were brought together by Ian Foster, Carl Kesselman and Steve Tuecke,
widely regarded as the "fathers of the grid". They led the effort to create the
Globus Toolkit incorporating not just computation management but also
storage management, security provisioning, data movement, monitoring and
a toolkit for developing additional services based on the same infrastructure
including agreement negotiation, notification mechanisms, trigger services
and information aggregation. While the Globus Toolkit remains the defacto
standard for building grid solutions, a number of other tools have been built
that answer some subset of services needed to create an enterprise or global
grid.
Grids versus conventional supercomputers:
"Distributed" or "grid computing" in general is a special type of
parallel computing which relies on complete computers (with onboard CPU,
storage, power supply, network interface, etc.) connected to the Internet by a
conventional network interface, such as Ethernet. This is in contrast to the
traditional notion of a supercomputer, which has many CPUs connected by a
local high-speed computer bus.
The primary advantage of distributed computing is that each node can

be purchased as commodity hardware, which when combined can produce
similar computing resources to a many-CPU supercomputer, but at lower
cost. This is due to the economies of scale of producing commodity
hardware, compared to the lower efficiency of designing and constructing a
small number of custom supercomputers. The primary performance
disadvantage is that the various CPUs and local storage areas do not have
high-speed connections. This arrangement is thus well-suited to applications
where multiple parallel computations can take place independently, without
the need to communicate intermediate results between CPUs.
The high-end scalability of geographically dispersed grids is generally

favorable, due to the low need for connectivity between nodes relative to the
capacity of the public Internet. Conventional supercomputers also create
physical challenges in supplying sufficient electricity and cooling capacity
in a single location. Both supercomputers and grids can be used to run
multiple parallel computations at the same time, which might be different
simulations for the same project, or computations for completely different
applications. The infrastructure and programming considerations needed to
do this on each type of platform are different, however.
There are also differences in programming and deployment. It can be

costly and difficult to write programs so that they can be run in the
environment of a supercomputer, which may have a custom operating
system, or require the program to address concurrency issues. If a problem
can be adequately parallelized, a "thin" layer of "grid" infrastructure can
cause conventional, standalone programs to run on multiple machines (but
each given a different part of the same problem). This makes it possible to
write and debug programs on a single conventional machine, and eliminates
complications due to multiple instances of the same program running in the
same shared memory and storage space at the same time.
Design considerations and variations:
One feature of distributed grids is that they can be formed from
computing resources belonging to multiple individuals or organizations
(known as multiple administrative domains). This can facilitate commercial
transactions, as in utility computing, or make it easier to assemble volunteer
computing networks.
One disadvantage of this feature is that the computers which are

actually performing the calculations might not be entirely trustworthy. The
designers of the system must thus introduce measures to prevent
malfunctions or malicious participants from producing false, misleading, or
erroneous results, and from using the system as an attack vector. This often
involves assigning work randomly to different nodes (presumably with
different owners) and checking that at least two different nodes report the
same answer for a given work unit. Discrepancies would identify
malfunctioning and malicious nodes.
Due to the lack of central control over the hardware, there is no way
to guarantee that nodes will not drop out of the network at random times.
Some nodes (like laptops or dialup Internet customers) may also be available
for computation but not network communications for unpredictable periods.
These variations can be accommodated by assigning large work units (thus
reducing the need for continuous network connectivity) and reassigning
work units when a given node fails to report its results as expected.
The impacts of trust and availability on performance and development

difficulty can influence the choice of whether to deploy onto a dedicated
computer cluster, to idle machines internal to the developing organization, or
to an open external network of volunteers or contractors.
In many cases, the participating nodes must trust the central system
not to abuse the access that is being granted, by interfering with the
operation of other programs, mangling stored information, transmitting
private data, or creating new security holes. Other systems employ measures
to reduce the amount of trust "client" nodes must place in the central system
such as placing applications in virtual machines.
Public systems or those crossing administrative domains (including

different departments in the same organization) often result in the need to
run on heterogeneous systems, using different operating systems and
hardware architectures. With many languages, there is a tradeoff between
investment in software development and the number of platforms that can be
supported (and thus the size of the resulting network). Cross-platform
languages can reduce the need to make this tradeoff, though potentially at
the expense of high performance on any given node (due to run-time
interpretation or lack of optimization for the particular platform).
Various middleware projects have created generic infrastructure, to

allow various scientific and commercial projects to harness a particular
associated grid, or for the purpose of setting up new grids. BOINC is a
common one for academic projects seeking public volunteers; more are
listed at the end of the article.
Cycle stealing:
Typically, there are three types of owners, who use their workstations
mostly for:
1. Sending and receiving email and preparing documents.

2. Software development - edit, compile, debug and test cycle.
3. Running compute-intensive applications.
• Cluster computing aims to steal spare cycles from (1) and (2) to
provide resources for (3).
• However, this requires overcoming the ownership hurdle - people
are very protective of their workstations.
• Usually requires organisational mandate that computers are to be
used in this way.
• Stealing cycles outside standard work hours (e.g. overnight) is easy,
stealing idle cycles during work hours without impacting interactive
use (both CPU and memory) is much harder.
• Usually a workstation will be owned by an individual, group,

department, or organisation - they are dedicated to the exclusive use
by the owners.
• This brings problems when attempting to form a cluster of
workstations for running distributed applications.
International Grid Projects:
• GARUDA (Indian)
• D-grid (German)
• Malaysia national grid computing
• Singapore national grid computing project
• Thailand national grid computing project
• CERN data grid (Europe)
• PUBLIC FORUMS
o Computing Portals
o Grid Forum
o European Grid Forum
o IEEE TFCC!
o GRID’2000
GARUDA:
GARUDA is a collaboration of science researchers and experimenters
on a nation wide grid of computational nodes, mass storage and scientific
instruments that aims to provide the technological advances required to
enable data and compute intensive science for the 21st century. One of
GARUDA’s most important challenges is to strike the right balance between
research and the daunting task of deploying that innovation into some of the
most complex scientific and engineering endeavors being undertaken today.
The Department of Information Technology (DIT), Government of

India has funded the Centre for Development of Advanced Computing (C-
DAC) to deploy the nation-wide computational grid ‘GARUDA’ which will
connect 17 cities across the country in its Proof of Concept (PoC) phase with
an aim to bring “Grid” networked computing to research labs and industry.
GARUDA will accelerate India’s drive to turn its substantial research
investment into tangible economic benefits.
Some Problems:
1. Embarrassingly Parallel:
In the jargon of parallel computing, an embarrassingly parallel

workload (or embarrassingly parallel problem) is one for which no particular
effort is needed to segment the problem into a very large number of parallel
tasks, and there is no essential dependency (or communication) between
those parallel tasks.
In other words, each step can be computed independently from every

other step, thus each step could be made to run on a separate processor to
achieve quicker results.
A very common usage of an embarrassingly parallel problem lies

within graphics processing units (GPUs) for things like 3D projection since
each pixel on the screen can be rendered independently from each other
pixel.
Embarrassingly parallel problems are ideally suited to distributed

computing over the Internet (e.g. SETI@home), and are also easy to perform
on server farms which do not have any of the special infrastructure used in a
true supercomputer cluster.
Embarrassingly parallel problems lie at one end of the spectrum of

parallelization, the degree to which a computational problem can be readily
divided amongst processors.
2. Software Lockout:
In multiprocessor computer systems, software lockout is the

issue of performance degradation due to the idle wait times spent by
the CPUs in kernel-level critical sections. Software lockout is the
major cause of scalability degradation in a multiprocessor system,
posing a limit on the maximum useful number of processors. To
mitigate the phenomenon, the kernel must be designed to have its
critical sections as short as possible, therefore decomposing each data
structure in smaller substructures.
BIBLIOGRAPHY
1. High Performance Computing, by Alex Veidenbaum.

2. High Performance Computing and Networking, by Wolfgang
Gentzsch, Uwe Harms.
3. High Performance Computing and the Art of Parallel Programming,
by Stan Openshaw, Lan Turton.
4. www.wikipedia.org
5. www.howstuffworks.com

High Performance Computing - Project Report

Uploaded by

Copyright:

Available Formats

You might also like

High Performance Computing - Project Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

High Performance Computing - Project Report

Uploaded by

Copyright:

Available Formats

HIGH PERFORMANCE

HPC can be boiled down to one thing – SPEED. The goal is to

The history of computing is longer than the history of computing

• one-to-one correspondence, a rule to count how many items, say on a

Advances in the numeral system and mathematical notation eventually

By the High Middle Ages, the positional Hindu-Arabic numeral

Navigation and astronomy:

Parallel computing is the simultaneous execution of some

One major way to classify parallel computers is based on their

A variety of architectures have been developed for parallel processing.

Theory and practice:

Processors in a parallel computer may communicate with each other

Approaches to parallel computers include multiprocessing, parallel

There are two major approaches to parallel programming: implicit

Some people consider parallel programming to be synonymous with

Programs which work correctly in a single CPU system may not do so

The classification of computers can be described as follows:

Today in practice, the term usually refers to computers compatible with

Some non-System/360-compatible systems derived from or compatible

There were several minicomputer operating systems and architectures that

The microcomputer came after the minicomputer, most notably replacing

Modern desktop computers, video game consoles, laptop computers,

The term supercomputer itself is rather fluid, and today's supercomputer

Supercomputers are used for highly calculation-intensive tasks such as

The term is most commonly associated with computing used for

Evolving the "HPC" Concept:

The more current and evolving definition of HPC refers to High

The sole task of compute nodes is to execute parallel jobs. In most

SMP systems allow any processor to work on any task no matter

SMP is one of many styles of multiprocessor machine architecture;

Other systems include asymmetric multiprocessing (ASMP), in which

The former is not widely used or supported (though the high-powered

Advantages & Disadvantages:

In situations where more than one program is running at the same

In cases where many jobs are being processed in an SMP

Access to RAM is serialized; this and cache coherency issues causes

A computer cluster is a group of tightly coupled computers that work

"A cluster is a logical arrangement of independent entities that

• "Logical arrangement" implies a structured organization. Logical

What is computer clusters?

Computer clusters are groups of computers working together to complete

In other words, a cluster is a group of computers which work together

The formal engineering basis of cluster computing as a means of

Consequently the history of early computer clusters is more or less

The development of customer-built and research clusters proceeded

The first commercial clustering product was ARCnet, developed by

Two other noteworthy early commercial clusters were the Tandem

No history of commodity computer clusters would be complete without

1. High-availability (HA) cluster.

3. High-performance computing (HPC) clusters.

One of the most popular HPC implementations is a cluster with nodes

Microsoft offers Windows Compute Cluster Server as a high-

Many software programs running on High-performance computing

Basics of Cluster Computing

Cluster computing refers to technologies that allow multiple

The nodes are usually interconnected by local area network (LAN)

The speed of network technology is characterized by a bandwidth and

To realize this technology, two network interface cards (NIC's) should