Professional Documents
Culture Documents
Implementing A Linux Cluster
Implementing A Linux Cluster
A project report
submitted in partial fulfilment of
the requirements for the award of the degree of
Bachelor of Technology
in
Computer Science and Engineering
by
Deepak Lukose
Roll No:16
Group No:8
S8 CSE
our supervision and guidance. The project report has been submitted to the De-
Abstract
gether closely so that in many respects it can be viewed as though it were a single
computer. Clusters are commonly connected through fast local area networks and
are usually deployed to improve speed and/or reliability over that provided by a
single computer, while typically being much more cost-effective than single com-
Clusters built from open source software, particularly based on the GNU/Linux
operating system, are increasingly popular. Their success is not hard to explain
cations. A wealth of open source or free software has emerged to make it easy to
set up, administer, and program these clusters. This work aims at an implemen-
tation of free and open source clusters for performing scientific computations at a
faster pace.
iv
Acknowledgements
I express my sincere thanks to Mr. Vinod Pathari for his constant backing
and support. I would like to extend my gratitude to entire faculty and staff of CSE
Department of NITC, who stood by me in all the difficulties I had to face during
the completion of this project. Last but not the least I thank God Almighty for
Deepak Lukose
v
Contents
Chapter
1 Introduction 1
2 Motivation 4
3 Design 5
3.1.1 Mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Cluster Installation 8
4.1 OSCAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8 Conclusion 29
Bibliography 30
Chapter 1
Introduction
High performance cluster for general purpose computing with special regard to
apoint in 1977. ARCnet was not a commercial success and clustering did not
really take off until DEC released their VAXcluster product in the 1980s for the
VAX/VMS operating system. The ARCnet and VAXcluster products not only
supported parallel computing, but also shared file systems and peripheral devices.
They were supposed to give us the advantage of parallel processing while maintain-
ing data reliability and uniqueness. VAXcluster, now VMScluster, is still available
cluster, and are most commonly used in scientific computing. One of the more
and free software to implement the parallelism. This configuration is often referred
to as a Beowulf cluster. Such clusters commonly run custom programs which have
been designed to exploit the parallelism available on HPC clusters. Many such
programs use libraries such as Message Passing Interface(MPI) which are specially
be called from Fortran, C, C++ and Ada programs. MPI’s advantage over older
message passing libraries is that it is both portable (because MPI has been imple-
mented for almost every distributed memory architecture) and fast (because each
Specialized application clusters - Those clusters which are built using special-
ized applications fall in this category. Eg: Beowulf, distcc, MPICH and other.
Full-blown clusters - This category includes those clusters which integrated into
the kernel mechanism for automatic process migration among homogeneous nodes.
3
Eg:Mosix, openMosix, Kerrighed, OpenSSI etc. OpenSSI and Kerrighed are
DragonFly BSD, a recent fork of FreeBSD 4.8 is being redesigned at its core
capabilities.
differences between grids and traditional clusters are that grids connect collections
of computers which do not fully trust each other, and hence operate more like a
computing utility than like a single computer. In addition, grids typically support
Motivation
Most of the time, the computer is idle. Start a program like xload or top
that monitors the system use, and one will probably find that the processors load
is not even hitting the 1.0 mark. If one has two or more computers, chances are
that at any given time, at least one of them is doing nothing. Unfortunately, when
we really do need CPU power - during a C++ compile, or encoding Ogg Vorbis
music files - we need a lot of it at once. The idea behind clustering is to spread
these loads among all available computers, using the resources that are free on
other machines.
The basic unit of a cluster is a single computer, also called a “node”. Clusters
can grow in size - they “scale” - by adding more machines. The power of the
cluster as a whole will be based on the speed of individual computers and their
connection speeds are. In addition, the operating system of the cluster must make
the best use of the available hardware in response to changing conditions. This
(machines joining and leaving the cluster), and the loads cannot be predicted ahead
of time.
Chapter 3
Design
(3) Select the operating system, cluster software, and other system software
While each of these tasks, in part, depends on the others, the first step
is crucial. If at all possible, the cluster’s mission should drive all other design
decisions. At the very least, the other design decisions must be made in the
Selecting the hardware should be the final step in the design, but often we
won’t have as much choice as we would like. A number of constraints may force us
to select the hardware early in the design process. The most obvious is the budget
constraints.
Defining what we want to do with the cluster is really the first step in de-
signing it. For many clusters, the mission will be clearly understood in advance.
This is particularly true if the cluster has a single use or a few clearly defined uses.
6
But it should be noted that clusters have a way of evolving. What may be a
reasonable assessment of needs today may not be tomorrow. Good design is often
3.1.1 Mission
High performance cluster for general purpose computing with special regard to
scales well over Linux distributions. For greater control over how the cluster is
configured, one will be happier with OSCAR in the long run. Typically, OSCAR
provides better documentation than other Cluster kits like Rocks. OSCAR was
chosen over traditional Beowulf due to the ease of installation as well as its com-
prehensive package which includes many compilers and application softwares. One
of the main package that is being used on OSCAR is LAM-MPI, a popular imple-
mentation of the MPI parallel programming paradigm. LAM-MPI can be used for
ronment), for network booting on computers. The Head node must have at least
7GB Hard disk capacity and all other client nodes need a minimum of 5GB hard
disk capacity.
Chapter 4
Cluster Installation
One of the more important developments in the short life of high perfor-
mance clusters has been the creation of cluster installation kits such as OSCAR
(Open Source Cluster Application Resources) and Rocks. With software packages
like these, it is possible to install everything one needs and very quickly have a
fully functional cluster. A fully functional cluster will have a number of software
scheduling.
4.1 OSCAR
A collection of open source cluster software, OSCAR includes everything that one
category approach, selecting the best available software for each type of cluster-
related task. One will often have several products to choose from for any given
need.
The design goals for OSCAR include using the best-of-class software, elimi-
the need for expertise in setting up a cluster because OSCAR takes us completely
cluster before mastering all the skills one will eventually need. In the long run,
one will want to master those packages in OSCAR. OSCAR makes it very easy to
experiment with packages and dramatically lowers the barrier to getting started.
one customizes the installation, the computer nodes are meant to be dedicated to
the cluster.
4.2 Packages
of the packages are available as standalone packages. The main packages in OS-
CAR are:
C3: The Cluster, Command, and Control tool suite provides a command-line ad-
clients. Similarly we can use commands like cget, ckill, cpush, crm, cshutdown.
lows the user to make changes to the environment of future shells. For exam-
ple, Switcher allows a user to change between MPICH and LAM/MPI. “switcher
SIS: The System Installation Suite is used to install the operating systems on the
clients.
and Ganglia, a real-time monitoring system and execution environment, are the
libraries.
libraries.
PVM :This package provides the parallel virtual machine system, another mes-
With OSCAR, one first installs Linux (but only on the head node) and then
installs OSCAR. The installations of the two are separate. This makes the instal-
lation more involved, but it gives us more control over the configuration of the
lems. And because the OSCAR installation is separate from the Linux installation,
OSCAR uses a system image cloning strategy to distribute the disk image to
the compute nodes. With OSCAR it is best to use the same hardware throughout
the cluster. OSCAR’s thin client model is designed for diskless systems.
Chapter 5
Parallel Programming
program. So here also we will start with “Hello World” program. The parallel
#include "mpi.h"
#include <stdio.h>
char computerName[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);
MPI_Comm_rank(MPI_COMM_WORLD, &processId);
MPI_Get_processor_name(computerName, &nameSize);
12
fprintf(stderr,"Hello from process %d on %s\n", processId, computerName);
MPI_Finalize( );
return 0;
This example introduces five MPI functions, defined through the inclusion
of the header file for the MPI library, mpi.h, and included when the MPI library
is linked to the program. While this example uses C, similar libraries are available
Four of these functions, MPI Init, MPI Comm size, MPI Comm rank, and
MPI Init is used to initialize an MPI session. All MPI programs must have
a call to MPI Init. MPI Init is called once, typically at the start of a program.
One can have lots of other code before this call, or one can even call MPI Init
from a subroutine, but one should call it before any other MPI functions are
called. (There is an exception: the function MPI Initialized can be called before
MPI Init. MPI Initialized is used to see if MPI Init has been previously called. )
In C, MPI Init can be called with the addresses for argc and argv as shown in the
MPI Finalize is called to shut down MPI. MPI Finalize should be the last
MPI call made in a program. It is used to free memory, etc. It is the user’s
13
responsibility to ensure that all pending communications are complete before a
process calls MPI Finalize. Every process must call MPI Finalize.
communicator (the communications group for the processes being used). It takes
the communicator as the first argument and the address of an integer variable
used to return the number of processes. For example, if one is executing a pro-
gram using five processes and the default communicator, the value returned by
MPI Comm size will be five, the total number of processes being used. This is
number of processes, but not necessarily the number of machines being used.
In the hello world program, both MPI Comm size and MPI Comm rank used
the default communicator, MPI COMM WORLD. This communicator includes all
tors are used to distinguish and group messages. As such, communicators provide
one’s own communicators, the default communicator will probably satisfy most of
the needs.
MPI Comm rank is used to determine the rank of the current process within
the communicator. MPI Comm rank takes a communicator as its first argument
and the address of an integer variable is used to return the value of the rank.
a communicator. Ranks range from 0 to one less than the size returned by
MPI Comm size. For example, if one is running a set of five processes, the individ-
and MPI Comm rank are often used to divide up a problem among processes.
Next, each individual process can examine its rank to determine its role in the
calculation. For example, the process with rank 0 might work on the first part of
the problem; the process with rank 1 will work on the second part of the problem,
etc. One can divide up the problem differently also. For example, the process with
rank 0 might collect all the results from the other processes for the final report
MPI Get processor name is used to retrieve the host name of the node on
display host names. The first argument is an array to store the name and the
Each of the C versions of these five functions returns an integer error code.
With a few exceptions, the actual code is left up to the implementers. Error codes
can be translated into meaningful messages using the MPI Error string function.
of the individual machines, the loads on the machines, and the speeds of the
communications links. Unless one takes explicit measures to control the order of
execution among processors, one should not make assumptions about the order of
execution.
When running the program, the user specifies the number of processes on
the command line. MPI Comm size provides a way to get that information back
into the program. Next time, if one wants to use a different number of processes,
just change the command line and the code will take care of the rest.
it can be easily decomposed into parts that can be shared among the computers
the parallel solution illustrates all the basics one needs to get started writing MPI
code. The reason this area problem is both interesting and commonly used is
computers calculate the areas for different rectangles. Basically, MPI Comm size
and MPI Comm rank are used to divide the problem among processors. MPI Send
is used to send the intermediate results back to the process with rank 0, which
collects the results with MPI Recv and prints the final answer. Here is the program:
#include "mpi.h"
#include <stdio.h>
/* problem parameters */
#define numberRects 50
16
#define lowerLimit 2.0
/* MPI variables */
MPI_Status status;
/* problem variables */
int i;
/* MPI setup */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);
MPI_Comm_rank(MPI_COMM_WORLD, &processId);
area = 0.0;
{
17
at = lower + i * width + width / 2.0;
height = f(at);
tag = 0;
total = area;
dest = 0;
};
/* finish */
MPI_Finalize( );
return 0;
}
18
In this example, we are calculating the area under the curve y = f (x2 ) between
x=2 and x=5. Since each process only needs to do part of this calculation, we need
to divide the problem among the processes so that each process gets a different
part and all the parts are accounted for. MPI Comm size is used to determine the
number of parts the problem will be broken into, noProcesses. That is, we divide
the total range (2 to 5) equally among the processes and adjust the start of the
range for an individual process based on its rank. In the next section of code, each
process calculates the area for its part of the problem. Then we need to collect
and combine all our individual results. One process will act as a collector to which
the remaining processes will send their results. Using the process with rank 0 as
the receiver is the logical choice. The remaining processes act as senders. A fair
amount of MPI code development can be done on a single processor system and
of a liquid and the procedure is validated with standard values. A few among
the plausible theoretical models for the thermal conductivity of nanofluids has
been selected and studied. Algorithms are made for simulating them, abiding the
the simulations of the models considered are compared among themselves and with
the existing experimental results, and further investigated to select the most ap-
propriate model which matches best with a practical case of interest (metal oxide
- water system). Parametric studies are conducted to study the variation of ther-
mal conductivity enhancement with temperature, and the optimal dosing levels
6.1 Algorithm
main()
initializePositions();
initializeVelocities();
velocityVerlet(dt);
instantaneousTemperature();
if(i%200==0)
rescaleVelocities();
if(time>equilibrationtime)
jt();
thermalConductivity();
velocityVerlet()
}
21
6.2 Profiling
It is generally said that a typical program will spend over 90% of its execution
time in less that 10% of the actual code. This is just a rule of thumb or heuristic,
and as such, will be wildly inaccurate or totally irrelevant for some programs.
But for many, if not most, programs, it is a reasonable observation. The actual
numbers don’t matter since they will change from program to program. It is the
idea that is important for most programs, most of the execution time spent is in a
If the application spends 95% of its time in 5% of the code, there is little to
be gained by optimizing the other 95% of the code. Even if one could completely
eliminate it, one would only see a 5% improvement. But if one can manage a 10%
improvement in the critical 5% of the code, for example, we will see a 9.5% overall
improvement in the program. Thus, the key to improving the code’s performance
is to identify that crucial 5%. That is the region where one should spend one’s
to balance the amount of time we spend optimizing code with the amount of
improvement we actually get. There is a point where the code is good enough.
The goals of profiling are two-fold to decide how much optimization is worth doing
For serial algorithms, one can often make reasonable estimates on how time
is being spent by simply examining and analyzing the algorithm. The standard
Since the problem size often provides a bound for algorithmic performance, this
a less than perfect overlap among the communicating processes. A processor may
be idle while it waits for its next task. In particular, it may be difficult to predict
when a processor will be idle and what effect this will have on overall performance.
often the preferred approach for parallel programs. That is, we directly measure
Thus, with parallel programs, the most appropriate strategy is to select the
best algorithm one can and then empirically verify its actual performance.
real 378.63
user 378.61
sys 0.00
From this profile we can find that there is scope for improvement in func-
23
tions like Accelerations, lj and jt because they account for more than 99% of the
execution time. It can be noticed that a single call to lj function takes very little
amount of time to execute but since it is executed large number of times, even
a slight improvement in the code will heavily affect the execution time. It can
be found from the code that the functions jt and Accelerations are both having
a complexity of O(N 2 ). So the work done by this part of the code can be split
real 155.74
user 134.28
sys 1.70
The comparison between the unoptimized and optimized code is shown be-
low:
24
12 45000 450
40000 400
10 35000 350
8 30000 300
25000 250
6
20000 200
4 15000 150
10000 100
2
5000 50
0 0 0
Before After Before After Before After
9 16000 140
8 14000 120
7 12000 100
6 10000
5 80
8000
4 60
3 6000
4000 40
2
1 2000 20
0 0 0
Before After Before After Before After
25
With this optimized code we get a speedup of 2.43 on a single node(4 CPUs)
cluster. The error in the computation of the order of 10−7 is due the presence of
machines by using shared memory instead of message passing. When the program
was run on a cluster with two nodes(4 CPUs) the speedup achieved was in the
Once the cluster is running, one needs to run a benchmark or two just to see
There are three main reasons for running benchmarks. First, a benchmark
if we suspect problems with the cluster, we can rerun the benchmark to see if
performance is really any different. Second, benchmarks are useful when comparing
systems or cluster configurations. They can provide a reasonable basis for selecting
run several with differently sized clusters, etc. , we should be able to make better
used and performance numbers are available for almost all relevant systems.
digital computers. LINPACK makes use of the BLAS (Basic Linear Algebra Sub-
programs) libraries for performing basic vector and matrix operations. The LIN-
floating point computing power. It measures how fast a computer solves dense
27
n by n systems of linear equations Ax=b, a common task in engineering. The
2
solution is based on Gaussian elimination with partial pivoting, with 3
∗ n3 + n2
floating point operations. The result is Millions of floating point operations per
second(Mflop/s).
formance measure for ranking supercomputers in the TOP500 list of the world’s
fastest computers. This performance does not reflect the overall performance of
a given system, as no single number ever can. It does, however, reflect the per-
Since the problem is very regular, the performance achieved is quite high, and
the performance numbers give a good correction of peak performance. When the
High Performance Linpack was run on the cluster it gave a peak performance of
1.380e-01 GFLOPS.
The folowing figure explains the relationship between the execution time and
Conclusion
The basic objective of this project was to setup a high performance compu-
tational cluster with special concern to molecular modelling. By using a cluster kit
such as OSCAR,the first phase of the project, setting up a high performance clus-
ter, could be completed. With the help of message passing libraries like LAM/MPI,
the second phase of the project, improving the performance of a molecular mod-
elling problem, was completed. The molecular modelling problem, which is con-
remarkable speedup of the order of 2.0 . This increased efficiency came at the
expense of higher code complexity. Finally, the testing phase was completed using
1.380e-01 GFLOPS. Future work includes the formal analysis of existing code for
Bibliography
[1] Joseph D. Sloan, High Performance Linux Clusters with OSCAR, Rocks,
OpenMosix, and MPI, O’Reilly & Associates, 1991.
[2] Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, Tata
McGraw-Hill, 2003.
[3] Zoltán Juhász, Péter Kacsuk and Dieter Kranzimüller, Distributed and Par-
allel Systems-Cluster and Grid Computing, Springer, 2002.
[4] Stefan Böhringer, Building a diskless Linux Cluster for high performance
computations from a standard Linux distribution, Technical Report, Institut
für Humangenetik, Universitätsklinikum Essen, 2003.