Failover In-Depth

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Research Journal of Information Technology 2(2): 35-38, 2010

ISSN: 2041-3114
© M axwell Scientific Organization, 2010
Submitted date: April 30, 2010 Accepted date: June 14, 2010 Published date: September 13, 2010

Using Server Clusterization to Establish Fault-Tolerant Internet Connectivity


1
O.O . Adeosu n, 2 E.R. Adagunodo and 3 E.A. Olajubu
1
Department of Computer Science and Engineering, Ladoke Akintola University of Technology,
Ogbom oso, Nigeria
2,3
Departm ent of Computer Science and Engin eering, Obafemi A wolowo University,
Ile-Ife, N igeria

Abstract: This study discusses the issue of providing tolerance to hardware and software faults in Internet
system as well as issues related to clusterization of servers. A replication scheme is presented, and a detailed
dependability analysis of this scheme is performed. The proposed model was designed mainly for fault-tolerant
internet system where many unrelated applications could compete for hardware and software resources, thereby
exhibiting highly varying and dynamic system characteristics. A major feature of the model under consideration
is to attem pt the adaptive execution of red undant compo nents for a req uired level of fault toleran ce.

Key w ords: Clusterization, serve rs, replica tion sch eme , redun dant components, depe ndability ana lysis,
reliability, parallel processing, real-time processing, stochastic modeling

INTRODUCTION techniques intended to cope with the effects of faults and


avert the occurrence of failures or at least to warn a user
Fault-tolerant computing is the art and science of that errors have been introduced into the state of the
building computing systems that continue to operate system (Anderson, 1985). Clients in high-available
satisfactorily in the presence of faults. Fault-tolerance is systems can re-route transparently faulty requests to
achieved by applying a set of analysis and design another application server (or failover). To implement
techniques to create system s with dramatically improved failover, it requires replicating service on multiple nodes,
dependability. Fault tolerance and dependable systems storing distributed checkpoint and synchronizing replicas.
research covers a wide spectrum of applications ranging Using the integ rated approach to the incorporation of
across embedded real-time systems, comm ercial fault tolerance, some o f existing software fault tolerance
transaction system s, transportation systems and approaches will be extended to the treatmen t of both
military/space system s - to nam e a few . The supporting hardw are and software faults (hybrid faults). Two typical
research includes system architecture, design techniques, schemes are taken into account - recovery blocks
coding theory, testing, validation, proof of correctness, (Randell, 1975) and Self-C onfiguring Optimistic
modeling, software reliability, operating systems, parallel Programming (SCOP), an adaptive scheme introduced
processing, and real-time processing. These areas often recen tly by B ondavalli et al. (1993). N-version
involve widely diverse core expertise ranging from formal programming (Avizienis and Chen, 19 77) is also chosen
logic, mathematics of stochastic modeling, graph theory, as a representative of non-adaptive schemes for the sake
hardware design and software engineering. of comp arison. These architectural solutions spec ially
Replication is one of the oldest and most important directed to Internet system are then analyzed with respect
topics in the overall area of distributed systems. Whether to dependability, availability, accessibility and
one replicates data or com putation, the objective is to restartability.
have some group of processes that handle incoming Similar works, in particular, Laprie et al. (1990)
events. If we replicate data, these processes are passive presented a set of hybrid-fault-tolerant architectures and
and opera te only to maintain the stored data, reply to read analyzed and evaluated three of them. Their architectures
requests, and apply updates when we replicate are based on a fixed set of hardware com ponents and not
computation, the usual goal is to provide fault-tolerance. related to the dynamicity of hardware resources availab le
The development of highly dependable computing as well as the efficiency issues. Such architectural
systems usually requires the comb ined utilization of a solutions cannot well match the characteristics of
wide range of technique s, including fault tolerance dependable Internet system in which the resources must

Corresponding Author: O.O. Adeosun, Department of Computer Science and Engineering, Ladoke Akintola University of
Technology, Ogbomoso, Nigeria
35
Res. J. Inform. Technol., 2(2): 35-38, 2010

Fig. 1: Replicated servers model

be competed by many unrelated but concurrent service primary using timeout and its stub automatically re-routes
requests. In such varying environments the architectures a faulty request to an alternative server.
with the fixed requirem ent to hardw are com ponen ts are The weakness of primary/backup schemes is that in
either inefficient or infeasible. settings where both processes could have been active,
only one is actually performing operations. It is true we
Fault-Tolerant Internet Con nectivity Mod el: Guerraoui are gaining fault-tolerance but spending twice as much
and Schiper (19 96) opine that grou p com mun ication money to get this property. An outgrow th of this work
enables encapsulating a set of entities that cooperate to
was the emergence of schemes in which a group of
achieve some common service. A group has a logical
replicas could cooperate, with each process backup the
address, which allows clients ignore the existence of its
others, and each handling some share of the workload.
membe rs. In Fig. 1, a set of replicated servers composes
the group. All replicas must provide access to the same
methods and h ave to maintain the same state. To achieve Com munication mo del: We use an asynchronous
this, there sh ould be strong consistency . This w ill enable communication in our model; even overload server can be
read-o ne-w rite-all replica s. assumed as fault-suspected because there is no w ay to
distinguish between overload and faulty servers. Also, we
From primary to backup replication: In the replicated use an underneath group-communication layer to provide
service, one of the replicated servers (the primary) the needed multicast primitive and also server service.
executes a transaction locally. Many classical approaches The server service manages the replicated servers and
to replication are based on a primary/backup model where detects fault-suspected servers removing them from the
one device or process has unilateral control over one or group. The group communication layer operates in the
more other processes or devices. For example, the presence of message omission faults, processor crashes
primary might perform some computation, streaming a and recov eries as well network partitions and m erges.
log of updates to a backup (standby) proce ss, which can
then take over if the primary fails, that is, it forwards Design approach: As the number of nodes in a
updates to all other group member (backups) using the
distributed computation increases, so does the probability
total order multicast primitive (TOCAST) (Guerraoui and
for failure. If one thinks of a system as a collection of
Schiper, 1996). This primitive ensures that updates are
functionalities that must perform specific tasks, then the
delivered in the same order by all correct processes that
design of a survivable system can be thought of as a
work according to their specification. The termination
property of the TOCA ST assures the distributed system multistage proce ss. It should be noted that, in a malicious
progress despite of failures, as well as its non-blocking environment, each stage has its limitations.
characteristics. Typically, the primary waits for all In traditional fault-tolerance, tolerating faults is
back up an swers and return s respo nse to the clien t. typically achieved utilizing the principle of redundancy.
In order to avoid bottleneck, any replica can be
enabled to play the primary role. Backu p failure is Information redundancy: Usually considers the
transparent to the requester, but faulty primaries require inclusion of add itional information as the basis for fault
achieving failover. In this study, a clien t detects a faulty- recov ery. A typ ical exa mple is an error correction code.

36
Res. J. Inform. Technol., 2(2): 35-38, 2010

Time redundancy: Relies on m ultiple executions skewed Implementation Issues: Tw o OpenSource pro jects were
in time on the sam e node a nd is often used to mask identified: Java-G roups (B an, 1999 ) and JOnAS (Java
omissions. Open Application Server) (Danes et al., 2000). Our
replicated server is been deve loped to match the two
Spatial redundancy: Uses multiple components, each OpenSource. In our model, we changed some classes of
computing a value, and the final value is derived from a the JOnAS to include the TOCAST p rimitive in the
convergence function (e.g., majority voting). The server-side. Replicated servers join the group and use this
resulting N-m odular redundant (N M R) sy stem primitive to setting the distributed checkpoint. W e
implements a k-of-N system, which implies that the implement the distributed check point selecting, at
system functions as long as k or m ore com ponen ts are compiling time, updates to be forwarded during the
service execution. An update is assumed to be a method
fault free. A typical configuration is a triple-redundant
without result (it return s a null value). In the client-side,
redundancy (TM), which is a 2-of-3 system.
we modify the client’s stub to automatically re-route
faulty requests.
Enabling recovery failures and providing failover
service to users: Achieving the proposed Internet fault- CONCLUSION
tolerant service using replicated servers requires treating
client-primary as well primary-backups interaction. The Transactional systems could benefit from group
model handles client-primary interaction switching of the communication to achieve fault tolerance. The systems
client requests to alternative server, when the current are more available for service delivery and multicasting
service is interrupted. The work also handles primary- just updates provides good performance, eliminating
backup interaction implementing distributed checkpoints. additional co mm unica tion rou nds.
Recover from a failed server is ea sier. It just requires re- Also, we expect that replication improves the
routing clients’ requests. application response time, when compared with non-
replicated application servers, by allowing requests to be
Distributed checkpoint implementation: A distributed handled by several nodes rather than one besides
checkpoint contains all local snapshots placed in all the eliminating a single point-of-failure. In addition,
replicated servers. Each snapshot holds information about deployment and redeployment of new and recovered
the last executed method, the client who req uested this servers are necessa ry to maintain the Internet availability.
method and the server who executed this m ethod . This
follows a distributed checkpoint approach, which ACKNOWLEDGEMENT
multicasts a snapshot from a primary to all other servers.
W henever the primary receives a transactional request The authors are thankful to Prof. Adetunde, I.A., the
Dean of Engineering, University of Mines and
(using point-to-point commun ication) from a client, it
Technology, Tarkwa, Ghana, for his advice, valuable
updates its own state and m ulticasts synchronization
comments, sugg estions and for making th is article
messages to the backups using the TOCAST primitive.
publishable.
The primary verifies if the distributed checkpoint was
successfully established (waiting for all backup REFERENCES
confirmation me ssages) and answers the client.
Backups process the synchronization messages and Anderson, T., 1985. Resilient Computing Systems,
autom atically store updates in their own states to establish London, UK, Collins Professional and Technical
the distributed checkpoint and to reflect a single Books.
distributed global state. If a server fails, clients are Avizienis, A. and L. Chen, 1977. On the implementation
guaranteed access to the same data through the backups. of N-version programm ing for software fault
W hen a client-server conn ection is closed, all servers tolerance during program execution. COM PSAC 77,
remove the information about the distributed checkpoint Chicago, IL,pp: 149-155.
for that client. Storing this inform ation w ill enable Ban, B., 1999. JavaGroups User’s Guide, Department of
autom atic failover during a transaction execution. The Computer Science, Cornell University, pp: 73.
non-finished methods will be executed in another server. http://JavaGroups sourceforg e.net/.
Bondavalli, A., F. d i Gian dom enico and J. Xu, 1993 . A
Propagating updates to backups: There are tw o possible cost-effective and flexible scheme for softw are fault
strategies to propagate updates: deferred update and tolerance. J. Comput. Sys. SC. Eng., 8(4): 234-244.
immediate update (W iesmann et al., 2000). In deferred Danes, A., P. Dechamboux, M. Riveill and G. Vandome,
update, transactions are processed locally at one server 2000. Technologie a Base de C omposants EJB
and are forw ard to the backups at the comm it time w hile Experience et Perspectives Avec JOnAS. OCM 2000,
the immediate update synchronizes every transaction Nantes, Mai 2000, pp: 11-13. Retrieved from:
across all servers. http://www .objectweb.org/jonas/.

37
Res. J. Inform. Technol., 2(2): 35-38, 2010

Guerraou i, R. and A. Schip er, 1996. Fault_Tolerance by Randell, B., 1975. System structure for software fault
Replication in Distributed Systems. Departement tolerance. IEEE TSE, Vol. SE-1, No. 2, pp: 220-232.
d’Informatique Ecole Polytechnique Federale de W iesmann, M., F. Pedone, A. Schiper, B. Kemme and
Lausanne, 1996. G. Alonso, 2000. Understanding replication in
Laprie, J.C., J. Arlat, C . Beounes, K. Kanoun and databases and distributed systems. Proceedings of
C. Hourtolle, 1987. Definition and analysis of ICDCS 2000, Taipe i, Taiw an, R.O.C., April 2000,
hardware-and-softw are fault-tolerant architectures. pp: 264-274.
IEEE Comput., 23 (7): 39-51.

38

You might also like