Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 27

Grand Large

Coordinated Checkpoint
Versus Message Log For
Fault Tolerant MPI
Aurlien Bouteiller bouteill@lri.fr
joint work with
F.Cappello, G.Krawezik, P.Lemarinier
Cluster&Grid group, Grand Large Project
http://www.lri.fr/~gk/MPICH-V

HPC trend: Clusters are


getting larger
High performance computers
have more and more nodes
(more than 8000 for ASCI Q,
more than 5000 for BigMac
cluster, 1/3rd of the
installations of top500 have
more than 500 processors).
More components increases
fault probability

ASCI-Q full system MTBF is


estimated (analytic) to few hours
(Petrini: LANL), a 5 hours job
with 4096 procs has less than
50% chance to terminate

Many numerical applications


uses MPI (Message Passing
Interface) library
Need for automatic fault
tolerant MPI

Fault tolerant MPI


A classification of fault tolerant message passing environments
considering
A) level in the software stack where fault tolerance is managed and
B) fault tolerance techniques.
Automatic
coordinate
d
based

Framework

Non Automatic

Log based
Optimistic log
Optimistic recovery
In distributed systems

Cocheck

Independent of
MPI

n faults with coherent checkpoint

Causal log
Manetho
n faults

Pessimistic log
Coordinated
checkpoint

Starfish

Enrichment of MPI

API

Egida

Clip

MPI/FT

Semi-transparent checkpoint

Redundance of tasks

FT-MPI

Modification of MPI routines


User Fault Treatment

Pruitt 98

2 faults sender based

Communication
Lib.

MPICH-V2

MPICH-CL
N faults

Sender based Mess. Log.


1 fault sender based

MPI-FT

N fault
N faults
Centralized server
Distributed logging

Level

Several protocols to perform fault tolerance in MPI applications with N faults and
automatic recovery : Global checkpointing, Pessimistic/Causal Message log
compare fault tolerant protocols for a single MPI implementation

Outline

Introduction
Coordinated checkpoint vs message
log

Comparison framework

Performances

Conclusions and future works

Fault Tolerant protocols


Problem of inconsistent states

Uncoordinated checkpoint : the problem of inconsistent


states
Order of message receptions are undeterministic events
message received but not sent are inconsistent
Domino effect can lead to rollback to the begining of the
execution in case of fault
Possible loose of the whole execution and unpredictive fault cost

P0
P1
P2
C3

m3

m1

m2

C2 C1

Fault Tolerant protocols


Global Checkpoint 1/2

Communication Induced Checkpointing

Does not require global synchronisation to provide a


global coherent snapshot

Drawbacks studied in

L. alvisi, E. Elnozahy, S. Rao, S. A. Husain, and


A. D. Mel. An analysis of communication induced checkpointing. In 29th
symposium on Fault-Tolerant Computing (FTFC99). IEEE Press, June 99.

on

number of forced checkpoint increases linearly with number


of nodes Does not scale
Unpredictable frequency of checkpoint may lead to
take an overestimated number of checkpoints
Detection of a possible inconsistent state induces blocking
checkpoint of some processes
Blocking checkpoint has a dramatic overhead
fault free execution

These protocols may not be used in practice

Fault Tolerant protocols


Global Checkpoint 2/2

Coordinated checkpoint

All processes coordinate their checkpoints so that


the global system state is coherent (Chandy &
Lamport Algorithm)
Negligible overhead on fault free execution

Requires global synchronization (may take a long


time to perform checkpoint because of checkpoint
server stress)

In the case of a single fault, all processes have to


roll back to their checkpoints
High cost of fault recovery

Efficient when fault frequency is low

Fault tolerant protocols


Message Log 1/2

Pessimistic log

All messages recieved by a process are logged on


a reliable media before it can causally influence
the rest of the system
Non negligible overhead on network performances
in fault free execution

No need to perform global synchronization


Does not stress checkpoint servers

No need to roll back non failed processes


Fault recovery overhead is limited

Efficient when fault frequency is high

Fault tolerant protocols


Message Log 2/2

Causal log

Designed to improve fault free performance of pessimistic


log
Messages are logged locally and causal dependencies are
piggybacked to messages
Non negligible overhead on fault free execution, slightly
better than pessimistic log
No global synchronisation
Does not stress checkpoint server
Only failed process are rolled back
Failed Processes retrieve their state from dependant
processes or no process depends on it.
Fault recovery overhead is limited but greater than
pessimistic log

Comparison: Related works


Several protocols to perform automatic fault tolerance in MPI applications
- Coordinated checkpoint
- Causal message log
- Pessimistic message log
All of them have been studied theoretically but not compared

Egida compared log based techniques

Siram Rao, Lorenzo Alvisi, Harrick M. Vim: The cost of recovery in message logging
protocols. In 17th symposium on Reliable Distributed Systems (SRDS), pages 1018, IEEE Press, October 1998
- Causal log is better for single nodes faults
- Pessimistic log is better for concurrent faults

No existing comparison of coordinated and message log protocols


No existing comparable implementations of coordinated and
message log protocols

high fault recovery overhead of coordinated checkpoint

high overhead of message logging on fault free performance


Suspected : fault frequency implies tradeoff

Compare coordinated checkpoint and pessimistic logging

Outline

Introduction

Coordinated checkpoint vs Message log

Related work

Comparison framework

Performances

Conclusions and future works

Architectures
We designed MPICH-CL and MPICH-V2 in a shared
framework to perform a fair comparison of coordinated
checkpoint and pessimistic message log

MPICH-CL
Chandy&Lamport algorithm
Coordinated checkpoint

MPICH-V2
Pessimistic sender Based
message log

Communication daemon
Ckpt
Server

Event
Logger (V2 only)
ack Reception
event

Checkpoint
Image
Ckpt
Control

Send

Send
CL/V2
MPI
CSAC
processReceive daemon Receive
keep
Payload (V2 only)

Node

CL and V2 share the same


architecture
communication daemon
includes protocol specific
actions

Generic device: based on


MPICH-1.2.5
A new device: ch_v2 or ch_cl device
All ch_xx device functions are blocking
communication
functions built over TCP layer
MPI_Send
MPID_SendControl
MPID_SendChannel

Binding

_bsend

ADI
Channel
Interface
Chameleon
Interface
V2/CL device
Interface

_bsend

- blocking send

_brecv

- blocking receive

_probe

- check for any


message avail.

_from
_Init
_Finalize

- get the src of


the last message
- initialize the client
- finalize the client

Node Checkpointing

User-level Checkpoint : Condor Stand Alone Checkpointing

Clone checkpointing + non blocking checkpoint


(1) fork
Resume execution
using CSAC just
after (4), reopen
sockets and return

code

Ckpt
order

CSAC
libmpichv

fork

(2) Terminate ongoing coms


(3) close sockets
(4) call ckpt_and_exit()

Checkpoint image is sent to reliable CS on the fly


Local storage does not ensure fault tolerance

Checkpoint scheduler policy

In MPICH-V2,
checkpoint scheduler
is not required by
pessimistic protocol.
It is used to
minimize size of
checkpointed payload
using a best effort
heuristic.

Node

Ckpt
sched

Node

2
Node

Message
payload

Policy : permanent
individual checkpoint

In MPICH-CL,
checkpoint scheduler
is used as a
dedicaded process to
initiate checkpoint.
Policy : checkpoint
every n seconds
where n is a runtime
parameter

Ckpt
Server

Node

Node

Ckpt
Server

2
Ckpt
sched

Synchro

Node

Outline

Introduction

Coordinated checkpoint vs Message log

Comparison framework

Performances

Conclusions and future works

Experimental conditions
Cluster: 32 1800+ Athlon CPU, 1 GB, IDE Disc
+ 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc
+ 48 ports 100Mb/s Ethernet switch
Linux 2.4.18, GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp)
Checkpoint Server
+Event Logger (V2 only)
+Checkpoint Scheduler
+Dispatcher
A single reliable node

Network

node

node

node

Bandwith and latency

Latency for a 0 byte


MPI message :
MPICH-P4 ( 77us),
MPICH-CL (154us),
MPICH-V2 (277us)

Latency is high in MPICH-CL due to more memory copies compared to P4


Latency is even higher in MPICH-V2 due to the event logging.
A receiving process can send a new message only when the
reception event has been successfully logged (3 TCP messages for a
communication)

Benchmark applications
Validating our
implementations on NAS
BT Benchmark class A
and B shows comparable
performances to P4
reference
implementation.
As expected MPICH-CL
reaches better fault free
performances than
MPICH-V2

Size of process (MB)

Time to checkpoint a process


according to its size.
Checkpoint time increases linearly
with checkpoint size.
Memory swap overhead appears at
512MB (fork).

Time to checkpoint all processes (seconds)

Checkpoint time (seconds)

Checkpoint server
Performance

Number of processes checkpointing simultaneously

Time to checkpoint all processes


concurently on a single checkpoint
server.
2nd process does not increase
checkpoint time, filling unused
bandwith.
More processes increase checkpoint
time linearly.

BT Checkpoint and Restart


Performance

Considerating the same dataset, per process image


size decreases when number of processes increases
As a consequence time to checkpoint remains
constant with increasing number of processes
Performing a complete asynchronous checkpoint takes
as much time as coordinated checkpoint
Time to restart after a fault is decreasing with the
number of nodes for V2 and not changing for CL

Fault impact on performances

NAS Benchmark BT B
25 nodes (32MB per
process image size)
Average time to
perform Checkpoint

MPICH-CL : 68s

MPICH-V2 : 73.9s

Average time to
recover from failure

MPICH-CL : 65.8s

MPICH-V2 : 5.3s

If we consider a 1GB memory occupation for every process, an extrapolation


gives a 2000s checkpoint time for 25 nodes in MPICH-CL. The minimum fault
interval ensuring progression of computation becomes about 1h
MPICH-V2 can tolerate a high fault rate. MPICH-CL cannot ensure
termination of the execution for a high fault rate.

Outline

Introduction

Coordinated checkpoint vs Message log

Comparison framework

Performances

Conclusions and future works

Conclusion

MPICH-CL and MPICH-V2 are two comparable


implementations of fault tolerant MPI from the
MPICH-1.2.5, one using coordinated checkpoint, the
other pessimistic message log
We have compared the overhead of these two
techniques according to fault frequency
The recovery overhead is the main factor
differentiating performances
We have found a crosspoint from which message log
becomes better than coordinated checkpoint. On our
test application this cross point appears near 1 per
10 minutes. With 1GB application, coordinated
checkpoint does not ensure progress of the
computation for one fault every hour.

Perspectives

Larger scale experiments


Use more nodes and applications with realistic
amount of memory
High performance networks experiments
Myrinet Infiniband

Comparison with causal log => MPICH-V2C vs


augmented MPICH-CL

MPICH-V2C is a causal log implementation, thus removing


the high latency impact induced by pessimistic log

MPICH-CL is being modified to restart non failed nodes from


local checkpoint, removing the high restart overhead

MPICH-V2C

MPICH-V2 suffers from high latency


(pessimistic protocol)
MPICH-V2C corrects this drawback at the
expense of an increase of average message
size (causal log protocol)
P could have sent s1
But have to wait ack for message log
Of preceeding receptions

r1

r2

r1
s1
(delayed)

r2

P sends s1 before acknoledge from


EL. Causality informations are
are piggybacked on messages in such
a case
s1
(piggyback
Of causalities)

s1
(actual)

P
Log of
r1 r2
EL

ACK for r1 r2
P can send s1

Log of
r1 r2

s2
(nothing to
piggyback)

ACK for r1 r2
P can stop
piggyback

You might also like