Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI

Grand Large
Coordinated Checkpoint
Versus Message Log For
Fault Tolerant MPI
Aurlien Bouteiller bouteill@lri.fr
joint work with
F.Cappello, G.Krawezik, P.Lemarinier
Cluster&Grid group, Grand Large Project
http://www.lri.fr/~gk/MPICH-V
HPC trend: Clusters are

getting larger
High performance computers
have more and more nodes
(more than 8000 for ASCI Q,
more than 5000 for BigMac
cluster, 1/3rd of the
installations of top500 have
more than 500 processors).
More components increases
fault probability
ASCI-Q full system MTBF is

estimated (analytic) to few hours
(Petrini: LANL), a 5 hours job
with 4096 procs has less than
50% chance to terminate
Many numerical applications

uses MPI (Message Passing
Interface) library
Need for automatic fault
tolerant MPI
Fault tolerant MPI

A classification of fault tolerant message passing environments
considering
A) level in the software stack where fault tolerance is managed and
B) fault tolerance techniques.
Automatic
coordinate
d
based
Framework
Non Automatic
Log based
Optimistic log
Optimistic recovery
In distributed systems
Cocheck
Independent of
MPI
n faults with coherent checkpoint
Causal log
Manetho
n faults
Pessimistic log
Coordinated
checkpoint
Starfish
Enrichment of MPI
API
Egida
Clip
MPI/FT
Semi-transparent checkpoint
Redundance of tasks
FT-MPI
Modification of MPI routines

User Fault Treatment
Pruitt 98
2 faults sender based
Communication
Lib.
MPICH-V2
MPICH-CL
N faults
Sender based Mess. Log.

1 fault sender based
MPI-FT
N fault
N faults
Centralized server
Distributed logging
Level
Several protocols to perform fault tolerance in MPI applications with N faults and
automatic recovery : Global checkpointing, Pessimistic/Causal Message log
compare fault tolerant protocols for a single MPI implementation
Outline
Introduction
Coordinated checkpoint vs message
log
Comparison framework
Performances
Conclusions and future works
Fault Tolerant protocols

Problem of inconsistent states
Uncoordinated checkpoint : the problem of inconsistent

states
Order of message receptions are undeterministic events
message received but not sent are inconsistent
Domino effect can lead to rollback to the begining of the
execution in case of fault
Possible loose of the whole execution and unpredictive fault cost
P0
P1
P2
C3
m3
m1
m2
C2 C1

Global Checkpoint 1/2
Communication Induced Checkpointing
Does not require global synchronisation to provide a

global coherent snapshot
Drawbacks studied in
L. alvisi, E. Elnozahy, S. Rao, S. A. Husain, and

A. D. Mel. An analysis of communication induced checkpointing. In 29th
symposium on Fault-Tolerant Computing (FTFC99). IEEE Press, June 99.
on
number of forced checkpoint increases linearly with number

of nodes Does not scale
Unpredictable frequency of checkpoint may lead to
take an overestimated number of checkpoints
Detection of a possible inconsistent state induces blocking
checkpoint of some processes
Blocking checkpoint has a dramatic overhead
fault free execution
These protocols may not be used in practice

Global Checkpoint 2/2
Coordinated checkpoint
All processes coordinate their checkpoints so that

the global system state is coherent (Chandy &
Lamport Algorithm)
Negligible overhead on fault free execution
Requires global synchronization (may take a long

time to perform checkpoint because of checkpoint
server stress)
In the case of a single fault, all processes have to

roll back to their checkpoints
High cost of fault recovery
Efficient when fault frequency is low
Fault tolerant protocols

Message Log 1/2
Pessimistic log
All messages recieved by a process are logged on

a reliable media before it can causally influence
the rest of the system
Non negligible overhead on network performances
in fault free execution
No need to perform global synchronization

Does not stress checkpoint servers
No need to roll back non failed processes

Fault recovery overhead is limited
Efficient when fault frequency is high
Fault tolerant protocols

Message Log 2/2
Causal log
Designed to improve fault free performance of pessimistic

log
Messages are logged locally and causal dependencies are
piggybacked to messages
Non negligible overhead on fault free execution, slightly
better than pessimistic log
No global synchronisation
Does not stress checkpoint server
Only failed process are rolled back
Failed Processes retrieve their state from dependant
processes or no process depends on it.
Fault recovery overhead is limited but greater than
pessimistic log
Comparison: Related works

Several protocols to perform automatic fault tolerance in MPI applications
- Coordinated checkpoint
- Causal message log
- Pessimistic message log
All of them have been studied theoretically but not compared
Egida compared log based techniques
Siram Rao, Lorenzo Alvisi, Harrick M. Vim: The cost of recovery in message logging
protocols. In 17th symposium on Reliable Distributed Systems (SRDS), pages 1018, IEEE Press, October 1998
- Causal log is better for single nodes faults
- Pessimistic log is better for concurrent faults
No existing comparison of coordinated and message log protocols

No existing comparable implementations of coordinated and
message log protocols
high fault recovery overhead of coordinated checkpoint
high overhead of message logging on fault free performance

Suspected : fault frequency implies tradeoff
Compare coordinated checkpoint and pessimistic logging
Outline
Introduction
Coordinated checkpoint vs Message log
Related work
Performances
Architectures
We designed MPICH-CL and MPICH-V2 in a shared
framework to perform a fair comparison of coordinated
checkpoint and pessimistic message log
MPICH-CL
Chandy&Lamport algorithm
Coordinated checkpoint
MPICH-V2
Pessimistic sender Based
message log
Communication daemon
Ckpt
Server
Event
Logger (V2 only)
ack Reception
event
Checkpoint
Image
Ckpt
Control
Send
Send
CL/V2
MPI
CSAC
processReceive daemon Receive
keep
Payload (V2 only)
Node
CL and V2 share the same

architecture
communication daemon
includes protocol specific
actions
Generic device: based on

MPICH-1.2.5
A new device: ch_v2 or ch_cl device
All ch_xx device functions are blocking
communication
functions built over TCP layer
MPI_Send
MPID_SendControl
MPID_SendChannel
Binding
_bsend
ADI
Channel
Interface
Chameleon
Interface
V2/CL device
Interface
_bsend
- blocking send
_brecv
- blocking receive
_probe
- check for any

message avail.
_from
_Init
_Finalize
- get the src of

the last message
- initialize the client
- finalize the client
Node Checkpointing
User-level Checkpoint : Condor Stand Alone Checkpointing
Clone checkpointing + non blocking checkpoint

(1) fork
Resume execution
using CSAC just
after (4), reopen
sockets and return
code
Ckpt
order
CSAC
libmpichv
fork
(2) Terminate ongoing coms

(3) close sockets
(4) call ckpt_and_exit()
Checkpoint image is sent to reliable CS on the fly

Local storage does not ensure fault tolerance
Checkpoint scheduler policy
In MPICH-V2,
checkpoint scheduler
is not required by
pessimistic protocol.
It is used to
minimize size of
checkpointed payload
using a best effort
heuristic.
Node
Ckpt
sched
Node
2
Node
Message
payload
Policy : permanent
individual checkpoint
In MPICH-CL,
checkpoint scheduler
is used as a
dedicaded process to
initiate checkpoint.
Policy : checkpoint
every n seconds
where n is a runtime
parameter
Ckpt
Server
Node
Node
Ckpt
Server
2
Ckpt
sched
Synchro
Node
Outline
Introduction
Performances
Experimental conditions
Cluster: 32 1800+ Athlon CPU, 1 GB, IDE Disc
+ 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc
+ 48 ports 100Mb/s Ethernet switch
Linux 2.4.18, GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp)
Checkpoint Server
+Event Logger (V2 only)
+Checkpoint Scheduler
+Dispatcher
A single reliable node
Network
node
node
node
Bandwith and latency
Latency for a 0 byte

MPI message :
MPICH-P4 ( 77us),
MPICH-CL (154us),
MPICH-V2 (277us)
Latency is high in MPICH-CL due to more memory copies compared to P4

Latency is even higher in MPICH-V2 due to the event logging.
A receiving process can send a new message only when the
reception event has been successfully logged (3 TCP messages for a
communication)
Benchmark applications
Validating our
implementations on NAS
BT Benchmark class A
and B shows comparable
performances to P4
reference
implementation.
As expected MPICH-CL
reaches better fault free
performances than
MPICH-V2
Size of process (MB)
Time to checkpoint a process

according to its size.
Checkpoint time increases linearly
with checkpoint size.
Memory swap overhead appears at
512MB (fork).
Time to checkpoint all processes (seconds)
Checkpoint time (seconds)
Checkpoint server
Performance
Number of processes checkpointing simultaneously
Time to checkpoint all processes

concurently on a single checkpoint
server.
2nd process does not increase
checkpoint time, filling unused
bandwith.
More processes increase checkpoint
time linearly.
BT Checkpoint and Restart

Performance
Considerating the same dataset, per process image

size decreases when number of processes increases
As a consequence time to checkpoint remains
constant with increasing number of processes
Performing a complete asynchronous checkpoint takes
as much time as coordinated checkpoint
Time to restart after a fault is decreasing with the
number of nodes for V2 and not changing for CL
Fault impact on performances
NAS Benchmark BT B
25 nodes (32MB per
process image size)
Average time to
perform Checkpoint
MPICH-CL : 68s
MPICH-V2 : 73.9s
Average time to
recover from failure
MPICH-CL : 65.8s
MPICH-V2 : 5.3s
If we consider a 1GB memory occupation for every process, an extrapolation

gives a 2000s checkpoint time for 25 nodes in MPICH-CL. The minimum fault
interval ensuring progression of computation becomes about 1h
MPICH-V2 can tolerate a high fault rate. MPICH-CL cannot ensure
termination of the execution for a high fault rate.
Outline
Introduction
Performances
Conclusion
MPICH-CL and MPICH-V2 are two comparable

implementations of fault tolerant MPI from the
MPICH-1.2.5, one using coordinated checkpoint, the
other pessimistic message log
We have compared the overhead of these two
techniques according to fault frequency
The recovery overhead is the main factor
differentiating performances
We have found a crosspoint from which message log
becomes better than coordinated checkpoint. On our
test application this cross point appears near 1 per
10 minutes. With 1GB application, coordinated
checkpoint does not ensure progress of the
computation for one fault every hour.
Perspectives
Larger scale experiments

Use more nodes and applications with realistic
amount of memory
High performance networks experiments
Myrinet Infiniband
Comparison with causal log => MPICH-V2C vs

augmented MPICH-CL
MPICH-V2C is a causal log implementation, thus removing

the high latency impact induced by pessimistic log
MPICH-CL is being modified to restart non failed nodes from

local checkpoint, removing the high restart overhead
MPICH-V2C
MPICH-V2 suffers from high latency

(pessimistic protocol)
MPICH-V2C corrects this drawback at the
expense of an increase of average message
size (causal log protocol)
P could have sent s1
But have to wait ack for message log
Of preceeding receptions
r1
r2
r1
s1
(delayed)
r2
P sends s1 before acknoledge from

EL. Causality informations are
are piggybacked on messages in such
a case
s1
(piggyback
Of causalities)
s1
(actual)
P
Log of
r1 r2
EL
ACK for r1 r2
P can send s1
Log of
r1 r2
s2
(nothing to
piggyback)
ACK for r1 r2
P can stop
piggyback

Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI

Uploaded by

Copyright:

Available Formats

Grand Large

HPC trend: Clusters are

ASCI-Q full system MTBF is

Many numerical applications

Fault tolerant MPI

n faults with coherent checkpoint

Modification of MPI routines

2 faults sender based

Sender based Mess. Log.

Conclusions and future works

Fault Tolerant protocols

Uncoordinated checkpoint : the problem of inconsistent

Fault Tolerant protocols

Communication Induced Checkpointing

Does not require global synchronisation to provide a

L. alvisi, E. Elnozahy, S. Rao, S. A. Husain, and

number of forced checkpoint increases linearly with number

These protocols may not be used in practice

Fault Tolerant protocols

All processes coordinate their checkpoints so that

Requires global synchronization (may take a long

In the case of a single fault, all processes have to

Efficient when fault frequency is low

Fault tolerant protocols

All messages recieved by a process are logged on

No need to perform global synchronization

No need to roll back non failed processes

Efficient when fault frequency is high

Fault tolerant protocols

Designed to improve fault free performance of pessimistic

Comparison: Related works

Egida compared log based techniques

No existing comparison of coordinated and message log protocols

high fault recovery overhead of coordinated checkpoint

high overhead of message logging on fault free performance

Compare coordinated checkpoint and pessimistic logging

Coordinated checkpoint vs Message log

Conclusions and future works

CL and V2 share the same

Generic device: based on

- check for any

- get the src of

User-level Checkpoint : Condor Stand Alone Checkpointing

Clone checkpointing + non blocking checkpoint

(2) Terminate ongoing coms

Checkpoint image is sent to reliable CS on the fly

Checkpoint scheduler policy

Coordinated checkpoint vs Message log

Conclusions and future works

Bandwith and latency

Latency for a 0 byte

Latency is high in MPICH-CL due to more memory copies compared to P4

Size of process (MB)

Time to checkpoint a process

Time to checkpoint all processes (seconds)

Checkpoint time (seconds)

Number of processes checkpointing simultaneously

Time to checkpoint all processes

BT Checkpoint and Restart

Considerating the same dataset, per process image

Fault impact on performances

If we consider a 1GB memory occupation for every process, an extrapolation