Professional Documents
Culture Documents
Towards A Generic Backup and Recovery Infrastructure For The German D-Grid Initiative
Towards A Generic Backup and Recovery Infrastructure For The German D-Grid Initiative
Towards A Generic Backup and Recovery Infrastructure For The German D-Grid Initiative
1 Introduction
– minimal invasiveness: The backup and recovery service should not in-
fluence already existing operational procedures at the sites of the service
providers.
– usability: Scientists should be empowered to easily back up and recover
their experimental results without intervention of an administrator or calling
a help desk, which is often necessary if a commcercial backup and recovery
solution is used.
– simple software rollout: The software required at the sites of the users
and providers should be easy to install. More precisely, a new version of
the backup and recovery client at the user’s site should be automatically
retrieved without user intervention.
– sustainability: The developed backup and recovery service should be usable
in the long term, i.e. beyond the project duration, and will be offered by the
DFN and the service providers.
– replaceable backup and recovery backend: The dependence on a spe-
cific commercial backup and recovery backend should be avoided, so that in
the future the underlying backup software can be changed.
– independence of the operating system: It must be possible to use the
backup and recovery service from different operating systems. Otherwise,
client for many different operating systems have to be maintained.
– support of multiple interfaces: The backup and recovery service should
be available from inside of the Grid as well as from outside of the Grid.
From the perspective of the user, backup and recovery can be realized in a
pull- or push-based manner. A pull-based backup and recovery approach uses a
predefined schedule to decide when to start a backup/recovery. For example, a
backup can be run daily at 01:00 AM since utilization is normally low at night. A
push-based backup and recovery is initiated by the user on-demand. For example,
a scientist can back up the results of an experiment after the experiment has
finished.
Since a push-based approach can be realized without further installation of
software on the client machines and without bothering an administrator for
installation and updates of the software, the push-based approach was favoured
by the project partners.
2.3 User- vs. Node-based Backup and Recovery
Many commercial backup solutions only support node-based backup and recovery
due to the fact that the file system may differ significantly from node to node.
The node-based backup and recovery strategy has a main disadvantage: a user
who moves between several nodes can only recover data from the current node
and not his/her entire data. To solve this problem, backup and recovery have to
be organized in a user-based manner, which enables cross-node recovery of the
entire user data.
In Figure 1(a), a node-base backup and recovery strategy is shown. User
1 and user 2 cannot recover their data after moving from node 1 to node 2
and from node 2 to node 3, respectively. Supporting a user-based backup and
recovery strategy enables recovery after moving from one node to another, as
shown in Figure 1(b).
Within F&L-Grid, a user-based backup and recovery strategy is realized.
Fig. 2. Example for different backup strategies (dotted squares are files modified since
the last backup, colored squares are backed up files in the current backup iteration).
File permisssions for single users or complete user groups are handled differently
depending on the file systems (e.g. NTFS, Reiser, ZFS) and operating systems
(e.g. Microsoft Windows, Linux, MacOS) used. Consequently, a uniform repre-
sentation of file permissions is challenging.
F&L-Grid uses an intersection of information supported by widespread file
systems (e.g. file/directory name, creation date, last-modified date, file size, . . . )
to describe file permissions. All relevant file permissions are stored in the fat
client database (see Section 2.6).
2.7 Backup
A backup job consists of six consecutive steps, as shown in Figure 4.
3 Implementation Issues
This section gives an overview of the technologies selected for the implementation
of the backup and recovery service, explains how the backup and recovery service
can be accessed, and describes the processing of a backup job in detail.
A user of the backup and recovery service requires the job submission client to
access the service, i.e. to submit a backup or recovery job. For this purpose, the
user browses a website that hosts the job submission client as a Java Web Start
application. Consequently, the user only needs to retrieve the job submission
client manually once. Subsequently, new versions of the job submission client
are retrieved automatically by the Java Web Start technology.
To retrieve the software, the user must authenticate himself/herself against
his/her home organization using Shibboleth [11] (authentication phase), i.e. the
user provides, for example, username and password to authenticate against the
Shibboleth Identity Provider of his/her home organization. Shibboleth is used
since it integrates well into the existing service portfolio of the DFN, which is
also based on Shibboleth.
A backup job is defined by selecting all relevant files/directories using the job
submission client (scan phase). The backup job indicates the selected directo-
ries and files by specifying their filenames, the node name, the user name, the
last-modified timestamp, and the file permissions. This job object can then be
sent to the backup and recovery service inside a Grid service invocation or, for
example, as a serialized Java object, depending on the service interface. The
service removes all files from the list of filenames that have not been changed
since the last backup (intersection phase). This can be done by querying the fat
client database using JDBC. The modified list is returned to the job submission
client that transfers the files to the fat client via the secure copy protocol (SCP)
after initiating a secure shell (SSH) session (copy phase). In our prototype, we
use the Java Secure Channel (JSch) [12] library (that is also used by the very
popular Eclipse IDE) for SSH and SCP implementation. When the data has
been completely transferred, we start a TSM command-line client located on
the fat client to back up the data to a TSM server (backup phase). If the backup
has been successfully completed, the job information is logged in the fat client
database and the data is removed from the fat client cache (cleanup phase). The
file permissions are logged for each file as a plain string. Currently, UNIX file
permissions are supported as well as Windows’ basic NTFS permissions using
the cacls (change access control lists) tool of Windows.
The type hierarchy of backup and recovery jobs is shown in Figure 5. The
common characteristics of a BackupJob and a RecoveryJob are encapsulated
in an interface Job. Both specialized jobs use a FileDescription to define the
relevant files and directories. The JobSerializer is used to generate a Grid
service invocation or a serialized Java object out of the backup and recovery
jobs.
Fig. 5. Type hierarchy of a backup and recovery job.
4 Related Work
To the best of our knowledge, there are no generic backup and recovery solutions
focusing on usability for Grid environments up to now. Consequently, work re-
lated to F&L-Grid reduces to standard technologies for data movement within
Grid environments: GridFTP, RFT, RLS, and OGSA-DAI.
Current service-oriented Grid middleware innately offers data handling func-
tionality. Based on this data handling components, research projects often build
proprietary backup and recovery solutions. Consider, for example, the Globus
Toolkit 4.x [3] that contains components for data movement (GridFTP and Reli-
able File Transfer (RFT)) and data replication (Replica Location Service (RLS)).
GridFTP stems from the widely-used File Transfer Protocol (FTP) and supports
the efficient, secure, and robust transfer of bulk data within Grid environments.
The Globus Toolkit contains a GridFTP server (globus-gridftp-server) and
a GridFTP command-line client (globus-url-copy) to host and retrieve data.
RFT is a web service based interface to GridFTP which additionally permits
scheduling of entire data movement jobs. The replication of data within Grid
environments promises increased efficiency. To manage all replicas, the Globus
Toolkit offers RLS, which is a simple registry for available replicas on different
sites. Using middleware-specific data handling functionality results in two main
problems:
1. the backup and recovery solutions can only hardly be reused,
2. in-depth knowledge of the middleware is required to use the backup and
recovery solution which often overburdens simple users.
OGSA-DAI [13] provides services to access data from different sources like
databases or flat files within the Grid. It also supports methods to query, trans-
form, and deliver data in different ways. OGSA-DAI can be used as a simple
backup service mostly for data replication but not for real backups leveraging
reliable backup and recovery software.
5 Conclusions
In this paper, we have presented a generic infrastructure for backup and recovery
in Grid environments. As part of the German Grid Initiative (D-Grid), the F&L-
Grid project determined the requirements for a backup and recovery service from
which an architecture for backup and recovery was derived. The infrastructure is
easy to use and applicable within arbitrary service-oriented Grid environments
and even for users outside the Grid. Scientists are empowered to easily back up
and recover their experimental data without intervention of an administrator. As
its backend, an arbitrary commercial backup and recovery solution can be used,
which ensures reliability of the backups. The details of the particular backend
used are hidden from the scientist.
In future work, we will support further file permissions including different
inheritance schemes. Furthermore, an accounting and billing component will be
developed that uses information from the fat client database to automatically
bill submitted backup and recovery jobs.
Acknowledgements
This work is financially supported by the German Federal Ministry of Education
and Research (BMBF) (D-Grid Initiative, F&L-Grid Project).
References
[1] I. Foster, C. Kesselman, and S. Tuecke. The Anatomy of the Grid: Enabling
Scalable Virtual Organizations. In International Journal of High Performance
Computing Applications, volume 15, pp. 200–222, 2001.
[2] OASIS: Web Services Resource Framework (WSRF).
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wsrf.
[3] D. Petcu. A Comprehensive Development Guide for the Globus Toolkit. In
Distributed Systems Online, volume 9, pp. 4–6, 2008.
[4] M. Romberg. The UNICORE Architecture: Seamless Access to Distributed
Resources. In Proceedings of the 8th IEEE International Symposium on
High Performance Distributed Computing (HPDC), IEEE Computer Society
Press, pp. 287-293 1999.
[5] R. Kettimuthu. GridFTP moves your Data and your News. In International
Science Grid This Week. July 2008,
http://www.isgtw.org/?pid=1001209.
[6] R.K. Madduri, C.S. Hood, and W.E. Allcock. Reliable File Transfer in
Grid Environments. In Proceedings of the 27th Annual IEEE Conference
on Local Computer Networks (LCN), IEEE Computer Society Press, pp.
737–738, 2002
[7] H. Neuroth, M. Kerzel, W. Gentzsch. German Grid Initiative (D-Grid).
Niedersächsische Staats- und Universitätsbibliothek, ISBN 3938616997, 2007.
[8] German Federal Ministry of Education and Research (BMBF).
http://www.bmbf.de/.
[9] German Research Network (DFN). http://www.dfn.de/.
[10] W3C: Web Services Description Language (WSDL) 2.0, June 2006.
http://www.w3.org/TR/wsdl20/.
[11] R.O. Sinnott, J. Jiang, J. Watt, and O. Ajayi. Shibboleth-based Access to and
Usage of Grid Resources. In Proceedings of the 7th IEEE/ACM International
Conference on Grid Computing, IEEE Computer Society Press, 2006, pp.
136–143
[12] JSch – Java Secure Channel. http://www.jcraft.com/jsch/.
[13] M. Antonioletti, N.P. Chue Hong, A.C. Hume, M. Jackson, K. Karasav-
vas, A. Krause, J.M. Schopf, M.P. Atkinson, B. Dobrzelecki, M. Illingworth,
N. McDonnell, M. Parsons, and E. Theocharopoulos. OGSA-DAI 3.0 The
Whats and the Whys. In Proceedings of the UK e-Science All Hands Meeting
2007.
[14] Zentren für Kommunikation und Informationsverarbeitung in Lehre und
Forschung e.V. (ZKI), Arbeitskreis Netzdienste.
http://www.zki.de/ak_nd/.