Professional Documents
Culture Documents
Principles of Transaction-Oriented Database Recovery
Principles of Transaction-Oriented Database Recovery
THEO HAERDER
Fachbereich Informatik, University of Kaiserslautern, West Germany
ANDREAS REUTER 1
IBM Research Laboratory, San Jose, California 95193
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
© 1983 ACM 0360-0300/83/1200-0287 $00.75
These situations and dependencies have explain the idea is the transfer of money
been investigated thoroughly by Bjork and from one account to another. The corre-
Davies in their studies of the so-called sponding transaction program is given in
"spheres of control" [Bjork 1973; Davies Figure 1.
1973, 1978]. They indicate that data being The concept of a transaction, which in-
operated by a process must be isolated in cludes all database interactions between
some way that lets others know the degree $BEGIN_TRANSACTION and $COM-
of reliability provided for these data, that MIT_TRANSACTION in the above ex-
is, ample, requires that all of its actions be
executed indivisibly: Either all actions are
• Will the data be changed without notifi- properly reflected in the database or noth-
cation to others? ing has happened. No changes are reflected
• Will others be informed about changes? in the database if at any point in time
• Will the value definitely not change any before reaching the $COMMIT_TRANS-
more? ACTION the user enters the ERROR
clause containing the $ R E S T O R E _
This ambitious concept was restricted to
TRANSACTION. To achieve this kind of
use in database systems by Eswaran et al.
indivisibility, a transaction must have four
[1976] and given its current name, the
properties:
"transaction." The transaction basically re-
flects the idea that the activities of a par- Atomicity. It must be of the all-or-noth-
ticular user are isolated from all concurrent ing type described above, and the user must,
activities, but restricts the degree of isola- whatever happens, know which state he or
tion and the length of a transaction. Typi- she is in.
cally, a transaction is a short sequence of Consistency. A transaction reaching its
interactions with the database, using oper- normal end (EOT, end of transaction),
ators such as FIND a record or MODIFY thereby committing its results, preserves
an item, which represents one meaningful the consistency of the database. In other
activity in the user's environment. The words, each successful transaction by defi-
standard example that is generally used to nition commits only legal results. This con-
interface. At the bottom, the database con- database literature, but for reasons that will
sists of some billions of bits stored on disk, become clear in the following sections we
which are interpreted by the DBMS into strictly distinguish between pages and
meaningful information on which the user blocks. A page is a fixed-length partition of
can operate. With each level of abstraction a linear address space and is mapped into
(proceeding from the bottom up), the ob- a physical block by the propagation control
jects become more complex, allowing more layer. Therefore a page can be stored in
powerful operations and being constrained different blocks during its lifetime in the
by a larger number of integrity rules. The database, depending on the strategy imple-
uppermost interface supports one of the mented for propagation control.
well-known data models, whether rela-
tional, networklike, or hierarchical. Access Path Management. This layer im-
Note that this mapping hierarchy is vir- plements mapping functions much more
tually contained in each DBMS, although complicated than those performed by sub-
for performance reasons it will hardly be ordinate layers. It has to maintain all phys-
reflected in the module structure. We shall ical object representations in the database
briefly sketch the characteristics of each (records, fields, etc.), and their related ac-
layer, with enough detail to establish our cess paths (pointers, hash tables, search
taxonomy. For a more complete description trees, etc.) in a potentially unlimited linear
see Haerder and Reuter [1983]. virtual address space. This address space,
which is divided into fixed-length pages, is
File Management. The lowest layer op- provided by the upper interface of the sup-
erates directly on the bit patterns stored on porting layer. For performance reasons, the
some nonvolatile, direct access device like partitioning of data into pages is still visible
a disk, drum, or even magnetic bubble on this level.
memory. This layer copes with the physical
characteristics of each storage type and Navigational Access Layer. At the top of
abstracts these characteristics into fixed- this layer we find the operations and objects
length blocks. These blocks can be read, that are typical for a procedural data ma-
written, and identified by a (relative) block nipulation language (DML). Occurrences
number. This kind of abstraction is usually of record types and members of sets are
done by the data management system handled by statements like STORE, MOD-
(DMS) of a normal general-purpose oper- IFY, FIND NEXT, and CONNECT [CO-
ating system. DASYL 1978]. At this interface, the user
navigates one record at a time through a
Propagation2 Control. This level is not hierarchy, through a network, or along log-
usually considered separately in the current ical access paths.
2 This term is introduced in Section 2.4; its meaning Nonprocedural Access Layer. This level
is not essential to the understanding of this paragraph. provides a nonprocedural interface to the
Temporary Log
Supports Transaction UNDO,
Host Global UNDO, partial REDO
Computer
DBMS !"~'~)~ Log Buffer
Code
DatabaseBuffer m
-©
t Archive Log
Supports Global REDO
Physical Copy of I ~
the Database ~
Archive Copy of ( ~
the Database ~
database. With each operation the user can shown in Figure 4. It closely resembles the
handle sets of results rather than single situation that must be dealt with by most
records. A relational model with high-level of today's commercial database systems.
query languages like SQL or QUEL is a The host computer, where the applica-
convenient example of the abstraction tion programs and DBMS are located, has
achieved by the top layer [Chamberlin a main memory, which is usually volatile. 8
1980; Stonebraker et al. 1976]. Hence we assume that the contents of the
database buffer, as well as the contents of
On each level, the mapping of higher the output buffers to the log files, are lost
objects to more elementary ones requires whenever the D B M S terminates abnor-
additional data structures, some of which mally. Below the volatile main memory
are shown in Table 1. there is a two-level hierarchy of permanent
copies of the database. One level contains
2.2 The Storage Hierarchy: an on-line version of the database in direct
Implementational Environment access memory; the other contains an ar-
chive copy as a provision against loss of the
Both the number of redundant data re- on-line copy. While both are functionally
quired to support the recovery actions de- situated on the same level, the on-line copy
scribed in Section 1 and the methods of is almost always up-to-date, whereas the
collecting such data are strongly influenced archive copy can contain an old state of the
by various properties of the different stor- database. Our main concern here is data-
age media used by the DBMS. In particular, base recovery, which, like all provisions for
the dependencies between volatile and per-
manent storage have a strong impact on
algorithms for gathering redundant infor- 3 In some real-time applications main memory is sup-
ported by a battery backup. It is possible that in the
mation and implementing recovery meas- future mainframes will have some stable buffer stor-
ures [Chen 1978]. As a descriptional frame- age. However, we are not considering these conditions
work we shall use a storage hierarchy, as here.
fault tolerance, is based upon redundancy. buffer. The mapping hierarchy is com-
We have mentioned one type of redun- pletely correct.
dancy: the archive copy, kept as a starting The materialized database is the state
point for reconstruction of an up-to-date that the D B M S finds at restart after a crash
on-line version of the database (global without having applied any log information.
REDO). This is discussed in more detail in There is no buffer. Hence some page mod-
Section 4. To support this, and other recov- ifications (even of successful transactions)
ery actions introduced in Section 1, two may not be reflected in the on-line copy. It
types of log files are required: is also possible that a new state of a page
has been written to disk, but the control
Temporary Log. The information col- structure that maps pages to blocks has not
lected in this file supports crash recovery; yet been updated. In this case, a reference
that is, it contains information needed to to such a page will yield the old value. This
reconstruct the most recent database (DB) view of the database is what the recovery
buffer. Selective transaction UNDO re- system has to transform into the most re-
quires random access to the log records. cent logically consistent current database.
Therefore we assume that the temporary The physical database is composed of all
log is located on disk. blocks of the on-line copy containing page
Archive Log. This file supports global images--current or obsolete. Depending on
REDO after a media failure. It depends on the strategy used on Level 2, there may be
the availability of the archive copy and different values for one page in the physical
must contain all changes committed to the database, none of which are necessarily the
database after the state reflected in the current contents. This view is not normally
archive copy. Since the archive log is always used by recovery procedures, but a salva-
processed in sequential order, we assume tion program would try to exploit all infor-
that the archive log is written on magnetic mation contained therein.
tape.
With these views of a database, we can
distinguish three types of update opera-
2.3 Different Views of a Database t i o n s - a l l of which explain the mapping
In Section 2.1, we indicated that the data- function provided by the propagation con-
base looks different at each level of abstrac- trol level. First, we have the modification of
tion, with each level using different objects page contents caused by some higher level
and interfaces. But this is not what we module. This operation takes place in the
mean by "different views of a database" in DB buffer and therefore affects only the
this section. We have observed that the current database. Second, there is the write
process of abstraction really begins at Level operation, transferring a modified page to
3, up to which there is only a more conven- a block on disk. In general, this affects only
ient representation of data in external stor- the physical database. If the information
age. At this level, abstraction is dependent about the block containing the new page
on which pages actually establish the linear value is stored in volatile memory, the new
address space, that is, which block is read contents will not be accessible after a crash;
when a certain page is referenced. In the that is, it is not yet part of the materialized
event of a failure, there are different pos- database. The operation that makes a pre-
sibilities for retrieving the contents of a viously written page image part of the ma-
page. These possibilities are denoted by terialized database is called propagation.
different views of the database: This operation writes the updated control
structures for mapping pages to blocks in a
The current database comprises all ob- safe, nonvolatile place, so that they are
jects accessible to the DBMS during normal available after a crash.
processing. The current contents of all If pages are always written to the same
pages can be found on disk, except for those block (the so-called "update-in-place" op-
pages that have been recently modified. eration, which is done in most commercial
Their new contents are found in the DB DBMS), writing implicitly is the equivalent
(a)
System A'
Buffer C'
D'
Materlahzed Database
Current'Database
(b)
Sys,er,,l A' I V'la'' b'c'l""
Buffer I C'
vlalb
' lcld~
l' I D' ,I
Current Database
v'+ {A',B,C'.D'}
mechanism [Lorie 1977]. The mapping of been reduced to safely updating one record,
page numbers to block numbers is done by which can be done in a simple way. For
using page tables. These tables have one details, see Lorie [1977].
entry per page containing the block number There are other implementations for
where the page contents are stored. The ATOMIC propagation. One is based on
shadow pages, accessed via the shadow page maintaining two recent versions of a page.
Table V, preserve the old state of the ma- For each page access, both versions have to
terialized database. The current version is be read into the buffer. This can be done
defined by the current page Table V'. Be- with minimal overhead by storing them in
fore this state is made stable (propagated), adjacent disk blocks and reading them with
all changed pages are written to their new chained I/O. The latest version, recognized
blocks, and so is the current page table. If by a time stamp, is kept in the buffer; the
this fails, the database will come up in its other one is immediately discarded. A mod-
old state. When all pages have been written ified page replaces the older version on disk.
related to the new state, ATOMIC propa- ATOMIC propagation is accomplished by
gation takes place by changing one record incrementing a special counter that is re-
on disk (which now points to V' rather lated to the time stamps in the pages. De-
than V) in a way that cannot be confused tails can be found in Reuter [1980]. An-
by a system crash. Thus the problem of other approach to ATOMIC propagation
indivisibly propagating a set of pages has has been introduced under the name "dif-
TRANSACTION IT
DESCRIPTOR ~* I RECORD ID, OLDVAL, NEWVAL
justified by the comparatively few benefits a tuple identified by its tuple identifier
yielded by logical transition logging on the (TID) from an old value to a new one, both
access path level. Hence logical transition of which are recorded. Inserting a tuple
logging on this level can generally be ruled entails modifying its initial null value to
out, but will become more attractive on the the given value, and deleting a tuple entails
next higher level. the inverse transition. Hence the log con-
tains the information shown in Figure 7.
Logical Logging on the Record-Oriented Logical transition logging obviously re-
Level. At one level higher, it is possible
quires a materialized database that is con-
to express the changes performed by the
sistent up to Level 3; that is, it can only
transaction program in a very compact
be combined with ATOMIC propagation
manner by simply recording the update
schemes. Although the number of log data
DML statements with their parameters.
written are very small, recovery will be
Even if a nonprocedural query language is more expensive than that in other schemes,
being used above this level, its updates will
because it involves the reprocessing of some
be decomposed into updates of single rec-
DML statements, although this can be
ords or tuples equivalent to the single- done more cheaply than the original proc-
record updates of procedural DB languages.
essing.
Thus logging on this level means that only Table 3 is a summation of the properties
the INSERT, UPDATE, and DELETE op- of all logging techniques that we have de-
erations, together with their record ids and scribed under two considerations: What is
attribute values, are written to the log. The the cost of collecting the log data during
mapping process discerns which entries are normal processing? and, How expensive is
affected, which pages must be modified, etc. recovery based on the respective type of log
Thus recovery is achieved by reexecuting
information? Of course, the entries in the
some of the previously processed DML table are only very rough qualitative esti-
statements. For UNDO recovery, of course, mations; for more detailed quantitative
the inverse DML statement must be exe- analysis see Reuter [1982].
cuted, that is, a DELETE to compensate Writing log information, no matter what
an INSERT and vice versa, and an UP-
type, is determined by two rules:
DATE returned to the original values.
These inverse DML statements must be * UNDO information must be written to
generated automatically as part of the reg- the log file before the corresponding up-
ular logging activity, and for this reason dates are propagated to the materialized
this approach is not viable for network- database. This has come to be known as
oriented DBMSs with information-bearing the "write ahead log" (WAL) principle
interrecord relations. In such cases, it can [Gray 1978].
be extremely expensive to determine, for • REDO information must be written to
example, the inverse for a DELETE. De- the temporary and the archive log file
tails can be found in Reuter [1981]. before EOT is acknowledged to the trans-
System R is a good example of a system action program. Once this is done, the
with logical logging on the record-oriented system must be able to ensure the trans-
level. All update operations performed on action's durability.
the tuples are represented by one general-
ized modification operator, which is not We return to different facets of these
explicitly recorded. This operator changes rules in Section 3.4.
c=
z
o
=
m
=
I A
I
I
' i
o
f
t t ......
.................. | ..........
~ o
ku
0g
.<
u.
~L
=1
i
X X
oo
oo
t
I ~E
o~
o~
o~
Z~
~ ............................. I~1 ~ o o
T
E QoXX
Principles of Transaction-Oriented Database Recovery • 305
Pages re-
siding in
buffer at
the mo-
ment of
Time crash
J
>
x < Pl
x< P2
<
<
X X X X X XXX X X X X XX x x< P,
<
<
<
Pk
<
crash
/
some examples in this section. Checkpoint Transaction-Oriented
Checkpoints
generation involves three steps [Gray
1978]: T1 I
• Write a B E G I N _ C H E C K P O I N T record T2 I
to the temporary log file. T3 I
• Write all checkpoint data to the log file
and/or the database. c(T1) c(T2) System
• Write an E N D _ C H E C K P O I N T record Time Crash
to the temporary log file.
Figure 10. Scenario for transaction-oriented check-
During restart, the B E G I N - E N D points.
bracket is a clear indication as to whether
a checkpoint was generated completely or be considered incomplete and its effects
interrupted by a system crash. Sometimes will be undone. Hence the EOT record
checkpointing is considered to be a means of each transaction can be interpreted as
for restoring the whole database to some a B E G I N _ C H E C K P O I N T and E N D _
previous state. Our view, however, focuses C H E C K P O I N T , since it agrees with our
on transaction recovery. Therefore to us a definition of a checkpoint in that it limits
checkpoint is a technique for optimizing the scope of REDO. Figure 10 illustrates
crash recovery rather than a definition of a transaction-oriented checkpoints (TOC).
distinguished state for recovery itself. In As can be seen in Figure 10, transaction-
order to effectively constrain partial oriented checkpoints are implied by a
REDO, checkpoints must be generated at FORCE discipline. The major drawback to
well-defined points in time. In the following this approach can be deduced from Figure
sections, we shall introduce four separate 9. Hot spot pages like Pi will be propagated
criteria for determining when to start each time they are modified by a transac-
checkpoint activities. tion even though they remain in the buffer
3.3.2 Transaction-OrientedCheckpoints
for a long time. The reduction of recovery
expenses with the use of transaction-ori-
As previously explained, a FORCE disci- ented checkpoints is accomplished by im-
pline will avoid partial REDO. All modified posing some overhead on normal process-
pages are propagated before an EOT record ing. This is discussed in more detail in
is written to the log, which makes the trans- Section 3.5. The cost factor of unnecessary
action durable. If this record is not found write operations performed by a FORCE
in the log after a crash, the transaction will discipline is highly relevant for very large
ComputingSurveys, Vol. 15, No. 4, December 1983
306 • T. Haerder and A. Reuter
Checkpoint
Checkpoint Signal c~ Generated
V
TII
I
T21 II
Figure 11. Scenario for transaction-con- I
sistent checkpoints. I T3|
I
L T4 I
Processing"Delay for
New Transactions
CI-I c, System
Crash
Time
database buffers. The longer a page re- two subsequent checkpoints can be ad-
mains in the buffer, the higher is the prob- justed to minimize overall recovery costs.
ability of multiple updates to the same page In Figure 11, T3 must be redone com-
by different transactions. Thus for DBMSs pletely, whereas T4 must be rolled back.
supporting large applications, transaction- There is nothing to be done about T1 and
oriented checkpointing is not the proper T2, since their updates have been propa-
choice. gated by generating c, Favorable as that
may sound, the TCC approach is quite un-
3.3.3 Transaction-Consistent Checkpoints realistic for large multiuser DBMSs, with
the exception of one special case, which is
The following transaction-consistent check- discussed in Section 3.4. There are two
points (TCC) are global in that they save reasons for this:
the work of all transactions that have mod-
ified the database. The first TCC, when • Putting the system into a quiescent state
successfully generated, creates a transac- until no update transaction is active may
tion-consistent database. It requires that cause an intolerable delay for incoming
all update activities on the database be transactions.
quiescent. In other words, when the check- • Checkpoint costs will be high in the case
point generation is signaled by the recovery of large buffers, where many changed
component, all incomplete update transac- pages will have accumulated. With a
tions are completed and new ones are not buffer of 6 megabytes and a substantial
admitted. The checkpoint is actually gen- number of updates, propagating the mod-
erated when the last update is completed. ified pages will take about 10 seconds.
After the E N D _ C H E C K P O I N T record For small applications and single-user sys-
has been successfully written, normal op- tems, TCC certainly is useful.
eration is resumed. This is illustrated in
Figure 11.
3.3.4 Action-Consistent Checkpoints
Checkpointing connotes propagating all
modified buffer pages and writing a record Each transaction is considered a sequence
to the log, which notifies the materialized of elementary actions affecting the data-
database of a new transaction-consistent base. On the record-oriented level, these
state, hence the name "transaction-consist- actions can be seen as DML statements.
ent checkpoint" (TCC). By propagating all Action-consistent checkpoints (ACC) can
modified pages to the database, TCC estab- be generated when no update action is being
lishes a point past which partial REDO will processed. Therefore signaling an ACC
not operate. Since all modifications prior means putting the system into quiescence
to the recent checkpoint are reflected in the on the action level, which impedes opera-
database, REDO-log information need only tion here much less than on the transaction
be processed back to the youngest E N D _ level. A scenario is shown in Figure 12.
CHECKPOINT record found on the log. The checkpoint itself is generated in the
We shall see later on that the time between very same way as was described for the
ComputingSurveys,Vol. 15, No. 4, December1983
Principles of Transaction-Oriented Database Recovery • 307
Checkpoint Checkpoint
Signal ~ ~'- ci Generated
TI: ~_ E3 []
T21 []
T4 I r'-I [] E3 I
[]
II
',I
II
[]: m
i I
T51 [ ]
II T6', .'~.
T7-- E]I
Cl-1
c~ System
Time ProcessingDelay Crash
for Act=ons
TCC technique. In the case of ACC, how- the log file rather than the pages them-
ever, the E N D _ C H E C K P O I N T record in- selves. This can be done with two or three
dicates an action-consistent 4 rather than a write operations, even with very large buff-
transaction-consistent database. Obviously ers, and helps to determine which pages
such a checkpoint imposes a limit on partial containing committed data were actually in
REDO. In contrast to TCC, it does not the buffer at the moment of a crash. How-
establish a boundary to global UNDO; how- ever, if there are hot spot pages, their
ever, it is not required by definition to do REDO information will have to be traced
so. Recovery in the above scenario means back very far on the temporary log. So,
global UNDO for T1, T2, and T3. REDO although indirect checkpointing does re-
has to be performed for the last action of duce the costs of partial REDO, this does
T5 and for all of T6. The changes of T4 not in general make partial REDO inde-
and T7 are part of the materialized data- pendent of mean time between failure. Note
base because of checkpointing. So again, also that this method is only applicable
REDO-log information prior to the recent with ~ATOMIC propagation. In the case
checkpoint is irrelevant for crash recovery. of ATOMIC schemes, propagation always
This scheme is much more realistic, since takes effect at one well-defined moment,
it does not cause long delays for incoming which is a checkpoint; pages that have only
transactions. Costs of checkpointing, how- been written (not propagated) are lost after
ever, are still high when large buffers are a crash. Since this checkpointing method
used. is concerned only with the temporary log,
leaving the database as it is, we call it
3.3.5 FuzzyCheckpoints "fuzzy." A description of a particular imple-
mentation of indirect, fuzzy checkpoints is
In order to further reduce checkpoint costs, given by Gray [1978].
propagation activity at checkpoint time has The best of both worlds, low checkpoint
to be avoided whenever possible. One way costs with fixed limits to partial REDO, is
to do this is indirect checkpointing. Indirect achieved by another fuzzy scheme de-
checkpointing means that information scribed by Lindsay et al. [1979]. This
about the buffer occupation is written to scheme combines ACC with indirect check-
pointing: At checkpoint time the numbers
4 This means that the materialized database reflects a
state produced by complete actions only; t h a t is, it is of all pages (with an update indicator) cur-
consistent up to Level 3 at the moment of checkpoint. rently in buffer are written to the log file.
ing. If there are no hot spot pages, nothing else
ComputingSurveys,Vol. 15, No. 4, December 1983
308 • T Haerder and A. Reuter
Propagation
ATOMIC ATOMIC Strategy
STEAL
/\ ~STEAL
~
STEAL
~
~S T E A L
Page
Replacement
/\ /\
FORCE ~FORCE FORCE ~FORCE
/\ /\
FORCE ~FORCE FORCE ~FORCE Processing
TOC
/k I / \
TCC ACC fuzzy TOC TCC fuzzy
I
TOC TCC ACC TOC TCC
Checkpoint
Scheme
E Example
_1
A j
never contain updates of incomplete trans-
I )(
T1 actions. -~FORCE is desirable for efficient
c' EOT processing, as discussed previously
T2
"" I " o C" (Section 3.3.2). The IMS/Fast Path in its
T3 I : ~ :
"main storage database" version is a system
I s"
T4 I I . designed with this implementation tech-
I nique [IMS N.d.; Date 1981]. The -~STEAL
c, and "~FORCE principles are generalized to
System Crash the extent that there are no write opera-
tions to the database during normal proc-
Figure 14. Transaction scenario for illustrating re- essing. All updates are recorded to the log,
covery techniques. and propagation is delayed until shutdown
(or some other very infrequent checkpoint),
forced at EOT. Note that in ~ATOMIC which makes the system belong to the TCC
schemes EOT processing is interruptable class. Figure 16 illustrates the implications
by a crash. of this approach.
In the scenario given in Figure 15, we With "~STEAL, there is no UNDO infor-
need only consider T1 and T2; the rest is mation on the temporary log. Accordingly,
irrelevant to the example. According to the there are only committed pages in the ma-
scenario, A' has been replaced from the terialized database. Each successful trans-
buffer, which triggered an UNDO entry to action writes REDO information during
be written. Pages B' and C' remained in EOT processing. Assuming that the crash
buffer as long as T2 was active. T2 reached occurs as indicated in Figure 14, the mater-
its normal end before the crash, and so the ialized database is in the initial state, and,
following had to be done: compared with the former current data-
base, is old. Everything that has been done
* Write UNDO information for B and C since start-up must therefore be applied to
(in case the FORCE fails). the database by processing the entire tem-
• Propagate B' and C'. porary log in chronological order. This, of
• Write REDO information for B' and C' course, can be very expensive, and hence
to the archive log file. the entire environment should be as stable
• Discard the UNDO entries for B and C. as possible to minimize crashes. The bene-
• Write an EOT record to the log files and fits of this approach are extremely high
acknowledge EOT to the user. transaction rates and short response times,
Of course, there are some obvious op- since physical I/O during normal process-
timizations as regards the UNDO data ing is reduced to a minimum.
for pages that have not been replaced be- The database cache, mentioned in Sec-
fore EOT, but these are not our concern tion 3.3, also tries to exploit the desirable
here. After the crash, the recovery com~ properties of ~ S T E A L and ~FORCE, but,
ponent finds the database and the log in addition, attempts to provide very fast
files as shown in the scenario. The mate- crash recovery. This is attempted by imple-
rialized database is inconsistent owing to menting a checkpointing scheme of the
-~ATOMIC propagation, and must be made "fuzzy" type.
consistent by applying all UNDO infor-
mation in reverse chronological order.
3.4.3 ImplementationTechnique:
3.4.2 ImplementationTechnique: ATOMIC, STEAL, -~FORCE,ACC
-~ATOMIC, -~TEAL, "~FORCE, TCC ATOMIC propagation is not yet widely
Applications with high transaction rates used in commercial database systems. This
require large DB buffers to yield satisfac- may result from the fact that indirect page
tory performance. With sufficient buffer mapping is more complicated and more ex-
space, a "~STEAL approach becomes fea- pensive than the update-in-place tech-
sible; that is, the materialized database will nique. However, there is a well-known ex-
ComputingSurveys,Vol. 15, No. 4, December 1983
310 • T. Haerder and A. R e u t e r
Before System Crash After Restart
ample of this type of implementation, based tions, as described in Section 3.2. EOT
on the shadow-page mechanism in System processing of T2 and T3 includes writing
R. This system uses action-consistent REDO information to the log, again using
checkpointing for update propagation, and logical transitions. When the system
hence comes up with a consistent material- crashes, the current database is in the state
ized database after a crash. More specifi- depicted in Figure 17; at restart the mater-
cally, the materialized database will be con- ialized database will reflect the most recent
sistent up to Level 4 of the mapping hier- checkpoint state. Crash recovery involves
archy and reflect the state of the most the following actions:
recent checkpoint; everything occurring
• UNDO the modification of A'. Owing to
after the most recent checkpoint will have
the STEAL policy in System R, incom-
disappeared. As discussed in Section 3.2,
plete transactions can span several
with an action-consistent database one can
checkpoints. Global UNDO must be ap-
use logical transition logging based on
plied to all changes of failed transactions
DML statements, which System R does.
prior to the recent checkpoint.
Note that in the case of ATOMIC propa-
* REDO the last action of T2 (modifi-
gation the WAL principle is bound to the
cation of C') and the whole transaction
propagation, that is, to the checkpoints. In
T3 {modification of C"). Although they
other words, modified pages can be written,
are committed, the corresponding page
but not propagated, without having written
states are not yet reflected in the mate-
an UNDO log. If the modified pages pertain
rialized database.
to incomplete transactions, the UNDO in-
formation must be on the temporary log * Nothing has to be done with D' since this
has not yet become part of the material-
before the pages are propagated. The same
ized database. The same is true of T4.
is true for STEAL: Not only can dirty pages
be written; in the case of System R they Since it was not present when ci was
generated, it has had no effect on the
can also be propagated. Consider the scen-
ario in Figure 17. materialized database.
T1 and T2 were both incomplete at
3.5 Evaluation of Logging
checkpoint. Since their updates (A' and B')
and Recovery Concepts
have been propagated, UNDO information
must be written to the temporary log. In Combining all possibilities of propagating,
System R, this is done with logical transi- buffer handling, and checkpointing, and
Computing Surveys, Vol. 15, No. 4, December 1983
Principles o[ Transaction-Oriented Database Recovery 311
Before System Crash A f t e r Restart
v, I a'lb"lc"l d' I
Table 4. Evaluationof Logging and Recovery Techniques Based on the Introduced Taxonomy
EOT processing FORCE ~FORCE FORCE ~FORCE FORCE ~FORCE FORCE ~FORCE
checkpoint type TOC TCC ACC!FUZZY TOC TCC FUZZY TOC TCC ACC TOC TCC
rnatermhzed DB statq
DC DC OC DC DC DC DC TC TC AC TC TC
after system failure
cost of transact=on 4. 4. 4. ÷
UNDO
cost of partial REDO + . .
at restart
cost of global UNDO + 4. 4" + . . . . . . ÷
at restart
)verhead during + ÷ + + +
normal processing
frequency of check- 4- - - 4. ÷ +
)omts
Notes:
Abbreviations: DC, device consistent (chaotic); AC, action consistent; TC, Transaction consistent.
E v a l u a t i o n s y m b o l s : - - , v e r y l o w ; - , l o w ; + , h i g h ; + + , v e r y high.
considering the overall properties of each -~STEAL does not allow for ACC). By re-
scheme that we have discussed, we can ferring the information in Table 4 to Figure
derive the evaluation given in Table 4. 13; one can see how existing DBMSs are
Table 4 can be seen as a compact sum- rated in this qualitative comparison.
mary of what we have discussed up to this Some criteria of our taxonomy divide the
point. Combinations leading to inherent world of DB recovery into clearly distinct
contradictions have been suppressed (e.g., areas:
• ATOMIC propagation achieves an ac- There are, of course, other factors influ-
tion- or transaction-consistent material- encing the performance of a logging and
ized database in the event of a crash. recovery component: The granule of log-
Physical as well as logical logging tech- ging (pages or entries), the frequency of
niques are therefore applicable. The checkpoints (it depends on the transaction
benefits of this property are offset by load), etc. are important. Logging is also
increased overhead during normal proc- tied to concurrency control in that the
essing caused by the redundancy required granule of logging determined the granule
for indirect page mapping. On the other of locking. If page logging is applied, DBMS
hand, recovery can be cheap when must not use smaller granules of locking
ATOMIC propagation is combined with than pages. However, a detailed discussion
TOC schemes. of these aspects is beyond the scope of this
• ~ATOMIC propagation generally results paper; detailed analyses can be found in
in a chaotic materialized database in the Chandy et al. [1975] and Reuter [1982].
event of a crash, which makes physical
logging mandatory. There is almost no 4. ARCHIVE RECOVERY
overhead during normal processing, but
without appropriate checkpoint schemes, Throughout this paper we have focused on
recovery will more expensive. crash recovery, but in general there are two
types of DB recovery, as is shown in Figure
• All t r a n s a c t i o n - o r i e n t e d and trans-
action-consistent schemes cause high 18. The first path represents the standard
crash recovery, depending on the physical
checkpoint costs. This problem is
(and the materialized) database as well as
emphasized in t r a n s a c t i o n - o r i e n t e d
schemes by a relatively high checkpoint on the temporary log. If one of these is lost
or corrupted because of hardware or soft-
frequency.
ware failure, the second path, archive re-
It is, in general, important when deciding covery, must be tried. This presupposes
which implementation techniques to that the components involved have inde-
choose for database recovery to carefully pendent failure modes, for example, if tem-
consider whether optimizations of crash re- porary and archive logs are kept on differ-
covery put additional burdens on normal ent devices. The global scenario for archive
processing. If this is the case, it will cer- recovery is shown in Figure 19; it illustrates
tainly not pay off, since crash recovery, it that the component "archive copy" actually
is hoped, will be a rare event. Recovery depends on some dynamically modified
components should be designed with mini- subcomponents. These subcomponents cre-
mal overhead for normal processing, pro- ate new archive copies and update existing
vided that there is fixed limit to the costs ones. The following is a brief sketch of some
of crash recovery. problems associated with this.
This consideration rules out schemes of Creating an archive copy, that is, copying
the ATOMIC, FORCE, TOC type, which the on-line version of the database, is a
can be implemented and look very appeal- very expensive process. If the copy is to be
ing at first sight. According to the classi- consistent, update operation on the data-
fication, the materialized database will base has to be interrupted for a long time,
always be in the most recent transaction- which is unacceptable in many applica-
consistent state in implementations of tions. Archive recovery is likely to be rare,
these schemes. Incomplete transactions and an archive copy should not be created
have not affected the materialized data- too frequently, both because of cost and
base, and successful transactions have because there is a chance that it will never
propagated indivisibly during EOT proc- be used. On the other hand, if the archive
essing. However appealing the schemes copy is very old, recovery starting from such
may be in terms of crash recovery, the a copy will have to redo too much work and
overhead during normal processing is too will take too long. There are two methods
high to justify their use [Haerder and Reu- to cope with this. First, the database can
ter 1979; Reuter 1980]. be copied on the fly, that is, without inter-
,hY,,c,, I
. Database + I''mLog0orary
File
Failure Consistent
Database
~--~1 Arch've
Copy I + ] Arch,ve
Log
© J
I
©
Figure 19. Scenario for archive recovery (global REDO).
rupting processing, in parallel with normal chive recovery in such an environment re-
processing. This will create an inconsistent quires the most recent archive copy, the
copy, a so-called "fuzzy dump." latest incremental modifications to it (if
The other possibility is to write only the there are any), and the archive log. When
changed pages to an incremental dump, recovering the database itself, there is little
since a new copy will be different from an additional cost in creating an identical new
old one only with respect to these pages. archive copy in parallel.
Either type of dump can be used to create There is still another problem hidden in
a new, more up-to-date copy from the pre- this scenario: Since archive copies are
vious one. This is done by a separate off- needed very infrequently, they may be sus-
line process with respect to the database ceptible to magnetic decay. For this reason
and therefore does not affect DB operation. several generations of the archive copy are
In the case of DB applications running 24 usually kept. If the most recent one does
hours per day, this type of separate process not work, its predecessor can be tried, and
is the only possible way to maintain archive so on. This leads to the consequences illus-
recovery data. As shown in Figure 19, at- trated in Figure 20.
0
Archive Copy
O_
Archive Copy
O_
Archive Copy
Generationn-2 Generationn-1 Generationn
I I Actual
I ~ (Transaction
Consistent)
Distanceto be Coveredby the ArchiveLog Stateof DB
Figure 20. Consequences of multigeneration archive copies.
f ~
~_.[ Archwe
DMuUs:bepSy;:hro~~ DedT Log(2)
\ /
(a)
Temp°rarY
'/Mus'beSynchr°°'z
Log(l) DuringPhase1 of EOT
t Physmal
Database
"em0or
Log(2)
rl J/
Archive J Temporary
Log(1) Log(1)
I Archive
Copy
Archwe Temporary
Log(2) Log(2)
(b)
Figure 21. Two possibilities for duplicating the archive log.
We must anticipate the case of starting That makes the log susceptible to magnetic
archive recovery from the oldest generation, decay, as well, but in this case generations
and hence the archive log must span the will not help; rather we have to duplicate
whole distance back to this point in time. the entire archive log file.Without taking
storage costs into account, this has severe ATOMIC/~ATOMIC dichotomy defines
impact on normal DB processing, as is two different methods of handling low-level
shown in Figure 21. updates of the database, and also gives rise
Figure 21a shows the straightforward so- to different views of the database, both the
lution: two archive log files that are kept materialized and the physical database.
on different devices. If this scheme is to This proves to be useful in defining differ-
work, all three log files must be in the same ent crash states of a database.
state at any point in time. In other words, Buffer Handling. We have shown that
writing to these files must be synchronized interfering with buffer replacement can
at each EOT. This adds substantial costs to support UNDO recovery. The STEAL/
normal processing and particularly affects -~STEAL criterion deals with this concept.
transaction response times. The solution in EOT Processing. By distinguishing
Figure 21b assumes that all log information FORCE policies from -~FORCE policies we
is written only to the temporary log during can distinguish whether successful trans-
normal processing. An independent process actions will have to be redone after a crash.
that runs asynchronously then copies the It can also be shown that this criterion
REDO data to the archive log. Hence ar- heavily influences the DBMS performance
chive recovery finds most of the log entries during normal operation.
in the archive log, but the temporary log is Checkpointing. Checkpoints have been
required for the most recent information. introduced as a means for limiting the costs
In such an environment, temporary and of partial REDO during crash recovery.
archive logs are no longer independent from They can be classified with regard to the
a recovery perspective, and so we must events triggering checkpoint generation
make the temporary log very reliable by and the number of data written at a check-
duplicating it. The resulting scenario looks point. We have shown that each class has
much more complicated than the first one, some particular performance characteris-
but in fact the only additional costs are tics.
those for temporary log storage--which are
usually small. The advantage here is that Some existing DBMSs and implementa-
only two files have to be synchronized dur- tion concepts have been classified and de-
ing EOT, and moreover--as numerical scribed according to the taxonomy. Since
analysis shows--this environment is more the criteria are relatively simple, each sys-
reliable than the first one by a factor of 2. tem can easily be assigned to the appropri-
These arguments do not, of course, ex- ate node of the classification tree. This
haust the problem of archive recovery. Ap- classification is more than an ordering
plications demanding very high availability scheme for concepts: Once the parameters
and fast recovery from a media failure will of a system are known, it is possible to draw
use additional measures such as duplexing important conclusions as to the behavior
the whole database and all the hardware and performance of the recovery compo-
{e.g., see TANDEM [N.d.]). This aspect of nent.
database recovery does not add anything ACKNOWLEDGMENTS
conceptually to the recovery taxonomy es- W e would like to thank Jim Gray ( T A N D E M Com-
tablished in this paper. puters, Inc.) for his detailed proposals concerning the
structure and contents of this paper, and his enlight-
5. CONCLUSION ening discussions of logging and recovery. Thanks are
also due to our colleagues Flaviu Cristian, Shel Fin-
We have presented a taxonomy for classi- kelstein, C. Mohan, Kurt Shoens, and Ire Traiger
fying the implementation techniques for (IBM Research Laboratory) for their encouraging
database recovery. It is based on four cri- commentsand critical remarks.
teria:
Propagation. We have shown that update REFERENCES
propagation should be carefully distin- ASTRAHAN, M. M., BLASGEN, M. W., CHAMBERLIN,
guished from the write operation. The D. D., GRAY,J. N., KING, W. F., LINDSAY,B. G.,
LORIE, R., MEHL,J. W., PRICE, T. G., PUTZOLU, and predicate locks in a database system. Com-
F., SEL1NGER,P. G., SCHKOLNICK,M., SLUTZ, mun. ACM 19, 11 (Nov.), 624-633.
D. R., TRAIGER,I. L., WADE, B. W., AND YOST,
R. A. 1981. History and evaluation of System GRAY,J. 1978. Notes on data base operating systems.
R. Commun. ACM 24, 10 (Oct.), 632-646. In Lecture Notes on Computer Science, vol. 60, R.
Bayer, R. N. Graham, and G. Seegmueller, Eds.
BERNSTEIN, P. A., AND GOODMAN,N. 1981. Con- Springer-Verlag, New York.
currency control in distributed database systems.
ACM Comput. Surv. 13, 2 (June), 185-221. GRAY,J. 1981. The transaction concept: Virtues and
limitations. In Proceedings of the 7th Interna-
BJORK, L. A. 1973. Recovery scenario for a DB/DC tional Con[erence on Very Large Database Sys-
system. In Proceedings of the ACM 73 National tems (Cannes, France, Sept. 9-11). ACM, New
Conference (Atlanta, Ga., Aug. 27-29). ACM, New York, pp. 144-154.
York, pp. 142-146.
GRAY, J., LORIE, R., PUTZOLU,F., AND TRAIGER,I.
CHAMBERLIN,D. D. 1980. A summary of user expe- L. 1976. Granularity of locks and degrees of
rience with the SQL data sublanguage. In Pro- consistency in a large shared data base. In Mod-
ceedings of the International Conference on Da- eling in Data Base Management Systems. Elsevier
tabases (Aberdeen, Scotland, July), S. M. Deen North-Holland, New York, pp. 365-394.
and P. Hammersley, Eels. Heyden, London, pp.
181-203. GRAY, J., MCJONES, P., BLASGEN,M., LINDSAY,B.,
LORIE, R., PRICE, T., PUTZOLU, F., AND
CHANDY,K. M., BROWN,J. C., DISSLEY, C. W., AND TRAIGER, I. L. 1981. The recovery manager of
UHRIG, W. R. 1975. Analytic models for roll- the System R database manager. ACM Comput.
back and recovery strategies in data base systems. Surv. 13, 2 (June), 223-242.
IEEE Trans So#w Eng. SE-1, 1 (Mar.), 100-
110. HAERDER, T., AND REUTER, A. 1979. Optimization
of logging and recovery in a database system. In
CHEN, T. C. 1978. Computer technology and the Database Architecture, G. Bracchi, Ed. Elsevier
database user. In Proceedings of the 4th Interna- North-Holland, New York, pp. 151-168.
tional Conference on Very Large Database Sys-
tems (Berlin, Oct.). IEEE, New York, pp. 72-86. HAERDER, T., AND REUTER, A. 1983. Concepts for
implementing a centralized database manage-
CODASYL 1973. CODASYL DDL Journal of Devel- ment system. In Proceedings of the International
opment June Report. Available from IFIP Admin- Computing Symposium (Invited Paper) (Nuern-
istrative Data Processing Group, 40 Paulus Pot- berg, W. Germany, Apr.), H. J. Schneider, Ed.
terstraat, Amsterdam. German Chapter of ACM, B. G. Teubner, Stutt-
CODASYL 1978. CODASYL: Report of the Data gart, pp. 28-6O.
Description Language Committee. Inf. Syst. 3, 4, IMS/VS-DB N.d. IMS/VS-DB Primer, IBM World
247-320. Trade Center, Palo Alto, July 1976.
CODD, E. F. 1982. Relational database: A practical KOHLER, W. H. 1981. A survey of techniques for
foundation for productivity. Commun. ACM 25, synchronization and recovery in decentralized
2 (Feb.), 109-117. computer systems. ACM Comput. Surv. 13, 2
DATE,C. J. 1981. An Introduction to Database Sys- (June), 149-183.
tems, 3rd ed. Addison-Wesley, Reading, Mass. LAMPSON, B. W., AND STURGIS,H. E. 1979. Crash
DAVIES, C. T. 1973. Recovery semantics for a DB/ recovery in a distributed data storage system.
DC System. In Proceedings of the ACM 73 Na- XEROX Res. Rep. Palo Alto, Calif. Submitted
tmnal Conference, (Atlanta, Ga., Aug. 27-29). for publication.
ACM, New York, pp. 136-141. LINDSAY,B. G., SEL1NGER,P. G., GALTIERI,C., GRAY,
J. N., LORIE, R., PRICE, T. G., PUTZOLU, F.,
DAVIES,C. T. 1978. Data processing spheres of con- TRAIGER, I. L., AND WADE,B. W. 1979. Notes
trol. IBM Syst. J. 17, 2, 179-198. on distributed databases. IBM Res. Rep. RJ 2571,
EFFELSBERG, W., HAERDER, T., REUTER, A., AND San Jose, Calif.
SCHULZE-BOHL, J. 1981. Performance mea- LOmE, R. A. 1977. Physical integrity in a large seg-
surement in database systems--Modeling, inter- mented database. ACM Trans. Database Sys. 2, 1
pretation and evaluation. In Informatik Fachber- (Mar.), 91-104.
whte 41. Springer-Verlag, Berlin, pp. 279-293 (in
German). REUTER, A. 1980. A fast transaction-oriented log-
ging scheme for UNDO-recovery. IEEE Trans.
ELHARDT,K. 1982. The database cache--Principles So#w. Eng. SE-6 (July), 348-356.
of operation. Ph.D. dissertation, Technical Uni-
versity of Munich, Munich, West Germany (in REUTER, A. 1981. Recovery in Database Systems.
German). Carl Hanser Verlag, Munich (in German).
ESWARAN, K. P., GRAY, J. N., LORIE, R. A., AND REUTER,A. 1982. Performance Analysis of Recovery
TRAIGER, I.L. 1976. The notions of consistency Techniques, Res. Rep., Computer Science De-