Applies To:: Checkpoint Tuning and Troubleshooting Guide (Doc ID 147468.1)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

7/6/2020 Document 147468.

1
Copyright (c) 2020, Oracle. All rights reserved. Oracle Confidential.

Checkpoint Tuning and Troubleshooting Guide (Doc ID 147468.1)

APPLIES TO:

Oracle Database - Enterprise Edition - Version 8.0.3.0 and later


Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Information in this document applies to any platform.

PURPOSE

This bulletin provides the Database Administrator a better understanding of


incremental checkpoint and a description of four initialization parameters used for checkpoint tuning:

- FAST_START_MTTR_TARGET
- LOG_CHECKPOINT_INTERVAL
- LOG_CHECKPOINT_TIMEOUT
- LOG_CHECKPOINTS_TO_ALERT

It also explains how to interpret and handle checkpoint errors: 'Checkpoint not Complete' and 'Cannot Allocate New Log'
reported in the ALERT<sid>.LOG file.

DETAILS

Contents:

1. What is a Checkpoint?
2. Checkpoints and Performance
3. Parameters related to incremental checkpointing
4. Redo logs and Checkpoint
5. Understanding Checkpoint Error messages ("Cannot allocate new log" and "Checkpoint not complete")
6. Oracle Release Information
7. Using Statspack to determine Checkpointing Problems

CHECKPOINT TUNING AND ERROR HANDLING

1. What is a Checkpoint?

A Checkpoint is a database event which synchronizes the modified data blocks in memory with the datafiles on
disk. It offers Oracle the means for ensuring the consistency of data modified by transactions. The mechanism
of writing modified blocks on disk in Oracle is not synchronized with the commit of the corresponding
transactions.

A checkpoint has two purposes: (1) to establish data consistency, and (2) enable faster database recovery.
How is recovery faster? Because all database changes up to the checkpoint have been recorded in the datafiles,
making it unnecessary to apply redo log entries prior to the checkpoint. The checkpoint must ensure that all the
modified buffers in the cache are really written to the corresponding datafiles to avoid the loss of data
which may occur with a crash (instance or disk failure).

https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state=2whn9zwj2_118&id=147468.1 1/6
7/6/2020 Document 147468.1

Oracle writes the dirty buffers to disk only on certain conditions:

- A shadow process must scan more than one-quarter of the db_block_buffer


parameter.
- Every three seconds.
- When a checkpoint is produced.

A checkpoint is realized on five types of events:

- At each switch of the redo log files.


- When the delay for LOG_CHECKPOINT_TIMEOUT is reached.
- When the size in bytes corresponding to :
(LOG_CHECKPOINT_INTERVAL* size of IO OS blocks)
is written on the current redo log file.
- Directly by the ALTER SYSTEM SWITCH LOGFILE command.
- Directly with the ALTER SYSTEM CHECKPOINT command.

During a checkpoint the following occurs:


- The database writer (DBWR) writes all modified database
blocks in the buffer cache back to datafiles,
- Checkpoint process (ckpt) updates the headers of all
the datafiles to indicate when the last checkpoint
occurred (SCN)

2. Checkpoints and Performance

Checkpoints present a tuning dilemma for the Database Administrator. Frequent


checkpoints will enable faster recovery, but can cause performance
degradation. How then should the DBA address this?

Depending on the number of datafiles in a database, a checkpoint can be a


highly resource intensive operation, since all datafile headers are frozen
during the checkpoint. There is a performance trade-off regarding frequency
of checkpoints. More frequent checkpoints enable faster database recovery
after a crash. This is why some customer sites which have a very low
tolerance for unscheduled system downtime will often choose this option.
However, the performance degradation of frequent checkpoints may not justify
this philosophy in many cases. Let's assume the database is up and running 95%
of the time, and unavailable 5% of the time from infrequent instance crashes
or hardware failures requiring database recovery. For most customer sites, it
makes more sense to tune for the 95% case rather than the rare 5% downtime.

This bulletin assumes that performance is your number one priority and so
recommendations are made accordingly. Therefore, your goal is to minimize the frequency
of checkpoints through tuning.

Tuning checkpoints involves four key initialization parameters

- FAST_START_MTTR_TARGET
- LOG_CHECKPOINT_INTERVAL
- LOG_CHECKPOINT_TIMEOUT
- LOG_CHECKPOINTS_TO_ALERT

These parameters are discussed in detail below.

Recommendations are also given for handling "checkpoint not complete" messages
found in the alert log, which indicate a need to tune redo logs and
checkpoints.

https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state=2whn9zwj2_118&id=147468.1 2/6
7/6/2020 Document 147468.1

3. Parameters related to incremental checkpointing

FAST_START_MTTR_TARGET

Since Oracle 9i FAST_START_MTTR_TARGET parameter is the preferred method


of tuning incremental checkpoint target. FAST_START_MTTR_TARGET enables you
to specify the number of seconds the database takes to perform crash recovery
of a single instance. Based on internal statistics, incremental checkpoint
automatically adjusts the checkpoint target to meet the requirement of
FAST_START_MTTR_TARGET.

V$INSTANCE_RECOVERY.ESTIMATED_MTTR shows the current estimated mean time to


recover (MTTR) in seconds. This value is shown even if FAST_START_MTTR_TARGET
is not specified.

V$INSTANCE_RECOVERY.TARGET_MTTR shows the effective MTTR target in seconds


enforced by the system.

V$MTTR_TARGET_ADVICE shows the number of I/Os resulted by the current workload


under the current MTTR setting and the estimated number of I/Os that would be
resulted by the current workload under other MTTR settings. This view helps
the user to assess the trade-off between runtime performance and setting FAST_START_MTTR_TARGET to achieve
better recovery time.

LOG_CHECKPOINT_INTERVAL

LOG_CHECKPOINT_INTERVAL parameter specifies the maximum number of redo blocks


the incremental checkpoint target should lag the current log tail.
If FAST_START_MTTR_TARGET is specified, LOG_CHECKPOINT_INTERVAL should not
be set or set to 0.
On most Unix systems the operating system block size is 512 bytes.
This means that setting LOG_CHECKPOINT_INTERVAL to a value of 10,000 would
mean the incremental checkpoint target should not lag the current log tail
by more than 5,120,000 (5M) bytes. . If the size of your redo log is 20M, you are taking 4
checkpoints for each log.

LOG_CHECKPOINT_INTERVAL influences when a checkpoint occurs, which means


careful attention should be given to the setting of this parameter, keeping it
updated as the size of the redo log files is changed. The checkpoint
frequency is one of the factors which impacts the time required for the
database to recover from an unexpected failure. Longer intervals between
checkpoints mean that if the system crashes, more time will be needed for the
database to recover. Shorter checkpoint intervals mean that the database will
recover more quickly, at the expense of increased resource utilization during
the checkpoint operation.

This parameter also impacts the time required to complete a database recovery
operation during the roll forward phase of recovery. The actual recovery time
is dependent upon this time, and other factors, such as the type of failure
(instance or system crash, media failure, etc.), and the number of archived
redo logs which need to be applied. setting and the estimated number of I/Os that would be
resulted by the current workload under other MTTR settings. This view helps
the user to assess the trade-off between runtime performance and setting
FAST_START_MTTR_TARGET to achieve better recovery time.

https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state=2whn9zwj2_118&id=147468.1 3/6
7/6/2020 Document 147468.1

LOG_CHECKPOINT_TIMEOUT

The LOG_CHECKPOINT_TIMEOUT parameter specifies the maximum number of seconds


the incremental checkpoint target should lag the current log tail.
In another word, it specifies how long a dirty buffer in buffer cache can
remain dirty.
Checkpoint frequency impacts the time required for the
database to recover from an unexpected failure. Longer intervals between
checkpoints mean that more time will be required during database recovery.

Oracle recommends using LOG_CHECKPOINT_INTERVAL to control the checkpoint


interval rather than LOG_CHECKPOINT_TIMEOUT, which will initiate a checkpoint
every "n" seconds, regardless of the transaction frequency. This can cause
unnecessary checkpoints in cases where transaction volumes vary. Unnecessary
checkpoints must be avoided whenever possible for optimal performance.

It is a misconception that setting LOG_CHECKPOINT_TIMEOUT to a given


value will initiate a log switch at that interval, enabling a recovery
window used for a stand-by database configuration. Log switches cause a checkpoint,but a checkpoint does not cause a
log switch. The only way to cause a log switch is manually with
ALTER SYSTEM SWITCH LOGFILE or resizing the redo logs to cause
more frequent switches. This is controlled by operating system
blocks, not a timed interval.
Sizing of the online redo logs is critical for performance and recovery.
See additional sections below on redo logs and checkpoints.

LOG_CHECKPOINTS_TO_ALERT

LOG_CHECKPOINTS_TO_ALERT lets you log your checkpoints to the alert file.

Doing so is useful for determining whether checkpoints are occurring at


the desired frequency.
Prior to Oracle9i this parameter was STATIC.
Oracle generally advises this be set to TRUE as the overhead is
negligible but the information in the alert log may be useful.

See Note:76713.1 to have more detail on How those instance parameters can influence the checkpoint.

4. Redo logs and Checkpoint

A checkpoint occurs at every log switch. If a previous checkpoint is already


in progress, the checkpoint forced by the log switch will override the current
checkpoint.

This necessitates well-sized redo logs to avoid unnecessary checkpoints as a


result of frequent log switches.
The lag between the incremental checkpoint target and the log tail is
also limited by 90% of the smallest online log file size. This makes sure
that in most cases log switch would not need to wait for checkpoint.
Because of this, log file sizes should be configured large enough.
A good rule of thumb is to switch logs at most every twenty minutes.
Having your log files too small can increase checkpoint activity and reduce performance.
https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state=2whn9zwj2_118&id=147468.1 4/6
7/6/2020 Document 147468.1

Oracle recommends the user to set all online log files to be the same size,
and have at least two log groups per thread. The alert log is a valuabletool for
monitoring the rate that log switches occur, and subsequently, checkpoints
occur.

The following is an example of quick log switches


from the alert log:

Fri May 16 17:15:43 1997


Thread 1 advanced to log sequence 1272
Current log# 3 seq# 1272 mem# 0: /prod1/oradata/logs/redologs03.log
Thread 1 advanced to log sequence 1273
Current log# 1 seq# 1273 mem# 0: /prod1/oradata/logs/redologs01.log
Fri May 16 17:17:25 1997
Thread 1 advanced to log sequence 1274
Current log# 2 seq# 1274 mem# 0: /prod1/oradata/logs/redologs02.log
Thread 1 advanced to log sequence 1275
Current log# 3 seq# 1275 mem# 0: /prod1/oradata/logs/redologs03.log
Fri May 16 17:20:51 1997
Thread 1 advanced to log sequence 1276
Current log# 1 seq# 1276 mem# 0: /prod1/oradata/logs/redologs01.log

If redo logs switch every 3 minutes, you will see performance degradation.
This indicates the redo logs are not sized large enough to efficiently handle
the transaction load.

See Note:1038851.6 for further detail on how to estimate an adequate size of the redolog files. See
Note:1035935.6 Example of How To Resize the Online Redo Logfiles

5. Understanding Checkpoint Error messages ( Cannot allocate new log and Checkpoint not complete )

Sometimes, you can see in your alert.log file, the following corresponding
messages:

Thread 1 advanced to log sequence 248


Current log# 2 seq# 248 mem# 0: /prod1/oradata/logs/redologs02.log
Thread 1 cannot allocate new log, sequence 249
Checkpoint not complete

This message indicates that Oracle wants to reuse a redo log file, but
the current checkpoint position is still in that log. In this case, Oracle must
wait until the checkpoint position passes that log. Because the
incremental checkpoint target never lags the current log tail by more than 90%
of the smallest log file size, this situation may be encountered if DBWR writes
too slowly, or if a log switch happens before the log is completely full,
or if log file sizes are too small.
When the database waits on checkpoints,redo generation is stopped until the
log switch is done.

6. Oracle Release Information

In Oracle8i initialization parameter FAST_START_IO_TARGET causes incremental


checkpoint to automatically adjusting its target so that the number of data
blocks needed by recovery would be no more than FAST_START_IO_TARGET.
This parameter has been deprecated since Oracle 9i in favor of parameter FAST_START_MTTR_TARGET.

https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state=2whn9zwj2_118&id=147468.1 5/6
7/6/2020 Document 147468.1

7. Using Statspack to determine Checkpointing problems

Statspack snapshots can be taken every 15 minutes or so, these reports gather useful
information about number of checkpoints started and checkpoints completed and number
of database buffers written during checkpointing for that window of time . It also contains
statistics about redo activity. Gathering and comparing these snapshot reports gives you
a complete idea about checkpointing performance at different periods of time.

Another important thing to watch in statspack report is the following wait events,
they could be a good indication about problems with the redo log throughput and checkpointing:

log file switch (checkpoint incomplete)


log file switch (archiving needed)
log file switch/archive
log file switch (clearing log file)
log file switch completion
log switch/archive
log file sync

In the case when one or more of the above wait events is repeated frequently
with considerable values then you need to take an action like adding More
online redo log files or increasing their sizes and/or modifying checkpointing parameters.

REFERENCES

NOTE:76713.1 - 8i Parameters that Influence Checkpoints


NOTE:1038851.6 - How to Estimate Size of Redo Logs
Didn't find what you are looking for?

https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state=2whn9zwj2_118&id=147468.1 6/6

You might also like