Professional Documents
Culture Documents
CPU Spikes and Top CPU SQL
CPU Spikes and Top CPU SQL
S
unfolding CPU spike mysteries. The approach outlined can be used on both
upporting mission-critical production databases, 10g and 11g databases.
an Oracle database administrator always faces the Assumptions
challenge of being questioned when applications run In this analysis, the first assumption is that all external causes outside the
database have been eliminated, and that the SQL statement(s) cause the CPU
slow on an otherwise well-optimized and stable environment. spike occurring on the database server. Several known processes running on a
Myriad factors could cause this degradation including CPU database server may contribute to CPU spikes. Database backup process, OEM
agent process, or any other OS daemon process, etc. This can be easily
spikes on a database server. Above an acceptable range, the confirmed by using system monitoring tools, such as Oracle’s OS watcher that
spikes impact the response and throughput of all applications tracks top CPU consumer processes at the OS level during the CPU spike time.
Secondly, our approach is built on top of performance data maintained
that database server serves. With years of experience and automatically by AWR and ASH. The CPU spike may have happened in the past,
the assumption that the database is guilty until proven or maybe it is an ongoing issue. Our approach works consistently for both
situations, assuming that there is enough AWR and ASH data available to
innocent, our team has developed a systematic approach. We cover the CPU spike time past or present. Both AWR and ASH are part of Oracle
utilize data from Automatic Workload Repository (AWR) and Enterprise Manager Diagnostic Pack that must be licensed as a separate
option. Oracle has default retention policy for AWR data (seven days) but you
Automated Session History (ASH) to find out SQL statements can adjust it according to your needs.
that are suffering as a result of these spikes and possibly Methodology
which of those could be the root cause of CPU spikes. This Our team believes that the performance data in AWR and ASH at the CPU spike
time are skewed compared to those at the non-CPU spike time. Our approach
article discusses this topic and illuminates several discoveries is a symptom-driven, bottom-up technique that focuses on understanding the
using two real-life case studies. outstanding variations. Though, the challenge of using AWR data always lies
in the fact that it sums up information at the instance level where symptoms
and problems could both be involved. The first step to tackle any CPU spike
Introduction
issue starts with one symptom: The SQL statement with the highest CPU time
We closely monitor our production database systems and CPU spikes are always (defined as top CPU SQL). Be aware that the identified top CPU SQL may or
on the top of the watch list. Although the CPU spikes vary depending on intensity may not be the real problem. After you identify the top CPU consuming SQL,
and duration, the impact is obvious. CPU spikes point directly to contention review other relevant AWR data looking for problem SQL statements. Any
and starvation of system resources, resulting in reduced database performance. finding from AWR data is confirmed with the corresponding ASH data. Identify
A database server that is part of RAC database architecture, prolonged CPU the sessions that are top CPU SQL, and examine all aspects of session-level
spikes could cause the cluster database to become unresponsive that may information such as wait events, blocking session, etc.
result in instance or node eviction.
When using this approach, follow two basic guidelines:
We have a very systematic and well-documented approach for building the
underlying infrastructure for a database. One of the key rules is to build and •• Both AWR and ASH data are time-based data. If the top CPU SQL is not new
to the database, one should be able to compare and confirm findings from
deploy the hardware stack such that the peak baseline CPU utilization is never
AWR or ASH data across time.
more than 50 percent with no degradation in performance for the application
workload defined. Over the years we have found out that this has helped us in •• Both AWR and ASH data are dimension-based data. Figure 1 and Figure 2
delivering systems that meet and/or beat the high expectations of our show those major dimensions in AWR and ASH data of Oracle 10g, which
management. From a rudimentary capacity-planning approach this makes hold true for 11g. Supporting evidence will be displayed from multiple
sense, but it assumes that the application workload is well-defined and the
continued on page 6
dimensions in AWR data as well as in ASH data when AWR and ASH data To expedite analysis, our team created a series of SQL scripts to extract data
share the same dimensions such as SQL, wait events, etc. The common directly from AWR and ASH data dictionary views, instead of generating the
dimensions serve not only as the link that allows drill down for the analysis whole comprehensive AWR or ASH reports.
throughout these two different performance data sets, but also as the important
To summarize the steps involved:
cross-check points that should always be compared and contrasted.
•• For the CPU spike under investigation, first identify the begin snapshot
id and end snapshot id as the range of AWR and ASH data at the CPU
spike time. The script awr_snapshots.sql is provided to take the
instance_number as input parameter, and output the list of snapshot
id for us to choose.
•• For the AWR data identified above, extract the top CPU SQL by using script
awr_top_cpu_sql.sql. It takes the instance_number, begin_snapshot_id,
end_snapshot_id as input parameters, and displays the list of top 10 SQL
statements with the highest CPU time in the view DBA_HIST_SQLSTAT. To
get a nice rolling window to show the movement of SQL statements over
time, this script pulls from the base AWR snapshot interval and provides
the list for every snapshot existing between begin_snapshot_id and
end_snapshot_id. The goal is to check whether there is a SQL statement
standing up consistently at the top of the list during the time period of
the CPU spike, but does not show up as a Top CPU consuming SQL during
time periods when there were no CPU spikes.
•• Once the top CPU SQL is identified, retrieve its run-time information out
of the full AWR data. The script awr_sql_run_history.sql is provided
to take the instance_num, sql_id as input parameters, and output all
records found in the view DBA_HIST_SQLSTAT. Consider the impact of a SQL
statement on the database that can be demonstrated by values in columns
of CPU_TIME_DELTA, ROWS_PROCESSED_DELTA, BUFFER_GETS_DELTA,
Figure 1: Dimensions shown in AWR reports DISK_READ_DELTA, EXECUTION_DELTA and PARSE_CALLS_DELTA. By
comparing the values at CPU spike time with those at non- CPU spike
time, you should able to see whether the changes are obvious enough
to make this SQL statement a problem. Otherwise, consider it only as
a symptom.
•• Check other dimensions of AWR data to confirm whatever the findings
have drawn from the top CPU SQL identified above. The scripts
awr_top_events.sql and awr_load_profile.sql are provided to take
the instance_number, begin_snapshot_id, end_snapshot_id as input
parameters, and display instance-level top-five timed events and load
profiles for review.
•• Further dig out database sessions that run the top CPU SQL at the CPU
spike time and check their session level information from multiple
dimensions of ASH data. The goal is to use the Oracle Wait Interface to see
whether contention is happening for those sessions. If yes, then from what
points and from where are those contention points coming. The script
SQL> @awr_snapshots 1
Listing 1: Identify the begin and end snapshot id for the CPU spike time
Figure 2: Dimensions shown in ASH data dictionary view
Begin snapshot id=176856 End snapshot id=176857 End snapshot time=07-MAR-2010 14:00
Elapsed CPU
Sql Id Plan Hash Value Seconds Seconds Rows Buffer Gets Disk Reads Executions Parses
-----------------------------------------------------------------------------------------------------------------------------
vu7ytmpa3s5ek 5296468315 11,385 11,013 106,230 3,293,690,084 8,620 24 24
...
Begin snapshot id=176857 End snapshot id=176858 End snapshot time=07-MAR-2010 14:30
Elapsed CPU
Sql Id Plan Hash Value Seconds Seconds Rows Buffer Gets Disk Reads Executions Parses
-----------------------------------------------------------------------------------------------------------------------------
vu7ytmpa3s5ek 5296468315 14,261 13,771 45,350 3,884,282,095 4,906 10 10
...
always staying at the top of the list and their process ids as well as positions database sessions were db file sequential read most of the times,
shuffling up and down, the DBA finally traced these down to certain database indicating sessions were doing active IO. Because the database is doing
sessions and the associated SQL statements currently running. what it is supposed to do, we used our top CPU SQL approach trying to
Then, the real questions started. The SQL statements identified were solve this puzzle before asking other teams to help diagnose the actual
all legitimately issued from an application server. According to the underlying issue.
application support team, these SQLs are not new and the database runs As shown in Listing 1, our team identified that the CPU spike time started at
them on an ongoing basis during the day. So why the CPU did spikes show 1:30 p.m. and the begin snapshot id in AWR as 176856. As the issue was
up all of sudden at this particular time? The on-call DBA also checked that currently going on, we used DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT
there was no change in the SQL execution plan and the wait events of those to create a new snapshot 176860 to record performance data into AWR.
Listing 5: Identify sessions running SQL vu7ytmpa3s5ek at the CPU spike time
Listing 6: Trace down the top CPU SQL session 885 at the CPU spike time
continued on page 10
The real-time troubleshooting tool was very helpful in revealing what is a time because the value of Rows was always equal to that of Executions, we
currently happening. However, it doesn’t provide enough information on what concluded that this SQL was just a symptom of showing that it could not run
happened in the past, which could blur the understanding of the test results. fast enough at CPU spike time.
Combining AWR data and ASH data, this approach reveals both historically To get a hint of what else could be happening, we checked the top-five
and realistic answers for troubleshooting issues. timed event in AWR in Listing 12. From the event row cache lock and
library cache lock, we observed that there was contention as well as an
Case Study 2 object lock issue.
Another database server was found out to have CPU spike between 1 a.m. and We kept digging into ASH data and identified one session (SID = 833) that
2 a.m. every morning. There was no monitoring alert because the spike came was running SQL with SQL ID 37sgt41qpy0p8 at the time of the CPU spike.
up just right below the warning line. However, it generated a noticeable As shown in Listing 13, we confirmed it was running into row lock and library
pattern in the CPU graph. To take preventive actions, our team was assigned to cache contention as indicated by the wait event. The blocking session
find out the root cause of this daily occurrence. (SID = 837) was also identified and we traced it down as shown in Listing 14.
Selecting one day’s CPU spike time and identified SQL 37sgt41qpy0p8 as the We could see that this blocking session was causing the problem: It blocked
top CPU SQL. As shown in Listing 11, its run-time information actually sessions while being blocked also. Confirmed with Listing 15 and 16, in which
indicated a different story. The CPU time of this SQL was stable at 10 seconds the wait event and blocking session information for all sessions in ASH data all
at non-CPU spike time, while jumping up to 5,000 seconds at CPU spike time. pointed to the session (SID = 837) as the big blocker.
While the Buffer Gets and Disk Reads did show some increases at the CPU By checking the application-level information for this session as shown in
spike time, their numbers were considered not significant compared to other Listing 17, it came from a DBMS_SCHEDULER job. After reviewing this job’s
SQL statements running at the same time. The number of Executions, on the schedule and the run log in the view DBMS_SCHEDULER_JOBS and DBMS_
contrary, was reduced substantially at the CPU spike time. Taking into SCHEDULER_JOB_RUN_DETAILS, there was a perfect match between the run
consideration that this SQL was being run more than 2,000 times per AWR time of this job and the CPU spike time on daily basis. As this was a Daily
snapshot window at the non-CPU time, and each time it processed one row at object statistics collection job, it explained both the CPU spike as well as the
Begin snapshot id=70212 End snapshot id=70213 End snapshot time=18-APR-2010 01:30
Instance Averge % Total
Number Event Name Waits Time (s) Wait (ms) Call Time
-----------------------------------------------------------------------------------
2 CPU time 4146 37.2
2 row cache lock 110,290 1277 12 11.4
2 library cache lock 3,999 882 221 7.9
2 direct path read 211,549 817 4 7.3
2 latch: row cache objects 61,304 715 12 6.4
Listing 12: Review the top 5 timed events at the CPU spike time in AWR data
Listing 13: Trace down the top CPU SQL session 833 at the CPU spike time
Listing 14: Trace down blocking session 837 at the CPU spike time
Listing 17: Check the application-level information for the top CPU SQL session 837