Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

CPU Spikes and Top CPU SQL

application is optimized. However, there are always going to be problems that


arise from well-detailed systems. Several times when there was absolutely no
code or workload change, application performance degradation usually is a
result of sudden CPU spikes.
With this in mind, as a DBA team we have developed a systematic approach to
troubleshoot database performance. This approach also provides compelling
evidence to show whether there were anomalies going on inside database
during CPU spike time. With Oracle Database10g’s introduction of AWR and
ASH technologies, we have formalized a series of steps to analyze their data
with the aim of nailing down problematic SQL statements that could cause
CPU spikes. (By now, we should all be aware that AWR is a tool to collect
instance-level performance data, while ASH collects session-level performance
By Sai S. Selvaganesan and Lei Zeng data!) By combining these two powerful tools, we have been successful in

S
unfolding CPU spike mysteries. The approach outlined can be used on both
upporting mission-critical production databases, 10g and 11g databases.
an Oracle database administrator always faces the Assumptions
challenge of being questioned when applications run In this analysis, the first assumption is that all external causes outside the
database have been eliminated, and that the SQL statement(s) cause the CPU
slow on an otherwise well-optimized and stable environment. spike occurring on the database server. Several known processes running on a
Myriad factors could cause this degradation including CPU database server may contribute to CPU spikes. Database backup process, OEM
agent process, or any other OS daemon process, etc. This can be easily
spikes on a database server. Above an acceptable range, the confirmed by using system monitoring tools, such as Oracle’s OS watcher that
spikes impact the response and throughput of all applications tracks top CPU consumer processes at the OS level during the CPU spike time.
Secondly, our approach is built on top of performance data maintained
that database server serves. With years of experience and automatically by AWR and ASH. The CPU spike may have happened in the past,
the assumption that the database is guilty until proven or maybe it is an ongoing issue. Our approach works consistently for both
situations, assuming that there is enough AWR and ASH data available to
innocent, our team has developed a systematic approach. We cover the CPU spike time past or present. Both AWR and ASH are part of Oracle
utilize data from Automatic Workload Repository (AWR) and Enterprise Manager Diagnostic Pack that must be licensed as a separate
option. Oracle has default retention policy for AWR data (seven days) but you
Automated Session History (ASH) to find out SQL statements can adjust it according to your needs.
that are suffering as a result of these spikes and possibly Methodology
which of those could be the root cause of CPU spikes. This Our team believes that the performance data in AWR and ASH at the CPU spike
time are skewed compared to those at the non-CPU spike time. Our approach
article discusses this topic and illuminates several discoveries is a symptom-driven, bottom-up technique that focuses on understanding the
using two real-life case studies. outstanding variations. Though, the challenge of using AWR data always lies
in the fact that it sums up information at the instance level where symptoms
and problems could both be involved. The first step to tackle any CPU spike
Introduction
issue starts with one symptom: The SQL statement with the highest CPU time
We closely monitor our production database systems and CPU spikes are always (defined as top CPU SQL). Be aware that the identified top CPU SQL may or
on the top of the watch list. Although the CPU spikes vary depending on intensity may not be the real problem. After you identify the top CPU consuming SQL,
and duration, the impact is obvious. CPU spikes point directly to contention review other relevant AWR data looking for problem SQL statements. Any
and starvation of system resources, resulting in reduced database performance. finding from AWR data is confirmed with the corresponding ASH data. Identify
A database server that is part of RAC database architecture, prolonged CPU the sessions that are top CPU SQL, and examine all aspects of session-level
spikes could cause the cluster database to become unresponsive that may information such as wait events, blocking session, etc.
result in instance or node eviction.
When using this approach, follow two basic guidelines:
We have a very systematic and well-documented approach for building the
underlying infrastructure for a database. One of the key rules is to build and •• Both AWR and ASH data are time-based data. If the top CPU SQL is not new
to the database, one should be able to compare and confirm findings from
deploy the hardware stack such that the peak baseline CPU utilization is never
AWR or ASH data across time.
more than 50 percent with no degradation in performance for the application
workload defined. Over the years we have found out that this has helped us in •• Both AWR and ASH data are dimension-based data. Figure 1 and Figure 2
delivering systems that meet and/or beat the high expectations of our show those major dimensions in AWR and ASH data of Oracle 10g, which
management. From a rudimentary capacity-planning approach this makes hold true for 11g. Supporting evidence will be displayed from multiple
sense, but it assumes that the application workload is well-defined and the
continued on page 6

2nd Qtr 2011 ■ Page 5


CPU Spikes and Top CPU SQL continued from page 5

dimensions in AWR data as well as in ASH data when AWR and ASH data To expedite analysis, our team created a series of SQL scripts to extract data
share the same dimensions such as SQL, wait events, etc. The common directly from AWR and ASH data dictionary views, instead of generating the
dimensions serve not only as the link that allows drill down for the analysis whole comprehensive AWR or ASH reports.
throughout these two different performance data sets, but also as the important
To summarize the steps involved:
cross-check points that should always be compared and contrasted.
•• For the CPU spike under investigation, first identify the begin snapshot
id and end snapshot id as the range of AWR and ASH data at the CPU
spike time. The script awr_snapshots.sql is provided to take the
instance_number as input parameter, and output the list of snapshot
id for us to choose.
•• For the AWR data identified above, extract the top CPU SQL by using script
awr_top_cpu_sql.sql. It takes the instance_number, begin_snapshot_id,
end_snapshot_id as input parameters, and displays the list of top 10 SQL
statements with the highest CPU time in the view DBA_HIST_SQLSTAT. To
get a nice rolling window to show the movement of SQL statements over
time, this script pulls from the base AWR snapshot interval and provides
the list for every snapshot existing between begin_snapshot_id and
end_snapshot_id. The goal is to check whether there is a SQL statement
standing up consistently at the top of the list during the time period of
the CPU spike, but does not show up as a Top CPU consuming SQL during
time periods when there were no CPU spikes.
•• Once the top CPU SQL is identified, retrieve its run-time information out
of the full AWR data. The script awr_sql_run_history.sql is provided
to take the instance_num, sql_id as input parameters, and output all
records found in the view DBA_HIST_SQLSTAT. Consider the impact of a SQL
statement on the database that can be demonstrated by values in columns
of CPU_TIME_DELTA, ROWS_PROCESSED_DELTA, BUFFER_GETS_DELTA,
Figure 1: Dimensions shown in AWR reports DISK_READ_DELTA, EXECUTION_DELTA and PARSE_CALLS_DELTA. By
comparing the values at CPU spike time with those at non- CPU spike
time, you should able to see whether the changes are obvious enough
to make this SQL statement a problem. Otherwise, consider it only as
a symptom.
•• Check other dimensions of AWR data to confirm whatever the findings
have drawn from the top CPU SQL identified above. The scripts
awr_top_events.sql and awr_load_profile.sql are provided to take
the instance_number, begin_snapshot_id, end_snapshot_id as input
parameters, and display instance-level top-five timed events and load
profiles for review.
•• Further dig out database sessions that run the top CPU SQL at the CPU
spike time and check their session level information from multiple
dimensions of ASH data. The goal is to use the Oracle Wait Interface to see
whether contention is happening for those sessions. If yes, then from what
points and from where are those contention points coming. The script

SQL> @awr_snapshots 1

Instance Number Snap Id Snap Time Snap Level


--------------- --------- ------------------------- ----------
...
1 176855 07-MAR-2010 13:00 1
1 176856 07-MAR-2010 13:30 1
1 176857 07-MAR-2010 14:00 1
1 176858 07-MAR-2010 14:30 1
1 176859 07-MAR-2010 15:00 1
1 176860 07-MAR-2010 15:16 1

Listing 1: Identify the begin and end snapshot id for the CPU spike time
Figure 2: Dimensions shown in ASH data dictionary view

Page 6 ■ 2nd Qtr 2011


SQL> @awr_top_cpu_sql 1 176856 176860

Begin snapshot id=176856 End snapshot id=176857 End snapshot time=07-MAR-2010 14:00
Elapsed CPU
Sql Id Plan Hash Value Seconds Seconds Rows Buffer Gets Disk Reads Executions Parses
-----------------------------------------------------------------------------------------------------------------------------
vu7ytmpa3s5ek 5296468315 11,385 11,013 106,230 3,293,690,084 8,620 24 24
...

Begin snapshot id=176857 End snapshot id=176858 End snapshot time=07-MAR-2010 14:30
Elapsed CPU
Sql Id Plan Hash Value Seconds Seconds Rows Buffer Gets Disk Reads Executions Parses
-----------------------------------------------------------------------------------------------------------------------------
vu7ytmpa3s5ek 5296468315 14,261 13,771 45,350 3,884,282,095 4,906 10 10
...

Listing 2: Identify the top CPU SQL as vu7ytmpa3s5ek

SQL> @awr_sql_run_history 1 vu7ytmpa3s5ek

Instance Snap Snap Elapsed CPU


Sql Id Number Id Time Plan Hash Value Seconds Seconds Rows Buffer Gets Disk Reads Executions Parses
--------------- -------- --------- -------------------- --------------- ----------- ----------- -------------- ---------------- ---------------- --------------- --------
...
vu7ytmpa3s5ek 1 176837 07-MAR-2010 04:00 5296468315 1,044 1,030 30,938 461,227,822 1,021 5 5
vu7ytmpa3s5ek 1 176838 07-MAR-2010 04:30 5296468315 696 690 19,985 343,813,929 481 0 0
vu7ytmpa3s5ek 1 176839 07-MAR-2010 05:00 5296468315 266 265 1,882 132,844,756 62 0 0
vu7ytmpa3s5ek 1 176840 07-MAR-2010 05:30 5296468315 44 40 4,756 18,941,601 380 3 3
vu7ytmpa3s5ek 1 176841 07-MAR-2010 06:00 5296468315 60 59 148 29,203,250 77 1 1
vu7ytmpa3s5ek 1 176842 07-MAR-2010 06:30 5296468315 105 104 0 48,946,885 96 1 1
vu7ytmpa3s5ek 1 176843 07-MAR-2010 07:00 5296468315 606 596 5,840 287,736,753 176 1 1
vu7ytmpa3s5ek 1 176844 07-MAR-2010 07:30 5296468315 271 270 142 132,232,291 33 6 6
vu7ytmpa3s5ek 1 176845 07-MAR-2010 08:00 5296468315 757 734 10,152 313,558,597 729 7 7
vu7ytmpa3s5ek 1 176846 07-MAR-2010 08:30 5296468315 1,067 1,061 7,090 443,839,315 221 1 1
vu7ytmpa3s5ek 1 176847 07-MAR-2010 09:00 5296468315 579 574 2,831 248,128,175 48 1 1
vu7ytmpa3s5ek 1 176848 07-MAR-2010 09:30 5296468315 612 609 160 301,446,395 0 0 0
vu7ytmpa3s5ek 1 176849 07-MAR-2010 10:00 5296468315 499 499 130 249,778,972 0 0 0
vu7ytmpa3s5ek 1 176850 07-MAR-2010 10:30 5296468315 610 608 140 300,330,553 0 0 0
vu7ytmpa3s5ek 1 176855 07-MAR-2010 13:00 5296468315 810 746 57,922 320,108,962 4,196 26 26
vu7ytmpa3s5ek 1 176856 07-MAR-2010 13:30 5296468315 3,887 3,768 61,244 1,321,679,994 4,489 14 14
vu7ytmpa3s5ek 1 176857 07-MAR-2010 14:00 5296468315 7,498 7,245 44,986 1,972,010,090 4,131 10 10
vu7ytmpa3s5ek 1 176858 07-MAR-2010 14:30 5296468315 6,763 6,526 364 1,912,272,005 775 0 0
vu7ytmpa3s5ek 1 176859 07-MAR-2010 15:00 5296468315 6,309 6,160 266 2,008,435,124 75 0 0
vu7ytmpa3s5ek 1 176860 07-MAR-2010 15:16 5296468315 4,595 4,512 260 1,670,567,860 13 0 0

Listing 3: Examine the run-time information for SQL vu7ytmpa3s5ek

ash_session_for_sqlid.sql is provided to take the instance_number,


SQL> @awr_top_events 1 176856 176860
begin_snapshot_id, end_snapshot_id, sql_id as input parameters, and
display sessions matched in the view DBA_HIST_ACTIVE_SESS_HISTORY. Begin snapshot id=176856 End snapshot id=176857 End snapshot time=07-MAR-2010 14:00
You can then use scripts ash_session_for_sid.sql and ash_session_ Instance Averge % Total
Number Event Name Waits Time (s) Wait (ms) Call Time
details.sql to derive from session dimension to wait event, blocking -----------------------------------------------------------------------------------
session and application dimensions. 1 CPU time 7751 47.9
1 read by other session 108,844 5679 52 35.1
•• Also, sum up ASH data for all sessions at the CPU spike time to confirm 1 db file sequential read 143,598 1295 9 8
1 db file scattered read 7,182 660 92 4.1
our findings about the top CPU SQL sessions. The scripts ash_blocking_ 1 gc buffer busy 15,508 426 27 2.6
session.sql is provided to take the instance_number, begin_snapshot_id,
Listing 4: Review the top 5 timed events at the CPU spike time in AWR data
end_snapshot_id as input parameters and rollup session count according
to columns EVENTS and BLOCKING_SESSION in the view DBA_HIST_ACTIVE_
SESS_HISTORY, thus easily detecting whether a particular session, or wait
Case Study 1
event, or combination of both, is a hot spot. The scripts ash_top_events.sql
is provided to display the top-10 wait events based on the session count in The monitoring tool alerted us that a particular database server was having
ASH data and we need to confirm with top-five events in AWR data. CPU spikes above 90 percent. We notified the on-call DBA and started using the
UNIX top command at the OS level to check what was consuming significant
Having describing the basics, we are going to elaborate our analytic technique CPU resources. Despite the difficulty of having more than 10 oracle processes
using the following two case studies.
continued on page 8

2nd Qtr 2011 ■ Page 7


CPU Spikes and Top CPU SQL continued from page 7

always staying at the top of the list and their process ids as well as positions database sessions were db file sequential read most of the times,
shuffling up and down, the DBA finally traced these down to certain database indicating sessions were doing active IO. Because the database is doing
sessions and the associated SQL statements currently running. what it is supposed to do, we used our top CPU SQL approach trying to
Then, the real questions started. The SQL statements identified were solve this puzzle before asking other teams to help diagnose the actual
all legitimately issued from an application server. According to the underlying issue.
application support team, these SQLs are not new and the database runs As shown in Listing 1, our team identified that the CPU spike time started at
them on an ongoing basis during the day. So why the CPU did spikes show 1:30 p.m. and the begin snapshot id in AWR as 176856. As the issue was
up all of sudden at this particular time? The on-call DBA also checked that currently going on, we used DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT
there was no change in the SQL execution plan and the wait events of those to create a new snapshot 176860 to record performance data into AWR.

SQL> @ash_session_for_sqlid 1 176856 176860 vu7ytmpa3s5ek


Session Blocking
Snap Id Sample Time Sid State Sql Id Event Session
--------- ------------------------------ ------ ------------ --------------- ---------------------------------------- --------
...
176857 07-MAR-2010 13:43:49 1067 WAITING vu7ytmpa3s5ek gc cr block busy
176857 07-MAR-2010 13:44:40 1074 WAITING vu7ytmpa3s5ek db file sequential read
176857 07-MAR-2010 13:45:10 840 WAITING vu7ytmpa3s5ek latch: cache buffers chains 1074
176857 07-MAR-2010 13:45:30 1074 WAITING vu7ytmpa3s5ek db file sequential read
176857 07-MAR-2010 13:45:50 852 WAITING vu7ytmpa3s5ek db file sequential read
176857 07-MAR-2010 13:46:50 841 WAITING vu7ytmpa3s5ek db file sequential read
176857 07-MAR-2010 13:47:10 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:47:30 852 WAITING vu7ytmpa3s5ek db file sequential read
176857 07-MAR-2010 13:48:00 852 WAITING vu7ytmpa3s5ek db file sequential read
176857 07-MAR-2010 13:48:50 963 WAITING vu7ytmpa3s5ek db file sequential read
176857 07-MAR-2010 13:40:21 874 WAITING vu7ytmpa3s5ek latch: cache buffers chains 885
176857 07-MAR-2010 13:51:11 897 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:51:41 852 WAITING vu7ytmpa3s5ek db file sequential read
176857 07-MAR-2010 13:51:51 868 WAITING vu7ytmpa3s5ek latch: cache buffers chains 841
176857 07-MAR-2010 13:51:51 874 WAITING vu7ytmpa3s5ek latch: cache buffers chains 841
176857 07-MAR-2010 13:51:51 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains 841
176857 07-MAR-2010 13:51:51 897 WAITING vu7ytmpa3s5ek latch: cache buffers chains 841
176857 07-MAR-2010 13:51:51 1045 WAITING vu7ytmpa3s5ek latch: cache buffers chains 841
176857 07-MAR-2010 13:52:01 840 WAITING vu7ytmpa3s5ek latch: cache buffers chains 874
176857 07-MAR-2010 13:52:01 841 WAITING vu7ytmpa3s5ek latch: cache buffers chains 874
176857 07-MAR-2010 13:52:01 1039 WAITING vu7ytmpa3s5ek latch: cache buffers chains 874
...

Listing 5: Identify sessions running SQL vu7ytmpa3s5ek at the CPU spike time

SQL> @ash_session_for_sid 1 176856 176860 885


Session Blocking
Snap Id Sample Time Sid State Sql Id Event Session
--------- ------------------------------ ------ ------------ --------------- ---------------------------------------- --------
...
176857 07-MAR-2010 13:37:10 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:41:51 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains 841
176857 07-MAR-2010 13:42:11 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains 841
176857 07-MAR-2010 13:45:42 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains 897
176857 07-MAR-2010 13:46:22 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:47:43 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:48:33 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:49:03 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:49:23 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains 1074
176857 07-MAR-2010 13:49:34 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:51:14 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:51:24 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:51:35 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:56:16 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains 1074
176857 07-MAR-2010 13:57:16 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains
176857 07-MAR-2010 13:57:36 885 WAITING vu7ytmpa3s5ek gc buffer busy
176858 07-MAR-2010 14:01:07 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains 1039
176858 07-MAR-2010 14:01:27 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains 963
176858 07-MAR-2010 14:07:28 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains 974
176858 07-MAR-2010 14:08:29 885 WAITING vu7ytmpa3s5ek latch: cache buffers chains 965
...

Listing 6: Trace down the top CPU SQL session 885 at the CPU spike time

Page 8 ■ 2nd Qtr 2011


a big spur in the number of Executions of this SQL around the time when
SQL> @ash_blocking_session 1 176856 176860
Blocking Waiting Session
CPU spike started, and the huge jump of Buffer Gets incurred thereafter. By
Event Session Count reviewing the history of how this SQL ran in the past (Executions < 5 and
---------------------------------------- -------- ---------------
read by other session 3,471
Buffer Gets < 9 digits per snapshot), we thought this could be the problem.
db file sequential read 642 We confirmed it by checking other SQL statements without revealing any
gc buffer busy 566
db file scattered read 476
similar issues.
latch: cache buffers chains 95
log file sync 1132 63
As shown in Listing 4, we also confirmed with the top-five timed events in
control file sequential read 62 AWR. The event read by other session, db file sequential read and
gc cr block busy 35
SQL*Net more data to client 30
db file scattered read were taken as being in line with the heavy
gc cr grant 2-way 24 “Buffer Gets” observed.
latch: cache buffers chains 1074 18
latch: cache buffers chains 897 18 To confirm, we need to further validate it by ASH data. Listing 5 displays a list
latch: cache buffers chains 974 11
of sessions that we found to be running SQL vu7ytmpa3s5ek at the CPU spike
Listing 7: Check blocking session information for all sessions at the CPU spike time time. Listing 6 shows one top CPU SQL session that we traced down. By looking
at the wait event and blocking session information, we observed another
problem when the multiple sessions ran this SQL simultaneously: contention.
SQL> @ash_top_events 1 176856 176860 The event latch: cache buffers chains and gc buffer busy were taken
Begin snapshot id=176856 End snapshot id=176857 End snapshot time=07-MAR-2010 14:00
as signals of contention that again were in line with the heavy Buffer Gets.
Instance Blocking Waiting Session Listing 7 and 8 show the summarized wait event and blocking session
Number Event Session Count information for all sessions at the CPU spike time, which also pointed us to the
-----------------------------------------------------------------------------------
1 read by other session 1027 same direction.
1 db file sequential read 235
1 gc buffer busy 201 After we identified the top CPU SQL as the problem, we identified which side of
1 db file scattered read 121 application it came from as shown in Listing 9, and its SQL text, execution
1 latch: cache buffers chains 43
1 log file sync 1132 24 plan as shown in Listing 10. With all detailed data in hand to support our
1 gc cr block busy 16 findings, we were confident to move forward to talk to the application team.
1 control file sequential read 14
1 latch: cache buffers chains 1074 12
1 latch: cache buffers chains 897 11
... SQL> @ash_session_details 1 176856 176860 885 vu7ytmpa3s5ek
Listing 8: Review top wait events at the CPU spike time in ASH data
Sid Sql Id Username Module Action Program Service Name
----------------- ------------------------------- --------------------------- ------------------------------ ----------------------------- -------------------------
885 vu7ytmpa3s5ek REPORT_USER JDBC Thin Client JDBC Thin Client adhoc_report
As shown in Listing 2, the team decided to take SQL vu7ytmpa3s5ek as the top
CPU SQL because it stayed consistently at the top of the list. We also decided to Listing 9: Check the application-level information for the top CPU SQL session 885
always work on the biggest problem first, while bearing in mind to come back
to check the SQL statements at the second or third place, because they could
also contribute to the CPU spike. SQL> @awr_sql_report 1 176856 176860 vu7ytmpa3s5ek
As shown in Listing 3, we pulled out all the run-time information for SQL Listing 10: Retrieve the sql text and execution plan for SQL vu7ytmpa3s5ek in AWR data
vu7ytmpa3s5ek in AWR data and the last few lines really popped out. We saw

SQL> @awr_sql_run_history 1 37sgt41qpy0p8

Instance Snap Snap Elapsed CPU


Sql Id Number Id Time Plan Hash Value Seconds Seconds Rows Buffer Gets Disk Reads Executions Parses
--------------- -------- --------- -------------------- --------------- ------------ ------------ --------------- ---------------- -------------- --------------- -------
...
37sgt41qpy0p8 2 70209 17-APR-2010 22:30 0 10 10 2,112 0 0 2,112 0
37sgt41qpy0p8 2 70210 17-APR-2010 23:00 0 10 10 2,070 0 0 2,070 0
37sgt41qpy0p8 2 70211 17-APR-2010 23:30 0 10 10 2,138 0 0 2,138 0
37sgt41qpy0p8 2 70212 18-APR-2010 01:00 0 392 247 2,070 8,111 0 2,070 0
37sgt41qpy0p8 2 70213 18-APR-2010 01:30 0 5,020 4,668 852 35,616 0 852 1
37sgt41qpy0p8 2 70214 18-APR-2010 02:00 0 2,967 2,664 776 21,104 0 776 1
37sgt41qpy0p8 2 70215 18-APR-2010 02:30 0 10 10 1,926 0 0 1,926 0
37sgt41qpy0p8 2 70216 18-APR-2010 03:00 0 10 10 2,096 0 0 2,096 0
37sgt41qpy0p8 2 70217 18-APR-2010 03:30 0 10 10 2,109 0 0 2,109 1
37sgt41qpy0p8 2 70218 18-APR-2010 04:00 0 10 10 2,032 0 0 2,032 0

Listing 11: Examine the run-time information for SQL 37sgt41qpy0p8

continued on page 10

2nd Qtr 2011 ■ Page 9


CPU Spikes and Top CPU SQL continued from page 9

The real-time troubleshooting tool was very helpful in revealing what is a time because the value of Rows was always equal to that of Executions, we
currently happening. However, it doesn’t provide enough information on what concluded that this SQL was just a symptom of showing that it could not run
happened in the past, which could blur the understanding of the test results. fast enough at CPU spike time.
Combining AWR data and ASH data, this approach reveals both historically To get a hint of what else could be happening, we checked the top-five
and realistic answers for troubleshooting issues. timed event in AWR in Listing 12. From the event row cache lock and
library cache lock, we observed that there was contention as well as an
Case Study 2 object lock issue.
Another database server was found out to have CPU spike between 1 a.m. and We kept digging into ASH data and identified one session (SID = 833) that
2 a.m. every morning. There was no monitoring alert because the spike came was running SQL with SQL ID 37sgt41qpy0p8 at the time of the CPU spike.
up just right below the warning line. However, it generated a noticeable As shown in Listing 13, we confirmed it was running into row lock and library
pattern in the CPU graph. To take preventive actions, our team was assigned to cache contention as indicated by the wait event. The blocking session
find out the root cause of this daily occurrence. (SID = 837) was also identified and we traced it down as shown in Listing 14.
Selecting one day’s CPU spike time and identified SQL 37sgt41qpy0p8 as the We could see that this blocking session was causing the problem: It blocked
top CPU SQL. As shown in Listing 11, its run-time information actually sessions while being blocked also. Confirmed with Listing 15 and 16, in which
indicated a different story. The CPU time of this SQL was stable at 10 seconds the wait event and blocking session information for all sessions in ASH data all
at non-CPU spike time, while jumping up to 5,000 seconds at CPU spike time. pointed to the session (SID = 837) as the big blocker.
While the Buffer Gets and Disk Reads did show some increases at the CPU By checking the application-level information for this session as shown in
spike time, their numbers were considered not significant compared to other Listing 17, it came from a DBMS_SCHEDULER job. After reviewing this job’s
SQL statements running at the same time. The number of Executions, on the schedule and the run log in the view DBMS_SCHEDULER_JOBS and DBMS_
contrary, was reduced substantially at the CPU spike time. Taking into SCHEDULER_JOB_RUN_DETAILS, there was a perfect match between the run
consideration that this SQL was being run more than 2,000 times per AWR time of this job and the CPU spike time on daily basis. As this was a Daily
snapshot window at the non-CPU time, and each time it processed one row at object statistics collection job, it explained both the CPU spike as well as the

SQL> @awr_top_events 2 70212 70214

Begin snapshot id=70212 End snapshot id=70213 End snapshot time=18-APR-2010 01:30
Instance Averge % Total
Number Event Name Waits Time (s) Wait (ms) Call Time
-----------------------------------------------------------------------------------
2 CPU time 4146 37.2
2 row cache lock 110,290 1277 12 11.4
2 library cache lock 3,999 882 221 7.9
2 direct path read 211,549 817 4 7.3
2 latch: row cache objects 61,304 715 12 6.4

Listing 12: Review the top 5 timed events at the CPU spike time in AWR data

SQL> @ash_session_for_sid 2 70212 70214 833


Session Blocking
Snap Id Sample Time Sid State Sql Id Event Session
--------- ------------------------------ ------ ------------ --------------- ---------------------------------------- --------
...
70214 18-APR-2010 01:40:16 833 WAITING rjjk8gjw4tmqf library cache lock
70214 18-APR-2010 01:40:26 833 WAITING e89tktqg9kv1s library cache lock
70214 18-APR-2010 01:40:47 833 WAITING ydsgq975nmhkd library cache lock 837
70214 18-APR-2010 01:40:57 833 WAITING jyztj0vtzu45b library cache lock
70214 18-APR-2010 01:41:07 833 WAITING o0464ghfsxzwd library cache lock
70214 18-APR-2010 01:41:17 833 WAITING 069g87c64q13e latch: row cache objects
70214 18-APR-2010 01:42:08 833 WAITING mtjy9mk07rcmk row cache lock
70214 18-APR-2010 01:42:28 833 WAITING 2hwsq1ma3w29m latch: row cache objects
70214 18-APR-2010 01:42:49 833 WAITING 3c73j42t595vn row cache lock
70214 18-APR-2010 01:43:19 833 WAITING 9m05zf9a2t4ht library cache lock
70214 18-APR-2010 01:44:10 833 WAITING 5xj39u95phcx2 library cache lock
70214 18-APR-2010 01:44:30 833 WAITING 8k2b99mtds9fx row cache lock
70214 18-APR-2010 01:45:32 833 WAITING 4s826ffgzzwvv library cache lock
70214 18-APR-2010 01:46:33 833 WAITING 0ybcgnn9jy6jg library cache lock
70214 18-APR-2010 01:46:53 833 WAITING 16qj7vcafhpsy row cache lock
70214 18-APR-2010 01:47:34 833 WAITING 5u0jd8tppsstw library cache lock 837
70214 18-APR-2010 01:49:06 833 WAITING hv59t19bjahqi latch: row cache objects
...

Listing 13: Trace down the top CPU SQL session 833 at the CPU spike time

Page 10 ■ 2nd Qtr 2011


SQL>@ash_session_for_sid 2 70212 70214 837
Session Blocking
Snap Id Sample Time Sid State Sql Id Event Session
--------- ------------------------------ ------ ------------ --------------- ---------------------------------------- --------
...
70214 18-APR-2010 01:40:57 837 WAITING h9fcdzjnksq78 library cache lock 813
70214 18-APR-2010 01:41:07 837 WAITING h9fcdzjnksq78 library cache lock 813
70214 18-APR-2010 01:41:27 837 WAITING h9fcdzjnksq78 library cache lock 813
70214 18-APR-2010 01:41:38 837 WAITING h9fcdzjnksq78 library cache lock 805
70214 18-APR-2010 01:41:48 837 WAITING h9fcdzjnksq78 library cache lock 809
70214 18-APR-2010 01:41:58 837 WAITING h9fcdzjnksq78 library cache lock 1111
70214 18-APR-2010 01:42:08 837 WAITING h9fcdzjnksq78 library cache lock 805
70214 18-APR-2010 01:42:18 837 WAITING h9fcdzjnksq78 library cache lock 852
70214 18-APR-2010 01:42:28 837 WAITING h9fcdzjnksq78 library cache lock 1113
70214 18-APR-2010 01:42:38 837 WAITING h9fcdzjnksq78 library cache lock 557
70214 18-APR-2010 01:42:49 837 WAITING h9fcdzjnksq78 library cache lock 833
70214 18-APR-2010 01:42:59 837 WAITING h9fcdzjnksq78 library cache lock 1113
70214 18-APR-2010 01:43:09 837 WAITING h9fcdzjnksq78 library cache lock 1113
70214 18-APR-2010 01:43:19 837 WAITING h9fcdzjnksq78 library cache lock 851
70214 18-APR-2010 01:43:30 837 WAITING h9fcdzjnksq78 library cache lock 833
70214 18-APR-2010 01:43:40 837 WAITING h9fcdzjnksq78 library cache lock 1111
70214 18-APR-2010 01:43:50 837 WAITING h9fcdzjnksq78 library cache lock 813
70214 18-APR-2010 01:44:00 837 WAITING h9fcdzjnksq78 library cache lock 805
...

Listing 14: Trace down blocking session 837 at the CPU spike time

row cache/library cache contention as the stats collection invalidated cursors


SQL> @ash_blocking_session 2 70212 70214
Blocking Waiting Session
that needed to be reparsed and reloaded into the SQL cache.
Event Session Count
---------------------------------------- -------- ---------------
When the CPU spike happened in the past, our approach has also proven to be
row cache lock 249 efficient, without many other options to explore as when CPU spike issue is
latch: row cache objects 190 actually occurring.
direct path read 143
library cache lock 124
db file scattered read 102 Conclusion
library cache lock 837 99
db file sequential read 39 AWR and ASH are considered to be great tools that Oracle provides to
PX Deq Credit: send blkd 32 troubleshoot the database for the purpose of determining and solving
gc cr multi block request 28
performance issues. Combined with ASH data, AWR data has become more
Listing 15: Check blocking session information for all sessions at the CPU spike time meaningful in determining issues occurring at both instance-wide as well as
at the session level. There are many possibilities of exploring the usage of AWR
and ASH data—our top CPU SQL approach to untangle CPU spike issue is just
SQL> @ash_top_events 2 70212 70214 one good example. Our team continues to learn about performance issues,
...
refining this method as new situations are encountered.
Begin snapshot id=70213 End snapshot id=70214 End snapshot time=18-APR-2010 02:00
Instance Blocking Waiting Session
Number Event Session Count
---------------------------------------------------------------------------------- ■ ■ ■ About the Authors
2 row cache lock 203
2 latch: row cache objects 145 Sai S. Selvaganesan works for Yahoo as a senior engineering
2 library cache lock 124 manager with more than 15 years of experience in oracle databases.
2 db file scattered read 73
2 library cache lock 837 62 He can be reached at ssaisundar@yahoo.com.
2 db file sequential read 36
2 PX Deq Credit: send blkd 32 Lei Zeng works for Yahoo as a senior Oracle DBA. She has 10 years
2 gc cr multi block request 23 experience in Oracle database management and administration. She
C
Listing 16: Review top wait events at the CPU spike time in ASH data
can be reached at leizeng2003@yahoo.com.

SQL> @ash_session_details 2 70212 70214 837 h9fcdzjnksq78

Sid Sql Id Username Module Action Program Service Name


------ --------------- ------------ -------------------- --------------- ----------------------- ----------------
837 h9fcdzjnksq78 DBA DBMS_SCHEDULER DAILY_DB_STATS oracle@db-box (J001) dba_maintenance

Listing 17: Check the application-level information for the top CPU SQL session 837

2nd Qtr 2011 ■ Page 11

You might also like