Exadata PM Process

Exadata Database Machine Preventive Maintenance
Procedures
Revision Date Author(s) Changes

History
V1.0 Sept 16, 2011 Leslie Keller Initial Service Release
Olly Sharwood
V1.0.1 Sept 19, 2011 Olly Sharwood Rewrite of section 2.3 overview
V1.0.2 Sept 21, 2011 Olly Sharwood Miscellaneous minor corrections
V1.0.3 Nov 15, 2011 Olly Sharwood Added exception policy for not yet at
2 years; Re-wrote battery condition
check details for clarity; Clarified part
ordering to be done by field; Clarified
Exachk can be customer run;
Removed RAID battery kits no longer
being productized; Added proprietary
label and clarifications on the DBA
steps.
V1.0.4 Nov 21, 2011 Olly Sharwood Corrected typos in flash/HBA steps.
V1.0.5 Nov 21, 2011 Olly Sharwood Changed physical replacement steps to
numbers instead of bullets
V1.1 Dec 21, 2012 Olly Sharwood Added X3-2, X3-8; Re-ordered
Overview section 1; Re-wrote
Preparation section 2; Added F20 M2
ESM procedure; Added clearing of
ESM fault in ILOM; Clarified
customer vs. field steps; Changed time
to complete from 60 to 90 days.
V1.1.1 Jan 24, 2013 Olly Sharwood Enhanced clearing of ESM fault in
ILOM; Added notes on ESM in spares
kits.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
1 Exadata Database Machine Preventive Maintenance Service
Overview:
The Oracle Exadata Database Machine is an easy to deploy solution for hosting the Oracle
Database that delivers the highest levels of database performance available. The Exadata
Database Machine is a cloud in a box composed of database servers, Oracle Exadata Storage
Servers, an InfiniBand fabric for storage networking and all the other components required to
host an Oracle Database. It delivers outstanding I/O and SQL processing performance for online
transaction processing (OLTP), data warehousing (DW) and consolidation of mixed workloads.
Extreme performance is delivered for all types of database applications by leveraging a
massively parallel grid architecture using Real Application Clusters and Exadata Storage
Servers.
To provide the highest levels of database performance available, Exadata includes memory
components used for cache, that require batteries and Electronic Storage Modules (ESM) to
ensure protection of the data in the event of an unplanned power loss. As the batteries and
ESM's near the thresholds of their life expectancy, their ability to uphold cache becomes
reduced. When this occurs, caching is turned off and a performance impact will be seen.
Disabling the data caching is done without need to reboot the server and there is no impact to
the quality of the results delivered. The values specified below should give the Customer at least
90 days to effect replacements before any issues should arise.
During the second year of service, Oracle will begin an annual Preventive Maintenance (PM)
service to ensure the integrity and serviceability of the Exadata Database Machine. The PM
service consists of a machine comparison against the current Exadata Best Practices that will
note any deviations, and replace the battery and ESM components before they reach their life
expectancy. This service will be provided only to Customers with a Premier Support for
Systems or equivalent Hardware support contract that is valid at the time the PM service is due.
The service will be performed on site by the local Oracle Services teams in co-ordination with
the Customer, and may be run on a live system. There is no impact on the performance of the
live system if completed as described.
A Service Request and task for the field to perform the PM service will be automatically opened
by Oracle Global Customer Support based on the installation date and location of the Exadata
unit. The local field will then contact and notify the Customer approximately 90 days in
advance of the scheduled PM completion date and explain the PM process to the Customer if
they are not familiar with it. The local field should co-ordinate and schedule a preparation visit
with the Customer, during which all parts required will be identified. The local field will then
schedule and co-ordinate the parts replacement visit with the Customer.
A general time line has been established for proactive replacements of batteries and ESM's. The
listed years of service due assume no extraordinary conditions such as operating at high
temperatures that may degrade the expected useable life of the battery and ESM components.
These time lines are established to ensure the continued best operating performance for the
Exadata Database Machine.
Product Year 1 2 3 4 5 6 7
Battery No Yes No Yes No Yes No
Exadata V2
ESM No No Yes No No Yes No
Exadata X2-2, Battery No Yes No Yes No Yes No
Exadata X2-8 & X2-2
Expansion Rack ESM No No No Yes No No No
Exadata X3-2,
Exadata X3-8, & X3-2 Battery No Yes No Yes No Yes No
Expansion Rack
Notes:
1. Even though all the batteries or ESM's may not be flagged as critical in system output,
Oracle will replace ALL batteries and ESM's in the rack according to the timeline shown
above. Exadata PM SR's that are opened by Oracle on behalf of the customer should
replace ALL batteries and/or ESM's as appropriate for the PM year since install date.
2. The PM Service will be provided during year 6 (and later), only if the system is within 5
years of last ship date or Extended support has been offered.
3. For a definition of terms used in this document, refer to section 7 at the end of the
document.
4. There are no ESMs used in Exadata X3-2, X3-8 and X3-2 Expansion Racks.
5. Where this document refers to Oracle field service engineer, it also applies to and refers
to Oracle Authorized Service Partner engineers in locations where Oracle does not
provide direct service.
1.1 PM Time line Exception Policy due to Early Life Failures:

If a customer opens an SR for an Exadata that is experiencing battery failures prior to the
annual PM service request being opened, then an exception to provide PM service early on the
customer's SR will be considered. The criteria for creating an exception task for PM service is:
if more than 1/4 of the batteries in the machine are below the failed threshold (<600mAh
BBU07 or <674mAh BBU08) or
if more than 1/2 of the batteries in the machine are below the proactive replacement threshold
(<800mAh BBU07 or BBU08).
The TSC x64 owner of the SR will verify this criteria is met based on battery outputs from the
machine, and if so, will open an Exadata PM task to the field to schedule early replacements for
all the batteries in the rack. If this criteria is not met, then the TSC x64 owner will open a
normal field task to do break/fix of the identified batteries in the machine that are below the
proactive replacement within 90 days threshold (<800mAh BBU07 or BBU08) per MOS Note
1329989.1.
2 Preparation for PM
2.1 Initial PM Activities

The Oracle field service engineer when first assigned a PM task, should:
1. Alert your RSM (and account SSDM) of this activity.
2. Review the overview presentations and the technical instructions attached to MOS Note
1356432.1 (This document). Any questions should be directed to the Parent SR owner in
TSC x64. If they are unavailable, use the standard engagement methods to get the next
available engineer or CTC, and/or use the support_exadata_hw_ww@oracle.com alias.
3. Contact the customer to explain the program/process per Exadata PM Presentation(s).
During this contact:
a) Request they run Exachk and upload to the SR to determine the current state of the
system and the immediacy for performing the replacement service. The following
checks are done with Exachk:
Current Exadata software image level.
Firmware levels on Compute Nodes
Current ILOM settings including ASR capability checks.
Status of the Flash card's Electronic Storage Module (ESM)
Status of the Raid card's Battery
The TSC engineer assigned to the SR will review the Exachk output and provide
guidance to the customer via the SR, and the field task if any field actionable
issues are noted.
b) Explain the need for minimum Exadata image version of 11.2.2.1.1 required to work
with the replacement batteries, such that the customer must update image prior to
schedule PM activity. Oracle is no longer able to supply batteries that work with
earlier image versions, and the Exadata performance may be reduced until such time
as the image can be updated to perform the scheduled PM activity. If Exadata
software updates for minimum firmware levels are required:
The Customer should patch update the Exadata software image to the latest
available. (See MOS Note 888828.1)
Patch updating can be done at the same time as this service if the customer
chooses to do multiple change activities in the same outage, but is not provided
specifically as a part of this service.
It is the customer's responsibility to upgrade and patch the system(s) either by
themselves or via a separate optional Advanced Customer Services (ACS)
patching service which can be purchased to perform these tasks.
c) Determine a date for the on-site planning visit. PM allows up to 90 days to complete
the service.
The PM process overview and Exachk collection can also be done during the on-site planning
visit if it is preferred by the customer.
2.2 On-site Planning Visit

Multiple physical checks will be performed and any issues will be documented and addressed
by the Oracle field service team along with the Customer. The following tasks should be done
during the on-site PM planning visit:
Visual Inspection of the Machine (Refer to Section 2.3 for details)

Any visible faults such as amber LED's that may require resolving prior to, or at
the same time as PM being completed
Broken Cables or Cable Management Arms (CMA's)
Verification of the spares pool that ships with the Exadata Database Machine.
Determine the size of the Exadata rack and any additional systems such as add-on
Exadata Storage Servers or upgrade additions that will require PM replacement at the
same time. If there are add-on servers, also note the generation type (V2 vs. X2-2 vs.
X3-2) of the add-on systems as they may have a different time line for PM component
requirements as listed in section 1.
Determine if any site access issues will affect access and scheduling of replacements
Determine with the Customer if this will be a Full Machine Down Replacement or a
Rolling System Update impacting only one server at a time. Refer to Section 2.6 below
for details.
Determine the date that the PM part replacement service task will be completed. The
process allows for up to 90 days to complete the replacements.
The on-site planning visit may also be used to provide the customer with the overview of
the PM service, and assist them in obtaining and running Exachk if the customer
requires it. Manual gathering of battery or ESM health may be obtained instead, refer to
Section 2.4 for details)
Upon completion of the on-site PM planning visit, the field service team should:
Validate with the TSC engineer any additional parts that Exachk issues may require for
resolution
Provide the TSC engineer the date and time for when the PM part replacement is
scheduled. This allows TSC to update the SR date and set it for proper auto-closure
when the field tasks are closed.
At 2 weeks prior to the scheduled PM part replacement date (or as soon as possible if <2 weeks)
Order the appropriate quantity of replacement parts for Batteries and/or ESM's for each
of the server units in the rack identified, and the part types scheduled for this current
year's service. Refer to Section 2.5 for details and Section 6 for part numbers.
Order replacement cables, CMA's and any other part that is faulted based on the
information gathered above and have them delivered to the Customer site. Refer to
section 6 for part numbers and the internal Oracle Sun System Handbook
(http://support.us.oracle.com/handbook_internal/).
Due to the quantity of parts expect them to take 2 weeks to arrive on-site.
Acquire a #1 and #2 Philips screwdriver
NOTE: Service logistics is notified of scheduled PMs on the same 90 day time
scale as the SR and field tasks being opened. To assist logistics ordering and
placing of parts, do NOT order parts as soon as the task is opened. Wait until 2
weeks prior to the scheduled PM service date before ordering parts, should
provide sufficient time to get prats within the 90 day window for completion of
the PM service.
2.3 Visual Inspection of the Machine

The following provides additional details on how to perform the visual inspection task:
Visually inspect to identify any amber LED's are on. If any are, verify with the customer
if there is a separate Service Request open that needs to be resolved prior to the PM
service being carried out.
Visually inspect the rear of the Exadata Database Machine for any Cable Management
Arm's (CMA's) which may be damaged or bent.
Visually inspect all cables for bend radii which are extremely tight and do not have
adequate slack to allow for proper usage of the CMA's or for FRU removal.
Validation of Onsite Spares

A set of spares is delivered with every Exadata Database Machine. These spares are intended
to minimize downtime in the event of a hardware failure. Compare the contents of the
Customers Spares pool to the table below. Note that there is a spares pool for every Exadata
Database Machine system a customer has, i.e. 2 full racks at a Customer site means 2 full spares
kits.
Oracle Exadata V2, X2-2, X2-8, X3-2 & X3-8 Machine Spares Kits
Part Description Full Rack Half Rack Quarter Rack
InfiniBand (IB)
cables (in external (6x) 3m cables &
(8x) 5m cables. (4x) 5m cables.
additional parts (10x) 5m cables.
boxes)
Disk Drive Two One One
Sun Flash cards Two One One
5M InfiniBand
Two One One
(tied inside rack)
Ethernet cables1 one each blue, red, one each blue, red, one each blue, red,
(tied inside rack) black, orange black, orange black, orange
Keys 2 sets of 2 keys to 2 sets of 2 keys to 2 sets of 2 keys to
(tied inside rack) open the rack doors open the rack doors open the rack doors
and side panels. and side panels. and side panels.
Note: The InfiniBand cables in the external parts boxes are intended to be used for connecting
multiple racks together. The InfiniBand cable in the spare bundle looped inside the rack is
intended for break/fix use. If the system is part of a multi-rack environment some of the
InfiniBand cables from the spares pool will be used in that implementation. Also, adding
individual Storage Cells to an Exadata Database Machine may also use some of the InfiniBand
cables from the spares pool.
Note: The Sun Flash cards contained in the Spares Kits that contain an ESM do not require
replacement during PM service. The ESM does not have a shelf-life limit like a battery as it
does not degrade unless it is being powered and charged like the ESMs being used in the
system.
Reference MOS Note 1323593.1 for more information regarding the spares kit.
If the Spare parts pool is not complete determine what parts are missing.
Acceptable reasons for missing parts:

IB Cables may have been used to extend their system, for example, multi-racking.
There may be open SR's for part replacement which have not yet been delivered.
If the Parts cannot be located, then the Customer should be advised that the on-site spares are
part of the quick remediation process purchased by the Customer, having them missing may
inhibit that delivery. An effort should be made to locate the missing components or the
purchase of replacement parts should be considered.
1 Orange Ethernet cables are included only in Exadata V2, X2-2 systems and X2-2 Expansion Racks.
2.4 Preventive Maintenance Checks
The purpose of the PM checks is to both ensure the system is operating and meeting current
Exadata best practices, and to determine the current health of the battery and/or ESM
components in order to determine whether they will last the full 90 day period given to
complete the replacement of them.
2.4.1 Automated PM Checks (exachk)

The 'exachk' best practices verification tool is used for this purpose. It should be noted that
at the time of this writing 'exachk' does not yet support Solaris on Exadata Database
Machine. For these systems you will need to perform the manual PM checks described in the
section below.
See the My Oracle Support (MOS) references at the end of this section for further information
on 'exachk' and the specific checks it performs, as well as how to download the latest version.
Once a Customer has been notified that there is an upcoming PM action, the customer should
download and run the latest version of 'exachk'. The 'exachk' tool takes approximately 1
hour to run on a full rack (shorter on smaller racks), and is based on the current Best Practices
document for Exadata Database Machine. These checks include the validation of the battery
and ESM states for the purposes of determining whether the full 90 day period provided to
replace these is available, or if the current health of those components requires them to be
replaced sooner rather than later within the 90 day period.
The Oracle Exadata Database Machine 'exachk' will collect data regarding key software,
hardware, and firmware versions and configuration best practices specific to the Oracle Exadata
Database Machine.
The output assists customers to periodically review and cross reference current data for key
components of their Oracle Exadata Database Machine against supported version levels and
recommended Oracle RAC and Exadata best practices.
The Oracle Exadata Database Machine 'exachk' may be executed as desired but should be
executed regularly as part of the maintenance program for an Oracle Exadata Database
Machine. The exachk command does not require close proximity to the system and thus may
be run remotely. The customer should upload the data into the SR for review, and/or provide it
to the field engineer for review. If the customer is uncomfortable running the 'exachk' tool
themselves, then the field engineer or TSC should assist them and show them how to use it.
If the Customer has recently run the 'exachk' tool the output files can be easily viewed again
using the '-f' option to point to the associated output files, and output to the terminal or html.
This will summarize the output to show the informational and warning status messages, leaving
the 'pass' messages silent. To include the 'pass' messages use the '-f -o -v' options before
the filename.
See MOS ID 1329170.1 - Master Reference Note for exachk
See MOS ID 1070954.1 - Oracle Database Machine exachk or HealthCheck for more
information and for download instructions.
See MOS ID 757552.1 - Oracle Exadata Best Practices for more information on specific
checks
2.4.2 Manual Battery/ESM Checks.

It is recommended that 'exachk' be used to verify the system, however, if you do not have, or
may not use the exachk utility a manual check may be performed. To perform a manual check
for the health of the batteries and ESM's you should follow these steps. These checks are for
validation of the battery and ESM states for the purposes of determining whether the full 90 day
period provided to replace these is available, or if the current health of those components
requires them to be replaced sooner rather than later within the 90 day period.
The DCLI command may be used to expedite checking all the servers. Most commonly,
customers choose to use only the first database server as the node that is fully setup with the
required ssh keys to use dcli from. The IP group files for use with dcli are already created
on the first database server if the Exadata was installed by Oracle ACS. Substitute
"all_group", "cell_group", or "dbs_group" in the examples below. The group files
are normally found in the following directory: /opt/oracle.SupportTools/onecommand
Verify with the Customer that dcli is acceptable and available to use.
Log into the servers via the internal KVM if one is available. If one is not available, a Laptop
can also be used and provides logging of your work if Customer policy allows for this.
(Remember <ctrl> <ctrl> is the KVM escape)
A manual check will take approximately 30 minutes to perform.
Checking the Exadata Image Version.

For Exadata Database Machine, the software image needs to be at certain levels to provide
monitoring and support for the ESM's and batteries in the system.
Exadata software image 11.2.2.1.1, or higher, is required to support the BBU08

batteries on the RAID card.
Exadata software image 11.2.1.3.1, or higher, provides support for the Flash F20
Card ESM lifetime monitoring in ILOM.
If the Customer is using a previous software image to those presented above, and cannot patch
upgrade, then the PM process must be placed on hold and monitored. Raid cards with batteries
which have reached their lower thresholds will disable their caching to preserve the quality of
the data. It will, however, compromise the performance of the system. All Raid cards with
batteries that are above the lower threshold will continue to work at top speed,
The alternative of replacing all the batteries and patching the image later is not supported, and
Oracle is no longer able to supply older batteries as they are no longer manufactured. The
RAID cards without the newer firmware cannot recognize the newer batteries, ALL caching
will be turned off until the firmware is patch updated. That would be a significant performance
hit.
Exadata Software Image Version:
[root@cel01 ~]# /opt/oracle.cellos/imageinfo -ver

11.2.2.2.0.101206.2
Exadata software versions should be at least to 11.2.2.1.1 before beginning the replacement
procedures.
Checking the RAID card batteries.

The proactive replacement guidelines for the RAID card batteries are defined in MOS note
1329989.1. When performing an Exadata PM service, the first criteria based on time applies,
and ALL batteries in the rack should be replaced accordingly. If the PM service is being done
early on an exception basis due to early life failures below the 800mAh threshold, then ALL
batteries in the rack should still be replaced irrespective of their current battery charge capacity
capability. This is to ensure that all units in the rack have batteries of the same age, when
moving forward to the next PM service due.
Remaining Capacity
In an Exadata PM service, the purpose of the capacity check is not to determine whether or not
a specific battery needs replaced, rather it is to determine the urgency of the need to perform the
PM service. The SR will be opened with a 90 day window in which to complete the PM
replacements. The customer outage time and urgency for doing so will be influenced by the
current state of the batteries operating in the rack. Use the following information to discuss
with the customer how soon or late within that 90 days they should scheduled the PM
replacements to occur. Regardless of the actual state of the battery capacity, ALL batteries
within the rack should be replaced for a proactively opened PM SR.
The absolute minimum BBU07 charge required to meet the minimum 48 hours hold-up time is
600mAh. When the BBU07 can no longer hold this much charge, MegaCli64 will report this
with the "Remaining Capacity Low" setting will change from the normal "No" to "Yes" which
may be an early warning notice to check whether its "Full Charge Capacity" is getting low.
The absolute minimum BBU08 charge required to meet the minimum 48 hours hold-up time is
674mAh. Note, on BBU08 this may be flagged prematurely due to a firmware bug (Sun CR
7018730) that incorrectly sets the value higher at 960mAh based on incorrect operational
assumptions. If this is being flagged due to this bug, ignore the alert if the "Full Charge
Capacity" value is over 800mAh.
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 | grep Capacity
Remaining Capacity Low : Yes

Remaining Capacity: 597 mAh
Full Charge Capacity: 612 mAh
Design Capacity: 1215 mAh
In the example above, the remaining capacity is below 600mAh and the Full Charge Capacity is
near 600mAh indicating an urgent need for replacement. Any batteries which indicate
remaining capacity below the threshold, but have a Full charge capacity that is much higher,
should be checked to be sure they are not in a learning cycle. In the above example the values
are close indicating an immanent failure. The PM service should be scheduled to occur as soon
as possible, as any further delay will put the customer at risk of experiencing a performance hit
soon, if not already.
The guideline for proactive replacement for BBU07 or BBU08 is to schedule replacement
within 90 days if the value is at 800mAh or less. If batteries are within this range, then
schedule the PM service for when the next convenient outage is for the customer, within the
next 90 days.
Learn Cycle
When a new BBU is installed into a server, it will have a depleted charge state. If the charge is
less than 50% of the Designed Capacity it will be forced into Write-Through (WT) mode and
run a full learn charge until the BBU has sufficient charge to maintain the cache. This may take
24 hours or longer. Checking its status will show:
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 | grep Charging

Charging Status : Charging
Note also that learn cycles occur every 30 days from first power on for DB nodes and occur 4
times a year for Storage Cells. Storage Cells with image 11.2.1.3.1 or later, the learn cycle is
manually scheduled quarterly to start at 2AM January 17, April 17, July 17, and October 17.
The time is chosen to minimize impact on day time operations, where the WT mode reduces the
write performance of the HBA during this period until the BBU cache is re-enabled into Write-
Back (WB) mode.
Data Collection
Collect all data supporting a replacement of any batteries and attach it to the SR opened. For
this purpose /opt/oracle.SupportTools/sundiag.sh will collect the appropriate
data.
Checking the Flash F20 Card ESM's

If this check process is taking place on a year that does not coincide with the replacement of the
ESM, a visual inspection of the Flash F20 amber LED's should be performed. If any amber
LED's are lit on the cards, make note of the indicated component, add the part to your order list
and replace any of them as part of this PM process.
The Flash F20 cards did not originally come with software which properly recorded the
powered on hours into a non-volatile programmable memory (FRUID PROM) contained on the
card. As such, it should be assumed that all ESM's should be replaced at 3 years from the date
of system installation. The proper monitoring software was made available in August 2010, but
is not expected to have been immediately put into place on the Customer's system(s).
The Flash F20 monitoring software was delivered with ILOM version 3.0.9.19.a, as part of
Image 11.2.1.3.0, and later versions.
Identification between Flash F20 and Flash F20 M2 cards which have different ESMs, can be
done by reviewing the FRU data which will report either F20 (Aura1.0) or F20 M2 (Aura 1.1)
in the FRU description. Use ILOM or ipmitool fru print or other output as described
in MOS Note 1416397.1.
2.5 Ordering Replacement Parts

Use the SR opened for this Service to order the appropriate parts necessary for your Customer.
The following table shows example quantities of parts needed to perform a full battery and
ESM replacement in the various Exadata Database Machine systems. Note that the customer
may have extended their systems with individual storage cells or even a full Exadata Expansion
Rack. Verify the exact quantity needed before ordering. When ordering ESMs, it is always
preferred to order the single-box Kits sized for single cells, quarter rack, or half rack Exadatas,
rather than individual ESMs that are stocked for break/fix purposes. The ESM used on the
spares kit Flash F20 cards does not have a shelf-life limit and does not need to be replaced
during PM service.
Individual RAID
Exadata Rack Type Flash F20 Card ESM Kit's
HBA Batteries1
Exadata X3-8 Full Rack 16 Not Applicable
Exadata X3-2 Full Rack 22 Not Applicable
Exadata X3-2 Half Rack 11 Not Applicable
Exadata X3-2 Quarter
5 Not Applicable
Rack
Individual X3-2 Storage
1 Not Applicable
Cell
Exadata X3-2 Storage
4 Not Applicable
Expansion Quarter Rack
9 Not Applicable
Expansion Half Rack
1 The BBU08 Battery Kits are no longer being productized as of November 2011. Order single battery FRU's in the
appropriate quantity, 1 per DB node and 1 per Storage Cell. ESM Kits containing multiple ESMs as described are available.
Individual RAID
Exadata Rack Type Flash F20 Card ESM Kit's
HBA Batteries
18 Not Applicable
Expansion Full Rack
Exadata X2-8 Full Rack x2 Half Rack Kits
16
(56 ESMs total)
Exadata V2, X2-2 Full x2 Half Rack Kits
22
Rack (56 ESMs total)
Exadata V2, X2-2 Half x1 Half Rack Kit
11
Exadata V2, X2-2 Quarter x1Quarter Rack Kit
5
Individual V2, X2-2 x1 Individual Server ESM Kit
1
Storage Cell (4 ESMs total)
Exadata X2-2 Storage x4 Individual Server ESM Kits
4
Expansion Quarter Rack1 (16 ESMs total)
9
Expansion Half Rack1 (36 ESMs total)
18
Expansion Full Rack1 (72 ESMs total)
Section 6 includes a table with all of the part numbers required as part of this PM procedure.
Should you need to order a broken or failed component that is not on this list please refer to the
Oracle Sun System Handbook found on MOS at:
https://support.us.oracle.com/handbook_internal/index.html
2.6 Rolling System Update vs. Full Machine Down

There are two methods of performing the proactive replacement of batteries and/or the ESM.
1. Full System Downtime This approach has a short maintenance window but the entire
system is down for the duration of the component replacement. This method is preferred
if the customer has a scheduled maintenance window or if the risk of a rolling
replacement is unacceptable.
2. Rolling replacement method - This approach where one server is taken offline at a
time has a longer maintenance window but the overall system is up during the entire
activity. The risks associated with a rolling replacement are:
a) For systems with high redundancy, a double disk failure during the maintenance
window may cause data loss bringing down the entire system and requiring a restore
from backup.
1 There are no Storage Expansion rack specific kits. Field engineers are free to order either individual kits as specified or multiple regular
rack kits and individual FRU kit combinations as necessary for the quantity of batteries and ESMs necessary.
b) For systems with normal redundancy, a single disk failure during the maintenance
window may cause data loss bringing down the entire system and requiring a restore
from backup."
The decision to perform these tasks in a rolling fashion as opposed to a full down Exadata
Database Machine is entirely the Customer's.
Should the Customer decide to perform any other updates or changes during standard down
time, it is acceptable to do so, however it is not recommended.
2.6.1 Rolling System Update

Individual Server Repair Should an individual server require repair, the Customer should
take the server out of production, the Field Engineer should complete the required repair, then
the customer would return the server to use by the Exadata Database Machine.
Rolling System Down (Single Server down at any given time) This method is similar to the
Individual Server repair mentioned above. The Customer should take the server out of
production, the Field Engineer should complete the required repair, then the customer would
return the server to use by the Exadata Database Machine. Once the server is back up and
running as part of the Exadata Database Machine then the steps are repeated for the next server
in need. This process continues until the last repair has been completed.
NOTE: During the rolling repair process the Exadata Database Machine will be
without certain redundancies while a server is down. Only one server should be
down at any time. If multiple servers are offline, then there may be data loss and
the entire system will go down requiring the system to be restored from backup.
Make sure the server last repaired has completed its re-integration into the
Exadata Database Machine before you begin work on the next Server.
2.6.2 Full Machine Down Replacement

Should a repair be required for multiple servers within an Exadata Database Machine it can be
done in one of two manners. It is the Customers decision as to which of the following options
is used.
Full Machine Down (All servers down at once) Should enough servers require attention the
Customer may wish to take the entire Exadata Database Machine down. This choice may be
made to prevent an extended exposure to a loss of redundancies or simply because a convenient
service window allows the opportunity. In this scenario the field engineers may work on
multiple servers at once.
When shutting down the servers be sure and start with the DB nodes. Doing so quiesces the
servers and the applications on them. The Storage Cells should follow once all DB nodes are
down. When starting them up, begin with the Cells then proceed to the DB nodes.
2.6.3 Estimated Down Times

System down time is required for the replacement of either the ESM's or the Batteries. Down
time of the entire system is not required as explained in Section 2.3.1 above, but without high-
redundancy, it may leave the overall Database Machine temporarily without some redundancies.
Downtime is composed of a number of elements.
A Storage Cell, done individually, takes approximately 1 hour depending on work load. The
last step (verification) can take the most time due to waiting for disks to re-sync. A DB Node
can take about 30 minutes. The basic steps are: (more detail below)
Stop the services & shutdown the server
Effect the Replacement of ESM's & RAID card batteries.
Boot the server
Validate all services have restarted.
Overall downtime Estimates:
Full System Shutdown Rolling Replacement

System Type Single Engineer Two Engineers Single Engineer Two Engineers
Full Rack 8 Hrs (ESMs or Both) 5 Hrs (ESMs or both) 20 Hrs 20 Hrs
6 Hrs (Batteries only) 4 Hrs (Batteries only)
Half Rack 4 Hrs (ESMs or Both) 2.5 Hrs (ESMs or both) 10 Hrs 10 Hrs
3 Hrs (Batteries only) 2 Hrs (Batteries only)
Quarter Rack 2.5 Hrs (ESMs or Both) 2 Hrs (ESMs or both) 4 Hrs 4 Hrs
2 Hrs (Batteries only) 1.5 Hrs (Batteries only)
If a full system down approach is preferred by the Customer the big savings is in not having to
shutdown, start up and re-synchronize data on each storage server individually. There is also
benefit in 2 engineers working in parallel to change out parts for the full and half racks. There
are some conflicts in space which arise from this method which is why the times are not truly
halved.
The Rolling replacement does not benefit from having 2 engineers working in parallel since the
bulk of the time is awaiting the shutdown, start up and re-sync's to complete. Remember, you
may not work on 2 storage cells at the same time unless performing the full system down
approach, regardless of redundancy level setting.
3 On-Site Part Replacement - Step by Step
Note: This document is intended for use by Oracle Support engineers and Oracle Authorized
Service Partners only. The commands in this section that may need to be completed by the
customer's database administrator (DBA) may be copied for use by the customer.
3.1 Steps for full system down outage

1. The Customer should shutdown CRS services and power down all DB nodes
a) As root user set the machine so that CRS is disabled from starting at power on.:
# <GI_HOME>/bin/crsctl disable crs
where GI_HOME environment variable is typically set to /u01/app/11.2.0/grid but

will depend on the customer's environment.
b) Shutdown the CRS services:

# <GI_HOME>/bin/crsctl stop crs
c) Validate CRS is down cleanly. There should be no processes running.
# ps -ef | grep css
d) Shutdown the server operating system:
Linux:
# shutdown -y -h now
Solaris:
# shutdown -y -i 5 -g 0
2. The Customer or Field Engineer then may shutdown all Cells.

3. The Field Engineer turns off the PDU breakers in the rack to shutoff power to servers.
4. The Field Engineer Replaces the FRU's
1. Replace Batteries as necessary according to PM service time line (section 3.4)
2. Replace ESM's as necessary according to PM service time line (section 3.3)
5. The Field Engineer turns on the PDU breakers in the rack to power on servers.
6. Power on all servers, via front power button once ILOM has completed booting. Wait 5
to 10 minutes for the Cells to be fully up and ASM online.
7. The Customer and Field Engineer, after restarting all the cells and compute nodes, log
into the first compute node and run:
# dcli -l root -g <cells_group_file> "service celld status"
Each server should return all 3 processes running:
rsStatus: running
msStatus: running
cellsrvStatus: running
Review status of disks that all are online (enter on 1 line):

# dcli -l root -g <cells_group_file> "cellcli -e "list griddisk
where status!='active'""
If no rows are returned, all is well.
8. The Customer should re-enables the CRS services

# <GI_HOME>/bin/crsctl enable crs
9. The Customer should starts CRS across all nodes:

# dcli -l root -g <dbs_group_file> <GI_HOME>/bin/crsctl start crs
10. The Customer then validates that ASM and DBS are back:
# dcli -l root -g <dbs_group_file> ps -ef |grep pmon
# dcli -l root -g <dbs_group_file> <GI_HOME>/bin/crsctl check crs
3.2 Steps for rolling outage

The Customer and Field Engineer need to work together to coordinate taking a single server out
of production, the Field Engineer completes the required repair, and then works with the
Customer to return the server to use by the Exadata Database Machine. Once the server is back
up and running as part of the Exadata Database Machine, then the steps are repeated for the next
server. This process continues until the last repair has been completed.
NOTE: Make sure the server has completed it re-integration into the Exadata
Database Machine before you begin work on the next Server. You should never
have 2 servers down at once, not even 2 disk drives, unless the server has failed them
before you arrived!
Storage Cell replacement consists of the following basic actions: (details in Section 4.1 4.3
below)
Take the grid disks offline Customer (Section 4.1)
Shut down the server Customer or Field Engineer (Section 4.1)
Part Replacement Field Engineer (either or both if necessary)
Single RAID card battery (Section 3.4 and Section 3.5)
Four Flash card ESM's (Section 3.3)
Clear F20 fault status if set (Section 3.3)
Boot the server Customer or Field Engineer (Section 4.2)
Verify all storage is present and all connections are correct Customer (Section 4.2)
Activate the Grid Disks Customer (Section 4.3)
Verify that all Grid Disks are back online Customer (Section 4.3)
Each Database Node takes approximately 30 minutes and consists of the following steps:
Stop the CRS services Customer (Section 4.4)
Shut down the server Customer or Field Engineer (Section 4.4)
Part Replacement Field Engineer
Single RAID card battery (Section 3.4 and Section 3.5)
Boot the server Customer or Field Enginer (Section 4.5)
Validate all services have restarted - Customer (Section 4.6)
3.3 Flash F20 and F20 M2 ESM Replacement:
Preparing the Server (Storage Cell or Compute Node) for service
Power Off the Storage Cell

Extend unit from Rack
Remove Servers Top Cover
Removing the PCI Card
1. Remove back panel PCI cross bar

a) Loosen the two captive Phillips screws on each end of the crossbar
b) Lift the PCI crossbar up and back to remove it from the chassis
2. Remove the PCI Riser containing the PCI cards to be serviced

a) Loosen the captive screw holding the riser to the motherboard
b) Lift up the riser and the PCI cards that are attached to it as a unit.
3. Extract the Flash HBA cards from the PCI riser assembly 1 card at a time.
Removing the ESM on F20 Card (541-3731)
The F20 Card has the ESM located in the centre of the card, with FMODs on either side of it.
The assembly part number label is located on the front of the card near the card edge connector
between the disk controller and rear FMODs.
1. Extract the Flash card from the PCI riser assembly 1 card at a time.
2. Remove the two ESM assembly retaining pins on the back of the card.
a) First, remove the center pin from each retaining pin.
b) Next, push the outer section of each retaining pin through the card and remove them.
3. Carefully slide the ESM assembly (the ESM shroud and the ESM) off the card without
disturbing FMOD0 or FMOD3.
4. Using a pair of wire cutters, clip the ESM cable near the ESM end. This will allow
removal of the cable without needing to unscrew clips and remove FMOD0.
5. Disconnect the ESM cable from connector J803 on the card using the remaining tail.
6. Remove the ESM from the plastic ESM shroud.
Installing the ESM on F20 Card (541-3731)
The F20 Card has the ESM located in the centre of the card, with FMODs on either side of it.
The assembly part number label is located on the front of the card near the card edge connector
between the disk controller and rear FMODs.
1. Place the ESM in the plastic ESM shroud.
2. Place the ESM assembly next to the board, then slide it gently onto the card. Carefully
route the cable and plug between FMOD0 and FMOD1 while sliding it on.
3. Install the two retaining pins from the back of the card
a) First, install the outer section of each retaining pin.
b) Next, install the center sections of the each retaining pin.
4. Connect the ESM plug to J803 on the card, routing the ESM cable around the retainer
clip holding FMOD0 and FMOD1, with the cable laying between the 2 FMODs.
5. Install the card back into the riser assembly in the same slot.
Reverse the previous steps to re-install the PCIe Riser back into the server.
Removing the ESM on F20 M2 Card (541-4417)
The F20 Card has the ESM located on the rear of the card next to the SAS cable connector. The
assembly part number label is located next to the orange WWN label on the rear side of the
card.
1. Locate the plastic retaining clip for the ESM plastic housing on the rear side of the card.
2. With a small tool such as the tip of a screwdriver, carefully press the clip down while
pushing the housing off the rear end of the PCI card.
3. Disconnect the ESM cable from connector J803 on the card.
4. Remove the ESM from the plastic ESM shroud.
Installing the ESM on F20 M2 Card (541-4417)
The F20 Card has the ESM located on the rear of the card next to the SAS cable connector. The
assembly part number label is located next to the orange WWN label on the rear side of the
card.
1. Place the ESM in the plastic ESM shroud.
2. Connect the ESM plug to J803 on the card.
3. Slide the ESM assembly feet carefully onto the board, one into the slotted hole, the other
slides onto the end of the PCI card. There should be an audible click when the retaining
clip engages in its slot.
Reverse the previous steps to re-install the PCIe Riser back into the server.
Clearing the ESM Fault Status (Exadata V2 and X2-2s with F20 cards)
The ESM power-on monitoring feature in Exadata V2, X2-2 and X2-8 with F20 cards in ILOM
is implemented in software and requires manually clearing if faults are present. The physical
replacement of the ESM does not initiate this automatically. This does not apply to Exadata
X2-2 and X2-8 units with F20 M2 cards, which ILOM manages the thresholds automatically.
The monitoring features was added in ILOM 3.0.9.19.a and contained in Exadata software
image version 11.2.1.3.1 and later.
1. After the system is plugged in and ILOM is booted, log in to ILOM on the Storage Cell
as root user
2. For each ESM that was replaced, check if the fault_state is set to critical.
-> show /SYS/MB/RISER1/PCIE1/F20CARD fault_state
3. a) If the fault_state is showing as critical, then skip to step 4.
b) If the fault_state is showing OK then it is not yet critical so it may not have reached
power on hours threshold needed to flag it as critical. Its possibly because the unit has
come close before the PM is being done, or it may be due to flash updating ILOM at
some interim time which will reset the counter to 0. The fault_state can be manually set
to critical as follows:
1. On systems with image 11.2.3.2.1 and earlier with ILOM version v3.0.9.x
through v3.0.16.10.d, enter the fault management shell in ILOM CLI:
-> start /SP/faultmgmt/shell

Are you sure you want to start /SP/faultmgmt/shell (y/n)? y
faultmgmtsp>
2. Run the following to mark the card failed:
-> etcd -i ereport.chassis.device.esm.eol.warning@/SYS/MB/RISER1/PCIE1/F20CARD
3. Exit the fault management shell

-> exit
4. With the fault_state showing as critical, it can be cleared as follows:
-> set /SYS/MB/RISER1/PCIE1/F20CARD clear_fault_action=true
This will reset the power-on-hours counter to 0.

5. Repeat steps 2 and 3 for each of the 4 Flash cards using the following ILOM targets:
/SYS/MB/RISER1/PCIE4/F20CARD
Notes:
ILOM v3.0.9.19.a on Exadata V2 systems image 11.2.2.2.2 and earlier have a bug that
prevents slot PCIE4 from reporting the presence of the flash F20 card. Skip that slot if
the system has this problem.
ILOM v3.0.9.19.a on Exadata V2 systems and v3.0.9.27.a on Exadata X2-2/X2-8
systems has a bug that programmed the thresholds to 2 years (17200 hours) instead of 3
years or 4 years, so the fault status may have already been triggered and cleared.
On Exadata X2-2/X2-8 systems, the threshold may report 3 years (26220 hours) instead
of 4 years (35052 hours) if the system_identifier property in ILOM /SP is not
programmed to the standard Exadata identity string Exadata Database Machine X2-2
(or X2-8) that identifies this card as being in an Exadata, rather than a regular X4270M2
system. This may be the case on V2 systems upgraded with X2-2 servers if the identity
string was changed to the V2 rack string Sun Oracle Database Machine.
Follow the instructions in Section 4.2 and 4.3 to bring the storage cell back into service and to
verify that all components are working as expected.
NOTE: Do not begin to replace components in other Storage Cells until all
components in this one are back online and have re-silvered.
3.4 Battery replacement in Exadata V2, X2-2, X3-2 Database Machine
Compute nodes and all Storage Cells:
Note: This Procedure is for the x4170 & x4270m2 Compute Nodes in the Exadata Database
X2-2 Machines or any of the Storage Cells. Instructions for battery replacement in the X4800
of the Exadata Database X2-8 Machine follows in the next section.
If you are using the Rolling Upgrade methodology the Storage Cells will need to be taken
offline one at a time to perform this procedure. Please follow Section 4.1 for instructions
before beginning the replacement. Database Nodes should follow the steps in Section 4.4. If
using the Full System Down method, proceed with the next step.
For each server in the Exadata Database Machine you should perform the
following steps:
Preparing the Server (Storage Cell or Compute Node) for service
Have the Customer DBA remove the server from operation

Power Off
Extend unit from Rack and unplug AC Power Cords
Remove Servers Top Cover
Removing the PCI Card
1. Disconnect the SAS cables from the HBA PCI card making a note of which port each
cable goes into.
2. Remove back panel PCI cross bar

a) Loosen the two captive Phillips screws on each end of the crossbar
b) Lift the PCI crossbar up and back to remove it from the chassis
3. Remove the PCI Riser containing the PCI card to be serviced

a) Loosen the captive screw holding the riser to the motherboard
b) Lift up the riser and the PCI card that is attached to it as a unit.
4. Extract the RAID HBA card from the PCI riser assembly
Removing the battery.
1. Use a No. 1 Phillips screwdriver to remove the 3 retaining screws that secure the battery
to the HBA from the underside of the card only.
Do NOT attempt to remove any screws from the battery on the top side of the HBA.
2. Detach the battery pack including circuit board from the HBA by gently lifting it from
its circuit board connector on the top side of the HBA.
Reverse the previous steps to re-install the new battery on the HBA, and reinstall the PCIe
card and PCIe riser back into the server. Take care to get the SAS cables re-connected to
the same ports they were removed from, as accidently reversing them may affect disk slot
mappings.
Follow the instructions in Section 4.2 and 4.3 to bring the storage cell back into service and to
verify that all components are working as expected. For Database nodes follow the steps
outlined in sections 4.5 and 4.6.
NOTE: Do not begin to replace components in other Storage Cells until all
components in this one are back online and have re-silvered.
3.5 RAID card Battery replacement in Exadata X2-8, X3-8 Database

Machine Compute Nodes
Database Nodes based on Sun Fire X4800, Sun Fire X4800M2 and Sun Server X2-8 systems
should follow the steps in Section 4.4 for instructions before beginning the replacement.
If you are using the Rolling Upgrade methodology the Compute Nodes will need to be taken
offline one at a time to perform this procedure.
Preparing the Server for service
Power Off
Remove CMOD0 from the server setting it on a flat, antistatic surface with ample space
and light.
Remove the CMOD cover.
Removing the REM and replacing the battery.
1. Lift the REM ejector handle and rotate it to its fully open position.
2. Lift the connector end of the REM and pull the REM away from the retaining clip on the
front support bracket.
3. To remove the battery, use a No. 1 Phillips screwdriver to remove the 3 retaining screws
that mount the battery to the REM.
4. Detach the battery pack including circuit board from the REM by gently lifting it from
its circuit board connector.
Install the new battery and reinstall the REM into the server.
1. Attach the battery pack to the REM by aligning the circuit board connectors and gently
pressing together.
2. Secure the original battery to the underside of the new REM using the 3 retaining
screws.
3. Ensure that the REM ejector lever is in the closed position. The lever should be flat with
the REM support bracket.
4. Position the REM so that the battery is facing downward and the connector is aligned
with the connector on the motherboard.
5. Slip the opposite end of the REM under the retaining clips on the front support bracket
and ensure that the notch on the edge of the REM is positioned around the alignment
post on the bracket.
6. Carefully lower and position the connector end of the REM until the REM contacts the
connector on the motherboard, ensuring that the connectors are aligned. To seat the
connector, carefully push the REM downward until it is in a level position.
Returning the Server to the Customer
1. Install the cover on the CMOD and return the CMOD back into the unit in CMOD0 slot.
2. Power on server and have the DBA return it to Service
3.6 Other Replacements:
Replace any broken cables or cable management arms (CMA's) discovered in the earlier visual
inspection. Refer to MOS Note 1444683.1 for handling instructions and training.
Replace any other parts that require replacement according to the appropriate part canned action
plans on MOS.
4 Starting, Stopping and Verifying the Exadata sub-systems
The following sections explain how to shutdown, start up and verify the servers within an
Exadata Database Machine.
NOTE: This document is intended for use by Oracle Support engineers and approved service
partners only. The commands in this section that need to be completed by the customer's
database administrator (DBA) may be copied for use by the customer.
4.1 Stopping Storage Cells
For Extended information on this section check MOS Note:

ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM
Where noted, the SQL, CellCLI and commands under root should be run by the
Customers DBA, unless the Customer provides login access to the Field Engineer.
Customer Activity:
1. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME
attribute value of 3.6hrs should be adequate for replacing components, but may have been
changed by the Customer. To check this parameter, have the Customer log into ASM and
perform the following query:
SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg

where a.name = 'disk_repair_time' and a.group_number = dg.group_number;
As long as the value is large enough to comfortably replace the batteries and ESM's in a
storage cell, there is no need to change it.
2. Check if ASM will be OK if the grid disks go OFFLINE.
# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...snipit ...
DATA_CD_09_cel01 ONLINE Yes
RECO_CD_00_cel01 ONLINE Yes
RECO_CD_01_cel01 ONLINE Yes
etc....
If one or more disks return asmdeactivationoutcome='No', you should wait for some time
and repeat the query until all disks return asmdeactivationoutcome='Yes'.
NOTE: Taking the storage server offline while one or more disks return a status of
asmdeactivationoutcome='No' will cause Oracle ASM to dismount the affected disk
group, causing the databases to shut down abruptly.
3. Run cellcli command to Inactivate all grid disks on the cell you wish to power down/reboot.
(this could take up to 10 minutes or longer)
# cellcli
...
CellCLI> ALTER GRIDDISK ALL INACTIVE
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
...etc...
4. Execute the command below and the output should show asmmodestatus='UNUSED' or
'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and
inactive in ASM.
CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes
RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes
...etc...
Customer or Service Engineer Activity:
You can now shutdown the Cell using the following command:
# shutdown -h -y now
Disconnect the power cords before opening the top of the server
4.2 Starting Storage Cells

Service Engineer Activity:
To start the cells, begin by booting the server. Once the power cords have been re-attached and
the ILOM has booted you will see a slow blink on the green LED for the server. Power on the
server by pressing the power button on the front of the unit.
Customer or Service Engineer Activity:
1. Before activating the cell disks verify all are visible to the server. The following command
should show 16 devices:
# lsscsi | grep -i MARVELL

[1:0:0:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdm
[1:0:1:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdn
[1:0:2:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdo
[1:0:3:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdp
[2:0:0:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdq
[2:0:1:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdr
[2:0:2:0] disk ATA MARVELL SD88SA02 D20Y /dev/sds
[2:0:3:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdt
[3:0:0:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdu
[3:0:1:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdv
[3:0:2:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdw
[3:0:3:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdx
[4:0:0:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdy
[4:0:1:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdz
[4:0:2:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdaa
[4:0:3:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdab
2. The following command should show 12 disks:
# lsscsi | grep -i LSI

[0:0:20:0] enclosu LSILOGIC SASX28 A.1 502E -
[0:2:0:0] disk LSI MR9261-8i 2.90 /dev/sda
[0:2:1:0] disk LSI MR9261-8i 2.90 /dev/sdb
[0:2:2:0] disk LSI MR9261-8i 2.90 /dev/sdc
[0:2:3:0] disk LSI MR9261-8i 2.90 /dev/sdd
[0:2:4:0] disk LSI MR9261-8i 2.90 /dev/sde
[0:2:5:0] disk LSI MR9261-8i 2.90 /dev/sdf
[0:2:6:0] disk LSI MR9261-8i 2.90 /dev/sdg
[0:2:7:0] disk LSI MR9261-8i 2.90 /dev/sdh
[0:2:8:0] disk LSI MR9261-8i 2.90 /dev/sdi
[0:2:9:0] disk LSI MR9261-8i 2.90 /dev/sdj
[0:2:10:0] disk LSI MR9261-8i 2.90 /dev/sdk
[0:2:11:0] disk LSI MR9261-8i 2.90 /dev/sdl
There should be 16 FMOD's found with the MARVELL label and 12 disks found by the search
on the LSI label. If the device count is not correct above the server should be re-opened and the
device connections checked to be sure they are secure BEFORE the following CellCLI
commands are issued.
Customer Activity:
1. Once the operating system is alive you will need to activate the grid disks.
# cellcli

CellCLI> alter griddisk all active
...etc...
2. Issue the command below and all disks should show 'active':
CellCLI> list griddisk
DATA_CD_00_dmorlx8cel01 active
RECO_CD_00_dmorlx8cel01 active
...etc...
Proceed to Verification of Storage Cells below
4.3 Verifying Storage Cells have returned to service.
Customer Activity:
Verify all grid disks have been successfully put online using the following command. Wait until
asmmodestatus is ONLINE for all grid disks. The following is an example of the output early in
the activation process.
CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

DATA_CD_00_dmorlx8cel01 active ONLINE Yes
RECO_CD_00_dmorlx8cel01 active SYNCING Yes
RECO_CD_01_dmorlx8cel01 active ONLINE Yes
...etc...
Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process.
Oracle ASM synchronization is only complete when ALL grid disks show
asmmodestatus=ONLINE. This process can take some time depending on how busy the
machine is, and has been while this individual server was down for repair.
4.4 Stopping DB nodes

Customer Activity:
1. To stop the node the Customers Oracle DBA stops the CRS services as root user
# <GI_HOME>/bin/crsctl stop crs
where GI_HOME environment variable is typically set to /u01/app/11.2.0/grid but will

depend on the customer's environment.
2. Validate CRS is down
# ps -ef | grep css
This command shouldn't return any records if there are no CRS services running.
NOTE: Stopping CRS on one node may require modifying CRS services to run on a different
node.
3. You can now shutdown the DB node using the following command:
Linux:
Solaris:
# shutdown -y -i 5 -g 0
4.5 Starting DB Nodes

Service Engineer Activity:
To start the DB nodes, begin by booting the server. Once the power cords have been re-
attached and the ILOM has booted you will see a slow blink on the green LED for the server.
Power on the server by pressing the power button on the front of the unit.
If the machine is going to reboot only once due to this process, then nothing is required from
the Oracle software.
4.6 Verifying DB Nodes have returned to service.

Customer Activity:
1. Validate that CRS is running:
If 'dcli' is available, then use the following:
# dcli -l root -g <dbs_group_file> ps -ef |grep pmon
# dcli -l root -g <dbs_group_file> <GI_HOME>/bin/crsctl check crs
If 'dcli' is not available, then an alternative method is as follows:
As root execute:
[root@db01 ~]# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle
[root@db01 ~]# crsctl check crs

CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
In the above output the 1 of +ASM1 refers to the DB node number. For example, Db node
#3 the value would be +ASM3.
2. Validate that instances are running:
# ps -ef |grep pmon
It should return a record for ASM instance and a record for each database.
5 After the Replacements are done
Once the entire PM process is complete the 'exachk' utility should be re-run to verify that all is
well. See MOS ID 1329170.1 - Master Reference Note for exachk and MOS ID 1070954.1 -
Oracle Database Machine exachk or HealthCheck for more information and for download
instructions.
The engineer should complete all required documentation prior to closing the service request
tasks for providing this PM service.
6 Parts List:
6.1 Part Kits

Flash Cards ESM Kits
Part No. Description
7018497 Kit containing (4x) ESM's for individual Cell replacements on
F20 card
7018498 Kit containing (12x) ESM's for Quarter Rack replacements on
F20 card
7018499 Kit containing (28x) ESM's for Half and Full Rack replacements
(order qty.2 for Full Rack) on F20 card
70517311 Kit containing (4x) ESMs for individual Cell replacements on
F20 M2 card
70517321 Kit containing (12x) ESM's for Quarter Rack replacements on
F20 M2 card
70517331 Kit containing (28x) ESM's for Half and Full Rack replacements
(order qty.2 for Full Rack) on F20 M2 card
Raid Card Battery

Single BBU08 Battery FRU for individual server replacements,
371-4982
Cell or DB node. (This is not actually a kit but a single FRU).
7050794 RoHS2013 BBU08 Battery FRU for newer systems that require it
NOTE: The RAID Card Battery Kits being developed will no longer be made as of November 2011. Order single
battery FRU's in the appropriate quantity, 1 per server (DB nodes and storage cells).
6.2 Individual FRU Part Numbers:

F20 and F20 M2 Flash Card (PCIe)

511-1500 BD,PCI Express Flash Board, (AURA 1.0)
371-4415 DOM, SS-FLASH, 32GB/24GB Solid State Flash Memory
Module, D20R firmware (AURA1.0)
371-4650 5.5V, 11F, Capacitive Backup Power Module, (ESM)(AURA 1.0)
1 F20 M2 ESM Kits are expected to be available in March 2013.
541-4416 BD,PCI Express Flash Board, (AURA 1.1)
371-5014 DOM, SS-FLASH, 32GB/24GB Solid State Flash Memory
Module, D20Y firmware (AURA 1.1)
371-4953 5.5V, 11F, Capacitive Backup Power Module, (ESM)(AURA 1.1)
7061269 DOM, SS-FLASH, 32GB/24GB Solid State Flash Memory
Module, D21Y firmware (AURA 1.1)
NOTE: It is preferred to NOT order the individual F20 ESMs for PM, use the ESM Kits listed
in Section 6.1 for bulk PM replacements. Individual ESMs are intended for break/fix use only.
6Gb SAS Raid Card (PCIe)

375-3644 6Gigabit SAS RAID PCI Express HBA, B2 ASIC,
375-3701 6Gigabit SAS RAID PCI Express HBA, B4 ASIC
7047503 6Gigabit SAS RAID PCI Express HBA, B4 ASIC, RoHS2013
371-4746 6Gigabit SAS RAID PCI Battery Module ( LION), BBU-07
371-4982 6Gigabit SAS RAID PCI Battery Module ( LION), BBU-08
7050794 6Gigabit SAS RAID PCI Battery Module ( LION), BBU-08, RoHS2013
6Gb SAS Raid (REM)

375-3647 6Gbps SAS2 REM 8P RAID HBA (Niwot-REM)
7047851 6Gbps SAS2 RAID Expansion Module (REM), B4 ASIC, RoHS2013
371-4746 6Gigabit SAS RAID PCI Battery Module, ( LION), BBU-07
371-4982 6Gigabit SAS RAID PCI Battery Module, ( LION), BBU-08
7050794 6Gigabit SAS RAID PCI Battery Module ( LION), BBU-08, RoHS2013
6.3 Additional FRU Part Numbers:

Exadata V2, X2-2, X2-8 Cable Management Arm's and Slide Rails

350-1546 CMA for 1U DB nodes
350-1287 CMA for 2U Storage Cells
350-1502 Bolt-In Slide Rail Kit for 2U Storage Cells
371-4919 Slide Rail Kit for 1U DB Nodes (use this to replace 350-1514 &
350-2741 combination in older units).
7 Definitions:
Machine The Oracle Exadata Database Machine is also known as the 'machine' and consists
of all the individual servers, switches, cables and the entire software stack that makes up an
Exadata engineered solution.
System see Machine
Server A server is one of the individual servers which are used to build an Exadata Database
Machine. A server may be either a Storage Cell or a Compute Node.
Compute Node The compute nodes are also known as the 'Database node' or the 'DB node'.
These servers may be one of Sun Fire X4170 (1U), the Sun Fire X4170 M2(1U) or the
Sun Server X3-2 (1U), or Sun Fire X4800, Sun Fire X4800M2or Sun Server X2-8 (5U).
Storage Cell The Storage cells (Cell) are the 2U servers in an Exadata. These servers may be
one of Sun Fire X4275 or the Sun Fire X4270 M2 or Sun Server X3-2L.
Flash F20 The Flash F20 card is the PCIe based controller of the Flash Disks (FDOM's).
There are four in every Storage Cell in an Exadata Database Machine but there are none
in the Compute Nodes. Newer Storage cells have F20 M2 cards.
ESM The Energy Storage Module (ESM) is the power backup for the Flash 20 cards that
allows it cache to be flushed on a power fail. It works in a manner similar to a battery.
Flash F40 The Flash F40 card is the PCIe based controller of the Flash Disks (FDOMs) used
in X3-2 and X3-8 Storage Cells. It uses an on-board capacitor array to provide power
failure protection that does not need regular replacement under PM service.
RAID Card The RAID card is the PCIe based controller of the spinning disks in each of the
Exadata servers. There is one in all Storage Cells and in all Compute Nodes. The
Sun Fire X4800 compute node contains a REM based version of this same card.
BBU07/BBU08 This is the battery used to backup the cache for the RAID cards. All versions
of the RAID card, REM and PCIe, use the same battery. The BBU07 is the older version
of the battery that is no longer available.
InfiniBand Switch The Sun Datacenter InfiniBand Switch 36p is used to build the Infiniband
fabric within the Exadata Database Machine. There are three of these switches in the
Machine, the spine switch located in Rack Unit 1 and the two 'leaf units' found in the
center of the Machine. The quarter rack version of the Exadata Database Machine has
only the two leaf switches.
Spares Pool The spares pool consists of the parts contained in the spares kit delivered with
each Exadata Database Machine. If there are more than one Exadata product at a
customers site this pool should consist of one kit for each Machine.

Exadata PM Process

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exadata PM Process

Uploaded by

Copyright:

Available Formats

Exadata Database Machine Preventive Maintenance

Revision Date Author(s) Changes

1.1 PM Time line Exception Policy due to Early Life Failures:

2.1 Initial PM Activities

2.2 On-site Planning Visit

Visual Inspection of the Machine (Refer to Section 2.3 for details)

2.3 Visual Inspection of the Machine

Validation of Onsite Spares

Acceptable reasons for missing parts:

2.4.1 Automated PM Checks (exachk)

2.4.2 Manual Battery/ESM Checks.

A manual check will take approximately 30 minutes to perform.

Checking the Exadata Image Version.

Exadata software image 11.2.2.1.1, or higher, is required to support the BBU08

Exadata Software Image Version:

[root@cel01 ~]# /opt/oracle.cellos/imageinfo -ver

Checking the RAID card batteries.

Remaining Capacity Low : Yes

# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 | grep Charging

Checking the Flash F20 Card ESM's

2.5 Ordering Replacement Parts

2.6 Rolling System Update vs. Full Machine Down

2.6.1 Rolling System Update

2.6.2 Full Machine Down Replacement

2.6.3 Estimated Down Times

Overall downtime Estimates:

Full System Shutdown Rolling Replacement

3.1 Steps for full system down outage

where GI_HOME environment variable is typically set to /u01/app/11.2.0/grid but

b) Shutdown the CRS services:

c) Validate CRS is down cleanly. There should be no processes running.

# ps -ef | grep css

d) Shutdown the server operating system:

2. The Customer or Field Engineer then may shutdown all Cells.

Each server should return all 3 processes running:

Review status of disks that all are online (enter on 1 line):

If no rows are returned, all is well.

8. The Customer should re-enables the CRS services

9. The Customer should starts CRS across all nodes:

3.2 Steps for rolling outage

3.3 Flash F20 and F20 M2 ESM Replacement:

Preparing the Server (Storage Cell or Compute Node) for service

Power Off the Storage Cell

Removing the PCI Card

1. Remove back panel PCI cross bar

2. Remove the PCI Riser containing the PCI cards to be serviced

Removing the ESM on F20 Card (541-3731)

6. Remove the ESM from the plastic ESM shroud.

Installing the ESM on F20 Card (541-3731)

1. Place the ESM in the plastic ESM shroud.

Removing the ESM on F20 M2 Card (541-4417)

3. Disconnect the ESM cable from connector J803 on the card.

4. Remove the ESM from the plastic ESM shroud.

1. Place the ESM in the plastic ESM shroud.

2. Connect the ESM plug to J803 on the card.

-> show /SYS/MB/RISER1/PCIE1/F20CARD fault_state

3. a) If the fault_state is showing as critical, then skip to step 4.

-> start /SP/faultmgmt/shell

3. Exit the fault management shell

4. With the fault_state showing as critical, it can be cleared as follows:

-> set /SYS/MB/RISER1/PCIE1/F20CARD clear_fault_action=true

This will reset the power-on-hours counter to 0.

Have the Customer DBA remove the server from operation

Removing the PCI Card