Professional Documents
Culture Documents
Exadata PM Process
Exadata PM Process
Procedures
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
1 Exadata Database Machine Preventive Maintenance Service
Overview:
The Oracle Exadata Database Machine is an easy to deploy solution for hosting the Oracle
Database that delivers the highest levels of database performance available. The Exadata
Database Machine is a cloud in a box composed of database servers, Oracle Exadata Storage
Servers, an InfiniBand fabric for storage networking and all the other components required to
host an Oracle Database. It delivers outstanding I/O and SQL processing performance for online
transaction processing (OLTP), data warehousing (DW) and consolidation of mixed workloads.
Extreme performance is delivered for all types of database applications by leveraging a
massively parallel grid architecture using Real Application Clusters and Exadata Storage
Servers.
To provide the highest levels of database performance available, Exadata includes memory
components used for cache, that require batteries and Electronic Storage Modules (ESM) to
ensure protection of the data in the event of an unplanned power loss. As the batteries and
ESM's near the thresholds of their life expectancy, their ability to uphold cache becomes
reduced. When this occurs, caching is turned off and a performance impact will be seen.
Disabling the data caching is done without need to reboot the server and there is no impact to
the quality of the results delivered. The values specified below should give the Customer at least
90 days to effect replacements before any issues should arise.
During the second year of service, Oracle will begin an annual Preventive Maintenance (PM)
service to ensure the integrity and serviceability of the Exadata Database Machine. The PM
service consists of a machine comparison against the current Exadata Best Practices that will
note any deviations, and replace the battery and ESM components before they reach their life
expectancy. This service will be provided only to Customers with a Premier Support for
Systems or equivalent Hardware support contract that is valid at the time the PM service is due.
The service will be performed on site by the local Oracle Services teams in co-ordination with
the Customer, and may be run on a live system. There is no impact on the performance of the
live system if completed as described.
A Service Request and task for the field to perform the PM service will be automatically opened
by Oracle Global Customer Support based on the installation date and location of the Exadata
unit. The local field will then contact and notify the Customer approximately 90 days in
advance of the scheduled PM completion date and explain the PM process to the Customer if
they are not familiar with it. The local field should co-ordinate and schedule a preparation visit
with the Customer, during which all parts required will be identified. The local field will then
schedule and co-ordinate the parts replacement visit with the Customer.
A general time line has been established for proactive replacements of batteries and ESM's. The
listed years of service due assume no extraordinary conditions such as operating at high
temperatures that may degrade the expected useable life of the battery and ESM components.
These time lines are established to ensure the continued best operating performance for the
Exadata Database Machine.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
Product Year 1 2 3 4 5 6 7
Battery No Yes No Yes No Yes No
Exadata V2
ESM No No Yes No No Yes No
Exadata X2-2, Battery No Yes No Yes No Yes No
Exadata X2-8 & X2-2
Expansion Rack ESM No No No Yes No No No
Exadata X3-2,
Exadata X3-8, & X3-2 Battery No Yes No Yes No Yes No
Expansion Rack
Notes:
1. Even though all the batteries or ESM's may not be flagged as critical in system output,
Oracle will replace ALL batteries and ESM's in the rack according to the timeline shown
above. Exadata PM SR's that are opened by Oracle on behalf of the customer should
replace ALL batteries and/or ESM's as appropriate for the PM year since install date.
2. The PM Service will be provided during year 6 (and later), only if the system is within 5
years of last ship date or Extended support has been offered.
3. For a definition of terms used in this document, refer to section 7 at the end of the
document.
4. There are no ESMs used in Exadata X3-2, X3-8 and X3-2 Expansion Racks.
5. Where this document refers to Oracle field service engineer, it also applies to and refers
to Oracle Authorized Service Partner engineers in locations where Oracle does not
provide direct service.
if more than 1/4 of the batteries in the machine are below the failed threshold (<600mAh
BBU07 or <674mAh BBU08) or
if more than 1/2 of the batteries in the machine are below the proactive replacement threshold
(<800mAh BBU07 or BBU08).
The TSC x64 owner of the SR will verify this criteria is met based on battery outputs from the
machine, and if so, will open an Exadata PM task to the field to schedule early replacements for
all the batteries in the rack. If this criteria is not met, then the TSC x64 owner will open a
normal field task to do break/fix of the identified batteries in the machine that are below the
proactive replacement within 90 days threshold (<800mAh BBU07 or BBU08) per MOS Note
1329989.1.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
2 Preparation for PM
The TSC engineer assigned to the SR will review the Exachk output and provide
guidance to the customer via the SR, and the field task if any field actionable
issues are noted.
b) Explain the need for minimum Exadata image version of 11.2.2.1.1 required to work
with the replacement batteries, such that the customer must update image prior to
schedule PM activity. Oracle is no longer able to supply batteries that work with
earlier image versions, and the Exadata performance may be reduced until such time
as the image can be updated to perform the scheduled PM activity. If Exadata
software updates for minimum firmware levels are required:
The Customer should patch update the Exadata software image to the latest
available. (See MOS Note 888828.1)
Patch updating can be done at the same time as this service if the customer
chooses to do multiple change activities in the same outage, but is not provided
specifically as a part of this service.
It is the customer's responsibility to upgrade and patch the system(s) either by
themselves or via a separate optional Advanced Customer Services (ACS)
patching service which can be purchased to perform these tasks.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
c) Determine a date for the on-site planning visit. PM allows up to 90 days to complete
the service.
The PM process overview and Exachk collection can also be done during the on-site planning
visit if it is preferred by the customer.
Determine the size of the Exadata rack and any additional systems such as add-on
Exadata Storage Servers or upgrade additions that will require PM replacement at the
same time. If there are add-on servers, also note the generation type (V2 vs. X2-2 vs.
X3-2) of the add-on systems as they may have a different time line for PM component
requirements as listed in section 1.
Determine if any site access issues will affect access and scheduling of replacements
Determine with the Customer if this will be a Full Machine Down Replacement or a
Rolling System Update impacting only one server at a time. Refer to Section 2.6 below
for details.
Determine the date that the PM part replacement service task will be completed. The
process allows for up to 90 days to complete the replacements.
The on-site planning visit may also be used to provide the customer with the overview of
the PM service, and assist them in obtaining and running Exachk if the customer
requires it. Manual gathering of battery or ESM health may be obtained instead, refer to
Section 2.4 for details)
Upon completion of the on-site PM planning visit, the field service team should:
Validate with the TSC engineer any additional parts that Exachk issues may require for
resolution
Provide the TSC engineer the date and time for when the PM part replacement is
scheduled. This allows TSC to update the SR date and set it for proper auto-closure
when the field tasks are closed.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
At 2 weeks prior to the scheduled PM part replacement date (or as soon as possible if <2 weeks)
Order the appropriate quantity of replacement parts for Batteries and/or ESM's for each
of the server units in the rack identified, and the part types scheduled for this current
year's service. Refer to Section 2.5 for details and Section 6 for part numbers.
Order replacement cables, CMA's and any other part that is faulted based on the
information gathered above and have them delivered to the Customer site. Refer to
section 6 for part numbers and the internal Oracle Sun System Handbook
(http://support.us.oracle.com/handbook_internal/).
Due to the quantity of parts expect them to take 2 weeks to arrive on-site.
Acquire a #1 and #2 Philips screwdriver
NOTE: Service logistics is notified of scheduled PMs on the same 90 day time
scale as the SR and field tasks being opened. To assist logistics ordering and
placing of parts, do NOT order parts as soon as the task is opened. Wait until 2
weeks prior to the scheduled PM service date before ordering parts, should
provide sufficient time to get prats within the 90 day window for completion of
the PM service.
Visually inspect to identify any amber LED's are on. If any are, verify with the customer
if there is a separate Service Request open that needs to be resolved prior to the PM
service being carried out.
Visually inspect the rear of the Exadata Database Machine for any Cable Management
Arm's (CMA's) which may be damaged or bent.
Visually inspect all cables for bend radii which are extremely tight and do not have
adequate slack to allow for proper usage of the CMA's or for FRU removal.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
Oracle Exadata V2, X2-2, X2-8, X3-2 & X3-8 Machine Spares Kits
Part Description Full Rack Half Rack Quarter Rack
InfiniBand (IB)
cables (in external (6x) 3m cables &
(8x) 5m cables. (4x) 5m cables.
additional parts (10x) 5m cables.
boxes)
Disk Drive Two One One
Sun Flash cards Two One One
5M InfiniBand
Two One One
(tied inside rack)
Ethernet cables1 one each blue, red, one each blue, red, one each blue, red,
(tied inside rack) black, orange black, orange black, orange
Keys 2 sets of 2 keys to 2 sets of 2 keys to 2 sets of 2 keys to
(tied inside rack) open the rack doors open the rack doors open the rack doors
and side panels. and side panels. and side panels.
Note: The InfiniBand cables in the external parts boxes are intended to be used for connecting
multiple racks together. The InfiniBand cable in the spare bundle looped inside the rack is
intended for break/fix use. If the system is part of a multi-rack environment some of the
InfiniBand cables from the spares pool will be used in that implementation. Also, adding
individual Storage Cells to an Exadata Database Machine may also use some of the InfiniBand
cables from the spares pool.
Note: The Sun Flash cards contained in the Spares Kits that contain an ESM do not require
replacement during PM service. The ESM does not have a shelf-life limit like a battery as it
does not degrade unless it is being powered and charged like the ESMs being used in the
system.
Reference MOS Note 1323593.1 for more information regarding the spares kit.
If the Spare parts pool is not complete determine what parts are missing.
If the Parts cannot be located, then the Customer should be advised that the on-site spares are
part of the quick remediation process purchased by the Customer, having them missing may
inhibit that delivery. An effort should be made to locate the missing components or the
purchase of replacement parts should be considered.
1 Orange Ethernet cables are included only in Exadata V2, X2-2 systems and X2-2 Expansion Racks.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
2.4 Preventive Maintenance Checks
The purpose of the PM checks is to both ensure the system is operating and meeting current
Exadata best practices, and to determine the current health of the battery and/or ESM
components in order to determine whether they will last the full 90 day period given to
complete the replacement of them.
See the My Oracle Support (MOS) references at the end of this section for further information
on 'exachk' and the specific checks it performs, as well as how to download the latest version.
Once a Customer has been notified that there is an upcoming PM action, the customer should
download and run the latest version of 'exachk'. The 'exachk' tool takes approximately 1
hour to run on a full rack (shorter on smaller racks), and is based on the current Best Practices
document for Exadata Database Machine. These checks include the validation of the battery
and ESM states for the purposes of determining whether the full 90 day period provided to
replace these is available, or if the current health of those components requires them to be
replaced sooner rather than later within the 90 day period.
The Oracle Exadata Database Machine 'exachk' will collect data regarding key software,
hardware, and firmware versions and configuration best practices specific to the Oracle Exadata
Database Machine.
The output assists customers to periodically review and cross reference current data for key
components of their Oracle Exadata Database Machine against supported version levels and
recommended Oracle RAC and Exadata best practices.
The Oracle Exadata Database Machine 'exachk' may be executed as desired but should be
executed regularly as part of the maintenance program for an Oracle Exadata Database
Machine. The exachk command does not require close proximity to the system and thus may
be run remotely. The customer should upload the data into the SR for review, and/or provide it
to the field engineer for review. If the customer is uncomfortable running the 'exachk' tool
themselves, then the field engineer or TSC should assist them and show them how to use it.
If the Customer has recently run the 'exachk' tool the output files can be easily viewed again
using the '-f' option to point to the associated output files, and output to the terminal or html.
This will summarize the output to show the informational and warning status messages, leaving
the 'pass' messages silent. To include the 'pass' messages use the '-f -o -v' options before
the filename.
See MOS ID 1329170.1 - Master Reference Note for exachk
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
See MOS ID 1070954.1 - Oracle Database Machine exachk or HealthCheck for more
information and for download instructions.
See MOS ID 757552.1 - Oracle Exadata Best Practices for more information on specific
checks
Verify with the Customer that dcli is acceptable and available to use.
Log into the servers via the internal KVM if one is available. If one is not available, a Laptop
can also be used and provides logging of your work if Customer policy allows for this.
(Remember <ctrl> <ctrl> is the KVM escape)
If the Customer is using a previous software image to those presented above, and cannot patch
upgrade, then the PM process must be placed on hold and monitored. Raid cards with batteries
which have reached their lower thresholds will disable their caching to preserve the quality of
the data. It will, however, compromise the performance of the system. All Raid cards with
batteries that are above the lower threshold will continue to work at top speed,
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
The alternative of replacing all the batteries and patching the image later is not supported, and
Oracle is no longer able to supply older batteries as they are no longer manufactured. The
RAID cards without the newer firmware cannot recognize the newer batteries, ALL caching
will be turned off until the firmware is patch updated. That would be a significant performance
hit.
Exadata software versions should be at least to 11.2.2.1.1 before beginning the replacement
procedures.
Remaining Capacity
In an Exadata PM service, the purpose of the capacity check is not to determine whether or not
a specific battery needs replaced, rather it is to determine the urgency of the need to perform the
PM service. The SR will be opened with a 90 day window in which to complete the PM
replacements. The customer outage time and urgency for doing so will be influenced by the
current state of the batteries operating in the rack. Use the following information to discuss
with the customer how soon or late within that 90 days they should scheduled the PM
replacements to occur. Regardless of the actual state of the battery capacity, ALL batteries
within the rack should be replaced for a proactively opened PM SR.
The absolute minimum BBU07 charge required to meet the minimum 48 hours hold-up time is
600mAh. When the BBU07 can no longer hold this much charge, MegaCli64 will report this
with the "Remaining Capacity Low" setting will change from the normal "No" to "Yes" which
may be an early warning notice to check whether its "Full Charge Capacity" is getting low.
The absolute minimum BBU08 charge required to meet the minimum 48 hours hold-up time is
674mAh. Note, on BBU08 this may be flagged prematurely due to a firmware bug (Sun CR
7018730) that incorrectly sets the value higher at 960mAh based on incorrect operational
assumptions. If this is being flagged due to this bug, ignore the alert if the "Full Charge
Capacity" value is over 800mAh.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 | grep Capacity
In the example above, the remaining capacity is below 600mAh and the Full Charge Capacity is
near 600mAh indicating an urgent need for replacement. Any batteries which indicate
remaining capacity below the threshold, but have a Full charge capacity that is much higher,
should be checked to be sure they are not in a learning cycle. In the above example the values
are close indicating an immanent failure. The PM service should be scheduled to occur as soon
as possible, as any further delay will put the customer at risk of experiencing a performance hit
soon, if not already.
The guideline for proactive replacement for BBU07 or BBU08 is to schedule replacement
within 90 days if the value is at 800mAh or less. If batteries are within this range, then
schedule the PM service for when the next convenient outage is for the customer, within the
next 90 days.
Learn Cycle
When a new BBU is installed into a server, it will have a depleted charge state. If the charge is
less than 50% of the Designed Capacity it will be forced into Write-Through (WT) mode and
run a full learn charge until the BBU has sufficient charge to maintain the cache. This may take
24 hours or longer. Checking its status will show:
Note also that learn cycles occur every 30 days from first power on for DB nodes and occur 4
times a year for Storage Cells. Storage Cells with image 11.2.1.3.1 or later, the learn cycle is
manually scheduled quarterly to start at 2AM January 17, April 17, July 17, and October 17.
The time is chosen to minimize impact on day time operations, where the WT mode reduces the
write performance of the HBA during this period until the BBU cache is re-enabled into Write-
Back (WB) mode.
Data Collection
Collect all data supporting a replacement of any batteries and attach it to the SR opened. For
this purpose /opt/oracle.SupportTools/sundiag.sh will collect the appropriate
data.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
LED's are lit on the cards, make note of the indicated component, add the part to your order list
and replace any of them as part of this PM process.
The Flash F20 cards did not originally come with software which properly recorded the
powered on hours into a non-volatile programmable memory (FRUID PROM) contained on the
card. As such, it should be assumed that all ESM's should be replaced at 3 years from the date
of system installation. The proper monitoring software was made available in August 2010, but
is not expected to have been immediately put into place on the Customer's system(s).
The Flash F20 monitoring software was delivered with ILOM version 3.0.9.19.a, as part of
Image 11.2.1.3.0, and later versions.
Identification between Flash F20 and Flash F20 M2 cards which have different ESMs, can be
done by reviewing the FRU data which will report either F20 (Aura1.0) or F20 M2 (Aura 1.1)
in the FRU description. Use ILOM or ipmitool fru print or other output as described
in MOS Note 1416397.1.
Individual RAID
Exadata Rack Type Flash F20 Card ESM Kit's
HBA Batteries1
Exadata X3-8 Full Rack 16 Not Applicable
Exadata X3-2 Full Rack 22 Not Applicable
Exadata X3-2 Half Rack 11 Not Applicable
Exadata X3-2 Quarter
5 Not Applicable
Rack
Individual X3-2 Storage
1 Not Applicable
Cell
Exadata X3-2 Storage
4 Not Applicable
Expansion Quarter Rack
Exadata X3-2 Storage
9 Not Applicable
Expansion Half Rack
1 The BBU08 Battery Kits are no longer being productized as of November 2011. Order single battery FRU's in the
appropriate quantity, 1 per DB node and 1 per Storage Cell. ESM Kits containing multiple ESMs as described are available.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
Individual RAID
Exadata Rack Type Flash F20 Card ESM Kit's
HBA Batteries
Exadata X3-2 Storage
18 Not Applicable
Expansion Full Rack
Exadata X2-8 Full Rack x2 Half Rack Kits
16
(56 ESMs total)
Exadata V2, X2-2 Full x2 Half Rack Kits
22
Rack (56 ESMs total)
Exadata V2, X2-2 Half x1 Half Rack Kit
11
Rack (28 ESMs total)
Exadata V2, X2-2 Quarter x1Quarter Rack Kit
5
Rack (12 ESMs total)
Individual V2, X2-2 x1 Individual Server ESM Kit
1
Storage Cell (4 ESMs total)
Exadata X2-2 Storage x4 Individual Server ESM Kits
4
Expansion Quarter Rack1 (16 ESMs total)
Exadata X2-2 Storage x9 Individual Server ESM Kits
9
Expansion Half Rack1 (36 ESMs total)
Exadata X2-2 Storage x18 Individual Server ESM Kits
18
Expansion Full Rack1 (72 ESMs total)
Section 6 includes a table with all of the part numbers required as part of this PM procedure.
Should you need to order a broken or failed component that is not on this list please refer to the
Oracle Sun System Handbook found on MOS at:
https://support.us.oracle.com/handbook_internal/index.html
1. Full System Downtime This approach has a short maintenance window but the entire
system is down for the duration of the component replacement. This method is preferred
if the customer has a scheduled maintenance window or if the risk of a rolling
replacement is unacceptable.
2. Rolling replacement method - This approach where one server is taken offline at a
time has a longer maintenance window but the overall system is up during the entire
activity. The risks associated with a rolling replacement are:
a) For systems with high redundancy, a double disk failure during the maintenance
window may cause data loss bringing down the entire system and requiring a restore
from backup.
1 There are no Storage Expansion rack specific kits. Field engineers are free to order either individual kits as specified or multiple regular
rack kits and individual FRU kit combinations as necessary for the quantity of batteries and ESMs necessary.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
b) For systems with normal redundancy, a single disk failure during the maintenance
window may cause data loss bringing down the entire system and requiring a restore
from backup."
The decision to perform these tasks in a rolling fashion as opposed to a full down Exadata
Database Machine is entirely the Customer's.
Should the Customer decide to perform any other updates or changes during standard down
time, it is acceptable to do so, however it is not recommended.
Rolling System Down (Single Server down at any given time) This method is similar to the
Individual Server repair mentioned above. The Customer should take the server out of
production, the Field Engineer should complete the required repair, then the customer would
return the server to use by the Exadata Database Machine. Once the server is back up and
running as part of the Exadata Database Machine then the steps are repeated for the next server
in need. This process continues until the last repair has been completed.
NOTE: During the rolling repair process the Exadata Database Machine will be
without certain redundancies while a server is down. Only one server should be
down at any time. If multiple servers are offline, then there may be data loss and
the entire system will go down requiring the system to be restored from backup.
Make sure the server last repaired has completed its re-integration into the
Exadata Database Machine before you begin work on the next Server.
Full Machine Down (All servers down at once) Should enough servers require attention the
Customer may wish to take the entire Exadata Database Machine down. This choice may be
made to prevent an extended exposure to a loss of redundancies or simply because a convenient
service window allows the opportunity. In this scenario the field engineers may work on
multiple servers at once.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
When shutting down the servers be sure and start with the DB nodes. Doing so quiesces the
servers and the applications on them. The Storage Cells should follow once all DB nodes are
down. When starting them up, begin with the Cells then proceed to the DB nodes.
A Storage Cell, done individually, takes approximately 1 hour depending on work load. The
last step (verification) can take the most time due to waiting for disks to re-sync. A DB Node
can take about 30 minutes. The basic steps are: (more detail below)
Stop the services & shutdown the server
Effect the Replacement of ESM's & RAID card batteries.
Boot the server
Validate all services have restarted.
If a full system down approach is preferred by the Customer the big savings is in not having to
shutdown, start up and re-synchronize data on each storage server individually. There is also
benefit in 2 engineers working in parallel to change out parts for the full and half racks. There
are some conflicts in space which arise from this method which is why the times are not truly
halved.
The Rolling replacement does not benefit from having 2 engineers working in parallel since the
bulk of the time is awaiting the shutdown, start up and re-sync's to complete. Remember, you
may not work on 2 storage cells at the same time unless performing the full system down
approach, regardless of redundancy level setting.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
3 On-Site Part Replacement - Step by Step
Note: This document is intended for use by Oracle Support engineers and Oracle Authorized
Service Partners only. The commands in this section that may need to be completed by the
customer's database administrator (DBA) may be copied for use by the customer.
Linux:
# shutdown -y -h now
Solaris:
# shutdown -y -i 5 -g 0
3. The Field Engineer turns off the PDU breakers in the rack to shutoff power to servers.
4. The Field Engineer Replaces the FRU's
1. Replace Batteries as necessary according to PM service time line (section 3.4)
2. Replace ESM's as necessary according to PM service time line (section 3.3)
5. The Field Engineer turns on the PDU breakers in the rack to power on servers.
6. Power on all servers, via front power button once ILOM has completed booting. Wait 5
to 10 minutes for the Cells to be fully up and ASM online.
7. The Customer and Field Engineer, after restarting all the cells and compute nodes, log
into the first compute node and run:
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
# dcli -l root -g <cells_group_file> "service celld status"
rsStatus: running
msStatus: running
cellsrvStatus: running
10. The Customer then validates that ASM and DBS are back:
# dcli -l root -g <dbs_group_file> ps -ef |grep pmon
# dcli -l root -g <dbs_group_file> <GI_HOME>/bin/crsctl check crs
NOTE: Make sure the server has completed it re-integration into the Exadata
Database Machine before you begin work on the next Server. You should never
have 2 servers down at once, not even 2 disk drives, unless the server has failed them
before you arrived!
Storage Cell replacement consists of the following basic actions: (details in Section 4.1 4.3
below)
Take the grid disks offline Customer (Section 4.1)
Shut down the server Customer or Field Engineer (Section 4.1)
Part Replacement Field Engineer (either or both if necessary)
Single RAID card battery (Section 3.4 and Section 3.5)
Four Flash card ESM's (Section 3.3)
Clear F20 fault status if set (Section 3.3)
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
Boot the server Customer or Field Engineer (Section 4.2)
Verify all storage is present and all connections are correct Customer (Section 4.2)
Activate the Grid Disks Customer (Section 4.3)
Verify that all Grid Disks are back online Customer (Section 4.3)
Each Database Node takes approximately 30 minutes and consists of the following steps:
Stop the CRS services Customer (Section 4.4)
Shut down the server Customer or Field Engineer (Section 4.4)
Part Replacement Field Engineer
Single RAID card battery (Section 3.4 and Section 3.5)
Boot the server Customer or Field Enginer (Section 4.5)
Validate all services have restarted - Customer (Section 4.6)
3. Extract the Flash HBA cards from the PCI riser assembly 1 card at a time.
The F20 Card has the ESM located in the centre of the card, with FMODs on either side of it.
The assembly part number label is located on the front of the card near the card edge connector
between the disk controller and rear FMODs.
1. Extract the Flash card from the PCI riser assembly 1 card at a time.
2. Remove the two ESM assembly retaining pins on the back of the card.
a) First, remove the center pin from each retaining pin.
b) Next, push the outer section of each retaining pin through the card and remove them.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
3. Carefully slide the ESM assembly (the ESM shroud and the ESM) off the card without
disturbing FMOD0 or FMOD3.
4. Using a pair of wire cutters, clip the ESM cable near the ESM end. This will allow
removal of the cable without needing to unscrew clips and remove FMOD0.
5. Disconnect the ESM cable from connector J803 on the card using the remaining tail.
The F20 Card has the ESM located in the centre of the card, with FMODs on either side of it.
The assembly part number label is located on the front of the card near the card edge connector
between the disk controller and rear FMODs.
2. Place the ESM assembly next to the board, then slide it gently onto the card. Carefully
route the cable and plug between FMOD0 and FMOD1 while sliding it on.
3. Install the two retaining pins from the back of the card
a) First, install the outer section of each retaining pin.
b) Next, install the center sections of the each retaining pin.
4. Connect the ESM plug to J803 on the card, routing the ESM cable around the retainer
clip holding FMOD0 and FMOD1, with the cable laying between the 2 FMODs.
5. Install the card back into the riser assembly in the same slot.
Reverse the previous steps to re-install the PCIe Riser back into the server.
The F20 Card has the ESM located on the rear of the card next to the SAS cable connector. The
assembly part number label is located next to the orange WWN label on the rear side of the
card.
1. Locate the plastic retaining clip for the ESM plastic housing on the rear side of the card.
2. With a small tool such as the tip of a screwdriver, carefully press the clip down while
pushing the housing off the rear end of the PCI card.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
Installing the ESM on F20 M2 Card (541-4417)
The F20 Card has the ESM located on the rear of the card next to the SAS cable connector. The
assembly part number label is located next to the orange WWN label on the rear side of the
card.
3. Slide the ESM assembly feet carefully onto the board, one into the slotted hole, the other
slides onto the end of the PCI card. There should be an audible click when the retaining
clip engages in its slot.
Reverse the previous steps to re-install the PCIe Riser back into the server.
Clearing the ESM Fault Status (Exadata V2 and X2-2s with F20 cards)
The ESM power-on monitoring feature in Exadata V2, X2-2 and X2-8 with F20 cards in ILOM
is implemented in software and requires manually clearing if faults are present. The physical
replacement of the ESM does not initiate this automatically. This does not apply to Exadata
X2-2 and X2-8 units with F20 M2 cards, which ILOM manages the thresholds automatically.
The monitoring features was added in ILOM 3.0.9.19.a and contained in Exadata software
image version 11.2.1.3.1 and later.
1. After the system is plugged in and ILOM is booted, log in to ILOM on the Storage Cell
as root user
2. For each ESM that was replaced, check if the fault_state is set to critical.
b) If the fault_state is showing OK then it is not yet critical so it may not have reached
power on hours threshold needed to flag it as critical. Its possibly because the unit has
come close before the PM is being done, or it may be due to flash updating ILOM at
some interim time which will reset the counter to 0. The fault_state can be manually set
to critical as follows:
1. On systems with image 11.2.3.2.1 and earlier with ILOM version v3.0.9.x
through v3.0.16.10.d, enter the fault management shell in ILOM CLI:
faultmgmtsp>
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
2. Run the following to mark the card failed:
-> etcd -i ereport.chassis.device.esm.eol.warning@/SYS/MB/RISER1/PCIE1/F20CARD
/SYS/MB/RISER1/PCIE4/F20CARD
/SYS/MB/RISER2/PCIE2/F20CARD
/SYS/MB/RISER2/PCIE5/F20CARD
Notes:
ILOM v3.0.9.19.a on Exadata V2 systems image 11.2.2.2.2 and earlier have a bug that
prevents slot PCIE4 from reporting the presence of the flash F20 card. Skip that slot if
the system has this problem.
ILOM v3.0.9.19.a on Exadata V2 systems and v3.0.9.27.a on Exadata X2-2/X2-8
systems has a bug that programmed the thresholds to 2 years (17200 hours) instead of 3
years or 4 years, so the fault status may have already been triggered and cleared.
On Exadata X2-2/X2-8 systems, the threshold may report 3 years (26220 hours) instead
of 4 years (35052 hours) if the system_identifier property in ILOM /SP is not
programmed to the standard Exadata identity string Exadata Database Machine X2-2
(or X2-8) that identifies this card as being in an Exadata, rather than a regular X4270M2
system. This may be the case on V2 systems upgraded with X2-2 servers if the identity
string was changed to the V2 rack string Sun Oracle Database Machine.
Follow the instructions in Section 4.2 and 4.3 to bring the storage cell back into service and to
verify that all components are working as expected.
NOTE: Do not begin to replace components in other Storage Cells until all
components in this one are back online and have re-silvered.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
3.4 Battery replacement in Exadata V2, X2-2, X3-2 Database Machine
Compute nodes and all Storage Cells:
Note: This Procedure is for the x4170 & x4270m2 Compute Nodes in the Exadata Database
X2-2 Machines or any of the Storage Cells. Instructions for battery replacement in the X4800
of the Exadata Database X2-8 Machine follows in the next section.
If you are using the Rolling Upgrade methodology the Storage Cells will need to be taken
offline one at a time to perform this procedure. Please follow Section 4.1 for instructions
before beginning the replacement. Database Nodes should follow the steps in Section 4.4. If
using the Full System Down method, proceed with the next step.
For each server in the Exadata Database Machine you should perform the
following steps:
Preparing the Server (Storage Cell or Compute Node) for service
1. Disconnect the SAS cables from the HBA PCI card making a note of which port each
cable goes into.
4. Extract the RAID HBA card from the PCI riser assembly
1. Use a No. 1 Phillips screwdriver to remove the 3 retaining screws that secure the battery
to the HBA from the underside of the card only.
Do NOT attempt to remove any screws from the battery on the top side of the HBA.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
2. Detach the battery pack including circuit board from the HBA by gently lifting it from
its circuit board connector on the top side of the HBA.
Reverse the previous steps to re-install the new battery on the HBA, and reinstall the PCIe
card and PCIe riser back into the server. Take care to get the SAS cables re-connected to
the same ports they were removed from, as accidently reversing them may affect disk slot
mappings.
Follow the instructions in Section 4.2 and 4.3 to bring the storage cell back into service and to
verify that all components are working as expected. For Database nodes follow the steps
outlined in sections 4.5 and 4.6.
NOTE: Do not begin to replace components in other Storage Cells until all
components in this one are back online and have re-silvered.
Power Off
Remove CMOD0 from the server setting it on a flat, antistatic surface with ample space
and light.
Remove the CMOD cover.
1. Lift the REM ejector handle and rotate it to its fully open position.
2. Lift the connector end of the REM and pull the REM away from the retaining clip on the
front support bracket.
3. To remove the battery, use a No. 1 Phillips screwdriver to remove the 3 retaining screws
that mount the battery to the REM.
4. Detach the battery pack including circuit board from the REM by gently lifting it from
its circuit board connector.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
Install the new battery and reinstall the REM into the server.
1. Attach the battery pack to the REM by aligning the circuit board connectors and gently
pressing together.
2. Secure the original battery to the underside of the new REM using the 3 retaining
screws.
3. Ensure that the REM ejector lever is in the closed position. The lever should be flat with
the REM support bracket.
4. Position the REM so that the battery is facing downward and the connector is aligned
with the connector on the motherboard.
5. Slip the opposite end of the REM under the retaining clips on the front support bracket
and ensure that the notch on the edge of the REM is positioned around the alignment
post on the bracket.
6. Carefully lower and position the connector end of the REM until the REM contacts the
connector on the motherboard, ensuring that the connectors are aligned. To seat the
connector, carefully push the REM downward until it is in a level position.
1. Install the cover on the CMOD and return the CMOD back into the unit in CMOD0 slot.
Replace any broken cables or cable management arms (CMA's) discovered in the earlier visual
inspection. Refer to MOS Note 1444683.1 for handling instructions and training.
Replace any other parts that require replacement according to the appropriate part canned action
plans on MOS.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
4 Starting, Stopping and Verifying the Exadata sub-systems
The following sections explain how to shutdown, start up and verify the servers within an
Exadata Database Machine.
NOTE: This document is intended for use by Oracle Support engineers and approved service
partners only. The commands in this section that need to be completed by the customer's
database administrator (DBA) may be copied for use by the customer.
As long as the value is large enough to comfortably replace the batteries and ESM's in a
storage cell, there is no need to change it.
...snipit ...
DATA_CD_09_cel01 ONLINE Yes
DATA_CD_10_cel01 ONLINE Yes
DATA_CD_11_cel01 ONLINE Yes
RECO_CD_00_cel01 ONLINE Yes
RECO_CD_01_cel01 ONLINE Yes
etc....
If one or more disks return asmdeactivationoutcome='No', you should wait for some time
and repeat the query until all disks return asmdeactivationoutcome='Yes'.
NOTE: Taking the storage server offline while one or more disks return a status of
asmdeactivationoutcome='No' will cause Oracle ASM to dismount the affected disk
group, causing the databases to shut down abruptly.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
3. Run cellcli command to Inactivate all grid disks on the cell you wish to power down/reboot.
(this could take up to 10 minutes or longer)
# cellcli
...
CellCLI> ALTER GRIDDISK ALL INACTIVE
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk DATA_CD_02_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
GridDisk RECO_CD_01_dmorlx8cel01 successfully altered
GridDisk RECO_CD_02_dmorlx8cel01 successfully altered
...etc...
4. Execute the command below and the output should show asmmodestatus='UNUSED' or
'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and
inactive in ASM.
You can now shutdown the Cell using the following command:
# shutdown -h -y now
Disconnect the power cords before opening the top of the server
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
[1:0:3:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdp
[2:0:0:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdq
[2:0:1:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdr
[2:0:2:0] disk ATA MARVELL SD88SA02 D20Y /dev/sds
[2:0:3:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdt
[3:0:0:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdu
[3:0:1:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdv
[3:0:2:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdw
[3:0:3:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdx
[4:0:0:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdy
[4:0:1:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdz
[4:0:2:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdaa
[4:0:3:0] disk ATA MARVELL SD88SA02 D20Y /dev/sdab
There should be 16 FMOD's found with the MARVELL label and 12 disks found by the search
on the LSI label. If the device count is not correct above the server should be re-opened and the
device connections checked to be sure they are secure BEFORE the following CellCLI
commands are issued.
Customer Activity:
1. Once the operating system is alive you will need to activate the grid disks.
# cellcli
CellCLI> alter griddisk all active
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk DATA_CD_02_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
GridDisk RECO_CD_01_dmorlx8cel01 successfully altered
GridDisk RECO_CD_02_dmorlx8cel01 successfully altered
...etc...
2. Issue the command below and all disks should show 'active':
CellCLI> list griddisk
DATA_CD_00_dmorlx8cel01 active
DATA_CD_01_dmorlx8cel01 active
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
DATA_CD_02_dmorlx8cel01 active
RECO_CD_00_dmorlx8cel01 active
RECO_CD_01_dmorlx8cel01 active
RECO_CD_02_dmorlx8cel01 active
...etc...
Customer Activity:
Verify all grid disks have been successfully put online using the following command. Wait until
asmmodestatus is ONLINE for all grid disks. The following is an example of the output early in
the activation process.
Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process.
Oracle ASM synchronization is only complete when ALL grid disks show
asmmodestatus=ONLINE. This process can take some time depending on how busy the
machine is, and has been while this individual server was down for repair.
This command shouldn't return any records if there are no CRS services running.
NOTE: Stopping CRS on one node may require modifying CRS services to run on a different
node.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
3. You can now shutdown the DB node using the following command:
Linux:
# shutdown -y -h now
Solaris:
# shutdown -y -i 5 -g 0
In the above output the 1 of +ASM1 refers to the DB node number. For example, Db node
#3 the value would be +ASM3.
2. Validate that instances are running:
# ps -ef |grep pmon
It should return a record for ASM instance and a record for each database.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
5 After the Replacements are done
Once the entire PM process is complete the 'exachk' utility should be re-run to verify that all is
well. See MOS ID 1329170.1 - Master Reference Note for exachk and MOS ID 1070954.1 -
Oracle Database Machine exachk or HealthCheck for more information and for download
instructions.
The engineer should complete all required documentation prior to closing the service request
tasks for providing this PM service.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
6 Parts List:
NOTE: The RAID Card Battery Kits being developed will no longer be made as of November 2011. Order single
battery FRU's in the appropriate quantity, 1 per server (DB nodes and storage cells).
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
541-4416 BD,PCI Express Flash Board, (AURA 1.1)
371-5014 DOM, SS-FLASH, 32GB/24GB Solid State Flash Memory
Module, D20Y firmware (AURA 1.1)
371-4953 5.5V, 11F, Capacitive Backup Power Module, (ESM)(AURA 1.1)
7061269 DOM, SS-FLASH, 32GB/24GB Solid State Flash Memory
Module, D21Y firmware (AURA 1.1)
NOTE: It is preferred to NOT order the individual F20 ESMs for PM, use the ESM Kits listed
in Section 6.1 for bulk PM replacements. Individual ESMs are intended for break/fix use only.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.
7 Definitions:
Machine The Oracle Exadata Database Machine is also known as the 'machine' and consists
of all the individual servers, switches, cables and the entire software stack that makes up an
Exadata engineered solution.
System see Machine
Server A server is one of the individual servers which are used to build an Exadata Database
Machine. A server may be either a Storage Cell or a Compute Node.
Compute Node The compute nodes are also known as the 'Database node' or the 'DB node'.
These servers may be one of Sun Fire X4170 (1U), the Sun Fire X4170 M2(1U) or the
Sun Server X3-2 (1U), or Sun Fire X4800, Sun Fire X4800M2or Sun Server X2-8 (5U).
Storage Cell The Storage cells (Cell) are the 2U servers in an Exadata. These servers may be
one of Sun Fire X4275 or the Sun Fire X4270 M2 or Sun Server X3-2L.
Flash F20 The Flash F20 card is the PCIe based controller of the Flash Disks (FDOM's).
There are four in every Storage Cell in an Exadata Database Machine but there are none
in the Compute Nodes. Newer Storage cells have F20 M2 cards.
ESM The Energy Storage Module (ESM) is the power backup for the Flash 20 cards that
allows it cache to be flushed on a power fail. It works in a manner similar to a battery.
Flash F40 The Flash F40 card is the PCIe based controller of the Flash Disks (FDOMs) used
in X3-2 and X3-8 Storage Cells. It uses an on-board capacitor array to provide power
failure protection that does not need regular replacement under PM service.
RAID Card The RAID card is the PCIe based controller of the spinning disks in each of the
Exadata servers. There is one in all Storage Cells and in all Compute Nodes. The
Sun Fire X4800 compute node contains a REM based version of this same card.
BBU07/BBU08 This is the battery used to backup the cache for the RAID cards. All versions
of the RAID card, REM and PCIe, use the same battery. The BBU07 is the older version
of the battery that is no longer available.
InfiniBand Switch The Sun Datacenter InfiniBand Switch 36p is used to build the Infiniband
fabric within the Exadata Database Machine. There are three of these switches in the
Machine, the spine switch located in Rack Unit 1 and the two 'leaf units' found in the
center of the Machine. The quarter rack version of the Exadata Database Machine has
only the two leaf switches.
Spares Pool The spares pool consists of the parts contained in the spares kit delivered with
each Exadata Database Machine. If there are more than one Exadata product at a
customers site this pool should consist of one kit for each Machine.
2013 Oracle Corporation. Confidential Oracle Internal and Approved Partners Only.