Systems With WriteBack Smart Flash Cache (WBFC) Enabled Running Into Unnecessary Block Repair During Resilvering Could Cause Data Loss

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Oracle Database - Enterprise Edition - Version 11.2.0.2 to 11.2.0.4 [Release 11.

2]
Oracle Exadata Storage Server Software - Version 11.2.3.2.1 to 11.2.3.2.1 [Release
11.2]
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Information in this document applies to any platform.

DESCRIPTION

After failure of a flashdisk on Exadata Storage Servers with Write-back Smart Flash
Cache (WBFC) enabled, ASM resilvering takes too long creating extended exposure
to a second flashdisk failure (or third flashdisk failure if using ASM high redundancy),
which may result in data loss.

This behavior is caused by one of the following issues:

 Bug 17342825 : after failure of flashdisk resilver took longer, overlapping


second failure
 Unpublished Bug 17446237 : kcfis block repair should be suppressed upon
getting 'block to be resilvered'

OCCURRENCE

Systems with the following configuration are exposed to this behavior and should
take immediate action to apply the required fixes:

 Exadata Storage Servers with WBFC enabled


 One of the following database software:
o 11.2.0.4.0
o 11.2.0.3.20 (Jul 2013)
o 11.2.0.3.19 (Jun 2013)
o Any 11.2.0.2
o Any 11.2.0.1

SYMPTOMS

With WBFC enabled, upon flashdisk failure, the griddisks cached by the failed
flashdisk will have stale blocks. Exadata Storage Server Software initiates a
resilvering operation in order to resynchronize the stale blocks from the content on
other storage servers. Block repair operations initiated by the database for the stale
blocks should be suppressed while the resilvering operation is in progress.

The expected duration of resilvering process is based on the amount of dirty blocks
stored on the failed flashdisks. When experiencing the behavior described above,
database-initiated block repair interferes with the resilvering operation, substantially
extending the time it takes resilver to complete. If a second flashdisk failure (or third
flashdisk failure if using ASM high redundancy) occurs on a different Exadata Storage
Server during the extended resynchronize time then data can be lost.

 Messages on alert.log on the storage cell indicating the failure of the FDOM(s)
and initiation of resilvering - creation of resilvering tables. Those messages
are EXPECTED as part of the initialization of resilvering.

CDHS: Received cd health state change with newState HEALTH_FAIL guid


b6e2c8ae-345f-4030-8c59-0ffc7173c87b
CDHS: Do cd health state change FD_07_cel16 from HEALTH_GOOD to newState
HEALTH_FAIL
FlashLogcel16_FLASHLOG (978653732, cdisk=FD_07_cel16) is inactive due to
inactive flash disk
Warning: turned off caching for FlashCache Part b2d9ef09-0f6c-4e6f-9147-
1fe419419a4d (2719004012) located on cdisk FD_07_cel16 due to IO errors or
slow/hung device
Wed Oct 09 23:12:58 2013
INFO: Griddisk DATA_CD_08_cel16 (id: 2402492548) no longer cached by flash ID:
731738644
INFO: Griddisk DBFS_DG_CD_07_cel16 (id: 2155917700) no longer cached by flash
ID: 731738644
INFO: Griddisk DBFS_DG_CD_08_cel16 (id: 3061245036) no longer cached by flash
ID: 731738644
INFO: Griddisk DATA_CD_07_cel16 (id: 2278565236) no longer cached by flash ID:
731738644
INFO: Griddisk DATA_CD_01_cel16 (id: 3224203868) no longer cached by flash ID:
731738644
INFO: 5 resilvering tables were updated because flash disk (ID:
731738644, guid: 185cf112-0b4f-4911-8bdd-f390ce7f94b6) failed
CDHS: Done cd FD_07_cel16 health state change from HEALTH_GOOD to newState
HEALTH_FAIL
Wed Oct 09 23:12:59 2013
Drop celldisk FD_07_cel16 (options: force, from memory only, no-erase) - begin
Disabling caching on FlashCache cel16_FLASHCACHE (731738644)
cdisk=FD_07_cel16 which had dirty (not synced) data
Wed Oct 09 23:12:59 2013
NOTE: Initiating ASM Instance operation: ASM RESILVER diskgroup on 5 disks
Published 3 grid disk events ASM RESILVER diskgroup on DG DATA to:
ClientHostName = db01.*.com, ClientPID = 30637
Published 2 grid disk events ASM RESILVER diskgroup on DG DBFS_DG to:
ClientHostName = db05.*.com, ClientPID = 21976
Drop celldisk FD_07_cel16 - end
Wed Oct 09 23:14:27 2013
 The alert.log files on any of the ASM instances will be filled with messages
related to block repair activities. This is NOT EXPECTED and is an indication
of block repair operations by RDBMS processes.

Errors in file /u01/app/grid/diag/asm/+asm/+ASM3/trace/+ASM3_r000_*.trc:


ORA-27603: Cell storage I/O error, I/O failed on disk
o/192.168.xx.xx/DATA_CD_01_cel06 at offset 156237824000 for data length
1048576
ORA-27626: Exadata error: 223 (Block needs to be resilvered)
SUCCESS: extent 5965 of file 612 group 1 repaired - all online mirror sides found
readable, no repair required
NOTE: repairing group 1 file 612 extent 14264
Errors in file /u01/app/grid/diag/asm/+asm/+ASM3/trace/+ASM3_r000_*.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk
o/192.168.xx.xx/DATA_CD_07_cel06 at offset 88774541312 for data length 1048576
ORA-27626: Exadata error: 223 (Block needs to be resilvered)

or
SUCCESS: extent 16667 of file 286 group 1 repaired by relocating to a different AU
on the same disk or the disk is offline
NOTE: repairing group 1 file 286 extent 11654

WORKAROUND

none

PATCHES

Database home version Action required


11.2.0.4 Install Patch 17492065
Note: this fix is included in merge Patch
17612092
11.2.0.3.21 or later No action required
11.2.0.3.19 or 11.2.0.3.20 Install Patch 17446237
11.2.0.3.9 through No action required
11.2.0.3.18
11.2.0.3.0 through Update to version that meets minimum
11.2.0.3.8 requirement
11.2.0.2 BP22 No action required
11.2.0.2 BP20 or 11.2.0.2 Install Patch 17342825
BP21
11.2.0.2 BP19 or earlier Update to version that meets minimum
requirement
11.2.0.1 No longer under Error Correction
Support
Update to version that meets minimum
requirement

Note the minimum requirements for using WBFC in addition to the fixes listed above

 Exadata Storage Software version 11.2.3.2.1 or later

You might also like