003 Huawei OceanStor Storage Reliability Technologies

Huawei OceanStor Storage Reliability
Technologies
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 1

Objectives
Upon completion of this course, you will be able to understand

OceanStor Dorado V6's key reliability features and their technical
principles.
2 Huawei Confidential
Contents
1. Storage Reliability Metrics
2. Module-Level Reliability
3. System-Level Reliability
4. Solution-Level Reliability
5. O&M Reliability
6. Reliability Tests and Certifications
Storage Reliability Metrics
Failure rate λ = 1/MTBF, 1 FITs = 10-9 (1/h)
Return repair rate F (t) = λ x t
MTBF
Annual return repair rate = λ x 8760
Mean Time Between
Failures
Maintainability
MTTR
Mean Time To Recover Repair rate μ = 1/MTTR
MTBF DT = (1 – A) x 8760 x 60 (minutes/year)

Availability (A) =
MTBF + MTTR
Expressed in 0.99999... or x nines Consequence
Downtime (DT)
Time during which a service or
System reliability metrics: MTBF, function is unavailable
MTTR, and availability
Overview of Storage System Reliability
Level 3: Solution Data reliability Service availability O&M reliability
level
HyperReplication
Data protections and disaster HyperSnap HyperClone HyperMetro Intelligent
(remote Fast upgrade
recovery (DR) solutions provide (snapshot) (data clone) (active-active) prediction
replication)
system-level data protection and
DR. 3DC (geo-
redundancy)
Multiple cache Switchover Continuous High disk fault

Level 2: System copies
RAID 2.0+
within seconds mirroring
tolerance
Wear leveling
level
Reconstruction Dynamic HyperMetro- Bad block/sector Online
System-level reliability design offloading reconstruction Inner Overload scanning diagnosis
enables fault self-healing and (high-end control
data integrity protection for a I/O data models) Quick response to Isolation of
system. protection slow I/Os slow disks
Level 1:
Module level Hardware
Component Device Environment Production Disk
Lean manufacturing and
reliability
processing ensure the yield
rate.
Contents
5. O&M Reliability

Solution
Module-Level Reliability - Overview Module Level System Level
Level
Component reliability Disk reliability Data integrity Planned activity

• Optical module • Storage component • Fault forecast • Anti-vibration design
• Clock • Connector • T10 PI • Error prevention
• Production- • ERT
Component
component • Dedicated IC • Chip ECC/CRC • Replacement of

phase filtering • Heat dissipation design
Media
• Soft failure faulty components
prevention • Reliable upgrade
Reliability of Huawei-developed HSSD reliability • Reliable expansion
• Protocol data
chips • Backup power • Load balancing algorithm integrity
• Error detecting • ECC/CRC reliability • Error correction algorithm
• Data storage
• Error handling • BIST • RAID • Bad block management
redundancy
Running fault
• Key hardware signal • Storage component • Board process/material detection and self-
Board
• Clock signal • BIST Manufacturing healing

Board reliability • Low-speed
• Board power supply • SI/PI reliability
management bus
• Board temperature • Fault diagnosis and
• Materials locating
screening • Fault prediction
Redundancy design System power supply/backup System cooling • Burn-in screening • Power on self-test
Device
• ORT • Environment check

power • Component check
• FRU/Network • IQC
• PDU/BBU/CBU/Power supply reliability • Fan reliability and self-healing
redundancy
• Functional module
check and self-
healing
Environment
Security • Software check and

Anti- Anti- Moisture self-healing
Power supply Temperature Altitude Dust proof EMC standards
vibration corrosion proof
compliance
Solution
Module-Level Reliability - Environment Module Level System Level
Level
Systematic anti-vibration solution Anti-corrosion design Thermal design
• Temperature monitoring: Multiple

temperature monitoring points are designed in
• Die-cast alloy disk guide rails: The shock the system to for real-time monitoring.
resistance quality of the die-cast zinc base • Anti-corrosion techniques for disks: Electroless • Intelligent fan speed adjustment: When
alloy guide rails reduces the vibration passed Nickel/Immersion Gold (ENIG) and Solder detecting an abnormal temperature, the
from the enclosure to disks. Paste VIAs (SPV) techniques remarkably system adjusts the temperature using
• Multi-level fan vibration isolation: The multi- increase the service life and reliability of disks intelligent fan speed adjustment to ensure that
level (horizontal and vertical) vibration used in contaminated environments. the system works at a proper temperature.
attenuation mechanism reduces 40% vibration • Anti-corrosion techniques for controllers: The • Over-temperature protection: If an abnormal
between fans, supports, and the enclosure. anti-corrosion techniques along with the temperature persists for a long time, which
• Air baffles that reduce vibration and noise: The temperature rise test and voltage distribution may adversely affect system operation and
design of flow equalization for air baffles design provide comprehensive protection for data reliability, the system will take protection
enables the fans to dissipate heat evenly and controllers, prolonging controller service life measures such as service suspension and
reduces the vibration and noise caused by and improving controller reliability in system power-off to prevent the abnormal
turbulence. contaminated environments. temperature from causing a system failure.
Solution
Module-Level Reliability - Disk Reliability Module Level System Level
Level
ERT Joint test

About 1000 disks have
Tests/audits during ORT long-term test for
Sampling test by Huawei production
been tested for three the supplier's product samples after
Huawei test
months by introducing production delivery
multiple acceleration
factors. 5-level FA
System test Tiltwatch/Shockwatch

I. System/Disk logs
Locate simple
label problems quickly.
500+ test cases cover: Disk deployment test
system functions and ERT long-term Circular improvements II. Protocol analysis
performance, reliability test (Quality/Supply/Cooperation) Locate interactive
compatibility with problems accurately.
earlier versions, disk
system reliability. III. Electrical signal
Failure analysis (FA) analysis
Locate complex disk
Disk test Online quality data problems.
Joint test Component certification/ Start:
analysis/warning IV. Disk teardown for
100+ test cases cover: system
Preliminary test for R&D analysis
functions, performance, Qualification test samples/Analysis on Design review Locate physical
compatibility, reliability,
firmware, vibration,
product applications Disk selection RMA reverse damages of disks.
temperature cycle, disk V. Data restoration
head, disk, special
specifications verification.
Introduction
Solution
Module-Level Reliability - HSSD Reliability Module Level System Level
Level
Data redundancy Wear leveling

Rebuilding of bad pages Using redundant information Erase cycle Erase cycle
⊕ 100% 100%
50% 50%
x Block Block
Uncorrectable
• Die-level Multi-copy & RAID: metadata (multiple copies) and user data (RAID)
• Data restoration: LDPC, read retry, and intra-disk XOR that enable data restoration Wear leveling: periodically moves data blocks so that the data blocks with less wear can
using redundancy be used again.
Bad block management and background inspection Advanced management
Logical block Physical block Data management module

Block 0 Block 0 2
Block 1 Invalid block Disk management module
Block 2 Block 2
Block 4 Block 3
2 Unused 1 DISK
DISK DISK
DISK
Unused
Block 5 Block
Reservoir
1
Unused
1. Background inspection: combines read inspection and write inspection and 1. Online self-healing: restores a disk to its factory settings online.
proactively reports bad blocks detected during inspection. 2. Die failure: active reporting and capacity reduction
2. Bad block isolation: detects, migrates, and isolates bad blocks. 3. Power failure protection: flush dirty data when power failure using capacitor
Solution
Module-Level Reliability - HSSD RAID Module Level System Level
Level
Conventional SSD: silent data corruption HSSD: bad block (die) self-recovery + single-disk fault
(D3) + single-disk fault → data loss → no data loss
RAID Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 4 (intra-HSSD RAID)
RAID 4-x Die Die Die Die Die Die

CKG0 D0 D1 D2 D3 D4 P Parity
column
RAID 4-x Die Die Die Die Die Die
CKG1 D0 D1 D2 D3 P D4 Data
RAID 4-x Die Die Die Die Die Die column
... ... ... ... ... ...

CKG2 D0 D1 P D2 D3 D4
RAID 4-n Die Die Die Die Die Die
Package 1 Package 2 Package 3 Package 4 Package 5 Package 6
 Without RAID: Silent data corruption may occur on disks. (HDDs may have bad sectors and SSDs may have bad blocks, on which data is
unavailable.) If such bad sectors or blocks are seldom accessed, corruption of data in them cannot be detected or rectified in time. Once
a disk fails, data in these bad sectors or blocks cannot participate in reconstruction, resulting in loss of user data.
 With RAID: HSSDs periodically scan data blocks and restore detected bad blocks using intra-disk RAID. In addition, data in bad blocks
can be restored in real time using inter-disk RAID when these blocks are accessed by a host or participate in data reconstruction. In this
way, data will not be lost.
Contents
5. O&M Reliability

Solution
System-Level Reliability Module Level System Level
Level
High service availability

 Controller failover in
Solid data reliability
seconds
 Multiple cache copies
 Continuous mirroring
 RAID 2.0+
for high-end models Server  Fast reconstruction
 HyperMetro-Inner for
 Reconstruction
high-end models
 Overload control
Controller offloading
 Dynamic reconstruction
 E2E data protection
High disk fault tolerance

 Wear leveling/Anti–wear
leveling
 Bad sector/block scanning
and repair
 Online diagnosis
 Quick response to slow I/Os
 Isolation of low disks
Solution
High Service Availability Module Level System Level
Level
Host
1
Front end Controller Controller Front end
Interface
module
Interface
module
enclosure enclosure Interface
module
Interface
module
E2E Redundancy Design
A B
2 1. Controller switchover is transparent to services: front-end

interconnect I/O modules (FIMs), protocol offloading, and
Interface
Interface
Controller A Controller B Controller C Controller D Controller A Controller B Controller C Controller D
module
module
Interface
Interface
controller failover within seconds
module
module
2. Services are not interrupted if multiple controllers are
Software Software Software Software 5 Software Software Software Software
faulty (HyperMetro-Inner): three cache copies, continuous
3 Interface
Interface
mirroring, cross-engine mirroring, and full-mesh back end.
module
module
Interface
Interface
module
module
3. Services are not affected in the case of a software fault:
process availability detection, startup of processes within
seconds upon a fault, and intermittent isolation of
Interface Interface
Inter- Inter-
Interface Interface frequently abnormal background tasks.
module module module module
enclosure enclosure 4. Services are not interrupted if multiple disks are faulty:
Back end exchange exchange Back end
EC-2/EC-3 for user data
5. Controllers do not reset if interface modules for inter-
Disk
enclosure enclosure exchanging are faulty: multiple interface
4 modules for redundancy on high-end storage, and TCP
RAID ...
forwarding for mid-range and entry-level storage with a
... ... ... ... ... ... ... ... ... ... ... single interface module
RAID ...
High Service Availability - Controller Failover Module Level System Level
Solution
Level
Within Seconds
Host Quick Controller Failover
Controller
enclosure A
1. The host delivers I/Os to controller A: If all
FIM 1 FIM 2
controllers are normal, I/Os are delivered to
controller A through FIM 1.
Controller A Controller B Controller C Controller D 2. Controller A is faulty: FIM 1 and controller B
4 detect that controller A is unavailable by means
1 of interrupts.
3
3. Service switchover: Services are quickly (within
2 X 1s) switched to controller B that has data copies
of controller A, by switching the vNode. Then
FIMs are instructed to refresh the distribution
view.
BIM 1 BIM 2
4. I/O path switchover: FIM 1 returns BUSY for the
I/Os that have been delivered to controller A.
Disk enclosure The retried and new I/Os delivered by the host
RAID ... are delivered to controller B based on the new
... ... ... ... ... ... view.
RAID ...
High Service Availability - Continuous Mirroring Module Level System Level
Solution
Level
1 2 3 4 2 3 4 2 3
2* 1* 4* 3* 1 4* 3* 1 4
3* 1*
1* 2*
4* 2*
Controller
Controller A A Controller
Controller BB Controller
Controller CC Controller
ControllerDD ControllerAA Controller B Controller
Controller ControllerCCController DD
Controller Controller
Controller AAController
Controller Controller
BB C Controller
Controller D D
C Controller
Normal Failure of one controller (controller A) Failure of one more controller (controller D)
 Continuous mirroring (ensuring service continuity even when seven out of eight controllers are faulty): If controller A is faulty,
controller B selects controller C or D as the cache mirror. If controller D fails at the moment, cache mirroring is implemented
between controller B and controller C to ensure dual-copy redundancy.
 Service continuity: If a controller fails, its mirror controller establishes a mirror relationship with another functional controller within 5
minutes. This design increases the service availability by at one nine and ensures service continuity in the event that multiple
controllers fail successively.
High Service Availability – Module Level System Level
Solution
Level
HyperMetro-Inner for High-End Storage

Tolerating simultaneous failure of two controllers Tolerating failure of a single controller enclosure
Shared front end Shared front end Shared front end Shared front end
Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller
A B C D A B C D A B C D A B C D
Shared back end Shared back end Shared back end Shared back end
Smart Smart
disk enclosure disk enclosure
 The global cache provides three cache copies across controller enclosures.  The global cache provides three cache copies across controller enclosures.
 If two controllers fail simultaneously, at least one cache copy is available.  A smart disk enclosure connects to 8 controllers (in 2 controller enclosures)
 A single controller enclosure can tolerate simultaneous failure of two through BIMs.
controllers with the three-copy technology.  If a controller enclosure fails, at least one cache copy is available.
Solution
High Service Availability - SmartQoS Module Level System Level
Level
Upper Limit Control

Host • Priority setting: The storage system converts the traffic control objective into a
number of tokens. You can set upper limit objectives for low-priority LUNs or
snapshots to guarantee sufficient resources for high-priority LUNs or snapshots.
• Token application: The storage system processes the dequeued LUN or snapshot
I/Os by tokens. I/Os can be dequeued and processed only when sufficient tokens are
obtained.
Burst Quota
• Token accumulation: If the performance of a LUN, snapshot, LUN group, or host is
lower than the upper threshold within a second, a one-second burst duration is
accumulated. When the service pressure suddenly increases, the performance can
LUNs whose traffic LUNs whose traffic LUNs whose traffic exceed the upper limit and reach the burst traffic. The accumulated tokens are used
does not reach the reaches the lower far exceeds the by the current objects and last the configured duration. In this way, the system can
lower limit limit lower limit respond to the burst traffic in time.
No Burst Traffic
suppression prevention suppression Lower Limit Guarantee
• Minimum traffic: If each LUN is configured with the minimum traffic
(IOPS/bandwidth) by default, the minimum traffic must be ensured when the LUN
is overloaded.
• Traffic suppression for high-load LUNs: When the system is overloaded, if the traffic
of some LUNs does not reach the lower limit, the system performs load rating on
... all LUNs. The system provides a loose traffic condition for medium- and low-load
LUNs based on the load status. The system suppresses the traffic of high-load LUNs
Disk until the system releases sufficient resources to enable the traffic of all LUNs to
reach the lower limit.
Solution
Solid Data Reliability Module Level System Level
Level
Host
E2E Data Protection
Host data C1 C2 C3 C4 C5 C6 C7 C8
1. Cache data redundancy: Two or three copies of cache
Controller enclosure A Controller enclosure B data ensure no data loss in the case that multiple
Controller Controller Controller Controller Controller Controller Controller Controller
controllers or a single controller enclosure is faulty.
A B C D A B C D 2. Disk data redundancy: RAID 2.0+ ensures that user data
C1 C1 C5 C5 C5 C1 on disks will not be lost in the case that multiple disks
are consecutively or simultaneously faulty. Data
C6 C2 1 C2 C2 C6 C6 reconstruction is offloaded to smart disk enclosures,
further ensuring data reliability.
C7 C3 C3 C3 C7 C7 3. Intra-disk data redundancy: RAID 4 ensures die-level
redundancy within a disk, preventing user data loss in
C4 C4 C8 C8 C8 C4
the case of bad blocks or die failure.
4. Maintained data redundancy even when RAID disks are
5 insufficient: Dynamic reconstruction, in which fewer
22+2  fault
Disk
 21+2 4 data disks are involved, is used to maintain redundancy
if the number of member disks does not meet RAID
2
requirements.
Disk Disk Disk Disk
... Disk Disk Disk 5. E2E data consistency: E2E PI and parent-child hierarchy
C1 C2 C3 C4 C5 C6 C7 +2 verification ensure that data on I/O paths will not be
damaged.
Die Die Die Die Die Die Die Die ... Die Die
3 +1
Solution
Solid Data Reliability - Multiple Cache Copies Module Level System Level
Level
Third copy saved in another

Controller enclosure 0 controller enclosure
Controller enclosure 1
1 1* 1** 5** 5 5* 5** 1**
2* 2 6** 2** 6* 6 2** 6**
3** 7** 3 3* 7** 3** 7 7*
8** 4** 4* 4 4** 8** 8* 8

Third copy saved in another
controller
Controller A Controller B Controller C Controller D Controller A Controller B Controller C Controller D
 Data will not be lost if two controllers are faulty: Three copies of cache data are supported. For host data with the
same LBA, the system creates a pair of cache data copies on two controllers and the third copy on another
controller.
 Data will not be lost if a controller enclosure is faulty: When the system has two or more controller enclosures,
three copies of cache data are saved on controllers in different controller enclosures. This ensures that cache data
will not be lost in the event that a controller enclosure (containing four controllers) is faulty.
Solution
Solid Data Reliability - RAID 2.0+ Module Level System Level
Level
Conventional LUN
Block
LUN
RAID virtualization-
based RAID
RAID Hot
RAID
group 1
spare
group 2
Conventional RAID Block Virtualization-based RAID

Resource management based on disks Resource management based on data blocks
I/Os of a LUN are processed by limited disks in a RAID group. I/Os to each LUN are evenly distributed to all disks, balancing performance.
Slow reconstruction: If a single disk is faulty, only a limited number of disks in Fast reconstruction: If a single disk is faulty, all disks participate in
the RAID group participate in reconstruction. reconstruction.
A hot spare disk must be specified. Once the hot spare disk is faulty, it must be Reconstruction can be performed as long as there is free space, independent of
replaced in time. specific hot spare disks.
Solution
Solid Data Reliability - Fast Reconstruction Module Level System Level
Level
HDD 0 HDD 4 HDD 8

01 02 03 41 42 43 81 82 83 CKG 0 01 26 33 74 88
HDD 0 HDD 4 HDD 8
04 05 06 44 45 46 84 85 86
07 08 09 47 48 49 87 88 89 x
CKG 1 03 16 44 66 73
HDD 1 HDD 5 HDD 9

HDD 5 81
RAID 5 (4+1) 11 12 13 51 52 53 91 92 93
HDD1 (hot spare HDD 9
14 15 16 54 55 56 94 95 96 ⊕
disk) 0 1 x 2 3 4
17 18 19 57 58 59 97 98 99
HDD 2 HDD 6
x
CKG 2 14 28 42 56 96
5
HDD 2 HDD 6 21 22 23 61 62 63
⊕ 24 25 26 64 65 66
68
27 28 29 67 68 69 ⊕
HDD 3 HDD 7
HDD 3 HDD 7 31 32 33 71 72 73 CKG 3 21 34 51 71 99
Rebuilding 34 35 36 74 75 76
37 38 39 77 78 79
Rebuilding
Reconstruction using conventional RAID: During reconstruction, data is
read from the other functional disks and reconstructed. Then, Reconstruction using RAID 2.0+: RAID 2.0+ supports dozens of member disks.
reconstructed data is written to a hot spare disk or a new disk. The When a disk fails, other disks participate in reconstruction reads and writes,
write performance of a single disk restricts the reconstruction. greatly shortening the reconstruction time. As more disks share reconstruction
Therefore, the reconstruction takes a long time. loads, the load on each disk significantly decreases.
Solid Data Reliability - Reconstruction Module Level System Level
Solution
Level
Offloading
Controller-based Reconstruction
Controller 3 Calculates the faulty
blocks. 4 Writes faulty
blocks into the
hot spare space.
1. Reconstruction occupies controller computing resources (CPU
1 Initiates a disk
read request. resources): If a single or multiple disks are faulty, all data is
computed on the controller, causing the controller CPU to be
Disk enclosure 1 Disk enclosure 2 overloaded. The host I/O processing capabilities are adversely
2 Reads 23 blocks.
affected.
2. Reconstruction occupies massive data write bandwidth: All data
on the disks in the RAID group is read to the controller for
Disk 1 Disk 2 ... Disks
35
Disk
36
Disk 1 Disk 2 ... Disk
35
Disk
36
computing, occupying the data write bandwidth. As a result, the
host I/O write bandwidth is affected.
Reconstruction Offloading (2x Better Performance)

Controller 5 Use P', Q', P", and Q"
to calculate D1 and Dx'. 6 Writes faulty
blocks into the hot
1.1 Initiates a 4.1 Calculates spare space. 1. The computing of RAID member disks is offloaded to smart disk
reconstruction task. 1.2 Initiates a
transmission 4.1 Calculates enclosures: Smart disk enclosures have idle CPU resources. The
P' and Q'. reconstruction task. transmission P" and Q".
disk recovery data read by the IP enclosure is calculated in the
Disk Disk enclosure by using P' and Q'.
Disk read Compute Disk read Compute
enclosure 1 enclosure 2 2. Reconstruction occupies a small amount of data write
module module module module
2.1 Initiates a disk 2.1 Initiates a 3.2 Reads
bandwidth: When RAID data recovery is involved, data on the
read request. 3.1 Reads disk read disk to be recovered is computed in the smart disk enclosure
... ...
12 blocks. 12 blocks.
request. and does not need to be transmitted to the controller, reducing
Disk Disk Disks Disk Disk Disk Disk Disk back-end bandwidth occupation and reducing the impact of
1 2 35 36 1 2 35 36 reconstruction on system performance.
Solid Data Reliability - Dynamic Reconstruction
Solution
Module Level System Level
Level
Normal: RAID (4+2) A RAID member disk is faulty. Dynamic reconstruction: RAID (3+2)
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6
CKG0 D0 D1 D2 D3 P Q CKG0 D0 D1 D2 D3 P Q CKG0 D0 D1 D2 D3 P Q
D0+D2+D3=P'+Q'
CKG1 D0' D1' D2' P' Q'
Insufficient
disks
For a RAID group has M+N members (M indicates data columns and N indicates parity columns):
 Common reconstruction: When a disk is faulty, the system uses an idle CK to replace the faulty one and restores data to the idle CK. If
the number of member disks in the disk domain is fewer than M+N, two CKs reside on the same disk, decreasing the RAID redundancy
level.
 Dynamic reconstruction: If the number of member disks in the disk domain is fewer than M+N, the system reduces the number of data
columns (M) and retains the number of parity columns (N) during reconstruction. This method retains the RAID level, ensuring system
reliability.
Solution
Solid Data Reliability - E2E Data Protection Module Level System Level
Level
Protection point 2: checksum Protection point 3: Metadata CRC

Host verification between data and Protection point 4: Metadata DIF
8 KB = 16 sectors (512 B* + 8 B） metadata verification
Writes data. Reads data
512 B PI (8 B)
Per 8 KB
Inserts PI. Verifies PI. 512 B PI (8 B)
Checksum 508 B CRC (4 B) PI (8 B)
512 B PI (8 B)
Metadata (parent)
Storage system software ... Protection point 5: Parent-
child hierarchy verification
512 B PI (8 B) of metadata
508 B CRC (4 B) PI (8 B)
Verifies PI. 512 B PI (8 B)
Verifies PI.
…… Protection point 1: data PI Metadata (child)
Data verification *B: bytes
Data protection at hardware boundaries: Data verification is performed at multiple key nodes on I/O paths within a storage system, including front-end chips,
controller software front end, controller software back end, and back-end chips.
Multi-level software based data protection: For each 512-byte data, in addition to 8-byte PI (two bytes of which are CRC bytes), the system extracts the CRC
bytes in 16 PI sectors to form the checksum and stores the checksum in the metadata node. If skew occurs in a single or multiple pieces of data (512+8), the
checksum is also changed and becomes inconsistent with that saved in the metadata node. When the system reads the data and detects the inconsistency, it
uses the RAID redundancy data on other disks to recover the error data, preventing data loss.
Metadata protection: Metadata is organized in a tree structure, and a parent node of metadata stores the CRC values of its child nodes, which is similar to the
relationship between data and metadata. Once the metadata is damaged, it can be verified and restored using the parent and child nodes.
Solid Data Reliability – Module Level System Level
Solution
Level
E2E Data Protection (Example)

Host
Writes data. Reads data.
Calculates data CRC and compares it with existing PI. If they are
Inserts PI. inconsistent, data is damaged. The system then restores the data
using RAID and returns a success message to the host.
Service layer
Compares the PI (CRC) in data with that recorded in the

corresponding metadata. If they are inconsistent, read skew
Records PI in metadata. occurs. The system restores the data using RAID and returns a
success message to the upper layer.
Data Layer
Verifies PI and marks that data is Reads data and verifies PI. If the verification fails, the system
incorrect if verification fails. considers that data on the disk is damaged, uses RAID to
restore the data, and returns a success message to the upper
layer.
Disk
Solution
High Disk Fault Tolerance Module Level System Level
Level
• Operation
record
• Access model • Environment
optimization monitoring
• Service flow control • Hibernation
and balancing Operation • Firmware
• Separation of user data Management management
and metadata Access
• Wear leveling/Anti– optimization
wear leveling Diagnosis
&
Prediction • Health evaluation
• Bad sector/block scanning
and repair
• Online diagnosis
Access
• Access configuration Fault tolerance
authentication &
• Sub-health status Recovery • Quick response to slow I/Os
prediction • Isolation of slow disks
• Parameter Isolation&
Offline
• Refined error code handling
configuration • Fault isolation • Data restoration and migration
• Resource pre- • Deep inspection • Bit error and intermittent
allocation • NPF problem disconnection processing
processing
• State maintaining
• Securely offline
Solid Disk Reliability - Wear Leveling and Module Level System Level
Solution
Level
Anti-Wear Leveling
Global wear leveling and anti-wear leveling
100%
Wear leveling
80%
60% SSDs can only withstand a limited number of read and
Wear write operations. The system evenly distributes workloads
leveling 40% to each SSD, preventing some disks from failing due to
20% continuous frequent access.
0%
SSD0 SSD1 SSD2 SSD3 SSD4 SSD5 SSD6 SSD7 SSD8 SSD9
100% Anti-wear leveling

90%
Anti- To prevent simultaneous multi-SSD failures, the system
80% starts anti-global wear leveling when detecting that the
wear
leveling 70% SSD wear has reached the threshold. Unbalanced data
distribution on the SSDs makes their wear degrees differ
60% by at least 2%.
SSD0 SSD1 SSD2 SSD3 SSD4 SSD5 SSD6 SSD7 SSD8 SSD9
Solid Disk Reliability – Module Level System Level
Solution
Level
Bad Sector/Block Scanning and Repair

Bad Sector Repair Technical Principles
 Storage systems periodically scan disks for potential faults.
 Scan algorithm: cross scan, dynamic rate adjustment
0 1 2 3 4 5
 When detecting a bad sector, the storage system employs RAID
group redundancy to recover data on the bad sector and writes
the recovered data to disk (disk remapping will isolate the
faulty block and map the block to the internal reserved space).
IP/FC SAN
5' = 3 XOR 4 XOR P1
Technical Highlights
 The disk vulnerability can be detected and removed as early as
0 1 2 P0 possible to minimize the system risks due to disk failure.
3 4 P1 5
3' 4' P1'
RAID/CKG
Solution
Solid Disk Reliability — Online Diagnosis Module Level System Level
Level
•Collects statistics on minor disk errors and detects head or

disk abnormalities in advance.
Monitor
•Isolates a disk for diagnosis and recovery if the disk

malfunctions for the first time.
•Isolates and stops using a disk if the disk malfunctions for
Isolate several times.
•Implements a series of operations, such as power-on, power-

off, SMART check, I/O retry, verification, and disk self-check,
to analyze disk failure causes and degrees.
Diagnose •Repairs bad sectors to recovery disks.
•Diagnosis succeeded: The disks are connected to the system

and continue to provide services.
•Diagnosis failed: The disks are no longer connected to the
Restore system. RAID reconstruction starts.
Solid Disk Reliability – Module Level System Level
Solution
Level
Quick Response to Slow I/Os

Quick Response to Slow I/Os Technical Principles
 When a read error occurs due to physical bad sectors, head
a problems, vibration, or electron escape, the system will
handle the error by adding voltage, re-reading data, or re-
mapping. As a result, the I/O response time may be
prolonged.
 The system monitors the response to I/Os delivered to disks.
IP/FC SAN If the response time of I/Os to a disk exceeds the upper
threshold, the system stops accessing the disk and responds
to host requests quickly using data restored on other RAID
member disks. This disk can be accessed if no fault is found
or after it recovers from a fault.
a = b XOR c XOR d Technical Highlights

The adverse impact brought by slow disks or slow disk response
xa b c d RAID
within a short period can be eliminated before disks are isolated.
I/O requests can be responded in a timely manner, preventing
services from being affected by long retry and recovery policies.
Disk 1 Disk 2 Disk 3 Disk 4
Solid Disk Reliability — Module Level System Level
Solution
Level
Slow Disk Detection and Isolation

Technical Principles
Slow Disk Detection and Isolation
 Disks are classified into domains based on disk characteristics and array attributes
Slow (such as the disk rational speed, interface type, and medium type). The average
disk response time of all disks in a domain is used as a baseline to find out the disks
that are relatively slower.
Average  If a disk is identified to be slower than other disks in multiple periods, the system
response determines that this disk is a slow disk and returns a message indicating that the
time disk is temporarily isolated for diagnosis. If the disk cannot be recovered, it will be
permanently isolated.
Technical Highlights
 When all disks in the system become slow, the response time of a single disk
cannot be much greater than that of other disks. In this case, no disk will be
isolated, reducing the failure rate.
Disk  Slow disks are identified and isolated only after diagnosis and repair.
Abnormal Normal
Contents
5. O&M Reliability

Solution
Solution-Level Reliability Module Level System Level
Level
Intra-system Reliability Solution

Production Production DC Remote DR 1. HyperSnap: snapshot activation using multi-time-segment
DC A B center C cache technology does not block host I/Os; high-density
snapshots
Host cluster 2. HyperClone: complete physical copy, isolation of the
source and target LUNs; incremental synchronization
between the source and target LUNs
FC/IP
4 SAN Inter-system Reliability Solution
3 Asynchronous
replication 3. HyperMetro
Intra-city  Active-active architecture: Active-active LUNs are
FC/IP SAN active-active FC/IP SAN readable and writable in both DCs and data is
solution synchronized in real time.
1 2 Real-time data  High reliability: Cross-site bad block repair improves
Cloning Snapshot
Snapshot Cloning synchronization Public cloud system reliability.
 Elastic scalability: expanded to the 3DC solution based
CloudBackup
...
on the remote replication solution.
...
Dorado V6
4. HyperReplication: Synchronous and asynchronous data
Dorado V6 5 replication across storage systems provides intra-city and
remote data protection.
IP IP HEC/AWS
network network
Cloud Backup Solution
5. CloudBackup: The public cloud object storage is used as
Quorum device
(Optional)
the backup storage to prevent data loss caused by faults
on storage devices in the enterprise DC.
Solution
Solution-Level Reliability - HyperSnap Module Level System Level
Level
Overview
• HyperSnap quickly captures online data and
generates snapshots for source data at specified
points in time without interrupting system services,
preventing data loss caused by viruses or
08:00 snapshot
misoperations. Those snapshots can be used for
backup and testing.
12:00 snapshot
Source Snapshot
LUN LUN
16:00 snapshot Highlights
• The innovative multi-time-segment cache
20:00 snapshot
technology can continuously activate snapshots at
an interval of several seconds. Activating snapshots
OceanStor
does not block host I/Os, and host services can be
quickly responded.
• Based on the RAID 2.0+ virtualization architecture,
the system flexibly allocates storage space for
snapshots, making resource pools unnecessary.
Solution
Solution-Level Reliability - HyperClone Module Level System Level
Level
Overview
 HyperClone generates a complete physical copy
(target LUN) of the production LUN (source
LUN) at a point in time. The copy can be used
for backup, testing, and data analysis.
Highlights
 A complete, consistent physical copy
 Isolation of source and target LUNs, eliminating
mutual impact on performance
 Consistency groups, enabling the consistent
splitting of multiple LUNs
 Incremental synchronization
 Reverse incremental synchronization (from the
target LUN to the source LUN)
Solution
Solution-Level Reliability - HyperMetro Module Level System Level
Level
HyperMetro Architecture Principles
DC A
 One Huawei OceanStor storage system is deployed in both DC A
DC B
and DC B respectively in active-active mode. The two storage
systems concurrently provide read and write services for service
hosts in the two DCs. No data is lost in the event of the
Oracle RAC breakdown of any DC.
cluster/VMware
vSphere cluster
FusionSphere cluster
...
HyperMetro Design
WAN
FC/IP SAN FC/IP
SANSAN  Active-active architecture: Active-active LUNs are readable and
writable in both DCs and data is synchronized in real time.
 High reliability design: Dual-arbitration mechanism and cross-site
Production
bad sector repair boost system reliability.
Production
Storage
Storage
Real-time data  Flexible scalability design: SmartVirtualization, HyperSnap, and
synchronization
HyperReplication are supported. HyperMetro can be expanded to
the Disaster Recovery Data Center Solution (Geo-Redundant
Mode).
IP IP
network network
Quorum device (Optional)
Solution
Solution-Level Reliability - HyperReplication Module Level System Level
Level
Overview
Production center DR center
 HyperReplication supports both synchronous and
asynchronous remote replication between storage
systems. It is used in disaster recovery solutions to
provide intra-city and remote data protection, preventing
data loss caused by disasters and improving business
continuity.
Highlights
 Remote replication interworking among entry-level, mid-

range, and high-end storage systems
 RPO within seconds for asynchronous replication and
zero RPO for synchronous replication
HyperReplication
 Consistency group
 Incremental synchronization
 Fibre Channel and IP links
Huawei Huawei  Network: bi-directional replication, 1:N, and N:1
OceanStor OceanStor  3DC: cascading replication and parallel replication
Solution
Solution-Level Reliability - 3DC Module Level System Level
Level
Cascaded Architecture Parallel Architecture

Intra-city DR center Production Intra-city DR center
Production
B center A B
center A
A A
Synchronous/A
Synchronous/A
synchronous
synchronous
replication
replication SAN
SAN SAN Asynchronous SAN
replication
A A' A A'
Remote DR Remote DR
center C center C
1. The intra-city DR center undertakes
remote replication tasks and has Asynchronous
replication
minor impact on services in the
production center.
2. If the storage system in the If the storage system in the
SAN SAN
production center malfunctions, the production center malfunctions, the
intra-city DR center takes over intra-city or remote DR center can
services and keeps a data replication A" quickly take over services. A"
relationship with the remote DR
center.
Contents
5. O&M Reliability

Solution
O&M Reliability - Fast Upgrade System Level
Level
O&M
Host
Upgrade Transparent to Hosts
Controller enclosure  Host unaware of upgrade: Each controller has an I/O
Front-end interconnect Front-end interconnect holding module, which holds current I/Os during
I/O module I/O module component upgrade and restart. After being
I/O holding
upgraded, components continue to process the I/Os
module
Controller A Controller B Controller C Controller so that hosts do not detect any connection
D interruption or I/O exception.
 Component upgrade: The system upgrade is divided
Phase 1 I/O holding I/O holding I/O holding I/O holding
into two phases. The software components
Component
1
Component
1
Component
1
Component
1
(processes) with redundant units are upgraded first.
Component Component Component Component After the software packages are uploaded and the
2 2 2 2 processes are restarted, the second phase is triggered.
Component Component Component Component
N N N N  Zero performance loss: Each software component
Phase 2 restarts with 1s. The front-end interface module
returns BUSY for failed I/Os during the upgrade. The
host re-delivers the failed I/Os, and the performance
is restored to 100% within 2 seconds.
Disk enclosure  Short upgrade duration: No host compatibility issue is
involved. Host information does not need to be
……… collected for evaluation. The entire storage system
Fast restart can be upgraded within 10 minutes as controllers do
not need to be restarted.
Solution
O&M Reliability - Intelligent Prediction System Level
Level
O&M
Elimination of potential risks, Capacity monitoring, identifying

improving reliability. overloaded resources in advance
 Predict disk risks 14 day ahead.
 Use the XGBoost algorithm to identify

80% disk risks with only a 0.1% mis-  Predict the capacity trend in  Predict capacity
reporting rate. the next 12 months and consumption and identify
determine the capacity overloaded resources.
requirements.
Intelligent capacity planning and on-

Massive data analysis, building mature
demand procurement, reducing TCO
risk disk prediction models
 Analyze 500,000+ disks and

20+ billion feature records.
 Verify enterprise data center  Optimize idle resources  Evaluate the capacity
models for over 600 days. to improve resource requirements and provide
utilization. detailed expansion solutions.
Contents
5. O&M Reliability

Reliability Tests and Certifications
Hardware Tests and Certifications Ecosystem Compatibility Certifications
• After 10+ years of technical accumulation and

continuous investment, Huawei has established the
largest storage interoperability lab in Asia. The lab
Type Purpose Authentication Result
provides 10,000+ pages of interoperability lists and 1
Ensure that the product meets the requirements of the
million+ verified service scenarios, covering 4000+
EMC test
corresponding EMC standards, real-world
electromagnetic environment, and device or system
software and hardware versions of mainstream
Met the mandatory
compatibility. admission certification applications, operating systems, virtualization
requirements of each
Ensure the personal safety when using the products, region/country and products, servers, and switches. Huawei has
reduce the injury caused by electric shock, fire, heat, organization:
Safety test mechanical damage, radiation, chemical damage, and • China: CCC cooperated with mainstream vendors and earned
energy, and meet the admission requirements of each • European Commission:
country. CE 1500+ certificates. Huawei products are compatible
Environment Check that the products meet the requirements. Expose
•
•
Japan: VCCI-A
Russia: CU
with mainstream vendors' new products as soon as
(climatic) test defects in design, process, and materials.
• ...
the new products are launched.
Improve the environment adaptability of the products to
mechanical stress during storage and transportation to
Passed some optional • Huawei is the storage vendor that has the most
certifications, such as
Environmental
(mechanical) test
ensure qualified product appearance, structure, and
performance and ensure that the product can withstand
China's: upstream and downstream partner resources. Huawei
• Earthquake resistance
the adverse impact caused by external mechanical stress certification is the top strategic partner of Seagate, Western
on the equipment. • Environmental
Labeling certification Digital, Intel, SAP, and Microsoft.
Find the weak points of the products and improve
HALT test
product reliability.
Summary
Key Reliability Technologies of OceanStor Dorado V6
• High Service Availability (Tolerate 7 out of 8 controllers failures)

Controller Failover within Seconds, Continuous Mirroring, HyperMetro-Inner
• Solid Data Reliability
Multiple Cache Copies, RAID 2.0+, E2E Data Protection
Fast Reconstruction, Reconstruction Offloading, Dynamic Reconstruction
• High Disk Fault Tolerance
Wear Leveling/Anti–wear Leveling, Bad Sector/Block Scanning and Repair, Quick Response to Slow I/Os,
Intra-disk RAID
• Disaster Recovery Solution
HyperSnap, HyperClone, HyperReplication, 3DC(Cascaded and Parallel Architecture)
• O&M Reliability
Fast Upgrade
Thank you. Bring digital to every person, home, and
organization for a fully connected,
intelligent world.
Copyright©2021 Huawei Technologies Co., Ltd.

All Rights Reserved.
The information in this document may contain predictive

statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

003 Huawei OceanStor Storage Reliability Technologies

Uploaded by

Copyright:

Available Formats

You might also like

003 Huawei OceanStor Storage Reliability Technologies

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

003 Huawei OceanStor Storage Reliability Technologies

Uploaded by

Copyright:

Available Formats

Huawei OceanStor Storage Reliability

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 1

Upon completion of this course, you will be able to understand

1. Storage Reliability Metrics

6. Reliability Tests and Certifications

MTBF DT = (1 – A) x 8760 x 60 (minutes/year)

Multiple cache Switchover Continuous High disk fault

1. Storage Reliability Metrics

6. Reliability Tests and Certifications

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 6

Component reliability Disk reliability Data integrity Planned activity

component • Dedicated IC • Chip ECC/CRC • Replacement of

• Clock signal • BIST Manufacturing healing

• ORT • Environment check

Security • Software check and

Systematic anti-vibration solution Anti-corrosion design Thermal design

• Temperature monitoring: Multiple

ERT Joint test

System test Tiltwatch/Shockwatch

Data redundancy Wear leveling

Bad block management and background inspection Advanced management

Logical block Physical block Data management module

RAID 4-x Die Die Die Die Die Die

... ... ... ... ... ...

Package 1 Package 2 Package 3 Package 4 Package 5 Package 6

1. Storage Reliability Metrics

6. Reliability Tests and Certifications

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 12

High service availability

High disk fault tolerance

2 1. Controller switchover is transparent to services: front-end

HyperMetro-Inner for High-End Storage

Upper Limit Control

Third copy saved in another

1 1* 1** 5** 5 5* 5** 1**

2* 2 6** 2** 6* 6 2** 6**

3** 7** 3 3* 7** 3** 7 7*

8** 4** 4* 4 4** 8** 8* 8

Controller A Controller B Controller C Controller D Controller A Controller B Controller C Controller D

Conventional RAID Block Virtualization-based RAID

HDD 0 HDD 4 HDD 8

HDD 1 HDD 5 HDD 9

Reconstruction Offloading (2x Better Performance)

CKG0 D0 D1 D2 D3 P Q CKG0 D0 D1 D2 D3 P Q CKG0 D0 D1 D2 D3 P Q

CKG1 D0' D1' D2' P' Q'

Protection point 2: checksum Protection point 3: Metadata CRC

E2E Data Protection (Example)

Writes data. Reads data.

Compares the PI (CRC) in data with that recorded in the

100% Anti-wear leveling

Bad Sector/Block Scanning and Repair

5' = 3 XOR 4 XOR P1

•Collects statistics on minor disk errors and detects head or

•Isolates a disk for diagnosis and recovery if the disk

•Implements a series of operations, such as power-on, power-

•Diagnosis succeeded: The disks are connected to the system

Quick Response to Slow I/Os

a = b XOR c XOR d Technical Highlights

1 1* 1 5 5 5* 5 1

2* 2 6 2 6* 6 2 6

3 7 3 3* 7 3 7 7*

8 4 4* 4 4 8 8* 8