Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

RAID Technology

Security Level:
Objectives

Upon completion of this course, you will be able to understand:

• RAID basics

• RAID levels

• RAID 2.0+

• Replication protection and erasure coding


Agenda

1. RAID Basics
2. RAID Levels
3. RAID 2.0+
4. Replication Protection and Erasure Coding
RAID Basics — What Is RAID?
Redundant Array of Independent Disks (RAID) is an enabling technology that leverages multiple drives
as part of a set that provides data protection against drive failures. In general, RAID implementations
also improve the storage system performance by serving I/Os from multiple disks simultaneously.

Capacity
Two methods to implement RAID:
Logical drive Performance • Hardware RAID: A specialized hardware controller is required,
Reliability usually in the hosts (servers).

RAID controller with a BBU


Physical Physical Physical Physical
drive drive drive drive

• Software RAID: implemented on OS, no additional hardware


required. Software RAID is widely adopted in enterprise
A subset of disks within a RAID array can be grouped to storage.
form logical associations called logical arrays, also known
as a RAID set or a RAID group.

4 Huawei Confidential
RAID Basics — Striping
Striping is a technique to spread data across multiple drives to make drives working in parallel. All drives
work simultaneously, allowing more data to be processed in a shorter time and increasing performance,
compared to reading and writing from a single disk.

Strip: In each disk of a RAID set, a predefined number of


contiguously addressable disk blocks.
Stripe: The set of strips that spans across all the disks in the
RAID set.
Stripe size: A multiple of strip size by the number of data
disks in the RAID set.
Stripe depth (Strip size): The number of blocks in a strip
and the maximum amount of data that can be written to
or read from a single disk in the set.
Stripe width: The number of data strips in a stripe.

In a five disk striped RAID set with a strip size of 64 KB, the
stripe size is 320 KB (64 KB x 5). The stripe width is 5.

5 Huawei Confidential
RAID Basics — Mirroring and Parity
Mirroring is a technique whereby the same Parity is a redundancy technique that ensures protection of
data is stored on two different disk drives. data without maintaining a full set of duplicate data.
Additional disk drives are added to hold parity, a
mathematical construct that allows re-creation of the
Mirroring
missing data.

3 1 2 3 9

A A 1 1 2 1 5

B B 2 3 1 3 9
Disks
C C 1 1 3 2 7

D D

E E Data disks Parity disk

Parity can improve space utilization, higher than mirroring.


Read performance is improved, while write Parity is recalculated every time there is a change in data.
performance is lower than a single disk. For parity RAID, the stripe size calculation does not include the
Space utilization is 50%. parity strip. For example, in a five (4+1) disk parity RAID set with
a strip size of 64 KB, the stripe size will be 256 KB (64 KB x 4).

6 Huawei Confidential
RAID Basics — Hot Spare and Rebuild
Rebuild: A RAID rebuild is the data reconstruction process that occurs
Hot spare disks are preparatory disk drives that when a hard disk drive needs to be replaced.
are kept on active standby for use when a disk
drive fails. For example, Ah in the figure below When a disk fails unexpectedly, a RAID array copies data to a spare
is a hot spare disk. drive while the failed one is replaced. Data is then reassembled on the
new drive using RAID algorithms and parity data. During the rebuild,
the performance of some applications or processes may be negatively
Error affected by latency.

A0 A1 A2 Parity

Data disk Data disk Data disk Parity


Replace disk

Global hot spare: spare drive for all RAID groups in the storage
Ah

Dedicated hot spare: spare drive dedicated to a RAID group

7 Huawei Confidential
Agenda

1. RAID Basics
2. RAID Levels
3. RAID 2.0+
4. Replication Protection and Erasure Coding

8 Huawei Confidential
RAID Levels — Overview
Application performance, data availability requirements, and cost determine RAID level selection.
These RAID levels are defined on the basis of striping, mirroring, and parity techniques. Some RAID
levels use a single technique, whereas others use a combination of techniques. The following table
provides a brief summary:

Level Description
RAID 0 Striped set with no fault tolerance
RAID 1 Disk mirroring
RAID 3 Striped set with parallel access and a dedicated parity disk
RAID 4 Striped set with independent disk access and a dedicated parity disk
RAID 5 Striped set with independent disk access and distributed parity
RAID 6 Striped set with independent disk access and dual distributed parity
Nested RAID Combinations of RAID levels, for example, RAID 10, RAID 50, and RAID 60
RAID TP Support triple parity, tolerating 3 disks failed in one RAID set

9 Huawei Confidential
RAID Levels — RAID 0
RAID 0 configuration uses data striping techniques, where data is striped across all the disks within a
RAID set. It utilizes the full storage capacity of a RAID set, without providing data protection and
availability.
Dat
A B C Da E F G H …

Read and write performance of RAID 0 improves,


when the number of disks in a RAID set increases.
RAID 0 is good for applications that require a high
A I/O throughput but have no availability
B C D
requirement.
E F G H

I J K L

10 Huawei Confidential
RAID Levels — RAID 1
RAID 1 is based on the mirroring technique. In this RAID configuration, data is mirrored to provide
fault tolerance.
Dat
a
A B C D E ...
RAID 1 is suitable for applications that require
high availability and performance without cost
constraints, such as read-intensive OLTP
A A applications and operation systems. Utilization
= of RAID 1 is only 50%.
B B
C C
D D
E E

11 Huawei Confidential
RAID Levels — RAID 3 and RAID 4
RAID 3 stripes data for performance and uses Similar to RAID 3, RAID 4 stripes data for performance
parity for fault tolerance. Parity information is and uses parity for improved fault tolerance. Data is
stored on a dedicated drive so that the data can striped across all disks except the parity disk. Parity
be reconstructed if a drive fails in a RAID set. information is stored on a dedicated disk so that the
data can be rebuilt if a drive fails.
RAID 3 always reads and writes complete stripes Data disks in RAID 4 can be accessed independently so
of data across all disks because the drives that specific data elements can be read or written on a
operate in parallel. single disk without reading or writing an entire stripe.
A B C D ……

XOR
RAID 3 provides good performance for large
sequential data access, like video streaming.
A0 A1 A2 PA RAID 4 is rarely used.
B0 B1 B2 PB
C0 C1 C2 PC
D0 D1 D2 PD
Data disks Parity
disk
12 Huawei Confidential
RAID Levels — RAID 5

Similar to RAID 4, RAID 5 stripes data for performance and uses parity for improved fault tolerance.
Unlike RAID 4, in RAID 5, parity is distributed across all disks to overcome the write bottleneck of a
dedicated parity disk.
Dat
A0 B0 C0 D0 A1 aB1 C1 E1 A2 B2 D2 ... RAID 5 is widely applied in storage systems.
RAID 5 is good for random, read-intensive
I/O applications and preferred for
messaging, data mining, medium-
performance media serving, and relational
A0 B0 C0 D0 P0 database management system (RDBMS)
XOR
A1 B1 C1 P1 E1 implementations, in which database
A2 B2 P2 D2 E2 administrators (DBAs) optimize data access.
A3 P3 C3 D3 E3
P4 B4 C4 D4 E4

13 Huawei Confidential
RAID Levels — RAID 6
RAID 6 works in the same way as RAID 5, except that RAID 6 includes a second parity element to
enable survival if two disk failures occur in a RAID set.

A B C D E …
Dat
a RAID 6 is widely applied in storage
systems, especially in all-flash arrays.
Write performance of RAID 6 is lower
XOR
A0 A1 A2 Ap Aq than RAID 5. RAID 6 is good for
B0 B1 Bp Bq B2 applications that require high
C0 Cp Cq C1 C2 availability.
Dp Dq D0 D1 D2
Eq E0 E1 E2 Ep

14 Huawei Confidential
RAID Levels — RAID 10/50/60
Nested RAID combines multiple techniques to get data redundancy and performance, such as:
RAID 10: RAID 1 + RAID 0, mirroring + striping
RAID 50/60: RAID 5/6 + RAID 0, parity + striping

A B C D E F G H ……
A0 B0 D0 E0 A1 C0 D1 F0 B1 C1

A A B B
C C D D
XOR A0 B0 P00 XOR D0 E0 P10
E E F F
A1 P01 C0 D1 P11 F0
G G H H P02 B1 E1
C1 P12 F1
A2 B2 P03 D2 E2 P13
RAID 10 RAID 50
RAID 10 is good for applications requiring high performance and availability.

15 Huawei Confidential
RAID Levels — TP
Large-capacity drives result in a long data rebuild, during which remaining drives are more vulnerable to disk
failure.
RAID-TP (Triple Parity) can tolerate a simultaneous 3-disk failure in one RAID set, and service will not be
interrupted.
NetApp (TEC), HPE Nimble (RAID-3P), and Huawei
Dorado support RAID-TP.

Comparison of possibility of data loss events in 5 years

RAID-TP provides higher reliability than RAID 6 (EC-2).

Dorado-RAID-TP video

Disk capacity (TB)

16 Huawei Confidential
Write Penalty — RAID Impact on Disk Performance
In both mirrored and parity RAID, every write operation translates into more I/O overhead for disks, which is
referred to as a write penalty.

In RAID 1 implementation, every write operation must be performed on two disks configured as a mirrored pair. So,
the write penalty is 2.
In RAID 5 (copy on write, COW), a write operation may manifest as four I/O operations: read (old data and old
parity data), recalculate (new parity data), and write a parity segment for every data write operation ( new data
and new parity data). So, the write penalty is 4, as shown in the figure below.

Ep E4
Ep new = old - old + E4 new Considering an application that generates 6000 IOPS, with 60%
of them being reads.
The disk load in RAID 5 is calculated as follows:
Ep new Ep old E4 old E4 new
RAID 5 disk load (reads + writes) = 0.6 x 6000 + 4 x (0.4 x 6000)
A1 A2 A3 A4 Ap [as write penalty for RAID 5 is 4]
B1 B2 B3 Bp B4 = 3600 + 4 x 2400
C1 C2 Cp C3 C4 = 3600 + 9600
D1 Dp D2 D3 D4
= 13,200 IOPS
Ep E1 E2 E3 E4
In this mode, the write penalty of RAID-TP is up to 8, which is not
Disks
acceptable and needs to be optimized.
RAID 5 write penalty
17 Huawei Confidential
Optimization of RAID Performance in Storage
A0 B0 C0 D0 Redirect on write (ROW)

A B C D P

Full Stripe Write: When the data (A0, B0, C0, and D0)
to be written to a RAID group is a full stripe, there
is no additional read operations for old data and old
parity, and the new parity data is calculated as per
the new stripe. So, full stripe write only generates
additional write for parity data. Huawei Dorado adopts ROW + full stripe write to make every write
operation at the least write penalty, making RAID-TP possible.
Write cache can aggregate small I/Os into full stripe
write to reduce write penalty, especially for sequential Dorado RAID Write Penalty per Write Operation
I/Os. For example, if RAID 5 (4D+1P) has a stripe size RAID 5 (N+1) (N+1)/N
of 4 x 16 KB, the write penalty is only 5 (4 for new
data, and 1 for new parity calculated from new data) RAID 6 (N+2) (N+2)/N
in the event of full stripe write. RAID-TP (N+3) (N+3)/N

18 Huawei Confidential
RAID Level Comparison
Item RAID 0 RAID 1 RAID 10 RAID 5 RAID 6 RAID-TP
Mirroring Parity protection, Erasure coding
Mirroring Parity protection, 1-
Protection No protectio 2-disk failure protection, 3-disk
protection disk failure tolerated
n tolerated failure tolerated
Utilization 100% 50% 50% (n-1)/n x 100% (n-2)/n x 100% (n-3)/n x 100%
Min. Disks
2 2 4 3 4 6
Required
Good for both
Better than Good for random Good for random Good for random
Read random and
a single Good and sequential and sequential and sequential
Performance sequential
disk reads reads reads
reads
Poor to fair for
Slower
Write Fair for random and random
Good than a Good Poor in writes
Performance sequential writes writes and fair for
single disk
sequential writes
8 (without
Write Penalty 1 2 2 4 6
optimization)

19 Huawei Confidential
Agenda

1. RAID Basics
2. RAID Levels
3. RAID 2.0+
4. Replication Protection and Erasure Coding

20 Huawei Confidential
RAID 2.0+ Basics — RAID Technology Evolution

Hot Hot
spare spare

Traditional RAID LUN virtualization Block virtualization


HDS G series (LDEV striping) OceanStor V5/Dorado: RAID 2.0
NetApp FAS HPE 3PAR: Fast RAID
IBM Storwize: Distributed RAID
Dell EMC Unity AFA: Dynamic pool
NetApp E series: DDP

Traditional RAID LUN Virtualization Block Virtualization

Dedicated hot spare disks for RAID groups and Dedicated hot spare disks for RAID groups Distributed hot spare space. Automatic
Hot spare disks
storage arrays. Manual configuration. and storage arrays. Manual configuration. configuration.
Multiple-to one data rebuild, where write Multiple-to multiple rebuild, where data is
Multiple-to one data rebuild, where write
Rebuild performance of hot spare disks is the written to hot spare space in multiple
performance of hot spare disks is the bottleneck
bottleneck disks in parallel. Faster rebuild.

Load balancing No balancing between RAID groups Load balancing between RAID groups Load balancing between RAID groups

21 Huawei Confidential
RAID 2.0+ Basics — LUN Virtualization
LUN

Cell

RAID
group

Hot Hot
Physical
spare spare
drives
disk disk

Dedicated hot spare disks are a rebuild bottleneck.

22 Huawei Confidential
RAID 2.0+ Basics — Block Virtualization
LUN

Cell

RAID

Logical
block

Hot spare space

Distributed hot spare space replaces dedicated hot spare disks, eliminating the rebuild bottleneck.

23 Huawei Confidential
RAID 2.0+ Basics — Logical Objects
Disk domain (DD): a combination of multiple disks, used for
resource and failure isolation. In the DD, physical drives are
divided into chunks (CKs) with a fixed size of 64 MB.

Storage pool: storage resource container for applications, which


allocates CKs to CKGs with a RAID policy. A storage pool can
consist one or more tiers (SSD, SAS, and NL-SAS).

Chunk (CK): logical unit for physical drives in a DD, with a size of
64 MB.

Chunk group (CKG): logical RAID group with CKs from different
drives in the storage pool.

Extent: logical unit divided from a CKG, with a fixed size of 4 MB


by default (value range: 512 KB to 4 MB). In thick LUNs, extent is
the unit for space allocation, reclamation, and migration.

Grain: In thin LUNs, extents are divided into grains (64 KB) for
fine-grained space management.

24 Huawei Confidential
RAID 2.0+ Basics — Working Principles
Disk Domain Storage Pool LUN & File System Protocol

CK CKG Extent iSCSI/FC/FCoE/IB


(RAID group) Thick LUN
SSD (DG)

Grain Thin LUN


Tier 0
SAS (DG)
Thin LUN

Tier 1
NLSAS (DG)
Grain File system
NFS & CIFS

Tier 2

25 Huawei Confidential
RAID 2.0+ Highlights — Automatic Load Balancing

Benefits of automatic load balancing:


• Improved performance, especially of an individual logical unit, due to an increase in the number of drives that
constitute an array group.
• Superior workload distribution: If the workload of one array group is higher than another array group, you can
distribute the workload by combining the array groups, thereby reducing the total workload concentrated on
each specific array group.

26 Huawei Confidential
RAID 2.0+ Highlights — Fast Rebuild
Reconstruction principle of traditional Reconstruction principle of RAID
RAID 2.0+ RAID 2.0+ shortens rebuild from 10 hours to 0.5 hour.

 Traditional RAID's reconstruction speed is 30


MB/s. Reconstructing 1 TB data needs 10
hours.
 Tests by Huawei indicate that RAID 2.0+
shortens 1 TB data reconstruction to 0.5
hour.
Multiple-to-One Rebuild Multiple-to-Multiple Rebuild

27 Huawei Confidential
Agenda

1. RAID Basics
2. RAID Levels
3. RAID 2.0+
4. Replication Protection and Erasure Coding

28 Huawei Confidential
Working Principles of Replication Protection
Application data

Source Source
Source data Source data Source data Source data
data data
replica 1 replica 2 replica 1 replica 2
fragment 1 fragment 2
• Application data is divided into
fragments with a fixed size.
• For each fragment, the system will
Disk Disk Disk Disk Disk Disk
generate additional replicas (1 to 2
Disk Disk Disk Disk Disk Disk replicas).
… … … … … … • These copies will be written to
Disk Disk Disk Disk Disk Disk different disks on different nodes.

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6

Replication protection is widely adopted in distributed storage systems, such as Ceph, HDFS, OceanStor
Pacific and VMware vSAN. Replication protection provides high data durability, but low space utilization
(2-copy: 50%; 3-copy: 33%)

29 Huawei Confidential
Erasure Coding — Overview

N
Data
N fragments N
Encoding Decoding
File File
M
Parity
fragments

Erasure coding (EC) is a method of data protection in which data is broken into fragments (N), expanded and
encoded with redundant data pieces (M) and stored across a set of different locations.
If some failures happened and remaining available fragments are not less than N, the system can rebuild the full
data.
EC protection is applied in distributed storage, such as OceanStor Pacific, vSAN (all-flash), and Ceph (all-flash).

Video: Erasure Coding

30 Huawei Confidential
Erasure Coding — Working Principles
File
file

1. The source file is divided into chunks with a fixed size.


Chunk 1 Chunk 2 Chunk x 2. Chunks are divided into N strips with a fixed size (such as 128
KB).
3. The system generates M redundant strips according to N
Strip Strip Strip original data strips with a coding algorithm.
4. The N+M strips will be written into disks in N+M nodes.
Redundant
5. If the failed nodes are less than M, the original data can be
Source
Sourcedata Source
Source data Source data
rebuilt with a decoding algorithm.
Source data
Source data data data Source data Redundant
Redundant Redundant
data
fragment
slices fragment
slices fragment
slices fragment
slices data fragment
data slices data slices
fragment

Disk Disk Disk Disk Disk Disk

Disk Disk Disk Disk Disk Disk


… … … … … …
Typically, N>=M. So, the space utilization of EC N/(N+M) is higher
Disk Disk Disk Disk Disk Disk than replication protection, which can save costs.

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6

31 Huawei Confidential
Comparison Between EC and Replication
Item Replication (N-Copy) EC (N+M)
Protection Failure of N-1 nodes tolerated Failure of M nodes tolerated
Utilization 1/N N/(N+M)
Computing Simple, less compute-intensive Complex, compute-intensive
Operation Read: no extra read traffic Read: (N-1)/N*X ~1X read traffic
Network Traffic Write: (N-1)*X write traffic Write: (N+M-1)/N*X write traffic
(N+M-1)*X rebuild traffic, more complex and
Rebuild Traffic ~1X rebuild traffic, fast rebuild
slower than replication
Application Production system (3-copy) Cold data storage
Scenarios High performance All-flash node

32 Huawei Confidential
Thank you. Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2021 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

You might also like