Professional Documents
Culture Documents
RAIDn White Paper 10
RAIDn White Paper 10
RAIDn White Paper 10
WHITE PAPER
By
INTRODUCTION ..................................................................................................3
Current RAID Methods for Protecting Against Multiple Drive Failures are
Wasteful.............................................................................................................5
INTRODUCTION
There is no issue more critical to large data storage centers than the preservation
and integrity of their data. A small cluster of physical or electrical malfunctions
can, in the blink of an eye, cause more economic damage than a devastating
warehouse fire. Yet data storage redundancy techniques, a key component in the
necessary defense against such disasters, still are based on two simple
algorithms developed years ago - mirroring and parity summing. Their
shortcomings are only slightly masked by combining them with each other and
with data striping in various compound patterns.
The ideal technique would adjust to the amount of disk-loss insurance the user
needed, instead of forcing a pattern of protection (and vulnerability) on the user
based on the capabilities of the two algorithms mentioned above. InoStor
Corporation has developed and perfected a matrix array of patented1 techniques
that take protection against multiple drive failures far beyond conventional RAID
capabilities. In the theme of taking RAID to the “nth power,” InoStor has applied
the name RAIDn to all of the matrix variants underlying this revolutionary
technology.
The standard RAID (Redundant Array of Inexpensive Disks) techniques used for
redundancy are RAID-1 (mirroring) and RAID-5 (parity summing). Also important
is RAID-0 (striping) which gives speed, not redundancy. Used alone, RAID-1
gives n-1 disks redundancy and only one disk worth of data (usually n=2), while
RAID-5 gives one-disk redundancy and (n-1) disks worth of data.
Let’s analyze some typical combinations, say in 9- and 10-drive disk packs, for
conventional RAID arrays. The relevant factors to be considered are:
♦ Safe loss count or minimum redundancy (maximum number of disks lost that
can ALWAYS be recovered)
♦ Maximum loss count or maximum redundancy (maximum number of disks
lost that can EVER be recovered)
♦ Data capacity
♦ Ideal read speed, and
♦ Ideal write speed (the last three compared with a single disk).
Then we’ll compare them to corresponding RAIDn designs. All of the packs
considered will be assumed to have pattern designs that maximize read and
write speed for large files and parallel data flows to/from disks. The "ideal"
1
US Patent number 6,557,123
speeds will be based on raw data movement only, ignoring buffering and
computational burdens.
(A) RAID 0+1 Array: Two striped arrays of five disks each mirror the other. A
striped array (RAID-0) is lost if only one of its disks is lost, so safe loss count = 1
and maximum loss count = 5 (depending whether disks lost are on same side of
mirror). Data capacity = 5, read speed = 10 (using RAID-5 parity distribution and
an operating system capable of alternating mirror reads to achieve full
parallelism; the usual max is 5) and write speed = 5 (here reading assumes a
strategy of alternating between sides of the mirror to increase the parallelism).
(B) RAID 5+1 Array: Two RAID-5 arrays of five disks each mirror each other.
Safe loss count is 3 (when one side has lost no more than one disk, the other
perhaps more, we can still recover), maximum loss count is 6 (one entire side,
and one disk from the other side). Data capacity is 4 (equals that of one RAID-5
array), read speed = 10 but usual maximum is 5 (see (A) above), and write speed
= 4 (using full parallelism, but with parity and mirror burdens). Similar results
arise from a 1+5 array (a RAID-5 made of mirrored pairs).
(C) RAID 5+5 Array: Three independent RAID-5 arrays of three disks each,
which are combined to create a larger RAID-5 configuration. (Note that RAID-5
has nothing to do with the count of disks being five.) Thus one entire array of
three can be lost, plus one disk of each of the other two. This implies safe loss
count = 3 (it can't stand a 0 - 2 - 2 loss pattern) and max loss count = 5. Data
capacity is 4 (of 9), read speed is 9 (using nested striping) and write speed is 4.
Now, a RAIDn pattern is given the matrix designation n.m, where n>m>0 and “n”
is the total disk count while “m” is the number of disks that can be lost. ANY
pattern of m disks can be lost AND the amount of data stored is equal to n-m,
which is the theoretical maximum when m disks are lost. The read speed is n and
the write speed is n-m, which is also a theoretical maximum.
It is important to note that each n.m pattern (with the exception of n.1 and n.(n-1)
for which a general formula exists) must be discovered and proved on its own.
Table 1 presents a comparison of RAIDn with standard RAID patterns.
Just for starters, all the m=2 patterns have been tested up to n=21, and they
DOUBLE the capacity and write speed of 5+1, the standard alternative offering
safe loss count>=2. Safe loss count=2 is important because of the common
operator error of replacing the wrong disk.
Table 1.
n
Comparison of RAID with standard RAID patterns
A H H B H H C H H
Case 0+1 10.2 10.5 5+1 10.3 10.6 5+5 9.3 9.5
Safe 1 2 5 3 3 6 3 3 5
Loss
Count
Max Loss 5 2 5 6 3 6 5 3 5
Count
Data 5 8 5 4 7 4 4 6 4
Capacity
Read 5* 10 10 5* 10 10 9 9 9
Speed
Write 5 8 5 4 7 4 4 6 4
Speed
*Note: Read speeds of 10 are possible but only with specialized operating system support
(Novell).
Current RAID Methods for Protecting Against Multiple Drive Failures are
Wasteful
Some of the RAID algorithms do guard against multiple disk failures, but these
are not currently implemented for Linux. However, the Linux Software RAID can
guard against multiple disk failures by layering an array on top of an array. For
example, nine disks can be used to create three raid-5 arrays. Then these three
arrays can in turn be hooked together into a single RAID-5 array on top. In fact,
this kind of a configuration will guard against a three-disk failure. Note that a
large amount of disk space is ''wasted'' on the redundancy information.
In general, an m x n array will use m+n-1 disks for parity. The least amount of
space is "wasted" when m=n.
Old style RAID 0, RAID 1, RAID 4, RAID 5, and RAID 0+1 (identical mirrored
RAID 0's) will be included as part of the RAIDn code. All RAIDn insurance levels
(2<=m<=n-2) have been tested for n<=21. The m=1 insurance level is equivalent
to RAID 5, the m=0 level to RAID 0, and the m=n-1 level to RAID 1.
REAL SPEED
RAIDn uses an extremely fast computational algorithm. In this, it differs from the
highly complex, encryption type multiple redundancy algorithms already known,
such as the Mariani and Reed-Solomon algorithms.
A test version was coded to run as a Linux application on a PPro 200, using
subdirectores of a single disk to model RAID disks, so that the Linux time
command would distinguish between user time (here the calculations related to
RAIDn), system time (the standard disk access and buffering burden), and
elapsed time (which includes hardware delays). Even for the large 18.3 case, the
algorithm time was less than the system disk access time, and much less than
the hardware-driven elapsed time, for RAID-constructing writes, and for RAID-
decoding reads from a damaged array. It was only a little greater than system
time for the RAID reconstruction call.
In the simpler 6.3 and 5.2 cases, also tested, it was about half the system time
for reconstruct, and much less than half for the other RAIDn algorithms. This was
without any assembly coding or hardware assist - both of which are applicable to
these algorithms to create an even greater performance boost. This means that
in real developments, RAIDn will truly approach the ideal read and write speeds
described in the section above.
Table 2.
Timings (nsec/byte) in application RAIDn test on PPro 200
Note that the above results were achieved without any detailed optimization, and
are crude Linux "time" outputs that only approximate the actual burdens.
RAIDn ARCHITECTURE
After loading RAIDn with the standard insmod utilities, and waiting for initialization
to complete, it is possible both to directly read and write data to the quantity of
the known RAIDn device size, using dd and similar utilities. It is also possible to
create, mount and use a file system of the known RAIDn device size.
Information for defining all the current RAIDn device, or initializing or converting
RAIDn devices, is stored in at least three places: (A) shared kernel memory; (B)
superblocks or other non-volatile storage; (C) a standard file format analogous to
Linux "raidtab". The frequency of updating these may be different, but must
suffice to recover the current data storage state of a system that suddenly loses
power.
Each parity species supported is capable of operation with all disks or at any disk
loss level from which its algorithm is capable of recovering. All reads and writes
will (after synchronization) leave all operating disks in the correct data and parity
states, able to read correct data by decoding if necessary.
If the disk loss level is exceeded, an error state is returned upon reading or
writing.
During reconstruction or conversion, all reads and writes and file system
operations continue to work uninterrupted, with the RAIDn code in charge of
whichever side of the conversion manages the actual IO.
Reconstruction, conversion, and the initialization that takes place after a RAIDn
instance has been specified and before it becomes usable, are background
processes with progress information available analogously to Linux RAID's
"/proc/mdstat" procfile.
All of the above are available to old-style Linux RAIDs, including those that pre-
existed the installation of RAIDn at a particular site.
The state data (above) includes device state for each underlying device, which is
checked at each RAIDn request initialization and each underlying device request
error. In a timely fashion, using atomic operations where necessary, this device
state is used both to recover from transient errors and to enforce degraded state
as necessary until reconstruction or recovery can be completed.
A software path exists whereby a running software Linux RAID of the old style
(RAID 0, RAID 1, RAID 4, RAID 5, or RAID 0+1) is transferred to the control of
the RAIDn module without noticeable interruption of service. Not only are the data
formats completely consistent, but a syncing operation is enforced which is
crudely the equivalent of a "raidstop" followed by a "raidstart", after which device
calls which previously went to the Linux RAID device are instead now sent to the
RAIDn device of the corresponding old style species.
The compatibility transfer is also capable of going in the other direction (from
RAIDn driver to Linux RAID driver).
Here is the variant of that when either software RAIDn or hardware RAIDn is
added:
File
System
Raw Block Module
Device I/O Calls
RAIDn Software
Module
Block
Device
Driver
SCSI-like Hardware
Interface
RAIDn Hardware
Module
Disk-like
Hardware
♦ Software and hardware RAIDn are not incompatible with one another - they
can be used alone or together.
♦ RAIDn can be nested. Its bottom boundary fits its top boundary, in either the
software or hardware implementations.
♦ As long as the information transmitted by the "Raw Block Device I/O Calls"
API and the "SCSI-like Hardware Interface" API can be made functionally
equivalent, the RAIDn software and hardware designs can be made
functionally equivalent too.
♦ There is nothing in the diagram that cannot apply to any other form of
standard RAID. This means standard RAID (RAID linear, RAID 0, RAID 1,
RAID 4, RAID 5, and nested combinations) can be implemented as a subset
of RAIDn.
♦ All reference to operating system kernel and buffering is omitted. This is
deliberate, to show the independence of the design from operating system
considerations.
The extreme generic character and simplicity of this design is not an illusion. Our
experience of coding RAIDn has rapidly converged to a central code block that
exhibits just such simplicity, despite a very wide variety of size and insurance
implementations, with great variations in algorithmic character. In addition, there
is code that deals with specific operating system requirements, independent of
the RAIDn algorithm. We began this with Linux, and expect other operating
systems to be similar. Moreover, hardware/firmware designs will be simpler.
The below diagrams show the minimum data flow structure of RAIDn operations.
If marked with an asterisk (*), not all such operations will include this step.
"Master" refers to action of the external master on the RAIDn module, and "slave"
to action of the RAIDn module on the external slave, so "write" always moves
data to the right, and "read" to the left. In the illustration below, “requests are
indicated by the thin arrows, and data flow is shown by the thick arrows:
RAIDn Module+
Master Slave
Write Request
Read Requests*
Write Data
Read Data*
Encode
Write Requests
Write Data
+
Processes in each box are accomplished in parallel
The separating lines are synchronization points: operations between them can
run in any order or in parallel, with the caveat that those vertically below must run
after the operation vertically above for the same device.
Thus, in the RAIDn write, the acceptance of write data from the master can go on
in parallel with reads (if necessary) from any of the slaves. For example, slave
disk 1 may have accepted its read request and provided its data before slave
disk 0 even accepts its read request.
Another way of explaining the data flow for RAIDn “writes” is to identify the
actions by the generating source:
RAIDn Processes
Master (Accomplished in Parallel) Slave
Read Request
Read Request
Read Requests
Read Data
Decode*
Read Data
Again, this can also be illustrated by identify actions by the generating source:
Read Data
The data flow shown can be subjected to any applicable buffering schemes. If
accesses to the same area tend to be repetitious, buffering may eliminate the
bottom two strips on the RAIDn “write” and the middle two strips on the RAIDn
“read” in cases where a stripe set is in cache. On the other hand, if accesses
tend to be short and randomly distributed, the slave “reads” and “writes” may be
passed out to another buffering module that does its best to load balance the
seeking process for each slave.