RAIDn White Paper 10

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

RAIDn:

Optimal Multiple Disk Loss Insurance

WHITE PAPER

By

Lawrence J. Dickson, Ph.D.

May 12, 2003

Copyright InoStor Corporation 1 05/09/03


TABLE OF CONTENTS

INTRODUCTION ..................................................................................................3

COST BENEFIT PATTERNS................................................................................3

Current RAID Methods for Protecting Against Multiple Drive Failures are
Wasteful.............................................................................................................5

RAIDn is a Proven Alternative ............................................................................5

REAL SPEED .......................................................................................................6

RAIDn ARCHITECTURE .......................................................................................7

BASIC BLOCK DEVICE.....................................................................................7

MULTIPLE RAIDn DEVICES MOUNTING LIST.................................................7

PARITY GENERATION AND PARITY-BASED DECODING .............................7

SHORTENING AND LENGTHENING................................................................8

INITIALIZATION, RECONSTRUCTION AND CONVERSION ...........................8

UNDERLYING DEVICE REQUEST COMPLETION AND STATE SENSING ....8

COMPATIBILITY RAID TRANSFER..................................................................9

RAIDn IMPLEMENTATION – CODE & HARDWARE STRUCTURE.....................9

RAIDn DATA FLOW STRUCTURES...................................................................11

Copyright InoStor Corporation 2 05/09/03


RAIDn White Paper

INTRODUCTION

There is no issue more critical to large data storage centers than the preservation
and integrity of their data. A small cluster of physical or electrical malfunctions
can, in the blink of an eye, cause more economic damage than a devastating
warehouse fire. Yet data storage redundancy techniques, a key component in the
necessary defense against such disasters, still are based on two simple
algorithms developed years ago - mirroring and parity summing. Their
shortcomings are only slightly masked by combining them with each other and
with data striping in various compound patterns.

The ideal technique would adjust to the amount of disk-loss insurance the user
needed, instead of forcing a pattern of protection (and vulnerability) on the user
based on the capabilities of the two algorithms mentioned above. InoStor
Corporation has developed and perfected a matrix array of patented1 techniques
that take protection against multiple drive failures far beyond conventional RAID
capabilities. In the theme of taking RAID to the “nth power,” InoStor has applied
the name RAIDn to all of the matrix variants underlying this revolutionary
technology.

COST BENEFIT PATTERNS

The standard RAID (Redundant Array of Inexpensive Disks) techniques used for
redundancy are RAID-1 (mirroring) and RAID-5 (parity summing). Also important
is RAID-0 (striping) which gives speed, not redundancy. Used alone, RAID-1
gives n-1 disks redundancy and only one disk worth of data (usually n=2), while
RAID-5 gives one-disk redundancy and (n-1) disks worth of data.

Let’s analyze some typical combinations, say in 9- and 10-drive disk packs, for
conventional RAID arrays. The relevant factors to be considered are:

♦ Safe loss count or minimum redundancy (maximum number of disks lost that
can ALWAYS be recovered)
♦ Maximum loss count or maximum redundancy (maximum number of disks
lost that can EVER be recovered)
♦ Data capacity
♦ Ideal read speed, and
♦ Ideal write speed (the last three compared with a single disk).

Then we’ll compare them to corresponding RAIDn designs. All of the packs
considered will be assumed to have pattern designs that maximize read and
write speed for large files and parallel data flows to/from disks. The "ideal"

1
US Patent number 6,557,123

Copyright InoStor Corporation 3 05/09/03


RAIDn White Paper

speeds will be based on raw data movement only, ignoring buffering and
computational burdens.

(A) RAID 0+1 Array: Two striped arrays of five disks each mirror the other. A
striped array (RAID-0) is lost if only one of its disks is lost, so safe loss count = 1
and maximum loss count = 5 (depending whether disks lost are on same side of
mirror). Data capacity = 5, read speed = 10 (using RAID-5 parity distribution and
an operating system capable of alternating mirror reads to achieve full
parallelism; the usual max is 5) and write speed = 5 (here reading assumes a
strategy of alternating between sides of the mirror to increase the parallelism).

(B) RAID 5+1 Array: Two RAID-5 arrays of five disks each mirror each other.
Safe loss count is 3 (when one side has lost no more than one disk, the other
perhaps more, we can still recover), maximum loss count is 6 (one entire side,
and one disk from the other side). Data capacity is 4 (equals that of one RAID-5
array), read speed = 10 but usual maximum is 5 (see (A) above), and write speed
= 4 (using full parallelism, but with parity and mirror burdens). Similar results
arise from a 1+5 array (a RAID-5 made of mirrored pairs).

(C) RAID 5+5 Array: Three independent RAID-5 arrays of three disks each,
which are combined to create a larger RAID-5 configuration. (Note that RAID-5
has nothing to do with the count of disks being five.) Thus one entire array of
three can be lost, plus one disk of each of the other two. This implies safe loss
count = 3 (it can't stand a 0 - 2 - 2 loss pattern) and max loss count = 5. Data
capacity is 4 (of 9), read speed is 9 (using nested striping) and write speed is 4.

Now, a RAIDn pattern is given the matrix designation n.m, where n>m>0 and “n”
is the total disk count while “m” is the number of disks that can be lost. ANY
pattern of m disks can be lost AND the amount of data stored is equal to n-m,
which is the theoretical maximum when m disks are lost. The read speed is n and
the write speed is n-m, which is also a theoretical maximum.

It is important to note that each n.m pattern (with the exception of n.1 and n.(n-1)
for which a general formula exists) must be discovered and proved on its own.
Table 1 presents a comparison of RAIDn with standard RAID patterns.

Just for starters, all the m=2 patterns have been tested up to n=21, and they
DOUBLE the capacity and write speed of 5+1, the standard alternative offering
safe loss count>=2. Safe loss count=2 is important because of the common
operator error of replacing the wrong disk.

Copyright InoStor Corporation 4 05/09/03


RAIDn White Paper

Table 1.
n
Comparison of RAID with standard RAID patterns

A H H B H H C H H
Case 0+1 10.2 10.5 5+1 10.3 10.6 5+5 9.3 9.5
Safe 1 2 5 3 3 6 3 3 5
Loss
Count
Max Loss 5 2 5 6 3 6 5 3 5
Count
Data 5 8 5 4 7 4 4 6 4
Capacity
Read 5* 10 10 5* 10 10 9 9 9
Speed
Write 5 8 5 4 7 4 4 6 4
Speed
*Note: Read speeds of 10 are possible but only with specialized operating system support
(Novell).

Current RAID Methods for Protecting Against Multiple Drive Failures are
Wasteful

Some of the RAID algorithms do guard against multiple disk failures, but these
are not currently implemented for Linux. However, the Linux Software RAID can
guard against multiple disk failures by layering an array on top of an array. For
example, nine disks can be used to create three raid-5 arrays. Then these three
arrays can in turn be hooked together into a single RAID-5 array on top. In fact,
this kind of a configuration will guard against a three-disk failure. Note that a
large amount of disk space is ''wasted'' on the redundancy information.

For an n x n RAID-5 array,


n=3, 5 out of 9 disks are used for parity (=55%)
n=4, 7 out of 16 disks
n=5, 9 out of 25 disks
...
N=9, 17 out of 81 disks (=~20%)

In general, an m x n array will use m+n-1 disks for parity. The least amount of
space is "wasted" when m=n.

RAIDn is a Proven Alternative

Old style RAID 0, RAID 1, RAID 4, RAID 5, and RAID 0+1 (identical mirrored
RAID 0's) will be included as part of the RAIDn code. All RAIDn insurance levels

Copyright InoStor Corporation 5 05/09/03


RAIDn White Paper

(2<=m<=n-2) have been tested for n<=21. The m=1 insurance level is equivalent
to RAID 5, the m=0 level to RAID 0, and the m=n-1 level to RAID 1.

Reading, writing, and converting RAIDs, included degraded RAIDs of one


species to clean RAIDs of that or another [RAID] level, can be intermixed
robustly.

REAL SPEED

RAIDn uses an extremely fast computational algorithm. In this, it differs from the
highly complex, encryption type multiple redundancy algorithms already known,
such as the Mariani and Reed-Solomon algorithms.

A test version was coded to run as a Linux application on a PPro 200, using
subdirectores of a single disk to model RAID disks, so that the Linux time
command would distinguish between user time (here the calculations related to
RAIDn), system time (the standard disk access and buffering burden), and
elapsed time (which includes hardware delays). Even for the large 18.3 case, the
algorithm time was less than the system disk access time, and much less than
the hardware-driven elapsed time, for RAID-constructing writes, and for RAID-
decoding reads from a damaged array. It was only a little greater than system
time for the RAID reconstruction call.

In the simpler 6.3 and 5.2 cases, also tested, it was about half the system time
for reconstruct, and much less than half for the other RAIDn algorithms. This was
without any assembly coding or hardware assist - both of which are applicable to
these algorithms to create an even greater performance boost. This means that
in real developments, RAIDn will truly approach the ideal read and write speeds
described in the section above.

Table 2.
Timings (nsec/byte) in application RAIDn test on PPro 200

RAIDn Type 18.3 6.3 5.2


Timing Type A S E A S E A S E
Construct 43 105 605 17 100 354 12 87 148
Read 46 74 351 10 50 91 15 82 119
Damaged
Reconstruct 52 32 86 24 47 110 21 45 69
a=algorithm (user), s=system (disk), e=elapsed (incl hardware delay)

Note that the above results were achieved without any detailed optimization, and
are crude Linux "time" outputs that only approximate the actual burdens.

Copyright InoStor Corporation 6 05/09/03


RAIDn White Paper

RAIDn ARCHITECTURE

RAIDn is initially implemented in the Linux operating system. This provides a


roadmap for porting the RAIDn libraries to other operating software.

BASIC BLOCK DEVICE

RAIDn is a loadable Linux kernel module that functions as a (disk-like) block


device driver and in turn accesses one or more pre-existing (disk-like) block
device drivers, called "underlying" devices.

After loading RAIDn with the standard insmod utilities, and waiting for initialization
to complete, it is possible both to directly read and write data to the quantity of
the known RAIDn device size, using dd and similar utilities. It is also possible to
create, mount and use a file system of the known RAIDn device size.

MULTIPLE RAIDn DEVICES MOUNTING LIST

A number of instances of RAIDn devices can coexist, using different underlying


devices, and belonging to parity calculation species that are the same or
different.

Information for defining all the current RAIDn device, or initializing or converting
RAIDn devices, is stored in at least three places: (A) shared kernel memory; (B)
superblocks or other non-volatile storage; (C) a standard file format analogous to
Linux "raidtab". The frequency of updating these may be different, but must
suffice to recover the current data storage state of a system that suddenly loses
power.

PARITY GENERATION AND PARITY-BASED DECODING

The RAIDn module(s) will support a number of standard-format Linux RAID


levels, including RAID 0, RAID 1, RAID 4, RAID 0+1, and the current four parity
algorithms of RAID 5. Also supported are a number of the patented RAIDn
algorithms including all disk insurance levels (between 2 and n-2 inclusive) for
any disk count up to and including nine. Other algorithms for up to 31 disks can
easily be added if needed.

Each parity species supported is capable of operation with all disks or at any disk
loss level from which its algorithm is capable of recovering. All reads and writes
will (after synchronization) leave all operating disks in the correct data and parity
states, able to read correct data by decoding if necessary.

If the disk loss level is exceeded, an error state is returned upon reading or
writing.

Copyright InoStor Corporation 7 05/09/03


RAIDn White Paper

SHORTENING AND LENGTHENING


Any instance of RAIDn may be shortened by a command that is atomic with
respect to device and file IO over that RAIDn device. Software moves a suitably
sized file system to the lower sector numbers and limits its growth beyond a
given boundary in such a way that race conditions do not arise.

Any instance of RAIDn may be lengthened (within limits imposed by the


space available on underlying devices) by a command that is atomic with
respect to device and file IO over that RAIDn device.

Software makes additional space available to a file system after such a


lengthening operation, and does it without race conditions or interruption of
service.

INITIALIZATION, RECONSTRUCTION AND CONVERSION

Any working RAIDn of any species may be converted to a RAIDn of any


other available species, given the availability of sufficient underlying resources of
sufficient length (which may include any or all the underlying devices of the
RAIDn being converted away from). This capability includes a self-conversion
from a RAIDn in a degraded state to the same species of RAIDn in a recovered
state (thus reconstruction is a special case of conversion). A reconstruction may
be aborted at any time and restored by instant atomic state change to the
previous degraded state.

During reconstruction or conversion, all reads and writes and file system
operations continue to work uninterrupted, with the RAIDn code in charge of
whichever side of the conversion manages the actual IO.

Reconstruction, conversion, and the initialization that takes place after a RAIDn
instance has been specified and before it becomes usable, are background
processes with progress information available analogously to Linux RAID's
"/proc/mdstat" procfile.

All of the above are available to old-style Linux RAIDs, including those that pre-
existed the installation of RAIDn at a particular site.

UNDERLYING DEVICE REQUEST COMPLETION AND STATE SENSING

The state data (above) includes device state for each underlying device, which is
checked at each RAIDn request initialization and each underlying device request
error. In a timely fashion, using atomic operations where necessary, this device
state is used both to recover from transient errors and to enforce degraded state
as necessary until reconstruction or recovery can be completed.

Copyright InoStor Corporation 8 05/09/03


RAIDn White Paper

It is possible for state to change in the middle of a RAIDn request level IO


operation. In this case, it is retried until either convergence or failure.

COMPATIBILITY RAID TRANSFER

A software path exists whereby a running software Linux RAID of the old style
(RAID 0, RAID 1, RAID 4, RAID 5, or RAID 0+1) is transferred to the control of
the RAIDn module without noticeable interruption of service. Not only are the data
formats completely consistent, but a syncing operation is enforced which is
crudely the equivalent of a "raidstop" followed by a "raidstart", after which device
calls which previously went to the Linux RAID device are instead now sent to the
RAIDn device of the corresponding old style species.

The compatibility transfer is also capable of going in the other direction (from
RAIDn driver to Linux RAID driver).

RAIDn IMPLEMENTATION – CODE & HARDWARE STRUCTURE

Here, in greatly simplified form, is the conventional arrangement of functional


layers between a file system and disk:

File System Module

Raw Block Device I/O Calls Interface

Block Device Driver (BDD)


Module

SCSI-like Hardware Interface Interface

Disk-like Hardware Module

Figure 1. Standard Block I/O layers

Here is the variant of that when either software RAIDn or hardware RAIDn is
added:

Copyright InoStor Corporation 9 05/09/03


RAIDn White Paper

File
System
Raw Block Module
Device I/O Calls

RAIDn Software
Module

Block
Device
Driver

SCSI-like Hardware
Interface
RAIDn Hardware
Module

Disk-like
Hardware

Figure 2. Standard RAIDn Layers

Several things are clear from this diagram:

♦ Software and hardware RAIDn are not incompatible with one another - they
can be used alone or together.
♦ RAIDn can be nested. Its bottom boundary fits its top boundary, in either the
software or hardware implementations.
♦ As long as the information transmitted by the "Raw Block Device I/O Calls"
API and the "SCSI-like Hardware Interface" API can be made functionally
equivalent, the RAIDn software and hardware designs can be made
functionally equivalent too.
♦ There is nothing in the diagram that cannot apply to any other form of
standard RAID. This means standard RAID (RAID linear, RAID 0, RAID 1,
RAID 4, RAID 5, and nested combinations) can be implemented as a subset
of RAIDn.
♦ All reference to operating system kernel and buffering is omitted. This is
deliberate, to show the independence of the design from operating system
considerations.

Copyright InoStor Corporation 10 05/09/03


RAIDn White Paper

The extreme generic character and simplicity of this design is not an illusion. Our
experience of coding RAIDn has rapidly converged to a central code block that
exhibits just such simplicity, despite a very wide variety of size and insurance
implementations, with great variations in algorithmic character. In addition, there
is code that deals with specific operating system requirements, independent of
the RAIDn algorithm. We began this with Linux, and expect other operating
systems to be similar. Moreover, hardware/firmware designs will be simpler.

RAIDn DATA FLOW STRUCTURES

The below diagrams show the minimum data flow structure of RAIDn operations.
If marked with an asterisk (*), not all such operations will include this step.
"Master" refers to action of the external master on the RAIDn module, and "slave"
to action of the RAIDn module on the external slave, so "write" always moves
data to the right, and "read" to the left. In the illustration below, “requests are
indicated by the thin arrows, and data flow is shown by the thick arrows:

RAIDn Module+
Master Slave

Write Request

Read Requests*

Write Data

Read Data*

Encode

Write Requests

Write Data

+
Processes in each box are accomplished in parallel

Figure 3. Data Flow for RAIDn Write

The separating lines are synchronization points: operations between them can
run in any order or in parallel, with the caveat that those vertically below must run
after the operation vertically above for the same device.

Copyright InoStor Corporation 11 05/09/03


RAIDn White Paper

Thus, in the RAIDn write, the acceptance of write data from the master can go on
in parallel with reads (if necessary) from any of the slaves. For example, slave
disk 1 may have accepted its read request and provided its data before slave
disk 0 even accepts its read request.
Another way of explaining the data flow for RAIDn “writes” is to identify the
actions by the generating source:

Master RAIDn Process Slave


Write Request
Write Data Read Requests*
Read Data*
Encode
Write Request
Write Data

A similar diagram illustrates the actions involved in a RAIDn “read” operation:

RAIDn Processes
Master (Accomplished in Parallel) Slave

Read Request
Read Request

Read Requests

Read Data

Decode*

Read Data

Figure 4. Data Flow for RAIDn Read

Again, this can also be illustrated by identify actions by the generating source:

Master RAIDn Process Slave


Read Request
Read Requests
Read Data
Decode*

Read Data

Copyright InoStor Corporation 12 05/09/03


RAIDn White Paper

The data flow shown can be subjected to any applicable buffering schemes. If
accesses to the same area tend to be repetitious, buffering may eliminate the
bottom two strips on the RAIDn “write” and the middle two strips on the RAIDn
“read” in cases where a stripe set is in cache. On the other hand, if accesses
tend to be short and randomly distributed, the slave “reads” and “writes” may be
passed out to another buffering module that does its best to load balance the
seeking process for each slave.

Copyright InoStor Corporation 13 05/09/03

You might also like