Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Disk configurations

Skip to end of metadata

 Page restrictions apply


 Added by cuwc, last edited by Trupti Gaikwad on Mar 04, 2011 (view change)
 show commenthide comment

Comment: Migrated to Confluence 4.0


Go to start of metadata

While Postilion International does not count disk setup as a core competency (we should defer to EMC,
Stratus, et al for that) it is inevitable that our clients will rely on us to provide guidance and other input. This
page is intended for sharing information about disk setups, so we can be more informed, especially as
more move further into the arena of high volume, mission critical systems for more prominent clients.

See also: High Availability, RAID

 Ideal configuration scenarios


 Disk types
 EMC Clariion SAN
o AX-range (AX100/AX150/...)
o CX-range (CX300/CX500/CX700/...)
 Stratus ftServer
o internal disks
o ftScalableStorage (ftSS)
 Case studies
 AEME: AmEx Middle East
o Status
o What we needed to drive
o What we had to work with
o What we decided
o Experience gained
 PoinSys, Sweden / Point International
o Status
o What we needed to drive
o What we had to work with
o What we decided
o Experience gained
 GE EuroHub / UK
o Status
o What they needed to drive
o What they had to work with
o What was implemented
o Experience gained
 Retail Decisions, ReD
o Experience gained
 See also ...

Ideal configuration scenarios


Corporate Communications has produced a "Sizing Guide for Stratus ftServer Systems" that also contains
general information on sizing and preferred disk setups. Some of this page is taken from there.

Ideally:

 RAID
o Use RAID 10 (aka RAID 1+0) for Database data files
(Realtime/PostCard/PostOffice data) – speed, redundancy and resilience
o Use RAID 1 for Database Log files, DoubleTake buffers, and OS+Applications
(and potentially the temp_db: see below) – speed and redundancy
 RAID 10 can of course be used for Log files too, if the customer
wishes to afford the increased number of drives
o do not rule out RAID5:
 RAID5 is typically slower than RAID10, especially if software RAID
is used; however some hardware RAID implementations do give
very good RAID5 performance (even better than RAID 10
potentially)
 see the Internal Postilion Benchmarks, e.g. ftScalableStorage
(ftSS)
 how do the rebuild/recovery times after failure compare for RAID5
vs RAID10?
 RAID10 gives consistent times irrespective of the
number of drives in the volume
 RAID 5 rebuild is perhaps slower and more intensive??
 RAID50 (5+0) is now available on some arrays, and has
good performance too.
 See Werner's tuning documents.

 Allocate a single volume for a single usage, do not mix usages as different usages load
the disks differently, e.g.
o each data file (Realtime/PostCard/PostOffice) should have its own volume
o each log file (Realtime/PostCard/PostOffice) should have its own volume
o if you suspect that the Office database will have to do some heavy sorting,
consider moving the temp_db to its own RAID 1 volume
o Double-Take buffer should have its own volume, one that does not contain
files that are part of a DT replication set
 don't partition a volume, i.e. a spindle should not be shared across partitions (because
you might violate the 1-volume-per-usage rule above);
 SAN:
o just because SANs are fast, does not mean it's a good idea to share a volume
between servers (see ReD case study below)
o make sure that the route to the disk is not shared, e.g. sharing disk controllers
can reduce throughput and also the consitency of the throughput even if the
spindles aren't shared.
o note that the disk utilisation profile (volume and proportion of reads to writes
etc) varies considerably between Realtime, PostCard, and Office ~ hence
being able to configure different cache profiles etc for each of these DBs would
be beneficial.
 anyone got any rules of thumb for this?
 ? perhaps 75% write cache (25% read) for Realtime volume, 75%
read cache (25% write) for PostCard and Office volumes?
 ? 75% write for temp_db volume?

Glossary used here:

 spindle: a single physical disk drive unit;


 volume or logical disk: a grouping of disks forming a single addressible array (so 2
spindles mirroring each other in RAID 1 array is a single volume – logical disk in Stratus
parlance)
 partition: a drive as seen by Windows: C:\ D:\ etc. There may be several partions per
volume (though we prefer not; see above).

Disk types
EMC Clariion SAN
AX-range (AX100/AX150/...)
 this is the low-budget entry-level SAN.
 it is capable of RAID 5 and "RAID 1/0"
o RAID 5 is the default setup; it is optimised for this (EMC has its own RAID5
optimisations, which improve performance further)
o "RAID 1/0" means it can support RAID1 and RAID 1+0 (... though this may be
0+1?)
 There are typically two 2Gbps Fibre Channels, with a SAN disk controller/processor per
channel:
o if the disks are configured as a single RAID volume, then only one channel will
be used (the other is a redundant backup) though the channels significantly
outperform the disk spindles so it does not bottleneck here;
o if the disks are configured as two (or more) RAID volumes, then both the
channels will activate; one for each volume in a two volume system if
everything is operating normally;
 additional disks can be configured as hot-standbys, which the AX will create and bring
into service automatically should a RAIDed drive fail.

CX-range (CX300/CX500/CX700/...)
 This is a high-end SAN.
 Can also be equipped with SRDF, EMC's replication technology that copies whole
volumes to a remote site over a high-bandwidth connection (think hardware DoubleTake
on steriods)
o Abbey (major UK bank, but not a Postilion user) uses this to replicate whole
machines (i.e. the Primary & DR machines run as diskless servers booting
from the SAN) – it works, but makes Stratus nervous.

Stratus ftServer
internal disks
 These are set up in RAID1 pairs.
o NB: the mirroring is done in software, i.e. takes CPU cycles from the server.
o the software is called RDR (Rapid Disk Resync) and is also used to remirror
disks efficiently after an outage (planned or unplanned)
 at installation, the system disk (C:) gets a 16GB partition by default
o this is what Stratus has determined is sufficient for most clients;
o the rest of the volume is unformatted, but the intention is that it is reserved
for ActiveUpgrade
o if our clients are not planning on using ActiveUpgrade, then the space can be
used (hopefully the systems are defended-in-depth to reduce the number and
frequency of OS hotfixes, and hopefully there is a DR server for use during
planned outages if needs be)
 ActiveUpgrade
o is a process by which both the system (CPU, memory, etc) and the system
disk's RAID1 mirror can be simplexed (brought out of fault-tolerant/redundant
mode) for upgrades;
o this leaves one half of the system running the production functionality as
normal;
o the other half runs just the OS on the system disk, but is totally isolated from
any data disks;
o software patches can be applied and tested
 if approved then the system will commit the changes and bring the
new software into production and re-duplexes itself into fault-
tolerant mode;
 if aborted then the changes are abandoned and the system re-
duplexes itself;
o currently this works for Microsoft Hotfixes, will in the future work for Stratus
Upgrades too
o it is of little use for Postilion upgrades since most need access to data that is
on the isolated disks, and you cannot store data on the system disk because
any changes will be lost when the system re-duplexes.

ftScalableStorage (ftSS)
 Stratus ftScalableStorage aka ftSS aka SftSS aka ftStorage aka ...
 Stratus's own high performance external disk enclosures
o currently (April 2007) available in non-SAN version, soon to have a SAN
version too
o see the Internal Postilion Benchmarks performed on the non-SAN version
 from email with Stratus
o you should be able to use PerfMon for each volume in an array. You will not
see physical disks as these are hidden by the raid controller and the volume
appears as a single spindle
o if you enable Window's write caching (Computer Management --> Device
Manager --> DiskDrives --> (Disk) --> Policies) then this is responded to but
ignored by the drive... it does its own caching.
o Storage cache cannot be tuned for read or write bias: writes get priority unless
the drive is not experiencing amny writes, at which point it shifts its cache to a
read bias
 only the read-ahead cache can be tuned, this on a volume by
volume basis
RAID
Skip to end of metadata

 Page restrictions apply


 Added by cuwc, last edited by cuwc on Feb 22, 2011 (view change)
 show commenthide comment

Comment: Migrated to Confluence 4.0


Go to start of metadata

RAID is an acronym for Redundant Array of Inexpensive Disks. Depending on the configuration it can
provide a performance boost, fault tolerance, or both, and to varying degrees.

RAID-0, Disk Spanning


RAID-0 is not fault tolerant, there are no redundant drives in a RAID-0 array. Data is striped across two or
more physical disks to make one (larger) logical disk. RAID-0 or spanning (see Just a bunch of disks, is
included with other "RAID" levels by convention and because the span is a basic building block for other
RAID levels. Spanning allows a file or volume to be larger than a single disk. This simplifies file service in
some operating systems or applications requiring very large capacity. Spans are often categorized as
having better speed than single drives. When a data request bridges drives, the drives can effectively be
accessed in parallel. This increase in peak transfer rate can be demonstrated with favorable benchmarks.
Transfer rate is not the only factor in determining overall performance. Spans normally have an increased
access delay. Transactions that bridge drives must wait for both drives to complete. Statistically this takes
longer than for a single drive. Normal system operation is a mix of varying demands. Performance in a
single user system, reading and writing large files is dominated by transfer rate (MB/sec). Performance in
multi-user, multi-tasking systems is dominated by access times (Ms/seek). Designers should carefully test
their assumptions before commiting to a span for performance considerations. Spanning can be applied
to any number of disks, but is typically applied to a group of two or four. The probability of a drive failure
increases as more drives are added. A span of two drives has twice the chance of failure as a single drive.
A span of four drives has four times the liklihood of failure as a single drive. Large spans are rarely used
because of the increased chance for failures. Mirroring (RAID-10) or parity (RAID-5) is added to a span for
fault tolerance. A span or any grouping of RAID drives may also be called an array or logical drive.

RAID-1, Disk Mirroring


RAID-1 always uses disks in pairs. A complete copy of a working drive is constantly "mirrored" to a second
drive. If either drive fails, the system continues to work from the other. Mirrored systems require at least
two drives, higher RAID levels require more than two drives. This level of fault tolerance is relatively
efficient, the system or controller does not have to generate or check parity. This RAID level can run as
fast when failed as not. The extra overhead associated with writing data to both drives can be offset at
moderate demand levels by caching. For disk volumes up to the maximum size of a single readily available
disk, RAID-1 is the fastest, most economical, least complex and most reliable choice for drive fault
tolerance. It is found extensively in Industrial, Manufacturing and Communications systems where
performance and minimum downtime are critical. As drives increase in size and decrease in cost, more
server and desktop systems are implemented with RAID-1.

RAID-5, Disk Striping with Parity


When a large file or volume (bigger than a single disk) is required, multiple physical drives are combined
to create a larger logical drive. As drives are added, the chance of a drive failure increases. A five drive
array has five times the chance of failure as a single drive. By adding parity, missing data can be
reconstructed from the remaining data when a drive fails. Parity can be applied to any number of drives
but is usually applied to a group of three or five drives. Conceptualy, RAID-5 adds parity to a RAID-0 span.
Either level may be referred to as a span, array or logical drive. Because of parity, N+1 disks are required
to store N disks worth of data. A two drive span becomes a three drive array with parity. A four drive span
becomes a five drive array with parity. Parity can be implemented in several ways. RAID-5 distributes the
parity evenly among the drives and computes parity based on multiple sector data stripes. This is the most
efficient method for most applications. Other RAID levels (2,3,4) distribute the parity differently. Generally,
arrays cannot be changed from one level to another, or to a different controller without rebuilding the array.
There is a casual but incorrect association between RAID-5 and five drive arrays. The RAID level describes
a methodology, not the number of drives. This RAID level is a cost effective solution for adding fault
tolerance to a span of drives, only one additional drive is required. But, parity adds significant overhead.
The performance penalty in the failed state is severe. What previously required a single read can now
requires four (in a 5 drive array), to reconstruct the data. While computation is a factor, the greater delay
generally comes from drive latency, waiting on the additional drive(s) to respond with data or parity.

RAID-10, a Mirrored Span


Where RAID-5 optimized for cost, by adding only one drive, RAID-10 (aka RAID 1+0) optimizes for
performance by adding more drives. Each drive in the span is duplicated (mirrored), eliminating the
overhead and delay associated with parity. A RAID-10 span can operate as fast in the failed state as in the
non-failed state. Instead of all drives needing to be read to reconstruct data, data is simply read from the
mirrored drive. Instead of multiple read/write operations to update parity, data is simply written to the
mirrored drive. A two drive span requires four drives with RAID-10, a four drive span requires eight. While
the larger number of drives (over RAID-5), increases the chance of a failure, there is a substantial increase
in performance. Performance Under Non-Faulted Conditions All fault tolerant levels take a performance hit
on writes because, for redundancy, data must be written to more than one drive. At moderate levels of
system activity, the impact of the additional write(s) can be absorbed by caching. Small reads from a span
(RAID-0,5,10) and all reads from a mirror (RAID-1), access a single drive. When only a single drive is
accessed, the speed of the array is the same as for a single drive system. The size of the data stripe
determines how data is interleaved on a span. The smaller the stripe, the more often a data request will
bridge drives. The larger the stripe, the more often it will be serviced from a single drive. When a transfer
bridges more than one drive, several factors come into play. While bridging can increase peak transfer
speed (by overlapping access to multiple drives), it also increases access delay because the entire
transaction is not completed until all drives are finished. Actual system operation determines whether
transfer speed or access speed dominates. Performance in a single user system, reading and writing large
files (only) is typically dominated by transfer rate (MB/sec). Performance in multi-user, multi-threaded or
multi-tasking systems, is typically dominated by access times (Ms/seek). Complex real time systems
generally spend more time waiting for the data to become available (access) than on the actual data
transfer. RAID-1 is usually the implementation of choice for such systems. A designer should carefully
evaluate and test performance assumptions to avoid surprises. A RAID controller (typically) overlaps
transfer commands to multiple drives. If the drives co-operate, the data transfer speed is effectively
increased. Different brands or models of drive do not always behave the same. For drives on a SCSI bus
to overlap data transfers, they must disconnect and reconnect, sharing the bus. This takes advantage of
the fact that the bus transfer rate (the number advertised) is higher than the rate data moves under the
heads (a number not advertised). Because a disconnect-reconnect sequence takes time (overhead), a
drive that does not do this will test faster in a single drive system. Some manufacturers take advantage of
this to post faster times on benchmarks. If drives do not disconnect well, then a multiple channel controller
is required to take full advantage of a multi-drive system. RAID levels that use parity (RAID-5) require
complex handling for some write conditions. When a transfer is much greater than the stripe size, the
controller can compute parity directly from the data in cache, while it is on its way to the drive. Data and
parity are then written directly to the drives, ideally by overlapping transfers. When the write operation is
"smaller", complimentary data must be read from the other drive(s) in order to generate parity. A small
write thus requires one or more read operations first, before the write can take place. "Small" does not
necessarily mean just a few bytes, or infrequently. Anytime and everytime the host writes a block that does
not satisfy parity internally, this function must be performed. Record updates, voice, mail and
communications type transfers are affected drasticly. Benchmarking with large file writes is an ineffective
measure of this performance impact. When more than one drive must be accessed to complete a transfer,
the average access time is slower than that of a single drive. Each drive still averages the same as if it
were a single drive, but now every drive must complete its transfer before the overall transfer is complete.
Every transfer is dominated by the last drive to respond. Average delay increases as drives are added to
the array. Disk caching can overcome read delays for frequently accessed data in cache. At moderate
levels of system activity, caching also compensates for some of the additional write delays. Overall
performance is strongly affected by the choice of drives. Drive performance varies with age, temperature
and stability of system power. These factors affect the number of recoverable errors that are encountered.
All drives encounter some number of "soft" errors. These errors tend to increase in frequency until at last,
the drive "fails" with a hard error. Error handling can greatly decrease the effective speed of a drive. The
performance of the controller is constant relative to drive variations. Different caching strategies can
optimize a specific drive to a specific data requirement. An example is optimizing the stripe size to a specific
data transfer (block size) on a test with a specific brand of drive. Later, if the brand or block size is changed
(or if the test did not really represent actual conditions), sub-optimum performance results. Most marketing
benchmarks emphasize the highest number obtainable under a selected condition, the implication being
that one high number implies all high numbers. A single performance number is rarely indicative of the full
range of modern system demands.

You might also like