Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 47

Intelligent People. Uncommon Ideas.

Yottabytes and Beyond


Demystifying Storage and
Building large Storage Networks
Part I

by Bhavin Turakhia, CEO, Directi


bhavin.t@directi.com

(shared under Creative Commons Attribution Share-alike License incorporated


herein by reference)
(http://creativecommons.org/licenses/by-sa/3.0/)
Why is storage important?

• Web 2.0 applications are an extension of your Desktop


• SaaS is here and growing
• Broadband is a reality
• Storage costs are dropping
• Everyone expects near-unlimited storage online –
Youtube, Flickr, Facebook et al are storing your life online*
• (.. And yea … lets not forget your personal bit-torrent
collection)

* it would take 1400 TB to store your entire life in video. 5700 TB if you want to know
what was happening around you. Another 73 TB for the audio files of everything you
heard (MP3 quality). That’s about 6000 TB for a copy of your life
Agenda

• Hard disks
 SATA, SAS, FC, Solidstate
• RAID
• DAS
• SAN
“Large scale storage requires
careful planning”
Choosing your Hard Disk
(SATA, FC, SAS, SCSI, Solidstate)
Introduction to Hard Drives

• Basic physical storage unit (aka Physical block device)


• Variables to consider when selecting a drive
 Type (SAS, SATA, FC)
 RPM
 Capacity
 MTBF (Mean Time between Failures)
 Life Expectancy
Hard Disk types

SATA SAS FC
(Serial ATA) (Serial Attached (Fibre Channel)
SCSI)
Typical Use • low-cost, high- • Replacement for • High performance
volume, low-speed, SCSI transaction oriented
large-storage • High performance applications with
environments transaction oriented high IOPs
• CDP / Backups applications with requirement
high IOPs
requirement
Performance • Average • Good (Similar to • Good (Similar to
• Typically 7200 FC) SAS)
RPM • 10k / 15k RPM • 10k / 15k RPM
Hard drive Typically - 250 GB, Typically – 73 GB, Typically – 73 GB,
capacities 500 GB, 750 GB, 146 GB, 300 GB, 146 GB, 300 GB,
1TB 400 GB 400 GB
Hard Disk types

SATA SAS FC
(Serial ATA) (Serial Attached (Fibre Channel)
SCSI)
Price per Gig $ 0.33 $2 $3
(based on max
drive capacity
retail web price)
Misc - • Backward -
compatible with
SATA
• Allows mixing
SATA drives on
same backplane
Hard Disk Conclusions

• For high IOPs, database applications, low-storage


requirements – you have a choice between FC and SAS
• SAS currently seems like the better option
• Future SAS standards promise to be faster than FC
(though it is likely they may remain neck to neck)
• For high-storage requirements (video server, file servers,
photo storage, archivals, mail servers, backup servers)
SATA is the way to go
• One may combine SAS and SATA to reduce average cost
and achieve your goals – especially since the backplanes
are cross-compatible
• Readup the spec sheet of the hard drives you plan on
using for determining specifics
Solid State Drives

• Uses solid state memory to store persistent data


• Eliminates mechanical parts
• Useful for creating efficient in-between caches or storing
small to mid-sized high performance databases
Solid State Drives
Advantages Disadvantages
• Faster startup – no spinning • Significantly more expensive
• Significantly faster on Random ($10-30/GB for Flash based,
IO (From 250x to 1000x+) $100-200/GB for DDR RAM
• Extremely low latency (25x to based)
200x better) • Slightly slower on large
• No noise sequential reads
• Lower power consumption • Slower random write speeds
• Lesser heat production incase of Flash based storage

• References
 Intro - http://en.wikipedia.org/wiki/Solid_state_disk
 RAM vs Flash based - http://www.storagesearch.com/ssd-ram-v-
flash.html
 SSD based SAN!!!  - http://www.superssd.com/
RAID Primer
(0, 1, 2, 3, 4, 5, 6, TP, 0+1, 10, 50, 60)
Introduction to RAID

• allows multiple disks to appear as a single contiguous


physical block device
• provides redundancy / high availability
• A raid group appears as a single physical block device

HD1 HD2 RAID


HD1 HD2
Comparison of Single RAID Levels

RAID 0 RAID 1 RAID 5 RAID 6

Diagram

Description Striping Mirroring Striping with Striping with


Parity Dual Parity
Minimum 2 2 3 4
Disks
Maximum Controller 2 Controller Controller
Disks Dependant Dependant Dependant
Array No. of Drives x Drive Capacity (No. of Drives (No. of Drives
Capacity Drive Capacity - 1) x Drive - 2) x Drive
Capacity Capacity
Comparison of Single RAID Levels

RAID 0 RAID 1 RAID 5 RAID 6


Storage 100% 50% (Num of drives (Num of drives
Efficiency – 1) / Num of – 2) / Num of
drives drives
Fault None 1 Drive failure 1 Drive failure 2 Drive failures
Tolerance
High None Good Good Very Good
Availability
Degradation NA • Slight • High • Very High
during rebuild degradation degradation degradation
• Rebuilds very • Slow Rebuild • Very Slow
fast (due to write Rebuild
penalty of (due to write
parity) penalty of dual
parity)
Comparison of Single RAID Levels

RAID 0 RAID 1 RAID 5 RAID 6


Random Read Very Good Good Very Good Very Good
Performance
Random Very Good Good (slightly Fair (Parity Poor (Dual
Write worse than overhead) Parity
Performance single drive) Overhead)
Sequential Very Good Fair Good Good
Read
Performance
Sequential Very Good Good Fair Fair
Write
Performance
Cost Lowest High Moderate Moderate+
Comparison of Single RAID Levels

RAID 0 RAID 1 RAID 5 RAID 6


Use Case • Non critical • Typically Non-write Non-write
data used as RAID intensive OLTP intensive OLTP
• High speed 10 in OLTP / applications / applications /
requirements OLAP file servers etc file servers etc
• Data backed applications
up elsewhere
Misc - - Parity can Not supported
considerably on all RAID
slow down cards
system
Understanding the Parity Penalty

• RAID 5 and RAID 6 store parity information against data for


rebuild
• Single Parity can be calculated using a simple XOR
• eg– “abcdefghijkl” on a 4 disk RAID 5 array
Disk 1 Disk 2 Disk 3 Disk 4
A (01000001) +12124286429
B (01000010) C (01000011) {P – 01000000}
Parity {P} D E F
G Parity {P} H I
J K Parity {P} L

• If Disk 2 fails then the data “B” can be recalculated as


(01000001 XOR 01000011 XOR 01000000) => 01000010
=> B
Understanding the Parity Penalty

• Steps to change “B” to “X” on Disk 2


Disk 1 Disk 2 Disk 3 Disk 4
A (01000001) B->X C (01000011) {P – 01000000}
(01000010) ->
(01011000)

• Read A, C and {P}


• Recalculate {P} as ‘A’ XOR ‘X’ XOR ‘C’
• Write ‘X’ and {P}
• A single update required 3 reads and 2 writes
• Random writes in RAID 5 and RAID 6 are very very
expensive
Understanding the Parity Penalty

• Rebuilding in RAID 5 and RAID 6 is expensive


• The cost increases with increase in number of disks
• As if this isnt enough there is an additional penalty
• All the writes after the computation (ie parity and the
changed block) must be simultaneous (involving a two-
phase commit operation)
 The impact can be marginally reduced through write-back caching
Comparison of Nested RAID Levels
RAID 10 RAID 50

Diagram

Description Mirroring then Striping Striping with Parity then


Striping without parity

Minimum Disks Even number > 4 >6

Maximum Disks Controller Dependant Controller Dependant


Array Capacity (Size of Drive) * (Number (Size of Drive) * (No. of
of Drives ) / 2 Drives In Each RAID 5
Set - 1) * (No of RAID 5
Sets)
Comparison of Nested RAID Levels

RAID 10 RAID 50
Storage Efficiency 50% ((No. of Drives In Each
RAID 5 Set - 1) / No. of
Drives In Each RAID 5
Set)
Fault Tolerance Multiple drive failure as Multiple drive failure as
long as 2 drives from long as 2 drives from
same RAID 1 set do not same RAID 5 set do not
fail fail
High Availability Excellent Excellent
Degradation during Minor • Moderate degradation
rebuild • Slow Rebuild
(due to write penalty of
parity)
Comparison of Nested RAID Levels

RAID 10 RAID 50
Read Performance Very Good Very Good
Write Performance Very Good Good
Use Case OLTP / OLAP Medium-write intensive
applications OLTP / OLAP
applications
Nested RAID Misc Notes

• RAID 10 is faster and better than RAID 0+1 for the same
cost
• RAID 60 is similar to RAID 50 except that the striped sets
with parity contain dual parity
• Ideally RAID 10 and RAID 50 will be the only nested RAID
levels you will use
RAID Considerations

• Select your Stripe Size by empirical testing


 smaller stripe size increases transfer performance, decreases
positioning performance, and vice versa
 ideal stripe sizes depend on your application, typical data read in a
read, sequential vs random reads etc
• Try and select hard drives from separate production
batches
• Maintain sufficient Spares in a large array (typically 1 per
10-15 disks is sufficient)
• Use Global spares across RAID groups if your controller
supports it
RAID Considerations

• Use hardware RAID unless performance is not a


consideration
 Especially nested RAID levels or parity based RAID – consume
more CPU cycles and increase rebuild time if implemented in
software
• General rule about Controller Cache – the higher the
better
• Ensure the controller has battery backup to retain its
cache in case of power failure
• For internal RAID Controller cards use faster PCI buses
(PCI-x)
The Fun starts –
Lets build our
storage system
Passive Disk Enclosure
based Direct Attached
Storage (PDE based DAS)
Passive Disk Enclosure based DAS

• DAS – Direct Attached storage


• RAID controller inside host machine
• External chasis is simply a JBOD (Just a Bunch Of Disks)
 (or what I’d like to call Passive Disk Enclosure or PDE)
• PDE enables stringing larger number of drives together as
compared to internal RAID array
• Eg Dell Powervault MD1000
Passive Disk Enclosure based DAS

• Passive Disk Enclosure can consist of SAS, SATA or FC


drives
• Passive Disk Enclosure to RAID Controller connectivity can
be SAS, FC, SCSI (possibly different from the backplane)
• Multiple PDEs can be daisy chained if they support it
• RAID card is a single point of failure
• Only one host machine supported
• Array of disks can be divided into multiple RAID groups
Passive Disk Enclosure based DAS

• Array of disks can be divided into multiple heterogeneous


RAID groups
• Size and type of a RAID group depends on RAID card
• PDE may have multiple paths to system with possibility of
multiplexing for increased speed
• Global spares can be defined on the RAID card
• Maximum storage size = maximum number of PDEs that
can be daisy chained x size of drives
Passive Disk Enclosure based DAS

• Performance Considerations
 Drives
 RAID configuration
 PDE Interconnect
 PDE to RAID Card connect
 RAID card config (cache etc)
 PCI bus
Active Disk Enclosure based
Direct Attached Storage
(ADE based DAS)
Active Disk Enclosure based DAS

• ADE Difference -> RAID Card is not in the host machine


but in the enclosure
• Host machine has a SAS/FC Host Bus Adaptor (HBA)
depending on ADE to Host connectivity support
 Some ADEs may support multiple connection protocols
• ADE may support SAS/FC/SATA drives
• ADE can support daisy-chaining PDEs
• Eg of ADE – Dell MD 3000, Infortrend eonstor devices,
Nexsan Satabeast and Sataboy etc
Active Disk Enclosure based DAS

• ADE may support dual RAID Controllers


• RAID Controllers can be used as Active-Active (incase of
multiple RAID Groups) – otherwise as Active Passive
• RAID Controller to HBA connectivity can be multiplexed - if
supported - for higher throughput
• ADEs are wrongly but commonly referred as SAN (SAN
device would still be alright)
Partitioning and Mounting
Logical Volumes

• A RAID Group is a physical unit of storage


• At the Operating System a Logical Group can be created
out of multiple RAID Groups
• Each Logical Group can be further divided into Logical
Volumes
• Each Logical Volume represents a mountable block device
• In Linux this is done using LVM
• In LVM Logical Volumes are resizable
SAN (Storage Area Network)
SAN

• Multiple host machines connected to an ADE through a


SAN switch
• SAN refers to the interconnect + Switch + ADE + PDE
• Switch and HBA can be SAS / FC depending on
interconnect type supported by ADE
• ADE would support creation of Volumes
• These can be mounted onto Client and further subdivided
SAN

• Care must be taken to mount each Logical Volume onto a


single client (unless you are running a Clustered File
System)
• This can be achieved by host masking supported by ADE
and/or the Switch
• Without careful host masking and mounting data
corruption can take place
SAN

• Complex SAN configs include multiple hosts and multiple


ADEs connected to active-active switches with multiplexed
connections
• Client hosts can be of heterogeneous operating systems
• (Funnily ADE to PDE paths sometimes are not be
multiplexed)
SAN

• While this looks complex – just think of it as removing hard


disks from the machine and hosting them outside in
separate enclosures
• Each machine mounts an independent partition from the
SAN
SAN

• Performance Considerations
 All variables we covered before
 Switch config
 Ensure that switch / HBA / interconnect does not become the
bottleneck and full hdd throughput can be utilized
Throughput Calculations

• Hard disk performance – Type, RPM etc


• Data distribution and Type of Data access
• RAID performance, number of drives, RAID type
• RAID card performance – cache, active-active config etc
• ADE to switch connection speed
• Switch to HBA connection speed
• HBA to PCI bus speed
That’s all Folks
“Lets go build out our Yottabyte
arrays and fill ‘em up”
[Considerably exaggerated hyperbole given that the combined space of all computers in the
world today (2007) doesn’t add up to 1 Yottabyte (2 ^ 80 bytes). Infact the entire worlds
storage is projected to hit 988 exabytes (2 ^ 60) by 2010]

[6th Sep 2007 - http://www.networkworld.com/newsletters/stor/2007/0903stor2.html –


Nanotech breakthrough could put entire YouTube contents on an iPod-size device]
Part II sneak preview

• Complex SAN configurations


• iSCSI
• NAS
• Clustered Storage
• GFS
• Backups
• Storage Monitoring
• Storage Benchmarking
• Some Commercial storage vendors
Intelligent People. Uncommon Ideas.

Shameless HR Propaganda Slide


• Directi builds cool Web products
• Deployed on distributed architecture
• Using terrabytes of storage
• Used by millions of users
• Generating billions of pageviews and transactions
• Spanning every possible software engineering technology
http://careers.directi.com | http://wiki.directi.com | http://cosmos.directi.com

Personal Blog: http://bhavin.directi.com


Mail: bhavin.t@directi.com

You might also like