03 Cse470 Unit 3

Fundamentals Of High Availability (HA) NUL} ne eee cen et Ire MCN Eo To) Tar Cae Pon BAe)+ Introduction to High Availabilty + Define Availabilty levels and Importance of Availability + How High Availability can be achieved + High availabilty Components Explained + Types of HA Solutions + High availabilty selection oriteria + High avaitabilty for virtual machines ‘caznsem coveion ° Innovation Cone for Estoation* Introduction After completing this unit, you should able to define: ¥ Importance of Availabilty in IT infrastructure ¥ Define what is High availabilty ¥ Define various industry terminologies use in the field of High availability * Note: This is a chapter where definitions, terminologies and fundamentals are covered, It is very important to know the fundamentals to understand the intricacies and meet the objectives. By the end of this chapter the student should be able to understand the various components how they relate to each other, and: what objectives they meet ‘camsien coveuton « Innovation Cone for Estoationinition of High Avail Service Around the clock: 365 days x 7 days week x 24 hours. + HA or High availability is a popular term used with computer services or systems which refers to the ability of a system or an infrastructure to run service with very minimal downtime, * High Availablity provides maximum uptime while minimizing risks associated with service outages. * High Availabilty is not a product but a design principle that applies to all systems to provide increased avallabilty, * Availabilty is normally quantified with respect to the number of 9's of uptime. = This is also called as “class of nines” = Please refer the table below, (Source: IDC) rains [Beth Teter pe Do | Sete nutes EEEEEDaN siseonte wore ‘EXEEIGIN 21-5 seconds —_ission citical, Lite-support femmes Balen meena) * A highly available system is normally quantified by its 99.99% (5 9's) availability or more. ‘camsien coveuton * Innovation Cone for Estoationiene * System: = Represents a set of components like CPU, disk, memory ete. System generally represents a Server here or a group of servers serving one application. This is also referred as "Nodes’. * Uptime: — Measurement, the time for which system is up and running, * Availabilty — The proportion of a time system is production capable. Note: Uptime and Availability and not really synonymous, Consider the failure of a network that connects the users. Here the system is up, but the production is unavailable, © High Availabilty. = Assystem implementation that ensures a certain absolute degree of operational continuity for a given period + Redundancy: = Redundant or duplicate components in the system to take care of failures. Service will be moved to the duplicate component. No single point of failure ‘camsien coveuton ® Innovation Cone for Estoationiene * Fault Tolerance ~ Ability of a system to continue in service, when there is failure of component ( software or Hardware ) occurs. Fault tolerance/Reliabilty is attained using muttiple redundant components like CPU, memory, disks etc. * RPO: — Recovery point objective, A definition of the amount of data loss that's deemed acceptable, in the event of a disaster, failover scenario + RTO. = Recovery time objective. A definition of the amount of time it takes, from initial disaster declaration, to having crtical business processes available to users * Private interconnectiHeart Beat: = Private or dedicated connectivity between nodes in a cluster for inter node communication and probe alive status ete. ‘camsien coveuton © Innovation Cone for Estoation+ Ahighly reliable or costly technology introduced doesn’t mean that the system is always available, + When it comes to availabilty there are many dependencies and its just not under the complete cantrol of these hardware components, + Figure (Source: IDC) represents a study result that shows the classification of failures in a system + it's not the just the hardware or software faults but there are many external components also in it that causes or leads to failure, ‘caznsem coveion * Innovation Cone for Estoationilability levels and High availabili + Considering the Approaches taken to restore an application back to its users, there are levels defined (Source: IDC). Basically these levels define the availablity standards. Tigh sy (ster ST ne ete Taran oh TERR ee opens TT signe aa Sane TST + Level#3 is measured to have 5 9's availabilty and requires no or minimal user interruption. Workload or the application will be automatically moved to failover or backup systems. ‘camsien coveuton Innovation Genre for EatoationHow HA can be a: High Availabilty can be achieved through multiple ways, where the traditional method to avoid the chances of unavailability by having fault tolerant systems. The general approaches are listed as below. * Single System, fault tolerant * HA Clustering ‘caznsem coveion » Innovation Cone for Estoationm, fault tolerant Fault Tolerance is the typicel approach which is followed in system to minimize the feilures or outages. This basically indicates the system's ability to remain online in service even in case of active fallures on some components, Generally this is often achieved over redundant components. ‘The Keys areas of Fault Tolerance include. 4, Redundant components. a, Power Supply b. CPU & Memory c. Disk 4. VO cards: Fiber channel, Network Cards 2. Monitoring, alerting & Fault isolation 3. Hot ewap or hot plugging * Fault tolerant systems can be simple in terms of infrastructure and management perspective however there will be ‘complexities and higher cost due to additional configurations. Irrespective of these fault tolerances, single system will also have challenges in terms of planned downtime. ‘caznsem coveion » Innovation Cone for Estoationine High Availability Clu: * High availabilty is closely essociated with fault tolerance. A fully fault tolerant system has no service interruptions but at a very higher cost, whereas high availability provides minimal service interruption with multiple systems or nodes but at minimal cost. * High availabilty can be introduced at any of these layers, based on the criticality. The commonly used approach is at the ‘Operating system layer. Above figure indicates @ HA configuration with two nodes in two scenarios. In case#1 which is a normal condition Node#1 is active and Node#? is lying as standby. ‘caznsem coveion 2 Innovation Cone for Estoationcluster is two or more systems (Either Operating system or Application) that forms a shared pool of resources and provide backup in the event of outage. Clustering basically combines the feature of hardware and software by quickly restoring services in minimal cost compared to fault tolerant systems. et Clustering provides excellent monitoring of resources and pro-active alerting and inamet Suit actions according to situations of H/iWand S/W. Major Components: i Cluster aware application: U SorversiNodes: O Shared Storag: Oi Heart Beat Connections & Quorum disks: Ci Redundant Network connectivity’s Shared SAN Storage ‘camsien coveuton a Innovation Cone for Estoation* Active/active: ~ Traffic is shared across the nodes and serving the users, When in failure of any node condition the traffic will get moved to remaining nodes. * Active/passive: — One of the node will be passive redundant and becomes online when issues with the Active node, This configuration typically requires the most extra hardware. Net = A.single extra node is brought online to take over the role of the node that has failed. In the case of heterogeneous software configuration on each primary node, the extra node must be universally capable of assuming any of the roles of the primary nodes it is responsible for. This normally refers to clusters which have multiple services running simultaneously = Neto-t — Failover standby node becomes the active one temporarily, until the primary one is restored, Failback will be initiated when primary is online. ‘camsien coveuton » Innovation Cone for Estoation+ HA clustering Advantages: v Planned outages: ¥ Reduced downtime for scalability of the system or upgrades ¥ Load Balancing: ¥ Improved performance for the applications that support active/active configuration, by spreading the load across multiple nodes. ‘caznsem coveion % Innovation Cone for Estoation* High availabilty offers @ variety of options to choose for data and application resiliency. ‘The key criteria Is to understand the business requirements to place the right solution so as to achieve the goals, Budget & Outage revenue impacts Uptime requirements & Outage Coverage RPO and RTO Replication types and Distance between sites No of Nodes or systems & Performance requirements ‘caznsem coveion ® Innovation Cone for Estoation* Network layer includes the networking areas of a system that includes TCP/IP network, FC network or any other modes of connectivity's, * Redundant component is the key factor of network availability \\ =~ (sansmecr >, > Se = >< rw raySANStrage Tape Ube ‘aaa coveion ” Innovation Centre for Etoation* Storage being one of the major components that holds all critical data alvays packed with high availabilty. * Storage nodes are generally addressed as controllers, which will be redundant like many other components in a storage * RAID ( Redundant Array of independentiinexpensive disks ) , provides the data protection against disks failures aoe Sones Serageitema ta) ‘aaa coveion esse AIOWS lanes ras Usable capacity 68583 Innovation Cone for EstoationRAID O — Blocks Striped. No Mirror. No Parity. ‘camsien coveuter Innovation Genre for Eatcation* Minimum 2 disks. = Excellent performance ( as blocks are striped ). * No redundancy ( no mirror, no parity ). * Don't use this for any critical system ‘caznsem coveion Innovation Genre for EatoationRAID 1 — Blocks Mirrored. No Stripe. No parity.Real Om + Minimum 2 disks. * Good performance (no striging. no parity ) * Excellent redundanoy(as blocks are mirrored ). ‘caznsem coveion Innovation Cone for EstoationAOR aa Group 1 Group 2 RAID 01 — Blocks Striped. ( and Blocks Mirrored) __* RAID Ot is also called as RAID O41 * Itis also called as "mirror of stripes’ * Itrequires minimum of 3 disks. But in most cases this will be implemented as minimum of 4 disks. * To understand this better, create two groups. For example, if you have total of 6 disks, create two groups with 3 disks each as shown below. in the above example, Group 1 has 3 disks and Group 2 has 3 disks. * Within the group, the data is striped. i In the Group 1 which contains three disks, the 1st block will be written to 1st disk, 2nd block to 2nd disk, and the 3rd block to 3rd disk. So, block A is written to Disk 1, block B to Disk 2, block C to Disk 3. * Across the group, the data is mirrored. ie The Group 1 and Group 2 will look exacily the same. i.e Disk 1 is mirrored to Disk 4, Disk 2 to Disk 5, Disk 3 to Disk 6 * This is why itis called “mirror of stripes’. ie the disks within the groups are striped. But, the groups are mirrored ‘camsien coveuton Innovation Genre for EatoationRAID 1+0 Group 1 Group 2 Group 3 ISIERSIENE RAID 10 — Blocks Mirrored. (and Blocks Striped) ‘camsien coveuter Innovation Genre for Eatcationia eae) * Minimum 4 disks * This is also called as ‘stripe of mirrors" * Exoellent redundanoy (as blocks are mirrored ) * Excellent performance ( as blocks are striped ) * If you can afford the dollar, this is the BEST option for any mission eritical applications (especially databases), ‘caznsem coveion Innovation Genre for Eatoation* Performance on both RAID 10 and RAID 01 will be the same, * The storage capacity on these will be the same. * The main difference is the fault tolerance level. On most implememntations of RAID controllers, RAID 01 fault tolerance is less. On RAID 01, since we have only two groups of RAID 0, if two drives (one in each group) falls, the entire RAID 01 will fail, In the above RAID 01 diagram, if Disk 1 and Disk 4 fails, both the groups will be down, So, the whole RAID 01 will fail. * RAID 10 fault tolerance is more. On RAID 10, since there are many groups (as the individual group is only two disks), even if three disks fails (one in each group), the RAID 10is still functional, In the above RAID 10 example, even if Disk 1, Disk 3, Disk 5 fails, the RAID 10 wil stil be functional * So, given a choice between RAID 10 and RAID 01, always choose RAID 10. ‘camsien coveuton Innovation Genre for EatoationECC Disks (disk 1) Disk 2 ‘camsien coveuter Innovation Genre for Eatcation= This uses bit level striping. i.e Instead of stripi 1g the blocks across the disks, it stripes the bits across the disks. = In the above diagram bt, b2, b3 are bits. E1, E2, E3 are error correction codes. = You need two groups of disks. One group of disks are used to write the data, another group is used to write the error correction codes. = This uses Hamming error correction code (ECC), and stores this information in the redundancy disks. = When data is written to the disks, it calculates the ECC code for the data on the fly, and stripes the data bits to the data-disks, and writes the ECC code to the redundancy disks. + When data is read from the disks, it also reads the corresponding ECC code from the redundancy disks, and checks whether the data is consistent. If required, it makes appropriate corrections on the fly. = This uses lot of disks and can be configured in different disk configuration. Some valid configurations are 1) 10 disks for data and 4 disks for ECC 2) 4 disks for data and 3 disks for ECC. = This is not used anymore. This is expensive and implementing it in a RAID controller is complex, and ECC is redundant now-a-days, as the hard disk themselves can do this. ome comma Innovation Cone for EstoationData Disks Parity Disk Disk 1 Disk 2 Disk 3 Disk 4 RAID 3 - Bytes Striped. ( and Dedicated Parity Disk) ‘camsien coveuton Innovation Genre for Eatoation* This uses byte level striping. i Instead of striping the blocks across the disks, it stripes the bits across the disks. * In the above diagram 81, 82, B3 are bytes. p1, p2, p3 are parties, * Uses muttiple data disks, and a dedicated disk to store parity, = The disks have to spin in sync to get to the data, * Sequential read and vitite will have good performance * Random read and write will have worst performance. * This is not commonly used ‘caznsem coveion Innovation Genre for EatoationData Disks ‘ Parity Disk _ Disk 1 RAID 4 D 4 - Blocks Striped. (and Dedicated Parity Disk) ome comma Innovation Cone for Estoation* This uses block level striping, * In the above diagram 81, 82, B3 are blocks. p1, p2, p3 are partes. * Uses muttiple data disks, and a dedicated disk to store parity, + Minimum of 3 disks (2 disks for data and 1 for parity) * Good random reads, as the data blocks are striped. * Bad random writes, as for every write, it has to write to the single parity disk * Itis somewhat similar to RAID 3 and 5, but litle different * This is just like RAID 3 in having the dedicated panty disk, but this strives blooks. * This is just like RAID 5 in striping the blocks across the data disks, but this has only one parity disk. * This Is not commonly used ‘camsien coveuton Innovation Genre for EatoationRAID 5 — Blocks Striped. Distributed Parity.* Minimum 3 disks, * Good performance (as blocks are striped ). * Good redundancy ( distributed parity ). * Best cost effective option providing both performance and redundancy. Use this for D8 that is heavily read oriented. Write operations will be siow. ‘caznsem coveion Innovation Genre for Eatoationome comma Innovation Cone for Estoation* Just ike RAID §, this does block level striping, However, it uses duel parity, * Can handle two disk failure * In the above diagram A, B, C are blocks. pt, p2, p3 are parities. * This creates two parity blocks for each data block. * This RAID configuration is complex to implement in a RAID controller, as it has to calculate two parity data for each data block. ‘caznsem coveion Innovation Genre for Eatoationon eee Oe eee Cape teormser raeet } Be 7 aBa- a | * Storage Plays important role in any HA Clustering solutions. * There are many ways the data availability can be ensured, using the connectivity methods to shared storage and nodes of a cluster. * While Figure#1 and 2 represents HA system using inter-site storage links and replication, Figure#3 Tepresents two different sites storages which are under a HiW of S/W storage virtualization layer. ‘camten coveuton ® Innovation Cone for EstoationNCAAs * Virtualization technologies enable consolidation of multiple epplications onto one or more physical server. * Virtualization is a way creating an abstract layer that can simulate the functionality of a hardware resource. Thus compute Virtualization simply mean simulating computer hardware or providing a virtual machine =) - S| ="3 Sa * A few virtual product examples are: Tn eanana ceed cVinere @PAAQ™D = Sun Solaris Zones / LDOM's ‘camsien coveuton = Innovation Cone for Estoationeres CoA ZuASer + HA continuously monitors all vitualized servers in a resource pool and detects physical server and operating system failures. * To monitor physical servers, an agent on each server maintains a heartbeat with the other servers in the resource pool such that a loss of heartbeat automatically initiates the restart of all affected virtual machines an other servers in the resource pool [wanticines —] ‘caznsem coveion Innovation Cone for EstoationTells ACD AVL! * High availability across hypervisors of resource pools are also possible in virtual environments using specific HA features, which helps the VM's to get migrated across different Virtualization host setup transparently. «Performed either automated or with manual interventions * HA features will keep monitoring the resource utilization of these virtual machines & based on the results, its generates the plan and place the VM's on any of the available Hypervisor hosts for the migration. ES «<======» £3 =~ x ~s ‘se Eerie | ‘caznsem coveion © Innovation Cone for EstoationDefine Availability Define MTBF, MTTR and express availabilty in these terms ‘What is meant by Private Interconnect or Heart Beat of a HA cluster? List the different components that affects availabilty ‘What are the different availability levels What are the different areas of a fault tolerant system Differentiate Fault tolerant and HA cluster solutions, eat onnen ‘What is the importance of monitoring, alerting in any HA or fault tolerant solutions, 9. Define HA clustering 10, List the various components of a HA Cluster. ‘caznsem coveion ° Innovation Cone for Estoation10, 41 12, 13, 14, 16, 17, 18, 19, 20. Advantages of an HA Cluster solution. Define Quorum disks and its importance. List the different modes of HA clustering based on nodes service status. List major factors in HA selection criteria Explain the HA internals of a data storage. Differentiate Raiditt and RAIDES Differentiate RAID#S and RAIDS Explain Virtualization List examples of Virtualization products Explain the HA features available at virtualization layer. ‘caznsem coveion es Innovation Cone for Estoation

03 Cse470 Unit 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 Cse470 Unit 3

Uploaded by

Copyright:

Available Formats

You might also like