Professional Documents
Culture Documents
High Availability in Storage Systems
High Availability in Storage Systems
Introduction
In todays connected world, information and communication have become vital and fundamental aspects of every sphere. Be it an individual or a business, data has become the lifeblood of our daily existence. Large scale panics due to Twitter blackouts are proof to this fact of life today. For businesses, even brief down times can result in substantial losses for the business. Long term down times resulting due to various human and natural disasters can cripple a business and bring it to its knees. According to Dunn and Bradstreet, 59% of Fortune 500 companies experience a minimum of 1.6 hours of down time per week, which translates in to $46 million per year. According to Network Computing, the Meta Group and Contingency Planning Research, the typical hourly cost of downtime varies from roughly $90,000 for Media firms to about $6.5 million for Brokerage services. Thus, it becomes clear that based on the nature and the size of the business, the financial impacts of down time can vary from one end of the spectrum to another. Often, the impact of a down time cannot be predicted accurately. While there are some obvious impacts of a down time in terms of lost revenue and productivity, there can be several intangible impacts such as brand image damages that could have not-so-obvious and far-reaching effects on the business.
time between failures (MTBF) to the sum of MTBF and mean time to repair (MTTR). Thus, availability is indicative of the percentage of time the system is available throughout its useful life. As mentioned earlier, one of the primary goals of disaster recovery strategies is to minimize the RTO (down time). Since MTTR is a measure of the down time and must meet the objectives of RTO, a comprehensive disaster recovery strategy must also encompass strategies to increase availability. Thus, while DR strategies are strictly not availability strategies, they do meet availability requirements to an extent.
Downtime
Uptime
Uptime
Downtime
Uptime
Classes of Availability
Availability is often expressed as a percentage of system availability. Often an availability of about 9095% is sufficient for most applications. However, for extremely critical business data such amounts of availability is simply not enough. As mentioned before, for services such as brokerage services and businesses offering online services, a down time of the order of more than a few minutes an year could have significant impacts on its operations. For e.g. a 99.9% availability typically means about 9 hours of down time per year. The financial and other impacts of such down times could spell trouble for the business. Often, truly highly available solutions have an availability of 99.999% (five nines) or 99.9999% (six nines). Such solutions have a down time of the order of a few seconds to a couple of minutes per year. There are different classes of data protection mechanisms based on the availability. Figure 2 shows a pyramid of various data protection strategies. As one goes up the hierarchy the down time decreases and hence the availability increases. The top two levels of the pyramid constitute strategies that represent true high availability (five nines and six nines).
Figure 2: Classes of Data Protection
IP Network
Controller A
Controller B
In order to do this Clustered RAID communication and maintain Cache Coherency, the two controllers need to have a set of (preferably) dedicated communication channels. A combination of more than one communication channels such as SAS fabric, Ethernet connections etc, could be employed here to ensure minimal performance impact and redundancies in this communication layer as well. As with all dual redundant intelligent clusters, the loss of the inter-node communication could result in the two controllers losing cache coherency. Further, as the communication is lost, each controller could try to take up the operation of its peer controller resulting in a split brain scenario. In order to handle this split
brain scenario, the two controllers also need to maintain a quorum using dedicated areas of the shared drive array to avoid conflicts and data corruptions. The key advantage of such a dual controller setup is that it is almost fully redundant with hot-swappable components. However, despite the controllers being redundant, the mid-plane connecting the controllers to the drive back-plane is still shared making it a single point of failure.
IP-SAN
Network Switch
However, SBB based dual controller units have a lower disk count making them more power efficient and a greener solution with smaller data center footprint. HA Clusters also encounter the split brain syndrome associated with dual controller nodes. However, unlike dual controller nodes, this
iTX Storage
iTX Storage
problem cannot be addressed using a quorum disk as the two units do not share the drive array. One way to address this problem is to have a client side device specific module (DSM) that performs the quorum action on a split brain. The DSM sits on the path of the IO and decides on the path to send the IO. In addition, it keeps track of the status of the HA Cluster, whether the two nodes are synchronized and permits a failover action from one system to another only when the two nodes are completely synchronized. The drawback in having a client-side DSM is that the HA cluster becomes dependent on the client. Also, if the clients are also clustered, then each of the clients needs to have distributed DSMs that communicate amongst each other. A clientagnostic HA Cluster can be created if we understand the reasons for a split brain scenario in HA Cluster. Typically, a split brain scenario that causes data corruption can occur in a HA Cluster when the network path between the storage nodes have failed severing the communication between the two nodes, while the client itself can access both the nodes. In this scenario, both the storage nodes will try to take-over the cluster ownership and unless the client has some way of Figure 7: Client agnostic HA Configuration knowing the right owner of the cluster, the IOs could potentially be sent to the wrong storage node causing data corruptions. Thus, a HA Cluster where the storage nodes have lost contact with each other while the connections from the client to both the storage nodes are alive is the cause of split brain scenario. As we can see in Figure 6, such a setup is not a true high availability solution as it does not provide path failover capability. Thus, it can be seen true HA setups are not prone to split bran syndromes for HA Clusters. Figure 7 shows one such network configuration that supports client-agnostic HA Cluster configurations.
Summary
Thus, it can be seen true storage high availability can be only ensured when there is redundancy built in to every component of a storage sub-system. Dual redundant controller setup and HA Cluster setups are two such setups that deliver the best in class availability. Each of them have their own advantages and drawbacks, but a combination of these approaches in addition to application server and network path redundancies deliver a truly highly available data center setup.
For More Information about High Availability in Storage Systems , http://www.amiindia.co.in or sales@amiindia.co.in