Analyzing Reliability in The Data Center Outline

Analyzing Reliability in the Data Center Outline
Course Description:
The growing reliance on information systems that operate 24 hours per day, 7 days per week, has
spawned a rapidly growing and developing industry that supplies products and services on-
demand. The need for these types of information services now reaches into every business office
in the world. Unfortunately, events of all kinds including hardware failure, human error,
environmental changes, structural failure and external events, can lead to the possibility of
unanticipated systems downtime.
Modern data centers do not tolerate planned downtime and strive for no outages in a 10-year
mission. Data center operations staffs are faced with the dilemma of either downtime as a result of
insufficient physical infrastructure, or incurring extensive costs by designing in more redundancies
than is necessary. Targeted reliability solutions allow businesses to meet individual requirements of
the data center, while minimizing the total cost of ownership.
In fact, very high reliability is difficult to attain and redundant hardware is only part of the answer.
This course will demonstrate some important performance success factors and overviews best
practices for analyzing and optimizing reliability.
Course Outline:
Learning Objectives
At the completion of this course, you will be able to:
 Define key terms associated with analyzing reliability risks
 Identify some common cause failures in the data center
 Describe the benefits of conducting a Probabilistic Risk Assessment (PRA)
 Recognize the reliability advantages of utilizing scalable, modular architecture in the data center
Agenda
 Introduction
 Analyzing risk
 Redundancy
 Common cause failures
 Probabilistic Risk Assessment (PRA)
 Case study example
 Summary
©2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Course Content or Material:
1) Introduction
a) Information systems need to operate 24 / 7
b) Growing and developing industry supplies products and services on-demand
c) Modern data centers do not tolerate planned downtime and strive for no outages in a 10-
year mission
d) Data center operations staffs are faced with
e) The dilemma of downtime as a result of insufficient physical infrastructure
f) Incurring extensive costs by factoring in unnecessary redundancies
g) Targeted reliability solutions allow businesses to meet individual requirements of the data
center, while minimizing the total cost of ownership
2) Reliability
a) Understanding how to best define downtime risk
i) Is important to optimizing its reliability
ii) Decreases total cost of ownership
iii) Increases agility
b) Reliability metrics statistically analyze the likelihood of a failure occurring
3) Redundancy
a) While redundancy can increase reliability, there are significant costs and potentially
serious drawbacks
b) A redundant system has more components
c) In general, systems with more components will experience more failures
4) Discussing Best Practices

a) The design, manufacture, operation, maintenance and repair of equipment
b) The gathering of data, and the review and publication of component benchmarking results
c) Consistent deployment of the language of reliability, both definitions and assumptions
d) A philosophy addressing the constant pursuit of root causes, common cause failures and
relevant data
5) Modularity and Component Count

a) Reliability can be increased through standardization
b) Modularity is a powerful concept
c) Modularizing a system can increase the number of internal components
d) Reliability analysis of modular systems must consider
i) Component design
ii) Function
iii) Dependencies
6) Benefits and Drawbacks
a) Data center testing and maintenance practices often have a significant impact on systems
reliability
b) Testing and diagnosis can improve reliability, but may also degrade it
7) UPS: Historical Perspective

a) In most UPSs, utility AC power is rectified to DC
b) The inverter synthesizes an AC voltage free from the effects of spikes, sags, harmonics,
and brief utility outages
8) Assessing Reliability
a) Product support engineers
i) Track the products' performance in actual use
ii) Identify and implement changes necessary to correct deficiencies or defects
iii) Benefit from a road map identifying components most likely to fail
b) Deviations from the predictions of the road map would identify new areas for more
intensive investigation and possible remedial action
9) The Correct Course of Action: PRA

a) The process of building the logical model results in a comprehensive review
b) The mathematical nature of the calculation limits the logical fallacies that tend to dominate
qualitative evaluation of reliability
c) The implication is that if N components are required for success, there is one, two, twice
as many, or even twice plus one as many units available
d) Not all redundancy makes the same contribution to reliability
10) Reliability Assessment Case Study

a) The mathematical models that resulted from the analysis were used to answer some key
questions
b) The scalable, modular system utilizes redundancy in nearly all components as a means of
achieving high reliability
c) MTechnology, Inc's analysis showed that
d) There are both costs and benefits to redundancy
e) Some sub-systems benefit less from redundancy than others
f) Complex mathematical formulas were utilized to calculate the case study failure rates and
common cause failures
11) Case study goals

a) To identify potential sources of failure
b) To evaluate the potential for further improvement in the scalable, modular power system’s
reliability and availability
12) Target of Case Study Analysis

a) Subjects of analysis
i) 14 - 40 kW Scalable, modular, rack based power system with PDU and static bypass
ii) 500 kW central UPS
b) Tools utilized
i) Probabilistic Risk Assessment (PRA)
ii) Fault tree
iii) Event tree analysis
iv) Bayesian updating
13) Reliability Assessment Case Study

a) All actions have both beneficial and negative affects on reliability
b) It helps to support the uptime of the servers but also can represent a point of failure
14) Comparing Modularity to the Central UPS

a) The scalable, modular system loses power to all loads only when
i) The main entrance bus fails
ii) The transfer switch fails to open
b) The probability of all 14 scalable, modular units failing simultaneously due to internal
failures is extremely low
c) PDU failure will cause partial load drop
d) Only one circuit breaker after the transfer switch will cause all critical loads to fail
15) Central UPS data

a) Battery failure is a significant contributor of failure in central UPS
b) The central UPS can fail internally, and bypass can fail, causing all loads to fail
c) PDU failure will cause partial load drop
16) Reliability Assessment Case Study: The Findings

a) The calculated reliability of the scalable, modular power system is comparable to data
published by vendors of large, central UPSs
b) The scalable, modular power system is significantly less likely to suffer a complete system
failure
c) The redundancy provided in the scalable, modular power system definitely improves the
product's reliability
17) Conclusions
a) Overall, the scalable, modular architecture had a system failure rate was approximately
40% lower than that of the central UPS system
b) Failure is defined as the loss of power for all critical loads
c) Discounting battery failures, the scalable, modular failure rate is still approximately 18%
less than that of a comparable central UPS architecture
d) If failure is defined to include dropping of any single load due to a branch circuit failure, but
not UPS failure, the scalable, modular architecture is 6% less likely to fail
e) Scalable, modular power system architecture proved more reliable than the single module
UPS with a single battery string
f) The redundant subsystems within the scalable, modular power system successfully
reduced the probability of UPS failure
g) The performance of the ATS is often the limiting factor in achieving higher reliability
18) Summary
a) Understanding how to best define downtime risk is important to optimizing its reliability,
while decreasing TCO and increasing agility
b) While redundancy can in principle increase reliability, there are significant costs and
potentially serious drawbacks
c) Data center professionals need to understand which processes are most critical, and
target reliability accordingly
d) PRA is a powerful tool when applied carefully
Course Assessment: Test Your Knowledge
Course Feedback: We Value Your Opinion

Analyzing Reliability in The Data Center Outline

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analyzing Reliability in The Data Center Outline

Uploaded by

Copyright:

Available Formats

Analyzing Reliability in the Data Center Outline

4) Discussing Best Practices

5) Modularity and Component Count

7) UPS: Historical Perspective

9) The Correct Course of Action: PRA

10) Reliability Assessment Case Study

11) Case study goals

12) Target of Case Study Analysis

13) Reliability Assessment Case Study

14) Comparing Modularity to the Central UPS

15) Central UPS data

16) Reliability Assessment Case Study: The Findings

Course Assessment: Test Your Knowledge

Course Feedback: We Value Your Opinion

You might also like