Data Center Standars

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Data Center Standars

Data Center Networks & Cloud Computing Security


Lecture 3–4 Pavel Moravec
Service Level Agreement

2
Building a Data Center is just a Start …
What is Service Level Agreement (SLA)?
An official commitment between the service provider
and a client
Can be a legally binding formal or an informal "contract"
Originally used by fixed line telco operators from 1980s
Commonly includes several components, from a definition of the services
to the termination of the agreement
Definition of type of service to be provided
The service's desired performance level (+ reliability and responsiveness)
Monitoring process and service level reporting
Steps for reporting issues with the service
Response and issue resolution time-frame
Repercussions for service provider not meeting commitment,
especially financial 3
Where and how does SLA apply?
Where are we able to find SLA?
Backbone Internet Providers
Web services
e.g. the availability of REST API to customers
Data Centers (both shared, on-premise and outsourced)
Cloud computing shared resources SLA
Example SLA (one of the Czech/Itallian Providers)
100% uptime for power and cooling
99,95% Internet connectivity
99,95% physical node availability for virtual infrastructure servers
99,8% access to provided physical nodes
4
What does SLA not cover?
“Higher power” aka “act of God” aka “Force Majeure”
wars, terrorism, strikes, traffic accidents,
sometimes also natural disasters (see previous lecture)
Extraordinary interventions to be carried out urgently
to avoid hazards to safety/stability/confidentiality/integrity
typically announced in advance to customers (e.g. 48h before execution
when possible or ASAP)
Unavailability or blocking of the infrastructure due to
Customer actions (shutdown of servers, abuse, misconf.)
3rd party OS or applications used
non-fulfillment or breach of Contract by customer
Internet or connectivity problems caused by customer or 3rd parties
Planned maintenance (normal amount) 5
Data Center Standards

6
A Data Center must follow some …
Guidelines/Best practices
ANSI/BICSI 002, Data Center Design and Implementation Best Practices
(USA → International)

Standards
TIA 942 (USA)
ISO/IEC 24764 → ISO/IEC 11801-5 (Worldwide)
EN 50600 series (WiP) + EN 50173-5 (EU)

Certification requirements
Uptime Institute Tier certification (Worldwide) 7
Building Industry Consulting Service International 002
DC Design and Implementation Best Practices (1)
Site selection – hazards, Electrical systems – utility serv.,
environments, access, regulations distribution, mechanical, UPS,
Space planning – capacity, power, standby and Emergency, Automation
cooling, supporting spaces, IT & Control, Lighting, Protection, …
Equipment placement, network Fire Protection – walls, floors,
Architectural – design concepts, ceilings, aisle containment,
access paths, planning details, extinguishers, protection,
construction components detection, …
Structural – general, specific DC Management and Building
Mechanical – classes, cooling Systems – building automation
conditions, thermal, mech., … systems, electronic safety and
security systems 8
Building Industry Consulting Service International 002
DC Design and Implementation Best Practices (2)
Security – physical security Telecommunications, Cabling,
plan, risks & threats, regulatory Infrastructure, Pathways,
& insurance, DC security plan, Spaces
crime prevention, access C0-C4 cabling class, topologies,
control, alarms, barriers, lighting, spaces, pathways, access
surveillance, guards, disaster providers
recovery, building site Backbone & horizontal c.
considerations, building shell, Installation, testing, racks
DC security

9
Building Industry Consulting Service International 002
DC Design and Implementation Best Practices (3)
Information Technology – Annexes (informative)
disaster recovery, computer Design Process
room layout, communication, Reliability & Availability
operations center, network Alignment, Outsourcing
infrastructure reliability, security Multi-DC arch., energy efficiency
Commissioning (+testing)
Maintenance (of all systems)

10
BICSI 002 – Annex B – Operational Requirements
Operational Annual Planned
Description
Level Downtime (*)
Operational less than 24 hours a day & less than 7 days a
0 > 400 h week. Scheduled maintenance “down” time available during
working hours and off hours
1 100 – 400 h As above
Operational up to 24 hours a day, up to 7 days a week, and up
2 50 – 99 h to 50 weeks per year. Scheduled maintenance “down” time as
above.
Functions are operational 24 hours a day, 7 days a
3 0 – 49 h week for 50 weeks or more. No sch. maintenance “down” time
is available during working hours
Functions are operational 24 hours a day, 7 days a
4 0h week for 52 weeks each year. No scheduled maintenance
“down” time is available
(*) ~ 8766 h/y11
BICSI 002 – Annex B – Downtime Impact
Classification
Description
(Impact)

Local in scope, affecting only a single function or operation, resulting in a minor Isolated
disruption or delay in achieving non‐critical organizational objectivescritical organizational objectives (Sub-Local)

Local in scope, affecting only a single site, or resulting in a minor disruption or Minor
delay in achieving key organizational objectives (Local)

Regional in scope, affecting a portion of the enterprise or resulting in a Major


moderate disruption or delay in achieving key organizational objectives (Regional)

Multiregional in scope, affecting a major portion of the enterprise or resulting in Severe


a major disruption or delay in achieving key organizational objectives (Multiregional)

Affecting the quality of service delivery across the entire enterprise, or resulting Catastrophic
in a significant disruption or delay in achieving key organizational objectives (Enterprise)
12
BICSI 002 – Annex B – Data Centre Class

Facility Availability Classes


F0/F1 – Single path (maps to Tier-1, Rated-1, Availability Class-1)
F2 – Single Path + redundant components (maps to T-2, R-2, AC-2)
F3 – Concurrency maintainable&operable (maps to T-3, R-3, AC-3)
F4 – Fault Tolerant (maps to T-4, R-4, AC-4)
Other classes:
Cable Plant: Cx, Network Infrastructure: Nx,
Data Processing and Storage: Sx, Applications: Ax 13
BICSI 002 – Annex B – Availability Requirements
Allowable Annual Downtime Allowable Availability
(minutes) (Uptime 9s – see next lecture)
> 5000 < 99%
500 → 5000 99% ← 99.9%
50 → 500 99.9% ← 99.99%
5 → 50 99.99% ← 99.999%
0.5 → 5 99.999% ← 99.9999%

14
TIA-942 – Telecommunications Infrastructure
Standard for Data Centers (1)
Specifications for DC telecommunications pathways & spaces
Recommendations on media & distance restrictions for structured
cabling system and applications over it (2005)
Telecommunication spaces and topologies
Cabling, pathways, redundancy, Informative annexes: Design,
administration, access provider information, equipment plans, dataspace
considerations, site selection, tiers, examples, references
Components known from TIA-568
Addendum 1 (2008) – usage of 75 Ω coaxial cable
Addendum 2 (2010) – additional guidelines for DCs – lighting in 3 tiers,
recommendation from CAT-6/6A to CAT-6A only
(minimum required category is Cat-6) 15
TIA-942 – Telecommunications Infrastructure
Standard for Data Centers (2)
TIA-942-A (2012)
harmonization with TIA-568C
left some limitations to other standards (removed from here)
removed 100m limitation for optical fibers
multi-mode cable possible for horizontal & backbone cabling
use of LC & MPO connectors for optical fibers
Introduced Intermediate Distribution Area (IDA)
Zone Distribution Area (ZD) can contain only passive components
energy efficiency recommendations, harmonized with IEC 24764
TIA-942-A Addendum 1 (2013) – mainly data center fabric topologies
examples, new switch topologies
Fat tree, full mesh, inter-connected meshes
Centralized switch, virtual switch 16
TIA-942 – Telecommunications Infrastructure
Standard for Data Centers (3)
TIA-942 Revision B (2017)
Added Cat-8 cabling, recommended cabling Cat-6A or higher
Maximum EDA cable length 10 → 7m
At least 1200mm deep cabinets, considerations for cabinet width 24”+
(600mm+)
Pre-terminated cabling
Labeling, cable routing, adding/removing cords, …
MPO-16 and MPO-32 connectors for 200G and 400G
Wideband multimode fiber (WBMMF) cable added
ANSI/TIA-568-C.4 coaxial cables and F connectors may be used
Normative references to other standards, including revised references to
temperature and humidity guidelines
Modifications for use outside of US, optical cable quality req. 17
TIA-942 – Ratings of Data Centres (1)
Rated-1: Basic Site Infrastructure
Single capacity components and a single, non-redundant distribution path
serving the computer equipment.
Limited protection against physical events
May not even have a raised floor
Susceptible to disruption from planned & unplanned activities
28.8 hours of annual downtime permissible
1 entrance pathway from access provider to facility, single pathway for all
cabling

18
TIA-942 – Ratings of Data Centres (2)
Rated-2: Redundant Capacity Component Site Infrastructure
Redundant capacity components and a single, non-redundant distribution
path serving the computer equipment.
Improved protection against physical events
Does have to use a raised floor
Slightly less susceptible to disruptions
22.0 hours of annual downtime permissible
Requirements of Rated-1 must be observed, also
2 entrance pathways from access provider to facility exist
Routers & switches have redundant power supplies & processors
Vulnerability of service entering building is addressed
N+1 redundant UPS modules, single generator is sufficient
19
TIA-942 – Ratings of Data Centres (3)
Rated-3: Concurrently Maintainable Site Infrastructure
Redundant capacity components and multiple independent distribution
paths serving the computer equipment (power, data, cooling). N+1 rule
for everything.
Typically, one single distribution path serves the computer equipment at
any time.
Protection against most physical events
The site is concurrently maintainable – each & every capacity component
incl. elements which are part of the distribution path, can be
removed/replaced/serviced on a planned basis without disrupting the ICT
capabilities to the End-User.
1.6 hours of annual downtime
20
TIA-942 – Ratings of Data Centres (3)
Rated-3: Concurrently Maintainable Site Infrastructure (contd.)
Requirements of Rated-2 must be observed, also
requires at least 2 access providers + a secondary entrance room
backbone pathways have to be redundant
multiple routers and switches must be included for redundancy
Vulnerability of a single access provider is addressed

21
TIA-942 – Ratings of Data Centres (4)
Rated-4: Fault Tolerant Site Infrastructure
Redundant capacity components and multiple independent distribution
paths serving the computer equipment.
All redundant capacity components and independent distribution paths
are active at the same time. 2(N+1) for all components
Protection against almost all physical events.
The data center allows concurrent maintainability and one fault anywhere
in the installation without causing downtime.
All computer hardware must have dual power inputs
Can sustain at least one worst-case, unplanned failure or event with no
critical load impact
0.4 hours (18 minutes) of annual downtime
22
TIA-942 – Ratings of Data Centres (4)
Rated-4: Fault Tolerant Site Infrastructure (contd.)
Requirements of Rated-3 must be observed, also
requires redundant backbone cabling, which should be in conduit or have
interlocking armor, optional secondary distribution area
optionally, horizontal cabling is also redundant
Addresses any vulnerability of the cabling infrastructure

23
ISO/IEC 11801-5 – Generic Cabling for Customer Premises
Part 5: Data centers (1)
Latest revision ISO/IEC 11801-5:2017
Balanced & optical fibre cabling specifications, normative parts:
Structure of the generic cabling system
Channel performance requirements
Link performance requirements
Reference implementations
Cable requirements
Connecting hardware requirements
Requirements for cords and jumpers
Annex A - Combination of balanced cabling links

24
ISO/IEC 11801-5 – Generic Cabling for Customer Premises
Part 5: Data centers (2)
Informative Annexes (optional):
Usage of high density connecting hardware within optical fibre cabling
Examples of structures in accordance with ISO/IEC 11801-5
Data center minimum configuration
End of Row concept
Middle of Row concept
Top of Rack concept
End of Row and Middle of Row concept with redundancy
Top of Rack concept with redundancy
End of Row and Middle of Row concept with full redundancy
Top of Rack concept with (full) redundancy
Examples of networking fabric architectures: fat-tree, full-mesh,
interconnected meshes, centralized switch, virtual switch 25
ISO/IEC 11801-5 – Cabling
Cable classes
Twisted pair (100 Ω impedance)
Class EA: link/channel up to 500 MHz Cat-6A cable/connectors
Class F: link/channel up to 600 MHz using Cat-7 cable/connectors
Class FA: link/channel up to 1000 MHz using Category 7A
Class I/II: link/channel up between to 1600 and 2000 MHz using Category
8.1/8.2 cable/connectors
2-4 mated connectors per copper channel, RJ-45 or TERA connector
Optical fiber interconnect using multi-mode fibre
OM3: Multimode fiber 50µm, min. modal bw of 2000 MHz*km at 850 nm
OM4: Multimode fiber 50µm, min. modal bw of 4700 MHz*km at 850 nm
OS1/OS2: Single-mode fiber type 1 dB/km / 0.4 dB/km attenuation
duplex LC (2 fibers) or MPO (3+ fibers) connector
Channel length is determined by media choice 26
ISO/IEC 11801-5 – Data Centre Topologies

Fat tree without port extenders

Standard 3-tiered architecture

27
Port extenders Full mesh Interconnected meshes
EN 50173-5 – IT Generic cabling systems
Part 5: Data centres
Structure of the generic cabling system in data centres
Channel performance in data centres
Reference implementations in data centres
Cable requirements in data centres
Connecting hardware requirements in data centres
Requirements for cords and jumpers in data centres

28
EN 50600 series – IT
Data centre facilities and infrastructures
EN 50600-1 – General concepts
EN 50600-2-1 – Building construction
EN 50600-2-2 – Power distribution
EN 50600-2-3 – Environmental control
EN 50600-2-4 – Telecommunications cabling infrastructure
EN 50600-2-5 – Security systems
EN 50600-3-1 – Management and operational information
EN 50600-4-1 – Overview of and general requirements for key
performance indicators
EN 50600-4-2 – Power Usage Effectiveness
EN 50600-4-3 – Renewable Energy Factor 29
EN 50600-2-5 Security Systems
Physical security – general, risk assessment
Designation of data centre spaces – Protection Classes
Protection Class against unauthorized access
Protection Class against fire events igniting within data centre spaces
Protection Class against environmental events (other than fire) within
data centre spaces
Protection Class against environmental events outside the data centre
spaces
Systems to prevent unauthorized access
Informative Annex – Pressure relief: Additional information

30
EN 50600 – Availability classes

31
EN 50600 – Protection classes

32
Uptime Institute Tier Standard
Tier Standard: Topology – 14 pages (version 01/2018)
Tier Standard: Operational Sustainability – 16 pages (2014)

Tier Requirements for Power (1 page) – clarification on reliability


Accredited Tier Designer Technical Paper Series (2017) containing
supplemental explanations and clarifications
Engine-Generator Ratings (5 pages) - requirement and use of an engine-
generator solution for on-site power.
Makeup Water (5 pages) – evaporative cooling systems minimum
Continuous Cooling (6 pages) – only required by Tier IV, but
recommended for densities higher than 4 kW/rack, examples on providing
thermal stability to the IT environment during cooling interruption
33
Tier Standard: Topology
Defines 4 basic Tiers – “Tier I” to “Tier IV”
there is no Tier 0, but there are requirements even for Tier I
The standard also considers:
Engine-Generator Systems
Ambient Temperature Design Points
Communications
Makeup Water
Utility Services
Defines Tier Functionality Progression
Forbids Fractional or Incremental Tier Classification
all components must be of given Tier (otherwise the lowest Tier rating will
be used)
Not TIA “Rated (Tiers)”, e.g. raised floor is not a requirement 34
Tier I – Basic Site Infrastructure
Non-redundant capacity components and a single, non-redundant
distribution path serving the critical environment
Tier I infrastructure includes:
a dedicated space for IT Systems
a UPS to filter power spikes, sags, and momentary power outages
dedicated cooling equipment + on-site power production to protect IT
functions from extended power outages
12 hours of on-site fuel storage for on-site power production.
There is sufficient capacity to meet the needs of the site
Planned work will require shutting down most or all of the site
infrastructure affecting critical environment, systems & end users
35
Tier I – Operational Impacts
The site is susceptible to disruption from planned & unplanned
activities
operation errors of site infrastructure components will cause a data
center disruption
Unplanned outage or failure of any capacity system, capacity
component, or distribution element will impact the critical environment
The site infrastructure must be completely shut down on an annual
basis to safely perform necessary preventive maintenance and repair
work
urgent situations may require more frequent shutdowns
Failure to regularly perform maintenance significantly increases the
risk of unplanned disruption as well as the severity of the
consequential failure 36
Tier II – Redundant Site Infrastructure CC (1)
A Tier II data center has redundant capacity components (CC) and a
single, non-redundant distribution path serving the critical env.
The redundant components are
extra on-site power production (e.g., engine generator, fuel cell)
12 of on-site fuel storage for ‘N’ capacity
UPS modules and energy storage
chillers, heat rejection equipment, pumps, cooling units & fuel tanks
Redundant capacity components can be removed from service on a
planned basis without causing any of the critical environment to be
shut down

37
Tier II – Redundant Site Infrastructure CC (2)
Removing distribution paths from service for maintenance or other
activity requires shutdown of critical environment
There is sufficient permanently installed capacity to meet the needs of
the site when redundant components are removed from service for any
reason

38
Tier II – Operational Impacts
The site is susceptible to disruption from planned activities &
unplanned events
operation errors of site infrastructure components may cause a data
center disruption
Unplanned capacity component failure may impact critical
environment. Unplanned outage or failure of any capacity system or
distribution element will impact the critical environment
The site infrastructure must be completely shut down on an annual
basis to safely perform preventive maintenance and repair work
urgent situations may require more frequent shutdowns
Failure to regularly perform maintenance significantly increases the
risk of unplanned disruption as well as the severity of the
consequential failure 39
Tier III – Concurrently Maintainable Site Infr.
A Concurrently Maintainable DC has redundant CCs and multiple
independent distribution paths serving the critical environment
For the electrical power backbone and mechanical distribution path, only
1 distribution path is required to serve the crit. env. at any time
The electrical power backbone is defined as the electrical power
distribution path from the output of the on-site power production system
to the input of the IT UPS and the power distribution path that serves the
critical mechanical equipment
12 of on-site fuel storage for ‘N’ capacity
The mechanical distribution path is the distribution path for moving heat
from the critical space to the outdoor environment, e.g. chilled/condenser
water piping, refrigerant piping, etc.
40
Tier III – Concurrently Maintainable Site Infr.
All IT equipment is dual powered and installed properly to be
compatible with the topology of the site’s architecture. Transfer
devices, such as point-of-use switches, must be incorporated for
critical environment that does not meet this requirement
Tier III – Performance Confirmation Tests
Each and every capacity component and element in the distribution
paths can be removed from service on a planned basis without
impacting any of the critical environment

There is sufficient permanently installed capacity to meet the needs of


the site when redundant components and distribution paths are
removed from service for any reason 41
Tier III – Operational Impacts
The site is susceptible to disruption from unplanned activities
operation errors of site infrastructure components may cause a computer
disruption
Unplanned outage or failure of any capacity system may impact the
critical environment
Unplanned outage or failure of a capacity component or distribution
element may impact the critical environment
Planned site infrastructure maintenance can be performed by
using the redundant capacity components & distribution paths to
safely work on the remaining equipment
During maintenance activities, the risk of disruption may be elevated
however, this does not defeat the Tier rating achieved in normal
operations 42
Tier IV – Fault Tolerant Site Infrastructure (1)
A Fault Tolerant data center has multiple, independent, physically
isolated systems that provide redundant capacity components and
multiple, independent, diverse, active distribution paths simultaneously
serving the critical environment
the redundant capacity components and diverse distribution paths shall
be configured such that ‘N’ capacity is providing power and cooling to the
critical environment after any infrastructure failure
12 of on-site fuel storage for ‘N’ capacity
all IT equipment is dual powered with a Fault Tolerant power design
internal to the unit and installed properly to be compatible with the
topology of the site’s architecture. Transfer devices, such as point-of-use
switches, must be incorporated for critical environment that does not
meet this specification 43
Tier IV – Fault Tolerant Site Infrastructure (2)
complementary systems and distribution paths must be physically
isolated from one another (compartmentalized) to prevent any single
event from simultaneously impacting both systems or distribution paths.

Continuous Cooling is required


provides a stable environment for all critical spaces within the ASHRAE
maximum temperature change for IT equipment as defined in Thermal
Guidelines for Data Processing Environments, Third Edition.
Continuous Cooling duration should be such that it provides cooling until
the mechanical system is providing rated cooling at the extreme ambient
conditions

44
Tier IV – Performance Confirmation Tests
A single failure of any capacity system, capacity component, or
distribution element will not impact the critical environment
The infrastructure controls system demonstrates autonomous
response to a failure while sustaining the critical environment
Each and every capacity component & element in the distribution paths
can be removed from service on a planned basis without impacting any
of the critical environment
there is sufficient capacity to meet the needs of the site when redundant
components or distribution paths are removed from service for any
reason
Any potential fault must be capable of being detected, isolated, and
contained while maintaining N capacity to the critical load
45
Tier IV – Operational Impacts
The site is not susceptible to disruption from a single unplanned
event, and it is not susceptible to disruption from any planned work
activities.
The site infrastructure maintenance can be performed by using the
redundant CCs and distribution paths to safely work on the remaining
equipment.
During maintenance activity where redundant capacity components or a
distribution path shut down, the critical environment is exposed to an
increased risk of disruption in the event a failure occurs on the remaining
path
however, this does not defeat the Tier rating achieved in normal operations
Operation of the fire alarm, fire suppression, or the emergency power
off (EPO) feature may cause a data center disruption 46
Tiers Overview

47
Tier Standard: Operational Sustainability (1)
Lists behaviors, risks and their mitigations beyond Tier that impact the
ability of DC to meet its uptime objectives over a long time
Management methodologies and concepts
3 elements of Operational sustainability:
Management & Operations
Staffing, Qualifications, Organization, Staff & Vendor Training
Preventive, Deferred and Preventive Maintenance Program,
Housekeeping Standards, Maintenance Management System, Vendor
Support, Life-cycle Planning, Failure Analysis Program
Site Policies, Financial Proc., Reference Library, Capacity & Load
Management, Operating Set Points, Rotating Redundant Equipment

48
Tier Standard: Operational Sustainability (2)
3 elements of Operational sustainability (contd.):
Building Characteristics
Commissioning, Purpose Build, Support & Specialty Spaces, Security
and Access, Setbacks
Infrastructure Category
Site Location
Natural and Man-Made Disasters Risks

49
Uptime Data Center Operational Excellence
Verifies that practices and procedures are in place to
avoid preventable errors (~ 73% of failures are human errors)
maintain IT functionality
support effective site operation
The Certification process ensures operations are in alignment with
organization's business objectives, availability expectations, and
mission imperatives.

Three levels
Bronze (expires 1 year after being awarded)
Silver (expires 2 years after being awarded)
Gold (expires 3 years after being awarded) 50
Operational Sustainability Examples

51
Tier Requirements for Power Summary
Tier Certified datacenters can combine on-site reliable power and more economical
best-case utility power for cost-efficiency with autonomous fail-over
On-site Power generation is the only truly reliable source – it is completely in control of
the organization
Utility-Provided power is the most economical source of power when compared to the
operational and maintenance costs of on-site generated power
Uptime’s Tier Standard has no specific requirements for the number or type of
utility-provided power sources
Disruption to utility power is not a failure, it is an anticipated operational
condition for which DC must be prepared
A Tier III or IV on-site generation system, along with its power paths and other supporting
elements, shall meet the Concurrently Maintainable and/or Fault Tolerant performance
confirmation tests when powered by on-site power generation

52
Engine-Generator Ratings
Abovementioned Power Summary
Requirements hold
Engine generators for Tier III and
IV sites shall not have a limitation on
consecutive hours of operation when loaded
to ‘N’ demand
Engine generators that have a limit on
consecutive hours of operation at N
demand are appropriate for Tier I or II

Operation may be for an extended period – weeks or even months during


extended outages due to local utility loss, or in case of catastrophic malfunction
of UPS system as well as during its replacement or heavy maintenance
53

You might also like