Stephen W. Duda

When N+1 Just Isn’t Enough


Two recent columns of mine1,2 have dealt with an N+1 level of redundancy for criti-
cal facilities, data centers, and the like. Sometime after the first of these two was
published, I received a telephone call from an ASHRAE friend of mine, James M.
Calm, a self-employed engineering consultant and ASHRAE Fellow working in Great
Falls, Va. In a cordial but pointed discussion, he challenged me to revisit the topic
of N+1 redundancy lest I leave our readers with the impression that N+1 is sufficient
in even the most critical of facilities. So, inspired by Jim, this column explores those
facilities where N+1 just isn’t good enough.
As a brief review, the term N+1 refers to a level of parallel piping, electrical, and control systems.
equipment or system redundancy in which one addi- The obvious benefit of N+1 is the fact that a single
tional component of each type is provided—shorthand equipment failure does not disrupt normal operation
for Need-plus-One. If two chillers are required for the of the facility. In a given facility, if four chilled water
load, three are provided; if three boilers are necessary, pumps are needed to meet the load and five are avail-
four are provided, and so forth. able, then a single failure does not immediately hamper
The building that was the subject of my May 2017 col- facility operation. However, consider the case where
umn is a large data center occupied by an agency of the a chilled water pump is intentionally removed from
U.S. government and is considered highly critical for service for maintenance purposes, perhaps a bearing
national defense. So critical, in fact, that they have “N+1 replacement or rebuild of the seals requiring significant
Buildings”—in other words, they maintain two distinct pump disassembly and which cannot be reassembled in
parallel data center facilities in two locations geographi- haste. If a second chilled water pump then fails unex-
cally remote from each other, such that either surviving pectedly, the ability to meet load is in jeopardy. This
building can carry the full function after a loss of the leads some business owners to request N+2 redundancy
other due to fire, terrorist attack, tornado, earthquake, in critical facilities.
or other disaster. Some readers may read the previous paragraph and
think the maintenance staff would not undertake a
Beyond N+1 chilled water pump overhaul during peak summer
Having an entirely redundant facility may be appro- months, instead waiting for winter when fewer chilled
priate for exceptionally critical functions, but there are water pumps will be needed. However, data centers
many privately owned or publicly traded companies for often have a round-the-clock and year-round cooling
whom an entirely redundant facility is not practical. For load of nearly constant magnitude, with chilled water
those whose business case justifies redundancy beyond pumps needed year-round for water-side economizer if
N+1 in a single facility, N+2, 2N, 2N+1, and 2(N+1) are
Stephen W. Duda, P.E., is senior mechanical engineer at Ross & Baruzzini, Inc.,
other options. These higher levels of redundancy may in St. Louis. He is a member and research subcommittee chair of TC 9.1, Large
also require some degree of duplicate pathways such as Building Air-Conditioning Systems.

not for chiller service, meaning there is no suitable off- one central plant while the other central plant carries
peak time for scheduled or routine maintenance. N+2 the load, and still allow for a failure in the operating
allows the owner the opportunity to have one compo- plant.
nent out of service intentionally for scheduled or routine Two completely independent central plants offer a
maintenance while also allowing for the possibility of an better chance of riding through a major calamity in
unplanned failure. one plant (e.g., fire, explosion, flood, roof collapse).
Even more redundancy can be achieved with a 2N Even that seemingly extreme level of redundancy is not
approach, with every component having an equivalent calamity-proof, however; a major tornado or earthquake
backup. For smaller values of N, there would not appear severe enough to destroy one central plant is likely to
to be a significant difference between N+2 or 2N (in fact, impact the other central plant as well, if located on the
the values are identical for N=2). As N becomes much same property. Hence, the agency mentioned in my
larger, the difference is significant and requires some opening paragraphs critical to national defense con-
engineering judgement. As an example, consider a very structed “N+1 Buildings” in two geographically remote
large data center with 30 computer-room air-handling locations. For a more detailed system of classifying levels
units (CRAHs). N+1 would yield 31 CRAHs but 2N would of redundancy and their associated levels of reliability
yield 60. Does your data center really require 60 CRAHs (sometimes called tier levels), see ANSI/TIA Standard
when 30 would cover the load? Are CRAHs really that 942A.3
failure-prone? Most likely, no. On the other hand, are How does one decide? One method is the “single point
31 CRAHs enough when 30 cover the load? Again, most of failure” analysis or network analysis, to identify where
likely no. Some value in between, perhaps N+5 or 1.2N, single points of failure might exist. At the same time,
would give the owner adequate flexibility to ride out a one would also identify points that have single redun-
few equipment failures or planned overhauls where the dant capability, but that the components or systems
value of N is larger. This needs to be discussed with the involved have the need for periodic maintenance, or
owner and/or end user in the light of their tolerance for have a short mean time to failure that would take them
failure, business model, and potential cost avoidance. off-line for some period of time, leaving the system with
2N is sometimes called for in central plants serving a temporary single point of failure. The final selection
critical facilities, such as two identical chiller plants each might in some sections be N+1, other places 2N, others
with N chillers, N chilled water pumps, N condenser N+2, and so forth, for different components and connec-
water pumps, N cooling towers, and so forth. True 2N tions, depending upon their vulnerability, failure rate
means that each of these central plants would be config- and need for periodic downtime for maintenance. In
ured for the full peak load and could operate completely some cases, N+2 can be reduced back to N+1 by stocking
independent of the other, with separate piping systems, critical replacement parts on-site.
separate sources of power, separate makeup water, etc.
Service could be shifted between the plants to allow Dual-Path Piping, Controls, and Power Systems
even wear while allowing scheduled or even unplanned For a very critical facility, it may not be sufficient to
maintenance to occur in one while the other carries the have redundant pieces of mechanical equipment if they
load. all feed into a common piping loop. Even a 2N chiller
Even 2N chiller plants may not satisfy the most criti- plant will not protect a customer from a potential service
cal of facilities, if the operator needs the flexibility to outage if each chiller plant feeds into a single chilled
perform significant maintenance (requiring equipment water supply and chilled water return header. Although
disassembly) on one plant when suddenly a failure we tend to think of only equipment with moving parts as
occurs in the other. Having one additional piece of each being subject to a failure, even a static piping system can
type of equipment available in each of the two plants require a shutdown to repair a leak, rupture, or testing.
would constitute 2(N+1) while having a total of one addi- While runs of pipe rarely fail, valves do, so the chilled
tional piece of each type of equipment available on a water pathway to each CRAH should be examined for
swing basis to either plant would constitute 2N+1. Either single points of failure. Perhaps the likelihood of a pip-
setup would allow scheduled maintenance to occur in ing system failure seems lesser than the likelihood of

a pump or fan failure, but nonetheless a truly critical FIGURE 1   Example of chilled-water distribution piping system.
facility may require redundant piping, power, and con-
Redundant Return Loop Redundant Supply Loop
trols as well. Connections to Chiller Plant Connections From Chiller Plant
For an example of a dual-path piping system serving
a data center, see Figure 1.4 In the diagram, one can see Supply Loop Header
dual supply and return chilled water piping connections
into a bi-directional 360-degree loop, and the loop itself Return Loop
is compartmentalized into several sections with shutoff Valved Branch Connections to
valves allowing for a segment to be isolated if repairs are Decentralized Air/Water-Cooled
Equipment (Typical)
needed. In this example, a loss of a single segment of
loop piping between service valves is survivable if there
is sufficient redundancy in the quantity of connected Header Sectional Valves
decentralized equipment components. No single point
of failure would be ensured by making sure that the
number of redundant CRAHs was equal to the largest
number of CRAHs connected to a single section of the Return Loop
piping loop, so that if that segment failed and was iso-
lated, sufficient additional CRAHs would remain online Supply Loop Header
to achieve full capacity. Valved Mains for Cross Connection or
Another sometimes-overlooked feature of a truly- External Emergency Connection
redundant HVAC system is makeup water for cool-
ing towers, if water-cooled equipment is used. When engineering magazine, detailed discussions of critical
considering both condenser water evaporation and or backup power are best left to our friends in the elec-
blowdown, open-cell cooling towers cannot operate for trical engineering field.5
extended periods of time without makeup water. Having Control reliability and redundancy is often missed or
only one connection to domestic water will not be suf- ignored, yet controls failures may be more likely to bring
ficient in critical facilities. Either two separate sources of down a data center than are mechanical failures. Good
domestic water supply to the facility, or one source plus practice for mission-critical facilities is to ensure the
significant onsite domestic water storage tanks, will be control system has the following capabilities:
necessary. And in either case, two separate and remote 1. Supplied by emergency power, which may seem
connections of makeup water to the condenser water obvious but some facilities miss this.
system must be provided. 2. Have an alarm on critical points alerting personnel
For true redundancy, one or more backup sources of who move into action with manual overrides, such as
electrical power are also required for the critical HVAC placing motor controller switches in the “hand” position.
equipment. This brings up a somewhat philosophi- 3. Stocking and quickly replacing control boards or
cal question: Does a single onsite engine generator devices in the event of failure.
constitute N+1? If standard utility-provided electricity 4. Have secondary backup controllers which upon fail-
is “N,” then it is possible to argue that a single genera- ure, are used to control devices resuming functionality.
tor by itself constitutes the “+1” of electric power. But This is accomplished through programming the build-
most truly critical facilities that this author has been ing automation system with event sequences triggered
exposed to have redundant utility power sources and from alarms or flags noting the failure.
redundant generators. These facilities also frequently 5. Specification of redundant sensors may be very
feature UPS (uninterruptible power supply), i.e., inexpensive insurance for critical applications to avoid
batteries; and some redundant electrical pathways an unanticipated defeat of all the expensive redundant
with dual-ended substations, automatic power trans- equipment being purchased.
fer switches, and a “checkerboard” arrangement of For an in-depth exploration of data center controls
equipment power feeds. Since this is a mechanical reliability, Stein/Gill6 is recommended further reading.

Operating a 2N Plant energy efficient. For instance, if 30 CRAHs are needed

An interesting corollary topic arises for a plant opera- and 35 are running at part speed, the energy use will be
tor furnished with a 2N central chiller plant. Assuming about 25% lower than if 30 CRAHs were running at full
there are no equipment failures and all components are speed. Another example: If two of three cooling towers
therefore available for service, what is the most energy- are needed, running three cooling towers at part speed
efficient way to operate that plant? Imagine a four- reduces tower fan energy by more than 50%. Even dual-
chiller chilled water plant where any two chillers can path piping saves energy. If piping is sized to handle the
carry the peak load. Should the operator run two chill- design flow all in one direction but normally has dual
ers at 100% capacity each? Three chillers at two-thirds paths, the water flow split reduces pumping energy.
capacity each? All four chillers at half capacity each? Proper operation for lowest energy use is a compli-
(The energy issue aside, in a mission-critical facility it is cated question, one that is far too detailed for this col-
generally not a good idea to operate two chillers at 100% umn. Adding to the complexity is that resultant pump-
load because if one fails suddenly, there is a time delay ing energy, cooling tower fan energy, the use of con-
before another can stage online and pick up the load; stant-speed versus variable-speed chillers and pumps
it is generally safer to run three at 66.7% so losing one versus a hybrid of those, all come into play. Fortunately,
machine poses a smaller risk.) one of my Engineer’s Notebook colleagues has written
Actual loads in data centers are seldom as high as design extensively on that topic, and his work is recommended
loads due to very conservative Watts-per-rack values reading for those exploring that issue.7
that often do not materialize. This is one reason why Another approach is that the plant, if properly instru-
VFDs are so cost effective in data centers. When there mented and metered, can be operated according to
are redundant CRAHs, VFDs can make the plant more ASHRAE Guideline 22.8 This Guideline provides, among

other things, a type of trial-and-error path to deter- higher-level hospitals or trauma centers, or private-
mining the best combination of equipment to operate sector data centers handling millions of dollars in trans-
for best energy performance. Operating two chillers at actions every few minutes. And it is not just pieces of
100% but wondering what would happen to energy con- equipment that require redundant backup; even piping
sumption if you ran three chillers at two-thirds capacity paths, controls, and often-overlooked things like dual
each? Try it and see what happens. Over time, a properly sources of HVAC makeup water may be required.
instrumented chiller plant in conjunction with digital
controls and computer server with sufficient memory References
1. Duda, S. 2017. “Waterside economizer retrofit for data center.”
can learn the best combination of equipment to operate ASHRAE Journal (5).
under an expansive array of load conditions, weather 2. Duda, S. 2018. “N+1 HVAC for IT closets and server rooms.”
conditions, time-of-day bracketed or tiered electrical ASHRAE Journal (5).
3. ANSI/TIA Standard 942-A Telecommunications Infrastructure
utility rate structures, and so forth. Standard for Data Centers. 2014. Telecommunications Industry
Association, Arlington, VA.
Conclusions 4. ASHRAE. 2012. Datacom Equipment Power Trends and Cooling
Applications. 2nd ed.
In case my May 2017 column left readers thinking that 5. As one example, see Lippold, B. and C. Bruck. 2018.
N+1 is sufficient in even the most critical facilities, this “Acceptance/Maintenance Testing for Data Center Electrical
column is intended to correct that impression. Where Reliability.” Mission Critical (5). BNP Media, Inc., Troy, Mich.
6. Stein, J., B. Gill. 2018. “Data center controls reliability.”
the need is justified, greater redundancy may be appro- ASHRAE Journal (10).
priate. N+2, 2N, 2N+1, and 2(N+1) levels of redundancy 7. Taylor, S. 2012. “Optimizing design & control of chilled water
plants, Part 5.” ASHRAE Journal (6).
must be considered for facilities critical to national 8. ASHRAE Guideline 22-2012, Instrumentation for Monitoring
defense, disaster recovery or emergency response, Central Chilled-Water Plant Efficiency.

