Professional Documents
Culture Documents
Thesis
Thesis
net/publication/254639325
Data center recovery best practices: Before, during, and after disaster recovery
execution
CITATIONS READS
2 9,365
1 author:
SEE PROFILE
All content following this page was uploaded by Heather Brotherton Ph.D.,PMP on 09 September 2015.
May 2011
PURDUE UNIVERSITY
GRADUATE SCHOOL
Thesis/Dissertation Acceptance
Entitled
DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING, AND AFTER DISASTER
RECOVERY EXECUTION
Master of Science
For the degree of
To the best of my knowledge and as understood by the student in the Research Integrity and
Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of
Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.
J. Eric Dietz
Approved by Major Professor(s): ____________________________________
____________________________________
PURDUE UNIVERSITY
GRADUATE SCHOOL
Title of Thesis/Dissertation:
DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING, AND AFTER DISASTER
RECOVERY EXECUTION
I certify that in the preparation of this thesis, I have observed the provisions of Purdue University
Executive Memorandum No. C-22, September 6, 1991, Policy on Integrity in Research.*
Further, I certify that this work is free of plagiarism and all materials appearing in this
thesis/dissertation have been properly quoted and attributed.
I certify that all copyrighted material incorporated into this thesis/dissertation is in compliance with the
United States’ copyright law and that I have received written permission from the copyright owners for
my use of their work, which is beyond the scope of the law. I agree to indemnify and save harmless
Purdue University from any and all claims that may be asserted or that may arise from any copyright
violation.
04/04/2011
______________________________________
Date (month/day/year)
*Located at http://www.purdue.edu/policies/pages/teach_res_outreach/c_22.html
DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING,
A Thesis
of
Purdue University
by
Heather M. Brotherton
of
Master of Science
May 2011
Purdue University
TABLE OF CONTENTS
Page
LIST OF TABLES ........................................................................................... iv
LIST OF FIGURES.......................................................................................... v
LIST OF ABBREVIATIONS............................................................................ vi
ABSTRACT ....................................................................................................vii
CHAPTER 1. INTRODUCTION....................................................................... 1
1.1. Statement of purpose ......................................................................... 1
1.2. Research Question ............................................................................. 1
1.3. Scope.................................................................................................. 2
1.4. Significance......................................................................................... 2
1.5. Assumptions ....................................................................................... 3
1.6. Limitations........................................................................................... 3
1.7. Delimitations ....................................................................................... 4
1.8. Summary............................................................................................. 4
CHAPTER 2. LITERATURE REVIEW............................................................. 6
2.1. Critical cyberinfrastructure vulnerability .............................................. 6
2.2. Barriers to cyberinfrastructure resiliency ............................................ 8
2.3. Mutual aid ........................................................................................... 9
2.3.1. Mutual Aid Association ............................................................. 11
2.4. Training ............................................................................................. 12
2.5. Testing .............................................................................................. 13
2.6. Summary........................................................................................... 14
CHAPTER 3. FRAMEWORK AND METHODOLOGY .................................. 15
3.1. Framework ........................................................................................ 15
3.2. Researcher Bias ............................................................................... 16
3.3. Methodology ..................................................................................... 16
3.4. Data Collection.................................................................................. 17
3.5. Authorizations ................................................................................... 17
3.6. Analysis............................................................................................. 18
3.6.1. Triangulation............................................................................. 18
3.7. Summary........................................................................................... 19
CHAPTER 4. CASE STUDIES...................................................................... 20
4.1. Commerzbank................................................................................... 20
4.1.1. Background.................................................................................... 20
4.1.2. World Trade Center Attacks ..................................................... 21
4.1.3. Conclusion................................................................................ 28
iii
Page
4.2. FirstEnergy........................................................................................ 29
4.2.1. Background .............................................................................. 29
4.2.2. Northeast Blackout of 2003 ...................................................... 29
4.2.3. Conclusion................................................................................ 38
4.3. Tulane ............................................................................................... 39
4.3.1. Background .............................................................................. 39
4.3.2. Hurricane Katrina...................................................................... 40
4.3.3. Conclusion................................................................................ 48
4.4. Commonwealth of Virginia ................................................................ 49
4.4.1. Background .............................................................................. 49
4.4.2. August 2010 outage ................................................................. 51
4.4.3. Conclusion................................................................................ 65
CHAPTER 5. ANALYSIS............................................................................... 67
5.1. Best Practice Triangulation ............................................................... 67
5.1.1. Before-Planning........................................................................ 67
5.1.2. During-Plan execution .............................................................. 73
5.1.3. After-Plan improvement............................................................ 78
CHAPTER 6. CONCLUSION ........................................................................ 86
CHAPTER 7. FUTURE RESEARCH............................................................. 89
BIBLIOGRAPHY............................................................................................ 91
APPENDICES
Appendix A............................................................................................. 103
Appendix B............................................................................................. 104
VITA ............................................................................................................ 118
PUBLICATION
Disaster recovery and business continuity planning:
Business justification.............................................................................. 120
iv
LIST OF TABLES
Table Page
Table 5.1 Tolerance and objectives .................................................................... 68
Table 5.2 Aid relationship utilized during recovery .............................................. 78
v
LIST OF FIGURES
Figure Page
Figure 5.1 Adherence to established procedures................................................ 74
Figure 5.2 Sample IT incident command structure.............................................. 77
Figure 5.3 Reported average downtime revenue losses in billions ..................... 80
Figure 5.4 Reported critical application and data classifications ......................... 81
Figure 5.5 Components of a resilient system ...................................................... 85
vi
LIST OF ABBREVIATIONS
ABSTRACT
This qualitative multiple case study analysis reviews well documented past
procedures, chain of command structure, recovery time and cost, and mutual aid
CHAPTER 1. INTRODUCTION
measures due to the high cost of remote failover systems and training.
systems resiliency.
What are best practices in planning, during, and after disaster recovery
execution?
2
1.3. Scope
recovery time, and business impact. Practical tools and resources to assist best
1.4. Significance
systems has created vulnerabilities that have not been uniformly addressed.
widespread, severe negative impact on the public. While most large corporations
have remote failover locations, there are many organizations important to critical
functions that do not have the resources to develop and implement business
recovery guidance, developed through the findings of this research, may help
3
ensure the stability of cyberinfrastructure and by extension the safety and well
being of all.
1.5. Assumptions
phenomenon of interest.
• Existing publicly available documents are the best source of the actions
1.6. Limitations
Limitations include:
due to:
of the incident
Therefore, this research will not address topics that cannot be examined
1.7. Delimitations
Delimitations include:
this research will not attempt to add to planning, but will focus on the
research study.
• Information systems failures that are not well documented will not be
addressed.
1.8. Summary
adverse incidents with minimal disruption. The scope of the project is defined in
cyberinfrastructure resiliency.
6
vulnerabilities and threats are discussed. The barriers to systems resiliency and
agreements.
operations of the economy and government. They include, but are not limited to,
However, despite this directive, in 2003 the Northeast portion of the United
States suffered an extended widespread power outage due in large part to failure
of the computer system (U.S.-Canada Power System Outage Task Force, 2004).
2003). Findings published by the New York Independent System Operator state
"the root cause of the blackout was the failure to adhere to the existing reliability
rules" (New York Independent System Operator, 2005, p. 4). "ICF Consulting
estimated the total economic cost of the August 2003 blackout to be between $7
and $10 billion" (Electricity Consumers Resource Council (ELCON), 2004, p. 1).
critical resources such as power and water from cyber attack (Scherr & Bartz,
intellectual property alone from 2008 to 2009 were approximately one trillion
(ANSI), 2010).
8
patch known vulnerabilities (Homeland Security, 2009). Each patch or fix applied
increased the usefulness of computers, but this has also increased vulnerability.
Information systems are highly complex, even information technology experts are
backing to push policy change and supply resources there is little chance for
failover testing can render an otherwise solid continuity plan useless. In some
cases, companies have disaster recovery plans, but are reluctant to test live
can be scheduled during low traffic periods when the staff can be prepared to
quickly recover any outage. These tests serve to identify system and failover
plan weaknesses and make the staff more comfortable with the failover and
recovery process.
idea that some disasters cannot be planned for because they are too large.
provides a framework for managing incidents of any size and complexity. (FEMA)
Information and training for NIMS is freely available on the Federal Emergency
The use of this framework is highly recommended because it is widely used and
provides a framework for integrating outside organizations into the command and
The September 2010 San Bruno gas pipeline explosion is a good example
activated 42 fire agencies, 200 law enforcement officers. (Jackson, 2011) “85
(Jackson, 2011) The resources required for this incident were far beyond feasible
maintainability for the city’s budget. The California Mutual Aid System along with
an Emergency Operations plan ensured the city was able to quickly and
The possibility that the utilization of IT mutual aid agreements will allow
exploring (Swanson, Bowen, Wohl Phillips, Gallup, & Lynes, 2010). Collocation
Staffing is a key resource that could be negotiated for through mutual aid
staff will be available should a catastrophic event occur. Some catastrophes may
make staff unavailable due to personal impact and additional staff may be
11
staff. The end result may be cost savings. Fewer contractors and consultants
can be shared between partner organizations. This may not only save costs of
developing and providing training, but will provide a "common language" for the
downtime.
Mutual Aid agreements are common for police, fire departments, and
utilities. Associations have been formed to fill the gaps in situations where an
These relationships have been used to the benefit of society at large allowing
during the Blue Cascades exercise (2004, p. 4). The Blue Cascades II exercise
The FEMA website has links to a few mutual aid associations such as
EMAC is designed to assist states, but this model may work for non-profit,
technology may be warranted due to the special skills, equipment, and resources
2.4. Training
Human error is often cited as the primary cause of systems failure (U.S.-Canada
Power System Outage Task Force, 2004). In many cases, the incident is
initiated by another type of failure (software, hardware, fire, etc), but the
Task Force, 2004). Automation of "easy tasks" leaves "complex, rare tasks" to
the human operator. (Patterson, et al., 2002, p. 3) Humans "are not good at
2002, p. 3) "Humans are furious pattern matchers" but "poor at solving problems
from first principals, and can only do so for so long before" tiring (Patterson, et
13
al., 2002, p. 3). Automation "prevents …building mental production rules and
this are that technologists are not efficient at solving problems without
and allows the technologist to quickly and more accurately respond to incidents.
2.5. Testing
emergency tests them, and latent errors in emergency systems can render them
are not the focus of the testing discussed here. Large-scale recovery and
of a large-scale disaster. Disasters have not only been historically costly, but
business continuity and disaster recovery testing are too high to risk.
14
2.6. Summary
the cost of maintaining remote failover. Training and testing are key factors in
from the analysis. Qualitative methods will be applied to facilitate the exploration
of this topic. This chapter details the research methodology employed as well as
3.1. Framework
a theoretical point of view, but how does execution play out in real life, high
beliefs that may encroach upon the findings of this research. Preparedness, in
incident mitigation, quicker recovery time, and reduced personal stress during
planning and the ingenuity of the incident responders is the key to success. I
believe that an all hazards approach, established chain of command, and well-
3.3. Methodology
processes. Primarily due to the rare occurrence of this type of event, it is highly
interest. Quantitative methods are impractical because, while the cases used will
measures is questionable due to the high stress nature of the recovery situations
Lab research was also considered and while this would produce high
Therefore, external validity would be low and would likely result in unrealistic
findings.
included:
• Documented resolution
Phenomenon related documents, artifacts, and archival records were used rather
than interviewing, which also reduces the possible impact of researcher bias.
Multiple cases were included in the case study. This method of data collection
sector. The area of interest is high impact cyberinfrastructure; the findings using
information systems.
3.5. Authorizations
3.6. Analysis
those resulting in positive and negative results, were identified. Factors explored
include:
3.6.1. Triangulation
The purpose of including more than one case study is to collate the
Generalizable practices from other disciplines will also be used to reinforce the
3.7. Summary
used in this research. Rationales for the methods employed were also discussed.
Findings and sources used for the case study are included in following chapters.
20
4.1. Commerzbank
4.1.1. Background
United States as well, including a 1992 flood in Chicago and the 1993 World
“only 300 feet from the World Trade Center towers.”(Editorial Staff of
SearchStorage.com, 2002)
21
September 11, 2001 the World Trade Center suffered the largest terrorist
attack in United States history. Nearly 3000 died that day as a result of the
attacks. (Schwartz, Li, Berenson, & Williams, 2002) The impact to the economy
of the city of New York alone was $83 billion. (Barovik, Bland, Nugent, Van Dyk,
& Winters, 2001) Site clean up took over eight months. (Comptroller of the city of
New York, 02) Not all businesses were able to recover from the devastation
inflicted by the attacks. (Scalet S. D., 2002) The overall economic impacts
continue today and the daily lives of each resident of the United States has been
4.1.2.1. Ramifications
Commerzbank was so near the World Trade Center impact sites that the
The interior of the building that housed Commerzbank was covered in debris and
glass creating an unsafe environment and choking building equipment. The data
Most of the local data center disk failed causing failover to Commerzbank’s
tolerant system with remote failover that allowed them to remain operational
22
4.1.2.2. Response
communications with “Federal Reserve and the New York Clearing House” that
were lost after the first collision.(Availability Digest, 2009) It became apparent
that the World trade center was under attack when the second jet hit,
and Why, Part 2: Organizations, 2010) When the building lost power
Commerzbank’s backup power generator took over, but the HVAC system failed
Rye, New York can be operated by 10 staff members and 16 reported to the
the primary data center and in days that followed EMC, Commerzbank’s storage
vendor, worked around the clock to restore data that was backed up to tape
fuel storage tank, cooling tower, UPS, batteries, and fire suppression
Commerzbank was in the midst of virtualizing storage, and had finished the
majority of the conversion before the attacks. (Mears, Connor, & Martin, 02)The
provided the capability to meet the zero downtime requirement set forth by the
“everything” to the remote site. (Parris, Who Survives Disasters and Why, Part 2:
Organizations, 2010) The remote site, located 30 miles from the World Trade
The primary site at the world trade center maintained local duplicate drives and
Commerzbank used:
2002) The facilities were physically connected via “Fibre Channel SAN” providing
a storage transfer rate of almost 1TB per second. (Parris, Who Survives
Disasters and Why, Part 2: Organizations, 2010) The remote site maintained
servers that “were members of the cluster” at the World Trade Center site. These
servers continued to serve using replicated “remote disks to the main site” after
the storage there failed.(Parris, Who Survives Disasters and Why, Part 2:
meant help was available” around the clock.(Parris, Who Survives Disasters and
with EMC and Compaq, later to become part of Hewlett-Packard (HP), ensured
they were on hand to assist with any services or equipment required to recover.
25
disaster recovery part of the business continuity plan worked. All critical data was
available, but it still took nearly four hours to resume normal business
operations.(Mears, Connor, & Martin, 02) Therefore, they had failed to meet the
and required way too much human intervention.” Rye’s backup servers were not
proprietary operating systems. The virtualized Linux servers use “SUSE Linux
and the support model of the open source community” rather than the HP
residing “on the server itself—the disk, network interface card and storage
interface—give that server a fixed identity” this also caused delays as the servers
2006) The new “system is designed for SAN connectivity and boot” any
BladeFrame server can assume any identity at any time. That’s what we were
requirements for the data center have also decreased due to the virtualized
servers. The overall physical complexity has decreased as well, 140 servers
has reduced hardware trouble-shooting time. Configuring new servers now takes
The primary site and the backup site contain servers that are members of
synchronous replication. The Rye site is now an active part of daily processing
We live every day in the recovery portion of the DR mode. Having the
assets active takes the mystery out of continuity. We’re not praying that it
works, not planning that it works—we know it works because it’s an active
part of the process.(Egenera, 2006)
4.1.2.5. Discussion
testing and every staff member knew what to do. The failover processes were
concern for heroics to save the business. Post incident review showed some
company identified the problem, found a suitable solution, and implemented the
solution.
severely impacted New York on a larger scale, having only two clusters both
located in New York may not provide the seamless zero downtime the company
requires. This global company has the resources to commit to this more
comprehensive configuration. They also have facilities around the world to take
advantage of for co-location. The floor space use was reduced by 60% through
2006)
In this case, like that of Katrina, the disaster destroyed the hardware at the
site. There was little that preparedness could do to save the equipment.
However, unlike Katrina the recovery plan worked. Commerzbank had many
advantages in this case; New York’s infrastructure did not suffer the damage
New Orleans suffered. Commerzbank did not have to shoulder the burden of
rebuilding a city, only their primary location. Also, Commerzbank had the
complacent. Disasters happen of various scales on a daily basis, most are not
terribly severe and impact a small number of people. Failure to plan for a large-
scale severe impact event will increase the financial burden and stress of
incidents that do occur. If possible, defray the costs of maintaining hot sites by
planning, walk through as many scenarios as imaginable this will help ensure
4.1.3. Conclusion
Commerzbank survived 9/11 with relative ease while many others suffered
unrecoverable losses. Many did not recover due to failure to plan and prepare for
understood the bank’s vulnerabilities and tolerances and made the investments
necessary to mitigate them. Past experience had taught the company how to
survive and high-level management and staff were trained to manage incidents.
This vigilance paid off in reduced downtime and minimized financial impact to the
company.
29
4.2. FirstEnergy
4.2.1. Background
has remained highly profitable despite a history of poor practices that put the
public at risk. One of the most notable resulted in a $5.45 million fine issued by
the Nuclear Regulatory Commission (NRC). This fine regarded “reactor pressure
vessel head degradation”. FirstEnergy was notified of the problem in 2002 by the
NRC. (Merschoff, 05) The plant was operated for nearly two years after the
company was aware the equipment was unsafe to operate. (Merschoff, 05)
FirstEnergy employees supplied the NRC with misinformation and at least two
(Minkel, 08) News reports claimed this blackout was primarily due to a software
bug that stalled the utility’s control room alarm system for over an hour. The
30
operators were deprived of the alerts that would have caused them to take the
necessary actions to mitigate the grid shutdown/failures. The primary energy grid
monitoring server failed shortly after the failure of the alarm system, the backup
server took over and failed after a short period. The failure of the backup server
time to a crawl, which further delayed operators’ actions due to a refresh rate of
The operators’ actions were slowed while they waited for information and service
4.2.2.1. Ramifications
4.2.2.1.1. General
taking down over 263 plants. (Associated press, 03) Resulting in eight states and
parts of Canada being without power. (Barron J. , 2003)This black out affected
(Northeast Blackout of 2003) The estimated cost of this blackout was $7-10
4.2.2.1.2. FirstEnergy
values fell as investors were cautioned sighting the possibility of fines and
There were no fines assessed because at that time no regulatory entity had the
stockholders sued for losses due to negligence and the company settled in July
of 2004 agreeing to pay $89.9 million to stockholders. (The New York Times
Company, 04)
4.2.2.2. Response
4.2.2.2.1. MISO
overseeing power flow across the upper Midwest located in Carmel, Indiana.
(Associated press, 03) (Midwest ISO) The MISO state estimator tool
malfunctioned due a power line break at 14:20 Eastern Daylight Time (EDT).
32
(U.S.-Canada Power System Outage Task Force) This was one of the two tools
MISO used, both of which were under development, to assess electric system
state and determine best course of action. (U.S.-Canada Power System Outage
Task Force) The state estimator (SE) mathematically processes raw data and
presents it in the electrical system model format. This information is then feed
into the real time contingency analysis (RTCA) tool to “evaluate the reliability of
the power system”. (U.S.-Canada Power System Outage Task Force, p. 48)
At 14:15 the SE tool produced a solution with a high degree of error. The
operator turned off the automated process that runs the SE every five minutes to
an unlinked line and manually corrected the linkage. The SE was manually run
completed at 13:07. The operator, left for lunch forgetting to re-enable the
automated tool processing. This was discovered and re-enabled at about 14:40.
The previous linkage problem recurred and the tools failed to produce reliable
results. The tool was not successfully run again until “16:04 about two minutes
before the start of the cascade.” (U.S.-Canada Power System Outage Task
Force, p. 48)
4.2.2.2.2. FE
alarm function failed at 14:14 and began a cascading series of application and
33
server failures, by 14:54 all functionality on the primary and backup servers
failed. (U.S.-Canada Power System Outage Task Force) FE’s IT staff were
unaware of any problems until 14:20, when their monitoring system paged them
the primary control system server failed and the backup server took over
processing. The FE IT engineer was then paged by the monitoring system. (U.S.-
Outage Task Force) IT staff did not notify the operators of the problems nor did
they verify that functionality was restored with the EMS system operators. (U.S.-
Canada Power System Outage Task Force) The alarm system remained non-
functional. IT staff were notified of the alarm problem at 15:42 and they
discussed the “cold reboot” recommended during a support call with General
Electric (GE). The operators advised them not to perform the reboot because the
Task Force) Reboot attempts were made at 15:46 and 15:59 to correct the EMS
An American Electric Power (AEP) operator, who was still receiving good
information from FE’s EMS, called FE operators to report a line trip at 14:32.
Shortly thereafter operators from MISO, AEP, PJM Interconnection (PJM), and
Power System Outage Task Force) FE operators became aware that the EMS
34
systems had failed at 14:36, when an operator reporting for the next shift
reported the problem to the main control room. (U.S.-Canada Power System
Outage Task Force) The “links to remote sites were down as well.” (U.S.-Canada
Power System Outage Task Force, p. 54) The EMS failure resulted in the
contingency analysis after becoming aware that there were problems with the
EMS system. (U.S.-Canada Power System Outage Task Force) At 15:46 it was
too late for the operators to take action to prevent the blackout. (U.S.-Canada
FirstEnergy did have mitigation in place. There were several server nodes
that can host all functions with one server on “hot-standby” for backup with
established relationship with the EMS vendor GE, which provided support to the
IT staff when a new problem that the IT staff was not experienced with occurred.
There were also established mutual aid relationships with other utility operators.
The operators have the ability to monitor affiliated electric systems and request
tactic for electric companies. The purpose of the policy is to avoid lines that will
require immediate repair for safety reasons and will increase stress on the
to protect the reliable functioning of the electric system and its monitoring tools.
4.2.2.4.1. Regulatory
voluntary they can now “impose fines of up to a million dollars a day”. (Minkel,
08)The Energy Policy Act of 2005 provided FERC authority to set and enforce
standards. (Minkel, 08) FERC has also created a prototype real-time monitoring
York. (Minkel, 08) More testing and infrastructure upgrade are required before
4.2.2.4.2. FirstEnergy
locations to provide resiliency. (Jesdanun, 04) The new system has improved
alarm, diagnosis, and contingency analysis capabilities. (NASA, 2008) There are
now more visual status information and cues. (NASA, 2008) FirstEnergy created
repair and maintenance downtimes between their operations and IT staffs” and
4.2.2.5. Discussion
electrical systems operators were “unaware” of the problem for over an hour, as
Task Force) However, there were repeated warnings from communications with
operators from various locations to indicate there was a problem with the EMS.
The operators were aware that there was a problem at 14:36, which provided the
operators and IT staff indicated that the operators were aware that the electrical
system state required action. Operators’ actions may have been hampered from
14:54 to 15:59 by EMS screen refresh rates of up to “59 seconds per screen.”
FE’s IT staff failed to notify the operators at 14:20, when they became
aware of EMS system failures. This could have provided the EMS operators with
16 minutes more to determine and execute the correct course of action. Also the
FE EMS system was not configured to produce alerts when it fails, which is a
standard EMS feature. This would have provided another six minutes to the
the many other warnings they received, it is hard to make a case that the
minutes notice. It is possible, that operators were too dependent upon the
automated systems and overconfident that the situation would correct itself. The
The EMS system was “brought into service in 1995” and it had been
decided to replace the aging system ”well before August 14th”. (U.S.-Canada
Power System Outage Task Force, pp. 55-56) The NERC found FE in violation
Force) It was later determined that the software had a programming error that
vice president, Joseph Bucciero, “the software bug surfaced because of the
FirstEnergy power lines had already short-circuited.” (Jesdanun, 04) The three
38
lines were lost because FE failed to perform tree trimming according to internal
policy. The lines sagged, which occurs on hot days, and touched trees. (NASA,
2008)
4.2.3. Conclusion
This outage serves as an example that many small, mostly human errors,
can result in disaster. A more resilient system requiring less human interaction to
perform emergency tasks could have prevented this outage. Poor communication
between IT and Operations staff was a large factor as was the operators’ failure
to heed the warning of other operators. The FirstEnergy operators were provided
with information outside of their EMS to understand that the EMS was likely
failure to be proactive. They did not trim trees, they did not replace their old EMS
system, they did not communicate appropriately with other energy operators, and
they did not train the employees how to act in a crisis situation when the EMS
could not be relied upon. There were contributing factors outside of FirstEnergy,
but if any one of the factors contributed by FirstEnergy were removed the wide
4.3. Tulane
4.3.1. Background
Business. (Gerace, Jean, & Krob) The University was established in 1834 as a
2008). A post Civil War endowment from Paul Tulane transformed the financially
struggling public university into the private university that survives today. (Alumni
focus and its contributions have shaped the city of New Orleans over the
New Orleans. (Alumni Affairs, Tulane University, 2008) Tulane is currently New
Since the university was established Tulane has weathered the Civil War
and many hurricanes. Tulane has adapted to the New Orleans hurricane prone
environment. Tulane has integrated buildings that can “withstand hurricane force
winds” into the campus landscape. (Alumni Affairs, Tulane University, 2008) Only
Katrina and the Civil War have prevented Tulane from offering instruction.
Two days before the beginning of Tulane’s 2005 fall semester Hurricane
Katrina devastated New Orleans. (Blackboard Inc., 2008) This was “the worst
natural disaster in the history of the U.S.” (Cowen, 05) The real damage to New
Orleans began hours after Katrina passed as the levee succumbed to the
4.3.2.1. Ramifications
However, Tulane’s data center is the focus of this case study therefore direct
impact on Tulane and the cascading effects will be discussed. The hurricane’s
property damages alone were in excess of $400 million. (Alumni Affairs, Tulane
University, 2008) Over a week after Katrina, “eighty percent of Tulane’s campus
The New Orleans campus was closed for the fall semester of 2005.
(Cowen S. , Messages for Students , 05) Students were displaced and attended
students were asked to pay fees at the hosting University, Tulane promised to
address tuition issues as soon they gained access to their student records.
Hurricane Katrina, they had no access to “computer records of any kind”. (Alumni
Affairs, Tulane University, 2008, p. 65) Tulane’s bank was not operational and
the administration did not know what funds were in the inaccessible account.
Look Back at a Disaster Plan: What Went Wrong and Right, 05)
service critical equipment and retrieve important servers” which saved several
experiments. (Grose, Lord, & Shallcross, 2005) Over 150 research projects
suffered damage. (Alumni Affairs, Tulane University, 2008) Medical teams were
Hospital was closed for six months, but “was the first hospital to reopen in
Tulane reopened in January 2006 for the spring semester. The school lost
$125 million due to being closed for the fall semester of the 2005-2006 school
years. (Alumni Affairs, Tulane University, 2008) Prior to reopening Tulane had to
streamline its academic programs. This made funding available for the daunting
task of rebuilding Tulane and New Orleans. New Orleans had no infrastructure to
42
support Tulane. Tulane provided housing, utilities, and schools to support Tulane
students and staff. (Alumni Affairs, Tulane University, 2008) Despite Tulane’s
amazing recovery, loss of tuition income and disaster related financial losses
4.3.2.2. Response
Monday August 29th 2005, Tulane University was flooded after the levees
have a few days warning prior to the hurricane. August 25th, Tulane’s IT staff
initiated online data backups according to the data center disaster recovery plan.
(Lawson, 2005) August 28th, Tulane brought its information systems down.
(Lawson, 2005) Backup generators and supplies were placed into campus
buildings. (Krane, Kahn, Markert, Whelton, Traber, & Taylor, 2007) On the 30th
systems failed “with loss of e-mail systems and both cell and landline phones.
Center command post along with other essential staff during the Hurricane.
power to the Reily building. (Alumni Affairs, Tulane University, 2008) Thursday,
43
the staff was rescued by helicopter from the now flooded Tulane after several
Tulane’s top recovery priority was paying its employees. (Anthes, 2008)
This effort was complicated because payroll employees failed to take the payroll
printers and supplies as specified in the disaster plan. (Lawson, A Look Back at a
Disaster Plan: What Went Wrong and Right, 05) Police escorted Tulane IT staff
to retrieve Tulane’s backup data and computers from their 14th floor offsite
completed “two days late” according to Tulane CIO John Lawson. (Lawson, A
Look Back at a Disaster Plan: What Went Wrong and Right, 05) As of September
invited Tulane to resume operations at Baylor. However, this process did not go
(Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) This
was quickly corrected and Tulane used the redirected emergency site to
updates via its Web site.” (Schaffhauser, 2005) School of Medicine classes
updated in the days before Katrina hit. (Testa, 2006) The database records along
with the Baylor registration website and newly created paper files allowed Baylor
and Tulane to gather the information needed to resume classes. (Testa, 2006)
This resumption was particularly vital for seniors. Unfortunately, not all of the
College students of New Orleans were so lucky. About 100,000 were displaced,
Email “was the first system to be brought back online”. (McLennan, 2006)
Blackboard provided systems to allow Tulane and other affected Gulf Coast
Tulane’s own Blackboard system was quickly restored to allow retrieval of course
(Anthes, 2008) The staff was trained and comfortable enacting the disaster plan.
They knew the backups could be completed in 36 hours. (Lawson, A Look Back
at a Disaster Plan: What Went Wrong and Right, 05) Offsite backups were
maintained on the 14th floor of a building in New Orleans. (Anthes, 2008) Tulane
What Went Wrong and Right, 05) The remote hosted emergency website for
Today the university has a disaster recovery plan including offsite backup
servers for websites, e-mail and other critical systems, which is updated yearly.
(Anthes, 2008) There are also documented protocols for recovery from a
disaster, which was missing during the recovery from Katrina. (Anthes, 2008)
The recovery plan has also been amended to cover more than hurricanes and IT
staff now participates in preparedness planning. (Anthes, 2008) (Gerace, Jean, &
Krob, 2007)
46
As of 2008, Tulane had a contract with SunGard mobile data center for
emergencies. (Anthes, 2008) Katrina’s affect on the New Orleans backup data
center made it clear that they needed to maintain backups at a more distant
location as a result “backups are taken to Baton Rouge 3 times a week”. (Anthes,
2008) Employees have been provided with USB storage devices to prepare
personal backups for emergencies. (Anthes, 2008) An alternate recovery site has
center at Tulane. (Lord, 2008) “Energy efficient systems were installed in the
simple e-mail” and emergency updates to the website can be published directly
the media to track potentially disastrous hurricanes, Tulane has enlisted a private
notebook computers” which can facilitate continuity during a disaster and the
university now has online classes. (Gerace, Jean, & Krob, 2007) (Lord, 2008)
4.3.2.5. Discussion
many things they did right and in the end they recovered. It is debatable if the
plan for offsite disaster recovery would have been worth the investment in
47
dollars. Itemized financial reports for Tulane were not available for review. It is
clear that the absence of offsite recovery contract was a deliberate financial
decision. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right,
05)
prone area especially considering that the destruction of the levee was a known
risk. (Kantor, 2005) This decision also created additional stress for Tulane’s staff
and students. Tulane did an excellent job of recovering payroll to ensure their
staff was not without desperately needed financial resources. The medical
students were also well cared for thanks to the help of outside partnerships. The
continued medical program could not have been possible had there not been an
Unfortunately, the loss of Tulane’s data center made for a difficult fall 2005
semester for most students. They not only had to relocate, but were without
financial or academic records from Tulane. For those students the approximately
$300,000 per year expenditure would have provided some peace of mind.
(Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) As a
result of this as well as other adverse conditions at Tulane, many students did
not return. In 2008, enrollment at Tulane was down by 5,300 students from its
pre-Katrina numbers. (Lord, 2008) This resulted in financial distress for Tulane
and the closing of its engineering school and consolidation of other programs and
Katrina forced the University to close. This ensured not only the survival of
Tulane, but the revival of New Orleans as well. The medical students and
hospital provided much needed health care for New Orleans residents. Students
(Brown, 2008) The damage caused to Tulane and New Orleans was beyond the
recover from a complete loss of IT and infrastructure were proven to be the most
valuable in this case. No one institution was capable of recovering New Orleans,
4.3.3. Conclusion
Tulane has learned from Katrina how to protect the data that is the
lifeblood of the university. The aftermath of Katrina has also made clear that the
students are Tulane’s customers and they cannot survive without them. Further
providing for the communities they are a part of in times of disaster, this was true
provider and educator Tulane has persevered and shored up its weakness and
4.4.1. Background
state agency charged to ensure the state’s information technology needs and the
This was to be the flagship partnership to show that the Public sector
However,"(d)elays, cost increases and poor service have dogged the state's
largest-ever outsourcing contract, the first of its kind in the country”. (Schapiro &
Bacque, Agencies' computers still being restored, 2010)Virginia had entered the
contract with the expectation that the contract would provide modernized
services for the "same cost as maintaining their legacy services.” (Stewart,
2006)At this point, the state no longer expects to see any cost savings under the
original contract period, but hope that savings will be realized under an extended
Since the beginning of the contract with Northrop Grumman, the state of
Virginia has suffered two major outages. In addition, the state paid an additional
July 2009, is significantly behind schedule. There have been ongoing issues with
Until the later part of March 2010, VITA could make changes to the
contract with Northrop Grumman without consulting with the General Assembly.
Investment Board (ITIB) is charged with oversight of VITA, but could not provide
full time oversight. The members of ITIB attend meetings irregularly and lack the
eliminate the ITIB. VITA and the State CIO now report to the Office of the
Governor. The new structure became effective March 16, 2010 after passing
The state has been plagued with a litany of service failures throughout the
contract with Northrop Grumman. In 2009, prison phone service failed and was
Service was restored six and a half hours later following an escalation request
2009)Another service failure, noted in the JLARC 2009 report, left the Virginia
State Police without internet access for three days. (Kumar & Helderman,
2009)June 20, 2007 the state of Virginia suffered a wide spread outage. (VITA,
(News Report, 2010) Thirteen percent of the state's file servers were unavailable
52
during the outage. (Schapiro & Bacque, Agencies' computers still being restored,
state's data center near Richmond, which caused 228 storage servers to go
government business, 2010) The hardware that failed was one of the SAN's two
Technology, the outage was "unprecedented" based on the "uptime data" on the
EMC SAN hardware that caused the widespread failure. (News Report, 2010)
"Officials also said a failover wasn't triggered because too few servers were
4.4.2.1. Ramifications
of Motor Vehicles (DMV) was the most visibly impacted agency. Drivers were
not able to renew licenses at the DMV offices during the outage, forcing the DMV
to open on Sunday and work through Labor Day to clear the backlog of expired
2010) Some drivers were ticketed for expired licenses before law enforcement
2010) According to Virginia State Police, while "they will not cite drivers whose
licenses expired during the blackout", unfortunately those that received tickets
must "go through the court system" to request relief. (Kravitz & Kumar, Virginia
addition, drivers who renewed licenses the day of the blackout will need to visit
the DMV again because the data and pictures from the transactions that day
were lost. (Schapiro & Bacque, Northrop Grumman regrets computer outage,
2010) This also increases the likelihood that some of the licenses and IDs
The DMV was not the only agency negatively impacted by the SAN
recipients will receive benefit checks up to two days late. Employees at this
agency also worked overtime to reduce and eliminate delays where possible.
(Schapiro & Bacque, Agencies' computers still being restored, 2010) Internet
services used by citizens to make child support and tax payments were
outage, 2010) "At the state Department of Taxation, taxpayers could not file
(Schapiro & Bacque, Agencies' computers still being restored, 2010)Three days
after the outage began "(f)our agencies continue(d) to have 'operational issues' "
these agencies included the departments of Taxation and Motor Vehicles. Many
54
other agencies continued to suffer negative effects from the outage. (Schapiro &
4.4.2.2. Response
message. (Wikan, 2010) The cause of the error message was determined to be
“one of the two memory boards on the machine needed replacement." (Wikan,
2010)"A few hours later, a technician replaced the board ", (Wikan, 2010) Shortly
after the board was replaced the storage area network (SAN) failed. It was later
discovered that the wrong board might have been replaced. (Wikan, 2010)"VITA
and Northrop Grumman activated the rapid response team and began work with
Work continued through the night to restore services but was unable to
restore data access to affected servers. (Wikan, 2010) (VITA)Thursday, the SAN
The storage provider, EMC, determined that the best course of action is to
perform an extensive maintenance and repair process. VITA and Northrop
Grumman, in consultation, have determined this is the best way to
proceed. (VITA)
The 24 affected agencies were notified prior to the SAN shutdown to allow them
to take appropriate action. (VITA) SAN service was restored at "2:30 a.m. Aug.
27." (Wikan, 2010) Over half of the attached servers were operational Friday
55
morning. (VITA) VITA began working with the operational customers to confirm
VITA continued data restorations over the weekend the DMV restore took “about
affected agencies were up and running". (VITA) However, three key agencies still
measures in place. Not only was there a "fault-tolerant" SAN, but also there
were magnetic tape backups and the staff had just performed recovery exercise
testing. The established relationship with the hardware vendor EMC brought
additional expertise to resolve this SAN outage. VITA also has two data centers,
Examination of the documents available on the VITA web site would imply
that every recommended best practice is being implemented and executed. The
SAN hardware used is best in class and has excellent reliability. VITA also had a
rapid response team whose mission was to reach incident resolution rapidly.
(Nixon, 2010) Yet, an outage in one system had serious negative impact on
56
several agencies and more importantly the citizens of Virginia for more than one
week.
This incident is one of many that the State of Virginia has suffered since
the beginning of the contract with Northrop Grumman. Professing use of industry
standards and best practices does not result in a reliable, stable cyber
infrastructure. In this case, there was still a single point of failure that resulted in
avoid outages such as that which occurred in late August. Virginia made the
been ordered. Agilisys Inc. was chosen to conduct a 10-12 week audit beginning
4.4.2.5. Discussion
exactly what happened and extrapolate what should have been done. VITA
2010) The exercise involved restoring service after losing a data center.
Providing that the exercise was adequately rigorous, performing a restore for an
to be tape restoration and data validation. (Wikan, 2010) More emphasis should
Incidents resulting in partial data loss or corruption are far more likely than loss of
an entire data center. Activities that improve restoration time for data recovery
and technology enhancements that might improve recovery time. In this case,
the data recovery process from tape left the DMV unable to issue or update
driver’s licenses or IDs for a week. A data restoration exercise might have
revealed this weakness and another solution might have been put in place to
daily backups for agencies like the DMV, the Department of Taxation, and Child
58
Support. Loss of payment records for the latter two agencies would cause major
inconveniences and bad press. Loss of four days identification data for licenses
and IDs is inexcusable. The root of this decision likely lies in the bottom-line,
storage. This is advisable for all high availability databases and might have
avoided the data loss and corruption that occurred. One possible mechanism is
confirmation that the data was written to the SAN. The local copy would then be
held until the backup copy is confirmed as processed. This would entail
local daily backups for daily transactions is also an advisable practice to avoid
loss of records.
(BCV), essentially a regularly scheduled copy between the two SANs. This
creates mirrored storage systems, hot scheduled copies occurring every minute
for example, using technology such as Oracle Data Guard or SQL server
mirroring and log shipping. Most database engines have a way to replicate
physical hardware in order to eliminate data loss. The use of both options
The fact that "too few servers were involved" to trigger failover is baffling.
(News Report, 2010) Any fault with the potential to incur the impact experienced
by this outage should initiate a failover. The IT staff should have initiated a
manual failover prior to making the SAN repair for the initial hardware failure.
This suggestion assumes that the failover would have eliminated the
dependence on the faulty SAN. In addition, if the SAN was still operating, why did
the technician perform the repair during business hours? The technician should
have created a cold backup to tape prior to doing the off hours repair. The
technician should have been aware the backup had not occurred for four days
and understood the potential data loss that could result. (Availability Digest,
2010)
VITA’s staff may need additional training to help them identify situations
when to perform a manual back up as well as situations that can wait for after
hours repair. It is likely that required change management processes were not
principals, the SAN repair would have been subject to a change management
the problem, the proposed fix, and the steps to be taken. Affected customers,
60
process owners, and the change management board (or equivalent) should have
been notified. Either there was: no change request, no one reviewed the change
request, the request was not understood or the proposed steps were not
executed.
Monitoring tools may also have played a role in this outage. The IT staff
either ignored alerts, did not understand them, or had the monitoring tools
incorrectly configured. Monitoring alerts should have notified the staff of the
problem, identified which SAN controller was having the problem, and alerted
staff of failed write attempts to the networked storage. Additional training could
have ensured properly implemented monitoring tools and the IT staff’s ability to
important to balance the cost savings with the risk being taken; the savings may
not justify the risk for many governmental organizations. Strategic long term
requirement. Perhaps the DMV data is not a place to cut corners. Distributed
servers were implemented to avoid single points of failure, and while even a
distributed system is not free from failure, the possibility of wide spread failure
may claim the SAN outage is unprecedented, but they do not claim this is the first
61
quality service due to poor strategic planning. Northrop Grumman has enough
services for the state of Virginia that allow a single hardware failure to cause
Ultimately, the outage was a result of human error. Human error will occur,
human error from escalating into a fiasco. Training would help with human error.
Training is an ongoing process that must be maintained along with the process of
constant improvement. Northrop Grumman, EMC, and VITA share the blame for
of uncertainty.
that specializes in IT should result in lower cost due to bulk discounts, enhanced
services, and access to high quality IT staff. The total costs of outsourcing IT
should go down over time due to falling hardware costs. (Lee) These expected
customer, in this case the citizens of Virginia. The quality of the partnership
should be reviewed using the dimensions of fitness of use and reliability. (Lee &
Kim, 1999) The events of the last few years have shown that the service that
Poorly strategized and executed services have not only cost Virginians,
but have been a source of inconvenience and delay. Some Virginians had to go
to court to combat expired license tickets, those who cannot find the time to do
this may also face increases in insurance premiums. These issues seem small
DMV just prior to the outage. These licenses are legal and nearly untraceable
and could fetch high prices on the black market. Also, consider the safety of
those working in prisons without phone service for hours. The phone outage
The 10-year contract with Northrop Grumman has left little possibility to
exit the contract and request new outsourcing bids. Virginia recently reviewed
the partnership and it was decided that it was too costly to exit the contract.
Northrop Grumman argued that Virginia did not provide them with adequate
access to information that would have allowed them to create a realistic refresh
schedule and budget. Virginia denied this, but agreed to extend the project
timeline and paid an additional $236 million to cover the hardware refresh.
(Schapiro & Bacque, Agencies' computers still being restored, 2010) This was
done in part for political reasons. Northrop Grumman agreed to move their
63
headquarters to Virginia. (Squires, 2010) Virginia hopes to create new jobs and
get better service. Meanwhile, Northrop Grumman will pay out approximately
infancy. There are few who understand both IT and law well enough to write or
defend the contract properly. A less lengthy agreement may have been best for
embarking on the hardware refresh. Perhaps the hardware refresh should have
least, an exit clause that would allow Virginia to exit the contract without risking
the waste of millions of public funds would be advisable. Public safety and
security are too important to place in the hands of a single provider without any
appears to have too much wiggle room to make Northrop Grumman accountable
for failures.
actions to see that this never happens again. Further, that Northrop Grumman is
held accountable in a manner that motivates them to stop ignoring issues raised
provide high quality services. Northrop Grumman is responsible for their vendors,
services and well-trained staff. Taxpayers should no longer pay for the
64
The best protection for Virginians may lie in contract law. Future
outsourcing contracts should not favor the vendor and exploit the state. Referring
situation cannot be a true partnership because business motives are not shared.
(Lee) The outsourcing contract should have clearly defined service level
agreements and failure to meet these expectations should result in equally clear
penalties. These penalties should have enough financial impact to ensure the
vendor does not determine that paying the penalty fees make better financial
sense than providing the contracted services. The contract between Virginia and
Northrop Grumman has exit penalties that are too expensive to be a feasible
contracted services that provide value to the citizens of Virginia the contract can
provided at least the basis for the outsourcing contract. The use of vendor
contracts "even as a starting point" is highly inadvisable because the contract will
favor the vendor. (Lee, p. 13) This problem is illustrated in the case of the
After the contract is in effect, the contract must be strictly managed by the
auditing team charged with conducting ongoing service reviews of the vendor.
unfortunately require additional expense, but auditing activities will ensure that
the outsourcing organization will realize the expected value of the contract.
4.4.3. Conclusion
Virginia's August 2010 outage provides a case study to illustrate the risks
provided effective recourse to enforce the contract terms. VITA also failed to
to ensure the vendor delivers quality services that meet business requirements.
This means investing in auditing to ensure that the vendor is taking appropriate
CHAPTER 5. ANALYSIS
various mitigation techniques. Tulane’s investment in backup tapes paid off but
the investment in an offsite data center did not. The factor that contributed most
vendors. This type of relationship has proven very useful in sectors such as
5.1.1. Before-Planning
time objectives (RTO), and recovery point objectives (RPO). MTPOD relates to
how long your business can be “down” before damaging the organization’s
viability. The case studies provide an array of tolerances as shown in Table 5.1
techniques to implement.
68
Table 5.1 above reflects estimated MTPOD, RTO, and RPO for each
organization based on artifacts included in each case study. These estimates are
than one week. One week was chosen as the point at which the viability of
point customers would switch to a competitor. Any outage would be costly for
Commerzbank but a weeklong outage would damage the bank’s reputation and
but when the outage impacts their ability to be profitable, they must look
highly competitive sector and would therefore have difficulty recovering from
customer loss.
outage will immediately inconvenience the customer base. Also outages result in
lost revenue because electricity cannot be stored for later use. Extended outages
strain other providers and potentially result in cascading critical service outages.
There are now mandatory guidelines as well and failure to meet these guidelines
69
investor owned therefore outages would reduce the value of company shares.
Investors sued FirstEnergy for lost revenue in the past and could potentially do
so again. All of these factors were included in the ½ hour estimate of MTPOD for
FirstEnergy.
Tulane University has sustained hurricane season for more than a century
case study artifacts revealed a repeating theme in this case; hurricanes had
become routine. The general thought was just send everyone away for a few
days return and clean up when it passes; back to business as usual. This reveals
that outages of “a few days” had no real impact on the organization. However, a
operations, most notably the ability to provide their primary service, education.
university hospital was not included, only the university itself. Including the
hospital would reduce the MTPOD to hours or less due to possible loss of life.
Loss of life will not necessarily result in irrevocable viability damage to the
organization, but must be avoided at all costs and therefore would be heavily
weighted.
70
complex. Some services such as 911 services are critical infrastructure and
cannot be down without compromising public safety. Other services may suffer
very little during an extended outage. Obviously, prison guards should never be
without phone services. However, do any of these factors really damage the
viability of the state, it would be very hard to argue that they do. This estimate
comes down to cost and public impact. Public impact was weighted most heavily.
Also, the state’s IT was outsourced therefore impact to the viability of Northrop
Grumman must be included. To date there has been little impact on Northrop
Grumman, but possible contractual changes made after the conclusion of the
third party investigation may have greater impact. Recovery time objectives were
data including transactional data. FirstEnergy stands out in this group with a not
applicable (N/A) rating on Table 5.1. This is based on the assumption that for
important to predicting and future planning as well as tools development, but loss
of this data would have little operational impact as other data sources could be
above. Commerzbank tolerances and objectives make it apparent that they must
71
downtime. Expenditures in IT to ensure this are warranted and practical for their
organization. They can afford to make the necessary investments and downtime
is far too costly. The case study artifacts reveal that Commerzbank is actively
Tulane is a good example of an organization with all the right pieces that
failed due to poor placement. Tulane had tape-based backup and recovery which
were appropriate for their budget and MTPOD. The backup data center was new
and not fully complete but location was the problem. It was near enough to be
They were lucky the building’s upper floors, where the tapes were located, were
not flooded allowing retrieval of the backup tapes. This site at the time would
have been a warm site at best; strategic placement would have made this site a
major asset.
operations center (EOC) and backup data center could minimally have provided
72
planning should include a backup data center that allows virtual operations where
very small with very few employees. Payroll and billing functions would be very
simple and probably paper based. Even in these circumstances, multiple copies
avoid lost revenue, or liability issues. Organizations in this category are not
Staff training levels are more apparent in some of the cases than others.
For example, FirstEnergy staff was inadequately trained and there was poor
Commerzbank’s quality of staff training was available. However, the fact that
73
employees began assembling at the backup site, in the midst of the chaos,
Again, these are the two most extreme examples, but training is the
difference between staff that fail to perform versus those that coolly navigate
themselves safely from just a few hundred feet from the largest terrorist attack in
U.S. history. The stress levels between the two staffs during the first phases are
extreme stress performs very well with very little warning. The poorly trained staff
failed to act despite many warnings and hours to act. FirstEnergy staff
correlation with the success of continuity and recovery efforts. FirstEnergy and
organizations could have completely averted disaster had the staff followed
procedures.
Forty minutes before the outage, the operators knew the monitoring equipment
was not working and still failed to take corrective action. Established internal
was aware of the problems with the EMS, but did not alert the operators to the
issue. This communication was not required at the time of the studied incident,
but was later addressed. However, the primary cause of the outage was failure to
follow procedure. As a result some areas were without power for up to a week
that ITIL standard practices were not followed. Artifacts indicate the use of ITIL
for this organization; therefore, ITIL adherence was used as the basis for
75
ITIL specifies standards for communication during incidents and also focuses on
review at the completion of this study. After the independent review is complete,
However, it is not disputable that a minor hardware problem, that was not an
outage, was acted upon inappropriately. This resulted in a weeklong outage for
and a staff that was trained and comfortable with the procedures. They also had
the luxury of knowing days ahead that the hurricane was coming. The execution
went according to plan for the most part. There were critical parts of the plan left
unexecuted; the payroll printer and related materials were not taken to safety.
This failure further complicated the task of issuing payroll and likely added
billions in lost revenue. The transactions system never went down during the
events of 9/11. Despite the loss of primary facilities and unforeseen technical
issues they were fully operational within hours. Commerzbank serves as a model
for the financial sector for business continuity and disaster recovery planning.
All of the organizations included in the study had well established chain of
command communication structures. Some were more effective than others for a
disruptions due to the magnitude of the disasters and the resulting damage to
difficulties. The impact of Katrina was so severe that the impact to the
infrastructure of New Orleans was prolonged and the duration of the disaster
Tulane’s critical staff members now carry cell phones from more than one
incident response plan, which follows many principals from the National Incident
Management system. (Tulane University, 09) This plan defines roles, incident
phases, and incident levels, which delineate what roles, are activated. (Tulane
University, 09)
contacts listed are cumulative, for example if a level 3 incident were to occur the
77
staff, and process owners would be contacted. Each role activated would have a
responsibilities check list to be used for specific level incidents. NASCIO has a
communicate as the voluntary industry standards of the time dictated. There was
no apparent deviation from the chain of command in the case of the Virginia
outage. Though it would be safe to speculate that the independent review will
was apparent in all of the case studies. Each continuity or recovery effort was
additional resources was integral to recovery success and reduced the duration
of the outage in most of the cases. The assistance Baylor provided Tulane was
vital to the future viability of Tulane. The relationships utilized are represented in
Commerzbank EMC
mandatory third party incident review to determine what steps were necessary to
prevent future incidents. Commerzbank and Tulane were unhappy with the
response and recovery provisions in place at the time of the incident, and have
5.1.3.1.1. Downtime
organization nor is there a simple way to determine the cost of recovery. These
figures vary based on the sector and other organizational factors. Organizations
that have experienced disaster recovery events have not made the financial
ramifications available to the public, including those in this study. Further most
literature and tools available to aid in determining these cost and return on
investment (ROI) are provided by commercial entities that are attempting to sell
questionable validity.
For the purpose of this study a combination of recent studies are used for
2010 claims “the average North American organization loses over $150,000 a
reports median downtime cost of $3,000 per day for small businesses and
$23,000 for medium size businesses. Based on these figures it would not be
systems. However, the losses are still substantial and investment in daily data
increased costs. Some sectors such as utility, financial, and some public have
regulatory standards that must be met and downtime could result in fines as well
as lost revenue.
result recovery times have increased by 1.5 hours. (Dines, 2011) The average
application and data classifications reported are shown in Figure 5.4 below. As
applications and data as critical, the cost of resiliency rises. As tolerance for
81
downtime decreases the cost of resiliency also rises. Economic realities dictate
that most organizations cannot maintain redundancy of all applications and data.
are many other ways to compute how much to invest in IT business continuity
most are far more complex. A 2010 Forrester study found that respondents
to note that many functions fall under the umbrella of IT operational resiliency
risk to insure profitability. These equations are outside of the scope of this
qualitative study. However, this study will use a 2010 Forester market study for
and failed over to an alternate site in the past five years”, this yields a 4.8 percent
(Dines, 2011) The average cost of downtime per hour was $145,000 and
Multiplying the average cost per hour by the average recovery time yields
over a five year time period yields a disaster cost of $536,500. Multiplying this by
the risk probability of 4.8 percent yields $25,752. These figures provide a range
$536, 500. The average of the two is $281,126; this figure, based on the Forester
implementation will allow the results of this small yearly investment to yield
The overall 5-year budget would be $1,405,630. The first year would likely be
dedicated to reviewing organizational needs and looking for cost effective ways
83
5.1.3.2. Findings
is less need for human intervention, thus requiring less manpower to recover.
This also allows response and recovery to begin immediately. In life threatening
situations staff can focus on evacuation. Possible liability issues can be reduced
related to both staff and external stakeholders by removing the question of due
diligence. Another advantage is that testing is far less disruptive because the
Integrating redundant systems into daily processing does not mean that all
and data can reduce costs. For example in the case of FirstEnergy, access to
past data is not business critical, investing in data recovery can be reduced.
Lower cost tape based storage and recovery methods are fine. However,
availability of real-time operations applications and data are critical to the mission
spent. A raw order of magnitude estimate for this would place such a system in
Gartner reported the cost of a tier IV data center to be about $3,450 per
square foot with a cost of $34.5 million for a 10,000 square foot datacenter.
(Cappuccio, 2010) A tier IV data center according to Gartner would provide less
than a half hour of downtime a year. (Cappuccio, 2010) However, the risk and
outage costs are high enough to justify such an investment. The losses of the
avoid outages.
process. The Virginia case also provides an example of the hazards of risk
monitored. The outsourcing party must ensure the power to enforce meaningful
failover testing and plan updates. Disaster recovery and business continuity
85
CHAPTER 6. CONCLUSION
The research question this study endeavored to answer is “What are best
practices before, during, and after disaster recovery execution?” The multiple
case study best practice analysis indicates that disaster recovery is one part of
process is broken down into three distinct phases: before, during, and after
disaster recovery execution. Strategic planning occurs during the before phase.
This planning includes determining the MTPOD, RTO, and RPO to help
before phase as well. Best practice during the disaster recovery execution phase
and utilization of aid relationships are important elements of the during phase. In
the after, post recovery phase, best practice involves reviewing the situation and
response to identify areas that need improvement. The after phase will not only
help plan future mitigation, but also identifies supporting government policy
needs in critical infrastructure sectors. The iterative cycle begins again in the
The purpose of this research was “to bridge the gap of unmet”
cyberinfrastructure resiliency needs. An assumption was made that the high cost
of implementation was the most significant barrier. While this may be true,
surprisingly, the two most avoidable disasters did not occur because of any direct
lack of funds. Virginia had already allocated the funds, the company they
company and was a fortune 500 company prior to the 2003 North East Blackout,
these two cases it is arguable that lack of management oversight and urgency
was the motivation for any lack of funds allocated. Both organizations understood
and implemented backup equipment, but failed to ensure that all mitigation
measures were followed. Commerzbank and Tulane are veterans in dealing with
others by human error, both are illustrated in this study. Required preparations
Car, home insurance, and retirement savings are all forgone by most unless it is
difficult to grasp. Add to this the tendency to disregard possible calamity and the
88
This study has revealed areas in need of further research. Two are well
known issues with ongoing research. These areas are educating a workforce
eliminating dependencies upon third parties for power. The areas of hydrogen
fuel cells and solar power continue to leap forward and may provide the power
disaster and the cost of such events. Tables based on industry, size, and location
a meaningful manner.
current, mostly unregulated IT, climate values fast over safe. Companies must
move fast to push out the next new product. Little time is spent focused on
ensuring security and resiliency. This will likely continue until minimum
90
regulations are in place. These policies and the ability to enforce them will be
very helpful to organizations that want to be secure and resilient, but are
struggling with vendors. Related to this is IT contract law, this field desperately
BIBLIOGRAPHY
Associated Press. (06 20-1). FirstEnergy to pay $28M fine, saying workers hid
damage. Retrieved 11 5-1 from USA Today:
http://www.usatoday.com/news/nation/2006-01-20-nuke-plant-fine_x.htm
Associated press. (03 19-11). Investigators pin origin of Aug 2003 blackout on
FirstEnergy failures . Retrieved 11 6-1 from Windcor Power Systems.
Availability Digest. (2010 10). The State of Virginia – Down for Days. Retrieved
2010 8-11 from www.availabilitydigest.com:
http://www.availabilitydigest.com/public_articles/0510/virginia.pdf
Balaouras, S. (2010 2-9). Business Continuity And Disaster Recovery Are Top IT
Priorities For 2010 And 2011 Six Percent Of IT Operating And Capital
Budgets Goes To BC/DR. Retrieved 2011 7-2 from Forrester.com:
http://www.forrester.com/rb/Research/business_continuity_and_disaster_r
ecovery_are_top/q/id/57818/t/2
Barovik, H., Bland, E., Nugent, B., Van Dyk, D., & Winters, R. (2001 26-11). For
The Record Nov. 26, 2001. Retrieved 11 13-1 from Time:
http://www.time.com/time/magazine/article/0,9171,1001334,00.html
92
Barron, J. (2003 15-8). Power Surge Blacks Out Northeast. Retrieved 2009 2-11
from New York Times:
http://www.nytimes.com/2003/08/15/nyregion/15POWE.html
Barron, J. (2003 15-8). Power Surge Blacks Out Northeast. Retrieved 2009 2-11
from The New York Times:
http://www.nytimes.com/2003/08/15/nyregion/15POWE.html
Blackboard Inc. (2008 24-10). Blackboard & Tulane University. Retrieved 10 27-
12 from Blaceboard:
http://www.blackboard.com/CMSPages/GetFile.aspx?guid=39a0b112-
221d-4d04-be80-f2024d16943a
Brown, K. (2008 1-2). House No. 3 Rises for URBANbuild. Retrieved 2011 2-1
from Tulane University New Wave:
http://tulane.edu/news/newwave/020108_urbanbuild.cfm
Cappuccio, D. (2010 17-3). Extend the Life of Your Data Center, While Lowering
Costs. Retrieved 2011 28-1 from Gartner:
http://www.gartner.com/it/content/1304100/1304113/march_18_extend_lif
e_of_data_center_dcappuccio.pdf
Caralli, R., Allen, J., Curtis, P., White, D., & Young, L. (2010 5). CERT®
Resilience Management Model, Version 1.0 Process Areas, Generic
Goals and Practices, and Glossary. Hanscom AFB, MA.
Comptroller of the city of New York. (02 04-9). One Year Later, The Fiscal Impact
of 9/11 on New York City. Retrieved 11 13-1 from The New York City
Comptroller's Office:
http://www.comptroller.nyc.gov/bureaus/bud/reports/impact-9-11-year-
later.pdf
Cowen. (n.d.). Letter to students. Retrieved 2010 27-12 from Tulane University:
http://renewal.tulane.edu/students_undergraduate_cowen2.shtml
Cowen, S. (05 2-9). Messages for Students . Retrieved 2010 27-12 from Tulane
University : http://www.tulane.edu/studentmessages/september.html
Egenera. (2006). Case Study: Commerzbank North America. Retrieved 2011 3-1
from Egenera: www.egenera.com/1157984790/Link.htm
94
EMAC. (n.d.). The History of Mutual Aid and EMAC. Retrieved 2011 20-2 from
EMAC: http://www.emacweb.org/?321
FEMA. (n.d.). Incident Command System (ICS). Retrieved 2011 20-2 from
FEMA:
http://www.fema.gov/emergency/nims/IncidentCommandSystem.shtm
Forrester, E. C., Buteau, B. L., & Shrum, S. (2009). Service Continuity: A Project
Management Process Area at Maturity Level 3. In E. C. Forrester, B. L.
Buteau, & S. Shrum, CMMI® for Services: Guidelines for Superior Service
(pp. 507-523). Boston, MA: Addison-Wesley Professional.
From Reuters and Bloomberg News. (03 19-8). FirstEnergy Shares Fall After
Blackout. Retrieved 11 6-1 from Los Angeles Times:
http://articles.latimes.com/2003/aug/19/business/fi-wrap19.1
Gerace, T., Jean, R., & Krob, A. (2007). Decentralized and centralized it support
at Tulane University: a case study from a hybrid model. In Proceedings of
the 35th annual ACM SIGUCCS fall conference (SIGUCCS '07). New
York: ACM.
Grose, T., Lord, M., & Shallcross, L. (2005 11). Down, but not out. Retrieved
2010 28-12 from ASEE PRISM: http://www.prism-
magazine.org/nov05/feature_katrina.cfm
95
Gulf Coast Presidents. (2005). Gulf Coast Presidents Express Thanks, Urge
Continued Assistance . Retrieved 10 27-12 from Tulane University:
http://www.tulane.edu/ace.htm
Jesdanun, A. (04 12-2). Software Bug Blamed For Blackout Alarm Failure.
Retrieved 11 6-1 from CRN:
http://www.crn.com/news/security/18840497/software-bug-blamed-for-
blackout-alarm-failure.htm?itc=refresh
Krane, N. K., Kahn, M. J., Markert, R. J., Whelton, P. K., Traber, P. G., & Taylor,
I. L. (2007 8). Surviving Hurricane Katrina: Reconstructing the Educational
Enterprise of Tulane University School of Medicine. Retrieved 10 17-12
from Academic Medicine:
http://journals.lww.com/academicmedicine/Fulltext/2007/08000/Surviving_
Hurricane_Katrina__Reconstructing_the.4.aspx
Kravitz, D., & Kumar, A. (2010 31-8). Virginia DMV licensing services will be
stalled until at least Wednesday. Retrieved 2010 6-11 from
Washingtonpost.com: http://www.washingtonpost.com/wp-
dyn/content/article/2010/08/30/AR2010083004877.html
Lawson, J. (05 9-12). A Look Back at a Disaster Plan: What Went Wrong and
Right. Retrieved 10 28-12 from The Chronicle of Higher Education:
http://chronicle.com/article/A-Look-Back-at-a-Disaster/10664
Lewis, B. (n.d.). Massive Computer Outage Halts Some Va. Agencies. Retrieved
2010 5-11 from HamptonRoads.com:
http://hamptonroads.com/print/566771
McIntyre, D. A. (2009 2-9). Gmail's outage raises new concern about the Net's
vulnerability. Retrieved 2009 25-11 from Newsweek:
http://www.newsweek.com/id/214760
Mears, J., Connor, D., & Martin, M. (02 2-9). What has changed. Retrieved 11 4-
1 from Network World.
Midwest ISO. (n.d.). About Us. Retrieved 2011 28-3 from Midwest ISO:
http://www.midwestmarket.org/page/About%20Us
Minkel, J. (08 13-8). The 2003 Northeast Blackout--Five Years Later. Retrieved
11 6-1 from Scientific American:
http://www.scientificamerican.com/article.cfm?id=2003-blackout-five-
years-later
NASA. (2008 3). Powerless. Retrieved 2011 6-1 from Process Based Mission
Assurance NASA Safety Center:
http://pbma.nasa.gov/docs/public/pbma/images/msm/PowerShutdown_sfc
s.pdf
New York Independent System Operator. (2005 2). ISO. Retrieved 2010 17-3
from
http://www.nyiso.com/public/webdocs/newsroom/press_releases/2005/bla
ckout_rpt_final.pdf
News Report. (2010 1-9). Northrop Grumman Vows to Find Cause of Virginia
Server Meltdown as Fix Nears. Retrieved 2010 6-11 from Government
Technology: http://www.govtech.com/policy-management/102482209.html
98
Outsource IT Needs LLC. (n.d.). How Much Should You Spend on Disaster
Recovery? Calculating the Value of Business Continuity. Retrieved 2011
7-2 from Outsource IT Needs, LLC:
http://outsourceitneeds.com/DisasterRecovery.pdf
Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., et al.
(2002). Recovery Oriented Computing (ROC): Motivation, Definition,
Techniques, and Case Studies. Computer Science Technical Report,
Computer Science Division, University of California at Berkeley, Computer
Science Department, Mills College and Stanford University; IBM
Research, Berkeley.
Petersen, R. (2009 9). Protecting Cyber Assets. Retrieved 2010 15-6 from
EDUCAUSE Review:
http://www.educause.edu/EDUCAUSE%2BReview/EDUCAUSEReviewMa
gazineVolume44/ProtectingCyberAssets/178440
99
Schapiro, J., & Bacque, P. (2010 28-08). Agencies' computers still being
restored. Retrieved 2010 5-11 from Richmond Times-Dispatch:
http://www2.timesdispatch.com/member-center/share-this/print/ar/476845/
Schapiro, J., & Bacque, P. (2010 3-9). Northrop Grumman regrets computer
outage. From Richmond Times-Dispatch:
http://www2.timesdispatch.com/news/state-news/2010/sep/03/vita03-ar-
485147/
Schapiro, J., & Bacque, P. (2010 2-9). Update: McDonnell lays out concerns to
Northrop Grumman. Retrieved 2010 8-11 from Richmond Times-Dispatch:
http://www2.timesdispatch.com/news/2010/sep/02/10/vita02-ar-483821/
Scherr, I., & Bartz, D. (2010 3-2). U.S. unveils cybersecurity safeguard plan.
Retrieved 2010 30-6 from Reuters:
http://www.reuters.com/article/idUSTRE62135H20100302
Scherr, I., & Bartz, D. (2010 2-3). U.S. unveils cybersecurity safeguard plan.
Retrieved 2010 13-4 from Reuters:
http://www.reuters.com/article/idUSTRE62135H20100302
100
Schwartz, S., Li, W., Berenson, L., & Williams, R. (2002 11-9). Deaths in World
Trade Center Terrorist Attacks --- New York City, 2001. Retrieved 11 13-1
from CDC: http://www.cdc.gov/mmwr/preview/mmwrhtml/mm51spa6.htm
Squires, P. (2010 2-9). Northrop Grumman to pay for cost of independent review.
Retrieved 2010 8-11 from virginiabusiness.com:
http://www.virginiabusiness.com/index.php/news/article/northrop-
grumman-to-pay-for-cost-of-independent-review/
Stewart, L. (2006 10-10). VITA Update to JLARC. Retrieved 2010 5-11 from
www.vita.virginia.gov: jlarc.state.va.us/meetings/October06/VITA.pdf
Swanson, A., Bowen, P., Wohl Phillips, A., Gallup, D., & Lynes, D. (2010 5).
Contingency Planning Guide for Federal Information Systems. NIST
Special Publication 800-34, Revision 1 . Gaithersburg, MD.
Swanson, M., Wohl, A., Pope, L., Grance, T., Hash, J., & Thomas, R. (2002
June). Contingency Planning Guide for Information Technology Systems
Recommendations of the National Institute of Standards and Technology
NIST Special Publication 800-34. Retrieved 2010 27-5 from Computer
Security Division Computer Resource Center National National Institute of
Standards and Technology: http://csrc.nist.gov/publications/nistpubs/800-
34/sp800-34.pdf
Testa, B. (2006 8). In Katrina’s Wake: Intensive Care for an Institution. Retrieved
2010 17-12 from Workforce Management:
http://www.workforce.com/section/recruiting-staffing/archive/feature-
katrinas-wake-intensive-care-institution/244929.html
The New York Times Company. (04 29-7). FirstEnergy settles suits related to
blackout. Retrieved 11 13-1 from NYTimes.com: NYTimes.com
Thibodeau, P., & Mearian, L. (2005 9-12). After Katrina, users start to weigh
long-term IT issues. Retrieved 12 2010-15 from Computerworld:
http://www.computerworld.com/s/article/104542/After_Katrina_users_start
_to_weigh_long_term_IT_issues
Tulane University. (09 3). Tulane University Computer Incident Response Plan
Part of Technology Services Disaster Recovery Plan. Retrieved 2011 20-2
from Information Security @ Tulane:
http://security.tulane.edu/TulaneComputerIncidentResponsePlan.pdf
U.S.-Canada Power System Outage Task Force. (2004 April). Final Report on
the August 14, 2003 Blackout in the United State and Canada: Causes
and Recommendations. From https://reports.energy.gov
VITA. (2007 1-7). Network News Volume 2, Number 7 From the CIO. Retrieved
2010 6-11 from www.vita.virginia.gov:
http://www.vita.virginia.gov/communications/publications/networknews/def
ault.aspx?id=3594
VITA. (2010 1-6). Network News Volume 5, Number 6 . Retrieved 2010 27-11
from www.vita.virginia.gov:
http://www.vita.virginia.gov/communications/publications/networknews/def
ault.aspx?id=12080
NASCIO IT Disaster Recovery and Business Continuity Tool-kit: Planning for the
Next Disaster
http://www.nascio.org/publications/documents/NASCIO-DRToolKit.pdf
This detailed 259 page document covers resiliency management from a cross
disciplinary perspective. Includes best practices, CMMI based generic goals and
objectives to guide the process of planning and implementing operational
resiliency.
Free online courses that provide testing and certificates of subject proficiency.
Covers a variety of topics such as emergency management, workplace violence,
and preparedness.
104
Without the flow of electronic informa- term power outages to more-severe disrup-
tion, government comes to a standstill. tions involving equipment destruction
When a state’s data systems and commu- from a variety of sources such as natural
nication networks are damaged and its disasters or terrorist actions. While many
processes disrupted, the problem can be vulnerabilities may be minimized or elimi-
serious and the impact far-reaching. The nated through technical, management, or
consequences can be much more than an operational solutions as part of the state’s
inconvenience. Serious disruptions to a overall risk management effort, it is virtually
state’s IT systems may lead to public dis- impossible to completely eliminate all risks.
trust, chaos and fear. It can mean a loss of
vital digital records and legal documents. In many cases, critical resources may reside
A loss of productivity and accountability. outside the organization’s control (such as
And a loss too of revenue and commerce. electric power or telecommunications),
and the organization may be unable to
Disasters that shut down a state’s mission ensure their availability. Thus effective dis-
critical applications for any length of time aster recovery planning, execution, and
could have devastating direct and indirect testing are essential to mitigate the risk of
costs to the state and its economy that system and service unavailability.
make considering a disaster recovery and Accordingly, in order for disaster recovery
business continuity plan essential. State planning to be successful, the state CIO’s
Chief Information Officers (CIOs) have an office must ensure the following:
obligation to ensure that state IT services
continue in the state of an emergency. The 1. Critical staff must understand the IT
good news is that there are simple steps disaster recovery and business conti-
that CIOs can follow to prepare for Before, nuity planning process and its place
During and After an IT crisis strikes. Is your within the overall Continuity of
state ready? Operations Plan and Business
Continuity Plan process.
Disaster Recovery Planning 101 2. Develop or re-examine disaster recov-
ery policy and planning processes
Disaster recovery and business continuity including preliminary planning, busi-
planning provides a framework of interim ness impact analysis, alternate site
measures to recover IT services following selection, and recovery strategies. NASCIO represents state chief infor-
mation officers and information
an emergency or system disruption. 3. Develop or re-examine IT disaster technology executives and man-
Interim measures may include the reloca- recovery planning policies and plans agers from state governments across
the United States. For more informa-
tion of IT systems and operations to an with emphasis on maintenance, train- tion visit www.nascio.org.
alternate site, the recovery of IT functions ing, and exercising the contingency
plan. Copyright © 2007 NASCIO
using alternate equipment, or execution of All rights reserved
agreements with an outsourced entity.
201 East Main Street, Suite 1405
Lexington, KY 40507
IT systems are vulnerable to a variety of Phone: (859) 514-9153
disruptions, ranging from minor short- Fax: (859) 514-9166
Email: NASCIO@AMRms.com
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 1
106
! CIOs need a Disaster Recovery and Business ! CIOs should conduct strategic assessments and
Continuity (DRBC) plan including: (1) Focus on inventory of physical assets, e.g. computing and
capabilities that are needed in any crisis situation; telecom resources, identify alternate sites and com-
(2) Identifying functional requirements; (3) Planning puting facilities. Also conduct strategic assessments
based on the degrees of a crisis from minor disrup- of essential employees to determine the staff that
tion of services to extreme catastrophic incidents; would be called upon in the event of a disaster and
(4) Establish service level requirements for business be sure to include pertinent contact information.
continuity; (5) Revise and update the plan; have criti-
Notes:
cal partners review the plan; and (6) Have hard and
digital copies of the plan stored in several locations
for security.
Notes:
! CIOs should conduct contingency planning in
case of lost personnel: This could involve cross-
training of essential personnel that can be lent out
to other agencies in case of loss of service or disas-
ter; also, mutual aid agreements with other public/
! CIOs should ask and answer the following ques-
private entities such as state universities for “skilled
tions: (1) What are the top business functions and
volunteers.” (Make sure contractors and volunteers
essential services the state enterprise can not func-
have approved access to facilities during a crisis).
tion without? Tier business functions and essen-
tial services into recovery categories based on Notes:
level of importance and allowable downtime. (2)
How can the operation’s facilities, vital records,
equipment, and other critical assets be protected?
(3) How can disruption to an agency’s or depart-
ment’s operations be reduced?
! Build cross-boundary relationships with emer-
Notes: gency agencies: CIOs should introduce themselves
and build relationships with state-wide, agency and
local emergency management personnel – you
don’t want the day of the disaster to be the first
time you meet your emergency management coun-
terparts. Communicate before the crisis. Also consid-
! CIOs should create a business resumption strate- er forging multi-state relationships with your CIO
gy: Such strategies lay out the interim procedures to counterparts to prepare for multi-state incidents.
follow in a disaster until normal business operations Consider developing a cross-boundary DR/BC plan
can be resumed. Plans should be organized by pro- or strategy, as many agencies and jurisdictions have
cedures to follow during the first 12, 24, and 48 their own plans.
hours of a disruption. (Utilize technologies such as
GIS for plotting available assets, outages, etc.) Notes:
Notes:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 3
105
2 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
107
! Intergovernmental communications and coordi- ! Testing: CIOs should conduct periodic training exer-
nation plan: Develop a plan to communicate and cises and drills to test DR/BC plans. These drills
coordinate efforts with state, local and federal gov- should be pre-scheduled and conducted on a regu-
ernment officials. Systems critical for other state, lar basis and should include both desk-top and field
local and federal programs and services may need to exercises. Conduct a gap analysis following each
be temporarily shut down during an event to safe- exercise.
guard the state’s IT enterprise. Local jurisdictions are
the point-of-service for many state transactions, Notes:
including benefits distribution and child support
payments, and alternate channels of service delivery
may need to be identified and temporarily estab-
lished. Make sure jurisdictional authority is clearly
established and articulated to avoid internal con-
! A CIO’s approach to a DR/BC plan will be unique to
flicts during a crisis. his or her financial and organizational situation and
Notes: the availability of trained personnel. This still leaves
the question as to who writes the plans. If a CIO
chooses from one of the many consultants that pro-
vide Continuity of Operations planning, he or she
should make sure that staff maintains a close degree
of involvement and, when completed, that the con-
! Establish a crisis communications protocol: A crisis sultant(s) provide general awareness training of the
communications protocol should be part of a state’s plan. If CIOs choose to conduct planning in-house,
IT DR/BC plan; Designate a primary media have an experienced and certified business continu-
spokesperson with additional single point-of-contact ity planner review it for any potential gaps or incon-
communications officers as back-ups. Articulate who sistencies.
can speak to whom under different conditions, as well
as who should not speak with the press. In a time of Notes:
crisis, go public immediately, but only with what you
know; provide updates frequently and regularly.
Notes:
4 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
108
(2) Top Steps States Need to Take to Solidify Public/ Private Partnerships Ahead of Crises (Pre-disas-
ter agreements with the private sector and other organizations.)
! Utilize preexisting business partnerships: Keep ! Be sure essential IT procurement staff are part of
the dialogue open with state business partners; peri- the DR/BC plan and are aware of their roles in exe-
odically call them all in for briefings on the state’s cuting pre-positioned contracts in the event of a dis-
disaster recovery and business continuity (DR/BC) aster; also be sure to include pertinent contact infor-
plans. mation.
Notes: Notes:
! Set up “Emergency Standby Services and ! CIOs should develop “Emergency Purchasing
Hardware Contracts:” Have contracts in place for Guidelines” for agencies and have emergency
products and services that may be needed in the response legislation in place.
event of a declared emergency. Develop a contract
Notes:
template so a contract can be developed with one
to two hours work time.
Notes:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 5
109
(3) How do you Make the Business Case on the Need for Redundancy? (Especially to the state legisla-
ture, the state executive branch and budget officials.)
Risk assessment of types of disasters that could lead " For federally declared states of emergency the
to the need for business continuity planning: financial aspect has been somewhat lessened by the
potential of acquiring funding grants from state or
! Geological hazards – Earthquakes, Tsunamis,
federal organizations such as FEMA. Additional fund-
Volcanic eruptions, Landslides/ mudslides/
ing for state cybersecurity preparedness efforts is
subsidence;
available to states through the U.S. Department of
! Meteorological hazards – Floods/ flash floods, tidal
Homeland Security’s State Homeland Security
surges, Drought, Fires (forest, range, urban), Snow,
Grants Program.
ice, hail, sleet, avalanche, Windstorm, tropical
cyclone, hurricane, tornado, dust/sand storms, Notes:
Extreme temperatures (heat, cold), Lightning strikes;
! Biological hazards – Diseases that impact humans
and animals (plague, smallpox, Anthrax, West Nile
Virus, Bird flu);
! Human-caused events – Accidental: Hazardous
material (chemical, radiological, biological) spill or " Establish metrics for costs of not having redun-
release; Explosion/ fire; Transportation accident; dancy: How much will it cost the state if certain crit-
Building/structure collapse; Energy/power/utility ical business functions go down – e.g. ERP issues on
failure; Fuel/resource shortage; Air/water pollution, the payment side; citizen service issues (what it
contamination; Water control structure/dam/levee would do to the DMV for license renewals); impacts
failure; Financial issues: economic depression, on eligibility verifications for social services, etc. How
inflation, financial system collapse; Communications long can you afford to be down? How much is this
systems interruptions; costing you? How long can you be without a core
! Intentional – Terrorism (conventional, chemical, business function?
radiological, biological, cyber); Sabotage; Civil distur- Notes:
bance, public unrest, mass hysteria, riot; Enemy
attack, war; Insurrection; Strike; Misinformation;
Crime; Arson; Electromagnetic pulse.
6 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
110
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 7
111
Notes: Notes:
! Protect current systems: Controlled access; uninter- ! Self-healing primary point of presence facilities
ruptible power supply (UPS); back-up generators that automatically restore service.
with standby contracts for diesel fuel (use priority Notes:
and back-up fuel suppliers that also have back-up
generators to operate their pumps in the event of a
widely spread power outage).
Notes:
! Approach enterprise backup as a shared service:
Other agencies may have the capability for excess
redundancy.
Notes:
! Strategic location: Locate critical facilities away
from sites that are vulnerable to natural and man-
made disasters.
Notes:
! Provide secure remote access to state IT systems
for essential employees (access may be tiered based
on critical need.)
Notes:
! Interactive voice response (IVR) systems that are
accessing back-end databases: (There may be no
operators for backup that can connect patrons to
services.) Seek diversity of inbound communica-
tions.
! Hot Sites: A disaster recovery facility that mirrors an
Notes: agency’s applications databases in real-time.
Operational recovery is provided within minutes of a
disaster. These can be provided at remote locations
or outsourced to one or multiple contractors.
Notes:
! Self-healing communications systems that auto-
matically re-route communications or use alternate
media.
8 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
112
! Decision making: Prepare yourself for making deci- ! Implement your emergency employee communi-
sions in an environment of uncertainty. During a cri- cations plan: Inform your internal audiences – IT
sis you may not have all the information necessary, staff and other government offices – at the same
however, you will be required to make immediate time you inform the press. Prepare announcements
decisions. to employees to transition them to alternate sites or
implement telecommuting or other emergency pro-
Notes:
cedures. Employees can maintain communication
with the central IT office utilizing Phone exchange
cards, provided to employees with two numbers: (1)
First number employees use to call in and leave their
contact information; (2) Second number is where
! Execute DR/BC Plan: Retrieve copies of the plan the employees call in every morning for a standing
from secure locations. Begin systematic execution of all employee conference call for updates on the
plan provisions, including procedures to follow dur- emergency situation.
ing the first 12, 24, and 48 hours of the disruption. Notes:
Notes:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 9
113
! Back-up communications: In the event wireless, ! Leverage technology/ Think outside the box: In a
radio and Internet communications are inaccessible, disaster situation the state’s GIS systems can be uti-
Government Emergency Telecommunications lized to monitor power outages and system avail-
Service (GETS) cards can be utilized for emergency ability. For emergency communications, the “State
wireline communications. GETS is a Federal program Portal” can be converted to an emergency manage-
that prioritizes calls over wireline networks and uti- ment portal. Also, Web 2.0 technologies such as
lizes both the universal GETS access number and a Weblogs, Wikis and RSS feeds can be utilized for
Personal Identification Number (PIN) for priority emergency communications.
access.
Notes:
Notes:
10 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
114
! Preliminary damage and loss assessment: ! Contractual performance: Review the performance
Conduct a post-event inventory and assess the loss of strategic contracts and modify contract agree-
of physical and non-physical assets. Include both ments as necessary.
tangible losses (e.g. a building or infrastructure) and
Notes:
intangible losses (e.g. financial and economic losses
due to service disruption). Be sure to include a dam-
age and loss assessment of hard copy and digital
records. Prepare a tiered strategy for recovery of lost
assets.
Notes: ! Lessons learned: Evaluate the effectiveness of the
DR/BC plan and how people responded. Examine all
aspects of the recovery effort and conduct a gap
analysis to identify deficiencies in the plan execu-
tion. Update the plan based on the analysis. What
went right (duplicate); what went wrong (tag and
! Employee transition: Once agencies have recov- avoid in the future). Correct problems so they don’t
ered their data, CIOs need to find interim space for happen again.
displaced employees, either at the hot site or anoth-
Notes:
er location. Coordinate announcements to employ-
ees to transition them to an alternate site or imple-
ment telecommuting procedures until normal oper-
ation are reestablished.
Notes:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 11
115
12 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
116
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 13
117
continuity. DRII established its goals to: Disaster: Lessons from Hurricane
Promote a base of common knowledge Katrina.” The book, edited by Ronald J.
for the business continuity planning/ dis- Daniels, Donald F. Kettl (a Governing con-
aster recovery industry through educa- tributor) and Howard Kunreuther, warns of
tion, assistance, and publication of the the inevitability of another disaster and
standard resource base; Certify qualified the need to be prepared to act. It address-
individuals in the discipline; and Promote es the public and private roles in assess-
the credibility and professionalism of cer- ing, managing and dealing with disasters
tified individuals: <http://www.drii.org/> and suggests strategies for moving ahead
in rebuilding the Gulf Coast. To see a table
The National Association of State of contents and sample text, visit
Procurement Officials (NASPO) has com- <http://www.upenn.edu/pennpress/book/
pleted work on disaster recovery as it 14002.html> Published by the University
relates to procurement: of Pennsylvania Press, the book sells for
<http://www.naspo.org/> $27.50.
AFTER THE DISASTER Hurricane Katrina “Disaster Recovery, How to protect your
not only impacted more than 90,000 technology in the event of a disaster,”
square miles and almost 10 million resi- Bob Xavier, November 27, 2001:
dents of the Gulf Coast but also affected <http://www.techsoup.org/howto/
how governments will manage such disas- articles/techplan/page2686.cfm>
ters in the future. A collection of articles
opens the dialogue about disaster
response in a new book, “On Risk and
14 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
VITA
118
VITA
EDUCATION
Candidate for M.S. in Computer Information Technology at Purdue University,
May 2011 G.P.A. 3.8/4.00
Honors B.A. in Communication, Public Relations at Purdue University, December
1998 G.P.A. 3.57/4.00
PUBLICATIONS
REFERENCES