Thesis

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/254639325
Data center recovery best practices: Before, during, and after disaster recovery
execution
Article · January 2011
CITATIONS READS
2 9,365
1 author:
Heather Brotherton Ph.D.,PMP

Purdue University
9 PUBLICATIONS 16 CITATIONS
SEE PROFILE
All content following this page was uploaded by Heather Brotherton Ph.D.,PMP on 09 September 2015.
The user has requested enhancement of the downloaded file.

From the SelectedWorks of Heather M
Brotherton
May 2011
Data center recovery best practices: Before, during,

and after disaster recovery execution
Contact Start Your Own Notify Me

Author SelectedWorks of New Work
Available at: http://works.bepress.com/heatherbrotherton/4

Graduate School ETD Form 9
(Revised 12/07)
PURDUE UNIVERSITY
GRADUATE SCHOOL
Thesis/Dissertation Acceptance
This is to certify that the thesis/dissertation prepared
By Heather McCall Brotherton
Entitled
DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING, AND AFTER DISASTER
RECOVERY EXECUTION
Master of Science
For the degree of
Is approved by the final examining committee:

J. Eric Dietz Gary Bertoline
Chair
W. Gerry McCartney Jeffrey Sprankle
To the best of my knowledge and as understood by the student in the Research Integrity and
Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of
Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.
J. Eric Dietz
Approved by Major Professor(s): ____________________________________
____________________________________
Approved by: Jeffrey L. Brewer 04/04/2011

Head of the Graduate Program Date
Graduate School Form 20
(Revised 9/10)
PURDUE UNIVERSITY
GRADUATE SCHOOL
Research Integrity and Copyright Disclaimer
Title of Thesis/Dissertation:
DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING, AND AFTER DISASTER
RECOVERY EXECUTION
For the degree of Master

Choose of Science
your degree
I certify that in the preparation of this thesis, I have observed the provisions of Purdue University
Executive Memorandum No. C-22, September 6, 1991, Policy on Integrity in Research.*
Further, I certify that this work is free of plagiarism and all materials appearing in this
thesis/dissertation have been properly quoted and attributed.
I certify that all copyrighted material incorporated into this thesis/dissertation is in compliance with the
United States’ copyright law and that I have received written permission from the copyright owners for
my use of their work, which is beyond the scope of the law. I agree to indemnify and save harmless
Purdue University from any and all claims that may be asserted or that may arise from any copyright
violation.
Heather McCall Brotherton

______________________________________
Printed Name and Signature of Candidate
04/04/2011
______________________________________
Date (month/day/year)
*Located at http://www.purdue.edu/policies/pages/teach_res_outreach/c_22.html
DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING,
AND AFTER DISASTER RECOVERY EXECUTION
A Thesis
Submitted to the Faculty
of
Purdue University
by
Heather M. Brotherton
In Partial Fulfillment of the
Requirements for the Degree
of
Master of Science
May 2011
Purdue University
West Lafayette, Indiana

ii
TABLE OF CONTENTS
Page
LIST OF TABLES ........................................................................................... iv
LIST OF FIGURES.......................................................................................... v
LIST OF ABBREVIATIONS............................................................................ vi
ABSTRACT ....................................................................................................vii
CHAPTER 1. INTRODUCTION....................................................................... 1
1.1. Statement of purpose ......................................................................... 1
1.2. Research Question ............................................................................. 1
1.3. Scope.................................................................................................. 2
1.4. Significance......................................................................................... 2
1.5. Assumptions ....................................................................................... 3
1.6. Limitations........................................................................................... 3
1.7. Delimitations ....................................................................................... 4
1.8. Summary............................................................................................. 4
CHAPTER 2. LITERATURE REVIEW............................................................. 6
2.1. Critical cyberinfrastructure vulnerability .............................................. 6
2.2. Barriers to cyberinfrastructure resiliency ............................................ 8
2.3. Mutual aid ........................................................................................... 9
2.3.1. Mutual Aid Association ............................................................. 11
2.4. Training ............................................................................................. 12
2.5. Testing .............................................................................................. 13
2.6. Summary........................................................................................... 14
CHAPTER 3. FRAMEWORK AND METHODOLOGY .................................. 15
3.1. Framework ........................................................................................ 15
3.2. Researcher Bias ............................................................................... 16
3.3. Methodology ..................................................................................... 16
3.4. Data Collection.................................................................................. 17
3.5. Authorizations ................................................................................... 17
3.6. Analysis............................................................................................. 18
3.6.1. Triangulation............................................................................. 18
3.7. Summary........................................................................................... 19
CHAPTER 4. CASE STUDIES...................................................................... 20
4.1. Commerzbank................................................................................... 20
4.1.1. Background.................................................................................... 20
4.1.2. World Trade Center Attacks ..................................................... 21
4.1.3. Conclusion................................................................................ 28
iii
Page
4.2. FirstEnergy........................................................................................ 29
4.2.1. Background .............................................................................. 29
4.2.2. Northeast Blackout of 2003 ...................................................... 29
4.2.3. Conclusion................................................................................ 38
4.3. Tulane ............................................................................................... 39
4.3.1. Background .............................................................................. 39
4.3.2. Hurricane Katrina...................................................................... 40
4.3.3. Conclusion................................................................................ 48
4.4. Commonwealth of Virginia ................................................................ 49
4.4.1. Background .............................................................................. 49
4.4.2. August 2010 outage ................................................................. 51
4.4.3. Conclusion................................................................................ 65
CHAPTER 5. ANALYSIS............................................................................... 67
5.1. Best Practice Triangulation ............................................................... 67
5.1.1. Before-Planning........................................................................ 67
5.1.2. During-Plan execution .............................................................. 73
5.1.3. After-Plan improvement............................................................ 78
CHAPTER 6. CONCLUSION ........................................................................ 86
CHAPTER 7. FUTURE RESEARCH............................................................. 89
BIBLIOGRAPHY............................................................................................ 91
APPENDICES
Appendix A............................................................................................. 103
Appendix B............................................................................................. 104
VITA ............................................................................................................ 118
PUBLICATION
Disaster recovery and business continuity planning:
Business justification.............................................................................. 120
iv
LIST OF TABLES
Table Page
Table 5.1 Tolerance and objectives .................................................................... 68
Table 5.2 Aid relationship utilized during recovery .............................................. 78
v
LIST OF FIGURES
Figure Page
Figure 5.1 Adherence to established procedures................................................ 74
Figure 5.2 Sample IT incident command structure.............................................. 77
Figure 5.3 Reported average downtime revenue losses in billions ..................... 80
Figure 5.4 Reported critical application and data classifications ......................... 81
Figure 5.5 Components of a resilient system ...................................................... 85
vi
LIST OF ABBREVIATIONS
CIO Chief Information Officer

DMV Department of Motor Vehicles
DR disaster recovery
EMAC Emergency Management Assistance Compact
EMS Emergency Management System
EOC emergency operations center
FE FirstEnergy
FEMA Federal Emergency Management Agency
FERC Federal Energy Regulatory Commission
HVAC Heating, Ventilating, and Air Conditioning
IT information technology
ITIL Information Technology Infrastructure Library
MOU Memorandum of Understanding
MTPOD Maximum Tolerable Period of Disruption
NIMS National Incident Management System
NRC Nuclear Regulatory Commission
ROI Return On Investment
RPO Recovery Point Objectives
RTO Recovery Time Objectives
SCADA Systems Control and Data Acquisition
SAN Storage Area Network
VITA Virginia Information Technologies Agency
vii
ABSTRACT
Brotherton, Heather M. M.S., Purdue University, May 2011. Data center

recovery best practices: before, during, and after disaster recovery execution.
Major Professor: J. Eric Dietz.
This qualitative multiple case study analysis reviews well documented past
information technology disasters with a goal of identifying practical before, during,
and after disaster recovery best practices. The topic of cyberinfrastructure
resiliency is explored including barriers to cyberinfrastructure resiliency. Factors
explored include: adherence to established procedures, staff training in recovery
procedures, chain of command structure, recovery time and cost, and mutual aid
relationships. Helpful tools and resources are included to assist planners.

1
CHAPTER 1. INTRODUCTION
1.1. Statement of purpose
The purpose of this research is to attempt to bridge the gap of unmet
needs in the area of cyberinfrastructure business continuity and disaster
recovery. Information systems are complex and vital to modern infrastructure.
Loss of computer information system availability can financially cripple
companies and potentially cause basic necessities such as clean water to be
unavailable. In many cases, organizations fail to implement business continuity
measures due to the high cost of remote failover systems and training.
Cyberinfrastructure resiliency is dependent upon creating practical, attainable
implementations. Through this research, the effectiveness of various business
continuity and disaster recovery practices will be explored to increase information
systems resiliency.
1.2. Research Question
What are best practices in planning, during, and after disaster recovery
execution?
2
1.3. Scope
The scope of the research is identification of best practices for business
continuity and disaster recovery. Factors affecting the success of
cyberinfrastructure incident recovery will be identified through case study
analysis. Success will be determined by reviewing factors such as practicality,
recovery time, and business impact. Practical tools and resources to assist best
practice implementation and execution will also be identified.
1.4. Significance
Aside from IT professionals, very few think about the impacts of
information system failure. Growing dependence upon computer information
systems has created vulnerabilities that have not been uniformly addressed.
Information systems are the ubiquitous controllers of critical infrastructure. Many
business processes and services depend upon computer information systems
resulting in myriad factors to consider in data center contingency planning.
These systems experience failures on a regular basis, but most failures
are unnoticed due to carefully crafted redundant mechanisms that seamlessly
continue processing. However, massive failures have occurred that resulted in
widespread, severe negative impact on the public. While most large corporations
have remote failover locations, there are many organizations important to critical
functions that do not have the resources to develop and implement business
continuity and disaster recovery plans. Practical, understandable planning and
recovery guidance, developed through the findings of this research, may help
3
ensure the stability of cyberinfrastructure and by extension the safety and well
being of all.
1.5. Assumptions
Assumptions for this study include:
• Examination of the experiences of organizations who have sustained
catastrophic information systems failures will yield information that will
contribute to disaster recovery best practices body of knowledge.
• The use of qualitative case study analysis is appropriate to study the
phenomenon of interest.
• Existing publicly available documents are the best source of the actions
and policies in place at the time of the incident.
1.6. Limitations
Limitations include:
• Contact with primary actors from cyber infrastructure failures is infeasible
due to:
o Difficulty identifying actors
o Limitations on what may be discussed due to risk of liability
o Degraded memory of actual events and policies active at the time
of the incident
• Highly detailed information will not be available in the documentation.
Therefore, this research will not address topics that cannot be examined
based on the detail of the available documentation.

4
• Observation of large-scale cyber infrastructure failure is not feasible due to
inherent unpredictability; observation of other information systems failures
and recovery will lack external validity.
1.7. Delimitations
Delimitations include:
• Many sources to assist in business continuity and disaster planning exist;
this research will not attempt to add to planning, but will focus on the
successes and failure of the planning and methods employed before,
during, and after information system recovery execution.
• Possible causal relationships will not be examined in this exploratory
research study.
• Information systems failures that are not well documented will not be
addressed.
• The number of case studies will be limited to ensure in depth coverage of
recovery methods employed.
• Realistic simulation of catastrophic failures is neither ethical nor feasible
and will not be attempted for the purpose of study.
1.8. Summary
This chapter is an introduction to the disaster recovery best practices
research project. The purpose of the research is to meet needs of
cyberinfrastructure resiliency. Cyberinfrastructure resiliency is defined as the
ability of an infrastructure level information system to tolerate and recover from

5
adverse incidents with minimal disruption. The scope of the project is defined in
this chapter as well as the significance, assumptions, limitations, and
delimitations. The following chapter will review literature on topic related to
cyberinfrastructure resiliency.
6
CHAPTER 2. LITERATURE REVIEW
This chapter provides an overview of the importance of systems resiliency
and introduces the concept of mutual aid. Computer information system
vulnerabilities and threats are discussed. The barriers to systems resiliency and
the challenges associated with removing these barriers to implement resiliency
are highlighted. Potential uses of mutual aid agreements as a pragmatic, cost-
effective risk mitigation alternative resiliency tool are discussed. Literature
related to systems resiliency is reviewed to provide a background of the
problems and to support the exploration of the employment of mutual aid
agreements.
2.1. Critical cyberinfrastructure vulnerability
The Clinton, Bush, and Obama administrations have recognized society’s
dependency on cyberinfrastructure in presidential communications. Presidential
Decision Directive 63 declared "cyber-based systems essential to the minimum
operations of the economy and government. They include, but are not limited to,
telecommunications, energy, banking and finance, transportation, water systems
and emergency services, both governmental and private." (Clinton
Administration, 1998, p. 1). This communication set forth policy to implement
cyberinfrastructure protections by 2000 (Clinton Administration, 1998).

7
However, despite this directive, in 2003 the Northeast portion of the United
States suffered an extended widespread power outage due in large part to failure
of the computer system (U.S.-Canada Power System Outage Task Force, 2004).
Transportation, communication, and water were unavailable leaving many
stranded in subways and trapped in elevators. In some cases, people were
unable to make non-cash purchases for essentials such as flashlights (Barron,
2003). Findings published by the New York Independent System Operator state
"the root cause of the blackout was the failure to adhere to the existing reliability
rules" (New York Independent System Operator, 2005, p. 4). "ICF Consulting
estimated the total economic cost of the August 2003 blackout to be between $7
and $10 billion" (Electricity Consumers Resource Council (ELCON), 2004, p. 1).
In more recent history, Google announced a directed attack from China
(Scherr & Bartz, 2010). This announcement was shortly followed by an
announcement from the Obama administration regarding initiatives to protect
critical resources such as power and water from cyber attack (Scherr & Bartz,
2010). No initiatives to date have resulted in substantial hardening of
cyberinfrastructure in fact the problem appears to be growing. Losses of
intellectual property alone from 2008 to 2009 were approximately one trillion
dollars (Internet Security Alliance (ISA)/American National Standards Institute
(ANSI), 2010).
8
2.2. Barriers to cyberinfrastructure resiliency
Computer information systems are inherently difficult to protect. They
remain in a state of constant flux due to technological advances and updates to
patch known vulnerabilities (Homeland Security, 2009). Each patch or fix applied
runs the risk of causing an undocumented conflict due to customization as well
as creating a new vulnerability. Constant connection to the Internet has
increased the usefulness of computers, but this has also increased vulnerability.
Information systems are highly complex, even information technology experts are
segmented. Upper level mangers as tend to be “digital immigrants" resulting in
increased difficulty in convincing them to fund cybersecurity projects. (Internet
Security Alliance (ISA)/American National Standards Institute (ANSI), 2010, p.
12) This disconnect is the doom of continuity planning, without high-level
backing to push policy change and supply resources there is little chance for
success (Petersen, 2009).
Funding alone will not make a resilient cyberinfrastructure, collaboration
among departments is necessary to create and maintain a plan that addresses
the business requirements of an organization (Caralli, Allen, Curtis, White, &
Young, 2010). There must also be organizational understanding and
commitment to the practices that contribute to the documentation required to
have an up to date continuity plan. These cultural changes require strong
actively committed leadership to enact.

9
Leadership lacking the fundamental understanding of the importance of
failover testing can render an otherwise solid continuity plan useless. In some
cases, companies have disaster recovery plans, but are reluctant to test live
systems due to the possibility of service interruptions (Balaouras, 2008). This
short sightedness can lead to disastrous costly consequences. Planned testing
can be scheduled during low traffic periods when the staff can be prepared to
quickly recover any outage. These tests serve to identify system and failover
plan weaknesses and make the staff more comfortable with the failover and
recovery process.
A common and somewhat illogical barrier to planning for resiliency is the
idea that some disasters cannot be planned for because they are too large.
(Schaffhauser, 2005) The National Incident Management System (NIMS)
provides a framework for managing incidents of any size and complexity. (FEMA)
Information and training for NIMS is freely available on the Federal Emergency
Management Agency training website. The site address is listed in Appendix A.
The use of this framework is highly recommended because it is widely used and
provides a framework for integrating outside organizations into the command and
incident response structure.
2.3. Mutual aid
Mutual aid agreements have evolved over human history as a means to
pool resources to solve a common problem. The redundant resources required
to maintain systems continuity may not be economically feasible for many

10
organizations. Rather than forgoing implementing remote failover locations, it
may be advisable to pool resources by forging reciprocal agreements.
The September 2010 San Bruno gas pipeline explosion is a good example
of the advantages of an existing Mutual aid compact. San Bruno’s disaster
activated 42 fire agencies, 200 law enforcement officers. (Jackson, 2011) “85
pieces of fire-fighting apparatus” were also provided for on site response.
(Jackson, 2011) The resources required for this incident were far beyond feasible
maintainability for the city’s budget. The California Mutual Aid System along with
an Emergency Operations plan ensured the city was able to quickly and
effectively respond to this unforeseen explosion. (Jackson, 2011)
The possibility that the utilization of IT mutual aid agreements will allow
organizations the ability to make better use of available resources is worth
exploring (Swanson, Bowen, Wohl Phillips, Gallup, & Lynes, 2010). Collocation
of critical services provides systems redundancy without the need to build a
dedicated recovery data center. Reciprocal relationships are generally defined
by a memorandum of understanding (Swanson, Bowen, Wohl Phillips, Gallup, &
Lynes, 2010). Memorandums of understanding, often referred to as an MOU,
define protocol, costs, resources available, and compatibility requirements. It
may be desirable to include nondisclosure agreements in the MOU.
Staffing is a key resource that could be negotiated for through mutual aid
agreements. Sharing staffing increases the likelihood that adequately trained
staff will be available should a catastrophic event occur. Some catastrophes may
make staff unavailable due to personal impact and additional staff may be
11
required to maintain or recover operations to prevent or reduce business impact
(Schellenger, 2010). Partnering with another organization to pool staffing
resources can ensure efficient contingency operations through cross-trained
staff. The end result may be cost savings. Fewer contractors and consultants
would be necessary and business impact could be minimized as a result of extra
staff that is familiar with the computer system.
Another possible advantage of mutual aid agreements is the ability to
share training expenses. General conceptual information and in-house training
can be shared between partner organizations. This may not only save costs of
developing and providing training, but will provide a "common language" for the
partnered organizations (FEMA, 2006, p. 4). Ideally, additional specialized
training for employees on incident management teams would be trained with
counterparts to ensure good communication between the teams. The ability to
communicate efficiently and effectively will also contribute to the reduction of
downtime.
2.3.1. Mutual Aid Association
Mutual Aid agreements are common for police, fire departments, and
utilities. Associations have been formed to fill the gaps in situations where an
organization lacks the necessary resources to respond adequately to an incident.
These relationships have been used to the benefit of society at large allowing
seamless performance of incident response duties. This is possible due to
predetermined procedures and protocols that exist in mutual aid agreements.
Organizations generally hold regular training with reciprocal partners. According

12
to Hardenbrook, utilities "showed the most advanced levels of cooperation"
during the Blue Cascades exercise (2004, p. 4). The Blue Cascades II exercise
focused on information technology dependencies.
The FEMA website has links to a few mutual aid associations such as
Emergency Management Assistance Compact (EMAC). EMAC emerged in 1949
in response to concern of Nuclear Attack. (EMAC) In 1996 the U.S Congress
recognized EMAC as a national disaster compact through Public Law. (EMAC)
EMAC is designed to assist states, but this model may work for non-profit,
education, and business. Creation of a similar association for information
technology may be warranted due to the special skills, equipment, and resources
required for response to a large-scale event.
2.4. Training
Training is a key factor in business continuity and disaster recovery.
Human error is often cited as the primary cause of systems failure (U.S.-Canada
Power System Outage Task Force, 2004). In many cases, the incident is
initiated by another type of failure (software, hardware, fire, etc), but the
complicating factor becomes human error (U.S.-Canada Power System Outage
Task Force, 2004). Automation of "easy tasks" leaves "complex, rare tasks" to
the human operator. (Patterson, et al., 2002, p. 3) Humans "are not good at
solving problems from first principals…especially under stress" (Patterson, et al.,
2002, p. 3) "Humans are furious pattern matchers" but "poor at solving problems
from first principals, and can only do so for so long before" tiring (Patterson, et
13
al., 2002, p. 3). Automation "prevents …building mental production rules and
models for troubleshooting" (Patterson, et al., 2002, p. 4). The implications of
this are that technologists are not efficient at solving problems without
experience. Training provides the opportunity to build "mental production rules"
and allows the technologist to quickly and more accurately respond to incidents.
2.5. Testing
Surveyed literature reinforces the importance of testing and
experimentation. Testing provides the opportunity to assess the effectiveness of
business continuity procedures, equipment, and configuration. Part of the
reasoning for testing is that “emergency systems are often flawed…only an
emergency tests them, and latent errors in emergency systems can render them
useless." (Patterson, et al., 2002) Incident response procedures vary in
complexity. Some procedures are employed on a regular basis; these situations
are not the focus of the testing discussed here. Large-scale recovery and
continuity procedures are rarely employed by an organization; however the
effectiveness of these plans is decisive in the organization's survival in the event
of a large-scale disaster. Disasters have not only been historically costly, but
have resulted in permanent closures (Scalet, 2002). The costs of neglecting
business continuity and disaster recovery testing are too high to risk.
14
2.6. Summary
Critical resource and service dependencies upon information systems
have created the necessity to protect the underlying cyberinfrastructure. Barriers
to the resilience of complex and often fragile systems must be removed.
Leadership must be educated on the requirements of systems resiliency.
Practices that support maintained system and business process documentation
must be integrated into the organizational culture. The cost of redundant
cyberinfrastructure renders implementing resiliency out of reach for many
organizations. The cultivation of reciprocal relationships is one option to reduce
the cost of maintaining remote failover. Training and testing are key factors in
implementing effective business continuity and disaster recovery procedures.

15
CHAPTER 3. FRAMEWORK AND METHODOLOGY
The purpose of this research is to examine data center recovery planning,
execution, and post-execution activities to identify best practices that emerge
from the analysis. Qualitative methods will be applied to facilitate the exploration
of this topic. This chapter details the research methodology employed as well as
data collection and analysis methodologies.
3.1. Framework
Information technology Business continuity and disaster recovery planning
has become a popular topic due increased information system interdependency.
Organizations cannot afford downtime due to primarily for financial reasons.
Methodologies have emerged to guide organizations through planning,
implementation, and maintenance lifecycle phases. Execution is addressed from
a theoretical point of view, but how does execution play out in real life, high
impact situations? Execution of cyberinfrastructure disaster recovery procedures
and protocols remain virtually unexamined. Research of documented, high
impact cyberinfrastructure recovery processes may uncover valuable information
that may enrich understanding of best practices. Best practices revealed or
reinforced through this research will be documented for future use.

16
3.2. Researcher Bias
I present here my personal bias on this topic, to inform the reader of
beliefs that may encroach upon the findings of this research. Preparedness, in
my mind, enables us to deal more effectively with adverse conditions. I whole-
heartedly believe that documentation and practice exercises contribute to
incident mitigation, quicker recovery time, and reduced personal stress during
emergency. I acknowledge that not every contingency can be included in
planning and the ingenuity of the incident responders is the key to success. I
believe that an all hazards approach, established chain of command, and well-
trained staff enable a more coordinated and efficient recovery process.
3.3. Methodology
Collective case study will be utilized in this qualitative phenomenological
study. This method is used because creating reliably accurate quantitative
measures is not feasible in the study of high impact cyberinfrastructure recovery
processes. Primarily due to the rare occurrence of this type of event, it is highly
unlikely to be presented the opportunity to observe the actual phenomenon of
interest. Quantitative methods are impractical because, while the cases used will
have some timeline and procedural documentation, the accuracy of these
measures is questionable due to the high stress nature of the recovery situations
and lack of highly detailed procedural information.

17
Lab research was also considered and while this would produce high
internal validity, it is not feasible to realistically simulate a true disaster situation.
Therefore, external validity would be low and would likely result in unrealistic
findings.
3.4. Data Collection
Purposeful sampling methods were employed. The criteria for selection
included:
• High impact cyberinfrastructure incident
• Documented resolution
Phenomenon related documents, artifacts, and archival records were used rather
than interviewing, which also reduces the possible impact of researcher bias.
Multiple cases were included in the case study. This method of data collection
may not produce findings generally applicable to information systems in every
sector. The area of interest is high impact cyberinfrastructure; the findings using
this methodology are expected to be highly generalizable to critical infrastructure
information systems.
3.5. Authorizations
Authorization for this research was granted by Purdue University College
of Technology and Purdue Institute of Homeland Security. The advisory
committee of the researcher approved this research to add to the body of

18
knowledge related to information systems business continuity and disaster
recovery. IRB approval was obtained for all written communication.
3.6. Analysis
Cross-case analysis was used to create a multidimensional profile of
disaster recovery processes and protocols. Recurring themes or practices, both
those resulting in positive and negative results, were identified. Factors explored
include:
• Adherence to established procedures
• Staff training in recovery procedures
• Chain of command structure
• Recovery time and cost
• Mutual aid relationships
3.6.1. Triangulation
The purpose of including more than one case study is to collate the
commonalities. The identification of common problems and successes
contributes to the understanding of best practices for disaster recovery.
Generalizable practices from other disciplines will also be used to reinforce the
identified and recommended best practices.

19
3.7. Summary
This chapter details the methodology, sampling, and analysis techniques
used in this research. Rationales for the methods employed were also discussed.
Findings and sources used for the case study are included in following chapters.
20
CHAPTER 4. CASE STUDIES
4.1. Commerzbank
4.1.1. Background
Commerzbank is the second largest bank in Germany established in
1870.(Availability Digest, 2009) In 2001, Commerzbank was the 16th largest in
the world.(Editorial Staff of SearchStorage.com, 2002) The bank has overcome
many adversities since its establishment such as World War I and
socialism.(Availability Digest, 2009) The bank has survived calamities in the
United States as well, including a 1992 flood in Chicago and the 1993 World
Trade Center bombing.(Parris, Who Survives Disasters and Why, Part 2:
Organizations, 2010) Commerzbank’s New York offices are “located on floors 31
to 34 at the World Financial Center”.(Hewlett-Packard, 2002) This location is
“only 300 feet from the World Trade Center towers.”(Editorial Staff of
SearchStorage.com, 2002)
21
4.1.2. World Trade Center Attacks
September 11, 2001 the World Trade Center suffered the largest terrorist
attack in United States history. Nearly 3000 died that day as a result of the
attacks. (Schwartz, Li, Berenson, & Williams, 2002) The impact to the economy
of the city of New York alone was $83 billion. (Barovik, Bland, Nugent, Van Dyk,
& Winters, 2001) Site clean up took over eight months. (Comptroller of the city of
New York, 02) Not all businesses were able to recover from the devastation
inflicted by the attacks. (Scalet S. D., 2002) The overall economic impacts
continue today and the daily lives of each resident of the United States has been
affected, if only indirectly.
4.1.2.1. Ramifications
Commerzbank was so near the World Trade Center impact sites that the
debris caused the widows to shatter.(Editorial Staff of SearchStorage.com, 2002)
The interior of the building that housed Commerzbank was covered in debris and
glass creating an unsafe environment and choking building equipment. The data
center air conditioning failed leading to high temperatures, which had a
cascading effect on the data center computers.(Hewlett-Packard, 2002, p. 2)
Most of the local data center disk failed causing failover to Commerzbank’s
remote site.(Hewlett-Packard, 2002, p. 2) Commerzbank had a redundant, fault
tolerant system with remote failover that allowed them to remain operational
22
throughout the event.(Hewlett-Packard, 2002, p. 2) They lost equipment at that
site but their ability to do business remained intact.
4.1.2.2. Response
Initially, links were directed to the Rye backup site to restore
communications with “Federal Reserve and the New York Clearing House” that
were lost after the first collision.(Availability Digest, 2009) It became apparent
that the World trade center was under attack when the second jet hit,
Commerzbank initiated immediate evacuation.(Parris, Who Survives Disasters
and Why, Part 2: Organizations, 2010) When the building lost power
Commerzbank’s backup power generator took over, but the HVAC system failed
due to the debris causing that site’s data center to shutdown.(Hewlett-Packard,
2002, p. 2) Automated failover processes continued as employees traveled to the
recovery site.(Editorial Staff of SearchStorage.com, 2002) The recovery site at
Rye, New York can be operated by 10 staff members and 16 reported to the
backup site on September 11th.(Hewlett-Packard, 2002, p. 2) This site served as
the primary data center and in days that followed EMC, Commerzbank’s storage
vendor, worked around the clock to restore data that was backed up to tape
rather than replicated.(Editorial Staff of SearchStorage.com, 2002) EMC added
“multiple terabytes” of storage to augment the backup site capacity during
following 36 hours allowing restoration of “mission-critical” data as well as
creation of new backups.(Editorial Staff of SearchStorage.com, 2002)

23
4.1.2.3. Mitigation in place
Commerzbank’s “primary site was well-protected, with its own generator,
fuel storage tank, cooling tower, UPS, batteries, and fire suppression
system.”(Parris, Who Survives Disasters and Why, Part 2: Organizations, 2010)
Commerzbank was in the midst of virtualizing storage, and had finished the
majority of the conversion before the attacks. (Mears, Connor, & Martin, 02)The
IT staff at Commerzbank designed and maintained a business continuity plan
that included regular testing and a call tree.(Hewlett-Packard, 2002, p. 3) This
provided the capability to meet the zero downtime requirement set forth by the
business.(Hewlett-Packard, 2002) To reach this goal Commerzbank shadowed
“everything” to the remote site. (Parris, Who Survives Disasters and Why, Part 2:
Organizations, 2010) The remote site, located 30 miles from the World Trade
Center site at Rye, was the cornerstone of that plan.(Hewlett-Packard, 2002)
Boensch describes the activities of Commerzbank’s Disaster Recovery

(DR) site in non-disaster mode. “Our DR site is really dual purpose. The
AlphaServer GS160 system is a standby production site in case of a
disaster. But on a regular day-to-day basis, it’s up and running as a test
and development system. Actually, the only things that are redundant in
an active/active configuration are the StorageWorks data disks — they are
truly dedicated both locally and remotely. We also use the site for
training.(Hewlett-Packard, 2002, p. 4)
The primary site at the world trade center maintained local duplicate drives and
“extra CPUs”(Hewlett-Packard, 2002, p. 2) There was also a “disaster-tolerant
cluster” in the active/active data configuration described above to provide failover

24
capacity in seconds(Parris, Using OpenVMS Clusters for Disaster Tolerance)
Commerzbank used:
EMC's Symmetrix Remote Data Facility (SRDF) hardware and software to

safeguard its customer transactions, financial databases, e-mail and other
crucial applications. SRDF replicates primary data to one or more sites,
making copies remotely available almost instantly.(Editorial Staff of
SearchStorage.com, 2002)
This system provided “a standard, immediately functional environment for critical
decision-support and transactional data.”(Editorial Staff of SearchStorage.com,
2002) The facilities were physically connected via “Fibre Channel SAN” providing
a storage transfer rate of almost 1TB per second. (Parris, Who Survives
Disasters and Why, Part 2: Organizations, 2010) The remote site maintained
servers that “were members of the cluster” at the World Trade Center site. These
servers continued to serve using replicated “remote disks to the main site” after
the storage there failed.(Parris, Who Survives Disasters and Why, Part 2:
Organizations, 2010) Commerzbank’s “Follow-the-sun personnel staffing model
meant help was available” around the clock.(Parris, Who Survives Disasters and
Why, Part 2: Organizations, 2010) Previously established vendor relationships
with EMC and Compaq, later to become part of Hewlett-Packard (HP), ensured
they were on hand to assist with any services or equipment required to recover.
25
4.1.2.4. Corrective actions
Commerzbank’s corporate vice president Rich Arenaro, felt that the
disaster recovery part of the business continuity plan worked. All critical data was
available, but it still took nearly four hours to resume normal business
operations.(Mears, Connor, & Martin, 02) Therefore, they had failed to meet the
zero downtime business requirements. The servers were “somewhat inflexible
and required way too much human intervention.” Rye’s backup servers were not
identical to those at the primary site causing application compatibility problems
with the operating systems. (Egenera, 2006)
"Our strategy had been based on a false one-to-one ratio of technology,

meaning if I buy a server here and one for Rye, I'm protected," Arenaro
says. "The reality is when you are faced with that situation, having
hardware really is the least of your worries. It's really having your data and
your systems available and ready to use."(Mears, Connor, & Martin, 02)
Commerzbank corrected this by virtualizing their servers and eliminating
proprietary operating systems. The virtualized Linux servers use “SUSE Linux
and the support model of the open source community” rather than the HP
operating system.(Egenera, 2006) Another problem was that the hardware
residing “on the server itself—the disk, network interface card and storage
interface—give that server a fixed identity” this also caused delays as the servers
were manually reassigned.(Egenera, 2006)
The virtualized environment provides a pool of servers with shared
storage and networking hardware to “run any application on demand”. (Egenera,

26
2006) The new “system is designed for SAN connectivity and boot” any
BladeFrame server can assume any identity at any time. That’s what we were
missing and what we grappled with on 9/11.”(Egenera, 2006) The cooling
requirements for the data center have also decreased due to the virtualized
servers. The overall physical complexity has decreased as well, 140 servers
were consolidated into 48 blades. (Egenera, 2006) The virtualized configuration
has reduced hardware trouble-shooting time. Configuring new servers now takes
less than an hour; it previously took up to 16 hours. (Egenera, 2006)
The primary site and the backup site contain servers that are members of
active/active clusters. Applications as well as data are stored on a SAN allowing
any services to be switched seamlessly between locations using bi-directional
synchronous replication. The Rye site is now an active part of daily processing
and handles 40% of the processing load. (Egenera, 2006)
We live every day in the recovery portion of the DR mode. Having the
assets active takes the mystery out of continuity. We’re not praying that it
works, not planning that it works—we know it works because it’s an active
part of the process.(Egenera, 2006)
4.1.2.5. Discussion
This case study provides an example of disaster recovery done correctly.
The IT department was involved in contingency planning and performed regular
testing and every staff member knew what to do. The failover processes were
sufficiently automated to allow the evacuation process to focus on safety without

27
concern for heroics to save the business. Post incident review showed some
weakness in the technical contingency plan. The plan’s focus needed to be
shifted from recovery to continuity to meet Commerzbank’s business needs. The
company identified the problem, found a suitable solution, and implemented the
solution.
The remaining weakness, based on the information available, is that there
is no mention of a third cluster outside of New York. If an incident occurred that
severely impacted New York on a larger scale, having only two clusters both
located in New York may not provide the seamless zero downtime the company
requires. This global company has the resources to commit to this more
comprehensive configuration. They also have facilities around the world to take
advantage of for co-location. The floor space use was reduced by 60% through
server virtualization, this extra space should be taken advantage of to host
remote clusters between Commerzbank locations to ensure continuity.(Egenera,
2006)
In this case, like that of Katrina, the disaster destroyed the hardware at the
site. There was little that preparedness could do to save the equipment.
However, unlike Katrina the recovery plan worked. Commerzbank had many
advantages in this case; New York’s infrastructure did not suffer the damage
New Orleans suffered. Commerzbank did not have to shoulder the burden of
rebuilding a city, only their primary location. Also, Commerzbank had the
resources necessary to provide for their uptime requirements.

28
The lesson that can be learned from Commerzbank is not to be
complacent. Disasters happen of various scales on a daily basis, most are not
terribly severe and impact a small number of people. Failure to plan for a large-
scale severe impact event will increase the financial burden and stress of
incidents that do occur. If possible, defray the costs of maintaining hot sites by
integrating them into daily processing as Commerzbank has done. During
planning, walk through as many scenarios as imaginable this will help ensure
that all details are covered.
4.1.3. Conclusion
Commerzbank survived 9/11 with relative ease while many others suffered
unrecoverable losses. Many did not recover due to failure to plan and prepare for
the possibility of massive hardware and personnel losses. Commerzbank
understood the bank’s vulnerabilities and tolerances and made the investments
necessary to mitigate them. Past experience had taught the company how to
survive and high-level management and staff were trained to manage incidents.
This vigilance paid off in reduced downtime and minimized financial impact to the
company.
29
4.2. FirstEnergy
4.2.1. Background
FirstEnergy (FE) was founded in 1997 located in Akron, Ohio is ranked
179 in the 2010 list of Fortune 500 companies.(FirstEnergy, 08)(FirstEnergy,
09)(Fortune, 10) This unregulated utility supplies electricity to “Illinois, Maryland,
Michigan, New Jersey, Ohio, and Pennsylvania”.(FirstEnergy, 09) FirstEnergy
has remained highly profitable despite a history of poor practices that put the
public at risk. One of the most notable resulted in a $5.45 million fine issued by
the Nuclear Regulatory Commission (NRC). This fine regarded “reactor pressure
vessel head degradation”. FirstEnergy was notified of the problem in 2002 by the
NRC. (Merschoff, 05) The plant was operated for nearly two years after the
company was aware the equipment was unsafe to operate. (Merschoff, 05)
FirstEnergy employees supplied the NRC with misinformation and at least two
employees were indicted. (Associated Press, 06)
4.2.2. Northeast Blackout of 2003
In 2003, the Northeast region suffered a blackout, the largest in US
history, causing several Northeast US cities and Canada to be without power.
(Minkel, 08) News reports claimed this blackout was primarily due to a software
bug that stalled the utility’s control room alarm system for over an hour. The
30
operators were deprived of the alerts that would have caused them to take the
necessary actions to mitigate the grid shutdown/failures. The primary energy grid
monitoring server failed shortly after the failure of the alarm system, the backup
server took over and failed after a short period. The failure of the backup server
overloaded the remaining server’s processing ability bringing computer response
time to a crawl, which further delayed operators’ actions due to a refresh rate of
up to 59 seconds per screen. (U.S.-Canada Power System Outage Task Force)
The operators’ actions were slowed while they waited for information and service
requests from the server to load.
4.2.2.1.1. General
In a matter of minutes the blackout cascaded through the power grid
taking down over 263 plants. (Associated press, 03) Resulting in eight states and
parts of Canada being without power. (Barron J. , 2003)This black out affected
water supply, transportation, and communication. One hospital was completely
without power (Barron, 2003) and governmental systems to detect border
crossings, port landings, and unauthorized access to vulnerable sites failed.
(Northeast Blackout of 2003) The estimated cost of this blackout was $7-10
billion. (Electricity Consumers Resource Council (ELCON), 2004)

31
4.2.2.1.2. FirstEnergy
Immediately following the outage, FirstEnergy’s public stock offering
values fell as investors were cautioned sighting the possibility of fines and
lawsuits. (From Reuters and Bloomberg News, 03) A US-Canadian taskforce
assigned to investigate “found four violations of industry reliability standards by
FirstEnergy”. (Associated press, 03)
The FirstEnergy violations included not reacting to a power line failure

within 30 minutes as required by the North American Electricity Reliability
Council, not notifying nearby systems of the problems, failing to analyze
what was going on and inadequate operator training. (Associated press,
03)
There were no fines assessed because at that time no regulatory entity had the
authority to impose fines. (Associated Press, 06) However, FirstEnergy
stockholders sued for losses due to negligence and the company settled in July
of 2004 agreeing to pay $89.9 million to stockholders. (The New York Times
Company, 04)
4.2.2.2. Response
4.2.2.2.1. MISO
Midwest Independent System Operator (MISO) a group responsible for
overseeing power flow across the upper Midwest located in Carmel, Indiana.
(Associated press, 03) (Midwest ISO) The MISO state estimator tool
malfunctioned due a power line break at 14:20 Eastern Daylight Time (EDT).
32
(U.S.-Canada Power System Outage Task Force) This was one of the two tools
MISO used, both of which were under development, to assess electric system
state and determine best course of action. (U.S.-Canada Power System Outage
Task Force) The state estimator (SE) mathematically processes raw data and
presents it in the electrical system model format. This information is then feed
into the real time contingency analysis (RTCA) tool to “evaluate the reliability of
the power system”. (U.S.-Canada Power System Outage Task Force, p. 48)
At 14:15 the SE tool produced a solution with a high degree of error. The
operator turned off the automated process that runs the SE every five minutes to
perform troubleshooting. Troubleshooting identified the cause of the problem as
an unlinked line and manually corrected the linkage. The SE was manually run
and at 13:00 producing a valid solution. (U.S.-Canada Power System Outage
Task Force)The real-time contingency analysis (RTCA) tool successfully
completed at 13:07. The operator, left for lunch forgetting to re-enable the
automated tool processing. This was discovered and re-enabled at about 14:40.
The previous linkage problem recurred and the tools failed to produce reliable
results. The tool was not successfully run again until “16:04 about two minutes
before the start of the cascade.” (U.S.-Canada Power System Outage Task
Force, p. 48)
4.2.2.2.2. FE
The Systems Control and Data Acquisition (SCADA) system monitoring
alarm function failed at 14:14 and began a cascading series of application and
33
server failures, by 14:54 all functionality on the primary and backup servers
failed. (U.S.-Canada Power System Outage Task Force) FE’s IT staff were
unaware of any problems until 14:20, when their monitoring system paged them
because the Emergency Management System (EMS) consoles failed. At 14:41
the primary control system server failed and the backup server took over
processing. The FE IT engineer was then paged by the monitoring system. (U.S.-
Canada Power System Outage Task Force)
A “warm reboot” was performed at 15:08. (U.S.-Canada Power System
Outage Task Force) IT staff did not notify the operators of the problems nor did
they verify that functionality was restored with the EMS system operators. (U.S.-
Canada Power System Outage Task Force) The alarm system remained non-
functional. IT staff were notified of the alarm problem at 15:42 and they
discussed the “cold reboot” recommended during a support call with General
Electric (GE). The operators advised them not to perform the reboot because the
power system was in an unstable state. (U.S.-Canada Power System Outage
Task Force) Reboot attempts were made at 15:46 and 15:59 to correct the EMS
failures. (U.S.-Canada Power System Outage Task Force)
An American Electric Power (AEP) operator, who was still receiving good
information from FE’s EMS, called FE operators to report a line trip at 14:32.
Shortly thereafter operators from MISO, AEP, PJM Interconnection (PJM), and
other FE locations called to provide system status information. (U.S.-Canada
Power System Outage Task Force) FE operators became aware that the EMS
34
systems had failed at 14:36, when an operator reporting for the next shift
reported the problem to the main control room. (U.S.-Canada Power System
Outage Task Force) The “links to remote sites were down as well.” (U.S.-Canada
Power System Outage Task Force, p. 54) The EMS failure resulted in the
Automatic Generation Control (AGC), which works with affiliated systems to
automatically adjust to meet load, to be unavailable from 14:54 to 15:08. (U.S.-
Canada Power System Outage Task Force) FE operators failed to perform
contingency analysis after becoming aware that there were problems with the
EMS system. (U.S.-Canada Power System Outage Task Force) At 15:46 it was
too late for the operators to take action to prevent the blackout. (U.S.-Canada
Power System Outage Task Force)
FirstEnergy did have mitigation in place. There were several server nodes
that can host all functions with one server on “hot-standby” for backup with
automatic failover. (U.S.-Canada Power System Outage Task Force) FE had an
established relationship with the EMS vendor GE, which provided support to the
IT staff when a new problem that the IT staff was not experienced with occurred.
There were also established mutual aid relationships with other utility operators.
The operators have the ability to monitor affiliated electric systems and request
support. There were also established communication procedures that dictated
that the operators make calls under specific conditions.

35
FirstEnergy also had a tree trimming policy that is a standard mitigation
tactic for electric companies. The purpose of the policy is to avoid lines that will
require immediate repair for safety reasons and will increase stress on the
electric system. This is a non-technical mitigation measure that is very important
to protect the reliable functioning of the electric system and its monitoring tools.
4.2.2.4.1. Regulatory
Federal Energy Regulatory Commission (FERC) regulations are no longer
voluntary they can now “impose fines of up to a million dollars a day”. (Minkel,
08)The Energy Policy Act of 2005 provided FERC authority to set and enforce
standards. (Minkel, 08) FERC has also created a prototype real-time monitoring
system for the nation’s electric grid. (Minkel, 08)
Future smart or supergrid systems are also under development. According
to Arshad Mansoor, Electric Power Research Institute’s power delivery and
utilization vice president, these systems would provide more resiliency by
“monitoring and repairing itself”. (Minkel, 08) Project Hydra scheduled to be in
service in downtown Manhattan in 2010 is a joint supergrid venture between the
Department of Homeland Security and Consolidated Edison Company of New
York. (Minkel, 08) More testing and infrastructure upgrade are required before
this promising technology could be implemented on a large scale. (Minkel, 08)

36
4.2.2.4.2. FirstEnergy
FirstEnergy implemented a new EMS system that was installed at two
locations to provide resiliency. (Jesdanun, 04) The new system has improved
alarm, diagnosis, and contingency analysis capabilities. (NASA, 2008) There are
now more visual status information and cues. (NASA, 2008) FirstEnergy created
an operator certification program, emergency response plan and updated
protocol. Communication requirements were established for “computer system
repair and maintenance downtimes between their operations and IT staffs” and
“tree trimming procedures and compliance were tightened.” (NASA, 2008)
4.2.2.5. Discussion
The primary cause of this outage appears to be human error. The
electrical systems operators were “unaware” of the problem for over an hour, as
the electrical system began to degrade. (U.S.-Canada Power System Outage
Task Force) However, there were repeated warnings from communications with
operators from various locations to indicate there was a problem with the EMS.
The operators were aware that there was a problem at 14:36, which provided the
operators more than an hour to take action. The discussion between FE
operators and IT staff indicated that the operators were aware that the electrical
system state required action. Operators’ actions may have been hampered from
14:54 to 15:59 by EMS screen refresh rates of up to “59 seconds per screen.”
(U.S.-Canada Power System Outage Task Force, p. 54)

37
FE’s IT staff failed to notify the operators at 14:20, when they became
aware of EMS system failures. This could have provided the EMS operators with
16 minutes more to determine and execute the correct course of action. Also the
FE EMS system was not configured to produce alerts when it fails, which is a
standard EMS feature. This would have provided another six minutes to the
operators to perform manual actions. Based on the operators’ failure to act on
the many other warnings they received, it is hard to make a case that the
operators would have acted in a timely manner even with an additional 32
minutes notice. It is possible, that operators were too dependent upon the
automated systems and overconfident that the situation would correct itself. The
North American Electric Reliability Council (NERC) found FE in violation for
failure to use “state estimation/contingency analysis tools”. (U.S.-Canada Power
System Outage Task Force, p. 22)
The EMS system was “brought into service in 1995” and it had been
decided to replace the aging system ”well before August 14th”. (U.S.-Canada
Power System Outage Task Force, pp. 55-56) The NERC found FE in violation
for insufficient monitoring equipment. (U.S.-Canada Power System Outage Task
Force) It was later determined that the software had a programming error that
contributed to the alarm failure.” According to Kema transmission services senior
vice president, Joseph Bucciero, “the software bug surfaced because of the
number of unusual events occurring simultaneously _ by that time, three
FirstEnergy power lines had already short-circuited.” (Jesdanun, 04) The three
38
lines were lost because FE failed to perform tree trimming according to internal
policy. The lines sagged, which occurs on hot days, and touched trees. (NASA,
2008)
4.2.3. Conclusion
This outage serves as an example that many small, mostly human errors,
can result in disaster. A more resilient system requiring less human interaction to
perform emergency tasks could have prevented this outage. Poor communication
between IT and Operations staff was a large factor as was the operators’ failure
to heed the warning of other operators. The FirstEnergy operators were provided
with information outside of their EMS to understand that the EMS was likely
providing unreliable information. The largest contributing factor was FirstEnergy
failure to be proactive. They did not trim trees, they did not replace their old EMS
system, they did not communicate appropriately with other energy operators, and
they did not train the employees how to act in a crisis situation when the EMS
could not be relied upon. There were contributing factors outside of FirstEnergy,
but if any one of the factors contributed by FirstEnergy were removed the wide
spread outage may not have occurred.

39
4.3. Tulane
4.3.1. Background
Tulane University is a private institution located in New Orleans, Louisiana
with an extension located in Houston, Texas for the Freemans School of
Business. (Gerace, Jean, & Krob) The University was established in 1834 as a
tropical disease research medical school (Alumni Affairs, Tulane University,
2008). A post Civil War endowment from Paul Tulane transformed the financially
struggling public university into the private university that survives today. (Alumni
Affairs, Tulane University, 2008)Tulane maintains a community service oriented
focus and its contributions have shaped the city of New Orleans over the
decades. In 1894 Tulane’s College of Technology brought electricity to the city of
New Orleans. (Alumni Affairs, Tulane University, 2008) Tulane is currently New
Orleans’s largest employer. (Tulane University)
Since the university was established Tulane has weathered the Civil War
and many hurricanes. Tulane has adapted to the New Orleans hurricane prone
environment. Tulane has integrated buildings that can “withstand hurricane force
winds” into the campus landscape. (Alumni Affairs, Tulane University, 2008) Only
Katrina and the Civil War have prevented Tulane from offering instruction.
(Tulane University, 2009)

40
4.3.2. Hurricane Katrina
Two days before the beginning of Tulane’s 2005 fall semester Hurricane
Katrina devastated New Orleans. (Blackboard Inc., 2008) This was “the worst
natural disaster in the history of the U.S.” (Cowen, 05) The real damage to New
Orleans began hours after Katrina passed as the levee succumbed to the
damage it suffered during the storm.
The ramifications of this disaster reach far beyond Tulane’s campus.
However, Tulane’s data center is the focus of this case study therefore direct
impact on Tulane and the cascading effects will be discussed. The hurricane’s
property damages alone were in excess of $400 million. (Alumni Affairs, Tulane
University, 2008) Over a week after Katrina, “eighty percent of Tulane’s campus
was underwater.” (Alumni Affairs, Tulane University, p. 66)
The New Orleans campus was closed for the fall semester of 2005.
(Cowen S. , Messages for Students , 05) Students were displaced and attended
other colleges as “visiting” students. (Gulf Coast Presidents, 2005) Some
students were asked to pay fees at the hosting University, Tulane promised to
address tuition issues as soon they gained access to their student records.
(Cowen S. , Student Messages, 05)

41
As university administration began planning for Tulane’s recovery from
Hurricane Katrina, they had no access to “computer records of any kind”. (Alumni
Affairs, Tulane University, 2008, p. 65) Tulane’s bank was not operational and
the administration did not know what funds were in the inaccessible account.
(Alumni Affairs, Tulane University, 2008) Accounts receivable servers were
unrecoverable because they operated independent of central IT. (Lawson, A
Look Back at a Disaster Plan: What Went Wrong and Right, 05)
Research at Tulane University suffered as well. Specimens from long
running studies were destroyed. Engineering faculty returned to campus “to
service critical equipment and retrieve important servers” which saved several
experiments. (Grose, Lord, & Shallcross, 2005) Over 150 research projects
suffered damage. (Alumni Affairs, Tulane University, 2008) Medical teams were
forced to destroy dangerous germ specimens used in research to avoid possible
outbreaks caused by inadvertent release of the germs. In addition, Tulane’s
Hospital was closed for six months, but “was the first hospital to reopen in
downtown New Orleans”. (Oversight and Investigations Subcommittee of the
House Committee on Energy and Commerce, p. 1)
Tulane reopened in January 2006 for the spring semester. The school lost
$125 million due to being closed for the fall semester of the 2005-2006 school
years. (Alumni Affairs, Tulane University, 2008) Prior to reopening Tulane had to
streamline its academic programs. This made funding available for the daunting
task of rebuilding Tulane and New Orleans. New Orleans had no infrastructure to
42
support Tulane. Tulane provided housing, utilities, and schools to support Tulane
students and staff. (Alumni Affairs, Tulane University, 2008) Despite Tulane’s
amazing recovery, loss of tuition income and disaster related financial losses
forced staff reductions and furloughs. (Lord, 2008)
4.3.2.2. Response
Monday August 29th 2005, Tulane University was flooded after the levees
damaged by Hurricane Katrina broke. (Searle, 2007) Tulane was fortunate to
have a few days warning prior to the hurricane. August 25th, Tulane’s IT staff
initiated online data backups according to the data center disaster recovery plan.
(Lawson, 2005) August 28th, Tulane brought its information systems down.
(Lawson, 2005) Backup generators and supplies were placed into campus
buildings. (Krane, Kahn, Markert, Whelton, Traber, & Taylor, 2007) On the 30th
generators began to fail as Tulane’s campus flooded, as a result communication
systems failed “with loss of e-mail systems and both cell and landline phones.
Text messaging remains functional and becomes the main source of
communication.” (Krane, Kahn, Markert, Whelton, Traber, & Taylor, 2007)
Senior administration staff sheltered in the Reily Student Recreation
Center command post along with other essential staff during the Hurricane.
Wednesday, August 31 Tulane’s “Electrical Superintendant Bob Voltz” shut off
power to the Reily building. (Alumni Affairs, Tulane University, 2008) Thursday,
43
the staff was rescued by helicopter from the now flooded Tulane after several
unsuccessful rescue attempts (Alumni Affairs, Tulane University, 2008)
Tulane’s top recovery priority was paying its employees. (Anthes, 2008)
This effort was complicated because payroll employees failed to take the payroll
printers and supplies as specified in the disaster plan. (Lawson, A Look Back at a
Disaster Plan: What Went Wrong and Right, 05) Police escorted Tulane IT staff
to retrieve Tulane’s backup data and computers from their 14th floor offsite
datacenter in New Orleans. (Alumni Affairs, Tulane University, 2008) Tulane’s
recovered backup tapes were processed at SunGard in Philadelphia. (Anthes,
2008) SunGard’s willingness to take Tulane as a customer allowed payroll to be
completed “two days late” according to Tulane CIO John Lawson. (Lawson, A
Look Back at a Disaster Plan: What Went Wrong and Right, 05) As of September
3rd 2005 Tulane still listed restoration of communications and IT systems as an
urgent issue. (Cowen S. , Student Messages, 05) Tulane’s President Scott
Cowen held live chats in September 2005 to address community concerns.
(Cowen S. , Student Messages, 2005)
Baylor University in Houston hosted Tulane’s redirected website and
invited Tulane to resume operations at Baylor. However, this process did not go
as smoothly as planned because the IP address assigned was not static.
(Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) This
was quickly corrected and Tulane used the redirected emergency site to
communicate with stakeholders providing “a continuous and unbroken chain of

44
updates via its Web site.” (Schaffhauser, 2005) School of Medicine classes
resumed in three weeks despite beginning with:
none of the necessary infrastructure that maintains the functions of any

medical school was available to Tulane's SOM. Information technology
support, network communication servers, the University's payroll system,
and e-mail were down, and student, resident, and faculty registration
systems were not operational. Student and resident rosters did not exist,
nor were there any methods to confirm credentials or grades. (Krane,
Kahn, Markert, Whelton, Traber, & Taylor, 2007)
Clinical students were able to resume because the Association of American
Medical Colleges maintains a database on medical students, which had been
updated in the days before Katrina hit. (Testa, 2006) The database records along
with the Baylor registration website and newly created paper files allowed Baylor
and Tulane to gather the information needed to resume classes. (Testa, 2006)
This resumption was particularly vital for seniors. Unfortunately, not all of the
College students of New Orleans were so lucky. About 100,000 were displaced,
many with no academic or financial records. (DeCay, 2007)
Email “was the first system to be brought back online”. (McLennan, 2006)
Blackboard provided systems to allow Tulane and other affected Gulf Coast
universities to establish online courses. (McLennan, 2006) This system was
utilized by Tulane to provide a six-week “mini fall semester”. (McLennan, 2006)
Tulane’s own Blackboard system was quickly restored to allow retrieval of course
material. (McLennan, 2006) There was no help desk to assist students or
instructors during the “mini fall semester”. (McLennan, 2006)

45
Tulane’s IT had plans that covered “how to prepare for a hurricane.”
(Anthes, 2008) The staff was trained and comfortable enacting the disaster plan.
They knew the backups could be completed in 36 hours. (Lawson, A Look Back
at a Disaster Plan: What Went Wrong and Right, 05) Offsite backups were
maintained on the 14th floor of a building in New Orleans. (Anthes, 2008) Tulane
also maintained a website for emergency information and phone contacts.
(McLennan, 2006) In case of a category 4 or higher hurricane the data center
would be shutdown and evacuated. (Lawson, A Look Back at a Disaster Plan:
What Went Wrong and Right, 05) The remote hosted emergency website for
Tulane would be activated prior to shutdown. (Lawson, A Look Back at a Disaster
Plan: What Went Wrong and Right, 05)
Today the university has a disaster recovery plan including offsite backup
servers for websites, e-mail and other critical systems, which is updated yearly.
(Anthes, 2008) There are also documented protocols for recovery from a
disaster, which was missing during the recovery from Katrina. (Anthes, 2008)
The recovery plan has also been amended to cover more than hurricanes and IT
staff now participates in preparedness planning. (Anthes, 2008) (Gerace, Jean, &
Krob, 2007)
46
As of 2008, Tulane had a contract with SunGard mobile data center for
emergencies. (Anthes, 2008) Katrina’s affect on the New Orleans backup data
center made it clear that they needed to maintain backups at a more distant
location as a result “backups are taken to Baton Rouge 3 times a week”. (Anthes,
2008) Employees have been provided with USB storage devices to prepare
personal backups for emergencies. (Anthes, 2008) An alternate recovery site has
been established in Philadelphia and there is now a hardened onsite command
center at Tulane. (Lord, 2008) “Energy efficient systems were installed in the
down town campus” which can be operated longer using emergency
generators.(Alumni Affairs, Tulane University, 2008)
Tulane also maintains a “digital ham-radio network that can transmit
simple e-mail” and emergency updates to the website can be published directly
by the university’s public relations (Lord, 2008) “So as not to be dependent on
the media to track potentially disastrous hurricanes, Tulane has enlisted a private
forecaster to supply e-mail updates.”(Lord, 2008) “Students are required to have
notebook computers” which can facilitate continuity during a disaster and the
university now has online classes. (Gerace, Jean, & Krob, 2007) (Lord, 2008)
4.3.2.5. Discussion
Tulane’s situation is an extreme, but not unique example. There were
many things they did right and in the end they recovered. It is debatable if the
plan for offsite disaster recovery would have been worth the investment in
47
dollars. Itemized financial reports for Tulane were not available for review. It is
clear that the absence of offsite recovery contract was a deliberate financial
decision. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right,
05)
In retrospect this was probably a poor financial gamble in a hurricane
prone area especially considering that the destruction of the levee was a known
risk. (Kantor, 2005) This decision also created additional stress for Tulane’s staff
and students. Tulane did an excellent job of recovering payroll to ensure their
staff was not without desperately needed financial resources. The medical
students were also well cared for thanks to the help of outside partnerships. The
continued medical program could not have been possible had there not been an
existing, if informal, mutual aid relationship with Baylor.
Unfortunately, the loss of Tulane’s data center made for a difficult fall 2005
semester for most students. They not only had to relocate, but were without
financial or academic records from Tulane. For those students the approximately
$300,000 per year expenditure would have provided some peace of mind.
(Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) As a
result of this as well as other adverse conditions at Tulane, many students did
not return. In 2008, enrollment at Tulane was down by 5,300 students from its
pre-Katrina numbers. (Lord, 2008) This resulted in financial distress for Tulane
and the closing of its engineering school and consolidation of other programs and
colleges. (Lord, 2008)

48
Nonetheless, Tulane made herculean efforts to reopen one semester after
Katrina forced the University to close. This ensured not only the survival of
Tulane, but the revival of New Orleans as well. The medical students and
hospital provided much needed health care for New Orleans residents. Students
of Architecture are designing and building affordable, energy efficient homes.
(Brown, 2008) The damage caused to Tulane and New Orleans was beyond the
ability to prevent or to protect infrastructure. Efforts to preserve the ability to
recover from a complete loss of IT and infrastructure were proven to be the most
valuable in this case. No one institution was capable of recovering New Orleans,
but Tulane has kept it alive.
4.3.3. Conclusion
Tulane has learned from Katrina how to protect the data that is the
lifeblood of the university. The aftermath of Katrina has also made clear that the
students are Tulane’s customers and they cannot survive without them. Further
that New Orleans is dependent upon Tulane. Universities have a history of
providing for the communities they are a part of in times of disaster, this was true
in the aftermath of 9/11 and of Katrina. As the largest employer, a medical
provider and educator Tulane has persevered and shored up its weakness and
has become more independent.

49
4.4. Commonwealth of Virginia
4.4.1. Background
The state of Virginia outsourced its information technology to Northrop
Grumman in 2005. (Schapiro & Bacque, Agencies' computers still being
restored, 2010)The contract was to span 10 years at a cost of $2.4 billion
becoming the largest single vendor in Virginia's history. (Lewis)Virginia
Information Technologies Agency (VITA) was established in 2003 and is the
state agency charged to ensure the state’s information technology needs and the
terms of the contract with Northrop Grumman are met. (Lewis)
This was to be the flagship partnership to show that the Public sector
could benefit through private outsourcing of information technology.
However,"(d)elays, cost increases and poor service have dogged the state's
largest-ever outsourcing contract, the first of its kind in the country”. (Schapiro &
Bacque, Agencies' computers still being restored, 2010)Virginia had entered the
contract with the expectation that the contract would provide modernized
services for the "same cost as maintaining their legacy services.” (Stewart,
2006)At this point, the state no longer expects to see any cost savings under the
original contract period, but hope that savings will be realized under an extended
contract. (Joint Legislative Audit and Review Commission, 2009)

50
Since the beginning of the contract with Northrop Grumman, the state of
Virginia has suffered two major outages. In addition, the state paid an additional
$236 million to cover a hardware refresh. (Schapiro & Bacque, Agencies'
computers still being restored, 2010)This process, scheduled to be completed in
July 2009, is significantly behind schedule. There have been ongoing issues with
Northrop Grumman's poor performance in several areas. (Joint Legislative Audit
and Review Commission, 2009)Included in these issues are inadequate disaster
recovery and unreliable backup completion. As recently as October 2009, "lack
of network redundancy" is recognized as a "major flaw" in the system. (Joint
Legislative Audit and Review Commission, 2009, p. 105)
4.4.1.1. Oversight issues
Until the later part of March 2010, VITA could make changes to the
contract with Northrop Grumman without consulting with the General Assembly.
(Joint Legislative Audit and Review Commission, 2009)This limited the
governor’s ability to oversee the state IT services. The Information Technology
Investment Board (ITIB) is charged with oversight of VITA, but could not provide
full time oversight. The members of ITIB attend meetings irregularly and lack the
technical knowledge required to provide adequate governance (Joint Legislative
Audit and Review Commission, 2009) VITA's oversight was restructured to
eliminate the ITIB. VITA and the State CIO now report to the Office of the
Secretary of Technology. This new structuring creates oversight by the

51
Governor. The new structure became effective March 16, 2010 after passing
through the house and senate via an emergency clause. (VITA)
4.4.1.2. Notable service failures
The state has been plagued with a litany of service failures throughout the
contract with Northrop Grumman. In 2009, prison phone service failed and was
prioritized according to the number of employees affected. The technicians were
given 18 hours to resolve the issue according to the assigned prioritization.
Service was restored six and a half hours later following an escalation request
initiated by the prison. (Joint Legislative Audit and Review Commission,
2009)Another service failure, noted in the JLARC 2009 report, left the Virginia
State Police without internet access for three days. (Kumar & Helderman,
2009)June 20, 2007 the state of Virginia suffered a wide spread outage. (VITA,
2007)The outage was caused by several "near simultaneous hardware failures"
in a legacy server scheduled for refresh. (The Virginia Information Technology
Infrastructure Partnership) This failure occurred after the annual disaster
recovery test, which was held in April.
4.4.2. August 2010 outage
Wednesday, Aug. 25 an outage occurred impacting 27 state agencies.
(News Report, 2010) Thirteen percent of the state's file servers were unavailable
52
during the outage. (Schapiro & Bacque, Agencies' computers still being restored,
2010) "The computer troubles were traced to a hardware malfunction at the
state's data center near Richmond, which caused 228 storage servers to go
offline." (Kravitz, Statewide computer meltdown in Virginia disrupts DMV, other
government business, 2010) The hardware that failed was one of the SAN's two
memory cards. (Lewis) According to Jim Duffey, Virginia Secretary of
Technology, the outage was "unprecedented" based on the "uptime data" on the
EMC SAN hardware that caused the widespread failure. (News Report, 2010)
"Officials also said a failover wasn't triggered because too few servers were
involved." (News Report, 2010) "Workers restored at least 75 percent of the
servers overnight." (Kravitz, Statewide computer meltdown in Virginia disrupts
DMV, other government business, 2010)
The SAN failure negatively impacted “483 of Virginia's servers." (Schapiro
& Bacque, Agencies' computers still being restored, 2010)Virginia's Department
of Motor Vehicles (DMV) was the most visibly impacted agency. Drivers were
not able to renew licenses at the DMV offices during the outage, forcing the DMV
to open on Sunday and work through Labor Day to clear the backlog of expired
licenses. (Schapiro & Bacque, Northrop Grumman regrets computer outage,
2010) Some drivers were ticketed for expired licenses before law enforcement
agencies were requested to stop issuing tickets to affected drivers. (Charette,

53
2010) According to Virginia State Police, while "they will not cite drivers whose
licenses expired during the blackout", unfortunately those that received tickets
must "go through the court system" to request relief. (Kravitz & Kumar, Virginia
DMV licensing services will be stalled until at least Wednesday, 2010) In
addition, drivers who renewed licenses the day of the blackout will need to visit
the DMV again because the data and pictures from the transactions that day
were lost. (Schapiro & Bacque, Northrop Grumman regrets computer outage,
2010) This also increases the likelihood that some of the licenses and IDs
issued that day could be illicitly sold.
The DMV was not the only agency negatively impacted by the SAN
outage. According to the Department of Social Services about 400 welfare
recipients will receive benefit checks up to two days late. Employees at this
agency also worked overtime to reduce and eliminate delays where possible.
(Schapiro & Bacque, Agencies' computers still being restored, 2010) Internet
services used by citizens to make child support and tax payments were
unavailable as well. (Schapiro & Bacque, Northrop Grumman regrets computer
outage, 2010) "At the state Department of Taxation, taxpayers could not file
returns, make payments or register a business through the agency's website."
(Schapiro & Bacque, Agencies' computers still being restored, 2010)Three days
after the outage began "(f)our agencies continue(d) to have 'operational issues' "
these agencies included the departments of Taxation and Motor Vehicles. Many
54
other agencies continued to suffer negative effects from the outage. (Schapiro &
Bacque, Agencies' computers still being restored, 2010)
4.4.2.2. Response
At approximately noon on Aug. 25, a data storage unit sent an error
message. (Wikan, 2010) The cause of the error message was determined to be
“one of the two memory boards on the machine needed replacement." (Wikan,
2010)"A few hours later, a technician replaced the board ", (Wikan, 2010) Shortly
after the board was replaced the storage area network (SAN) failed. It was later
discovered that the wrong board might have been replaced. (Wikan, 2010)"VITA
and Northrop Grumman activated the rapid response team and began work with
the appropriate vendors to restore service." (VITA)
Work continued through the night to restore services but was unable to
restore data access to affected servers. (Wikan, 2010) (VITA)Thursday, the SAN
was shut down overnight to replace all components (Wikan, 2010)
The storage provider, EMC, determined that the best course of action is to
perform an extensive maintenance and repair process. VITA and Northrop
Grumman, in consultation, have determined this is the best way to
proceed. (VITA)
The 24 affected agencies were notified prior to the SAN shutdown to allow them
to take appropriate action. (VITA) SAN service was restored at "2:30 a.m. Aug.
27." (Wikan, 2010) Over half of the attached servers were operational Friday
55
morning. (VITA) VITA began working with the operational customers to confirm
service availability and perform data restoration.(VITA, 2010)
Unfortunately, the DMV remained unable to “process driver’s licenses at
its customer centers. Some other agencies continue(d) to be impacted." (VITA)
VITA continued data restorations over the weekend the DMV restore took “about
18 hours.” (Wikan, 2010) as of Monday August 30th "Twenty-four of the 27
affected agencies were up and running". (VITA) However, three key agencies still
suffered service outages Monday through Wednesday: ”the Department of Motor
Vehicles, Department of Taxation and the State Board of Elections." (VITA)
4.4.2.3. Mitigation in Place
The recovery scenario in this SAN outage was facilitated by mitigation
measures in place. Not only was there a "fault-tolerant" SAN, but also there
were magnetic tape backups and the staff had just performed recovery exercise
testing. The established relationship with the hardware vendor EMC brought
additional expertise to resolve this SAN outage. VITA also has two data centers,
a primary and failover data center.
Examination of the documents available on the VITA web site would imply
that every recommended best practice is being implemented and executed. The
SAN hardware used is best in class and has excellent reliability. VITA also had a
rapid response team whose mission was to reach incident resolution rapidly.
(Nixon, 2010) Yet, an outage in one system had serious negative impact on
56
several agencies and more importantly the citizens of Virginia for more than one
week.
This incident is one of many that the State of Virginia has suffered since
the beginning of the contract with Northrop Grumman. Professing use of industry
standards and best practices does not result in a reliable, stable cyber
infrastructure. In this case, there was still a single point of failure that resulted in
more than a minor inconvenience for Virginia. Implying a serious lack of
foresight in planning for recovery from this "unprecedented" outage.
IT professionals experience one-in-a-million or billion faults many times
throughout their careers. Disasters that are disregarded as near impossible do
occur. This state invested in a long-term partnership with Northrop Grumman to
avoid outages such as that which occurred in late August. Virginia made the
necessary financial investments that should have resulted in a more stable,
modern, infrastructure managed by experienced IT staff.
4.4.2.4. Corrective Actions
Largely unknown at the time of this report, an independent review has
been ordered. Agilisys Inc. was chosen to conduct a 10-12 week audit beginning
November 1, 2010. (VITA, 2010)

57
4.4.2.5. Discussion
From a technical perspective it is difficult to do more than speculate
exactly what happened and extrapolate what should have been done. VITA
participated in disaster recovery exercises in the second quarter of 2010. (VITA,
2010) The exercise involved restoring service after losing a data center.
Providing that the exercise was adequately rigorous, performing a restore for an
outage affecting 13 percent of servers should have been relatively easy.
The major complicating factor resulting in delayed recovery was reported
to be tape restoration and data validation. (Wikan, 2010) More emphasis should
be placed on data restoration and validation activities in future exercises.
Incidents resulting in partial data loss or corruption are far more likely than loss of
an entire data center. Activities that improve restoration time for data recovery
are valuable to avoid serious negative business impacts from a relatively
common incident. Practicing these restorations will provide insight on process
and technology enhancements that might improve recovery time. In this case,
the data recovery process from tape left the DMV unable to issue or update
driver’s licenses or IDs for a week. A data restoration exercise might have
revealed this weakness and another solution might have been put in place to
mitigate the recovery time issue.
The Northrop Grumman decision to allow days between backups is highly
questionable. (Availability Digest, 2010) It is difficult to justify anything less than
daily backups for agencies like the DMV, the Department of Taxation, and Child
58
Support. Loss of payment records for the latter two agencies would cause major
inconveniences and bad press. Loss of four days identification data for licenses
and IDs is inexcusable. The root of this decision likely lies in the bottom-line,
Northrop Grumman is trying to make a profit and failed to implement sufficient
redundancy for the customer business needs. It is difficult to imagine a situation
where allowing days between backups is anything less than negligence.
Additional resiliency mechanisms should be built into the databases and
storage. This is advisable for all high availability databases and might have
avoided the data loss and corruption that occurred. One possible mechanism is
local auditing copies maintained onsite in intermediate storage until there is
confirmation that the data was written to the SAN. The local copy would then be
held until the backup copy is confirmed as processed. This would entail
maintaining onsite transactions records and data for up to 48 hours. Maintaining
local daily backups for daily transactions is also an advisable practice to avoid
loss of records.
Another option is to maintain clone Business Continuance Volumes
(BCV), essentially a regularly scheduled copy between the two SANs. This
creates mirrored storage systems, hot scheduled copies occurring every minute
for example, using technology such as Oracle Data Guard or SQL server
mirroring and log shipping. Most database engines have a way to replicate
themselves in a near real-time state; the replicated copy is stored on a separate

59
physical hardware in order to eliminate data loss. The use of both options
presented would significantly reduce the possibility of data loss.
The fact that "too few servers were involved" to trigger failover is baffling.
(News Report, 2010) Any fault with the potential to incur the impact experienced
by this outage should initiate a failover. The IT staff should have initiated a
manual failover prior to making the SAN repair for the initial hardware failure.
This suggestion assumes that the failover would have eliminated the
dependence on the faulty SAN. In addition, if the SAN was still operating, why did
the technician perform the repair during business hours? The technician should
have created a cold backup to tape prior to doing the off hours repair. The
technician should have been aware the backup had not occurred for four days
and understood the potential data loss that could result. (Availability Digest,
2010)
VITA’s staff may need additional training to help them identify situations
where initiating a failover is appropriate. Training may be required to identify
when to perform a manual back up as well as situations that can wait for after
hours repair. It is likely that required change management processes were not
followed. The VITA webpage professes a commitment to the principals of
Information Technology Infrastructure Library (ITIL). (VITA) Following ITIL
principals, the SAN repair would have been subject to a change management
process. An emergency change request should have been submitted explaining
the problem, the proposed fix, and the steps to be taken. Affected customers,
60
process owners, and the change management board (or equivalent) should have
been notified. Either there was: no change request, no one reviewed the change
request, the request was not understood or the proposed steps were not
executed.
Monitoring tools may also have played a role in this outage. The IT staff
either ignored alerts, did not understand them, or had the monitoring tools
incorrectly configured. Monitoring alerts should have notified the staff of the
problem, identified which SAN controller was having the problem, and alerted
staff of failed write attempts to the networked storage. Additional training could
have ensured properly implemented monitoring tools and the IT staff’s ability to
understand the alerts.
This outage also displays the weakness of consolidated centralized
services. The financial motivation to move to centralized services is strong. It is
important to balance the cost savings with the risk being taken; the savings may
not justify the risk for many governmental organizations. Strategic long term
planning focused on the business needs rather than cost savings is a
requirement. Perhaps the DMV data is not a place to cut corners. Distributed
servers were implemented to avoid single points of failure, and while even a
distributed system is not free from failure, the possibility of wide spread failure
and data loss is reduced.
Strategic planners would do well to be skeptical of vendor claims. EMC
may claim the SAN outage is unprecedented, but they do not claim this is the first
61
outage. It is foolish to depend heavily on a single piece of hardware for any
business critical service. Northrop Grumman has failed to deliver a reliable
quality service due to poor strategic planning. Northrop Grumman has enough
experience to deliver quality service. However, they have chosen to design IT
services for the state of Virginia that allow a single hardware failure to cause
outages for several agencies spanning days.
There are many areas for improvement revealed by this outage.
Ultimately, the outage was a result of human error. Human error will occur,
however there needs to be as many safeguards as possible in place to avoid
human error from escalating into a fiasco. Training would help with human error.
Training is an ongoing process that must be maintained along with the process of
constant improvement. Northrop Grumman, EMC, and VITA share the blame for
poorly automating redundancies and backups. A single technician dealing with
an "unprecedented" outage will always be likely to make a mistake in a moment
of uncertainty.
The partnership with VITA and Northrop Grumman was established as
part of a risk transference plan. Outsourcing expensive IT services to a company
that specializes in IT should result in lower cost due to bulk discounts, enhanced
services, and access to high quality IT staff. The total costs of outsourcing IT
should go down over time due to falling hardware costs. (Lee) These expected
financial benefits are nonexistent in the Virginia - Northrop Grumman contract.
The effectiveness of this partnership should be reviewed in terms of value to the

62
customer, in this case the citizens of Virginia. The quality of the partnership
should be reviewed using the dimensions of fitness of use and reliability. (Lee &
Kim, 1999) The events of the last few years have shown that the service that
Northrop Grumman provides is of questionable fitness for use or reliability.
Poorly strategized and executed services have not only cost Virginians,
but have been a source of inconvenience and delay. Some Virginians had to go
to court to combat expired license tickets, those who cannot find the time to do
this may also face increases in insurance premiums. These issues seem small
in comparison to the compromised integrity of the licenses issued by the Virginia
DMV just prior to the outage. These licenses are legal and nearly untraceable
and could fetch high prices on the black market. Also, consider the safety of
those working in prisons without phone service for hours. The phone outage
described was not an isolated event.
The 10-year contract with Northrop Grumman has left little possibility to
exit the contract and request new outsourcing bids. Virginia recently reviewed
the partnership and it was decided that it was too costly to exit the contract.
Northrop Grumman argued that Virginia did not provide them with adequate
access to information that would have allowed them to create a realistic refresh
schedule and budget. Virginia denied this, but agreed to extend the project
timeline and paid an additional $236 million to cover the hardware refresh.
(Schapiro & Bacque, Agencies' computers still being restored, 2010) This was
done in part for political reasons. Northrop Grumman agreed to move their
63
headquarters to Virginia. (Squires, 2010) Virginia hopes to create new jobs and
get better service. Meanwhile, Northrop Grumman will pay out approximately
$350,000 in fees due to the August 2010 outage.
The lesson to be learned from this partnership contract is that it is unwise
to commit to a lengthy contract. (Lee) Contract law as it relates to IT is in its
infancy. There are few who understand both IT and law well enough to write or
defend the contract properly. A less lengthy agreement may have been best for
embarking on the hardware refresh. Perhaps the hardware refresh should have
been negotiated as a separate contract from the services outsourcing. At the
least, an exit clause that would allow Virginia to exit the contract without risking
the waste of millions of public funds would be advisable. Public safety and
security are too important to place in the hands of a single provider without any
recourse to correct serious issues. The contract with Northrop Grumman
appears to have too much wiggle room to make Northrop Grumman accountable
for failures.
For Virginians the important concern is the implementation of corrective
actions to see that this never happens again. Further, that Northrop Grumman is
held accountable in a manner that motivates them to stop ignoring issues raised
by those charged with oversight. Northrop Grumman has a responsibility to
provide high quality services. Northrop Grumman is responsible for their vendors,
employees, configurations, and processes. They must deliver resilient IT
services and well-trained staff. Taxpayers should no longer pay for the
64
negligence of the outsourced contractor. This partnership was intended to
transfer risks of IT services to Northrop Grumman, but Virginia keeps paying
without realizing the expected benefits of partnership.
The best protection for Virginians may lie in contract law. Future
outsourcing contracts should not favor the vendor and exploit the state. Referring
to the outsourcing as a partnership may have been a good political move.
However, it is important to remember that the relationship in an outsourcing
situation cannot be a true partnership because business motives are not shared.
(Lee) The outsourcing contract should have clearly defined service level
agreements and failure to meet these expectations should result in equally clear
penalties. These penalties should have enough financial impact to ensure the
vendor does not determine that paying the penalty fees make better financial
sense than providing the contracted services. The contract between Virginia and
Northrop Grumman has exit penalties that are too expensive to be a feasible
option to exercise. (Joint Legislative Audit and Review Commission) Virginia is
effectively trapped in a bad contract with no recourse.
Future outsourcing contracts must ensure that if Virginia is not receiving
contracted services that provide value to the citizens of Virginia the contract can
be cancelled allowing Virginia to seek satisfactory services. (Lee) These
outsourcing contract improvements can only be achieved through requirements
identification, contract negotiations, and rigorous contract review prior to contract
finalization. The contract review must be performed by an experienced IT

65
contract lawyer. It is very probable that Northrop Grumman standard contracts
provided at least the basis for the outsourcing contract. The use of vendor
contracts "even as a starting point" is highly inadvisable because the contract will
favor the vendor. (Lee, p. 13) This problem is illustrated in the case of the
contract between Virginia and Northrop Grumman.
After the contract is in effect, the contract must be strictly managed by the
outsourcing organization. This may require the establishment of an internal IT
auditing team charged with conducting ongoing service reviews of the vendor.
The team should be comprised of experienced IT service auditors. This will
unfortunately require additional expense, but auditing activities will ensure that
the outsourcing organization will realize the expected value of the contract.
Therefore, the expenses of maintaining an auditing team should be included in
the outsourcing project costs.
4.4.3. Conclusion
Virginia's August 2010 outage provides a case study to illustrate the risks
of outsourcing. Virginia chose an experienced government contractor and made
appropriate investments. However, they failed to negotiate a contract that
provided effective recourse to enforce the contract terms. VITA also failed to
complete a manual that would have provided additional leverage to enforce
contract terms. (Joint Legislative Audit and Review Commission) In order to
mitigate outsourcing risks a strong, well defined, and managed contract is

66
necessary. An experienced IT contract lawyer is recommended to negotiate and
manage the outsourcing contract. The outsourcing organization must fulfill
contractual obligations to effectively employ mechanisms to enforce vendor
contract terms. Vigilance on the part of the outsourcing organization is required
to ensure the vendor delivers quality services that meet business requirements.
This means investing in auditing to ensure that the vendor is taking appropriate
action to provide contracted services.

67
CHAPTER 5. ANALYSIS
5.1. Best Practice Triangulation
Each of these case studies highlighted strengths and weaknesses of
various mitigation techniques. Tulane’s investment in backup tapes paid off but
the investment in an offsite data center did not. The factor that contributed most
to Tulane’s recoverability was the aid provided by other Universities and
vendors. This type of relationship has proven very useful in sectors such as
education and utilities. (Hardenbrook, 2004) Many of these types of
organizations work cooperatively on a daily basis to pool resources.
5.1.1. Before-Planning
Effective planning must begin with the company business requirements
and establishing the maximum tolerable period of disruption (MTPOD), recovery
time objectives (RTO), and recovery point objectives (RPO). MTPOD relates to
how long your business can be “down” before damaging the organization’s
viability. The case studies provide an array of tolerances as shown in Table 5.1
Tolerance and objectives. Using established tolerances and objectives based on
organizational characteristics would provide direction in terms of what mitigation
techniques to implement.
68
Table 5.1 Tolerance and objectives

Organization MTPOD RTO RPO
Commerzbank More than 1 week Less than 1 hour Last transaction
FirstEnergy Less than ½ hour Less than ¼ hour N/A
Tulane Less than 1 month Less than 1 week Previous business
day
Virginia More than 1 day Less than one hour 1 hour
Table 5.1 above reflects estimated MTPOD, RTO, and RPO for each
organization based on artifacts included in each case study. These estimates are
open to debate for example Commerzbank’s estimated MTPOD is listed as more
than one week. One week was chosen as the point at which the viability of
Commerzbank would be threatened. This tolerance was determined based on
looking only at the Commerzbank American division and determining at what
point customers would switch to a competitor. Any outage would be costly for
Commerzbank but a weeklong outage would damage the bank’s reputation and
cause attrition among customers. Customers tend to be tolerant of short outages,
but when the outage impacts their ability to be profitable, they must look
elsewhere. Commerzbank America has a relatively small customer base in a
highly competitive sector and would therefore have difficulty recovering from
customer loss.
FirstEnergy provides electricity, a critical infrastructure resource; any
outage will immediately inconvenience the customer base. Also outages result in
lost revenue because electricity cannot be stored for later use. Extended outages
strain other providers and potentially result in cascading critical service outages.
There are now mandatory guidelines as well and failure to meet these guidelines
69
carry strict enforceable fines. In addition electrical outages tend to be highly
publicized and investigated, damaging the company’s reputation. FirstEnergy is
investor owned therefore outages would reduce the value of company shares.
Investors sued FirstEnergy for lost revenue in the past and could potentially do
so again. All of these factors were included in the ½ hour estimate of MTPOD for
FirstEnergy.
Tulane University has sustained hurricane season for more than a century
before experiencing irrevocable damage to the University’s viability. Review of
case study artifacts revealed a repeating theme in this case; hurricanes had
become routine. The general thought was just send everyone away for a few
days return and clean up when it passes; back to business as usual. This reveals
that outages of “a few days” had no real impact on the organization. However, a
one-month or more outage impacts the university’s ability to maintain semester
operations, most notably the ability to provide their primary service, education.
Tulane IT is vital to education and research missions. Without these two
activities, university income is critically impacted. In determining this MTPOD the
university hospital was not included, only the university itself. Including the
hospital would reduce the MTPOD to hours or less due to possible loss of life.
Loss of life will not necessarily result in irrevocable viability damage to the
organization, but must be avoided at all costs and therefore would be heavily
weighted.
70
Determining the MTPOD for the Common Wealth of Virginia is more
complex. Some services such as 911 services are critical infrastructure and
cannot be down without compromising public safety. Other services may suffer
very little during an extended outage. Obviously, prison guards should never be
without phone services. However, do any of these factors really damage the
viability of the state, it would be very hard to argue that they do. This estimate
comes down to cost and public impact. Public impact was weighted most heavily.
Also, the state’s IT was outsourced therefore impact to the viability of Northrop
Grumman must be included. To date there has been little impact on Northrop
Grumman, but possible contractual changes made after the conclusion of the
third party investigation may have greater impact. Recovery time objectives were
based on reducing impact to the organization’s ability to maintain functionality.
Recovery point objectives (RPO) were based on organizational tolerance to lost
data including transactional data. FirstEnergy stands out in this group with a not
applicable (N/A) rating on Table 5.1. This is based on the assumption that for
operational purposes historical data is not critical to maintaining on going
services. Real-time data is critical to FirstEnergy operators. Past data is
important to predicting and future planning as well as tools development, but loss
of this data would have little operational impact as other data sources could be
utilized for the purposes mentioned.
The MTPOD, RTO, and RPO provide planning direction as mentioned
above. Commerzbank tolerances and objectives make it apparent that they must
71
employ business continuity measures to ensure as close as possible to zero
downtime. Expenditures in IT to ensure this are warranted and practical for their
organization. They can afford to make the necessary investments and downtime
is far too costly. The case study artifacts reveal that Commerzbank is actively
working on business continuity and actively works to improve the IT
infrastructure. Organizations in this category would be advised to avoid delays of
tape-based restores and to maintain two hot sites in an active/active cluster
configuration. It is important to note that one hot site must be significantly
geographically distant. The location should be in another part of the country or in
another country when possible.
Tulane is a good example of an organization with all the right pieces that
failed due to poor placement. Tulane had tape-based backup and recovery which
were appropriate for their budget and MTPOD. The backup data center was new
and not fully complete but location was the problem. It was near enough to be
affected by incidents that affected the University rendering it practically useless.
They were lucky the building’s upper floors, where the tapes were located, were
not flooded allowing retrieval of the backup tapes. This site at the time would
have been a warm site at best; strategic placement would have made this site a
major asset.
Katrina destroyed the infrastructure of New Orleans and Tulane; it is hard
to imagine how on campus classes could have resumed. A functional emergency
operations center (EOC) and backup data center could minimally have provided
72
student and employee records and possibly online coursework. Contingency
planning should include a backup data center that allows virtual operations where
possible. Virtual operations are useful in a variety of situations such as primary
site destruction, pandemics, inclement weather, and transportation interruptions.
For organizations that can continue operations without the use of
information technology services, investing in IT based mitigation may not be
appropriate. Be very cautious in ensuring that your organization truly has no IT
dependencies. Performing an exercise to walk through a mock year would help
to identify dependencies. Organizations that fall in this category are likely to be
very small with very few employees. Payroll and billing functions would be very
simple and probably paper based. Even in these circumstances, multiple copies
maintained at different locations would be advisable to prevent lost records, to
avoid lost revenue, or liability issues. Organizations in this category are not
representative of the average.
5.1.1.1. Staff training in recovery procedures
Staff training levels are more apparent in some of the cases than others.
For example, FirstEnergy staff was inadequately trained and there was poor
communication between operations and IT staff. There is inherent bias toward
documentation available, providing abundant evidence of bad training versus the
lack of evidence for good training. Therefore, less documented evidence of
Commerzbank’s quality of staff training was available. However, the fact that
73
employees began assembling at the backup site, in the midst of the chaos,
transportation, and communication problems of 9/11, is a testament to
Commerzbank’s preparedness training.
Again, these are the two most extreme examples, but training is the
difference between staff that fail to perform versus those that coolly navigate
themselves safely from just a few hundred feet from the largest terrorist attack in
U.S. history. The stress levels between the two staffs during the first phases are
not comparable. In these cases, the well-trained staff under conditions of
extreme stress performs very well with very little warning. The poorly trained staff
failed to act despite many warnings and hours to act. FirstEnergy staff
disregarded these warnings without attempting to verify the current situation.
5.1.2. During-Plan execution
5.1.2.1. Adherence to established procedures
Staff adherence to established procedures appeared to have a strong
correlation with the success of continuity and recovery efforts. FirstEnergy and
the Commonwealth of Virginia suffered comparatively minor incidents. These
organizations could have completely averted disaster had the staff followed
procedure and taken appropriate action. Figure 5.1 Adherence to established
procedures represents each organization’s adherence to procedure on a
continuum relative to one another. On the extremes of the continuum

74
represented in Figure 5.1 are Commerzbank and FirstEnergy. Commerzbank
appears to have executed their plan flawlessly despite encountering unexpected
technical difficulties. FirstEnergy failed to adhere to many industry standards and
procedures.
Figure 5.1 Adherence to established procedures
The independent review of the 2003 Northeast Blackout found FirstEnergy
primarily responsible for the blackout. The operators failed to respond
appropriately to calls from partner operators alerting them of problems detected.
Forty minutes before the outage, the operators knew the monitoring equipment
was not working and still failed to take corrective action. Established internal
procedures were inadequate to maintain reliable operations. FirstEnergy IT staff
was aware of the problems with the EMS, but did not alert the operators to the
issue. This communication was not required at the time of the studied incident,
but was later addressed. However, the primary cause of the outage was failure to
follow procedure. As a result some areas were without power for up to a week
and FirstEnergy's board sued for financial losses caused by negligence.
Based on the Commonwealth of Virginia case study artifacts, it is apparent
that ITIL standard practices were not followed. Artifacts indicate the use of ITIL
for this organization; therefore, ITIL adherence was used as the basis for
75
placement in the continuum in Figure 5.1 Adherence to established procedures.
ITIL specifies standards for communication during incidents and also focuses on
continuity of operations. The Commonwealth of Virginia outage was still under
review at the completion of this study. After the independent review is complete,
adherence to established procedure may be more accurately determined.
However, it is not disputable that a minor hardware problem, that was not an
outage, was acted upon inappropriately. This resulted in a weeklong outage for
some agencies and millions in reported losses.
Tulane University had well established pre-incident disaster procedures
and a staff that was trained and comfortable with the procedures. They also had
the luxury of knowing days ahead that the hurricane was coming. The execution
went according to plan for the most part. There were critical parts of the plan left
unexecuted; the payroll printer and related materials were not taken to safety.
This failure further complicated the task of issuing payroll and likely added
additional cost during the recovery execution process.
Commerzbank adherence to procedure saved the company millions if not
billions in lost revenue. The transactions system never went down during the
events of 9/11. Despite the loss of primary facilities and unforeseen technical
issues they were fully operational within hours. Commerzbank serves as a model
for the financial sector for business continuity and disaster recovery planning.
Other peer institutions never recovered from 9/11.

76
5.1.2.2. Chain of command structure
All of the organizations included in the study had well established chain of
command communication structures. Some were more effective than others for a
few reasons. Both Tulane and Commerzbank experienced communication
disruptions due to the magnitude of the disasters and the resulting damage to
infrastructure. Commerzbank had designated call trees and an alternate location
to maintain the chain of command despite communication and transportation
difficulties. The impact of Katrina was so severe that the impact to the
infrastructure of New Orleans was prolonged and the duration of the disaster
itself was longer. Both organizations struggled with limitations of communication
providers and overloaded cell towers.
Tulane’s critical staff members now carry cell phones from more than one
provider and maintain local and non-local numbers to avoid future
communication disruptions. Tulane has also developed a computer security
incident response plan, which follows many principals from the National Incident
Management system. (Tulane University, 09) This plan defines roles, incident
phases, and incident levels, which delineate what roles, are activated. (Tulane
University, 09)
This plan could be adapted using NIMS to provide an incident command
structure to manage cyberinfrastructure incidents as in Figure 5.2 below. The
contacts listed are cumulative, for example if a level 3 incident were to occur the
77
Figure 5.2 Sample IT incident command structure
Chief Information Officer (CIO), Infrastructure director, required infrastructure
staff, and process owners would be contacted. Each role activated would have a
responsibilities check list to be used for specific level incidents. NASCIO has a
toolkit that could be used as a template. Information on where to find the
NASCIO toolkit is available in appendix A.
Neither FirstEnergy nor the Commonwealth of Virginia experienced
disruptions in their chain of commands. FirstEnergy staff disregarded
communications with the Midwestern-coordinating operator and failed to
communicate as the voluntary industry standards of the time dictated. There was
no apparent deviation from the chain of command in the case of the Virginia
outage. Though it would be safe to speculate that the independent review will
reflect failures to follow some communication protocol.

78
5.1.2.3. Mutual aid relationships
The role of previously established relationships with vendors and partners
was apparent in all of the case studies. Each continuity or recovery effort was
assisted through external relationships. The use of these relationships to provide
additional resources was integral to recovery success and reduced the duration
of the outage in most of the cases. The assistance Baylor provided Tulane was
vital to the future viability of Tulane. The relationships utilized are represented in
Table 5.2 below.
Table 5.2 Aid relationships utilized during recovery

Organization Aid Provider
Commerzbank EMC
FirstEnergy GE, MISO and affiliated ISOs
Tulane Baylor and Blackboard
Virginia EMC and Affected Agencies
5.1.3. After-Plan improvement
All of the studied organizations employed some post-incident evaluation to
improve future response and resiliency. FirstEnergy and Virginia underwent
mandatory third party incident review to determine what steps were necessary to
prevent future incidents. Commerzbank and Tulane were unhappy with the
response and recovery provisions in place at the time of the incident, and have
made changes to increase resiliency.

79
5.1.3.1. Recovery time and cost
5.1.3.1.1. Downtime
There is no single way to determine the cost of downtime for every
organization nor is there a simple way to determine the cost of recovery. These
figures vary based on the sector and other organizational factors. Organizations
that have experienced disaster recovery events have not made the financial
ramifications available to the public, including those in this study. Further most
literature and tools available to aid in determining these cost and return on
investment (ROI) are provided by commercial entities that are attempting to sell
disaster recover or business continuity solutions and are therefore of
questionable validity.
For the purpose of this study a combination of recent studies are used for
illustrative purposes. The results of a study commissioned by CA technologies in
2010 claims “the average North American organization loses over $150,000 a
year through IT downtime.”(CA Technologies, 2010) A Symantec 2011 survey
reports median downtime cost of $3,000 per day for small businesses and
$23,000 for medium size businesses. Based on these figures it would not be
financially feasible for these types of organizations to invest in high availability
systems. However, the losses are still substantial and investment in daily data
backups maintained offsite would be advisable and affordable. Investing in high
availability critical infrastructure information systems is more likely to be a good
investment as illustrated in Figure 5.3 below. Increased downtime translates into

80
increased costs. Some sectors such as utility, financial, and some public have
regulatory standards that must be met and downtime could result in fines as well
as lost revenue.
Figure 5.3 Reported average downtime revenue losses in billions (CA

Technologies, 2010, p. 5)
5.1.3.1.2. Resiliency Investment
As with most projects in IT determining time and cost benefits is difficult
because the goal is a moving target. According to a 2010 study conducted by
Forrester Research, “more and more applications are considered critical” as a
result recovery times have increased by 1.5 hours. (Dines, 2011) The average
application and data classifications reported are shown in Figure 5.4 below. As
organizations become more dependent upon information systems and define
applications and data as critical, the cost of resiliency rises. As tolerance for
81
downtime decreases the cost of resiliency also rises. Economic realities dictate
that most organizations cannot maintain redundancy of all applications and data.
Figure 5.4 Reported critical applications and data classifications
There is currently no accepted standard for how much to invest in
resiliency. A rule of thumb for investing in disaster recovery is to earmark one
week’s worth of yearly revenue for mitigation.(Outsource IT Needs LLC) There
are many other ways to compute how much to invest in IT business continuity
most are far more complex. A 2010 Forrester study found that respondents
reported six percent of the IT operating budget is committed to resiliency
investments. (Balaouras, 2010) When creating a resiliency budget it is important
to note that many functions fall under the umbrella of IT operational resiliency
such as “security management, business continuity management, and IT
operations management”. (Caralli, Allen, Curtis, White, & Young, 2010)
Another factor related to resiliency investments is probability of a disaster.
The fields of insurance and economics have complicated equations to determine

82
risk to insure profitability. These equations are outside of the scope of this
qualitative study. However, this study will use a 2010 Forester market study for
anecdotal evidence to provide a simplified method to calculate a spending
baseline. Forester reports “24 percent of respondents have declared a disaster
and failed over to an alternate site in the past five years”, this yields a 4.8 percent
probability of experiencing a disaster requiring failover to a remote site in a year.
(Dines, 2011) The average cost of downtime per hour was $145,000 and
average recovery time was reported to be 18.5 hours. (Dines, 2011)
Multiplying the average cost per hour by the average recovery time yields
an average recovery cost of $2,682,500. Spreading the cost of a major disaster
over a five year time period yields a disaster cost of $536,500. Multiplying this by
the risk probability of 4.8 percent yields $25,752. These figures provide a range
for disaster investments a minimum investment of $25,752 and maximum of
$536, 500. The average of the two is $281,126; this figure, based on the Forester
study, represents a practical investment per year for disaster recovery.
An average annual budget of $281,126 is not a large investment relatively
speaking, careful long term planning along with integrated iterative
implementation will allow the results of this small yearly investment to yield
substantial results over time. A five-year resiliency implementation plan would
allow long term planning to be implemented through a series of short-term goals.
The overall 5-year budget would be $1,405,630. The first year would likely be
dedicated to reviewing organizational needs and looking for cost effective ways
83
to implement the resiliency plan. The following years could be focused on a
modular implementation and integrating resiliency into new projects.
5.1.3.2. Findings
One best practice identified through this case study is to integrate
redundant systems into daily processing functions. Commerzbank is a good
example of this configuration. This configuration was instituted post 9/11 as a
result of evaluation of the recovery efforts’ weaknesses. One advantage is there
is less need for human intervention, thus requiring less manpower to recover.
This also allows response and recovery to begin immediately. In life threatening
situations staff can focus on evacuation. Possible liability issues can be reduced
related to both staff and external stakeholders by removing the question of due
diligence. Another advantage is that testing is far less disruptive because the
recovery systems are already processing part of the load.
Integrating redundant systems into daily processing does not mean that all
systems must be redundant. Careful planning and classification of applications
and data can reduce costs. For example in the case of FirstEnergy, access to
past data is not business critical, investing in data recovery can be reduced.
Lower cost tape based storage and recovery methods are fine. However,
availability of real-time operations applications and data are critical to the mission
of FirstEnergy. Investments to support these critical functions are money well

84
spent. A raw order of magnitude estimate for this would place such a system in
the hundreds of thousands and could reach to the millions.
Gartner reported the cost of a tier IV data center to be about $3,450 per
square foot with a cost of $34.5 million for a 10,000 square foot datacenter.
(Cappuccio, 2010) A tier IV data center according to Gartner would provide less
than a half hour of downtime a year. (Cappuccio, 2010) However, the risk and
outage costs are high enough to justify such an investment. The losses of the
2003 blackout were widespread and totaled in the billions. As a result
FirstEnergy’s investments should be scaled to support business continuity to
avoid outages.
The August 2010 Virginia outage is an example of an organization that
made the “right” technical decisions, but failed on an organizational and
implementation level. A one-time implementation project will not ensure
cyberinfrastructure resiliency. It is an ongoing continuous improvement lifecycle
process. The Virginia case also provides an example of the hazards of risk
transference through outsourcing. Outsourced IT must be carefully managed and
monitored. The outsourcing party must ensure the power to enforce meaningful
penalties for contractual failures.
Figure 5.5 displays the high-level conceptual relationships of the various
components of a resilient system. Each triangle above can be further broken
down to reflect component relationships, for example training would be within
failover testing and plan updates. Disaster recovery and business continuity
85
Figure 5.5 Components of a resilient system
plans would include business impact analysis, categorization of data and
applications, and recovery sequence. These relationships hold whether IT is
internally maintained or externally maintained. Organizations must be vigilant to
ensure that each component is rigorously maintained.

86
CHAPTER 6. CONCLUSION
The research question this study endeavored to answer is “What are best
practices before, during, and after disaster recovery execution?” The multiple
case study best practice analysis indicates that disaster recovery is one part of
an iterative business continuity process in successful organizations. This
process is broken down into three distinct phases: before, during, and after
disaster recovery execution. Strategic planning occurs during the before phase.
This planning includes determining the MTPOD, RTO, and RPO to help
determine appropriate investment. Training, and rigorous testing occur in the
before phase as well. Best practice during the disaster recovery execution phase
includes effective management, organization, and execution of business
continuity and disaster recovery plans. Adherence to policy, chain of command,
and utilization of aid relationships are important elements of the during phase. In
the after, post recovery phase, best practice involves reviewing the situation and
response to identify areas that need improvement. The after phase will not only
help plan future mitigation, but also identifies supporting government policy
needs in critical infrastructure sectors. The iterative cycle begins again in the
before planning phase as improvement implementation begins.

87
The purpose of this research was “to bridge the gap of unmet”
cyberinfrastructure resiliency needs. An assumption was made that the high cost
of implementation was the most significant barrier. While this may be true,
surprisingly, the two most avoidable disasters did not occur because of any direct
lack of funds. Virginia had already allocated the funds, the company they
outsourced to failed to meet contracted requirements. FirstEnergy is fortune 500
company and was a fortune 500 company prior to the 2003 North East Blackout,
it is therefore unlikely that lack of available funds was a contributing factor. In
these two cases it is arguable that lack of management oversight and urgency
was the motivation for any lack of funds allocated. Both organizations understood
and implemented backup equipment, but failed to ensure that all mitigation
measures were followed. Commerzbank and Tulane are veterans in dealing with
adversity. Weaknesses revealed in each appeared to be failure to truly
understand the complexities of the recovery process in a truly catastrophic loss.
The real barrier appears to be an inability to admit that large-scale
disasters do and will happen. Unfortunately, this cognitive avoidance is in our
nature as humans. Some large-scale disasters are caused by catastrophes,
others by human error, both are illustrated in this study. Required preparations
must be strictly enforced or mandated by the government to induce compliance.
Car, home insurance, and retirement savings are all forgone by most unless it is
mandatory. The extensive complexity of data center and information systems is
difficult to grasp. Add to this the tendency to disregard possible calamity and the
88
end result is a crumbling, tenuous cyberinfrastructure. Regulation may be far off
due to the specialized workforce required to audit information systems.

89
CHAPTER 7. FUTURE RESEARCH
This study has revealed areas in need of further research. Two are well
known issues with ongoing research. These areas are educating a workforce
capable of managing critical information systems and increasing resiliency by
eliminating dependencies upon third parties for power. The areas of hydrogen
fuel cells and solar power continue to leap forward and may provide the power
required to create grid independent data centers. Sustainability and reduction in
power consumption are also required to build independent data centers.
Another area that requires further research is providing understandable
suggested business continuity investing based on the probability of a major
disaster and the cost of such events. Tables based on industry, size, and location
would be particularly useful in determining appropriate spending. Current
spending appears to be based on confidence, fear, or sales pitches. A rational
method, based on facts, would allow this information to be presented to a CFO in
a meaningful manner.
Lastly, the area of cyberinfrastructure policy needs to be investigated. The
current, mostly unregulated IT, climate values fast over safe. Companies must
move fast to push out the next new product. Little time is spent focused on
ensuring security and resiliency. This will likely continue until minimum
90
regulations are in place. These policies and the ability to enforce them will be
very helpful to organizations that want to be secure and resilient, but are
struggling with vendors. Related to this is IT contract law, this field desperately
needs to be brought to maturity to protect organizational investments.

BIBLIOGRAPHY
91
BIBLIOGRAPHY
Alumni Affairs, Tulane University. (2008 29-9). Tulane University: Renaissance.

Retrieved 10 27-12 from issue:
http://issue.com/thebooksmithgroup/docs/tulane
Anthes, G. (2008 31-3). Disaster survivor: Tulane's people priority. Retrieved 10

17-12 from Compuerworld:
http://www.computerworld.com/s/article/print/314109/Tulane_University
Associated Press. (06 20-1). FirstEnergy to pay $28M fine, saying workers hid
damage. Retrieved 11 5-1 from USA Today:
http://www.usatoday.com/news/nation/2006-01-20-nuke-plant-fine_x.htm
Associated press. (03 19-11). Investigators pin origin of Aug 2003 blackout on
FirstEnergy failures . Retrieved 11 6-1 from Windcor Power Systems.
Availability Digest. (2009 7). Commerzbank Survives 9/11 with OpenVMS

Clusters. Retrieved 11 3-1 from Availability Digest:
http://www.availabilitydigest.com/public_articles/0407/commerzbank.pdf
Availability Digest. (2010 10). The State of Virginia – Down for Days. Retrieved
2010 8-11 from www.availabilitydigest.com:
http://www.availabilitydigest.com/public_articles/0510/virginia.pdf
Balaouras, S. (2010 2-9). Business Continuity And Disaster Recovery Are Top IT
Priorities For 2010 And 2011 Six Percent Of IT Operating And Capital
Budgets Goes To BC/DR. Retrieved 2011 7-2 from Forrester.com:
http://www.forrester.com/rb/Research/business_continuity_and_disaster_r
ecovery_are_top/q/id/57818/t/2
Balaouras, S. (2008 Winter). The State of DR Preparedness. Retrieved 6 29-6

from Disaster Recovery Journal:
http://www.drj.com/index.php?Itemid=10&id=794&option=com_content&ta
sk=view
Barovik, H., Bland, E., Nugent, B., Van Dyk, D., & Winters, R. (2001 26-11). For
The Record Nov. 26, 2001. Retrieved 11 13-1 from Time:
http://www.time.com/time/magazine/article/0,9171,1001334,00.html
92
Barron, J. (2003 15-8). Power Surge Blacks Out Northeast. Retrieved 2009 2-11
from New York Times:
http://www.nytimes.com/2003/08/15/nyregion/15POWE.html
Barron, J. (2003 15-8). Power Surge Blacks Out Northeast. Retrieved 2009 2-11
from The New York Times:
http://www.nytimes.com/2003/08/15/nyregion/15POWE.html
Blackboard Inc. (2008 24-10). Blackboard & Tulane University. Retrieved 10 27-
12 from Blaceboard:
http://www.blackboard.com/CMSPages/GetFile.aspx?guid=39a0b112-
221d-4d04-be80-f2024d16943a
Brown, K. (2008 1-2). House No. 3 Rises for URBANbuild. Retrieved 2011 2-1
from Tulane University New Wave:
http://tulane.edu/news/newwave/020108_urbanbuild.cfm
CA Technologies. (2010 11). The Avoidable Cost of Downtime. Retrieved 2011

28-1 from Arcserve:
http://arcserve.com/us/~/media/Files/SupportingPieces/ARCserve/avoidab
le-cost-of-downtime-summary.pdf
Cappuccio, D. (2010 17-3). Extend the Life of Your Data Center, While Lowering
Costs. Retrieved 2011 28-1 from Gartner:
http://www.gartner.com/it/content/1304100/1304113/march_18_extend_lif
e_of_data_center_dcappuccio.pdf
Caralli, R., Allen, J., Curtis, P., White, D., & Young, L. (2010 5). CERT®
Resilience Management Model, Version 1.0 Process Areas, Generic
Goals and Practices, and Glossary. Hanscom AFB, MA.
Charette, R. (2010 31-8). Virginia's Continuing IT Outage Creates Political

Fireworks. Retrieved 2010 6-11 from IEEE Specrtum:
http://spectrum.ieee.org/riskfactor/computing/it/virginias-continuing-it-
outage-creates-political-fireworks
Clinton Administration. (1998 22-5). The Clinton Administration's Policy on

Critical Infrastructure Protection: Presidential Decision Directive 63.
Retrieved 2010 2-5 from Computer Security Resource Center National
Institute of Standards and Technology Federal Requiements:
http://csrc.nist.gov/drivers/documents/paper598.pdf
Collett, S. (2007 4-12). Five Steps to Evaluating Business Continuity Services.

Retrieved 2009 9-11 from CSOonline.com:
http://www.csoonline.com/article/221306/Five_Steps_to_Evaluating_Busin
ess_Continuity_Services
93
Comptroller of the city of New York. (02 04-9). One Year Later, The Fiscal Impact
of 9/11 on New York City. Retrieved 11 13-1 from The New York City
Comptroller's Office:
http://www.comptroller.nyc.gov/bureaus/bud/reports/impact-9-11-year-
later.pdf
Cowen. (n.d.). Letter to students. Retrieved 2010 27-12 from Tulane University:
http://renewal.tulane.edu/students_undergraduate_cowen2.shtml
Cowen, S. (05 8-12). Messages for Students . Retrieved 10 27-12 from

Tulane.edu: http://www.tulane.edu/students.html
Cowen, S. (05 2-9). Messages for Students . Retrieved 2010 27-12 from Tulane
University : http://www.tulane.edu/studentmessages/september.html
Cowen, S. (05 3-9). Student Messages. Retrieved 10 27-12 from Tulane

University: http://www.tulane.edu/studentmessages/september.html


DeCay, J. (2007 3-5). Advising Students After An Extreme Crisis: Assisting

Katrina Survivors. Retrieved 10 17-12 from Dallas County Community
College District:
http://www.dcccd.edu/sitecollectiondocuments/dcccd/docs/departments/do
/eduaff/transfer/conference/conference_cvc.pdf
Denial-of-service attack. (n.d.). Retrieved 2009 2-11 from Wikipedia:

http://en.wikipedia.org/wiki/Denial-of-service_attack
Dines, R. (2011). Market Study The State of Disaster Recovery Preparedness.

(R. Arnold, Ed.) Disaster recovery Journal , 24 (1), 12-22.
Editorial Staff of SearchStorage.com. (2002 6-3). Bank avoids data disaster on

Sept. 11. Retrieved 11 3-1 from SearchStorage.com:
http://searchstorage.techtarget.com/tip/0,289483,sid5_gci808783,00.html
Egenera. (2006). Case Study: Commerzbank North America. Retrieved 2011 3-1
from Egenera: www.egenera.com/1157984790/Link.htm
94
Electricity Consumers Resource Council (ELCON) . (2004 09-02). The Economic

Impacts of the August 2003 Blackout . Retrieved 2009 02-11 from
ELCON:
http://www.elcon.org/Documents/EconomicImpactsOfAugust2003Blackout
.pdf
EMAC. (n.d.). The History of Mutual Aid and EMAC. Retrieved 2011 20-2 from
EMAC: http://www.emacweb.org/?321
FEMA. (n.d.). Incident Command System (ICS). Retrieved 2011 20-2 from
FEMA:
http://www.fema.gov/emergency/nims/IncidentCommandSystem.shtm
FEMA. (2006 30-11). Private Sector NIMS Implementation Activities. From

http://www.fema.gov/pdf/emergency/nims/ps_fs.pdf
FirstEnergy. (08 27-2). Company history. From FirstEnergy:

http://www.firstenergycorp.com/corporate/Corporate_Profile/Company_His
tory.html
FirstEnergy. (09 27-2). Corporate profile. Retrieved 11 5-1 from FirstEnergy:

http://www.firstenergycorp.com/corporate/Corporate_Profile/index.html
Forrester, E. C., Buteau, B. L., & Shrum, S. (2009). Service Continuity: A Project
Management Process Area at Maturity Level 3. In E. C. Forrester, B. L.
Buteau, & S. Shrum, CMMI® for Services: Guidelines for Superior Service
(pp. 507-523). Boston, MA: Addison-Wesley Professional.
Fortune. (10 3-5). Fortune 500. Retrieved 11 5-1 from CNNMoney.com:

http://www.firstenergycorp.com/corporate/Corporate_Profile/Company_His
tory.html
From Reuters and Bloomberg News. (03 19-8). FirstEnergy Shares Fall After
Blackout. Retrieved 11 6-1 from Los Angeles Times:
http://articles.latimes.com/2003/aug/19/business/fi-wrap19.1
Gerace, T., Jean, R., & Krob, A. (2007). Decentralized and centralized it support
at Tulane University: a case study from a hybrid model. In Proceedings of
the 35th annual ACM SIGUCCS fall conference (SIGUCCS '07). New
York: ACM.
Grose, T., Lord, M., & Shallcross, L. (2005 11). Down, but not out. Retrieved
2010 28-12 from ASEE PRISM: http://www.prism-
magazine.org/nov05/feature_katrina.cfm
95
Gulf Coast Presidents. (2005). Gulf Coast Presidents Express Thanks, Urge
Continued Assistance . Retrieved 10 27-12 from Tulane University:
http://www.tulane.edu/ace.htm
Hardenbrook, B. (2004 8-9). Infrastructure Interdependencies Tabletop Exercise

BLUE CASCADES II. Seattle, WA.
Hewlett-Packard. (2002 7). hp AlphaServer technology helps Commerzbank

tolerate disaster on September 11. Retrieved 11 3-1 from hp.com:
http://h71000.www7.hp.com/openvms/brochures/commerzbank/commerzb
ank.pdf?jumpid=reg_R1002_USEN
Homeland Security. (2009 8). Information Technology Sector Baseline Risk

Assesment. Retrieved 2010 17-5 from Homeland Security:
http://www.dhs.gov/xlibrary/assets/nipp_it_baseline_risk_assessment.pdf
Homeland Security. (2009 August). Information Technology Sector Baseline Risk

Assessment. Retrieved 2010 17-5 from Homeland Security:
http://www.dhs.gov/xlibrary/assets/nipp_it_baseline_risk_assessment.pdf
Internet Security Alliance (ISA)/American National Standards Institute (ANSI).

(2010). The Financial Management of Cyber Risk An Implementation
Framework for CFOs. USA: Internet Security Alliance (ISA)/American
National Standards Institute (ANSI).
Jackson, C. (2011 2). California’s Mutual Aid System Provides Invaluable

Support During San Bruno Disaster. Retrieved 2011 20-2 from Western
City: http://www.westerncity.com/Western-City/February-2011/California-
rsquos-Mutual-Aid-System-Provides-Invaluable-Support-During-San-
Bruno-Disaster/
Jesdanun, A. (04 12-2). Software Bug Blamed For Blackout Alarm Failure.
Retrieved 11 6-1 from CRN:
http://www.crn.com/news/security/18840497/software-bug-blamed-for-
blackout-alarm-failure.htm?itc=refresh
Joint Legislative Audit and Review Commission. (2009 2009-13). Review of

Information Technology Services in Virginia. Retrieved 2010 05-11 from
http://jlarc.state.va.us/: jlarc.state.va.us/meetings/October09/VITA.pdf
Kantor, A. (2005 8-9). Technology succeeds, system fails in New Orleans.

Retrieved 11 2-1 from USA Today:
http://www.usatoday.com/tech/columnist/andrewkantor/2005-09-08-
katrina-tech_x.htm
96
Krane, N. K., Kahn, M. J., Markert, R. J., Whelton, P. K., Traber, P. G., & Taylor,
I. L. (2007 8). Surviving Hurricane Katrina: Reconstructing the Educational
Enterprise of Tulane University School of Medicine. Retrieved 10 17-12
from Academic Medicine:
http://journals.lww.com/academicmedicine/Fulltext/2007/08000/Surviving_
Hurricane_Katrina__Reconstructing_the.4.aspx
Kravitz, D. (2010 28-8). Statewide computer meltdown in Virginia disrupts DMV,

other government business. Retrieved 2010 6-11 from The Washington
Post : http://www.washingtonpost.com/wp-
dyn/content/article/2010/08/27/AR2010082705046.html
Kravitz, D., & Kumar, A. (2010 31-8). Virginia DMV licensing services will be
stalled until at least Wednesday. Retrieved 2010 6-11 from
Washingtonpost.com: http://www.washingtonpost.com/wp-
Kumar, A., & Helderman, R. (2009 14-10). Outsourced $2 Billion Computer

Upgrade Disrupts Va. Services. Retrieved 2010 6-11 from
Washingtonpost.com : http://www.washingtonpost.com/wp-
Lawson, J. (05 9-12). A Look Back at a Disaster Plan: What Went Wrong and
Right. Retrieved 10 28-12 from The Chronicle of Higher Education:
http://chronicle.com/article/A-Look-Back-at-a-Disaster/10664
Lawson, J. (2005 9-12). Katrina and Tulane: a Timeline. Retrieved 12 2010-17

from The Chronicle of Higher Education :
http://chronicle.com/article/KatrinaTulane-a-Timeline/21840
Lee. (1996). IT outsourcing contracts: practical issues for management. Industrial

Management & Data Systems , 96 (1), 15 - 20.
Lee, J., & Kim, Y. (1999). Effect of Partnership Quality on IS Outsourcing

Sucess: Conceptual Freamwork anf Empirical Validation. Journal of
Management Information Systems , 15 (4), 29-61.
Lewis, B. (n.d.). Massive Computer Outage Halts Some Va. Agencies. Retrieved
2010 5-11 from HamptonRoads.com:
http://hamptonroads.com/print/566771
Lord, M. (2008 11). WHEN DISASTER STRIKES Recovering from Katrina’s

damage, two New Orleans engineering schools make emergency
preparation a priority. Retrieved 10 28-12 from ASEE PRISM:
http://www.prism-magazine.org/nov08/feature_03.cfm#top
97
Massachusetts Institute of Technology Information Security Office . (1995). MIT

BUSINESS CONTINUITY PLAN. Retrieved 2010 17-5 from Information
Services and Technology : http://web.mit.edu/security/www/pubplan.htm
McIntyre, D. A. (2009 2-9). Gmail's outage raises new concern about the Net's
vulnerability. Retrieved 2009 25-11 from Newsweek:
http://www.newsweek.com/id/214760
McLennan, K. (2006). Selected Distance Education Disaster Planning Lessons

Learned From Hurricane Katrina . Retrieved 10 28-12 from Online Journal
of Disatnce Learning Administration:
http://www.westga.edu/~distance/ojdla/winter94/mclennan94.htm
Mears, J., Connor, D., & Martin, M. (02 2-9). What has changed. Retrieved 11 4-
1 from Network World.
Merschoff, E. (05 21-4). EA-05-071 - Davis-Besse (FirstEnergy Nuclear

Operating Company). Retrieved 11 5-1 from USNRC:
http://www.nrc.gov/reading-rm/doc-
collections/enforcement/actions/reactors/ea05071.html
Michigan State University Disaster Recovery Planning . (n.d.). Planning Guide.

Retrieved 2010 17-5 from Michigan State University Disaster Recovery
Planning : http://www.drp.msu.edu/Documentation/StepbyStepGuide.htm
Midwest ISO. (n.d.). About Us. Retrieved 2011 28-3 from Midwest ISO:
http://www.midwestmarket.org/page/About%20Us
Minkel, J. (08 13-8). The 2003 Northeast Blackout--Five Years Later. Retrieved
11 6-1 from Scientific American:
http://www.scientificamerican.com/article.cfm?id=2003-blackout-five-
years-later
NASA. (2008 3). Powerless. Retrieved 2011 6-1 from Process Based Mission
Assurance NASA Safety Center:
http://pbma.nasa.gov/docs/public/pbma/images/msm/PowerShutdown_sfc
s.pdf
New York Independent System Operator. (2005 2). ISO. Retrieved 2010 17-3
from
http://www.nyiso.com/public/webdocs/newsroom/press_releases/2005/bla
ckout_rpt_final.pdf
News Report. (2010 1-9). Northrop Grumman Vows to Find Cause of Virginia
Server Meltdown as Fix Nears. Retrieved 2010 6-11 from Government
Technology: http://www.govtech.com/policy-management/102482209.html
98
News Report. (2010 30-8). Work Continues on 'Unprecedented' Computer

Outage in Virginia . Retrieved 2010 6-11 from Government Technology:
http://www.govtech.com/security/102485974.html
Nixon, S. (2010 13-11). VITA Briefing. Retrieved 2010 7-11 from

www.vita.virginia.gov:
http://www.vita.virginia.gov/uploadedFiles/091310_JLARC_Final.pdf
Outsource IT Needs LLC. (n.d.). How Much Should You Spend on Disaster
Recovery? Calculating the Value of Business Continuity. Retrieved 2011
7-2 from Outsource IT Needs, LLC:
http://outsourceitneeds.com/DisasterRecovery.pdf
Oversight and Investigations Subcommittee of the House Committee on Energy

and Commerce. (2007 1-8). Testimony of M.L. Lagarde, III . Retrieved
2010 27-12 from Committee on Energy and Commerce:
http://energycommerce.house.gov/images/stories/Documents/Hearings/P
DF/110-oi-hrg.080107.Lagarde-Testimony.pdf
Parris, K. (n.d.). Using OpenVMS Clusters for Disaster Tolerance. Retrieved 11

3-1 from hp.com:
http://h71000.www7.hp.com/openvms/journal/v1/disastertol.pdf?jumpid=re
g_R1002_USEN
Parris, K. (2010). Who Survives Disasters and Why, Part 2: Organizations.

Retrieved 11 3-1 from www2.openvms.org/kparris/:
http://www2.openvms.org/kparris/Bootcamp_2010_Disasters_Part2_Orga
nizations.pdf
Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., et al.
(2002). Recovery Oriented Computing (ROC): Motivation, Definition,
Techniques, and Case Studies. Computer Science Technical Report,
Computer Science Division, University of California at Berkeley, Computer
Science Department, Mills College and Stanford University; IBM
Research, Berkeley.
Petersen, R. (2009 9). Protecting Cyber Assets. Retrieved 2010 15-6 from
EDUCAUSE Review:
http://www.educause.edu/EDUCAUSE%2BReview/EDUCAUSEReviewMa
gazineVolume44/ProtectingCyberAssets/178440
99
Scalet, S. D. (2002 1-9). IT Executives From Three Wall Street Companies -

Lehman Brothers, Merrill Lynch and American Express - Look Back on
9/11 and Take Stock of Where They Are Now. Retrieved 2009 9-11 from
CIO:
http://www.cio.com/article/31295/IT_Executives_From_Three_Wall_Street
_Companies_Lehman_Brothers_Merrill_Lynch_and_American_Express_L
ook_Back_on_9_11_and_Take_Stock_of_Where_They_Are_Now?page=
3&taxonomyId=1419
Scalet, S. D. (2002 01-09). IT Executives From Three Wall Street Companies-

Lehman Brothers, Merrill Lynch and American Express-Look Back on 9/11
and Take Stock of Where They Are Now . Retrieved 2009 09-11 from CIO:
http://www.cio.com/article/31295/IT_Executives_From_Three_Wall_Street
_Companies_Lehman_Brothers_Merrill Lynch_and
_American_Express_Look_Back_on_9_11_and
_Take_Stock_of_Where_They _Are_Now?page=3&taxonomyId=1419
Schaffhauser, D. (2005 21-10). Disaster Recovery: The Time Is Now. Retrieved

2010 17-12 from Campus Technology:
http://campustechnology.com/articles/2005/10/disaster-recovery-the-time-
is-now.aspx
Schapiro, J., & Bacque, P. (2010 28-08). Agencies' computers still being
restored. Retrieved 2010 5-11 from Richmond Times-Dispatch:
http://www2.timesdispatch.com/member-center/share-this/print/ar/476845/
Schapiro, J., & Bacque, P. (2010 3-9). Northrop Grumman regrets computer
outage. From Richmond Times-Dispatch:
http://www2.timesdispatch.com/news/state-news/2010/sep/03/vita03-ar-
485147/
Schapiro, J., & Bacque, P. (2010 2-9). Update: McDonnell lays out concerns to
Northrop Grumman. Retrieved 2010 8-11 from Richmond Times-Dispatch:
http://www2.timesdispatch.com/news/2010/sep/02/10/vita02-ar-483821/
Schellenger, D. (2010). Dealing with ther Personal Dimention of BC/DR. Disaster

Recovery Journal , 23 (2).
Scherr, I., & Bartz, D. (2010 3-2). U.S. unveils cybersecurity safeguard plan.
Retrieved 2010 30-6 from Reuters:
http://www.reuters.com/article/idUSTRE62135H20100302
Scherr, I., & Bartz, D. (2010 2-3). U.S. unveils cybersecurity safeguard plan.
Retrieved 2010 13-4 from Reuters:
http://www.reuters.com/article/idUSTRE62135H20100302
100
Schwartz, S., Li, W., Berenson, L., & Williams, R. (2002 11-9). Deaths in World
Trade Center Terrorist Attacks --- New York City, 2001. Retrieved 11 13-1
from CDC: http://www.cdc.gov/mmwr/preview/mmwrhtml/mm51spa6.htm
Searle, N. (2007). Baylor College of Medicine's Support of Tulane University

School of Medicine Following Hurricane Katrina. Retrieved 2010 17-12
from Academic Medicine:
http://journals.lww.com/academicmedicine/Fulltext/2007/08000/Surviving_
Hurricane_Katrina__Reconstructing_the.4.aspx
Slater, D. (2009 28-10). Business Continuity and Disaster Recovery Planning:

The Basics. Retrieved 2009 9-11 from CSOonline.com:
http://www.csoonline.com/article/204450/Business_Continuity_and_Disast
er_Recovery_Planning_The_Basics
Squires, P. (2010 2-9). Northrop Grumman to pay for cost of independent review.
Retrieved 2010 8-11 from virginiabusiness.com:
http://www.virginiabusiness.com/index.php/news/article/northrop-
grumman-to-pay-for-cost-of-independent-review/
Stewart, L. (2006 10-10). VITA Update to JLARC. Retrieved 2010 5-11 from
www.vita.virginia.gov: jlarc.state.va.us/meetings/October06/VITA.pdf
Swanson, A., Bowen, P., Wohl Phillips, A., Gallup, D., & Lynes, D. (2010 5).
Contingency Planning Guide for Federal Information Systems. NIST
Special Publication 800-34, Revision 1 . Gaithersburg, MD.
Swanson, M., Wohl, A., Pope, L., Grance, T., Hash, J., & Thomas, R. (2002
June). Contingency Planning Guide for Information Technology Systems
Recommendations of the National Institute of Standards and Technology
NIST Special Publication 800-34. Retrieved 2010 27-5 from Computer
Security Division Computer Resource Center National National Institute of
Standards and Technology: http://csrc.nist.gov/publications/nistpubs/800-
34/sp800-34.pdf
Testa, B. (2006 8). In Katrina’s Wake: Intensive Care for an Institution. Retrieved
2010 17-12 from Workforce Management:
http://www.workforce.com/section/recruiting-staffing/archive/feature-
katrinas-wake-intensive-care-institution/244929.html
The Clinton Administration’s Policy on Critical Infrastructure Protection:

Presidential Decision Directive 63. (1998 22-5). Retrieved 2010 02-05
from Computer Security Resource Center National Institute of Standards
and Technology Federal Requirements:
http://csrc.nist.gov/drivers/documents/paper598.pdf
101
The New York Times Company. (04 29-7). FirstEnergy settles suits related to
blackout. Retrieved 11 13-1 from NYTimes.com: NYTimes.com
The Virginia Information Technology Infrastructure Partnership. (n.d.). The

Virginia Information Technology Infrastructure Partnership ANNUAL
REPORT Improving Technology and Wiring Virginia for the 21st Century
July 1, 2006, through June 30, 2007. Retrieved 2010 6-11 from
www.vita.virginia.gov:
http://www.vita.virginia.gov/uploadedFiles/IT_Partnership/ITP2007Annual
Report.pdf
Thibodeau, P., & Mearian, L. (2005 9-12). After Katrina, users start to weigh
long-term IT issues. Retrieved 12 2010-15 from Computerworld:
http://www.computerworld.com/s/article/104542/After_Katrina_users_start
_to_weigh_long_term_IT_issues
Tulane University. (n.d.). About Tulane. Retrieved 10 29-12 from Tulane

University: http://tulane.edu/about/
Tulane University. (2009 2009-2). Ellen DeGeneres to Headline 'Katrina Class'

Commencement. Retrieved 1 2010-2 from Tulane Admission:
http://admission.tulane.edu/livecontent/news/34-ellen-degeneres-to-
headline-katrina-class.html
Tulane University. (09 3). Tulane University Computer Incident Response Plan
Part of Technology Services Disaster Recovery Plan. Retrieved 2011 20-2
from Information Security @ Tulane:
http://security.tulane.edu/TulaneComputerIncidentResponsePlan.pdf
U.S. Department of Transportation. (n.d.). iFlorida Model Deployment Final

Evaluation Report. Retrieved 2009 24-10 from
http://ntl.bts.gov/lib/31000/31000/31051/14480.htm
U.S.-Canada Power System Outage Task Force. (2004 April). Final Report on
the August 14, 2003 Blackout in the United State and Canada: Causes
and Recommendations. From https://reports.energy.gov
Virginia Community College. (1998 25-3). Virginia Community College Utility

Data Center Contingency Management/Disaster Recovery Plan. Retrieved
2009 9-11 from Virginia Community College:
http://helpnet.vccs.edu/NOC/Mainframe/drplan.htm
VITA. (n.d.). Information Technology Infrastructure Library (ITIL). Retrieved 2010

8-11 from www.vita.virginia.gov:
http://www.vita.virginia.gov/library/default.aspx?id=545
102
VITA. (n.d.). Information Technology Investment Board (ITIB). Retrieved 2010 6-

3 from www.vita.virginia.gov: http://www.vita.virginia.gov/ITIB/
VITA. (2010 1-11). Network News. Retrieved 11 13-1 from Vita:

http://www.vita.virginia.gov/communications/publications/networknews/def
ault.aspx?id=12906
VITA. (2007 1-7). Network News Volume 2, Number 7 From the CIO. Retrieved
2010 6-11 from www.vita.virginia.gov:
ault.aspx?id=3594
VITA. (2010 1-6). Network News Volume 5, Number 6 . Retrieved 2010 27-11
from www.vita.virginia.gov:
ault.aspx?id=12080
Wikan, D. (2010 13-9). Northrop Grumman to pay for computer outage

investigation. Retrieved 2010 7-11 from www.wvec.com:
http://www.wvec.com/news/local/Northrop-Grumman-to-pay-for-computer-
outage-investigation-102796459.html
APPENDICES
103
Appendix A. Recommended Resources
NASCIO IT Disaster Recovery and Business Continuity Tool-kit: Planning for the
Next Disaster
http://www.nascio.org/publications/documents/NASCIO-DRToolKit.pdf
This is an easy to follow workbook style14-page document covering before,

during, and after best practices.
Carnegie Melon Computer Emergency Response Team Resilience Management

Model
http://www.sei.cmu.edu/library/abstracts/reports/10tr012.cfm
This detailed 259 page document covers resiliency management from a cross
disciplinary perspective. Includes best practices, CMMI based generic goals and
objectives to guide the process of planning and implementing operational
resiliency.
FEMA's emergency management institute

http://www.training.fema.gov/IS/
Free online courses that provide testing and certificates of subject proficiency.
Covers a variety of topics such as emergency management, workplace violence,
and preparedness.
104
Appendix B. NASCIO IT Disaster Recovery and Business Continuity Tool-kit:

Planning for the Next Disaster
NASCIO: Representing Chief Information Officers of the States
NASCIO Staff Contact:

Drew Leatherby,
IT Disaster Recovery and Business Continuity Issues Coordinator
dleatherby@AMRms.com
Tool-kit: Planning for the Next Disaster
Without the flow of electronic informa- term power outages to more-severe disrup-
tion, government comes to a standstill. tions involving equipment destruction
When a state’s data systems and commu- from a variety of sources such as natural
nication networks are damaged and its disasters or terrorist actions. While many
processes disrupted, the problem can be vulnerabilities may be minimized or elimi-
serious and the impact far-reaching. The nated through technical, management, or
consequences can be much more than an operational solutions as part of the state’s
inconvenience. Serious disruptions to a overall risk management effort, it is virtually
state’s IT systems may lead to public dis- impossible to completely eliminate all risks.
trust, chaos and fear. It can mean a loss of
vital digital records and legal documents. In many cases, critical resources may reside
A loss of productivity and accountability. outside the organization’s control (such as
And a loss too of revenue and commerce. electric power or telecommunications),
and the organization may be unable to
Disasters that shut down a state’s mission ensure their availability. Thus effective dis-
critical applications for any length of time aster recovery planning, execution, and
could have devastating direct and indirect testing are essential to mitigate the risk of
costs to the state and its economy that system and service unavailability.
make considering a disaster recovery and Accordingly, in order for disaster recovery
business continuity plan essential. State planning to be successful, the state CIO’s
Chief Information Officers (CIOs) have an office must ensure the following:
obligation to ensure that state IT services
continue in the state of an emergency. The 1. Critical staff must understand the IT
good news is that there are simple steps disaster recovery and business conti-
that CIOs can follow to prepare for Before, nuity planning process and its place
During and After an IT crisis strikes. Is your within the overall Continuity of
state ready? Operations Plan and Business
Continuity Plan process.
Disaster Recovery Planning 101 2. Develop or re-examine disaster recov-
ery policy and planning processes
Disaster recovery and business continuity including preliminary planning, busi-
planning provides a framework of interim ness impact analysis, alternate site
measures to recover IT services following selection, and recovery strategies. NASCIO represents state chief infor-
mation officers and information
an emergency or system disruption. 3. Develop or re-examine IT disaster technology executives and man-
Interim measures may include the reloca- recovery planning policies and plans agers from state governments across
the United States. For more informa-
tion of IT systems and operations to an with emphasis on maintenance, train- tion visit www.nascio.org.
alternate site, the recovery of IT functions ing, and exercising the contingency
plan. Copyright © 2007 NASCIO
using alternate equipment, or execution of All rights reserved
agreements with an outsourced entity.
201 East Main Street, Suite 1405
Lexington, KY 40507
IT systems are vulnerable to a variety of Phone: (859) 514-9153
disruptions, ranging from minor short- Fax: (859) 514-9166
Email: NASCIO@AMRms.com
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 1
106
Before the Crisis

(1) Strategic and Business Planning Responsibilities (Building relationships; What is the CIO’s role on an
ongoing basis? Role of enterprise policies?)
! CIOs need a Disaster Recovery and Business ! CIOs should conduct strategic assessments and
Continuity (DRBC) plan including: (1) Focus on inventory of physical assets, e.g. computing and
capabilities that are needed in any crisis situation; telecom resources, identify alternate sites and com-
(2) Identifying functional requirements; (3) Planning puting facilities. Also conduct strategic assessments
based on the degrees of a crisis from minor disrup- of essential employees to determine the staff that
tion of services to extreme catastrophic incidents; would be called upon in the event of a disaster and
(4) Establish service level requirements for business be sure to include pertinent contact information.
continuity; (5) Revise and update the plan; have criti-
Notes:
cal partners review the plan; and (6) Have hard and
digital copies of the plan stored in several locations
for security.
Notes:
! CIOs should conduct contingency planning in
case of lost personnel: This could involve cross-
training of essential personnel that can be lent out
to other agencies in case of loss of service or disas-
ter; also, mutual aid agreements with other public/
! CIOs should ask and answer the following ques-
private entities such as state universities for “skilled
tions: (1) What are the top business functions and
volunteers.” (Make sure contractors and volunteers
essential services the state enterprise can not func-
have approved access to facilities during a crisis).
tion without? Tier business functions and essen-
tial services into recovery categories based on Notes:
level of importance and allowable downtime. (2)
How can the operation’s facilities, vital records,
equipment, and other critical assets be protected?
(3) How can disruption to an agency’s or depart-
ment’s operations be reduced?
! Build cross-boundary relationships with emer-
Notes: gency agencies: CIOs should introduce themselves
and build relationships with state-wide, agency and
local emergency management personnel – you
don’t want the day of the disaster to be the first
time you meet your emergency management coun-
terparts. Communicate before the crisis. Also consid-
! CIOs should create a business resumption strate- er forging multi-state relationships with your CIO
gy: Such strategies lay out the interim procedures to counterparts to prepare for multi-state incidents.
follow in a disaster until normal business operations Consider developing a cross-boundary DR/BC plan
can be resumed. Plans should be organized by pro- or strategy, as many agencies and jurisdictions have
cedures to follow during the first 12, 24, and 48 their own plans.
hours of a disruption. (Utilize technologies such as
GIS for plotting available assets, outages, etc.) Notes:
Notes:
105
How to Use the Tool-kit The tool-kit is comprised of six checklists

in three categories that address specific
This tool-kit represents an updated and contingency planning recommendations
expanded version of business continuity to follow Before, During and After a dis-
and disaster preparedness checklists uti- ruption or crisis situation occurs. The
lized for a brainstorming exercise at the Planning Phase, Before the disaster,
“CIO-CLC Business Continuity/ Disaster describes the process of preparing plans
Recovery Forum” at NASCIO’s 2006 Midyear and procedures and testing those plans to
Conference. This expanded tool-kit prepare for a possible network failure. The
evolved from the work of NASCIO’s Execution Phase, During the disaster,
Disaster Recovery Working Group, describes a coordinated strategy involving
www.NASCIO.org/Committees/ system reconstitution and outlines actions
DisasterRecovery. Along with NASCIO’s that can be taken to return the IT environ-
DVD on disaster recovery, “Government at ment to normal operating conditions. The
Risk: Protecting Your IT Infrastructure.” Final Phase, After the disaster, describes
(View video or place order at: the transitions and gap analysis that takes
www.NASCIO.org/Committees/ place after the disaster has been mitigat-
DisasterRecovery/DRVideo.cfm), these ed. The tool-kit also provides an accompa-
checklists and accompanying group brain- nying group activity worksheet, “Thinking
storming worksheets will serve as a Sideways,” to assist in disaster recovery
resource for state CIOs and other state planning sessions with critical staff.
leaders to not only better position them-
selves to cope with an IT crisis, but also to
help make the business case for disaster
recovery and business continuity activities
in their states.
IT Disaster Recovery and Business (4) General IT Infrastructure and

Continuity Checklists Services (Types of redundancy;
protecting systems.)
Before the Crisis During the Crisis
(1) Strategic and Business Planning (5) Tactical Role of CIOs for Recovery
Responsibilities (Building relation- During a Disaster (Working with
ships; What is the CIO’s role on an state and local agencies and first
ongoing basis? Role of enterprise responders; critical staff assign-
policies?) ments; tactical use of technology,
(2) Top Steps States Need to Take to e.g. GIS.)
Solidify Public/ Private After the Crisis
Partnerships Ahead of Crises (Pre-
disaster agreements with the pri- (6) Tactical Role of CIOs for Recovery
vate sector and other organiza- After a Disaster Occurs (Working
tions.) with state and local agencies, and
critical staff to resume day-to-day
(3) How do you Make the Business operations, and perform gap analy-
Case on the Need for sis of the plan’s effectiveness.)
Redundancy? (Especially to the
state legislature, the state executive
branch and budget officials.)
2 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
107
! Intergovernmental communications and coordi- ! Testing: CIOs should conduct periodic training exer-
nation plan: Develop a plan to communicate and cises and drills to test DR/BC plans. These drills
coordinate efforts with state, local and federal gov- should be pre-scheduled and conducted on a regu-
ernment officials. Systems critical for other state, lar basis and should include both desk-top and field
local and federal programs and services may need to exercises. Conduct a gap analysis following each
be temporarily shut down during an event to safe- exercise.
guard the state’s IT enterprise. Local jurisdictions are
the point-of-service for many state transactions, Notes:
including benefits distribution and child support
payments, and alternate channels of service delivery
may need to be identified and temporarily estab-
lished. Make sure jurisdictional authority is clearly
established and articulated to avoid internal con-
! A CIO’s approach to a DR/BC plan will be unique to
flicts during a crisis. his or her financial and organizational situation and
Notes: the availability of trained personnel. This still leaves
the question as to who writes the plans. If a CIO
chooses from one of the many consultants that pro-
vide Continuity of Operations planning, he or she
should make sure that staff maintains a close degree
of involvement and, when completed, that the con-
! Establish a crisis communications protocol: A crisis sultant(s) provide general awareness training of the
communications protocol should be part of a state’s plan. If CIOs choose to conduct planning in-house,
IT DR/BC plan; Designate a primary media have an experienced and certified business continu-
spokesperson with additional single point-of-contact ity planner review it for any potential gaps or incon-
communications officers as back-ups. Articulate who sistencies.
can speak to whom under different conditions, as well
as who should not speak with the press. In a time of Notes:
crisis, go public immediately, but only with what you
know; provide updates frequently and regularly.
Notes:
! Communicate to rank and file employees that

there is a plan, the why and how of the plan, and
their roles during a potential disruption of service or
disaster. Identify members of a possible crisis man-
agement team. Have in place their roles, actions to
be taken, and possible scenarios. Have a list of their
office, home, and cell or mobile phone numbers.
Notes:
108
(2) Top Steps States Need to Take to Solidify Public/ Private Partnerships Ahead of Crises (Pre-disas-
ter agreements with the private sector and other organizations.)
! Utilize preexisting business partnerships: Keep ! Be sure essential IT procurement staff are part of
the dialogue open with state business partners; peri- the DR/BC plan and are aware of their roles in exe-
odically call them all in for briefings on the state’s cuting pre-positioned contracts in the event of a dis-
disaster recovery and business continuity (DR/BC) aster; also be sure to include pertinent contact infor-
plans. mation.
Notes: Notes:
! Set up “Emergency Standby Services and ! CIOs should develop “Emergency Purchasing
Hardware Contracts:” Have contracts in place for Guidelines” for agencies and have emergency
products and services that may be needed in the response legislation in place.
event of a declared emergency. Develop a contract
Notes:
template so a contract can be developed with one
to two hours work time.
Notes:
! Think outside the box: CIOs can partner with any-

one, e.g. universities, local government, lottery cor-
porations, local companies and leased facilities with
! Outsourced back-up sites may be time limited; redundant capabilities.
therefore back-up, back-up outsourcing may be nec-
Notes:
essary for continuity leap-frog.
Notes:
! Place advertisements in the state’s “Contract

Reporter” every quarter; continuous recruitment is a
good business practice.
Notes:
109
(3) How do you Make the Business Case on the Need for Redundancy? (Especially to the state legisla-
ture, the state executive branch and budget officials.)
Risk assessment of types of disasters that could lead " For federally declared states of emergency the
to the need for business continuity planning: financial aspect has been somewhat lessened by the
potential of acquiring funding grants from state or
! Geological hazards – Earthquakes, Tsunamis,
federal organizations such as FEMA. Additional fund-
Volcanic eruptions, Landslides/ mudslides/
ing for state cybersecurity preparedness efforts is
subsidence;
available to states through the U.S. Department of
! Meteorological hazards – Floods/ flash floods, tidal
Homeland Security’s State Homeland Security
surges, Drought, Fires (forest, range, urban), Snow,
Grants Program.
ice, hail, sleet, avalanche, Windstorm, tropical
cyclone, hurricane, tornado, dust/sand storms, Notes:
Extreme temperatures (heat, cold), Lightning strikes;
! Biological hazards – Diseases that impact humans
and animals (plague, smallpox, Anthrax, West Nile
Virus, Bird flu);
! Human-caused events – Accidental: Hazardous
material (chemical, radiological, biological) spill or " Establish metrics for costs of not having redun-
release; Explosion/ fire; Transportation accident; dancy: How much will it cost the state if certain crit-
Building/structure collapse; Energy/power/utility ical business functions go down – e.g. ERP issues on
failure; Fuel/resource shortage; Air/water pollution, the payment side; citizen service issues (what it
contamination; Water control structure/dam/levee would do to the DMV for license renewals); impacts
failure; Financial issues: economic depression, on eligibility verifications for social services, etc. How
inflation, financial system collapse; Communications long can you afford to be down? How much is this
systems interruptions; costing you? How long can you be without a core
! Intentional – Terrorism (conventional, chemical, business function?
radiological, biological, cyber); Sabotage; Civil distur- Notes:
bance, public unrest, mass hysteria, riot; Enemy
attack, war; Insurrection; Strike; Misinformation;
Crime; Arson; Electromagnetic pulse.
" Education and awareness: Craft an education and

awareness program for IT staff, lawmakers and budg- " Up-front savings: States obtain greater leverage for
et officials to ensure all parties are on the same page fair pricing and priority service in the event of an
with regards to your DR/BC plan and the need for emergency before the emergency occurs, rather
such a plan. Prepare key talking points that outline than after the emergency has occurred.
the rationale for DR/BC planning. Utilize outside Notes:
resources such as this tool-kit and NASCIO’s DVD on
disaster recovery, “Government at Risk: Protecting
Your IT Infrastructure,” to help make the business
case for disaster recovery and business continuity
activities in your state.
" Consider channels of delivery: Child support pay-
Notes:
ments channeled through a broker agency.
Notes:
110
! Consider cycles of delivery: The most important

periods of delivery, e.g. the last week or couple of
days of the month may be the most critical back-up
period.
Notes:
! Realize that as the adoption rate for electronic

business processes and online services grows,
employees with knowledge of business rules and
paper processes will retire and will no longer be
around for manual backup.
Notes:
111
(4) General IT Infrastructure and Services (Types of redundancy; protecting systems.)
! CIOs need to ensure that information is regularly Notes:

backed up. Agencies need to store their back-up
data securely off site in a location that is accessible
but not too near the facility in question. Such loca-
tions should be equipped with hardware, software
and agency data, ready for use in an emergency.
(Restore functions should be tested on a regular ! Mobile communication centers can be utilized in
basis.) These “hot sites” can be owned and operated the event that traditional telecommunications sys-
by an agency or outsourced. tems are down.
Notes: Notes:
! Protect current systems: Controlled access; uninter- ! Self-healing primary point of presence facilities
ruptible power supply (UPS); back-up generators that automatically restore service.
with standby contracts for diesel fuel (use priority Notes:
and back-up fuel suppliers that also have back-up
generators to operate their pumps in the event of a
widely spread power outage).
Notes:
! Approach enterprise backup as a shared service:
Other agencies may have the capability for excess
redundancy.
Notes:
! Strategic location: Locate critical facilities away
from sites that are vulnerable to natural and man-
made disasters.
Notes:
! Provide secure remote access to state IT systems
for essential employees (access may be tiered based
on critical need.)
Notes:
! Interactive voice response (IVR) systems that are
accessing back-end databases: (There may be no
operators for backup that can connect patrons to
services.) Seek diversity of inbound communica-
tions.
! Hot Sites: A disaster recovery facility that mirrors an
Notes: agency’s applications databases in real-time.
Operational recovery is provided within minutes of a
disaster. These can be provided at remote locations
or outsourced to one or multiple contractors.
Notes:
! Self-healing communications systems that auto-
matically re-route communications or use alternate
media.
112
During the Crisis

(5) Tactical Role of CIOs for Recovery During a Disaster (Working with state and local agencies and first
responders; critical staff assignments; tactical use of technology, e.g. GIS.)
! Decision making: Prepare yourself for making deci- ! Implement your emergency employee communi-
sions in an environment of uncertainty. During a cri- cations plan: Inform your internal audiences – IT
sis you may not have all the information necessary, staff and other government offices – at the same
however, you will be required to make immediate time you inform the press. Prepare announcements
decisions. to employees to transition them to alternate sites or
implement telecommuting or other emergency pro-
Notes:
cedures. Employees can maintain communication
with the central IT office utilizing Phone exchange
cards, provided to employees with two numbers: (1)
First number employees use to call in and leave their
contact information; (2) Second number is where
! Execute DR/BC Plan: Retrieve copies of the plan the employees call in every morning for a standing
from secure locations. Begin systematic execution of all employee conference call for updates on the
plan provisions, including procedures to follow dur- emergency situation.
ing the first 12, 24, and 48 hours of the disruption. Notes:
Notes:
! Intergovernmental communications and coordi-

! Shutdown non-essential services to free up nation plan: Communicate and coordinate efforts
resources for other critical services. Identify critical with state, local and federal government officials.
business applications and essential services and tier Systems critical for other state, local and federal pro-
them into recovery categories based on level of grams and services may need to be temporarily shut
importance and allowable downtime, e.g. tier III down during an event to safeguard the state’s IT
applications are shut down first. Be sure to classify enterprise. Local jurisdictions are the point-of-con-
critical services for internal customers vs. external tact for many state transactions, including vehicle
customers. and voter registration, and alternate channels of
service delivery may need to be identified and tem-
Notes: porarily established. Make sure jurisdictional
authority is clearly established and articulated to
avoid internal conflicts during a crisis.
Notes:
! Communicate, communicate, communicate:

Engage your primary media spokesperson imme-
diately and have additional communications officers
on stand-by if needed. Immediately get the word to
the press; let the media – and therefore the public –
know that you are dealing with the situation.
Notes:
113
! Back-up communications: In the event wireless, ! Leverage technology/ Think outside the box: In a
radio and Internet communications are inaccessible, disaster situation the state’s GIS systems can be uti-
Government Emergency Telecommunications lized to monitor power outages and system avail-
Service (GETS) cards can be utilized for emergency ability. For emergency communications, the “State
wireline communications. GETS is a Federal program Portal” can be converted to an emergency manage-
that prioritizes calls over wireline networks and uti- ment portal. Also, Web 2.0 technologies such as
lizes both the universal GETS access number and a Weblogs, Wikis and RSS feeds can be utilized for
Personal Identification Number (PIN) for priority emergency communications.
access.
Notes:
Notes:
! Execute “Emergency Standby Services and

! CIO’s must be effectively engaged with the On Hardware Contracts:” If necessary, execute pre-
Scene Coordinator (OSC), and the Incident placed contracts for products and services needed
Command System (ICS) – the federal framework for during the crisis. The Governor may also have to
managing disaster response that outlines common temporarily suspend some of the state’s procure-
processes, roles, functions, terms, responsibilities, etc. ment laws and execute “Emergency Purchasing
ICS supports the FEMA National Incident Guidelines” for agencies.
Management System (NIMS) approach; states must
Notes:
understand both NIMS and the ICS.
Notes:
114
After the Crisis

(6) Tactical Role of CIOs for Recovery After a Disaster Occurs (Working with state and local agencies,
and critical staff to resume day-to-day operations, and perform gap analysis of the plan’s effectiveness.)
! Preliminary damage and loss assessment: ! Contractual performance: Review the performance
Conduct a post-event inventory and assess the loss of strategic contracts and modify contract agree-
of physical and non-physical assets. Include both ments as necessary.
tangible losses (e.g. a building or infrastructure) and
Notes:
intangible losses (e.g. financial and economic losses
due to service disruption). Be sure to include a dam-
age and loss assessment of hard copy and digital
records. Prepare a tiered strategy for recovery of lost
assets.
Notes: ! Lessons learned: Evaluate the effectiveness of the
DR/BC plan and how people responded. Examine all
aspects of the recovery effort and conduct a gap
analysis to identify deficiencies in the plan execu-
tion. Update the plan based on the analysis. What
went right (duplicate); what went wrong (tag and
! Employee transition: Once agencies have recov- avoid in the future). Correct problems so they don’t
ered their data, CIOs need to find interim space for happen again.
displaced employees, either at the hot site or anoth-
Notes:
er location. Coordinate announcements to employ-
ees to transition them to an alternate site or imple-
ment telecommuting procedures until normal oper-
ation are reestablished.
Notes:
! Budgetary concerns: Following a disaster and

resumption of IT services, there may be a need for
emergency capital expenditures to aid in the recov-
ery process. Be prepared to work with the state
budget officer and/ or the state’s legislative budget
committees.
Notes:
115
Appendix 1. Thinking Sideways
Instructions: Use this worksheet in conjunction with

each checklist as a group brainstorming tool.
A. Conduct a gap analysis on Checklist ___. Focus on

what’s missing and include key policy issues unique to
state governments, best practices and innovative ideas
that can be shared across jurisdictions:
C. How can CIOs use this information to secure funding

and other resources for business continuity?
B. Describe how states and the private sector can work

together to tackle these issues, through the transference
of knowledge and experience?
116
Appendix 2. Additional Resources State Government Resources
Federal Government Resources Pennsylvania’s Pandemic Preparation

Website:
The Federal Emergency Management <http://www.pandemicflu.state.pa.us/
Agency’s (FEMA’s) National Incident pandemicflu/site/default.asp>
Management System (NIMS) – NIMS was Also see Government Technology’s article
developed so responders from different regarding Pennsylvania’s new Website:
jurisdictions and disciplines can work <http://www.govtech.net/news/news.php
together better to respond to natural dis- ?id=99469>
asters and emergencies, including acts of
terrorism. NIMS’ benefits include a unified New York State’s, Office of General
approach to incident management; stan- Services (OGS) emergency contracts pre-
dard command and management struc- pared through the new National
tures; and emphasis on preparedness, Association of State Procurement Officials
mutual aid and resource management: (NASPO) Cooperative Purchasing
<http://www.fema.gov/emergency/nims/ Hazardous Incident Response Equipment
index.shtm> (HIRE) program, are available at:
<http://www.ogs.state.ny.us/purchase/spg
FEMA’s Emergency Management /awards/3823219745CAN.HTM> New York
Institute – A federal resource for emer- is the lead state for this multi-state coop-
gency management education and train- erative.
ing. <http://training.fema.gov/>
Washington State, Department of
GAO Report, Information Sharing: DHS Information Technology, Tech News,
Should Take Steps to Encourage More Enterprise Business Continuity: Making
Widespread Use of Its Program to Protect Sure Agencies are Prepared, December
and Share Critical Infrastructure 2005: <http://www.dis.wa.gov/technews/
Information. GAO-06-383, April 17, 2006: 2005_12/20051203.aspx>
<http://www.gao.gov/cgi-
bin/getrpt?GAO-06-383>
National Organization, Academia and
GAO Report, Continuity of Operations: Consortium Resources
Agency Plans Have Improved, but Better
Oversight Could Assist Agencies in Business Continuity Institute (BCI) – BCI
Preparing for Emergencies. GAO-05-577, was established in 1994 to enable
April 28, 2005: members to obtain guidance and support
<http://www.gao.gov/docdblite/ from fellow business continuity
summary.php?rptno=GAO-05- practitioners. The BCI has over 2600
577&accno=A22839> members in 50+ countries. The wider role
of the BCI is to promote the highest stan-
U.S. Department of Homeland Security dards of professional competence and
(DHS), Safe America Foundation – commercial ethics in the provision and
<http://www.safeamerica.org/sp_ maintenance of business continuity plan-
cybersafety.htm> ning and services:
<http://www.thebci.org/>
National Institute of Standards and
Technology (NIST) – Special Publication Disaster Recovery Institute (DRI) – DRI
800-34, Contingency Planning Guide for International (DRII) was first formed in
Information Technology: Recommendations 1988 as the Disaster Recovery Institute in
of the National Institute of Standards and St. Louis, MO. A group of professionals
Technology: from the industry and from Washington
<http://csrc.nist.gov/publications/ University in St. Louis forecast the need for
nistpubs/> comprehensive education in business
117
continuity. DRII established its goals to: Disaster: Lessons from Hurricane
Promote a base of common knowledge Katrina.” The book, edited by Ronald J.
for the business continuity planning/ dis- Daniels, Donald F. Kettl (a Governing con-
aster recovery industry through educa- tributor) and Howard Kunreuther, warns of
tion, assistance, and publication of the the inevitability of another disaster and
standard resource base; Certify qualified the need to be prepared to act. It address-
individuals in the discipline; and Promote es the public and private roles in assess-
the credibility and professionalism of cer- ing, managing and dealing with disasters
tified individuals: <http://www.drii.org/> and suggests strategies for moving ahead
in rebuilding the Gulf Coast. To see a table
The National Association of State of contents and sample text, visit
Procurement Officials (NASPO) has com- <http://www.upenn.edu/pennpress/book/
pleted work on disaster recovery as it 14002.html> Published by the University
relates to procurement: of Pennsylvania Press, the book sells for
<http://www.naspo.org/> $27.50.
U.S. Computer Emergency Readiness

Team (U.S. CERT)/ Coordination Center – Articles and Reports
Survivable Systems Analysis Method:
<http://www.cert.org/archive/html/ “Cleaning Up After Katrina,” CIO
analysis-method.html> Magazine, March 15, 2006:
<http://www.cio.com/archive/031506/view
The Council of State Archivists (CoSA) – _oreck.html?CID=19049>
CoSA is a national organization compris-
ing the individuals who serve as directors Continuity of Operations Planning:
of the principal archival agencies in each Survival for Government, Continuity
state and territorial government. CoSA’s Central:
Framework for Emergency Preparedness in <http://www.continuitycentral.com/
State Archives and Records Management feature0200.htm>
Programs is available at:
<http://www.statearchivists.org/prepare/ Disaster and Recovery, GovExec.com:
<http://www.govexec.com/features/
framework/assessment.htm>
1201/1201managetech.htm>
AFTER THE DISASTER Hurricane Katrina “Disaster Recovery, How to protect your
not only impacted more than 90,000 technology in the event of a disaster,”
square miles and almost 10 million resi- Bob Xavier, November 27, 2001:
dents of the Gulf Coast but also affected <http://www.techsoup.org/howto/
how governments will manage such disas- articles/techplan/page2686.cfm>
ters in the future. A collection of articles
opens the dialogue about disaster
response in a new book, “On Risk and
VITA
118
VITA
EDUCATION
Candidate for M.S. in Computer Information Technology at Purdue University,
May 2011 G.P.A. 3.8/4.00
Honors B.A. in Communication, Public Relations at Purdue University, December
1998 G.P.A. 3.57/4.00
PUBLICATIONS
“Disaster Recovery and Business Continuity Planning: Business Justification,” H.

M. Brotherton, Journal of Emergency Management, 67-60, 2010.
DOI:10.5055/jem.2010.0019
http://pnpcsw.pnpco.com/cadmus/testvol.asp?year=2010&journal=jem
EMPLOYMENT
ITaP Web and Applications Administration, Graduate Assistant
May 2009-Present
• Met with customers to define project requirements and create an articulate
design rational that best meets requirements
• Project planning and management
• Developed and updated documentation of administration policies and
procedures
• Granted and implemented development and deploy access to web
developers
• Migrated and created websites in our Apache, IIS, and ColdFusion
environments
• Customized application portals using, XML, JavaScript, HTML, CSS
• Created Tivoli Storage Manager nodes
• Assisted with SharePoint training development
• Built Tomcat web server
• Research BMC Remedy BI development and implementation
• Design, development, and implementation of the Applications
Administration website and forms using XHTML, JavaScript, CSS and
PHP
119
Social Security Administration, Social Insurance Specialist

April 1999- January 2009
• Served as Site LAN Coordinator for my office. Duties included: verifying
systems updates, reporting systems problems, changing daily backup
tapes, resolving systems issues by making necessary changes on site.
• Prepared and performed presentations to special interest groups
• Employed creativity and problem solving to deal effectively with situations
of competing or conflicting priorities
• Exercised professionalism and discretion in handling confidential
information
• Analyzed, interpreted, and implemented policy and balanced tasks in a
fast paced work environment
• Learned new policies, tools and technology on a daily basis to keep up
with constantly changing workloads
• Assumed responsibility for maintaining quality standards in processing
claims
• Worked both independently and as a team member to meet our office
goals
Before & Afterthoughts, Owner-Manager

November 2001-April 2003
Formed S-corporation, performed all bookkeeping, paid and managed
employees.
Underhill Games, Co-owner/Board Member

May 2001-April 2002
Researched business models, informed the board members about the various
corporation types and gave recommendation to form S-corporation, attended
trade conventions as company representative, created and edited our online
store, setup payment and shipping for business merchandise.
Purdue University West Lafayette, Residence Hall Counselor

August 1997-May 1998
Planned and implemented programs and activities, enforced University policies,
resolved disputes between residents, and served as contact for University
resource referrals.
REFERENCES
Dr. J. Eric Dietz, Computer and Information Technology, Purdue University,

Purdue Homeland Security Institute (PHSI), Gerald D. and Edna E. Mann Hall,
Room 166, 203 S. Martin Jischke Drive, West Lafayette, IN 47907-1971
Jeffrey Sprankle, Computer and Information Technology, Purdue University,
Knoy Hall, Room 221, 401 N. Grant Street, West Lafayette, IN 47907
PUBLICATION
120
121
122
123
View publication stats

Thesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Article · January 2011

Heather Brotherton Ph.D.,PMP

The user has requested enhancement of the downloaded file.

Data center recovery best practices: Before, during,

Contact Start Your Own Notify Me

Available at: http://works.bepress.com/heatherbrotherton/4

This is to certify that the thesis/dissertation prepared

By Heather McCall Brotherton

Is approved by the final examining committee:

Approved by: Jeffrey L. Brewer 04/04/2011

Research Integrity and Copyright Disclaimer

For the degree of Master

Heather McCall Brotherton

AND AFTER DISASTER RECOVERY EXECUTION

Submitted to the Faculty

In Partial Fulfillment of the

Requirements for the Degree

West Lafayette, Indiana

CIO Chief Information Officer

Brotherton, Heather M. M.S., Purdue University, May 2011. Data center

information technology disasters with a goal of identifying practical before, during,

and after disaster recovery best practices. The topic of cyberinfrastructure

resiliency is explored including barriers to cyberinfrastructure resiliency. Factors

explored include: adherence to established procedures, staff training in recovery

relationships. Helpful tools and resources are included to assist planners.

1.1. Statement of purpose

The purpose of this research is to attempt to bridge the gap of unmet

needs in the area of cyberinfrastructure business continuity and disaster

recovery. Information systems are complex and vital to modern infrastructure.

Loss of computer information system availability can financially cripple

companies and potentially cause basic necessities such as clean water to be

unavailable. In many cases, organizations fail to implement business continuity

Cyberinfrastructure resiliency is dependent upon creating practical, attainable

implementations. Through this research, the effectiveness of various business

continuity and disaster recovery practices will be explored to increase information

1.2. Research Question

The scope of the research is identification of best practices for business

continuity and disaster recovery. Factors affecting the success of

cyberinfrastructure incident recovery will be identified through case study

analysis. Success will be determined by reviewing factors such as practicality,

practice implementation and execution will also be identified.

Aside from IT professionals, very few think about the impacts of

information system failure. Growing dependence upon computer information

Information systems are the ubiquitous controllers of critical infrastructure. Many

business processes and services depend upon computer information systems

resulting in myriad factors to consider in data center contingency planning.

These systems experience failures on a regular basis, but most failures

are unnoticed due to carefully crafted redundant mechanisms that seamlessly

continue processing. However, massive failures have occurred that resulted in

continuity and disaster recovery plans. Practical, understandable planning and

Assumptions for this study include:

• Examination of the experiences of organizations who have sustained

catastrophic information systems failures will yield information that will

contribute to disaster recovery best practices body of knowledge.

• The use of qualitative case study analysis is appropriate to study the

and policies in place at the time of the incident.

• Contact with primary actors from cyber infrastructure failures is infeasible

o Difficulty identifying actors

o Limitations on what may be discussed due to risk of liability

o Degraded memory of actual events and policies active at the time

• Highly detailed information will not be available in the documentation.

based on the detail of the available documentation.

• Observation of large-scale cyber infrastructure failure is not feasible due to