Download as pdf or txt
Download as pdf or txt
You are on page 1of 138

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/254639325

Data center recovery best practices: Before, during, and after disaster recovery
execution

Article · January 2011

CITATIONS READS

2 9,365

1 author:

Heather Brotherton Ph.D.,PMP


Purdue University
9 PUBLICATIONS 16 CITATIONS

SEE PROFILE

All content following this page was uploaded by Heather Brotherton Ph.D.,PMP on 09 September 2015.

The user has requested enhancement of the downloaded file.


From the SelectedWorks of Heather M
Brotherton

May 2011

Data center recovery best practices: Before, during,


and after disaster recovery execution

Contact Start Your Own Notify Me


Author SelectedWorks of New Work

Available at: http://works.bepress.com/heatherbrotherton/4


Graduate School ETD Form 9
(Revised 12/07)

PURDUE UNIVERSITY
GRADUATE SCHOOL
Thesis/Dissertation Acceptance

This is to certify that the thesis/dissertation prepared

By Heather McCall Brotherton

Entitled
DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING, AND AFTER DISASTER
RECOVERY EXECUTION

Master of Science
For the degree of

Is approved by the final examining committee:


J. Eric Dietz Gary Bertoline
Chair
W. Gerry McCartney Jeffrey Sprankle

To the best of my knowledge and as understood by the student in the Research Integrity and
Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of
Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.

J. Eric Dietz
Approved by Major Professor(s): ____________________________________
____________________________________

Approved by: Jeffrey L. Brewer 04/04/2011


Head of the Graduate Program Date
Graduate School Form 20
(Revised 9/10)

PURDUE UNIVERSITY
GRADUATE SCHOOL

Research Integrity and Copyright Disclaimer

Title of Thesis/Dissertation:
DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING, AND AFTER DISASTER
RECOVERY EXECUTION

For the degree of Master


Choose of Science
your degree

I certify that in the preparation of this thesis, I have observed the provisions of Purdue University
Executive Memorandum No. C-22, September 6, 1991, Policy on Integrity in Research.*

Further, I certify that this work is free of plagiarism and all materials appearing in this
thesis/dissertation have been properly quoted and attributed.

I certify that all copyrighted material incorporated into this thesis/dissertation is in compliance with the
United States’ copyright law and that I have received written permission from the copyright owners for
my use of their work, which is beyond the scope of the law. I agree to indemnify and save harmless
Purdue University from any and all claims that may be asserted or that may arise from any copyright
violation.

Heather McCall Brotherton


______________________________________
Printed Name and Signature of Candidate

04/04/2011
______________________________________
Date (month/day/year)

*Located at http://www.purdue.edu/policies/pages/teach_res_outreach/c_22.html
DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING,

AND AFTER DISASTER RECOVERY EXECUTION

A Thesis

Submitted to the Faculty

of

Purdue University

by

Heather M. Brotherton

In Partial Fulfillment of the

Requirements for the Degree

of

Master of Science

May 2011

Purdue University

West Lafayette, Indiana


ii

TABLE OF CONTENTS

Page
LIST OF TABLES ........................................................................................... iv  
LIST OF FIGURES.......................................................................................... v  
LIST OF ABBREVIATIONS............................................................................ vi  
ABSTRACT ....................................................................................................vii  
CHAPTER 1. INTRODUCTION....................................................................... 1  
1.1. Statement of purpose ......................................................................... 1  
1.2. Research Question ............................................................................. 1  
1.3. Scope.................................................................................................. 2  
1.4. Significance......................................................................................... 2  
1.5. Assumptions ....................................................................................... 3  
1.6. Limitations........................................................................................... 3  
1.7. Delimitations ....................................................................................... 4  
1.8. Summary............................................................................................. 4  
CHAPTER 2. LITERATURE REVIEW............................................................. 6  
2.1. Critical cyberinfrastructure vulnerability .............................................. 6  
2.2. Barriers to cyberinfrastructure resiliency ............................................ 8  
2.3. Mutual aid ........................................................................................... 9  
2.3.1. Mutual Aid Association ............................................................. 11  
2.4. Training ............................................................................................. 12  
2.5. Testing .............................................................................................. 13  
2.6. Summary........................................................................................... 14  
CHAPTER 3. FRAMEWORK AND METHODOLOGY .................................. 15  
3.1. Framework ........................................................................................ 15  
3.2. Researcher Bias ............................................................................... 16  
3.3. Methodology ..................................................................................... 16  
3.4. Data Collection.................................................................................. 17  
3.5. Authorizations ................................................................................... 17  
3.6. Analysis............................................................................................. 18  
3.6.1. Triangulation............................................................................. 18
3.7. Summary........................................................................................... 19  
CHAPTER 4. CASE STUDIES...................................................................... 20  
4.1. Commerzbank................................................................................... 20  
4.1.1. Background.................................................................................... 20  
4.1.2. World Trade Center Attacks ..................................................... 21  
4.1.3. Conclusion................................................................................ 28  
iii

Page
4.2. FirstEnergy........................................................................................ 29  
4.2.1. Background .............................................................................. 29  
4.2.2. Northeast Blackout of 2003 ...................................................... 29  
4.2.3. Conclusion................................................................................ 38  
4.3. Tulane ............................................................................................... 39  
4.3.1. Background .............................................................................. 39  
4.3.2. Hurricane Katrina...................................................................... 40  
4.3.3. Conclusion................................................................................ 48  
4.4. Commonwealth of Virginia ................................................................ 49  
4.4.1. Background .............................................................................. 49  
4.4.2. August 2010 outage ................................................................. 51  
4.4.3. Conclusion................................................................................ 65  
CHAPTER 5. ANALYSIS............................................................................... 67  
5.1. Best Practice Triangulation ............................................................... 67  
5.1.1. Before-Planning........................................................................ 67  
5.1.2. During-Plan execution .............................................................. 73  
5.1.3. After-Plan improvement............................................................ 78  
CHAPTER 6. CONCLUSION ........................................................................ 86  
CHAPTER 7. FUTURE RESEARCH............................................................. 89  
BIBLIOGRAPHY............................................................................................ 91
APPENDICES  
Appendix A............................................................................................. 103  
Appendix B............................................................................................. 104
VITA ............................................................................................................ 118
PUBLICATION  
Disaster recovery and business continuity planning:
Business justification.............................................................................. 120  
iv

LIST OF TABLES

Table Page
Table 5.1 Tolerance and objectives .................................................................... 68
Table 5.2 Aid relationship utilized during recovery .............................................. 78  
v

LIST OF FIGURES

Figure Page
Figure 5.1 Adherence to established procedures................................................ 74
Figure 5.2 Sample IT incident command structure.............................................. 77
Figure 5.3 Reported average downtime revenue losses in billions ..................... 80
Figure 5.4 Reported critical application and data classifications ......................... 81
Figure 5.5 Components of a resilient system ...................................................... 85
vi

LIST OF ABBREVIATIONS

CIO Chief Information Officer


DMV Department of Motor Vehicles
DR disaster recovery
EMAC Emergency Management Assistance Compact
EMS Emergency Management System
EOC emergency operations center
FE FirstEnergy
FEMA Federal Emergency Management Agency
FERC Federal Energy Regulatory Commission
HVAC Heating, Ventilating, and Air Conditioning
IT information technology
ITIL Information Technology Infrastructure Library
MOU Memorandum of Understanding
MTPOD Maximum Tolerable Period of Disruption
NIMS National Incident Management System
NRC Nuclear Regulatory Commission
ROI Return On Investment
RPO Recovery Point Objectives
RTO Recovery Time Objectives
SCADA Systems Control and Data Acquisition
SAN Storage Area Network
VITA Virginia Information Technologies Agency
vii

ABSTRACT

Brotherton, Heather M. M.S., Purdue University, May 2011. Data center


recovery best practices: before, during, and after disaster recovery execution.
Major Professor: J. Eric Dietz.

This qualitative multiple case study analysis reviews well documented past

information technology disasters with a goal of identifying practical before, during,

and after disaster recovery best practices. The topic of cyberinfrastructure

resiliency is explored including barriers to cyberinfrastructure resiliency. Factors

explored include: adherence to established procedures, staff training in recovery

procedures, chain of command structure, recovery time and cost, and mutual aid

relationships. Helpful tools and resources are included to assist planners.


1

CHAPTER 1. INTRODUCTION

1.1. Statement of purpose

The purpose of this research is to attempt to bridge the gap of unmet

needs in the area of cyberinfrastructure business continuity and disaster

recovery. Information systems are complex and vital to modern infrastructure.

Loss of computer information system availability can financially cripple

companies and potentially cause basic necessities such as clean water to be

unavailable. In many cases, organizations fail to implement business continuity

measures due to the high cost of remote failover systems and training.

Cyberinfrastructure resiliency is dependent upon creating practical, attainable

implementations. Through this research, the effectiveness of various business

continuity and disaster recovery practices will be explored to increase information

systems resiliency.

1.2. Research Question

What are best practices in planning, during, and after disaster recovery

execution?
2

1.3. Scope

The scope of the research is identification of best practices for business

continuity and disaster recovery. Factors affecting the success of

cyberinfrastructure incident recovery will be identified through case study

analysis. Success will be determined by reviewing factors such as practicality,

recovery time, and business impact. Practical tools and resources to assist best

practice implementation and execution will also be identified.

1.4. Significance

Aside from IT professionals, very few think about the impacts of

information system failure. Growing dependence upon computer information

systems has created vulnerabilities that have not been uniformly addressed.

Information systems are the ubiquitous controllers of critical infrastructure. Many

business processes and services depend upon computer information systems

resulting in myriad factors to consider in data center contingency planning.

These systems experience failures on a regular basis, but most failures

are unnoticed due to carefully crafted redundant mechanisms that seamlessly

continue processing. However, massive failures have occurred that resulted in

widespread, severe negative impact on the public. While most large corporations

have remote failover locations, there are many organizations important to critical

functions that do not have the resources to develop and implement business

continuity and disaster recovery plans. Practical, understandable planning and

recovery guidance, developed through the findings of this research, may help
3

ensure the stability of cyberinfrastructure and by extension the safety and well

being of all.

1.5. Assumptions

Assumptions for this study include:

• Examination of the experiences of organizations who have sustained

catastrophic information systems failures will yield information that will

contribute to disaster recovery best practices body of knowledge.

• The use of qualitative case study analysis is appropriate to study the

phenomenon of interest.

• Existing publicly available documents are the best source of the actions

and policies in place at the time of the incident.

1.6. Limitations

Limitations include:

• Contact with primary actors from cyber infrastructure failures is infeasible

due to:

o Difficulty identifying actors

o Limitations on what may be discussed due to risk of liability

o Degraded memory of actual events and policies active at the time

of the incident

• Highly detailed information will not be available in the documentation.

Therefore, this research will not address topics that cannot be examined

based on the detail of the available documentation.


4

• Observation of large-scale cyber infrastructure failure is not feasible due to

inherent unpredictability; observation of other information systems failures

and recovery will lack external validity.

1.7. Delimitations

Delimitations include:

• Many sources to assist in business continuity and disaster planning exist;

this research will not attempt to add to planning, but will focus on the

successes and failure of the planning and methods employed before,

during, and after information system recovery execution.

• Possible causal relationships will not be examined in this exploratory

research study.

• Information systems failures that are not well documented will not be

addressed.

• The number of case studies will be limited to ensure in depth coverage of

recovery methods employed.

• Realistic simulation of catastrophic failures is neither ethical nor feasible

and will not be attempted for the purpose of study.

1.8. Summary

This chapter is an introduction to the disaster recovery best practices

research project. The purpose of the research is to meet needs of

cyberinfrastructure resiliency. Cyberinfrastructure resiliency is defined as the

ability of an infrastructure level information system to tolerate and recover from


5

adverse incidents with minimal disruption. The scope of the project is defined in

this chapter as well as the significance, assumptions, limitations, and

delimitations. The following chapter will review literature on topic related to

cyberinfrastructure resiliency.
6

CHAPTER 2. LITERATURE REVIEW

This chapter provides an overview of the importance of systems resiliency

and introduces the concept of mutual aid. Computer information system

vulnerabilities and threats are discussed. The barriers to systems resiliency and

the challenges associated with removing these barriers to implement resiliency

are highlighted. Potential uses of mutual aid agreements as a pragmatic, cost-

effective risk mitigation alternative resiliency tool are discussed. Literature

related to systems resiliency is reviewed to provide a background of the

problems and to support the exploration of the employment of mutual aid

agreements.

2.1. Critical cyberinfrastructure vulnerability

The Clinton, Bush, and Obama administrations have recognized society’s

dependency on cyberinfrastructure in presidential communications. Presidential

Decision Directive 63 declared "cyber-based systems essential to the minimum

operations of the economy and government. They include, but are not limited to,

telecommunications, energy, banking and finance, transportation, water systems

and emergency services, both governmental and private." (Clinton

Administration, 1998, p. 1). This communication set forth policy to implement

cyberinfrastructure protections by 2000 (Clinton Administration, 1998).


7

However, despite this directive, in 2003 the Northeast portion of the United

States suffered an extended widespread power outage due in large part to failure

of the computer system (U.S.-Canada Power System Outage Task Force, 2004).

Transportation, communication, and water were unavailable leaving many

stranded in subways and trapped in elevators. In some cases, people were

unable to make non-cash purchases for essentials such as flashlights (Barron,

2003). Findings published by the New York Independent System Operator state

"the root cause of the blackout was the failure to adhere to the existing reliability

rules" (New York Independent System Operator, 2005, p. 4). "ICF Consulting

estimated the total economic cost of the August 2003 blackout to be between $7

and $10 billion" (Electricity Consumers Resource Council (ELCON), 2004, p. 1).

In more recent history, Google announced a directed attack from China

(Scherr & Bartz, 2010). This announcement was shortly followed by an

announcement from the Obama administration regarding initiatives to protect

critical resources such as power and water from cyber attack (Scherr & Bartz,

2010). No initiatives to date have resulted in substantial hardening of

cyberinfrastructure in fact the problem appears to be growing. Losses of

intellectual property alone from 2008 to 2009 were approximately one trillion

dollars (Internet Security Alliance (ISA)/American National Standards Institute

(ANSI), 2010).
8

2.2. Barriers to cyberinfrastructure resiliency

Computer information systems are inherently difficult to protect. They

remain in a state of constant flux due to technological advances and updates to

patch known vulnerabilities (Homeland Security, 2009). Each patch or fix applied

runs the risk of causing an undocumented conflict due to customization as well

as creating a new vulnerability. Constant connection to the Internet has

increased the usefulness of computers, but this has also increased vulnerability.

Information systems are highly complex, even information technology experts are

segmented. Upper level mangers as tend to be “digital immigrants" resulting in

increased difficulty in convincing them to fund cybersecurity projects. (Internet

Security Alliance (ISA)/American National Standards Institute (ANSI), 2010, p.

12) This disconnect is the doom of continuity planning, without high-level

backing to push policy change and supply resources there is little chance for

success (Petersen, 2009).

Funding alone will not make a resilient cyberinfrastructure, collaboration

among departments is necessary to create and maintain a plan that addresses

the business requirements of an organization (Caralli, Allen, Curtis, White, &

Young, 2010). There must also be organizational understanding and

commitment to the practices that contribute to the documentation required to

have an up to date continuity plan. These cultural changes require strong

actively committed leadership to enact.


9

Leadership lacking the fundamental understanding of the importance of

failover testing can render an otherwise solid continuity plan useless. In some

cases, companies have disaster recovery plans, but are reluctant to test live

systems due to the possibility of service interruptions (Balaouras, 2008). This

short sightedness can lead to disastrous costly consequences. Planned testing

can be scheduled during low traffic periods when the staff can be prepared to

quickly recover any outage. These tests serve to identify system and failover

plan weaknesses and make the staff more comfortable with the failover and

recovery process.

A common and somewhat illogical barrier to planning for resiliency is the

idea that some disasters cannot be planned for because they are too large.

(Schaffhauser, 2005) The National Incident Management System (NIMS)

provides a framework for managing incidents of any size and complexity. (FEMA)

Information and training for NIMS is freely available on the Federal Emergency

Management Agency training website. The site address is listed in Appendix A.

The use of this framework is highly recommended because it is widely used and

provides a framework for integrating outside organizations into the command and

incident response structure.

2.3. Mutual aid

Mutual aid agreements have evolved over human history as a means to

pool resources to solve a common problem. The redundant resources required

to maintain systems continuity may not be economically feasible for many


10

organizations. Rather than forgoing implementing remote failover locations, it

may be advisable to pool resources by forging reciprocal agreements.

The September 2010 San Bruno gas pipeline explosion is a good example

of the advantages of an existing Mutual aid compact. San Bruno’s disaster

activated 42 fire agencies, 200 law enforcement officers. (Jackson, 2011) “85

pieces of fire-fighting apparatus” were also provided for on site response.

(Jackson, 2011) The resources required for this incident were far beyond feasible

maintainability for the city’s budget. The California Mutual Aid System along with

an Emergency Operations plan ensured the city was able to quickly and

effectively respond to this unforeseen explosion. (Jackson, 2011)

The possibility that the utilization of IT mutual aid agreements will allow

organizations the ability to make better use of available resources is worth

exploring (Swanson, Bowen, Wohl Phillips, Gallup, & Lynes, 2010). Collocation

of critical services provides systems redundancy without the need to build a

dedicated recovery data center. Reciprocal relationships are generally defined

by a memorandum of understanding (Swanson, Bowen, Wohl Phillips, Gallup, &

Lynes, 2010). Memorandums of understanding, often referred to as an MOU,

define protocol, costs, resources available, and compatibility requirements. It

may be desirable to include nondisclosure agreements in the MOU.

Staffing is a key resource that could be negotiated for through mutual aid

agreements. Sharing staffing increases the likelihood that adequately trained

staff will be available should a catastrophic event occur. Some catastrophes may

make staff unavailable due to personal impact and additional staff may be
11

required to maintain or recover operations to prevent or reduce business impact

(Schellenger, 2010). Partnering with another organization to pool staffing

resources can ensure efficient contingency operations through cross-trained

staff. The end result may be cost savings. Fewer contractors and consultants

would be necessary and business impact could be minimized as a result of extra

staff that is familiar with the computer system.

Another possible advantage of mutual aid agreements is the ability to

share training expenses. General conceptual information and in-house training

can be shared between partner organizations. This may not only save costs of

developing and providing training, but will provide a "common language" for the

partnered organizations (FEMA, 2006, p. 4). Ideally, additional specialized

training for employees on incident management teams would be trained with

counterparts to ensure good communication between the teams. The ability to

communicate efficiently and effectively will also contribute to the reduction of

downtime.

2.3.1. Mutual Aid Association

Mutual Aid agreements are common for police, fire departments, and

utilities. Associations have been formed to fill the gaps in situations where an

organization lacks the necessary resources to respond adequately to an incident.

These relationships have been used to the benefit of society at large allowing

seamless performance of incident response duties. This is possible due to

predetermined procedures and protocols that exist in mutual aid agreements.

Organizations generally hold regular training with reciprocal partners. According


12

to Hardenbrook, utilities "showed the most advanced levels of cooperation"

during the Blue Cascades exercise (2004, p. 4). The Blue Cascades II exercise

focused on information technology dependencies.

The FEMA website has links to a few mutual aid associations such as

Emergency Management Assistance Compact (EMAC). EMAC emerged in 1949

in response to concern of Nuclear Attack. (EMAC) In 1996 the U.S Congress

recognized EMAC as a national disaster compact through Public Law. (EMAC)

EMAC is designed to assist states, but this model may work for non-profit,

education, and business. Creation of a similar association for information

technology may be warranted due to the special skills, equipment, and resources

required for response to a large-scale event.

2.4. Training

Training is a key factor in business continuity and disaster recovery.

Human error is often cited as the primary cause of systems failure (U.S.-Canada

Power System Outage Task Force, 2004). In many cases, the incident is

initiated by another type of failure (software, hardware, fire, etc), but the

complicating factor becomes human error (U.S.-Canada Power System Outage

Task Force, 2004). Automation of "easy tasks" leaves "complex, rare tasks" to

the human operator. (Patterson, et al., 2002, p. 3) Humans "are not good at

solving problems from first principals…especially under stress" (Patterson, et al.,

2002, p. 3) "Humans are furious pattern matchers" but "poor at solving problems

from first principals, and can only do so for so long before" tiring (Patterson, et
13

al., 2002, p. 3). Automation "prevents …building mental production rules and

models for troubleshooting" (Patterson, et al., 2002, p. 4). The implications of

this are that technologists are not efficient at solving problems without

experience. Training provides the opportunity to build "mental production rules"

and allows the technologist to quickly and more accurately respond to incidents.

2.5. Testing

Surveyed literature reinforces the importance of testing and

experimentation. Testing provides the opportunity to assess the effectiveness of

business continuity procedures, equipment, and configuration. Part of the

reasoning for testing is that “emergency systems are often flawed…only an

emergency tests them, and latent errors in emergency systems can render them

useless." (Patterson, et al., 2002) Incident response procedures vary in

complexity. Some procedures are employed on a regular basis; these situations

are not the focus of the testing discussed here. Large-scale recovery and

continuity procedures are rarely employed by an organization; however the

effectiveness of these plans is decisive in the organization's survival in the event

of a large-scale disaster. Disasters have not only been historically costly, but

have resulted in permanent closures (Scalet, 2002). The costs of neglecting

business continuity and disaster recovery testing are too high to risk.
14

2.6. Summary

Critical resource and service dependencies upon information systems

have created the necessity to protect the underlying cyberinfrastructure. Barriers

to the resilience of complex and often fragile systems must be removed.

Leadership must be educated on the requirements of systems resiliency.

Practices that support maintained system and business process documentation

must be integrated into the organizational culture. The cost of redundant

cyberinfrastructure renders implementing resiliency out of reach for many

organizations. The cultivation of reciprocal relationships is one option to reduce

the cost of maintaining remote failover. Training and testing are key factors in

implementing effective business continuity and disaster recovery procedures.


15

CHAPTER 3. FRAMEWORK AND METHODOLOGY

The purpose of this research is to examine data center recovery planning,

execution, and post-execution activities to identify best practices that emerge

from the analysis. Qualitative methods will be applied to facilitate the exploration

of this topic. This chapter details the research methodology employed as well as

data collection and analysis methodologies.

3.1. Framework

Information technology Business continuity and disaster recovery planning

has become a popular topic due increased information system interdependency.

Organizations cannot afford downtime due to primarily for financial reasons.

Methodologies have emerged to guide organizations through planning,

implementation, and maintenance lifecycle phases. Execution is addressed from

a theoretical point of view, but how does execution play out in real life, high

impact situations? Execution of cyberinfrastructure disaster recovery procedures

and protocols remain virtually unexamined. Research of documented, high

impact cyberinfrastructure recovery processes may uncover valuable information

that may enrich understanding of best practices. Best practices revealed or

reinforced through this research will be documented for future use.


16

3.2. Researcher Bias

I present here my personal bias on this topic, to inform the reader of

beliefs that may encroach upon the findings of this research. Preparedness, in

my mind, enables us to deal more effectively with adverse conditions. I whole-

heartedly believe that documentation and practice exercises contribute to

incident mitigation, quicker recovery time, and reduced personal stress during

emergency. I acknowledge that not every contingency can be included in

planning and the ingenuity of the incident responders is the key to success. I

believe that an all hazards approach, established chain of command, and well-

trained staff enable a more coordinated and efficient recovery process.

3.3. Methodology

Collective case study will be utilized in this qualitative phenomenological

study. This method is used because creating reliably accurate quantitative

measures is not feasible in the study of high impact cyberinfrastructure recovery

processes. Primarily due to the rare occurrence of this type of event, it is highly

unlikely to be presented the opportunity to observe the actual phenomenon of

interest. Quantitative methods are impractical because, while the cases used will

have some timeline and procedural documentation, the accuracy of these

measures is questionable due to the high stress nature of the recovery situations

and lack of highly detailed procedural information.


17

Lab research was also considered and while this would produce high

internal validity, it is not feasible to realistically simulate a true disaster situation.

Therefore, external validity would be low and would likely result in unrealistic

findings.

3.4. Data Collection

Purposeful sampling methods were employed. The criteria for selection

included:

• High impact cyberinfrastructure incident

• Documented resolution

Phenomenon related documents, artifacts, and archival records were used rather

than interviewing, which also reduces the possible impact of researcher bias.

Multiple cases were included in the case study. This method of data collection

may not produce findings generally applicable to information systems in every

sector. The area of interest is high impact cyberinfrastructure; the findings using

this methodology are expected to be highly generalizable to critical infrastructure

information systems.

3.5. Authorizations

Authorization for this research was granted by Purdue University College

of Technology and Purdue Institute of Homeland Security. The advisory

committee of the researcher approved this research to add to the body of


18

knowledge related to information systems business continuity and disaster

recovery. IRB approval was obtained for all written communication.

3.6. Analysis

Cross-case analysis was used to create a multidimensional profile of

disaster recovery processes and protocols. Recurring themes or practices, both

those resulting in positive and negative results, were identified. Factors explored

include:

• Adherence to established procedures

• Staff training in recovery procedures

• Chain of command structure

• Recovery time and cost

• Mutual aid relationships

3.6.1. Triangulation

The purpose of including more than one case study is to collate the

commonalities. The identification of common problems and successes

contributes to the understanding of best practices for disaster recovery.

Generalizable practices from other disciplines will also be used to reinforce the

identified and recommended best practices.


19

3.7. Summary

This chapter details the methodology, sampling, and analysis techniques

used in this research. Rationales for the methods employed were also discussed.

Findings and sources used for the case study are included in following chapters.
20

CHAPTER 4. CASE STUDIES

4.1. Commerzbank

4.1.1. Background

Commerzbank is the second largest bank in Germany established in

1870.(Availability Digest, 2009) In 2001, Commerzbank was the 16th largest in

the world.(Editorial Staff of SearchStorage.com, 2002) The bank has overcome

many adversities since its establishment such as World War I and

socialism.(Availability Digest, 2009) The bank has survived calamities in the

United States as well, including a 1992 flood in Chicago and the 1993 World

Trade Center bombing.(Parris, Who Survives Disasters and Why, Part 2:

Organizations, 2010) Commerzbank’s New York offices are “located on floors 31

to 34 at the World Financial Center”.(Hewlett-Packard, 2002) This location is

“only 300 feet from the World Trade Center towers.”(Editorial Staff of

SearchStorage.com, 2002)
21

4.1.2. World Trade Center Attacks

September 11, 2001 the World Trade Center suffered the largest terrorist

attack in United States history. Nearly 3000 died that day as a result of the

attacks. (Schwartz, Li, Berenson, & Williams, 2002) The impact to the economy

of the city of New York alone was $83 billion. (Barovik, Bland, Nugent, Van Dyk,

& Winters, 2001) Site clean up took over eight months. (Comptroller of the city of

New York, 02) Not all businesses were able to recover from the devastation

inflicted by the attacks. (Scalet S. D., 2002) The overall economic impacts

continue today and the daily lives of each resident of the United States has been

affected, if only indirectly.

4.1.2.1. Ramifications

Commerzbank was so near the World Trade Center impact sites that the

debris caused the widows to shatter.(Editorial Staff of SearchStorage.com, 2002)

The interior of the building that housed Commerzbank was covered in debris and

glass creating an unsafe environment and choking building equipment. The data

center air conditioning failed leading to high temperatures, which had a

cascading effect on the data center computers.(Hewlett-Packard, 2002, p. 2)

Most of the local data center disk failed causing failover to Commerzbank’s

remote site.(Hewlett-Packard, 2002, p. 2) Commerzbank had a redundant, fault

tolerant system with remote failover that allowed them to remain operational
22

throughout the event.(Hewlett-Packard, 2002, p. 2) They lost equipment at that

site but their ability to do business remained intact.

4.1.2.2. Response

Initially, links were directed to the Rye backup site to restore

communications with “Federal Reserve and the New York Clearing House” that

were lost after the first collision.(Availability Digest, 2009) It became apparent

that the World trade center was under attack when the second jet hit,

Commerzbank initiated immediate evacuation.(Parris, Who Survives Disasters

and Why, Part 2: Organizations, 2010) When the building lost power

Commerzbank’s backup power generator took over, but the HVAC system failed

due to the debris causing that site’s data center to shutdown.(Hewlett-Packard,

2002, p. 2) Automated failover processes continued as employees traveled to the

recovery site.(Editorial Staff of SearchStorage.com, 2002) The recovery site at

Rye, New York can be operated by 10 staff members and 16 reported to the

backup site on September 11th.(Hewlett-Packard, 2002, p. 2) This site served as

the primary data center and in days that followed EMC, Commerzbank’s storage

vendor, worked around the clock to restore data that was backed up to tape

rather than replicated.(Editorial Staff of SearchStorage.com, 2002) EMC added

“multiple terabytes” of storage to augment the backup site capacity during

following 36 hours allowing restoration of “mission-critical” data as well as

creation of new backups.(Editorial Staff of SearchStorage.com, 2002)


23

4.1.2.3. Mitigation in place

Commerzbank’s “primary site was well-protected, with its own generator,

fuel storage tank, cooling tower, UPS, batteries, and fire suppression

system.”(Parris, Who Survives Disasters and Why, Part 2: Organizations, 2010)

Commerzbank was in the midst of virtualizing storage, and had finished the

majority of the conversion before the attacks. (Mears, Connor, & Martin, 02)The

IT staff at Commerzbank designed and maintained a business continuity plan

that included regular testing and a call tree.(Hewlett-Packard, 2002, p. 3) This

provided the capability to meet the zero downtime requirement set forth by the

business.(Hewlett-Packard, 2002) To reach this goal Commerzbank shadowed

“everything” to the remote site. (Parris, Who Survives Disasters and Why, Part 2:

Organizations, 2010) The remote site, located 30 miles from the World Trade

Center site at Rye, was the cornerstone of that plan.(Hewlett-Packard, 2002)

Boensch describes the activities of Commerzbank’s Disaster Recovery


(DR) site in non-disaster mode. “Our DR site is really dual purpose. The
AlphaServer GS160 system is a standby production site in case of a
disaster. But on a regular day-to-day basis, it’s up and running as a test
and development system. Actually, the only things that are redundant in
an active/active configuration are the StorageWorks data disks — they are
truly dedicated both locally and remotely. We also use the site for
training.(Hewlett-Packard, 2002, p. 4)

The primary site at the world trade center maintained local duplicate drives and

“extra CPUs”(Hewlett-Packard, 2002, p. 2) There was also a “disaster-tolerant

cluster” in the active/active data configuration described above to provide failover


24

capacity in seconds(Parris, Using OpenVMS Clusters for Disaster Tolerance)

Commerzbank used:

EMC's Symmetrix Remote Data Facility (SRDF) hardware and software to


safeguard its customer transactions, financial databases, e-mail and other
crucial applications. SRDF replicates primary data to one or more sites,
making copies remotely available almost instantly.(Editorial Staff of
SearchStorage.com, 2002)

This system provided “a standard, immediately functional environment for critical

decision-support and transactional data.”(Editorial Staff of SearchStorage.com,

2002) The facilities were physically connected via “Fibre Channel SAN” providing

a storage transfer rate of almost 1TB per second. (Parris, Who Survives

Disasters and Why, Part 2: Organizations, 2010) The remote site maintained

servers that “were members of the cluster” at the World Trade Center site. These

servers continued to serve using replicated “remote disks to the main site” after

the storage there failed.(Parris, Who Survives Disasters and Why, Part 2:

Organizations, 2010) Commerzbank’s “Follow-the-sun personnel staffing model

meant help was available” around the clock.(Parris, Who Survives Disasters and

Why, Part 2: Organizations, 2010) Previously established vendor relationships

with EMC and Compaq, later to become part of Hewlett-Packard (HP), ensured

they were on hand to assist with any services or equipment required to recover.
25

4.1.2.4. Corrective actions

Commerzbank’s corporate vice president Rich Arenaro, felt that the

disaster recovery part of the business continuity plan worked. All critical data was

available, but it still took nearly four hours to resume normal business

operations.(Mears, Connor, & Martin, 02) Therefore, they had failed to meet the

zero downtime business requirements. The servers were “somewhat inflexible

and required way too much human intervention.” Rye’s backup servers were not

identical to those at the primary site causing application compatibility problems

with the operating systems. (Egenera, 2006)

"Our strategy had been based on a false one-to-one ratio of technology,


meaning if I buy a server here and one for Rye, I'm protected," Arenaro
says. "The reality is when you are faced with that situation, having
hardware really is the least of your worries. It's really having your data and
your systems available and ready to use."(Mears, Connor, & Martin, 02)

Commerzbank corrected this by virtualizing their servers and eliminating

proprietary operating systems. The virtualized Linux servers use “SUSE Linux

and the support model of the open source community” rather than the HP

operating system.(Egenera, 2006) Another problem was that the hardware

residing “on the server itself—the disk, network interface card and storage

interface—give that server a fixed identity” this also caused delays as the servers

were manually reassigned.(Egenera, 2006)

The virtualized environment provides a pool of servers with shared

storage and networking hardware to “run any application on demand”. (Egenera,


26

2006) The new “system is designed for SAN connectivity and boot” any

BladeFrame server can assume any identity at any time. That’s what we were

missing and what we grappled with on 9/11.”(Egenera, 2006) The cooling

requirements for the data center have also decreased due to the virtualized

servers. The overall physical complexity has decreased as well, 140 servers

were consolidated into 48 blades. (Egenera, 2006) The virtualized configuration

has reduced hardware trouble-shooting time. Configuring new servers now takes

less than an hour; it previously took up to 16 hours. (Egenera, 2006)

The primary site and the backup site contain servers that are members of

active/active clusters. Applications as well as data are stored on a SAN allowing

any services to be switched seamlessly between locations using bi-directional

synchronous replication. The Rye site is now an active part of daily processing

and handles 40% of the processing load. (Egenera, 2006)

We live every day in the recovery portion of the DR mode. Having the
assets active takes the mystery out of continuity. We’re not praying that it
works, not planning that it works—we know it works because it’s an active
part of the process.(Egenera, 2006)

4.1.2.5. Discussion

This case study provides an example of disaster recovery done correctly.

The IT department was involved in contingency planning and performed regular

testing and every staff member knew what to do. The failover processes were

sufficiently automated to allow the evacuation process to focus on safety without


27

concern for heroics to save the business. Post incident review showed some

weakness in the technical contingency plan. The plan’s focus needed to be

shifted from recovery to continuity to meet Commerzbank’s business needs. The

company identified the problem, found a suitable solution, and implemented the

solution.

The remaining weakness, based on the information available, is that there

is no mention of a third cluster outside of New York. If an incident occurred that

severely impacted New York on a larger scale, having only two clusters both

located in New York may not provide the seamless zero downtime the company

requires. This global company has the resources to commit to this more

comprehensive configuration. They also have facilities around the world to take

advantage of for co-location. The floor space use was reduced by 60% through

server virtualization, this extra space should be taken advantage of to host

remote clusters between Commerzbank locations to ensure continuity.(Egenera,

2006)

In this case, like that of Katrina, the disaster destroyed the hardware at the

site. There was little that preparedness could do to save the equipment.

However, unlike Katrina the recovery plan worked. Commerzbank had many

advantages in this case; New York’s infrastructure did not suffer the damage

New Orleans suffered. Commerzbank did not have to shoulder the burden of

rebuilding a city, only their primary location. Also, Commerzbank had the

resources necessary to provide for their uptime requirements.


28

The lesson that can be learned from Commerzbank is not to be

complacent. Disasters happen of various scales on a daily basis, most are not

terribly severe and impact a small number of people. Failure to plan for a large-

scale severe impact event will increase the financial burden and stress of

incidents that do occur. If possible, defray the costs of maintaining hot sites by

integrating them into daily processing as Commerzbank has done. During

planning, walk through as many scenarios as imaginable this will help ensure

that all details are covered.

4.1.3. Conclusion

Commerzbank survived 9/11 with relative ease while many others suffered

unrecoverable losses. Many did not recover due to failure to plan and prepare for

the possibility of massive hardware and personnel losses. Commerzbank

understood the bank’s vulnerabilities and tolerances and made the investments

necessary to mitigate them. Past experience had taught the company how to

survive and high-level management and staff were trained to manage incidents.

This vigilance paid off in reduced downtime and minimized financial impact to the

company.
29

4.2. FirstEnergy

4.2.1. Background

FirstEnergy (FE) was founded in 1997 located in Akron, Ohio is ranked

179 in the 2010 list of Fortune 500 companies.(FirstEnergy, 08)(FirstEnergy,

09)(Fortune, 10) This unregulated utility supplies electricity to “Illinois, Maryland,

Michigan, New Jersey, Ohio, and Pennsylvania”.(FirstEnergy, 09) FirstEnergy

has remained highly profitable despite a history of poor practices that put the

public at risk. One of the most notable resulted in a $5.45 million fine issued by

the Nuclear Regulatory Commission (NRC). This fine regarded “reactor pressure

vessel head degradation”. FirstEnergy was notified of the problem in 2002 by the

NRC. (Merschoff, 05) The plant was operated for nearly two years after the

company was aware the equipment was unsafe to operate. (Merschoff, 05)

FirstEnergy employees supplied the NRC with misinformation and at least two

employees were indicted. (Associated Press, 06)

4.2.2. Northeast Blackout of 2003

In 2003, the Northeast region suffered a blackout, the largest in US

history, causing several Northeast US cities and Canada to be without power.

(Minkel, 08) News reports claimed this blackout was primarily due to a software

bug that stalled the utility’s control room alarm system for over an hour. The
30

operators were deprived of the alerts that would have caused them to take the

necessary actions to mitigate the grid shutdown/failures. The primary energy grid

monitoring server failed shortly after the failure of the alarm system, the backup

server took over and failed after a short period. The failure of the backup server

overloaded the remaining server’s processing ability bringing computer response

time to a crawl, which further delayed operators’ actions due to a refresh rate of

up to 59 seconds per screen. (U.S.-Canada Power System Outage Task Force)

The operators’ actions were slowed while they waited for information and service

requests from the server to load.

4.2.2.1. Ramifications

4.2.2.1.1. General

In a matter of minutes the blackout cascaded through the power grid

taking down over 263 plants. (Associated press, 03) Resulting in eight states and

parts of Canada being without power. (Barron J. , 2003)This black out affected

water supply, transportation, and communication. One hospital was completely

without power (Barron, 2003) and governmental systems to detect border

crossings, port landings, and unauthorized access to vulnerable sites failed.

(Northeast Blackout of 2003) The estimated cost of this blackout was $7-10

billion. (Electricity Consumers Resource Council (ELCON), 2004)


31

4.2.2.1.2. FirstEnergy

Immediately following the outage, FirstEnergy’s public stock offering

values fell as investors were cautioned sighting the possibility of fines and

lawsuits. (From Reuters and Bloomberg News, 03) A US-Canadian taskforce

assigned to investigate “found four violations of industry reliability standards by

FirstEnergy”. (Associated press, 03)

The FirstEnergy violations included not reacting to a power line failure


within 30 minutes as required by the North American Electricity Reliability
Council, not notifying nearby systems of the problems, failing to analyze
what was going on and inadequate operator training. (Associated press,
03)

There were no fines assessed because at that time no regulatory entity had the

authority to impose fines. (Associated Press, 06) However, FirstEnergy

stockholders sued for losses due to negligence and the company settled in July

of 2004 agreeing to pay $89.9 million to stockholders. (The New York Times

Company, 04)

4.2.2.2. Response

4.2.2.2.1. MISO

Midwest Independent System Operator (MISO) a group responsible for

overseeing power flow across the upper Midwest located in Carmel, Indiana.

(Associated press, 03) (Midwest ISO) The MISO state estimator tool

malfunctioned due a power line break at 14:20 Eastern Daylight Time (EDT).
32

(U.S.-Canada Power System Outage Task Force) This was one of the two tools

MISO used, both of which were under development, to assess electric system

state and determine best course of action. (U.S.-Canada Power System Outage

Task Force) The state estimator (SE) mathematically processes raw data and

presents it in the electrical system model format. This information is then feed

into the real time contingency analysis (RTCA) tool to “evaluate the reliability of

the power system”. (U.S.-Canada Power System Outage Task Force, p. 48)

At 14:15 the SE tool produced a solution with a high degree of error. The

operator turned off the automated process that runs the SE every five minutes to

perform troubleshooting. Troubleshooting identified the cause of the problem as

an unlinked line and manually corrected the linkage. The SE was manually run

and at 13:00 producing a valid solution. (U.S.-Canada Power System Outage

Task Force)The real-time contingency analysis (RTCA) tool successfully

completed at 13:07. The operator, left for lunch forgetting to re-enable the

automated tool processing. This was discovered and re-enabled at about 14:40.

The previous linkage problem recurred and the tools failed to produce reliable

results. The tool was not successfully run again until “16:04 about two minutes

before the start of the cascade.” (U.S.-Canada Power System Outage Task

Force, p. 48)

4.2.2.2.2. FE

The Systems Control and Data Acquisition (SCADA) system monitoring

alarm function failed at 14:14 and began a cascading series of application and
33

server failures, by 14:54 all functionality on the primary and backup servers

failed. (U.S.-Canada Power System Outage Task Force) FE’s IT staff were

unaware of any problems until 14:20, when their monitoring system paged them

because the Emergency Management System (EMS) consoles failed. At 14:41

the primary control system server failed and the backup server took over

processing. The FE IT engineer was then paged by the monitoring system. (U.S.-

Canada Power System Outage Task Force)

A “warm reboot” was performed at 15:08. (U.S.-Canada Power System

Outage Task Force) IT staff did not notify the operators of the problems nor did

they verify that functionality was restored with the EMS system operators. (U.S.-

Canada Power System Outage Task Force) The alarm system remained non-

functional. IT staff were notified of the alarm problem at 15:42 and they

discussed the “cold reboot” recommended during a support call with General

Electric (GE). The operators advised them not to perform the reboot because the

power system was in an unstable state. (U.S.-Canada Power System Outage

Task Force) Reboot attempts were made at 15:46 and 15:59 to correct the EMS

failures. (U.S.-Canada Power System Outage Task Force)

An American Electric Power (AEP) operator, who was still receiving good

information from FE’s EMS, called FE operators to report a line trip at 14:32.

Shortly thereafter operators from MISO, AEP, PJM Interconnection (PJM), and

other FE locations called to provide system status information. (U.S.-Canada

Power System Outage Task Force) FE operators became aware that the EMS
34

systems had failed at 14:36, when an operator reporting for the next shift

reported the problem to the main control room. (U.S.-Canada Power System

Outage Task Force) The “links to remote sites were down as well.” (U.S.-Canada

Power System Outage Task Force, p. 54) The EMS failure resulted in the

Automatic Generation Control (AGC), which works with affiliated systems to

automatically adjust to meet load, to be unavailable from 14:54 to 15:08. (U.S.-

Canada Power System Outage Task Force) FE operators failed to perform

contingency analysis after becoming aware that there were problems with the

EMS system. (U.S.-Canada Power System Outage Task Force) At 15:46 it was

too late for the operators to take action to prevent the blackout. (U.S.-Canada

Power System Outage Task Force)

4.2.2.3. Mitigation in place

FirstEnergy did have mitigation in place. There were several server nodes

that can host all functions with one server on “hot-standby” for backup with

automatic failover. (U.S.-Canada Power System Outage Task Force) FE had an

established relationship with the EMS vendor GE, which provided support to the

IT staff when a new problem that the IT staff was not experienced with occurred.

There were also established mutual aid relationships with other utility operators.

The operators have the ability to monitor affiliated electric systems and request

support. There were also established communication procedures that dictated

that the operators make calls under specific conditions.


35

FirstEnergy also had a tree trimming policy that is a standard mitigation

tactic for electric companies. The purpose of the policy is to avoid lines that will

require immediate repair for safety reasons and will increase stress on the

electric system. This is a non-technical mitigation measure that is very important

to protect the reliable functioning of the electric system and its monitoring tools.

4.2.2.4. Corrective actions

4.2.2.4.1. Regulatory

Federal Energy Regulatory Commission (FERC) regulations are no longer

voluntary they can now “impose fines of up to a million dollars a day”. (Minkel,

08)The Energy Policy Act of 2005 provided FERC authority to set and enforce

standards. (Minkel, 08) FERC has also created a prototype real-time monitoring

system for the nation’s electric grid. (Minkel, 08)

Future smart or supergrid systems are also under development. According

to Arshad Mansoor, Electric Power Research Institute’s power delivery and

utilization vice president, these systems would provide more resiliency by

“monitoring and repairing itself”. (Minkel, 08) Project Hydra scheduled to be in

service in downtown Manhattan in 2010 is a joint supergrid venture between the

Department of Homeland Security and Consolidated Edison Company of New

York. (Minkel, 08) More testing and infrastructure upgrade are required before

this promising technology could be implemented on a large scale. (Minkel, 08)


36

4.2.2.4.2. FirstEnergy

FirstEnergy implemented a new EMS system that was installed at two

locations to provide resiliency. (Jesdanun, 04) The new system has improved

alarm, diagnosis, and contingency analysis capabilities. (NASA, 2008) There are

now more visual status information and cues. (NASA, 2008) FirstEnergy created

an operator certification program, emergency response plan and updated

protocol. Communication requirements were established for “computer system

repair and maintenance downtimes between their operations and IT staffs” and

“tree trimming procedures and compliance were tightened.” (NASA, 2008)

4.2.2.5. Discussion

The primary cause of this outage appears to be human error. The

electrical systems operators were “unaware” of the problem for over an hour, as

the electrical system began to degrade. (U.S.-Canada Power System Outage

Task Force) However, there were repeated warnings from communications with

operators from various locations to indicate there was a problem with the EMS.

The operators were aware that there was a problem at 14:36, which provided the

operators more than an hour to take action. The discussion between FE

operators and IT staff indicated that the operators were aware that the electrical

system state required action. Operators’ actions may have been hampered from

14:54 to 15:59 by EMS screen refresh rates of up to “59 seconds per screen.”

(U.S.-Canada Power System Outage Task Force, p. 54)


37

FE’s IT staff failed to notify the operators at 14:20, when they became

aware of EMS system failures. This could have provided the EMS operators with

16 minutes more to determine and execute the correct course of action. Also the

FE EMS system was not configured to produce alerts when it fails, which is a

standard EMS feature. This would have provided another six minutes to the

operators to perform manual actions. Based on the operators’ failure to act on

the many other warnings they received, it is hard to make a case that the

operators would have acted in a timely manner even with an additional 32

minutes notice. It is possible, that operators were too dependent upon the

automated systems and overconfident that the situation would correct itself. The

North American Electric Reliability Council (NERC) found FE in violation for

failure to use “state estimation/contingency analysis tools”. (U.S.-Canada Power

System Outage Task Force, p. 22)

The EMS system was “brought into service in 1995” and it had been

decided to replace the aging system ”well before August 14th”. (U.S.-Canada

Power System Outage Task Force, pp. 55-56) The NERC found FE in violation

for insufficient monitoring equipment. (U.S.-Canada Power System Outage Task

Force) It was later determined that the software had a programming error that

contributed to the alarm failure.” According to Kema transmission services senior

vice president, Joseph Bucciero, “the software bug surfaced because of the

number of unusual events occurring simultaneously _ by that time, three

FirstEnergy power lines had already short-circuited.” (Jesdanun, 04) The three
38

lines were lost because FE failed to perform tree trimming according to internal

policy. The lines sagged, which occurs on hot days, and touched trees. (NASA,

2008)

4.2.3. Conclusion

This outage serves as an example that many small, mostly human errors,

can result in disaster. A more resilient system requiring less human interaction to

perform emergency tasks could have prevented this outage. Poor communication

between IT and Operations staff was a large factor as was the operators’ failure

to heed the warning of other operators. The FirstEnergy operators were provided

with information outside of their EMS to understand that the EMS was likely

providing unreliable information. The largest contributing factor was FirstEnergy

failure to be proactive. They did not trim trees, they did not replace their old EMS

system, they did not communicate appropriately with other energy operators, and

they did not train the employees how to act in a crisis situation when the EMS

could not be relied upon. There were contributing factors outside of FirstEnergy,

but if any one of the factors contributed by FirstEnergy were removed the wide

spread outage may not have occurred.


39

4.3. Tulane

4.3.1. Background

Tulane University is a private institution located in New Orleans, Louisiana

with an extension located in Houston, Texas for the Freemans School of

Business. (Gerace, Jean, & Krob) The University was established in 1834 as a

tropical disease research medical school (Alumni Affairs, Tulane University,

2008). A post Civil War endowment from Paul Tulane transformed the financially

struggling public university into the private university that survives today. (Alumni

Affairs, Tulane University, 2008)Tulane maintains a community service oriented

focus and its contributions have shaped the city of New Orleans over the

decades. In 1894 Tulane’s College of Technology brought electricity to the city of

New Orleans. (Alumni Affairs, Tulane University, 2008) Tulane is currently New

Orleans’s largest employer. (Tulane University)

Since the university was established Tulane has weathered the Civil War

and many hurricanes. Tulane has adapted to the New Orleans hurricane prone

environment. Tulane has integrated buildings that can “withstand hurricane force

winds” into the campus landscape. (Alumni Affairs, Tulane University, 2008) Only

Katrina and the Civil War have prevented Tulane from offering instruction.

(Tulane University, 2009)


40

4.3.2. Hurricane Katrina

Two days before the beginning of Tulane’s 2005 fall semester Hurricane

Katrina devastated New Orleans. (Blackboard Inc., 2008) This was “the worst

natural disaster in the history of the U.S.” (Cowen, 05) The real damage to New

Orleans began hours after Katrina passed as the levee succumbed to the

damage it suffered during the storm.

4.3.2.1. Ramifications

The ramifications of this disaster reach far beyond Tulane’s campus.

However, Tulane’s data center is the focus of this case study therefore direct

impact on Tulane and the cascading effects will be discussed. The hurricane’s

property damages alone were in excess of $400 million. (Alumni Affairs, Tulane

University, 2008) Over a week after Katrina, “eighty percent of Tulane’s campus

was underwater.” (Alumni Affairs, Tulane University, p. 66)

The New Orleans campus was closed for the fall semester of 2005.

(Cowen S. , Messages for Students , 05) Students were displaced and attended

other colleges as “visiting” students. (Gulf Coast Presidents, 2005) Some

students were asked to pay fees at the hosting University, Tulane promised to

address tuition issues as soon they gained access to their student records.

(Cowen S. , Student Messages, 05)


41

As university administration began planning for Tulane’s recovery from

Hurricane Katrina, they had no access to “computer records of any kind”. (Alumni

Affairs, Tulane University, 2008, p. 65) Tulane’s bank was not operational and

the administration did not know what funds were in the inaccessible account.

(Alumni Affairs, Tulane University, 2008) Accounts receivable servers were

unrecoverable because they operated independent of central IT. (Lawson, A

Look Back at a Disaster Plan: What Went Wrong and Right, 05)

Research at Tulane University suffered as well. Specimens from long

running studies were destroyed. Engineering faculty returned to campus “to

service critical equipment and retrieve important servers” which saved several

experiments. (Grose, Lord, & Shallcross, 2005) Over 150 research projects

suffered damage. (Alumni Affairs, Tulane University, 2008) Medical teams were

forced to destroy dangerous germ specimens used in research to avoid possible

outbreaks caused by inadvertent release of the germs. In addition, Tulane’s

Hospital was closed for six months, but “was the first hospital to reopen in

downtown New Orleans”. (Oversight and Investigations Subcommittee of the

House Committee on Energy and Commerce, p. 1)

Tulane reopened in January 2006 for the spring semester. The school lost

$125 million due to being closed for the fall semester of the 2005-2006 school

years. (Alumni Affairs, Tulane University, 2008) Prior to reopening Tulane had to

streamline its academic programs. This made funding available for the daunting

task of rebuilding Tulane and New Orleans. New Orleans had no infrastructure to
42

support Tulane. Tulane provided housing, utilities, and schools to support Tulane

students and staff. (Alumni Affairs, Tulane University, 2008) Despite Tulane’s

amazing recovery, loss of tuition income and disaster related financial losses

forced staff reductions and furloughs. (Lord, 2008)

4.3.2.2. Response

Monday August 29th 2005, Tulane University was flooded after the levees

damaged by Hurricane Katrina broke. (Searle, 2007) Tulane was fortunate to

have a few days warning prior to the hurricane. August 25th, Tulane’s IT staff

initiated online data backups according to the data center disaster recovery plan.

(Lawson, 2005) August 28th, Tulane brought its information systems down.

(Lawson, 2005) Backup generators and supplies were placed into campus

buildings. (Krane, Kahn, Markert, Whelton, Traber, & Taylor, 2007) On the 30th

generators began to fail as Tulane’s campus flooded, as a result communication

systems failed “with loss of e-mail systems and both cell and landline phones.

Text messaging remains functional and becomes the main source of

communication.” (Krane, Kahn, Markert, Whelton, Traber, & Taylor, 2007)

Senior administration staff sheltered in the Reily Student Recreation

Center command post along with other essential staff during the Hurricane.

Wednesday, August 31 Tulane’s “Electrical Superintendant Bob Voltz” shut off

power to the Reily building. (Alumni Affairs, Tulane University, 2008) Thursday,
43

the staff was rescued by helicopter from the now flooded Tulane after several

unsuccessful rescue attempts (Alumni Affairs, Tulane University, 2008)

Tulane’s top recovery priority was paying its employees. (Anthes, 2008)

This effort was complicated because payroll employees failed to take the payroll

printers and supplies as specified in the disaster plan. (Lawson, A Look Back at a

Disaster Plan: What Went Wrong and Right, 05) Police escorted Tulane IT staff

to retrieve Tulane’s backup data and computers from their 14th floor offsite

datacenter in New Orleans. (Alumni Affairs, Tulane University, 2008) Tulane’s

recovered backup tapes were processed at SunGard in Philadelphia. (Anthes,

2008) SunGard’s willingness to take Tulane as a customer allowed payroll to be

completed “two days late” according to Tulane CIO John Lawson. (Lawson, A

Look Back at a Disaster Plan: What Went Wrong and Right, 05) As of September

3rd 2005 Tulane still listed restoration of communications and IT systems as an

urgent issue. (Cowen S. , Student Messages, 05) Tulane’s President Scott

Cowen held live chats in September 2005 to address community concerns.

(Cowen S. , Student Messages, 2005)

Baylor University in Houston hosted Tulane’s redirected website and

invited Tulane to resume operations at Baylor. However, this process did not go

as smoothly as planned because the IP address assigned was not static.

(Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) This

was quickly corrected and Tulane used the redirected emergency site to

communicate with stakeholders providing “a continuous and unbroken chain of


44

updates via its Web site.” (Schaffhauser, 2005) School of Medicine classes

resumed in three weeks despite beginning with:

none of the necessary infrastructure that maintains the functions of any


medical school was available to Tulane's SOM. Information technology
support, network communication servers, the University's payroll system,
and e-mail were down, and student, resident, and faculty registration
systems were not operational. Student and resident rosters did not exist,
nor were there any methods to confirm credentials or grades. (Krane,
Kahn, Markert, Whelton, Traber, & Taylor, 2007)

Clinical students were able to resume because the Association of American

Medical Colleges maintains a database on medical students, which had been

updated in the days before Katrina hit. (Testa, 2006) The database records along

with the Baylor registration website and newly created paper files allowed Baylor

and Tulane to gather the information needed to resume classes. (Testa, 2006)

This resumption was particularly vital for seniors. Unfortunately, not all of the

College students of New Orleans were so lucky. About 100,000 were displaced,

many with no academic or financial records. (DeCay, 2007)

Email “was the first system to be brought back online”. (McLennan, 2006)

Blackboard provided systems to allow Tulane and other affected Gulf Coast

universities to establish online courses. (McLennan, 2006) This system was

utilized by Tulane to provide a six-week “mini fall semester”. (McLennan, 2006)

Tulane’s own Blackboard system was quickly restored to allow retrieval of course

material. (McLennan, 2006) There was no help desk to assist students or

instructors during the “mini fall semester”. (McLennan, 2006)


45

4.3.2.3. Mitigation in place

Tulane’s IT had plans that covered “how to prepare for a hurricane.”

(Anthes, 2008) The staff was trained and comfortable enacting the disaster plan.

They knew the backups could be completed in 36 hours. (Lawson, A Look Back

at a Disaster Plan: What Went Wrong and Right, 05) Offsite backups were

maintained on the 14th floor of a building in New Orleans. (Anthes, 2008) Tulane

also maintained a website for emergency information and phone contacts.

(McLennan, 2006) In case of a category 4 or higher hurricane the data center

would be shutdown and evacuated. (Lawson, A Look Back at a Disaster Plan:

What Went Wrong and Right, 05) The remote hosted emergency website for

Tulane would be activated prior to shutdown. (Lawson, A Look Back at a Disaster

Plan: What Went Wrong and Right, 05)

4.3.2.4. Corrective actions

Today the university has a disaster recovery plan including offsite backup

servers for websites, e-mail and other critical systems, which is updated yearly.

(Anthes, 2008) There are also documented protocols for recovery from a

disaster, which was missing during the recovery from Katrina. (Anthes, 2008)

The recovery plan has also been amended to cover more than hurricanes and IT

staff now participates in preparedness planning. (Anthes, 2008) (Gerace, Jean, &

Krob, 2007)
46

As of 2008, Tulane had a contract with SunGard mobile data center for

emergencies. (Anthes, 2008) Katrina’s affect on the New Orleans backup data

center made it clear that they needed to maintain backups at a more distant

location as a result “backups are taken to Baton Rouge 3 times a week”. (Anthes,

2008) Employees have been provided with USB storage devices to prepare

personal backups for emergencies. (Anthes, 2008) An alternate recovery site has

been established in Philadelphia and there is now a hardened onsite command

center at Tulane. (Lord, 2008) “Energy efficient systems were installed in the

down town campus” which can be operated longer using emergency

generators.(Alumni Affairs, Tulane University, 2008)

Tulane also maintains a “digital ham-radio network that can transmit

simple e-mail” and emergency updates to the website can be published directly

by the university’s public relations (Lord, 2008) “So as not to be dependent on

the media to track potentially disastrous hurricanes, Tulane has enlisted a private

forecaster to supply e-mail updates.”(Lord, 2008) “Students are required to have

notebook computers” which can facilitate continuity during a disaster and the

university now has online classes. (Gerace, Jean, & Krob, 2007) (Lord, 2008)

4.3.2.5. Discussion

Tulane’s situation is an extreme, but not unique example. There were

many things they did right and in the end they recovered. It is debatable if the

plan for offsite disaster recovery would have been worth the investment in
47

dollars. Itemized financial reports for Tulane were not available for review. It is

clear that the absence of offsite recovery contract was a deliberate financial

decision. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right,

05)

In retrospect this was probably a poor financial gamble in a hurricane

prone area especially considering that the destruction of the levee was a known

risk. (Kantor, 2005) This decision also created additional stress for Tulane’s staff

and students. Tulane did an excellent job of recovering payroll to ensure their

staff was not without desperately needed financial resources. The medical

students were also well cared for thanks to the help of outside partnerships. The

continued medical program could not have been possible had there not been an

existing, if informal, mutual aid relationship with Baylor.

Unfortunately, the loss of Tulane’s data center made for a difficult fall 2005

semester for most students. They not only had to relocate, but were without

financial or academic records from Tulane. For those students the approximately

$300,000 per year expenditure would have provided some peace of mind.

(Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) As a

result of this as well as other adverse conditions at Tulane, many students did

not return. In 2008, enrollment at Tulane was down by 5,300 students from its

pre-Katrina numbers. (Lord, 2008) This resulted in financial distress for Tulane

and the closing of its engineering school and consolidation of other programs and

colleges. (Lord, 2008)


48

Nonetheless, Tulane made herculean efforts to reopen one semester after

Katrina forced the University to close. This ensured not only the survival of

Tulane, but the revival of New Orleans as well. The medical students and

hospital provided much needed health care for New Orleans residents. Students

of Architecture are designing and building affordable, energy efficient homes.

(Brown, 2008) The damage caused to Tulane and New Orleans was beyond the

ability to prevent or to protect infrastructure. Efforts to preserve the ability to

recover from a complete loss of IT and infrastructure were proven to be the most

valuable in this case. No one institution was capable of recovering New Orleans,

but Tulane has kept it alive.

4.3.3. Conclusion

Tulane has learned from Katrina how to protect the data that is the

lifeblood of the university. The aftermath of Katrina has also made clear that the

students are Tulane’s customers and they cannot survive without them. Further

that New Orleans is dependent upon Tulane. Universities have a history of

providing for the communities they are a part of in times of disaster, this was true

in the aftermath of 9/11 and of Katrina. As the largest employer, a medical

provider and educator Tulane has persevered and shored up its weakness and

has become more independent.


49

4.4. Commonwealth of Virginia

4.4.1. Background

The state of Virginia outsourced its information technology to Northrop

Grumman in 2005. (Schapiro & Bacque, Agencies' computers still being

restored, 2010)The contract was to span 10 years at a cost of $2.4 billion

becoming the largest single vendor in Virginia's history. (Lewis)Virginia

Information Technologies Agency (VITA) was established in 2003 and is the

state agency charged to ensure the state’s information technology needs and the

terms of the contract with Northrop Grumman are met. (Lewis)

This was to be the flagship partnership to show that the Public sector

could benefit through private outsourcing of information technology.

However,"(d)elays, cost increases and poor service have dogged the state's

largest-ever outsourcing contract, the first of its kind in the country”. (Schapiro &

Bacque, Agencies' computers still being restored, 2010)Virginia had entered the

contract with the expectation that the contract would provide modernized

services for the "same cost as maintaining their legacy services.” (Stewart,

2006)At this point, the state no longer expects to see any cost savings under the

original contract period, but hope that savings will be realized under an extended

contract. (Joint Legislative Audit and Review Commission, 2009)


50

Since the beginning of the contract with Northrop Grumman, the state of

Virginia has suffered two major outages. In addition, the state paid an additional

$236 million to cover a hardware refresh. (Schapiro & Bacque, Agencies'

computers still being restored, 2010)This process, scheduled to be completed in

July 2009, is significantly behind schedule. There have been ongoing issues with

Northrop Grumman's poor performance in several areas. (Joint Legislative Audit

and Review Commission, 2009)Included in these issues are inadequate disaster

recovery and unreliable backup completion. As recently as October 2009, "lack

of network redundancy" is recognized as a "major flaw" in the system. (Joint

Legislative Audit and Review Commission, 2009, p. 105)

4.4.1.1. Oversight issues

Until the later part of March 2010, VITA could make changes to the

contract with Northrop Grumman without consulting with the General Assembly.

(Joint Legislative Audit and Review Commission, 2009)This limited the

governor’s ability to oversee the state IT services. The Information Technology

Investment Board (ITIB) is charged with oversight of VITA, but could not provide

full time oversight. The members of ITIB attend meetings irregularly and lack the

technical knowledge required to provide adequate governance (Joint Legislative

Audit and Review Commission, 2009) VITA's oversight was restructured to

eliminate the ITIB. VITA and the State CIO now report to the Office of the

Secretary of Technology. This new structuring creates oversight by the


51

Governor. The new structure became effective March 16, 2010 after passing

through the house and senate via an emergency clause. (VITA)

4.4.1.2. Notable service failures

The state has been plagued with a litany of service failures throughout the

contract with Northrop Grumman. In 2009, prison phone service failed and was

prioritized according to the number of employees affected. The technicians were

given 18 hours to resolve the issue according to the assigned prioritization.

Service was restored six and a half hours later following an escalation request

initiated by the prison. (Joint Legislative Audit and Review Commission,

2009)Another service failure, noted in the JLARC 2009 report, left the Virginia

State Police without internet access for three days. (Kumar & Helderman,

2009)June 20, 2007 the state of Virginia suffered a wide spread outage. (VITA,

2007)The outage was caused by several "near simultaneous hardware failures"

in a legacy server scheduled for refresh. (The Virginia Information Technology

Infrastructure Partnership) This failure occurred after the annual disaster

recovery test, which was held in April.

4.4.2. August 2010 outage

Wednesday, Aug. 25 an outage occurred impacting 27 state agencies.

(News Report, 2010) Thirteen percent of the state's file servers were unavailable
52

during the outage. (Schapiro & Bacque, Agencies' computers still being restored,

2010) "The computer troubles were traced to a hardware malfunction at the

state's data center near Richmond, which caused 228 storage servers to go

offline." (Kravitz, Statewide computer meltdown in Virginia disrupts DMV, other

government business, 2010) The hardware that failed was one of the SAN's two

memory cards. (Lewis) According to Jim Duffey, Virginia Secretary of

Technology, the outage was "unprecedented" based on the "uptime data" on the

EMC SAN hardware that caused the widespread failure. (News Report, 2010)

"Officials also said a failover wasn't triggered because too few servers were

involved." (News Report, 2010) "Workers restored at least 75 percent of the

servers overnight." (Kravitz, Statewide computer meltdown in Virginia disrupts

DMV, other government business, 2010)

4.4.2.1. Ramifications

The SAN failure negatively impacted “483 of Virginia's servers." (Schapiro

& Bacque, Agencies' computers still being restored, 2010)Virginia's Department

of Motor Vehicles (DMV) was the most visibly impacted agency. Drivers were

not able to renew licenses at the DMV offices during the outage, forcing the DMV

to open on Sunday and work through Labor Day to clear the backlog of expired

licenses. (Schapiro & Bacque, Northrop Grumman regrets computer outage,

2010) Some drivers were ticketed for expired licenses before law enforcement

agencies were requested to stop issuing tickets to affected drivers. (Charette,


53

2010) According to Virginia State Police, while "they will not cite drivers whose

licenses expired during the blackout", unfortunately those that received tickets

must "go through the court system" to request relief. (Kravitz & Kumar, Virginia

DMV licensing services will be stalled until at least Wednesday, 2010) In

addition, drivers who renewed licenses the day of the blackout will need to visit

the DMV again because the data and pictures from the transactions that day

were lost. (Schapiro & Bacque, Northrop Grumman regrets computer outage,

2010) This also increases the likelihood that some of the licenses and IDs

issued that day could be illicitly sold.

The DMV was not the only agency negatively impacted by the SAN

outage. According to the Department of Social Services about 400 welfare

recipients will receive benefit checks up to two days late. Employees at this

agency also worked overtime to reduce and eliminate delays where possible.

(Schapiro & Bacque, Agencies' computers still being restored, 2010) Internet

services used by citizens to make child support and tax payments were

unavailable as well. (Schapiro & Bacque, Northrop Grumman regrets computer

outage, 2010) "At the state Department of Taxation, taxpayers could not file

returns, make payments or register a business through the agency's website."

(Schapiro & Bacque, Agencies' computers still being restored, 2010)Three days

after the outage began "(f)our agencies continue(d) to have 'operational issues' "

these agencies included the departments of Taxation and Motor Vehicles. Many
54

other agencies continued to suffer negative effects from the outage. (Schapiro &

Bacque, Agencies' computers still being restored, 2010)

4.4.2.2. Response

At approximately noon on Aug. 25, a data storage unit sent an error

message. (Wikan, 2010) The cause of the error message was determined to be

“one of the two memory boards on the machine needed replacement." (Wikan,

2010)"A few hours later, a technician replaced the board ", (Wikan, 2010) Shortly

after the board was replaced the storage area network (SAN) failed. It was later

discovered that the wrong board might have been replaced. (Wikan, 2010)"VITA

and Northrop Grumman activated the rapid response team and began work with

the appropriate vendors to restore service." (VITA)

Work continued through the night to restore services but was unable to

restore data access to affected servers. (Wikan, 2010) (VITA)Thursday, the SAN

was shut down overnight to replace all components (Wikan, 2010)

The storage provider, EMC, determined that the best course of action is to
perform an extensive maintenance and repair process. VITA and Northrop
Grumman, in consultation, have determined this is the best way to
proceed. (VITA)

The 24 affected agencies were notified prior to the SAN shutdown to allow them

to take appropriate action. (VITA) SAN service was restored at "2:30 a.m. Aug.

27." (Wikan, 2010) Over half of the attached servers were operational Friday
55

morning. (VITA) VITA began working with the operational customers to confirm

service availability and perform data restoration.(VITA, 2010)

Unfortunately, the DMV remained unable to “process driver’s licenses at

its customer centers. Some other agencies continue(d) to be impacted." (VITA)

VITA continued data restorations over the weekend the DMV restore took “about

18 hours.” (Wikan, 2010) as of Monday August 30th "Twenty-four of the 27

affected agencies were up and running". (VITA) However, three key agencies still

suffered service outages Monday through Wednesday: ”the Department of Motor

Vehicles, Department of Taxation and the State Board of Elections." (VITA)

4.4.2.3. Mitigation in Place

The recovery scenario in this SAN outage was facilitated by mitigation

measures in place. Not only was there a "fault-tolerant" SAN, but also there

were magnetic tape backups and the staff had just performed recovery exercise

testing. The established relationship with the hardware vendor EMC brought

additional expertise to resolve this SAN outage. VITA also has two data centers,

a primary and failover data center.

Examination of the documents available on the VITA web site would imply

that every recommended best practice is being implemented and executed. The

SAN hardware used is best in class and has excellent reliability. VITA also had a

rapid response team whose mission was to reach incident resolution rapidly.

(Nixon, 2010) Yet, an outage in one system had serious negative impact on
56

several agencies and more importantly the citizens of Virginia for more than one

week.

This incident is one of many that the State of Virginia has suffered since

the beginning of the contract with Northrop Grumman. Professing use of industry

standards and best practices does not result in a reliable, stable cyber

infrastructure. In this case, there was still a single point of failure that resulted in

more than a minor inconvenience for Virginia. Implying a serious lack of

foresight in planning for recovery from this "unprecedented" outage.

IT professionals experience one-in-a-million or billion faults many times

throughout their careers. Disasters that are disregarded as near impossible do

occur. This state invested in a long-term partnership with Northrop Grumman to

avoid outages such as that which occurred in late August. Virginia made the

necessary financial investments that should have resulted in a more stable,

modern, infrastructure managed by experienced IT staff.

4.4.2.4. Corrective Actions

Largely unknown at the time of this report, an independent review has

been ordered. Agilisys Inc. was chosen to conduct a 10-12 week audit beginning

November 1, 2010. (VITA, 2010)


57

4.4.2.5. Discussion

From a technical perspective it is difficult to do more than speculate

exactly what happened and extrapolate what should have been done. VITA

participated in disaster recovery exercises in the second quarter of 2010. (VITA,

2010) The exercise involved restoring service after losing a data center.

Providing that the exercise was adequately rigorous, performing a restore for an

outage affecting 13 percent of servers should have been relatively easy.

The major complicating factor resulting in delayed recovery was reported

to be tape restoration and data validation. (Wikan, 2010) More emphasis should

be placed on data restoration and validation activities in future exercises.

Incidents resulting in partial data loss or corruption are far more likely than loss of

an entire data center. Activities that improve restoration time for data recovery

are valuable to avoid serious negative business impacts from a relatively

common incident. Practicing these restorations will provide insight on process

and technology enhancements that might improve recovery time. In this case,

the data recovery process from tape left the DMV unable to issue or update

driver’s licenses or IDs for a week. A data restoration exercise might have

revealed this weakness and another solution might have been put in place to

mitigate the recovery time issue.

The Northrop Grumman decision to allow days between backups is highly

questionable. (Availability Digest, 2010) It is difficult to justify anything less than

daily backups for agencies like the DMV, the Department of Taxation, and Child
58

Support. Loss of payment records for the latter two agencies would cause major

inconveniences and bad press. Loss of four days identification data for licenses

and IDs is inexcusable. The root of this decision likely lies in the bottom-line,

Northrop Grumman is trying to make a profit and failed to implement sufficient

redundancy for the customer business needs. It is difficult to imagine a situation

where allowing days between backups is anything less than negligence.

Additional resiliency mechanisms should be built into the databases and

storage. This is advisable for all high availability databases and might have

avoided the data loss and corruption that occurred. One possible mechanism is

local auditing copies maintained onsite in intermediate storage until there is

confirmation that the data was written to the SAN. The local copy would then be

held until the backup copy is confirmed as processed. This would entail

maintaining onsite transactions records and data for up to 48 hours. Maintaining

local daily backups for daily transactions is also an advisable practice to avoid

loss of records.

Another option is to maintain clone Business Continuance Volumes

(BCV), essentially a regularly scheduled copy between the two SANs. This

creates mirrored storage systems, hot scheduled copies occurring every minute

for example, using technology such as Oracle Data Guard or SQL server

mirroring and log shipping. Most database engines have a way to replicate

themselves in a near real-time state; the replicated copy is stored on a separate


59

physical hardware in order to eliminate data loss. The use of both options

presented would significantly reduce the possibility of data loss.

The fact that "too few servers were involved" to trigger failover is baffling.

(News Report, 2010) Any fault with the potential to incur the impact experienced

by this outage should initiate a failover. The IT staff should have initiated a

manual failover prior to making the SAN repair for the initial hardware failure.

This suggestion assumes that the failover would have eliminated the

dependence on the faulty SAN. In addition, if the SAN was still operating, why did

the technician perform the repair during business hours? The technician should

have created a cold backup to tape prior to doing the off hours repair. The

technician should have been aware the backup had not occurred for four days

and understood the potential data loss that could result. (Availability Digest,

2010)

VITA’s staff may need additional training to help them identify situations

where initiating a failover is appropriate. Training may be required to identify

when to perform a manual back up as well as situations that can wait for after

hours repair. It is likely that required change management processes were not

followed. The VITA webpage professes a commitment to the principals of

Information Technology Infrastructure Library (ITIL). (VITA) Following ITIL

principals, the SAN repair would have been subject to a change management

process. An emergency change request should have been submitted explaining

the problem, the proposed fix, and the steps to be taken. Affected customers,
60

process owners, and the change management board (or equivalent) should have

been notified. Either there was: no change request, no one reviewed the change

request, the request was not understood or the proposed steps were not

executed.

Monitoring tools may also have played a role in this outage. The IT staff

either ignored alerts, did not understand them, or had the monitoring tools

incorrectly configured. Monitoring alerts should have notified the staff of the

problem, identified which SAN controller was having the problem, and alerted

staff of failed write attempts to the networked storage. Additional training could

have ensured properly implemented monitoring tools and the IT staff’s ability to

understand the alerts.

This outage also displays the weakness of consolidated centralized

services. The financial motivation to move to centralized services is strong. It is

important to balance the cost savings with the risk being taken; the savings may

not justify the risk for many governmental organizations. Strategic long term

planning focused on the business needs rather than cost savings is a

requirement. Perhaps the DMV data is not a place to cut corners. Distributed

servers were implemented to avoid single points of failure, and while even a

distributed system is not free from failure, the possibility of wide spread failure

and data loss is reduced.

Strategic planners would do well to be skeptical of vendor claims. EMC

may claim the SAN outage is unprecedented, but they do not claim this is the first
61

outage. It is foolish to depend heavily on a single piece of hardware for any

business critical service. Northrop Grumman has failed to deliver a reliable

quality service due to poor strategic planning. Northrop Grumman has enough

experience to deliver quality service. However, they have chosen to design IT

services for the state of Virginia that allow a single hardware failure to cause

outages for several agencies spanning days.

There are many areas for improvement revealed by this outage.

Ultimately, the outage was a result of human error. Human error will occur,

however there needs to be as many safeguards as possible in place to avoid

human error from escalating into a fiasco. Training would help with human error.

Training is an ongoing process that must be maintained along with the process of

constant improvement. Northrop Grumman, EMC, and VITA share the blame for

poorly automating redundancies and backups. A single technician dealing with

an "unprecedented" outage will always be likely to make a mistake in a moment

of uncertainty.

The partnership with VITA and Northrop Grumman was established as

part of a risk transference plan. Outsourcing expensive IT services to a company

that specializes in IT should result in lower cost due to bulk discounts, enhanced

services, and access to high quality IT staff. The total costs of outsourcing IT

should go down over time due to falling hardware costs. (Lee) These expected

financial benefits are nonexistent in the Virginia - Northrop Grumman contract.

The effectiveness of this partnership should be reviewed in terms of value to the


62

customer, in this case the citizens of Virginia. The quality of the partnership

should be reviewed using the dimensions of fitness of use and reliability. (Lee &

Kim, 1999) The events of the last few years have shown that the service that

Northrop Grumman provides is of questionable fitness for use or reliability.

Poorly strategized and executed services have not only cost Virginians,

but have been a source of inconvenience and delay. Some Virginians had to go

to court to combat expired license tickets, those who cannot find the time to do

this may also face increases in insurance premiums. These issues seem small

in comparison to the compromised integrity of the licenses issued by the Virginia

DMV just prior to the outage. These licenses are legal and nearly untraceable

and could fetch high prices on the black market. Also, consider the safety of

those working in prisons without phone service for hours. The phone outage

described was not an isolated event.

The 10-year contract with Northrop Grumman has left little possibility to

exit the contract and request new outsourcing bids. Virginia recently reviewed

the partnership and it was decided that it was too costly to exit the contract.

Northrop Grumman argued that Virginia did not provide them with adequate

access to information that would have allowed them to create a realistic refresh

schedule and budget. Virginia denied this, but agreed to extend the project

timeline and paid an additional $236 million to cover the hardware refresh.

(Schapiro & Bacque, Agencies' computers still being restored, 2010) This was

done in part for political reasons. Northrop Grumman agreed to move their
63

headquarters to Virginia. (Squires, 2010) Virginia hopes to create new jobs and

get better service. Meanwhile, Northrop Grumman will pay out approximately

$350,000 in fees due to the August 2010 outage.

The lesson to be learned from this partnership contract is that it is unwise

to commit to a lengthy contract. (Lee) Contract law as it relates to IT is in its

infancy. There are few who understand both IT and law well enough to write or

defend the contract properly. A less lengthy agreement may have been best for

embarking on the hardware refresh. Perhaps the hardware refresh should have

been negotiated as a separate contract from the services outsourcing. At the

least, an exit clause that would allow Virginia to exit the contract without risking

the waste of millions of public funds would be advisable. Public safety and

security are too important to place in the hands of a single provider without any

recourse to correct serious issues. The contract with Northrop Grumman

appears to have too much wiggle room to make Northrop Grumman accountable

for failures.

For Virginians the important concern is the implementation of corrective

actions to see that this never happens again. Further, that Northrop Grumman is

held accountable in a manner that motivates them to stop ignoring issues raised

by those charged with oversight. Northrop Grumman has a responsibility to

provide high quality services. Northrop Grumman is responsible for their vendors,

employees, configurations, and processes. They must deliver resilient IT

services and well-trained staff. Taxpayers should no longer pay for the
64

negligence of the outsourced contractor. This partnership was intended to

transfer risks of IT services to Northrop Grumman, but Virginia keeps paying

without realizing the expected benefits of partnership.

The best protection for Virginians may lie in contract law. Future

outsourcing contracts should not favor the vendor and exploit the state. Referring

to the outsourcing as a partnership may have been a good political move.

However, it is important to remember that the relationship in an outsourcing

situation cannot be a true partnership because business motives are not shared.

(Lee) The outsourcing contract should have clearly defined service level

agreements and failure to meet these expectations should result in equally clear

penalties. These penalties should have enough financial impact to ensure the

vendor does not determine that paying the penalty fees make better financial

sense than providing the contracted services. The contract between Virginia and

Northrop Grumman has exit penalties that are too expensive to be a feasible

option to exercise. (Joint Legislative Audit and Review Commission) Virginia is

effectively trapped in a bad contract with no recourse.

Future outsourcing contracts must ensure that if Virginia is not receiving

contracted services that provide value to the citizens of Virginia the contract can

be cancelled allowing Virginia to seek satisfactory services. (Lee) These

outsourcing contract improvements can only be achieved through requirements

identification, contract negotiations, and rigorous contract review prior to contract

finalization. The contract review must be performed by an experienced IT


65

contract lawyer. It is very probable that Northrop Grumman standard contracts

provided at least the basis for the outsourcing contract. The use of vendor

contracts "even as a starting point" is highly inadvisable because the contract will

favor the vendor. (Lee, p. 13) This problem is illustrated in the case of the

contract between Virginia and Northrop Grumman.

After the contract is in effect, the contract must be strictly managed by the

outsourcing organization. This may require the establishment of an internal IT

auditing team charged with conducting ongoing service reviews of the vendor.

The team should be comprised of experienced IT service auditors. This will

unfortunately require additional expense, but auditing activities will ensure that

the outsourcing organization will realize the expected value of the contract.

Therefore, the expenses of maintaining an auditing team should be included in

the outsourcing project costs.

4.4.3. Conclusion

Virginia's August 2010 outage provides a case study to illustrate the risks

of outsourcing. Virginia chose an experienced government contractor and made

appropriate investments. However, they failed to negotiate a contract that

provided effective recourse to enforce the contract terms. VITA also failed to

complete a manual that would have provided additional leverage to enforce

contract terms. (Joint Legislative Audit and Review Commission) In order to

mitigate outsourcing risks a strong, well defined, and managed contract is


66

necessary. An experienced IT contract lawyer is recommended to negotiate and

manage the outsourcing contract. The outsourcing organization must fulfill

contractual obligations to effectively employ mechanisms to enforce vendor

contract terms. Vigilance on the part of the outsourcing organization is required

to ensure the vendor delivers quality services that meet business requirements.

This means investing in auditing to ensure that the vendor is taking appropriate

action to provide contracted services.


67

CHAPTER 5. ANALYSIS

5.1. Best Practice Triangulation

Each of these case studies highlighted strengths and weaknesses of

various mitigation techniques. Tulane’s investment in backup tapes paid off but

the investment in an offsite data center did not. The factor that contributed most

to Tulane’s recoverability was the aid provided by other Universities and

vendors. This type of relationship has proven very useful in sectors such as

education and utilities. (Hardenbrook, 2004) Many of these types of

organizations work cooperatively on a daily basis to pool resources.

5.1.1. Before-Planning

Effective planning must begin with the company business requirements

and establishing the maximum tolerable period of disruption (MTPOD), recovery

time objectives (RTO), and recovery point objectives (RPO). MTPOD relates to

how long your business can be “down” before damaging the organization’s

viability. The case studies provide an array of tolerances as shown in Table 5.1

Tolerance and objectives. Using established tolerances and objectives based on

organizational characteristics would provide direction in terms of what mitigation

techniques to implement.
68

Table 5.1 Tolerance and objectives


Organization MTPOD RTO RPO
Commerzbank More than 1 week Less than 1 hour Last transaction
FirstEnergy Less than ½ hour Less than ¼ hour N/A
Tulane Less than 1 month Less than 1 week Previous business
day
Virginia More than 1 day Less than one hour 1 hour

Table 5.1 above reflects estimated MTPOD, RTO, and RPO for each

organization based on artifacts included in each case study. These estimates are

open to debate for example Commerzbank’s estimated MTPOD is listed as more

than one week. One week was chosen as the point at which the viability of

Commerzbank would be threatened. This tolerance was determined based on

looking only at the Commerzbank American division and determining at what

point customers would switch to a competitor. Any outage would be costly for

Commerzbank but a weeklong outage would damage the bank’s reputation and

cause attrition among customers. Customers tend to be tolerant of short outages,

but when the outage impacts their ability to be profitable, they must look

elsewhere. Commerzbank America has a relatively small customer base in a

highly competitive sector and would therefore have difficulty recovering from

customer loss.

FirstEnergy provides electricity, a critical infrastructure resource; any

outage will immediately inconvenience the customer base. Also outages result in

lost revenue because electricity cannot be stored for later use. Extended outages

strain other providers and potentially result in cascading critical service outages.

There are now mandatory guidelines as well and failure to meet these guidelines
69

carry strict enforceable fines. In addition electrical outages tend to be highly

publicized and investigated, damaging the company’s reputation. FirstEnergy is

investor owned therefore outages would reduce the value of company shares.

Investors sued FirstEnergy for lost revenue in the past and could potentially do

so again. All of these factors were included in the ½ hour estimate of MTPOD for

FirstEnergy.

Tulane University has sustained hurricane season for more than a century

before experiencing irrevocable damage to the University’s viability. Review of

case study artifacts revealed a repeating theme in this case; hurricanes had

become routine. The general thought was just send everyone away for a few

days return and clean up when it passes; back to business as usual. This reveals

that outages of “a few days” had no real impact on the organization. However, a

one-month or more outage impacts the university’s ability to maintain semester

operations, most notably the ability to provide their primary service, education.

Tulane IT is vital to education and research missions. Without these two

activities, university income is critically impacted. In determining this MTPOD the

university hospital was not included, only the university itself. Including the

hospital would reduce the MTPOD to hours or less due to possible loss of life.

Loss of life will not necessarily result in irrevocable viability damage to the

organization, but must be avoided at all costs and therefore would be heavily

weighted.
70

Determining the MTPOD for the Common Wealth of Virginia is more

complex. Some services such as 911 services are critical infrastructure and

cannot be down without compromising public safety. Other services may suffer

very little during an extended outage. Obviously, prison guards should never be

without phone services. However, do any of these factors really damage the

viability of the state, it would be very hard to argue that they do. This estimate

comes down to cost and public impact. Public impact was weighted most heavily.

Also, the state’s IT was outsourced therefore impact to the viability of Northrop

Grumman must be included. To date there has been little impact on Northrop

Grumman, but possible contractual changes made after the conclusion of the

third party investigation may have greater impact. Recovery time objectives were

based on reducing impact to the organization’s ability to maintain functionality.

Recovery point objectives (RPO) were based on organizational tolerance to lost

data including transactional data. FirstEnergy stands out in this group with a not

applicable (N/A) rating on Table 5.1. This is based on the assumption that for

operational purposes historical data is not critical to maintaining on going

services. Real-time data is critical to FirstEnergy operators. Past data is

important to predicting and future planning as well as tools development, but loss

of this data would have little operational impact as other data sources could be

utilized for the purposes mentioned.

The MTPOD, RTO, and RPO provide planning direction as mentioned

above. Commerzbank tolerances and objectives make it apparent that they must
71

employ business continuity measures to ensure as close as possible to zero

downtime. Expenditures in IT to ensure this are warranted and practical for their

organization. They can afford to make the necessary investments and downtime

is far too costly. The case study artifacts reveal that Commerzbank is actively

working on business continuity and actively works to improve the IT

infrastructure. Organizations in this category would be advised to avoid delays of

tape-based restores and to maintain two hot sites in an active/active cluster

configuration. It is important to note that one hot site must be significantly

geographically distant. The location should be in another part of the country or in

another country when possible.

Tulane is a good example of an organization with all the right pieces that

failed due to poor placement. Tulane had tape-based backup and recovery which

were appropriate for their budget and MTPOD. The backup data center was new

and not fully complete but location was the problem. It was near enough to be

affected by incidents that affected the University rendering it practically useless.

They were lucky the building’s upper floors, where the tapes were located, were

not flooded allowing retrieval of the backup tapes. This site at the time would

have been a warm site at best; strategic placement would have made this site a

major asset.

Katrina destroyed the infrastructure of New Orleans and Tulane; it is hard

to imagine how on campus classes could have resumed. A functional emergency

operations center (EOC) and backup data center could minimally have provided
72

student and employee records and possibly online coursework. Contingency

planning should include a backup data center that allows virtual operations where

possible. Virtual operations are useful in a variety of situations such as primary

site destruction, pandemics, inclement weather, and transportation interruptions.

For organizations that can continue operations without the use of

information technology services, investing in IT based mitigation may not be

appropriate. Be very cautious in ensuring that your organization truly has no IT

dependencies. Performing an exercise to walk through a mock year would help

to identify dependencies. Organizations that fall in this category are likely to be

very small with very few employees. Payroll and billing functions would be very

simple and probably paper based. Even in these circumstances, multiple copies

maintained at different locations would be advisable to prevent lost records, to

avoid lost revenue, or liability issues. Organizations in this category are not

representative of the average.

5.1.1.1. Staff training in recovery procedures

Staff training levels are more apparent in some of the cases than others.

For example, FirstEnergy staff was inadequately trained and there was poor

communication between operations and IT staff. There is inherent bias toward

documentation available, providing abundant evidence of bad training versus the

lack of evidence for good training. Therefore, less documented evidence of

Commerzbank’s quality of staff training was available. However, the fact that
73

employees began assembling at the backup site, in the midst of the chaos,

transportation, and communication problems of 9/11, is a testament to

Commerzbank’s preparedness training.

Again, these are the two most extreme examples, but training is the

difference between staff that fail to perform versus those that coolly navigate

themselves safely from just a few hundred feet from the largest terrorist attack in

U.S. history. The stress levels between the two staffs during the first phases are

not comparable. In these cases, the well-trained staff under conditions of

extreme stress performs very well with very little warning. The poorly trained staff

failed to act despite many warnings and hours to act. FirstEnergy staff

disregarded these warnings without attempting to verify the current situation.

5.1.2. During-Plan execution

5.1.2.1. Adherence to established procedures

Staff adherence to established procedures appeared to have a strong

correlation with the success of continuity and recovery efforts. FirstEnergy and

the Commonwealth of Virginia suffered comparatively minor incidents. These

organizations could have completely averted disaster had the staff followed

procedure and taken appropriate action. Figure 5.1 Adherence to established

procedures represents each organization’s adherence to procedure on a

continuum relative to one another. On the extremes of the continuum


74

represented in Figure 5.1 are Commerzbank and FirstEnergy. Commerzbank

appears to have executed their plan flawlessly despite encountering unexpected

technical difficulties. FirstEnergy failed to adhere to many industry standards and

procedures.

Figure 5.1 Adherence to established procedures

The independent review of the 2003 Northeast Blackout found FirstEnergy

primarily responsible for the blackout. The operators failed to respond

appropriately to calls from partner operators alerting them of problems detected.

Forty minutes before the outage, the operators knew the monitoring equipment

was not working and still failed to take corrective action. Established internal

procedures were inadequate to maintain reliable operations. FirstEnergy IT staff

was aware of the problems with the EMS, but did not alert the operators to the

issue. This communication was not required at the time of the studied incident,

but was later addressed. However, the primary cause of the outage was failure to

follow procedure. As a result some areas were without power for up to a week

and FirstEnergy's board sued for financial losses caused by negligence.

Based on the Commonwealth of Virginia case study artifacts, it is apparent

that ITIL standard practices were not followed. Artifacts indicate the use of ITIL

for this organization; therefore, ITIL adherence was used as the basis for
75

placement in the continuum in Figure 5.1 Adherence to established procedures.

ITIL specifies standards for communication during incidents and also focuses on

continuity of operations. The Commonwealth of Virginia outage was still under

review at the completion of this study. After the independent review is complete,

adherence to established procedure may be more accurately determined.

However, it is not disputable that a minor hardware problem, that was not an

outage, was acted upon inappropriately. This resulted in a weeklong outage for

some agencies and millions in reported losses.

Tulane University had well established pre-incident disaster procedures

and a staff that was trained and comfortable with the procedures. They also had

the luxury of knowing days ahead that the hurricane was coming. The execution

went according to plan for the most part. There were critical parts of the plan left

unexecuted; the payroll printer and related materials were not taken to safety.

This failure further complicated the task of issuing payroll and likely added

additional cost during the recovery execution process.

Commerzbank adherence to procedure saved the company millions if not

billions in lost revenue. The transactions system never went down during the

events of 9/11. Despite the loss of primary facilities and unforeseen technical

issues they were fully operational within hours. Commerzbank serves as a model

for the financial sector for business continuity and disaster recovery planning.

Other peer institutions never recovered from 9/11.


76

5.1.2.2. Chain of command structure

All of the organizations included in the study had well established chain of

command communication structures. Some were more effective than others for a

few reasons. Both Tulane and Commerzbank experienced communication

disruptions due to the magnitude of the disasters and the resulting damage to

infrastructure. Commerzbank had designated call trees and an alternate location

to maintain the chain of command despite communication and transportation

difficulties. The impact of Katrina was so severe that the impact to the

infrastructure of New Orleans was prolonged and the duration of the disaster

itself was longer. Both organizations struggled with limitations of communication

providers and overloaded cell towers.

Tulane’s critical staff members now carry cell phones from more than one

provider and maintain local and non-local numbers to avoid future

communication disruptions. Tulane has also developed a computer security

incident response plan, which follows many principals from the National Incident

Management system. (Tulane University, 09) This plan defines roles, incident

phases, and incident levels, which delineate what roles, are activated. (Tulane

University, 09)

This plan could be adapted using NIMS to provide an incident command

structure to manage cyberinfrastructure incidents as in Figure 5.2 below. The

contacts listed are cumulative, for example if a level 3 incident were to occur the
77

Figure 5.2 Sample IT incident command structure

Chief Information Officer (CIO), Infrastructure director, required infrastructure

staff, and process owners would be contacted. Each role activated would have a

responsibilities check list to be used for specific level incidents. NASCIO has a

toolkit that could be used as a template. Information on where to find the

NASCIO toolkit is available in appendix A.

Neither FirstEnergy nor the Commonwealth of Virginia experienced

disruptions in their chain of commands. FirstEnergy staff disregarded

communications with the Midwestern-coordinating operator and failed to

communicate as the voluntary industry standards of the time dictated. There was

no apparent deviation from the chain of command in the case of the Virginia

outage. Though it would be safe to speculate that the independent review will

reflect failures to follow some communication protocol.


78

5.1.2.3. Mutual aid relationships

The role of previously established relationships with vendors and partners

was apparent in all of the case studies. Each continuity or recovery effort was

assisted through external relationships. The use of these relationships to provide

additional resources was integral to recovery success and reduced the duration

of the outage in most of the cases. The assistance Baylor provided Tulane was

vital to the future viability of Tulane. The relationships utilized are represented in

Table 5.2 below.

Table 5.2 Aid relationships utilized during recovery


Organization Aid Provider

Commerzbank EMC

FirstEnergy GE, MISO and affiliated ISOs

Tulane Baylor and Blackboard

Virginia EMC and Affected Agencies

5.1.3. After-Plan improvement

All of the studied organizations employed some post-incident evaluation to

improve future response and resiliency. FirstEnergy and Virginia underwent

mandatory third party incident review to determine what steps were necessary to

prevent future incidents. Commerzbank and Tulane were unhappy with the

response and recovery provisions in place at the time of the incident, and have

made changes to increase resiliency.


79

5.1.3.1. Recovery time and cost

5.1.3.1.1. Downtime

There is no single way to determine the cost of downtime for every

organization nor is there a simple way to determine the cost of recovery. These

figures vary based on the sector and other organizational factors. Organizations

that have experienced disaster recovery events have not made the financial

ramifications available to the public, including those in this study. Further most

literature and tools available to aid in determining these cost and return on

investment (ROI) are provided by commercial entities that are attempting to sell

disaster recover or business continuity solutions and are therefore of

questionable validity.

For the purpose of this study a combination of recent studies are used for

illustrative purposes. The results of a study commissioned by CA technologies in

2010 claims “the average North American organization loses over $150,000 a

year through IT downtime.”(CA Technologies, 2010) A Symantec 2011 survey

reports median downtime cost of $3,000 per day for small businesses and

$23,000 for medium size businesses. Based on these figures it would not be

financially feasible for these types of organizations to invest in high availability

systems. However, the losses are still substantial and investment in daily data

backups maintained offsite would be advisable and affordable. Investing in high

availability critical infrastructure information systems is more likely to be a good

investment as illustrated in Figure 5.3 below. Increased downtime translates into


80

increased costs. Some sectors such as utility, financial, and some public have

regulatory standards that must be met and downtime could result in fines as well

as lost revenue.

Figure 5.3 Reported average downtime revenue losses in billions (CA


Technologies, 2010, p. 5)

5.1.3.1.2. Resiliency Investment

As with most projects in IT determining time and cost benefits is difficult

because the goal is a moving target. According to a 2010 study conducted by

Forrester Research, “more and more applications are considered critical” as a

result recovery times have increased by 1.5 hours. (Dines, 2011) The average

application and data classifications reported are shown in Figure 5.4 below. As

organizations become more dependent upon information systems and define

applications and data as critical, the cost of resiliency rises. As tolerance for
81

downtime decreases the cost of resiliency also rises. Economic realities dictate

that most organizations cannot maintain redundancy of all applications and data.

Figure 5.4 Reported critical applications and data classifications

There is currently no accepted standard for how much to invest in

resiliency. A rule of thumb for investing in disaster recovery is to earmark one

week’s worth of yearly revenue for mitigation.(Outsource IT Needs LLC) There

are many other ways to compute how much to invest in IT business continuity

most are far more complex. A 2010 Forrester study found that respondents

reported six percent of the IT operating budget is committed to resiliency

investments. (Balaouras, 2010) When creating a resiliency budget it is important

to note that many functions fall under the umbrella of IT operational resiliency

such as “security management, business continuity management, and IT

operations management”. (Caralli, Allen, Curtis, White, & Young, 2010)

Another factor related to resiliency investments is probability of a disaster.

The fields of insurance and economics have complicated equations to determine


82

risk to insure profitability. These equations are outside of the scope of this

qualitative study. However, this study will use a 2010 Forester market study for

anecdotal evidence to provide a simplified method to calculate a spending

baseline. Forester reports “24 percent of respondents have declared a disaster

and failed over to an alternate site in the past five years”, this yields a 4.8 percent

probability of experiencing a disaster requiring failover to a remote site in a year.

(Dines, 2011) The average cost of downtime per hour was $145,000 and

average recovery time was reported to be 18.5 hours. (Dines, 2011)

Multiplying the average cost per hour by the average recovery time yields

an average recovery cost of $2,682,500. Spreading the cost of a major disaster

over a five year time period yields a disaster cost of $536,500. Multiplying this by

the risk probability of 4.8 percent yields $25,752. These figures provide a range

for disaster investments a minimum investment of $25,752 and maximum of

$536, 500. The average of the two is $281,126; this figure, based on the Forester

study, represents a practical investment per year for disaster recovery.

An average annual budget of $281,126 is not a large investment relatively

speaking, careful long term planning along with integrated iterative

implementation will allow the results of this small yearly investment to yield

substantial results over time. A five-year resiliency implementation plan would

allow long term planning to be implemented through a series of short-term goals.

The overall 5-year budget would be $1,405,630. The first year would likely be

dedicated to reviewing organizational needs and looking for cost effective ways
83

to implement the resiliency plan. The following years could be focused on a

modular implementation and integrating resiliency into new projects.

5.1.3.2. Findings

One best practice identified through this case study is to integrate

redundant systems into daily processing functions. Commerzbank is a good

example of this configuration. This configuration was instituted post 9/11 as a

result of evaluation of the recovery efforts’ weaknesses. One advantage is there

is less need for human intervention, thus requiring less manpower to recover.

This also allows response and recovery to begin immediately. In life threatening

situations staff can focus on evacuation. Possible liability issues can be reduced

related to both staff and external stakeholders by removing the question of due

diligence. Another advantage is that testing is far less disruptive because the

recovery systems are already processing part of the load.

Integrating redundant systems into daily processing does not mean that all

systems must be redundant. Careful planning and classification of applications

and data can reduce costs. For example in the case of FirstEnergy, access to

past data is not business critical, investing in data recovery can be reduced.

Lower cost tape based storage and recovery methods are fine. However,

availability of real-time operations applications and data are critical to the mission

of FirstEnergy. Investments to support these critical functions are money well


84

spent. A raw order of magnitude estimate for this would place such a system in

the hundreds of thousands and could reach to the millions.

Gartner reported the cost of a tier IV data center to be about $3,450 per

square foot with a cost of $34.5 million for a 10,000 square foot datacenter.

(Cappuccio, 2010) A tier IV data center according to Gartner would provide less

than a half hour of downtime a year. (Cappuccio, 2010) However, the risk and

outage costs are high enough to justify such an investment. The losses of the

2003 blackout were widespread and totaled in the billions. As a result

FirstEnergy’s investments should be scaled to support business continuity to

avoid outages.

The August 2010 Virginia outage is an example of an organization that

made the “right” technical decisions, but failed on an organizational and

implementation level. A one-time implementation project will not ensure

cyberinfrastructure resiliency. It is an ongoing continuous improvement lifecycle

process. The Virginia case also provides an example of the hazards of risk

transference through outsourcing. Outsourced IT must be carefully managed and

monitored. The outsourcing party must ensure the power to enforce meaningful

penalties for contractual failures.

Figure 5.5 displays the high-level conceptual relationships of the various

components of a resilient system. Each triangle above can be further broken

down to reflect component relationships, for example training would be within

failover testing and plan updates. Disaster recovery and business continuity
85

Figure 5.5 Components of a resilient system

plans would include business impact analysis, categorization of data and

applications, and recovery sequence. These relationships hold whether IT is

internally maintained or externally maintained. Organizations must be vigilant to

ensure that each component is rigorously maintained.


86

CHAPTER 6. CONCLUSION

The research question this study endeavored to answer is “What are best

practices before, during, and after disaster recovery execution?” The multiple

case study best practice analysis indicates that disaster recovery is one part of

an iterative business continuity process in successful organizations. This

process is broken down into three distinct phases: before, during, and after

disaster recovery execution. Strategic planning occurs during the before phase.

This planning includes determining the MTPOD, RTO, and RPO to help

determine appropriate investment. Training, and rigorous testing occur in the

before phase as well. Best practice during the disaster recovery execution phase

includes effective management, organization, and execution of business

continuity and disaster recovery plans. Adherence to policy, chain of command,

and utilization of aid relationships are important elements of the during phase. In

the after, post recovery phase, best practice involves reviewing the situation and

response to identify areas that need improvement. The after phase will not only

help plan future mitigation, but also identifies supporting government policy

needs in critical infrastructure sectors. The iterative cycle begins again in the

before planning phase as improvement implementation begins.


87

The purpose of this research was “to bridge the gap of unmet”

cyberinfrastructure resiliency needs. An assumption was made that the high cost

of implementation was the most significant barrier. While this may be true,

surprisingly, the two most avoidable disasters did not occur because of any direct

lack of funds. Virginia had already allocated the funds, the company they

outsourced to failed to meet contracted requirements. FirstEnergy is fortune 500

company and was a fortune 500 company prior to the 2003 North East Blackout,

it is therefore unlikely that lack of available funds was a contributing factor. In

these two cases it is arguable that lack of management oversight and urgency

was the motivation for any lack of funds allocated. Both organizations understood

and implemented backup equipment, but failed to ensure that all mitigation

measures were followed. Commerzbank and Tulane are veterans in dealing with

adversity. Weaknesses revealed in each appeared to be failure to truly

understand the complexities of the recovery process in a truly catastrophic loss.

The real barrier appears to be an inability to admit that large-scale

disasters do and will happen. Unfortunately, this cognitive avoidance is in our

nature as humans. Some large-scale disasters are caused by catastrophes,

others by human error, both are illustrated in this study. Required preparations

must be strictly enforced or mandated by the government to induce compliance.

Car, home insurance, and retirement savings are all forgone by most unless it is

mandatory. The extensive complexity of data center and information systems is

difficult to grasp. Add to this the tendency to disregard possible calamity and the
88

end result is a crumbling, tenuous cyberinfrastructure. Regulation may be far off

due to the specialized workforce required to audit information systems.


89

CHAPTER 7. FUTURE RESEARCH

This study has revealed areas in need of further research. Two are well

known issues with ongoing research. These areas are educating a workforce

capable of managing critical information systems and increasing resiliency by

eliminating dependencies upon third parties for power. The areas of hydrogen

fuel cells and solar power continue to leap forward and may provide the power

required to create grid independent data centers. Sustainability and reduction in

power consumption are also required to build independent data centers.

Another area that requires further research is providing understandable

suggested business continuity investing based on the probability of a major

disaster and the cost of such events. Tables based on industry, size, and location

would be particularly useful in determining appropriate spending. Current

spending appears to be based on confidence, fear, or sales pitches. A rational

method, based on facts, would allow this information to be presented to a CFO in

a meaningful manner.

Lastly, the area of cyberinfrastructure policy needs to be investigated. The

current, mostly unregulated IT, climate values fast over safe. Companies must

move fast to push out the next new product. Little time is spent focused on

ensuring security and resiliency. This will likely continue until minimum
90

regulations are in place. These policies and the ability to enforce them will be

very helpful to organizations that want to be secure and resilient, but are

struggling with vendors. Related to this is IT contract law, this field desperately

needs to be brought to maturity to protect organizational investments.


BIBLIOGRAPHY
91

BIBLIOGRAPHY

Alumni Affairs, Tulane University. (2008 29-9). Tulane University: Renaissance.


Retrieved 10 27-12 from issue:
http://issue.com/thebooksmithgroup/docs/tulane

Anthes, G. (2008 31-3). Disaster survivor: Tulane's people priority. Retrieved 10


17-12 from Compuerworld:
http://www.computerworld.com/s/article/print/314109/Tulane_University

Associated Press. (06 20-1). FirstEnergy to pay $28M fine, saying workers hid
damage. Retrieved 11 5-1 from USA Today:
http://www.usatoday.com/news/nation/2006-01-20-nuke-plant-fine_x.htm

Associated press. (03 19-11). Investigators pin origin of Aug 2003 blackout on
FirstEnergy failures . Retrieved 11 6-1 from Windcor Power Systems.

Availability Digest. (2009 7). Commerzbank Survives 9/11 with OpenVMS


Clusters. Retrieved 11 3-1 from Availability Digest:
http://www.availabilitydigest.com/public_articles/0407/commerzbank.pdf

Availability Digest. (2010 10). The State of Virginia – Down for Days. Retrieved
2010 8-11 from www.availabilitydigest.com:
http://www.availabilitydigest.com/public_articles/0510/virginia.pdf

Balaouras, S. (2010 2-9). Business Continuity And Disaster Recovery Are Top IT
Priorities For 2010 And 2011 Six Percent Of IT Operating And Capital
Budgets Goes To BC/DR. Retrieved 2011 7-2 from Forrester.com:
http://www.forrester.com/rb/Research/business_continuity_and_disaster_r
ecovery_are_top/q/id/57818/t/2

Balaouras, S. (2008 Winter). The State of DR Preparedness. Retrieved 6 29-6


from Disaster Recovery Journal:
http://www.drj.com/index.php?Itemid=10&id=794&option=com_content&ta
sk=view

Barovik, H., Bland, E., Nugent, B., Van Dyk, D., & Winters, R. (2001 26-11). For
The Record Nov. 26, 2001. Retrieved 11 13-1 from Time:
http://www.time.com/time/magazine/article/0,9171,1001334,00.html
92

Barron, J. (2003 15-8). Power Surge Blacks Out Northeast. Retrieved 2009 2-11
from New York Times:
http://www.nytimes.com/2003/08/15/nyregion/15POWE.html

Barron, J. (2003 15-8). Power Surge Blacks Out Northeast. Retrieved 2009 2-11
from The New York Times:
http://www.nytimes.com/2003/08/15/nyregion/15POWE.html

Blackboard Inc. (2008 24-10). Blackboard & Tulane University. Retrieved 10 27-
12 from Blaceboard:
http://www.blackboard.com/CMSPages/GetFile.aspx?guid=39a0b112-
221d-4d04-be80-f2024d16943a

Brown, K. (2008 1-2). House No. 3 Rises for URBANbuild. Retrieved 2011 2-1
from Tulane University New Wave:
http://tulane.edu/news/newwave/020108_urbanbuild.cfm

CA Technologies. (2010 11). The Avoidable Cost of Downtime. Retrieved 2011


28-1 from Arcserve:
http://arcserve.com/us/~/media/Files/SupportingPieces/ARCserve/avoidab
le-cost-of-downtime-summary.pdf

Cappuccio, D. (2010 17-3). Extend the Life of Your Data Center, While Lowering
Costs. Retrieved 2011 28-1 from Gartner:
http://www.gartner.com/it/content/1304100/1304113/march_18_extend_lif
e_of_data_center_dcappuccio.pdf

Caralli, R., Allen, J., Curtis, P., White, D., & Young, L. (2010 5). CERT®
Resilience Management Model, Version 1.0 Process Areas, Generic
Goals and Practices, and Glossary. Hanscom AFB, MA.

Charette, R. (2010 31-8). Virginia's Continuing IT Outage Creates Political


Fireworks. Retrieved 2010 6-11 from IEEE Specrtum:
http://spectrum.ieee.org/riskfactor/computing/it/virginias-continuing-it-
outage-creates-political-fireworks

Clinton Administration. (1998 22-5). The Clinton Administration's Policy on


Critical Infrastructure Protection: Presidential Decision Directive 63.
Retrieved 2010 2-5 from Computer Security Resource Center National
Institute of Standards and Technology Federal Requiements:
http://csrc.nist.gov/drivers/documents/paper598.pdf

Collett, S. (2007 4-12). Five Steps to Evaluating Business Continuity Services.


Retrieved 2009 9-11 from CSOonline.com:
http://www.csoonline.com/article/221306/Five_Steps_to_Evaluating_Busin
ess_Continuity_Services
93

Comptroller of the city of New York. (02 04-9). One Year Later, The Fiscal Impact
of 9/11 on New York City. Retrieved 11 13-1 from The New York City
Comptroller's Office:
http://www.comptroller.nyc.gov/bureaus/bud/reports/impact-9-11-year-
later.pdf

Cowen. (n.d.). Letter to students. Retrieved 2010 27-12 from Tulane University:
http://renewal.tulane.edu/students_undergraduate_cowen2.shtml

Cowen, S. (05 8-12). Messages for Students . Retrieved 10 27-12 from


Tulane.edu: http://www.tulane.edu/students.html

Cowen, S. (05 2-9). Messages for Students . Retrieved 2010 27-12 from Tulane
University : http://www.tulane.edu/studentmessages/september.html

Cowen, S. (05 3-9). Student Messages. Retrieved 10 27-12 from Tulane


University: http://www.tulane.edu/studentmessages/september.html

Cowen, S. (05 8-9). Student Messages. Retrieved 10 27-12 from Tulane


University: http://www.tulane.edu/studentmessages/september.html

Cowen, S. (2005 14-9). Student Messages. Retrieved 10 27-12 from Tulane


University: http://www.tulane.edu/studentmessages/september.html

DeCay, J. (2007 3-5). Advising Students After An Extreme Crisis: Assisting


Katrina Survivors. Retrieved 10 17-12 from Dallas County Community
College District:
http://www.dcccd.edu/sitecollectiondocuments/dcccd/docs/departments/do
/eduaff/transfer/conference/conference_cvc.pdf

Denial-of-service attack. (n.d.). Retrieved 2009 2-11 from Wikipedia:


http://en.wikipedia.org/wiki/Denial-of-service_attack

Dines, R. (2011). Market Study The State of Disaster Recovery Preparedness.


(R. Arnold, Ed.) Disaster recovery Journal , 24 (1), 12-22.

Editorial Staff of SearchStorage.com. (2002 6-3). Bank avoids data disaster on


Sept. 11. Retrieved 11 3-1 from SearchStorage.com:
http://searchstorage.techtarget.com/tip/0,289483,sid5_gci808783,00.html

Egenera. (2006). Case Study: Commerzbank North America. Retrieved 2011 3-1
from Egenera: www.egenera.com/1157984790/Link.htm
94

Electricity Consumers Resource Council (ELCON) . (2004 09-02). The Economic


Impacts of the August 2003 Blackout . Retrieved 2009 02-11 from
ELCON:
http://www.elcon.org/Documents/EconomicImpactsOfAugust2003Blackout
.pdf

EMAC. (n.d.). The History of Mutual Aid and EMAC. Retrieved 2011 20-2 from
EMAC: http://www.emacweb.org/?321

FEMA. (n.d.). Incident Command System (ICS). Retrieved 2011 20-2 from
FEMA:
http://www.fema.gov/emergency/nims/IncidentCommandSystem.shtm

FEMA. (2006 30-11). Private Sector NIMS Implementation Activities. From


http://www.fema.gov/pdf/emergency/nims/ps_fs.pdf

FirstEnergy. (08 27-2). Company history. From FirstEnergy:


http://www.firstenergycorp.com/corporate/Corporate_Profile/Company_His
tory.html

FirstEnergy. (09 27-2). Corporate profile. Retrieved 11 5-1 from FirstEnergy:


http://www.firstenergycorp.com/corporate/Corporate_Profile/index.html

Forrester, E. C., Buteau, B. L., & Shrum, S. (2009). Service Continuity: A Project
Management Process Area at Maturity Level 3. In E. C. Forrester, B. L.
Buteau, & S. Shrum, CMMI® for Services: Guidelines for Superior Service
(pp. 507-523). Boston, MA: Addison-Wesley Professional.

Fortune. (10 3-5). Fortune 500. Retrieved 11 5-1 from CNNMoney.com:


http://www.firstenergycorp.com/corporate/Corporate_Profile/Company_His
tory.html

From Reuters and Bloomberg News. (03 19-8). FirstEnergy Shares Fall After
Blackout. Retrieved 11 6-1 from Los Angeles Times:
http://articles.latimes.com/2003/aug/19/business/fi-wrap19.1

Gerace, T., Jean, R., & Krob, A. (2007). Decentralized and centralized it support
at Tulane University: a case study from a hybrid model. In Proceedings of
the 35th annual ACM SIGUCCS fall conference (SIGUCCS '07). New
York: ACM.

Grose, T., Lord, M., & Shallcross, L. (2005 11). Down, but not out. Retrieved
2010 28-12 from ASEE PRISM: http://www.prism-
magazine.org/nov05/feature_katrina.cfm
95

Gulf Coast Presidents. (2005). Gulf Coast Presidents Express Thanks, Urge
Continued Assistance . Retrieved 10 27-12 from Tulane University:
http://www.tulane.edu/ace.htm

Hardenbrook, B. (2004 8-9). Infrastructure Interdependencies Tabletop Exercise


BLUE CASCADES II. Seattle, WA.

Hewlett-Packard. (2002 7). hp AlphaServer technology helps Commerzbank


tolerate disaster on September 11. Retrieved 11 3-1 from hp.com:
http://h71000.www7.hp.com/openvms/brochures/commerzbank/commerzb
ank.pdf?jumpid=reg_R1002_USEN

Homeland Security. (2009 8). Information Technology Sector Baseline Risk


Assesment. Retrieved 2010 17-5 from Homeland Security:
http://www.dhs.gov/xlibrary/assets/nipp_it_baseline_risk_assessment.pdf

Homeland Security. (2009 August). Information Technology Sector Baseline Risk


Assessment. Retrieved 2010 17-5 from Homeland Security:
http://www.dhs.gov/xlibrary/assets/nipp_it_baseline_risk_assessment.pdf

Internet Security Alliance (ISA)/American National Standards Institute (ANSI).


(2010). The Financial Management of Cyber Risk An Implementation
Framework for CFOs. USA: Internet Security Alliance (ISA)/American
National Standards Institute (ANSI).

Jackson, C. (2011 2). California’s Mutual Aid System Provides Invaluable


Support During San Bruno Disaster. Retrieved 2011 20-2 from Western
City: http://www.westerncity.com/Western-City/February-2011/California-
rsquos-Mutual-Aid-System-Provides-Invaluable-Support-During-San-
Bruno-Disaster/

Jesdanun, A. (04 12-2). Software Bug Blamed For Blackout Alarm Failure.
Retrieved 11 6-1 from CRN:
http://www.crn.com/news/security/18840497/software-bug-blamed-for-
blackout-alarm-failure.htm?itc=refresh

Joint Legislative Audit and Review Commission. (2009 2009-13). Review of


Information Technology Services in Virginia. Retrieved 2010 05-11 from
http://jlarc.state.va.us/: jlarc.state.va.us/meetings/October09/VITA.pdf

Kantor, A. (2005 8-9). Technology succeeds, system fails in New Orleans.


Retrieved 11 2-1 from USA Today:
http://www.usatoday.com/tech/columnist/andrewkantor/2005-09-08-
katrina-tech_x.htm
96

Krane, N. K., Kahn, M. J., Markert, R. J., Whelton, P. K., Traber, P. G., & Taylor,
I. L. (2007 8). Surviving Hurricane Katrina: Reconstructing the Educational
Enterprise of Tulane University School of Medicine. Retrieved 10 17-12
from Academic Medicine:
http://journals.lww.com/academicmedicine/Fulltext/2007/08000/Surviving_
Hurricane_Katrina__Reconstructing_the.4.aspx

Kravitz, D. (2010 28-8). Statewide computer meltdown in Virginia disrupts DMV,


other government business. Retrieved 2010 6-11 from The Washington
Post : http://www.washingtonpost.com/wp-
dyn/content/article/2010/08/27/AR2010082705046.html

Kravitz, D., & Kumar, A. (2010 31-8). Virginia DMV licensing services will be
stalled until at least Wednesday. Retrieved 2010 6-11 from
Washingtonpost.com: http://www.washingtonpost.com/wp-
dyn/content/article/2010/08/30/AR2010083004877.html

Kumar, A., & Helderman, R. (2009 14-10). Outsourced $2 Billion Computer


Upgrade Disrupts Va. Services. Retrieved 2010 6-11 from
Washingtonpost.com : http://www.washingtonpost.com/wp-
dyn/content/article/2009/10/13/AR2009101303044.html

Lawson, J. (05 9-12). A Look Back at a Disaster Plan: What Went Wrong and
Right. Retrieved 10 28-12 from The Chronicle of Higher Education:
http://chronicle.com/article/A-Look-Back-at-a-Disaster/10664

Lawson, J. (2005 9-12). Katrina and Tulane: a Timeline. Retrieved 12 2010-17


from The Chronicle of Higher Education :
http://chronicle.com/article/KatrinaTulane-a-Timeline/21840

Lee. (1996). IT outsourcing contracts: practical issues for management. Industrial


Management & Data Systems , 96 (1), 15 - 20.

Lee, J., & Kim, Y. (1999). Effect of Partnership Quality on IS Outsourcing


Sucess: Conceptual Freamwork anf Empirical Validation. Journal of
Management Information Systems , 15 (4), 29-61.

Lewis, B. (n.d.). Massive Computer Outage Halts Some Va. Agencies. Retrieved
2010 5-11 from HamptonRoads.com:
http://hamptonroads.com/print/566771

Lord, M. (2008 11). WHEN DISASTER STRIKES Recovering from Katrina’s


damage, two New Orleans engineering schools make emergency
preparation a priority. Retrieved 10 28-12 from ASEE PRISM:
http://www.prism-magazine.org/nov08/feature_03.cfm#top
97

Massachusetts Institute of Technology Information Security Office . (1995). MIT


BUSINESS CONTINUITY PLAN. Retrieved 2010 17-5 from Information
Services and Technology : http://web.mit.edu/security/www/pubplan.htm

McIntyre, D. A. (2009 2-9). Gmail's outage raises new concern about the Net's
vulnerability. Retrieved 2009 25-11 from Newsweek:
http://www.newsweek.com/id/214760

McLennan, K. (2006). Selected Distance Education Disaster Planning Lessons


Learned From Hurricane Katrina . Retrieved 10 28-12 from Online Journal
of Disatnce Learning Administration:
http://www.westga.edu/~distance/ojdla/winter94/mclennan94.htm

Mears, J., Connor, D., & Martin, M. (02 2-9). What has changed. Retrieved 11 4-
1 from Network World.

Merschoff, E. (05 21-4). EA-05-071 - Davis-Besse (FirstEnergy Nuclear


Operating Company). Retrieved 11 5-1 from USNRC:
http://www.nrc.gov/reading-rm/doc-
collections/enforcement/actions/reactors/ea05071.html

Michigan State University Disaster Recovery Planning . (n.d.). Planning Guide.


Retrieved 2010 17-5 from Michigan State University Disaster Recovery
Planning : http://www.drp.msu.edu/Documentation/StepbyStepGuide.htm

Midwest ISO. (n.d.). About Us. Retrieved 2011 28-3 from Midwest ISO:
http://www.midwestmarket.org/page/About%20Us

Minkel, J. (08 13-8). The 2003 Northeast Blackout--Five Years Later. Retrieved
11 6-1 from Scientific American:
http://www.scientificamerican.com/article.cfm?id=2003-blackout-five-
years-later

NASA. (2008 3). Powerless. Retrieved 2011 6-1 from Process Based Mission
Assurance NASA Safety Center:
http://pbma.nasa.gov/docs/public/pbma/images/msm/PowerShutdown_sfc
s.pdf

New York Independent System Operator. (2005 2). ISO. Retrieved 2010 17-3
from
http://www.nyiso.com/public/webdocs/newsroom/press_releases/2005/bla
ckout_rpt_final.pdf

News Report. (2010 1-9). Northrop Grumman Vows to Find Cause of Virginia
Server Meltdown as Fix Nears. Retrieved 2010 6-11 from Government
Technology: http://www.govtech.com/policy-management/102482209.html
98

News Report. (2010 30-8). Work Continues on 'Unprecedented' Computer


Outage in Virginia . Retrieved 2010 6-11 from Government Technology:
http://www.govtech.com/security/102485974.html

Nixon, S. (2010 13-11). VITA Briefing. Retrieved 2010 7-11 from


www.vita.virginia.gov:
http://www.vita.virginia.gov/uploadedFiles/091310_JLARC_Final.pdf

Outsource IT Needs LLC. (n.d.). How Much Should You Spend on Disaster
Recovery? Calculating the Value of Business Continuity. Retrieved 2011
7-2 from Outsource IT Needs, LLC:
http://outsourceitneeds.com/DisasterRecovery.pdf

Oversight and Investigations Subcommittee of the House Committee on Energy


and Commerce. (2007 1-8). Testimony of M.L. Lagarde, III . Retrieved
2010 27-12 from Committee on Energy and Commerce:
http://energycommerce.house.gov/images/stories/Documents/Hearings/P
DF/110-oi-hrg.080107.Lagarde-Testimony.pdf

Parris, K. (n.d.). Using OpenVMS Clusters for Disaster Tolerance. Retrieved 11


3-1 from hp.com:
http://h71000.www7.hp.com/openvms/journal/v1/disastertol.pdf?jumpid=re
g_R1002_USEN

Parris, K. (2010). Who Survives Disasters and Why, Part 2: Organizations.


Retrieved 11 3-1 from www2.openvms.org/kparris/:
http://www2.openvms.org/kparris/Bootcamp_2010_Disasters_Part2_Orga
nizations.pdf

Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., et al.
(2002). Recovery Oriented Computing (ROC): Motivation, Definition,
Techniques, and Case Studies. Computer Science Technical Report,
Computer Science Division, University of California at Berkeley, Computer
Science Department, Mills College and Stanford University; IBM
Research, Berkeley.

Petersen, R. (2009 9). Protecting Cyber Assets. Retrieved 2010 15-6 from
EDUCAUSE Review:
http://www.educause.edu/EDUCAUSE%2BReview/EDUCAUSEReviewMa
gazineVolume44/ProtectingCyberAssets/178440
99

Scalet, S. D. (2002 1-9). IT Executives From Three Wall Street Companies -


Lehman Brothers, Merrill Lynch and American Express - Look Back on
9/11 and Take Stock of Where They Are Now. Retrieved 2009 9-11 from
CIO:
http://www.cio.com/article/31295/IT_Executives_From_Three_Wall_Street
_Companies_Lehman_Brothers_Merrill_Lynch_and_American_Express_L
ook_Back_on_9_11_and_Take_Stock_of_Where_They_Are_Now?page=
3&taxonomyId=1419

Scalet, S. D. (2002 01-09). IT Executives From Three Wall Street Companies-


Lehman Brothers, Merrill Lynch and American Express-Look Back on 9/11
and Take Stock of Where They Are Now . Retrieved 2009 09-11 from CIO:
http://www.cio.com/article/31295/IT_Executives_From_Three_Wall_Street
_Companies_Lehman_Brothers_Merrill Lynch_and
_American_Express_Look_Back_on_9_11_and
_Take_Stock_of_Where_They _Are_Now?page=3&taxonomyId=1419

Schaffhauser, D. (2005 21-10). Disaster Recovery: The Time Is Now. Retrieved


2010 17-12 from Campus Technology:
http://campustechnology.com/articles/2005/10/disaster-recovery-the-time-
is-now.aspx

Schapiro, J., & Bacque, P. (2010 28-08). Agencies' computers still being
restored. Retrieved 2010 5-11 from Richmond Times-Dispatch:
http://www2.timesdispatch.com/member-center/share-this/print/ar/476845/

Schapiro, J., & Bacque, P. (2010 3-9). Northrop Grumman regrets computer
outage. From Richmond Times-Dispatch:
http://www2.timesdispatch.com/news/state-news/2010/sep/03/vita03-ar-
485147/

Schapiro, J., & Bacque, P. (2010 2-9). Update: McDonnell lays out concerns to
Northrop Grumman. Retrieved 2010 8-11 from Richmond Times-Dispatch:
http://www2.timesdispatch.com/news/2010/sep/02/10/vita02-ar-483821/

Schellenger, D. (2010). Dealing with ther Personal Dimention of BC/DR. Disaster


Recovery Journal , 23 (2).

Scherr, I., & Bartz, D. (2010 3-2). U.S. unveils cybersecurity safeguard plan.
Retrieved 2010 30-6 from Reuters:
http://www.reuters.com/article/idUSTRE62135H20100302

Scherr, I., & Bartz, D. (2010 2-3). U.S. unveils cybersecurity safeguard plan.
Retrieved 2010 13-4 from Reuters:
http://www.reuters.com/article/idUSTRE62135H20100302
100

Schwartz, S., Li, W., Berenson, L., & Williams, R. (2002 11-9). Deaths in World
Trade Center Terrorist Attacks --- New York City, 2001. Retrieved 11 13-1
from CDC: http://www.cdc.gov/mmwr/preview/mmwrhtml/mm51spa6.htm

Searle, N. (2007). Baylor College of Medicine's Support of Tulane University


School of Medicine Following Hurricane Katrina. Retrieved 2010 17-12
from Academic Medicine:
http://journals.lww.com/academicmedicine/Fulltext/2007/08000/Surviving_
Hurricane_Katrina__Reconstructing_the.4.aspx

Slater, D. (2009 28-10). Business Continuity and Disaster Recovery Planning:


The Basics. Retrieved 2009 9-11 from CSOonline.com:
http://www.csoonline.com/article/204450/Business_Continuity_and_Disast
er_Recovery_Planning_The_Basics

Squires, P. (2010 2-9). Northrop Grumman to pay for cost of independent review.
Retrieved 2010 8-11 from virginiabusiness.com:
http://www.virginiabusiness.com/index.php/news/article/northrop-
grumman-to-pay-for-cost-of-independent-review/

Stewart, L. (2006 10-10). VITA Update to JLARC. Retrieved 2010 5-11 from
www.vita.virginia.gov: jlarc.state.va.us/meetings/October06/VITA.pdf

Swanson, A., Bowen, P., Wohl Phillips, A., Gallup, D., & Lynes, D. (2010 5).
Contingency Planning Guide for Federal Information Systems. NIST
Special Publication 800-34, Revision 1 . Gaithersburg, MD.

Swanson, M., Wohl, A., Pope, L., Grance, T., Hash, J., & Thomas, R. (2002
June). Contingency Planning Guide for Information Technology Systems
Recommendations of the National Institute of Standards and Technology
NIST Special Publication 800-34. Retrieved 2010 27-5 from Computer
Security Division Computer Resource Center National National Institute of
Standards and Technology: http://csrc.nist.gov/publications/nistpubs/800-
34/sp800-34.pdf

Testa, B. (2006 8). In Katrina’s Wake: Intensive Care for an Institution. Retrieved
2010 17-12 from Workforce Management:
http://www.workforce.com/section/recruiting-staffing/archive/feature-
katrinas-wake-intensive-care-institution/244929.html

The Clinton Administration’s Policy on Critical Infrastructure Protection:


Presidential Decision Directive 63. (1998 22-5). Retrieved 2010 02-05
from Computer Security Resource Center National Institute of Standards
and Technology Federal Requirements:
http://csrc.nist.gov/drivers/documents/paper598.pdf
101

The New York Times Company. (04 29-7). FirstEnergy settles suits related to
blackout. Retrieved 11 13-1 from NYTimes.com: NYTimes.com

The Virginia Information Technology Infrastructure Partnership. (n.d.). The


Virginia Information Technology Infrastructure Partnership ANNUAL
REPORT Improving Technology and Wiring Virginia for the 21st Century
July 1, 2006, through June 30, 2007. Retrieved 2010 6-11 from
www.vita.virginia.gov:
http://www.vita.virginia.gov/uploadedFiles/IT_Partnership/ITP2007Annual
Report.pdf

Thibodeau, P., & Mearian, L. (2005 9-12). After Katrina, users start to weigh
long-term IT issues. Retrieved 12 2010-15 from Computerworld:
http://www.computerworld.com/s/article/104542/After_Katrina_users_start
_to_weigh_long_term_IT_issues

Tulane University. (n.d.). About Tulane. Retrieved 10 29-12 from Tulane


University: http://tulane.edu/about/

Tulane University. (2009 2009-2). Ellen DeGeneres to Headline 'Katrina Class'


Commencement. Retrieved 1 2010-2 from Tulane Admission:
http://admission.tulane.edu/livecontent/news/34-ellen-degeneres-to-
headline-katrina-class.html

Tulane University. (09 3). Tulane University Computer Incident Response Plan
Part of Technology Services Disaster Recovery Plan. Retrieved 2011 20-2
from Information Security @ Tulane:
http://security.tulane.edu/TulaneComputerIncidentResponsePlan.pdf

U.S. Department of Transportation. (n.d.). iFlorida Model Deployment Final


Evaluation Report. Retrieved 2009 24-10 from
http://ntl.bts.gov/lib/31000/31000/31051/14480.htm

U.S.-Canada Power System Outage Task Force. (2004 April). Final Report on
the August 14, 2003 Blackout in the United State and Canada: Causes
and Recommendations. From https://reports.energy.gov

Virginia Community College. (1998 25-3). Virginia Community College Utility


Data Center Contingency Management/Disaster Recovery Plan. Retrieved
2009 9-11 from Virginia Community College:
http://helpnet.vccs.edu/NOC/Mainframe/drplan.htm

VITA. (n.d.). Information Technology Infrastructure Library (ITIL). Retrieved 2010


8-11 from www.vita.virginia.gov:
http://www.vita.virginia.gov/library/default.aspx?id=545
102

VITA. (n.d.). Information Technology Investment Board (ITIB). Retrieved 2010 6-


3 from www.vita.virginia.gov: http://www.vita.virginia.gov/ITIB/

VITA. (2010 1-11). Network News. Retrieved 11 13-1 from Vita:


http://www.vita.virginia.gov/communications/publications/networknews/def
ault.aspx?id=12906

VITA. (2007 1-7). Network News Volume 2, Number 7 From the CIO. Retrieved
2010 6-11 from www.vita.virginia.gov:
http://www.vita.virginia.gov/communications/publications/networknews/def
ault.aspx?id=3594

VITA. (2010 1-6). Network News Volume 5, Number 6 . Retrieved 2010 27-11
from www.vita.virginia.gov:
http://www.vita.virginia.gov/communications/publications/networknews/def
ault.aspx?id=12080

Wikan, D. (2010 13-9). Northrop Grumman to pay for computer outage


investigation. Retrieved 2010 7-11 from www.wvec.com:
http://www.wvec.com/news/local/Northrop-Grumman-to-pay-for-computer-
outage-investigation-102796459.html
APPENDICES
103

Appendix A. Recommended Resources

NASCIO IT Disaster Recovery and Business Continuity Tool-kit: Planning for the
Next Disaster
http://www.nascio.org/publications/documents/NASCIO-DRToolKit.pdf

This is an easy to follow workbook style14-page document covering before,


during, and after best practices.

Carnegie Melon Computer Emergency Response Team Resilience Management


Model
http://www.sei.cmu.edu/library/abstracts/reports/10tr012.cfm

This detailed 259 page document covers resiliency management from a cross
disciplinary perspective. Includes best practices, CMMI based generic goals and
objectives to guide the process of planning and implementing operational
resiliency.

FEMA's emergency management institute


http://www.training.fema.gov/IS/

Free online courses that provide testing and certificates of subject proficiency.
Covers a variety of topics such as emergency management, workplace violence,
and preparedness.
104

Appendix B. NASCIO IT Disaster Recovery and Business Continuity Tool-kit:


Planning for the Next Disaster

NASCIO: Representing Chief Information Officers of the States

NASCIO Staff Contact:


Drew Leatherby,
IT Disaster Recovery and Business Continuity Issues Coordinator
dleatherby@AMRms.com

Tool-kit: Planning for the Next Disaster

Without the flow of electronic informa- term power outages to more-severe disrup-
tion, government comes to a standstill. tions involving equipment destruction
When a state’s data systems and commu- from a variety of sources such as natural
nication networks are damaged and its disasters or terrorist actions. While many
processes disrupted, the problem can be vulnerabilities may be minimized or elimi-
serious and the impact far-reaching. The nated through technical, management, or
consequences can be much more than an operational solutions as part of the state’s
inconvenience. Serious disruptions to a overall risk management effort, it is virtually
state’s IT systems may lead to public dis- impossible to completely eliminate all risks.
trust, chaos and fear. It can mean a loss of
vital digital records and legal documents. In many cases, critical resources may reside
A loss of productivity and accountability. outside the organization’s control (such as
And a loss too of revenue and commerce. electric power or telecommunications),
and the organization may be unable to
Disasters that shut down a state’s mission ensure their availability. Thus effective dis-
critical applications for any length of time aster recovery planning, execution, and
could have devastating direct and indirect testing are essential to mitigate the risk of
costs to the state and its economy that system and service unavailability.
make considering a disaster recovery and Accordingly, in order for disaster recovery
business continuity plan essential. State planning to be successful, the state CIO’s
Chief Information Officers (CIOs) have an office must ensure the following:
obligation to ensure that state IT services
continue in the state of an emergency. The 1. Critical staff must understand the IT
good news is that there are simple steps disaster recovery and business conti-
that CIOs can follow to prepare for Before, nuity planning process and its place
During and After an IT crisis strikes. Is your within the overall Continuity of
state ready? Operations Plan and Business
Continuity Plan process.
Disaster Recovery Planning 101 2. Develop or re-examine disaster recov-
ery policy and planning processes
Disaster recovery and business continuity including preliminary planning, busi-
planning provides a framework of interim ness impact analysis, alternate site
measures to recover IT services following selection, and recovery strategies. NASCIO represents state chief infor-
mation officers and information
an emergency or system disruption. 3. Develop or re-examine IT disaster technology executives and man-
Interim measures may include the reloca- recovery planning policies and plans agers from state governments across
the United States. For more informa-
tion of IT systems and operations to an with emphasis on maintenance, train- tion visit www.nascio.org.
alternate site, the recovery of IT functions ing, and exercising the contingency
plan. Copyright © 2007 NASCIO
using alternate equipment, or execution of All rights reserved
agreements with an outsourced entity.
201 East Main Street, Suite 1405
Lexington, KY 40507
IT systems are vulnerable to a variety of Phone: (859) 514-9153
disruptions, ranging from minor short- Fax: (859) 514-9166
Email: NASCIO@AMRms.com

IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 1
106

NASCIO: Representing Chief Information Officers of the States

Before the Crisis


(1) Strategic and Business Planning Responsibilities (Building relationships; What is the CIO’s role on an
ongoing basis? Role of enterprise policies?)

! CIOs need a Disaster Recovery and Business ! CIOs should conduct strategic assessments and
Continuity (DRBC) plan including: (1) Focus on inventory of physical assets, e.g. computing and
capabilities that are needed in any crisis situation; telecom resources, identify alternate sites and com-
(2) Identifying functional requirements; (3) Planning puting facilities. Also conduct strategic assessments
based on the degrees of a crisis from minor disrup- of essential employees to determine the staff that
tion of services to extreme catastrophic incidents; would be called upon in the event of a disaster and
(4) Establish service level requirements for business be sure to include pertinent contact information.
continuity; (5) Revise and update the plan; have criti-
Notes:
cal partners review the plan; and (6) Have hard and
digital copies of the plan stored in several locations
for security.
Notes:
! CIOs should conduct contingency planning in
case of lost personnel: This could involve cross-
training of essential personnel that can be lent out
to other agencies in case of loss of service or disas-
ter; also, mutual aid agreements with other public/
! CIOs should ask and answer the following ques-
private entities such as state universities for “skilled
tions: (1) What are the top business functions and
volunteers.” (Make sure contractors and volunteers
essential services the state enterprise can not func-
have approved access to facilities during a crisis).
tion without? Tier business functions and essen-
tial services into recovery categories based on Notes:
level of importance and allowable downtime. (2)
How can the operation’s facilities, vital records,
equipment, and other critical assets be protected?
(3) How can disruption to an agency’s or depart-
ment’s operations be reduced?
! Build cross-boundary relationships with emer-
Notes: gency agencies: CIOs should introduce themselves
and build relationships with state-wide, agency and
local emergency management personnel – you
don’t want the day of the disaster to be the first
time you meet your emergency management coun-
terparts. Communicate before the crisis. Also consid-
! CIOs should create a business resumption strate- er forging multi-state relationships with your CIO
gy: Such strategies lay out the interim procedures to counterparts to prepare for multi-state incidents.
follow in a disaster until normal business operations Consider developing a cross-boundary DR/BC plan
can be resumed. Plans should be organized by pro- or strategy, as many agencies and jurisdictions have
cedures to follow during the first 12, 24, and 48 their own plans.
hours of a disruption. (Utilize technologies such as
GIS for plotting available assets, outages, etc.) Notes:
Notes:

IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 3
105

NASCIO: Representing Chief Information Officers of the States

How to Use the Tool-kit The tool-kit is comprised of six checklists


in three categories that address specific
This tool-kit represents an updated and contingency planning recommendations
expanded version of business continuity to follow Before, During and After a dis-
and disaster preparedness checklists uti- ruption or crisis situation occurs. The
lized for a brainstorming exercise at the Planning Phase, Before the disaster,
“CIO-CLC Business Continuity/ Disaster describes the process of preparing plans
Recovery Forum” at NASCIO’s 2006 Midyear and procedures and testing those plans to
Conference. This expanded tool-kit prepare for a possible network failure. The
evolved from the work of NASCIO’s Execution Phase, During the disaster,
Disaster Recovery Working Group, describes a coordinated strategy involving
www.NASCIO.org/Committees/ system reconstitution and outlines actions
DisasterRecovery. Along with NASCIO’s that can be taken to return the IT environ-
DVD on disaster recovery, “Government at ment to normal operating conditions. The
Risk: Protecting Your IT Infrastructure.” Final Phase, After the disaster, describes
(View video or place order at: the transitions and gap analysis that takes
www.NASCIO.org/Committees/ place after the disaster has been mitigat-
DisasterRecovery/DRVideo.cfm), these ed. The tool-kit also provides an accompa-
checklists and accompanying group brain- nying group activity worksheet, “Thinking
storming worksheets will serve as a Sideways,” to assist in disaster recovery
resource for state CIOs and other state planning sessions with critical staff.
leaders to not only better position them-
selves to cope with an IT crisis, but also to
help make the business case for disaster
recovery and business continuity activities
in their states.

IT Disaster Recovery and Business (4) General IT Infrastructure and


Continuity Checklists Services (Types of redundancy;
protecting systems.)
Before the Crisis During the Crisis
(1) Strategic and Business Planning (5) Tactical Role of CIOs for Recovery
Responsibilities (Building relation- During a Disaster (Working with
ships; What is the CIO’s role on an state and local agencies and first
ongoing basis? Role of enterprise responders; critical staff assign-
policies?) ments; tactical use of technology,
(2) Top Steps States Need to Take to e.g. GIS.)
Solidify Public/ Private After the Crisis
Partnerships Ahead of Crises (Pre-
disaster agreements with the pri- (6) Tactical Role of CIOs for Recovery
vate sector and other organiza- After a Disaster Occurs (Working
tions.) with state and local agencies, and
critical staff to resume day-to-day
(3) How do you Make the Business operations, and perform gap analy-
Case on the Need for sis of the plan’s effectiveness.)
Redundancy? (Especially to the
state legislature, the state executive
branch and budget officials.)

2 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
107

NASCIO: Representing Chief Information Officers of the States

! Intergovernmental communications and coordi- ! Testing: CIOs should conduct periodic training exer-
nation plan: Develop a plan to communicate and cises and drills to test DR/BC plans. These drills
coordinate efforts with state, local and federal gov- should be pre-scheduled and conducted on a regu-
ernment officials. Systems critical for other state, lar basis and should include both desk-top and field
local and federal programs and services may need to exercises. Conduct a gap analysis following each
be temporarily shut down during an event to safe- exercise.
guard the state’s IT enterprise. Local jurisdictions are
the point-of-service for many state transactions, Notes:
including benefits distribution and child support
payments, and alternate channels of service delivery
may need to be identified and temporarily estab-
lished. Make sure jurisdictional authority is clearly
established and articulated to avoid internal con-
! A CIO’s approach to a DR/BC plan will be unique to
flicts during a crisis. his or her financial and organizational situation and
Notes: the availability of trained personnel. This still leaves
the question as to who writes the plans. If a CIO
chooses from one of the many consultants that pro-
vide Continuity of Operations planning, he or she
should make sure that staff maintains a close degree
of involvement and, when completed, that the con-
! Establish a crisis communications protocol: A crisis sultant(s) provide general awareness training of the
communications protocol should be part of a state’s plan. If CIOs choose to conduct planning in-house,
IT DR/BC plan; Designate a primary media have an experienced and certified business continu-
spokesperson with additional single point-of-contact ity planner review it for any potential gaps or incon-
communications officers as back-ups. Articulate who sistencies.
can speak to whom under different conditions, as well
as who should not speak with the press. In a time of Notes:
crisis, go public immediately, but only with what you
know; provide updates frequently and regularly.
Notes:

! Communicate to rank and file employees that


there is a plan, the why and how of the plan, and
their roles during a potential disruption of service or
disaster. Identify members of a possible crisis man-
agement team. Have in place their roles, actions to
be taken, and possible scenarios. Have a list of their
office, home, and cell or mobile phone numbers.
Notes:

4 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
108

NASCIO: Representing Chief Information Officers of the States

(2) Top Steps States Need to Take to Solidify Public/ Private Partnerships Ahead of Crises (Pre-disas-
ter agreements with the private sector and other organizations.)

! Utilize preexisting business partnerships: Keep ! Be sure essential IT procurement staff are part of
the dialogue open with state business partners; peri- the DR/BC plan and are aware of their roles in exe-
odically call them all in for briefings on the state’s cuting pre-positioned contracts in the event of a dis-
disaster recovery and business continuity (DR/BC) aster; also be sure to include pertinent contact infor-
plans. mation.
Notes: Notes:

! Set up “Emergency Standby Services and ! CIOs should develop “Emergency Purchasing
Hardware Contracts:” Have contracts in place for Guidelines” for agencies and have emergency
products and services that may be needed in the response legislation in place.
event of a declared emergency. Develop a contract
Notes:
template so a contract can be developed with one
to two hours work time.
Notes:

! Think outside the box: CIOs can partner with any-


one, e.g. universities, local government, lottery cor-
porations, local companies and leased facilities with
! Outsourced back-up sites may be time limited; redundant capabilities.
therefore back-up, back-up outsourcing may be nec-
Notes:
essary for continuity leap-frog.
Notes:

! Place advertisements in the state’s “Contract


Reporter” every quarter; continuous recruitment is a
good business practice.
Notes:

IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 5
109

NASCIO: Representing Chief Information Officers of the States

(3) How do you Make the Business Case on the Need for Redundancy? (Especially to the state legisla-
ture, the state executive branch and budget officials.)

Risk assessment of types of disasters that could lead " For federally declared states of emergency the
to the need for business continuity planning: financial aspect has been somewhat lessened by the
potential of acquiring funding grants from state or
! Geological hazards – Earthquakes, Tsunamis,
federal organizations such as FEMA. Additional fund-
Volcanic eruptions, Landslides/ mudslides/
ing for state cybersecurity preparedness efforts is
subsidence;
available to states through the U.S. Department of
! Meteorological hazards – Floods/ flash floods, tidal
Homeland Security’s State Homeland Security
surges, Drought, Fires (forest, range, urban), Snow,
Grants Program.
ice, hail, sleet, avalanche, Windstorm, tropical
cyclone, hurricane, tornado, dust/sand storms, Notes:
Extreme temperatures (heat, cold), Lightning strikes;
! Biological hazards – Diseases that impact humans
and animals (plague, smallpox, Anthrax, West Nile
Virus, Bird flu);
! Human-caused events – Accidental: Hazardous
material (chemical, radiological, biological) spill or " Establish metrics for costs of not having redun-
release; Explosion/ fire; Transportation accident; dancy: How much will it cost the state if certain crit-
Building/structure collapse; Energy/power/utility ical business functions go down – e.g. ERP issues on
failure; Fuel/resource shortage; Air/water pollution, the payment side; citizen service issues (what it
contamination; Water control structure/dam/levee would do to the DMV for license renewals); impacts
failure; Financial issues: economic depression, on eligibility verifications for social services, etc. How
inflation, financial system collapse; Communications long can you afford to be down? How much is this
systems interruptions; costing you? How long can you be without a core
! Intentional – Terrorism (conventional, chemical, business function?
radiological, biological, cyber); Sabotage; Civil distur- Notes:
bance, public unrest, mass hysteria, riot; Enemy
attack, war; Insurrection; Strike; Misinformation;
Crime; Arson; Electromagnetic pulse.

" Education and awareness: Craft an education and


awareness program for IT staff, lawmakers and budg- " Up-front savings: States obtain greater leverage for
et officials to ensure all parties are on the same page fair pricing and priority service in the event of an
with regards to your DR/BC plan and the need for emergency before the emergency occurs, rather
such a plan. Prepare key talking points that outline than after the emergency has occurred.
the rationale for DR/BC planning. Utilize outside Notes:
resources such as this tool-kit and NASCIO’s DVD on
disaster recovery, “Government at Risk: Protecting
Your IT Infrastructure,” to help make the business
case for disaster recovery and business continuity
activities in your state.
" Consider channels of delivery: Child support pay-
Notes:
ments channeled through a broker agency.
Notes:

6 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
110

NASCIO: Representing Chief Information Officers of the States

! Consider cycles of delivery: The most important


periods of delivery, e.g. the last week or couple of
days of the month may be the most critical back-up
period.
Notes:

! Realize that as the adoption rate for electronic


business processes and online services grows,
employees with knowledge of business rules and
paper processes will retire and will no longer be
around for manual backup.
Notes:

IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 7
111

NASCIO: Representing Chief Information Officers of the States

(4) General IT Infrastructure and Services (Types of redundancy; protecting systems.)

! CIOs need to ensure that information is regularly Notes:


backed up. Agencies need to store their back-up
data securely off site in a location that is accessible
but not too near the facility in question. Such loca-
tions should be equipped with hardware, software
and agency data, ready for use in an emergency.
(Restore functions should be tested on a regular ! Mobile communication centers can be utilized in
basis.) These “hot sites” can be owned and operated the event that traditional telecommunications sys-
by an agency or outsourced. tems are down.

Notes: Notes:

! Protect current systems: Controlled access; uninter- ! Self-healing primary point of presence facilities
ruptible power supply (UPS); back-up generators that automatically restore service.
with standby contracts for diesel fuel (use priority Notes:
and back-up fuel suppliers that also have back-up
generators to operate their pumps in the event of a
widely spread power outage).
Notes:
! Approach enterprise backup as a shared service:
Other agencies may have the capability for excess
redundancy.
Notes:
! Strategic location: Locate critical facilities away
from sites that are vulnerable to natural and man-
made disasters.
Notes:
! Provide secure remote access to state IT systems
for essential employees (access may be tiered based
on critical need.)
Notes:
! Interactive voice response (IVR) systems that are
accessing back-end databases: (There may be no
operators for backup that can connect patrons to
services.) Seek diversity of inbound communica-
tions.
! Hot Sites: A disaster recovery facility that mirrors an
Notes: agency’s applications databases in real-time.
Operational recovery is provided within minutes of a
disaster. These can be provided at remote locations
or outsourced to one or multiple contractors.
Notes:
! Self-healing communications systems that auto-
matically re-route communications or use alternate
media.

8 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
112

NASCIO: Representing Chief Information Officers of the States

During the Crisis


(5) Tactical Role of CIOs for Recovery During a Disaster (Working with state and local agencies and first
responders; critical staff assignments; tactical use of technology, e.g. GIS.)

! Decision making: Prepare yourself for making deci- ! Implement your emergency employee communi-
sions in an environment of uncertainty. During a cri- cations plan: Inform your internal audiences – IT
sis you may not have all the information necessary, staff and other government offices – at the same
however, you will be required to make immediate time you inform the press. Prepare announcements
decisions. to employees to transition them to alternate sites or
implement telecommuting or other emergency pro-
Notes:
cedures. Employees can maintain communication
with the central IT office utilizing Phone exchange
cards, provided to employees with two numbers: (1)
First number employees use to call in and leave their
contact information; (2) Second number is where
! Execute DR/BC Plan: Retrieve copies of the plan the employees call in every morning for a standing
from secure locations. Begin systematic execution of all employee conference call for updates on the
plan provisions, including procedures to follow dur- emergency situation.
ing the first 12, 24, and 48 hours of the disruption. Notes:
Notes:

! Intergovernmental communications and coordi-


! Shutdown non-essential services to free up nation plan: Communicate and coordinate efforts
resources for other critical services. Identify critical with state, local and federal government officials.
business applications and essential services and tier Systems critical for other state, local and federal pro-
them into recovery categories based on level of grams and services may need to be temporarily shut
importance and allowable downtime, e.g. tier III down during an event to safeguard the state’s IT
applications are shut down first. Be sure to classify enterprise. Local jurisdictions are the point-of-con-
critical services for internal customers vs. external tact for many state transactions, including vehicle
customers. and voter registration, and alternate channels of
service delivery may need to be identified and tem-
Notes: porarily established. Make sure jurisdictional
authority is clearly established and articulated to
avoid internal conflicts during a crisis.
Notes:

! Communicate, communicate, communicate:


Engage your primary media spokesperson imme-
diately and have additional communications officers
on stand-by if needed. Immediately get the word to
the press; let the media – and therefore the public –
know that you are dealing with the situation.
Notes:

IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 9
113

NASCIO: Representing Chief Information Officers of the States

! Back-up communications: In the event wireless, ! Leverage technology/ Think outside the box: In a
radio and Internet communications are inaccessible, disaster situation the state’s GIS systems can be uti-
Government Emergency Telecommunications lized to monitor power outages and system avail-
Service (GETS) cards can be utilized for emergency ability. For emergency communications, the “State
wireline communications. GETS is a Federal program Portal” can be converted to an emergency manage-
that prioritizes calls over wireline networks and uti- ment portal. Also, Web 2.0 technologies such as
lizes both the universal GETS access number and a Weblogs, Wikis and RSS feeds can be utilized for
Personal Identification Number (PIN) for priority emergency communications.
access.
Notes:
Notes:

! Execute “Emergency Standby Services and


! CIO’s must be effectively engaged with the On Hardware Contracts:” If necessary, execute pre-
Scene Coordinator (OSC), and the Incident placed contracts for products and services needed
Command System (ICS) – the federal framework for during the crisis. The Governor may also have to
managing disaster response that outlines common temporarily suspend some of the state’s procure-
processes, roles, functions, terms, responsibilities, etc. ment laws and execute “Emergency Purchasing
ICS supports the FEMA National Incident Guidelines” for agencies.
Management System (NIMS) approach; states must
Notes:
understand both NIMS and the ICS.
Notes:

10 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
114

NASCIO: Representing Chief Information Officers of the States

After the Crisis


(6) Tactical Role of CIOs for Recovery After a Disaster Occurs (Working with state and local agencies,
and critical staff to resume day-to-day operations, and perform gap analysis of the plan’s effectiveness.)

! Preliminary damage and loss assessment: ! Contractual performance: Review the performance
Conduct a post-event inventory and assess the loss of strategic contracts and modify contract agree-
of physical and non-physical assets. Include both ments as necessary.
tangible losses (e.g. a building or infrastructure) and
Notes:
intangible losses (e.g. financial and economic losses
due to service disruption). Be sure to include a dam-
age and loss assessment of hard copy and digital
records. Prepare a tiered strategy for recovery of lost
assets.
Notes: ! Lessons learned: Evaluate the effectiveness of the
DR/BC plan and how people responded. Examine all
aspects of the recovery effort and conduct a gap
analysis to identify deficiencies in the plan execu-
tion. Update the plan based on the analysis. What
went right (duplicate); what went wrong (tag and
! Employee transition: Once agencies have recov- avoid in the future). Correct problems so they don’t
ered their data, CIOs need to find interim space for happen again.
displaced employees, either at the hot site or anoth-
Notes:
er location. Coordinate announcements to employ-
ees to transition them to an alternate site or imple-
ment telecommuting procedures until normal oper-
ation are reestablished.
Notes:

! Budgetary concerns: Following a disaster and


resumption of IT services, there may be a need for
emergency capital expenditures to aid in the recov-
ery process. Be prepared to work with the state
budget officer and/ or the state’s legislative budget
committees.
Notes:

IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 11
115

NASCIO: Representing Chief Information Officers of the States

Appendix 1. Thinking Sideways

Instructions: Use this worksheet in conjunction with


each checklist as a group brainstorming tool.

A. Conduct a gap analysis on Checklist ___. Focus on


what’s missing and include key policy issues unique to
state governments, best practices and innovative ideas
that can be shared across jurisdictions:

C. How can CIOs use this information to secure funding


and other resources for business continuity?

B. Describe how states and the private sector can work


together to tackle these issues, through the transference
of knowledge and experience?

12 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
116

NASCIO: Representing Chief Information Officers of the States

Appendix 2. Additional Resources State Government Resources

Federal Government Resources Pennsylvania’s Pandemic Preparation


Website:
The Federal Emergency Management <http://www.pandemicflu.state.pa.us/
Agency’s (FEMA’s) National Incident pandemicflu/site/default.asp>
Management System (NIMS) – NIMS was Also see Government Technology’s article
developed so responders from different regarding Pennsylvania’s new Website:
jurisdictions and disciplines can work <http://www.govtech.net/news/news.php
together better to respond to natural dis- ?id=99469>
asters and emergencies, including acts of
terrorism. NIMS’ benefits include a unified New York State’s, Office of General
approach to incident management; stan- Services (OGS) emergency contracts pre-
dard command and management struc- pared through the new National
tures; and emphasis on preparedness, Association of State Procurement Officials
mutual aid and resource management: (NASPO) Cooperative Purchasing
<http://www.fema.gov/emergency/nims/ Hazardous Incident Response Equipment
index.shtm> (HIRE) program, are available at:
<http://www.ogs.state.ny.us/purchase/spg
FEMA’s Emergency Management /awards/3823219745CAN.HTM> New York
Institute – A federal resource for emer- is the lead state for this multi-state coop-
gency management education and train- erative.
ing. <http://training.fema.gov/>
Washington State, Department of
GAO Report, Information Sharing: DHS Information Technology, Tech News,
Should Take Steps to Encourage More Enterprise Business Continuity: Making
Widespread Use of Its Program to Protect Sure Agencies are Prepared, December
and Share Critical Infrastructure 2005: <http://www.dis.wa.gov/technews/
Information. GAO-06-383, April 17, 2006: 2005_12/20051203.aspx>
<http://www.gao.gov/cgi-
bin/getrpt?GAO-06-383>
National Organization, Academia and
GAO Report, Continuity of Operations: Consortium Resources
Agency Plans Have Improved, but Better
Oversight Could Assist Agencies in Business Continuity Institute (BCI) – BCI
Preparing for Emergencies. GAO-05-577, was established in 1994 to enable
April 28, 2005: members to obtain guidance and support
<http://www.gao.gov/docdblite/ from fellow business continuity
summary.php?rptno=GAO-05- practitioners. The BCI has over 2600
577&accno=A22839> members in 50+ countries. The wider role
of the BCI is to promote the highest stan-
U.S. Department of Homeland Security dards of professional competence and
(DHS), Safe America Foundation – commercial ethics in the provision and
<http://www.safeamerica.org/sp_ maintenance of business continuity plan-
cybersafety.htm> ning and services:
<http://www.thebci.org/>
National Institute of Standards and
Technology (NIST) – Special Publication Disaster Recovery Institute (DRI) – DRI
800-34, Contingency Planning Guide for International (DRII) was first formed in
Information Technology: Recommendations 1988 as the Disaster Recovery Institute in
of the National Institute of Standards and St. Louis, MO. A group of professionals
Technology: from the industry and from Washington
<http://csrc.nist.gov/publications/ University in St. Louis forecast the need for
nistpubs/> comprehensive education in business

IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster 13
117

NASCIO: Representing Chief Information Officers of the States

continuity. DRII established its goals to: Disaster: Lessons from Hurricane
Promote a base of common knowledge Katrina.” The book, edited by Ronald J.
for the business continuity planning/ dis- Daniels, Donald F. Kettl (a Governing con-
aster recovery industry through educa- tributor) and Howard Kunreuther, warns of
tion, assistance, and publication of the the inevitability of another disaster and
standard resource base; Certify qualified the need to be prepared to act. It address-
individuals in the discipline; and Promote es the public and private roles in assess-
the credibility and professionalism of cer- ing, managing and dealing with disasters
tified individuals: <http://www.drii.org/> and suggests strategies for moving ahead
in rebuilding the Gulf Coast. To see a table
The National Association of State of contents and sample text, visit
Procurement Officials (NASPO) has com- <http://www.upenn.edu/pennpress/book/
pleted work on disaster recovery as it 14002.html> Published by the University
relates to procurement: of Pennsylvania Press, the book sells for
<http://www.naspo.org/> $27.50.

U.S. Computer Emergency Readiness


Team (U.S. CERT)/ Coordination Center – Articles and Reports
Survivable Systems Analysis Method:
<http://www.cert.org/archive/html/ “Cleaning Up After Katrina,” CIO
analysis-method.html> Magazine, March 15, 2006:
<http://www.cio.com/archive/031506/view
The Council of State Archivists (CoSA) – _oreck.html?CID=19049>
CoSA is a national organization compris-
ing the individuals who serve as directors Continuity of Operations Planning:
of the principal archival agencies in each Survival for Government, Continuity
state and territorial government. CoSA’s Central:
Framework for Emergency Preparedness in <http://www.continuitycentral.com/
State Archives and Records Management feature0200.htm>
Programs is available at:
<http://www.statearchivists.org/prepare/ Disaster and Recovery, GovExec.com:
<http://www.govexec.com/features/
framework/assessment.htm>
1201/1201managetech.htm>

AFTER THE DISASTER Hurricane Katrina “Disaster Recovery, How to protect your
not only impacted more than 90,000 technology in the event of a disaster,”
square miles and almost 10 million resi- Bob Xavier, November 27, 2001:
dents of the Gulf Coast but also affected <http://www.techsoup.org/howto/
how governments will manage such disas- articles/techplan/page2686.cfm>
ters in the future. A collection of articles
opens the dialogue about disaster
response in a new book, “On Risk and

14 IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
VITA
118

VITA

EDUCATION
Candidate for M.S. in Computer Information Technology at Purdue University,
May 2011 G.P.A. 3.8/4.00
Honors B.A. in Communication, Public Relations at Purdue University, December
1998 G.P.A. 3.57/4.00

PUBLICATIONS

“Disaster Recovery and Business Continuity Planning: Business Justification,” H.


M. Brotherton, Journal of Emergency Management, 67-60, 2010.
DOI:10.5055/jem.2010.0019
http://pnpcsw.pnpco.com/cadmus/testvol.asp?year=2010&journal=jem
EMPLOYMENT
ITaP Web and Applications Administration, Graduate Assistant
May 2009-Present
• Met with customers to define project requirements and create an articulate
design rational that best meets requirements
• Project planning and management
• Developed and updated documentation of administration policies and
procedures
• Granted and implemented development and deploy access to web
developers
• Migrated and created websites in our Apache, IIS, and ColdFusion
environments
• Customized application portals using, XML, JavaScript, HTML, CSS
• Created Tivoli Storage Manager nodes
• Assisted with SharePoint training development
• Built Tomcat web server
• Research BMC Remedy BI development and implementation
• Design, development, and implementation of the Applications
Administration website and forms using XHTML, JavaScript, CSS and
PHP
119

Social Security Administration, Social Insurance Specialist


April 1999- January 2009
• Served as Site LAN Coordinator for my office. Duties included: verifying
systems updates, reporting systems problems, changing daily backup
tapes, resolving systems issues by making necessary changes on site.
• Prepared and performed presentations to special interest groups
• Employed creativity and problem solving to deal effectively with situations
of competing or conflicting priorities
• Exercised professionalism and discretion in handling confidential
information
• Analyzed, interpreted, and implemented policy and balanced tasks in a
fast paced work environment
• Learned new policies, tools and technology on a daily basis to keep up
with constantly changing workloads
• Assumed responsibility for maintaining quality standards in processing
claims
• Worked both independently and as a team member to meet our office
goals

Before & Afterthoughts, Owner-Manager


November 2001-April 2003
Formed S-corporation, performed all bookkeeping, paid and managed
employees.

Underhill Games, Co-owner/Board Member


May 2001-April 2002
Researched business models, informed the board members about the various
corporation types and gave recommendation to form S-corporation, attended
trade conventions as company representative, created and edited our online
store, setup payment and shipping for business merchandise.

Purdue University West Lafayette, Residence Hall Counselor


August 1997-May 1998
Planned and implemented programs and activities, enforced University policies,
resolved disputes between residents, and served as contact for University
resource referrals.

REFERENCES

Dr. J. Eric Dietz, Computer and Information Technology, Purdue University,


Purdue Homeland Security Institute (PHSI), Gerald D. and Edna E. Mann Hall,
Room 166, 203 S. Martin Jischke Drive, West Lafayette, IN 47907-1971
Jeffrey Sprankle, Computer and Information Technology, Purdue University,
Knoy Hall, Room 221, 401 N. Grant Street, West Lafayette, IN 47907
PUBLICATION
120
121
122
123

View publication stats

You might also like