Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Session 12

A total of 115 British Airways


flights, or 13 percent of
services, were cancelled on
Sunday while 311 services, or
35 percent, were delayed,
according to Flight Aware, a
Houston-based plane-tracking
service. The carrier scrapped a
combined 418 flights at
Heathrow and Gatwick airport,
south of London, on Saturday
and 568 were delayed, the
research company said. British
Airways has declined to specify
figures for flights or customers
affected.
British Airways Incident
ü “Human Error” is all too often used by firms to hide a multitude of
datacentre design and training flaws, caused by years of underinvestment
in their server farms.
ü Switching off power supply to the Servers was the main reason
ü It caused major damage to the servers the airline uses to run its online
check-in, baggage handling and customer contact systems, resulting in
flights from Heathrow and Gatwick being grounded for the best part of
two days.
British Airways Incident
ü “Management decisions about budget and cost and spending have not
allowed these facilities to be upgraded over time to keep up with the
demand and criticality of these systems.”
ü In the airline industry in particular, flight operators are under mounting
pressure to cut costs in the face of growing competition from budget
carriers, said Kirby, and the upkeep of their IT estates can be the first thing
to suffer.
ü “A lot of the systems the airlines use have been around since the late
1970s and they weren’t really [designed] for client-facing systems. They
were for internal use,” he said.
British Airways Incident
British Airways owner IAG SA said a power outage that led to the
cancellation of hundreds of flights last month probably cost it about 80
million pounds ($102 million) in lost revenue and the expense of
accommodating, re-booking and compensating thousands of passengers……

Likely ….About $200 Millions

After 6 weeks:
'Total chaos' at Heathrow as British Airways computers crash yet again
HOLIDAYMAKERS are facing huge delays after British Airways systems at two
of the UK's busiest airports crashed once more.!!!
Business Continuity Management (BCM)
ü Business Continuity Management is a holistic management
process that identified potential impacts that threaten an
organization and provides a framework for building resilience
and capability for an effective response that safeguards the
interest of its key stakeholders, reputation, brand and value
creating activities.
ü Business continuity means maintaining the uninterrupted
availability of all key business resources required to support
essential business activities.
BC and DR - Definitions
ü Business Continuity - Overall continuation of business
functions during an emergency event.
ü Disaster Recovery – Recovery of the systems, applications and
processing capabilities
Why BCP and DRP?

DATA CORRUPTION COMPONENT FAILURE APPLICATION FAILURE

USER ERROR MAINTENANCE SITE OUTAGE


BCP and DRP
Fair amount of Confusion in terminology
Business Continuity Plan
ü Prepared at Business level
ü Includes IT
ü Covers all relevant functions
ü Considers all aspects such as :
§ Communication to External agencies
§ Communication to Customers, Suppliers
§ Quick response to correct misinformation
§ Handling increased vulnerability during emergencies
§ Keep the controls , security , integrity
§ Avoid making it worse than it is
BCP - Definition
A documented, tested, rehearsed plan to minimize financial losses to the
institution, serve customers with minimal disruptions, and mitigate the
negative effects of disruptions on business operations.

What is BCP for?


To continue the essential services to key stake-holders when the
organization faces :
§ catastrophic events such as floods, earthquakes, or acts of terrorism
§ accidents or sabotage
§ outages due to an application error, hardware or network failures
BCP – Team Structure
Business Continuity Committee
(Management Authorization)

Execution Teams

BCP Team Leader

BCP Spokesperson Internal Auditor

Damage Admin,
Emergency Relocation IT Operations
Asst. & Security &
Action Team Team Team Team
Salvage Team Support Team
BCP – Documentation
Documentation should
cover

Risk Management Environmental Management

Emergency Management Crisis Management

IT Disaster Recovery Knowledge Management

Facility Management Human Management

Supply Chain Management Security and Privacy

Health and Safety Communications PR

Enterprise business process, people and technology


BCP - Process
ü Initiated and Supported by Top Management
ü Assess risks and vulnerabilities
ü Actions to protect people, environment, assets
ü Actions to contain and prevent further damage
ü Business Impact Analysis
BCP - Process
ü Identify the essential activities that must continue during emergencies
and level of service targeted
ü Identify all resources needed to provide such services:
Place, People, Data, Facilities (Security, food, water, IT, communication,
transportation), Raw Materials and other equipment (as necessary), Prior
Permission from relevant authorities, Service Provider support, Contact lists,
Authorisation, access and escalation procedures, Budgets
BCP - Process
ü Identify who among those available Top Management will invoke the BCP to be
implemented , during emergencies
ü Communication process to stake-holders
ü Arrange for the people , premises , IT Facilities etc to be available , when needed
ü Train people
ü Test the facilities, remedy weaknesses
ü Document the process in a brief document
ü External audit of the document and complete audit actions
ü A senior businessperson accountable for on-going preparedness of BCP
arrangements
BCM Compliance Standards
ü Standards in Business Continuity ü Measure compliance in these BCM
ü ISO 22301 dimensions
ü FFIEC ü Program Administration
ü NIST 800 ü Crisis Management
ü NFPA 1600 ü Business Recovery
ü SEC ü IT Disaster Recovery
ü FISMA ü Fire & Life Safety
ü FINRA ü Supply Chain Risk Management
ü Supply Chain Resilience ü Third Party Management
Leadership Council
BCP & DRP - Differences
Business Continuity Plan (BCP) Disaster Recovery Plan (DRP)
ü Focused on recovery of individual business ü Focused on recovery of Enterprise IT
processes, departments, functions, applications and supporting infrastructure
facilities etc. (revenue, production and (support the business)
operational management) ü Recovery Time Objective (RTO) is typically
ü Recovery Time Objective (RTO) is typically measured in minutes or hours… sometimes
measured in days or weeks… sometimes days.
months ü Active IT participation with little to no business
ü Active business and IT participation participation during an event.
ü Recovery addresses people, process, and ü Recovery addresses enterprise data
support technologies required to continue center/computing, facility and support staff
the business needs.
ü Continuity plans are usually by process ü Recovery plans are usually by application suite,
department, function and/or facility platform and/or data center facility
DRP – IT Component of BCP
Owned by IT
Consistent with rest of BCP
Objectives :
Recovery Time Objective (RTO) – maximum permissible outage time
Recovery Point Objective (RPO) – the furthest point to which data loss is permitted
Facilities :
Cold Site: A facility that is environmentally conditioned, but devoid of any equipment.
Hot Site: It is an alternate facility having workspace for the personnel, fully equipped
with all resources and stand-by computer facilities
Mirrored site: It is identical in all aspects to the primary site, right down to the
information availability. It is equivalent to having a redundant site in normal times and
is naturally the most expensive option.
Data Recovery – Facilities
ü Conventional Backup
ü RAID
ü Remote Journaling
ü Electronic Vaulting – transmits data electronically and automatically
creates the backup offsite.
ü Disk Replication (Mirroring, Shadowing) – data on both the primary
server and the replicated server ; up-to-date copy of the data, excellent
RPO
ü Clustering – solution for high availability; use a secondary server to
provide access to applications and data when the primary server fails.
Cost of RTO, RPO
Recovery Objectives

Wks Days Hrs Mins Secs Secs Mins Hrs Days Wks

Data Loss Downtime


(Recovery Point Objective) (Recovery Time Objective)

Mirroring / Replication Clustering

Backup Restore from Disk

Vaulting Restore from Tape


Testing
ü Methods: Tabletop, simulation, full dress rehearsal
ü Scenarios to test
ü Care !: avoid creating confusion and panic
ü Measure and document the tests
ü Identify weaknesses and improvements
ü Implement improvement actions
Cloud DR as a service
ü Migrating entire IT operations or DR solutions only to
cloud, and replication or movement of data to cloud
brings significant cost savings and lowering of recovery
times.
ü Can shrink and grow in response to demand.
‘Replication Mode’ requires fewer resources and incurs
low cost.
ü When a business disruption occurs, the system enters
‘Failover Mode’ which requires more resources that
scale smoothly without requiring large upfront
investments.
ü Cloud Computing eliminates hardware unification
between primary datacenter and cloud.
ü Cloud servers start-up can be easily automated and
managed
Platform Recovery Strategies
DRP Document Format
1.Introduction
2.Business Impact Analysis - including a sample impact matrix
3.DRP Organization Responsibilities pre &post disaster - DRP / BCP checklist
4.Backup Strategy for Data Centers, Departmental File Servers, Wireless Network
servers, Data at Outsourced Sites, Desktops (In office and "at home"), Laptops and
mobiles
5.Recovery Strategy including approach ( E,g. Log File transfers, data back-up , spare
servers, dedicated DataComm lines), escalation plan process , decision points.
6.DR Facility to be set-up / upgraded ( Provide details of location, capacity , data
communication , data centre facilities)
DRP Document Format
7. User Area readiness – Contact Point , Tech Support, Alternate DataComm lines, User
Access Set-up
8.Accountability to decide on invoking DRP Actions ( Person 1/ 2/3/4)
9.Disaster Recovery Procedures in a tabular format
Sequence ,Action, who, when, Verification of correct completion, Communication
10. Technical Appendix
11.Communication Process including necessary phone numbers and contact points
12.Role of Outsourced Support Providers ( IT Support, Communication, Security,
Transport )
Recovery Action when Disaster strikes
1. Protection to People, Environment, Assets
2. Actions to Contain or eliminate further damage
3. Call the Support Organisations
4. Inform Senior Management
5. DR Incident manager and other roles for DRP clarified
6. Communicate to End users customers, Suppliers, Employees, Service
Providers , Support functions– What to do till normalcy returns
7. Recovery Strategy Clarified to all including approach ( E,g. Log File
transfers, data back-up , spare servers, dedicated DataComm lines),
escalation plan process and decision points.
Recovery Action when Disaster strikes
8. Disaster Recovery Procedures in a tabular format
9. Sequence ,Action, who, when, Verification of correct completion,
Communication
10. User Area Actions initiated– Contact Point , Tech Support, Alternate
DataComm lines, User Access Set-up
11. Technical Information shared for reference
12. Correctness of recovery checked and reviewed ( Audit team involved , if
possible)
Recovery Action when Disaster strikes
13. Select users asked to test pre-defined software options
14. Communicate to all to resume operations
15. Separate team to work on Recovering the Original site
16. Test Original site for readiness
17. Plan and Communicate Transition back to original site
18. Perform transition to Original site
19. Bring the DR site to readiness for any new emergencies
20. Document observations and learning
Planning and Setting up a DR system
ü Set the RPO, RTO Objectives based on Business need
ü Decide DR Location
ü Decide Recovery strategy
ü Decide on growth of Processing, Disk space and Communication needs
ü Decide on data (log files) movement from “Production” to DR Site
ü Set up data Centre , install servers, disk space, Communication lines
ü Test the data movement and recovery up to RPO
ü Set the data back up routine at DR Centre
ü Synchronize “Production” and DR System
ü Initiate log file movement
Recovery Actions

Subject Action Who does when Who else involved Remarks/Guidelines


Protect People Evacuate Building Data Immediately on Any one on site , Speed of action is vital
environment / Centre noticing preferably trained
assets manager flood/fire/buildi
ng collapse
Contain further First Aid, Fire- Data After protecting Trained personnel on Call Ambulance / doctor at
damage fighting, Stop Centre people First Aid, Fire Fighting site immediately
power/ water Manager
flow
Call Civil
Authorities
Updating BCP/DRP
üA Person accountable
üUpdate BCP/DRP when changes happen in business or IT
üAudited by external
Disaster “Recovery”
ü Getting back to BAU
ü Site readiness
ü People readiness
ü Data consistency or gaps
ü Involve Internal Audit teams
ü Preserve data back-ups at key points
§ May be needed in investigation, insurance etc.
ü Controlled move back to usual site
Closure:
§ Document the incidents and actions
§ Document learnings
§ Document gaps, close gaps
Incident Management
Unplanned interruption to an IT service or a reduction in the quality of an
IT service
Incident Management - Activities
ü Identification - detect or reported the incident.
ü Registration - the incident is registered in an ICM System.
ü Categorization - the incident is categorized by priority, SLA etc.
ü Communication
ü Prevention of further damage.
ü Seek help
ü Prioritization - the incident is prioritized for better utilization of the
resources and the Support Staff time.
ü Diagnosis - reveal the full symptom of the incident.
Incident Management - Activities
ü Escalation - should the Support Staff need support from other
organizational units.
ü Investigation and diagnosis - if no existing solution from the past could
be found the incident is investigated and root cause found.
ü Resolution and recovery - once the solution is found the incident is
resolved.
ü Rectify the effect of the problem.
ü Put in place temporary or permanent fix; if temporary, initiate
permanent fix.
ü Incident closure - the registry entry of the incident in the ICM System is
closed by providing the end-status of the incident.
Incident Manager - Responsibilities
ü Understand the incident/fault.
ü Gather sufficient information to start an analysis.
ü Maintain a general overview of the incident including communication.
ü Understand the functionality of multiple areas.
ü Obtain guidance on priorities to the teams starting the immediate
urgent unexpected recovery work.
Incident Closure
ü Root cause analysis.
ü Read-across to other parts of the system and actions to prevent.
ü “How could the event have been anticipated or detected early?”.
ü Training or HR actions as identified.
ü Update documentation and processes suitably.

You might also like