Professional Documents
Culture Documents
09 - DisasterRecovery
09 - DisasterRecovery
System Administration
Disaster Recovery
Topics
1. Planning
2. Disasters
3. Mitigation
1
Disaster Recovery Plans
1. Define (un)acceptable loss.
How much could you lose in a disaster?
2. Back up everything.
Backup data, metadata, and instructions on
how to restore your system.
3. Organize everything.
Can you find the backup tapes you need when
disaster strikes?
Define loss
Loss of service
How much employee productivity lost?
How much customer revenue lost?
Loss of data
Irreplaceable data
Medical image records
Stock purchases
Re-creatable data. At what cost?
Code for a software product
Simulation results
2
Backup Everything
On a system
Project directories
Home directories
System files (fstab, kernel, passwd, LVM, MBR)
Types of systems
Laptops
Connect then backup to backup server on command.
Desktops
Store everything on network disks.
Servers
Permanent connection to backup system.
Organize Everything
What resources do you back up?
On what schedule?
Media organization
Bar code labels on each tape.
Stored securely at proper temp/humidity.
Media database
Maps servers/drives to tapes and their locations.
Indicates whether tapes are on- or off-site.
Must be backed up w/ humanly-readable label.
3
Document
Store documentation in portable format.
Ensure documentation accessible in disaster.
Paper copies on and off-site.
Test
Can other people understand procedures?
Sample test tapes on regular (weekly) basis.
Attempt a full system recovery 2/year.
What is a Disaster?
A catastrophic event that causes loss of data
and/or service.
Human disasters
Errors or intentional.
Typo, backhoe, or hacker tools.
Natural disasters
Small scale: Hardware or power failure.
Large scale: Hurricane, earthquake, fire.
4
Types of Disasters
User errors
Accidental file deletion / overwrite.
Very common. Snapshots can automate.
Sysadmin errors
Accidental mass file destruction.
Regular backups will prevent loss.
Drive failure
Single disk failure: RAID can prevent loss.
System failure
Loss of an entire system.
RAID won’t help. Need backups.
CIT 470: Advanced Network and System Administration Slide #13
Types of Disasters
Power/Network Failure
Need UPS/generator or redundant network.
Software Failure
Software corrupts its own or other apps data store.
Need regular and perhaps historical backups.
Security Breach
An attacker / worm destroys/corrupts data.
Need long-term historical backups.
Natural Disaster
Potential loss of entire data center, incl. backups.
Need off-site backups to restore data.
Need off-site (virtual) data center to restore service.
CIT 470: Advanced Network and System Administration Slide #14
Risk Analysis
Evaluate risk cost of disaster
Cost * Probability
Determines budget for disaster mitigation.
Ex: power failure
70% chance per year
Average downtime: 4 hours
Average web site revenue / hour: $1000
Budget = 4 hrs * (1000 $/hr) * 0.7/yr = $2800/yr
5
Disaster Mitigation
Power Failures
UPS
Generator
System Failures
Redundancy: CPU, ECC RAM, NICs, power
Cluster of servers
Network Failures
Multiple internet connections f/ diff ISPs.
Disaster Mitigation
Drive Failures
RAID
Backups
Accidental Deletion
Snapshots
Backups
Security Incident
Backups
Redundant Site
Redundant site at a different location
– Location far enough away to be unaffected by
whatever disaster took down primary site.
– Automatic or manual switchover.
• DNS names with short experimation times.
Cheaper solution: use existing second site
– Duplicate critical services at both data centers.
– Rebuild less critical servers at second site.
6
References
1. Aeleen Frisch, Essential System Administration,
3rd edition, O’Reilly, 2002.
2. Evi Nemeth et al, UNIX System Administration
Handbook, 3rd edition, Prentice Hall, 2001.
3. Thomas A. Limoncelli and Christine Hogan, The
Practice of System and Network Administration,
2nd edition, Addison-Wesley, 2007.
4. W. Curtis Preston, UNIX Backup & Recovery,
O’Reilly, 1999.