Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 134

Exchange 2010 High Availability

Exchange Deployment Planning Services

Exchange 2010 High Availability

Ideal audience for this workshop Messaging SME Network SME Security SME

Exchange 2010 High Availability


During this session focus on the following : How will we leverage this functionality in our organization? What availability and service level requirements do we have around our messaging solution?

Agenda
Review of Exchange Server 2007 Availability Solutions Overview of Exchange 2010 High Availability Exchange 2010 High Availability Fundamentals Exchange 2010 High Availability Deep Dive Exchange 2010 Site Resilience

Exchange Server 2007 Single Copy Clustering

SCC out-of-box provides little high availability value On Store failure, SCC restarts store on the same machine; no CMS failover SCC does not automatically recover from storage failures SCC does not protect your data, your most valuable asset SCC does not protect against site failures SCC redundant network is not leveraged by CMS Conclusion SCC only provides protection from server hardware failures and bluescreens, the relatively easy components to recover Supports rolling upgrades without losing redundancy

Exchange Server 2007 Continuous Replication


2. Inspect logs
Database

Log

E00.log E0000000012.log E0000000011.log

Log

Database

1. Copy logs

3. Replay logs

Local

Cluster
File Share

Standby

Log shipping to a local disk

Log shipping within a cluster

Log shipping to a standby server or cluster

Exchange Server 2007 HA Solution (CCR + SCR)


Outlook (MAPI) client AD site: San Jose
Client Access Server

OWA, ActiveSync, or Outlook Anywhere

Manual activation of remote mailbox server

AD site: Dallas
Client Access Server

DB4 DB5

Mailbox server cant co-exist with other roles

Standby Server

DB6

SCR

CCR #1 Node A

CCR #1 Node B

CCR #2 Node A

CCR #2 Node B

SCR managed separately; no GUI Clustering knowledge required Database failure requires server failover

Windows cluster

Windows cluster

DB1 DB2 DB3

DB1 DB2 DB3

DB4 DB5 DB6

DB4 DB5 DB6

Exchange 2010 High Availability Goals



Reduce complexity Reduce cost Native solution - no single point of failure Improve recovery times Support larger mailboxes Support large scale deployments

Make High Availability Exchange deployments mainstream!

Exchange 2010 High Availability Architecture


AD site: Dallas

Client

All clients connect via CAS servers

Client Access Server

DB1 DB3

AD site: San Jose

Mailbox Server 6

DB5

Client Access Server (CAS)

Easy to extend across sites

Mailbox Server 1

Mailbox Server 2

Mailbox Server 3

Mailbox Server 4

Mailbox Server 5

Failover managed within Exchange

DB1 DB2 DB3

DB4 DB5 DB1

DB2 DB3

DB5 DB1

DB3 DB4 DB5

DB4

DB2

Database (DB) centric failover

Exchange 2010 High Availability Fundamentals



Database Availability Group Server Database Database Copy Active Manager RPC Client Access service
DAG

Exchange 2010 High Availability Fundamentals


Database Availability Group A group of up to 16 servers hosting a set of replicated

databases Wraps a Windows Failover Cluster
Manages servers membership in the group Heartbeats servers, quorum, cluster database

Defines the boundary of database replication Defines the boundary of failover/switchover (*over) Defines boundary for DAGs Active Manager
Mailbox Server 1 Mailbox Server 2 Mailbox Server 3 Mailbox Server 4 Mailbox Server 16

Exchange 2010 High Availability Fundamentals


Server

Unit of membership for a DAG Hosts the active and passive copies of multiple mailbox databases Executes Information Store, CI, Assistants, etc., services on active mailbox database copies Executes replication services on passive mailbox database copies
Mailbox Server 1 Mailbox Server 2 Mailbox Server 3

DB1 DB2 DB3

DB4 DB1 DB2

DB3 DB4

Exchange 2010 High Availability Fundamentals


Server (Continued)

Provides connection point between Information Store and RPC Client Access Very few server-level properties relevant to HA Servers Database Availability Group Servers Activation Policy RCA
Mailbox Server 1 Mailbox Server 2 Mailbox Server 3

DB1 DB2 DB3

DB4 DB1 DB2

DB3 DB4

Exchange 2010 High Availability Fundamentals


Mailbox Database

Unit of *over A database has 1 active copy active copy can be mounted or dismounted Maximum # of passive copies == # servers in DAG 1
Mailbox Server 1 Mailbox Server 2 Mailbox Server 3

DB1 DB2 DB3

DB4 DB1 DB2

DB3 DB4 DB1

Exchange 2010 High Availability Fundamentals


Mailbox Database (Continued)
~30 seconds database *overs Server failover/switchover involves moving all active databases to one or more other servers Database names are unique across a forest Defines properties relevant at the database level
GUID: a Databases unique ID EdbFilePath: path at which copies are located Servers: list of servers hosting copies

Exchange 2010 High Availability Fundamentals


Active/Passive vs. Source/Target

Availability Terms
Active: Selected to provide email services to clients Passive: Available to provide email services to clients if active fails

Replication Terms
Source: Provides data for copying to a separate location Target: Receives data from the source

Exchange 2010 High Availability Fundamentals


Mailbox Database Copy
Scope of replication A copy is either source or target of replication at any given time A copy is either active or passive at any given time Only 1 copy of each database in a DAG is active at a time A server may not host >1 copy of a any database

Mailbox Server 1

Mailbox Server 2

DB1 DB2 DB3

DB1 DB2 DB1 DB3

Exchange 2010 High Availability Fundamentals


Mailbox Database Copy
Defines properties applicable to an individual database copy Copy status: Healthy, Initializing, Failed, Mounted, Dismounted, Disconnected, Suspended, FailedandSuspended, Resynchronizing, Seeding ActiveCopy CopyQueueLength ActivationSuspended ReplayQueueLength

Exchange 2010 High Availability Fundamentals


Active Manager Exchange-aware resource manager (high availabilitys
brain)
Runs on every server in the DAG Manages which copies should be active and which should be passive Definitive source of information on where a database is active or mounted
Provides this information to other Exchange components (e.g., RPC Client Access and Hub Transport) Information stored in cluster database

Exchange 2010 High Availability Fundamentals


Active Manager Active Directory is still primary source for configuration info Active Manager is primary source for changeable state

information (such as active and mounted) Replication service monitors health of all mounted databases, and monitors ESE for I/O errors or failure

Exchange 2010 High Availability Fundamentals


Continuous Replication

Continuous replication has the following basic steps:


Database copy seeding of target Log copying from source to target Log inspection at target Log replay into database copy

Exchange 2010 High Availability Fundamentals


Database Seeding

There are several ways to seed the target instance:


Automatic Seeding Update-MailboxDatabaseCopy cmdlet
Can be performed from active or passive copies

Manually copy the database Backup and restore (VSS)

Exchange 2010 High Availability Fundamentals


Log Shipping

Log shipping in Exchange 2010 leverages TCP sockets
Supports encryption and compression Administrator can set TCP port to be used

Replication service on target notifies the active instance the next log file it expects
Based on last log file which it inspected

Replication service on source responds by sending the required log file(s) Copied log files are placed in the targets Inspector directory

Exchange 2010 High Availability Fundamentals


Log Inspection

The following actions are performed to verify the log file before replay:
Physical integrity inspection Header inspection Move any Exx.log files to ExxOutofDate folder that exist on target if it was previously a source

If inspection fails, the file will be recopied and inspected (up to 3 times) If the log file passes inspection it is moved into the database copys log directory

Exchange 2010 High Availability Fundamentals


Log Replay Log replay has moved to Information Store The following validation tests are performed prior to log
replay:
Recalculate the required log generations by inspecting the database header Determine the highest generation that is present in the log directory to ensure that a log file exists Compare the highest log generation that is present in the directory to the highest log file that is required Make sure the logs form the correct sequence Query the checkpoint file, if one exists

Replay the log file using a special recovery mode (undo phase is skipped)

Exchange 2010 High Availability Fundamentals


Lossy Failure Process In the event of failure, the following steps will occur for the
failed database:
Active Manager will determine the best copy to activate The Replication service on the target server will attempt to copy missing log files from the source - ACLL
If successful, then the database will mount with zero data loss If unsuccessful (lossy failure), then the database will mount based on the AutoDatabaseMountDial setting

The mounted database will generate new log files (using the same log generation sequence) Transport Dumpster requests will be initiated for the mounted database to recover lost messages When original server or database recovers, it will run through divergence detection and perform an incremental reseed or require a full reseed

Exchange 2010 High Availability Fundamentals


Backups

Streaming backup APIs for public use have been cut, must use VSS for backups Backup from any copy of the database/logs Always choose Passive (or Active) copy Backup an entire server Designate a dedicated backup server for a given database Restore from any of these backups scenarios
Database Availability Group

Mailbox Server 1

Mailbox Server 2

Mailbox Server 3

DB1

DB1 DB2 DB3

DB1 DB2 DB3

DB2
DB3

VSS requestor

Multiple Database Copies Enable New Scenarios


Site/server/disk failure Archiving/compliance Recover deleted items

Exchange 2010 HA E-mail archive Extended/protected dumpster retention

Database Availability Group

Mailbox Server 1

Mailbox Server 2

Mailbox Server 3 7-14 day lag copy

DB1 DB2 DB3

DB1 DB2 DB3

DB1 DB2 DB3

Mailbox Database Copies



Create up to 16 copies of each mailbox database Each mailbox database must have a unique name within Organization
Mailbox database objects are global configuration objects All mailbox database copies use the same GUID No longer connected to specific Mailbox servers

Mailbox Database Copies

Each DAG member can host only one copy of a given mailbox database
Database path and log folder path for copy must be identical on all members

Copies have settable properties


Activation Preference
RTM: Used as second sort key during best copy selection SP1: Used for distributing active databases; used as primary sorting key when using Lossless mount dial

Replay Lag and Truncation Lag


Using these features affects your storage design

Lagged Database Copies

A lagged copy is a passive database copy with a replay lag time greater than 0 Lagged copies are only for point-in-time protection, but they are not a replacement for point-in-time backups
Logical corruption and/or mailbox deletion prevention scenarios Provide a maximum of 14 days protection

When should you deploy a lagged copy?


Useful only to mitigate a risk May not be needed if deploying a backup solution (e.g., DPM 2010)

Lagged copies are not HA database copies


Lagged copies should never be automatically activated by system Steps for manual activation documented at http://technet.microsoft.com/en-us/library/dd979786.aspx

Lagged copies affect your storage design

DAG Design
Two Failure Models Design for all database copies activated
Design for the worst case - server architecture handles 100 percent of all hosted database copies becoming active

Design for targeted failure scenarios


Design server architecture to handle the active mailbox load during the worst failure case you plan to handle
1 member failure requires 2 or more HA copies and 2 or more servers 2 member failure requires 3 or more HA copies and 4 or more servers

Requires Set-MailboxServer <Server> MaximumActiveDatabases <Number>

DAG Design

Its all in the layout


Consider this scenario
8 servers, 40 databases with 2 copies
Server 1 DB1 DB2 DB3 DB4 DB5 DB36 DB37 DB38 DB39 DB40 Server 2 DB6 DB7 DB8 DB9 DB10 DB31 DB32 DB33 DB34 DB35 Server 3 DB11 DB12 DB13 DB14 DB15 DB26 DB27 DB28 DB29 DB30 Server 4 DB16 DB17 DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 Server 5 DB21 DB22 DB23 DB24 DB25 DB16 DB17 DB18 DB19 DB20 Server 6 DB26 DB27 DB28 DB29 DB30 DB11 DB12 DB13 DB14 DB15 Server 7 DB31 DB32 DB33 DB34 DB35 DB6 DB7 DB8 DB9 DB10 Server 8 DB36 DB37 DB38 DB39 DB40 DB1 DB2 DB3 DB4 DB5

DAG Design

Life is good
Server 1 DB1 DB2 DB3 DB4 DB5 DB36 DB37 DB38 DB39 DB40 Server 2 DB6 DB7 DB8 DB9 DB10 DB31 DB32 DB33 DB34 DB35

Its all in the layout


If I have a single server failure
Server 3 DB11 DB12 DB13 DB14 DB15 DB26 DB27 DB28 DB29 DB30 Server 4 DB16 DB17 DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 Server 5 DB21 DB22 DB23 DB24 DB25 DB16 DB17 DB18 DB19 DB20 Server 6 DB26 DB27 DB28 DB29 DB30 DB11 DB12 DB13 DB14 DB15 Server 7 DB31 DB32 DB33 DB34 DB35 DB6 DB7 DB8 DB9 DB10 Server 8 DB36 DB37 DB38 DB39 DB40 DB1 DB2 DB3 DB4 DB5

DAG Design

Its all in the layout


If I have a double server failure
Life could be good
Server 1 DB1 DB2 DB3 DB4 DB5 DB36 DB37 DB38 DB39 DB40 Server 2 DB6 DB7 DB8 DB9 DB10 DB31 DB32 DB33 DB34 DB35 Server 3 DB11 DB12 DB13 DB14 DB15 DB26 DB27 DB28 DB29 DB30 Server 4 DB16 DB17 DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 Server 5 DB21 DB22 DB23 DB24 DB25 DB16 DB17 DB18 DB19 DB20 Server 6 DB26 DB27 DB28 DB29 DB30 DB11 DB12 DB13 DB14 DB15 Server 7 DB31 DB32 DB33 DB34 DB35 DB6 DB7 DB8 DB9 DB10 Server 8 DB36 DB37 DB38 DB39 DB40 DB1 DB2 DB3 DB4 DB5

DAG Design

Its all in the layout


If I have a double server failure
Life could be bad
Server 1 DB1 DB2 DB3 DB4 DB5 DB36 DB37 DB38 DB39 DB40 Server 2 DB6 DB7 DB8 DB9 DB10 DB31 DB32 DB33 DB34 DB35 Server 3 DB11 DB12 DB13 DB14 DB15 DB26 DB27 DB28 DB29 DB30 Server 4 DB16 DB17 DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 Server 5 DB21 DB22 DB23 DB24 DB25 DB16 DB17 DB18 DB19 DB20 Server 6 DB26 DB27 DB28 DB29 DB30 DB11 DB12 DB13 DB14 DB15 Server 7 DB31 DB32 DB33 DB34 DB35 DB6 DB7 DB8 DB9 DB10 Server 8 DB36 DB37 DB38 DB39 DB40 DB1 DB2 DB3 DB4 DB5

DAG Design

Its all in the layout


Now lets consider this scenario
4 servers, 12 databases with 3 copies
DB1 DB4 DB7 Server 1 DB2 DB3 DB5 DB6 DB9 DB10 DB4 DB1 DB8 Server 2 DB5 DB6 DB3 DB7 DB11 DB12 Server 3 DB7 DB8 DB9 DB2 DB3 DB4 DB10 DB11 DB12 DB10 DB1 DB6 Server 4 DB11 DB12 DB2 DB5 DB8 DB9

With a single server failure:


DB1 DB4 DB7 Server 1 DB2 DB3 DB5 DB6 DB9 DB10 DB4 DB1 DB8 Server 2 DB5 DB6 DB3 DB7 DB11 DB12 DB7 DB2 DB10 Server 3 DB8 DB9 DB3 DB4 DB11 DB12 DB10 DB1 DB6 Server 4 DB11 DB12 DB2 DB5 DB8 DB9

With a double server failure:


DB1 DB4 DB7 Server 1 DB2 DB3 DB5 DB6 DB9 DB10 DB4 DB1 DB8 Server 2 DB5 DB6 DB7 DB3 DB7 DB2 DB11 DB12 DB10 Server 3 DB8 DB9 DB3 DB4 DB11 DB12 DB10 DB1 DB6 Server 4 DB11 DB12 DB2 DB5 DB8 DB9

Deep Dive on Exchange 2010 High Availability Basics


Quorum Witness DAG Lifecycle DAG Networks

Quorum

Quorum

Used to ensure that only one subset of members is functioning at one time A majority of members must be active and have communications with each other Represents a shared view of members (voters and some resources) Dual Usage
Data shared between the voters representing configuration, etc. Number of voters required for the solution to stay running (majority); quorum is a consensus of voters

When a majority of voters can communicate with each other, the cluster has quorum When a majority of voters cannot communicate with each other, the cluster does not have quorum

Quorum

Quorum is not only necessary for cluster functions, but it is also necessary for DAG functions
In order for a DAG member to mount and activate databases, it must participate in quorum

Exchange 2010 uses only two of the four available cluster quorum models
Node Majority (DAGs with an odd number of members) Node and File Share Majority (DAGs with an even number of members)

Quorum = (N/2) + 1 (whole numbers only)


6 members: (6/2) + 1 = 4 votes for quorum (can lose 3 voters) 9 members: (9/2) + 1 = 5 votes for quorum (can lose 4 voters) 13 members: (13/2) + 1 = 7 votes for quorum (can lose 6 voters) 15 members: (15/2) + 1 = 8 votes for quorum (can lose 7 voters)

Witness and Witness Server

Witness

A witness is a share on a server that is external to the DAG that participates in quorum by providing a weighted vote for the DAG member that has a lock on the witness.log file
Used only by DAGs that have an even number of members

Witness server does not maintain a full copy of quorum data and is not a member of the DAG or cluster

Witness

Represented by File Share Witness resource


File share witness cluster resource, directory, and share automatically created and removed as needed Uses Cluster IsAlive check for availability If witness is not available, cluster core resources are failed and moved to another DAG member If other DAG member does not bring witness resource online, the resource will remain in a Failed state, with restart attempts every 60 minutes
See http://support.microsoft.com/kb/978790 for details on this behavior

Witness

If in a Failed state and needed for quorum, cluster will try to online File Share Witness resource once
If witness cannot be restarted, it is considered failed and quorum is lost If witness can be restarted, it is considered successful and quorum is maintained
An SMB lock is placed on witness.log Node PAXOS information is incremented and the updated PAXOS tag is written to witness.log

If in an Offline state and needed for quorum, cluster will not try to restart quorum lost

Witness

When witness is no longer needed to maintain quorum, lock on witness.log is released Any member that locks the witness, retains the weighted vote (locking node)
Members in contact with locking node are in majority and maintain quorum Members not in contact with locking node are in minority and lose quorum

Witness Server

No pre-configuration typically necessary
Exchange Trusted Subsystem must be member of local Administrators group on Witness Server if Witness Server is not running Exchange 2010

Cannot be a member of the DAG (present or future) Must be in the same Active Directory forest as DAG

Witness Server

Can be Windows Server 2003 or later
File and Printer Sharing for Microsoft Networks must be enabled

Replicating witness directory/share with DFS not supported Not necessary to cluster Witness Server
If you do cluster witness server, you must use Windows 2008

Single witness server can be used for multiple DAGs


Each DAG requires its own unique Witness Directory/Share

Database Availability Group Lifecycle

Database Availability Group Lifecycle Create a DAG


New-DatabaseAvailabilityGroup -Name DAG1 WitnessServer EXHUB1 -WitnessDirectory C:\DAG1FSW -DatabaseAvailabilityGroupIpAddresses 10.0.0.8 New-DatabaseAvailabilityGroup -Name DAG2 -DatabaseAvailabilityGroupIpAddresses 10.0.0.8,192.168.0.8

Add Mailbox Servers to DAG


Add-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EXMBX1 Add-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EXMBX2

Add a Mailbox Database Copy


Add-MailboxDatabaseCopy -Identity DB1 -MailboxServer EXMBX2

Database Availability Group Lifecycle

DAG is created initially as empty object in Active Directory Continuous replication or 3rd party replication using Third Party Replication mode
Once changed to Third Party Replication mode, the DAG cannot be changed back

DAG is given a unique name and configured for IP addresses (or configured to use DHCP)

Database Availability Group Lifecycle

When the first Mailbox server is added to a DAG


A failover cluster is formed with the name of DAG using Node Majority quorum The server is added to the DAG object in Active Directory A cluster name object (CNO) for the DAG is created in default Computers container using the security context of the Replication service The Name and IP address of the DAG is registered in DNS The cluster database for the DAG is updated with info about local databases

Database Availability Group Lifecycle

When second and subsequent Mailbox server is added to a DAG The server is joined to cluster for the DAG The quorum model is automatically adjusted The server is added to the DAG object in Active Directory The cluster database for the DAG is updated with info about local databases

Database Availability Group Lifecycle

After servers have been added to a DAG


Configure the DAG
Network encryption Network compression Replication port

Configure DAG networks


Network subnets Collapse DAG networks in single network with multiple subnets Enable/disable MAPI traffic/replication Block network heartbeat cross-talk (Server1\MAPI !<-> Server2\Repl)

Database Availability Group Lifecycle

After servers have been added to a DAG


Configure DAG member properties
Automatic database mount dial BestAvailability, GoodAvailability, Lossless, custom value Database copy automatic activation policy Blocked, IntrasiteOnly, Unrestricted Maximum active databases

Create mailbox database copies


Seeding is performed automatically, but you have options

Monitor health and status of database copies and perform switchovers as needed

Database Availability Group Lifecycle



Before you can remove a server from a DAG, you must first remove all replicated databases from the server When a server is removed from a DAG:
The server is evicted from the cluster The cluster quorum is adjusted The server is removed from the DAG object in Active Directory

Before you can remove a DAG, you must first remove all servers from the DAG

DAG Networks

DAG Networks

A DAG network is a collection of subnets All DAGs must have:
Exactly one MAPI network
MAPI network connects DAG members to network resources (Active Directory, other Exchange servers, etc.)

Zero or more Replication networks


Separate network on separate subnet(s) Used for/by continuous replication only LRU determines which replication network to use when multiple replication networks are configured

DAG Networks

Initially created DAG networks based on enumeration of cluster networks


Cluster enumeration based on subnet One cluster network is created for each subnet

DAG Networks
Server / Network EX1 MAPI EX2 MAPI IP Address / Subnet Bits 192.168.0.15/24 192.168.0.16/24 Default Gateway 192.168.0.1 N /A 192.168.0.1 N /A EX1 REPLICATION 10.0.0.15/24 EX2 REPLICATION 10.0.0.16/24

Name DAGNetwork01 DAGNetwork02

Subnet(s) 192.168.0.0/24 10.0.0.0/24

Interface(s) EX1 (192.168.0.15) EX2 (192.168.0.16) EX1 (10.0.0.15) EX2 (10.0.0.16)

MAPI Access Enabled True False

Replication Enabled True True

DAG Networks
Server / Network EX1 MAPI EX1 REPLICATION EX2 MAPI EX2 REPLICATION IP Address / Subnet Bits 192.168.0.15/24 10.0.0.15/24 192.168.1.15/24 10.0.1.15/24 Default Gateway 192.168.0.1 N/A 192.168.1.1 N/A

Name DAGNetwork01 DAGNetwork02 DAGNetwork03 DAGNetwork04

Subnet(s) 192.168.0.0/24 10.0.0.0/24 192.168.1.0/24 10.0.1.0/24

Interface(s) EX1 (192.168.0.15) EX1 (10.0.0.15) EX2 (192.168.1.15) EX2 (10.0.1.15)

MAPI Access Enabled True False True False

Replication Enabled True True True True

DAG Networks

To collapse subnets into two DAG networks and disable replication for the MAPI network:
Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork01 Subnets 192.168.0.0,192.168.1.0 -ReplicationEnabled:$false Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork02 Subnets 10.0.0.0,10.0.1.0 Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork03 Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork04
Name DAGNetwork01 DAGNetwork02 DAGNetwork03 DAGNetwork04 Subnet(s) 192.168.0.0/24 10.0.0.0/24 192.168.1.0/24 10.0.1.0/24 Interface(s) EX1 (192.168.0.15) EX1 (10.0.0.15) EX2 (192.168.1.15) EX2 (10.0.1.15) MAPI Access Enabled True False True False Replication Enabled True True True True

DAG Networks

To collapse subnets into two DAG networks and disable replication for the MAPI network:
Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork01 Subnets 192.168.0.0,192.168.1.0 -ReplicationEnabled:$false Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork02 Subnets 10.0.0.0,10.0.1.0 Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork03 Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork04
Name DAGNetwork01 DAGNetwork02 Subnet(s) 192.168.0.0/24 192.168.1.0/24 10.0.0.0/24 10.0.1.0/24 Interface(s) EX1 (192.168.0.15) EX2 (192.168.1.15) EX1 (10.0.0.15) EX2 (10.0.1.15) MAPI Access Enabled True False Replication Enabled False True

DAG Networks

Automatic network detection occurs only when members added to DAG


If networks are added after member is added, you must perform discovery

Set-DatabaseAvailabilityGroup -DiscoverNetworks

DAG network configuration persisted in cluster registry


HKLM\Cluster\Exchange\DAG Network

DAG networks include built-in encryption and compression


Encryption: Kerberos SSP EncryptMessage/DecryptMessage APIs Compression: Microsoft XPRESS, based on LZ77 algorithm

DAGs use a single TCP port for replication and seeding


Default is TCP port 64327 If you change the port and you use Windows Firewall, you must manually change firewall rules

Deeper Dive on Exchange 2010 High Availability Advanced Features


Active Manager Best Copy Selection Datacenter Activation Coordination Mode

Active Manager

Active Manager

Exchange component that manages *overs


Runs on every server in the DAG Selects best available copy on failovers Is the definitive source of information on where a database is active
Stores this information in cluster database Provides this information to other Exchange components (e.g., RPC Client Access and Hub Transport)

Active Manager

Active Manager roles


Standalone Active Manager Primary Active Manager (PAM) Standby Active Manager (SAM)

Active Manager client runs on CAS and Hub

Active Manager

Transition of role state logged into Microsoft-ExchangeHighAvailability/Operational event log (Crimson Channel)

Active Manager

Primary Active Manager (PAM)


Runs on the node that owns the cluster core resources (cluster group) Gets topology change notifications Reacts to server failures Selects the best database copy on *overs Detects failures of local Information Store and local databases

Active Manager

Standby Active Manager (SAM)


Runs on every other node in the DAG Detects failures of local Information Store and local databases
Reacts to failures by asking PAM to initiate a failover

Responds to queries from CAS/Hub about which server hosts the active copy

Both roles are necessary for automatic recovery


If the Replication service is stopped, automatic recovery will not happen

Best Copy Selection

Best Copy Selection

Process of finding the best copy to activate for an individual database given a list of status results of potential copies for activation Active Manager selects the best copy to become the new active copy when the existing active copy fails

Best Copy Selection RTM

Sorts copies by copy queue length to minimize data loss, using activation preference as a secondary sorting key if necessary Selects from sorted listed based on which set of criteria met by each copy Attempt Copy Last Logs (ACLL) runs and attempts to copy missing log files from previous active copy

Best Copy Selection SP1

Sorts copies by activation preference when auto database mount dial is set to Lossless
Otherwise, sorts copies based on copy queue length, with activation preference used a secondary sorting key if necessary

Selects from sorted listed based on which set of criteria met by each copy Attempt Copy Last Logs (ACLL) runs and attempts to copy missing log files from previous active copy

Best Copy Selection

Is database mountable? Is copy queue length <= AutoDatabaseMountDial?


If Yes, database is marked as current active and mount request is issued If not, next best database tried (if one is available)

During best copy selection, any servers that are unreachable or activation blocked are ignored

Best Copy Selection


Criteria Copy Queue Length Replay Queue Length Content Index Status

1
2 3 4

< 10 logs
< 10 logs N /A N /A

< 50 logs
< 50 logs < 50 logs < 50 logs

Healthy
Crawling Healthy Crawling

5
6 7 8 9 10

N /A
< 10 logs < 10 logs N /A N /A

< 50 logs
N /A N /A N /A N /A

N /A
Healthy Crawling Healthy Crawling

Any database copy with a status of Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing, or SeedingSource

Best Copy Selection RTM



Four copies of DB1 DB1 currently active on Server1
Server1 Server2 Server3 Server4

X
DB 1

DB 1

DB 1

DB 1

Database Copy Server2\DB1 Server3\DB1 Server4\DB1

Activation Preference 2 3 4

Copy Queue Length 4 2 10

Replay Queue Length 0 2 0

CI State

Database State

Healthy Healthy Crawling

Healthy DiscAndHealthy Healthy

Best Copy Selection RTM

Sort list of available copies based by Copy Queue Length (using Activation Preference as secondary sort key if necessary):
Server3\DB1 Server2\DB1 Server4\DB1

Database Copy Server2\DB1 Server3\DB1 Server4\DB1

Activation Preference 2 3 4

Copy Queue Length 4 2 10

Replay Queue Length 0 2 0

CI State

Database State

Healthy Healthy Crawling

Healthy DiscAndHealthy Healthy

Best Copy Selection RTM

Only two copies meet first set of criteria for activation (CQL< 10; RQL< 50; CI=Healthy):
Server3\DB1 Server2\DB1 Server4\DB1
Lowest copy queue length tried first

Database Copy Server2\DB1 Server3\DB1 Server4\DB1

Activation Preference 2 3 4

Copy Queue Length 4 2 10

Replay Queue Length 0 2 0

CI State

Database State

Healthy Healthy Crawling

Healthy DiscAndHealthy Healthy

Best Copy Selection SP1



Four copies of DB1 DB1 currently active on Server1 Auto database mount dial set to Lossless
Server1

Server2

Server3

Server4

X
DB 1

DB 1

DB 1

DB 1

Database Copy Server2\DB1 Server3\DB1 Server4\DB1

Activation Preference 2 3 4

Copy Queue Length 4 2 10

Replay Queue Length 0 2 0

CI State

Database State

Healthy Healthy Crawling

Healthy DiscAndHealthy Healthy

Best Copy Selection SP1

Sort list of available copies based by Activation Preference:


Server2\DB1 Server3\DB1 Server4\DB1

Database Copy Server2\DB1 Server3\DB1 Server4\DB1

Activation Preference 2 3 4

Copy Queue Length 4 2 10

Replay Queue Length 0 2 0

CI State

Database State

Healthy Healthy Crawling

Healthy DiscAndHealthy Healthy

Best Copy Selection SP1

Sort list of available copies based by Activation Preference:


Server2\DB1 Server3\DB1 Server4\DB1
Lowest preference value tried first

Database Copy Server2\DB1 Server3\DB1 Server4\DB1

Activation Preference 2 3 4

Copy Queue Length 4 2 10

Replay Queue Length 0 2 0

CI State

Database State

Healthy Healthy Crawling

Healthy DiscAndHealthy Healthy

Best Copy Selection

After Active Manager determines the best copy to activate


The Replication service on the target server attempts to copy missing log files from the source (ACLL)
If successful, then the database will mount with zero data loss If unsuccessful (lossy failure), then the database will mount based on the AutoDatabaseMountDial setting If data loss is outside of dial setting, next copy will be tried

Best Copy Selection

After Active Manager determines the best copy to activate


The mounted database will generate new log files (using the same log generation sequence) Transport Dumpster requests will be initiated for the mounted database to recover lost messages When original server or database recovers, it will run through divergence detection and either perform an incremental resync or require a full reseed

Datacenter Activation Coordination Mode

Datacenter Activation Coordination Mode



DAC mode is a property of a DAG Acts as an application-level form of quorum
Designed to prevent multiple copies of same database mounting on different members due to loss of network

Datacenter Activation Coordination Mode

RTM: DAC Mode is only for DAGs with three or more members that are extended to two Active Directory sites
Dont enable for two-member DAGs where each member is in different AD site or DAGs where all members are in the same AD site DAC Mode also enables use of Site Resilience tasks
Stop-DatabaseAvailabilityGroup Restore-DatabaseAvailabilityGroup Start-DatabaseAvailabilityGroup

SP1: DAC Mode can be enabled for all DAGs

Datacenter Activation Coordination Mode

Uses Datacenter Activation Coordination Protocol (DACP), which is a bit in memory set to either:
0 = cant mount 1 = can mount

Datacenter Activation Coordination Mode

Active Manager startup sequence


DACP is set to 0 DAG member communicates with other DAG members it can reach to determine the current value for their DACP bits
If the starting DAG member can communicate with all other members, DACP bit switches to 1 If other DACP bits are set to 0, starting DAG member DACP bit remains at 0 If another DACP bit is set to 1, starting DAG member DACP bit switches to 1

Improvements in Service Pack 1


Replication and Copy Management enhancements in SP1

Improvements in Service Pack 1


Continuous replication changes
Enhanced to reduce data loss Eliminates log drive as single point of failure

Automatically switches between modes:


File mode (original, log file shipping) Block mode (enhanced log block shipping)

Switching process:
Initial mode is file mode Block mode triggered when target needs Exx.log file (e.g., copy queue length = 0) All healthy passives processed in parallel File mode triggered when block mode falls too far behind (e.g., copy queue length > 0)

Improvements in Service Pack 1


ESE Log Buffer

Send me the latest log files I have log 2

Replication Log Buffer

Log File 6

Database Log is built copy up Log fragment toand date detected and inspected converted to
Log File 5

Log File 6

Log File 7

Log File 3

Log Log File 4 File 4

complete log
Log File 1 Log File 2

Log File 1

Log File 2

Continuous Replication File Mode Continuous Replication Block Mode

Improvements in Service Pack 1

SP1 introduces RedistributeActiveDatabases.ps1 script (keep database copies balanced across DAG members)
Moves databases to the most preferred copy If cross-site, tries to balance between sites

Targetless admin switchover altered for stronger activation preference affinity


First pass of best copy selection sorted by activation preference; not copy queue length This basically trades off even distribution of copies for a longer activation time. So you might pick a copy with more logs to play, but it will provide you with better distribution of databases

Improvements in Service Pack 1

*over Performance Improvements


In RTM, a *over immediately terminated replay on copy that was becoming active, and mount operation did necessary log recovery In SP1, a *over drives database to clean shutdown by playing all logs on passive copy, and no recovery required on new active

Improvements in Service Pack 1

DAG Maintenance Scripts


StartDAGServerMaintenance.ps1
It runs Suspend-MailboxDatabaseCopy for each database copy hosted on the DAG member It pauses the node in the cluster, which prevents it from being and becoming the PAM It sets the DatabaseCopyAutoActivationPolicy parameter on the DAG member to Blocked It moves all active databases currently hosted on the DAG member to other DAG members If the DAG member currently owns the default cluster group, it moves the default cluster group (and therefore the PAM role) to another DAG member

Improvements in Service Pack 1

DAG Maintenance Scripts


StopDAGServerMaintenance.ps1
It run Resume-MailboxDatabaseCopy for each database copy hosted on the DAG member It resumes the node in the cluster, which it enables full cluster functionality for the DAG member It sets the DatabaseCopyAutoActivationPolicy parameter on the DAG member to Unrestricted

Improvements in Service Pack 1

CollectOverMetrics.ps1 and CollectReplicationMetrics.ps1 rewritten

Improvements in Service Pack 1

Exchange Management Console enhancements in SP1


Manage DAG IP addresses Manage witness server/directory and alternate witness server/directory

Switchovers and Failovers (*overs)

Exchange 2010 *Overs


Within a datacenter
Database *over Server *over

Between datacenters
Single database *over Server *over

Datacenter switchover

Single Database Cross-Datacenter *Over

Database mounted in another datacenter and another Active Directory site Serviced by new Hub Transport servers Different OwningServer for routing
Transport dumpster re-delivery now from both Active Directory sites

Serviced by new CAS


Different CAS URL for protocol access Outlook Web App now re-directs connection to second CAS farm Other protocols proxy or redirect (varies)

Datacenter Switchover
Customers can evolve to site resilience Standalone local redundancy site resilience Consider name space design at first deployment Keep extending the DAG! Monitoring and many other concepts/skills just reapplied Normal administration remains unchanged Disaster recovery not HA event

Site Resilience

Agenda

Understand the steps required to build and activate a standby site for Exchange 2010
Site Resilience Overview Site Resilience Models Planning and Design Site activation steps Client Behavior

Site Resilience Drivers

Business requirements drive site resilience


When a risk assessment reveals a high-impact threat to meeting SLAs for data loss and loss of availability Site resilience required to mitigate the risk Business requirements dictate low recovery point objective (RPO) and recovery time objective (RTO)

Site Resilience Overview

Ensuring business continuity brings expense and complexity


A site switchover is a coordinated effort with many stakeholders that requires practice to ensure the real event is handled well

Exchange 2010 reduces cost and complexity


Low-impact testing can be performed with cross-site single database switchover

Exchange 2007
Site resilience choices

CCR+SCR and /recoverCMS SCC+SCR and /recoverCMS CCR stretched across datacenters SCR and database portability SCR and /m:RecoverServer SCC stretched across datacenters with synchronous replication

Exchange 2010 makes it simpler

Database Availability Group (DAG) with members in different datacenters/sites


Supports automatic and manual cross-site database switchovers and failovers (*overs) No stretched Active Directory site No special networking needed No /recoverCMS

Suitability of site resilience solutions


Solution Ship backups and restore Standby Exchange 2003 clusters CCR+SCR in separate AD sites RTO goal High Moderate Moderate RPO goal High Low Low Deployment complexity Low High Moderate

CCR in a stretched AD site


Exchange 2010 DAGs

Low
Low

Low
Low

High
Low

Site Resilience Models

Voter Placement and Infrastructure Design

Infrastructure Design

There are two key models you have to take into account when designing site resilient solutions
Datacenter / Namespace Model User Distribution Model

When planning for site resilience, each datacenter is considered active


Exchange Server 2010 site resilience requires active CAS, HUB, and UM in standby datacenter Services used by databases mounted in standby datacenter after single database *over

Infrastructure Design
User Distribution Models

The locality of the users will ultimately determine your site resilience architecture
Are users primarily located in one datacenter? Are users located in multiple datacenters? Is there a requirement to maintain user population in a particular datacenter?

Active/Passive user distribution model


Database copies deployed in the secondary datacenter, but no active mailboxes are hosted there

Active/Active user distribution model


User population dispersed across both datacenters with each datacenter being the primary datacenter for its specific user population

Infrastructure Design
Client Access Arrays 1 CAS array per AD site

Multiple DAGs within an AD site can use the same CAS array

FQDN of the CAS array needs to resolve to a loadbalanced virtual IP address in DNS, but only in internal DNS
You need a load balancer for CAS array, as well

Set the databases in the AD site to utilize CAS array via Set-MailboxDatabase -RPCClientAccessServer property By default, new databases will have the RPCClientAccessServer value set on creation
If database was created prior to creating CAS array, then it is set to random CAS FQDN (or local machine if role co-location) If database is created after creating CAS array, then it is set to the CAS array FQDN

Voter Placement

Majority of voters should be deployed in primary datacenter


Primary = datacenter with majority of user population

If user population is spread across datacenters, deploy multiple DAGs to prevent WAN outage from taking one datacenter offline

Voter Placement
Portland Seattle DAG2 Alt Wit DAG1 Alt Wit

CAS03

HUB04

CAS04

CAS01

HUB01

MBX02

MBX01

MBX08

MBX07

MBX06

MBX05

DAG2

MBX03

MBX04

DAG1

CAS02

HUB02

HUB03

DAG1 Witness

DAG2 Witness

Site Resilience

Namespace, Network and Certificate Planning

Planning for site resilience


Namespaces Each datacenter is considered active and needs

their own namespaces Each datacenter needs the following namespaces


OWA/OA/EWS/EAS namespace POP/IMAP namespace RPC Client Access namespace SMTP namespace

In addition, one of the datacenters will maintain the Autodiscover namespace

Planning for site resilience


Namespaces

Best Practice: Use Split DNS for Exchange hostnames used by clients Goal: minimize number of hostnames
mail.contoso.com for Exchange connectivity on intranet and Internet mail.contoso.com has different IP addresses in intranet/Internet DNS

Important before moving down this path, be sure to map out all host names (outside of Exchange) that you want to create in the internal zone

Planning for site resilience


Namespaces
External DNS
Mail.contoso.com Pop.contoso.com Imap.contoso.com Autodiscover.contoso.co m Smtp.contoso.com

External DNS
Mail.region.contoso.com Pop.region.contoso.com Imap.region.contoso.com Smtp.region.contoso.com

Exchange Config
ExternalURL = mail.contoso.com CAS Array = outlook.contoso.com OA endpoint = mail.contoso.com

Exchange Config
Datacenter 1 Datacenter 2
ExternalURL = mail.region.contoso.com CAS Array = outlook.region.contoso.co m OA endpoint = mail.region.contoso.com

Internal DNS
Mail.contoso.com Pop.contoso.com Imap.contoso.com Autodiscover.contoso.co m Smtp.contoso.com Outlook.contoso.com

CAS

HT

HT

CAS

Internal DNS
Mail.region.contoso.com Pop.region.contoso.com Imap.region.contoso.com Smtp.region.contoso.com Outlook.region.contoso.co m

AD

MBX

MBX

AD

Planning for site resilience


Network

Design High Availability for Dependencies


Active Directory Network services (DNS, TCP/IP, etc.) Telephony services (Unified Messaging) Backup services Network services Infrastructure (power, cooling, etc.)

Planning for site resilience


Network Latency

Must have less than 250 ms round trip

Network cross-talk must be blocked


Router ACLs should be used to block traffic between MAPI and replication networks If DHCP is used for the replication network, DHCP can be used to deploy static routes

Lower TTL for all Exchange records to 5 minutes


OWA/EAS/EWS/OA, IMAP/POP, SMTP, RPCCAS Both internal and external DNS zone

Planning for site resilience


Certificates
Certificate Type Wildcard Certs Pros One cert for both sides Flexible if names change Cons Wildcard certs can be expensive, or impossible to obtain WM 5 clients dont work with wildcard certs Setting of Cert Principal Name to *.company.com is global to all CAS in forest Requires ISA or other firewall which can forward based on properties Additional hardware required AD replication delays affect publishing rules Requires multiple certificates Requires multiple IPs Requires load balancer No way to run DR site as Active during normal operation Setting of Cert Principal Name to mail.company.com is global to all CAS in forest

Intelligent Firewall

Traffic is forwarded to the correct CAS

Load Balancer

Load Balancer can listen for both external names and forward to the correct CAS Just an A record change required after site failover Minimal configuration changes required after failover Works with all clients

Same Config in Both Sites Manipulate Cert Principal Name

Planning for site resilience


Certificates

Best practice: minimize the number of certificates


1 certificate for all CAS servers + reverse proxy + Edge/Hub Use Subject Alternative Name (SAN) certificate which can cover multiple hostnames 1 additional certificate if using OCS

OCS requires certificates with <=1024 bit keys and the server name in the certificate principal name

If leveraging a certificate per datacenter, ensure the Certificate Principal Name is the same on all certificates
Outlook Anywhere wont connect if the Principal Name on the certificate does not match the value configured in msstd: (default matches OA RPC End Point)
Set-OutlookProvider EXPR -CertPrincipalName msstd:mail.contoso.com

Datacenter Switchover

Switchover Tasks

Datacenter Switchover Process



Failure occurs Activation decision Terminate partially running primary datacenter Activate secondary datacenter
Validate prerequisites Activate mailbox servers Activate other roles (in parallel with previous step)

Service is restored

Datacenter Switchovers
Primary to Standby Standby to Primary
1. 2. Verify all services working Start-DatabaseAvailabilityGroup <DAGName> ActiveDirectorySite <PSiteName> Set-DatabaseAvailabilityGroup <DAGName> WitnessDirectory <Directory> WitnessServer <ServerName> Reseed data Schedule downtime for dismount Change DNS records back MoveActiveMailboxDatabase <DBName> ActivateOnServer <ServerName> Mount databases in primary datacenter

1. 2.

3. 4. 5. 6.

Primary site fails Stop-DatabaseAvailabilityGroup <DAGName> ActiveDirectorySite <PSiteName> ConfigurationOnly (run this in both datacenters) Stop-Service clussvc Restore-DatabaseAvailabilityGroup <DAGName> ActiveDirectorySite <SSiteName> Databases mount (assuming no activation blocks) Adjust DNS records for SMTP and HTTPS

3.

4. 5. 6. 7.

8.

Datacenter Switchover Tasks



Stop-DatabaseAvailabilityGroup
Adds failed servers to stopped list Removes servers from started list

Restore-DatabaseAvailabilityGroup
Force quorum Evict stopped nodes Start using alternate file share witness if necessary

Start-DatabaseAvailabilityGroup
Remove servers from stopped list Join servers to cluster Add joined servers to started list

Client Experiences
Typical Outlook Behavior

All Outlook versions behave consistently in a single datacenter scenario
Profile points to RPC Client Access Server array Profile is unchanged by failovers or loss of CAS

All Outlook versions should behave consistently in a datacenter switchover scenario


Primary datacenter Client Access Server DNS name is bound to IP address of standby datacenters Client Access Server Autodiscover continues to hand out primary datacenter CAS name as Outlook RPC endpoint Profile remains unchanged

Client Experiences

Outlook Cross-Site DB Failover Experience


Behavior is to perform a direct connect from the CAS array in the first datacenter to the mailbox hosting the active copy in the second datacenter You can only get a redirect to occur by changing the RPCClientAccessServer property on the database

Client Experiences
Other Clients

Other client behavior varies based on protocol and scenario


In-Site *Over Scenario Out-of-Site *Over Scenario Manual Redirect Reconnect / Autodiscover Redirect or proxy Proxy Autodiscover Seamless N/A Datacenter Switchover Reconnect Reconnect Reconnect Reconnect Reconnect Reconnect Reconnect Reconnect Reconnect Reconnect Reconnect Reconnect N/A N/A

OWA OA EAS POP/IMAP EWS Autodiscover SMTP / Powershell

End of Exchange 2010 High Availability Module

For More Information



Exchange Server Tech Center
http://technet.microsoft.com/en-us/exchange/default.aspx

Planning services
http://technet.microsoft.com/en-us/library/cc261834.aspx

Microsoft IT Showcase Webcasts


http://www.microsoft.com/howmicrosoftdoesitwebcasts

Microsoft TechNet
http://www.microsoft.com/technet/itshowcase

2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

You might also like