Professional Documents
Culture Documents
Exchange 2010 High Availability
Exchange 2010 High Availability
Ideal audience for this workshop Messaging SME Network SME Security SME
Agenda
Review of Exchange Server 2007 Availability Solutions Overview of Exchange 2010 High Availability Exchange 2010 High Availability Fundamentals Exchange 2010 High Availability Deep Dive Exchange 2010 Site Resilience
SCC out-of-box provides little high availability value On Store failure, SCC restarts store on the same machine; no CMS failover SCC does not automatically recover from storage failures SCC does not protect your data, your most valuable asset SCC does not protect against site failures SCC redundant network is not leveraged by CMS Conclusion SCC only provides protection from server hardware failures and bluescreens, the relatively easy components to recover Supports rolling upgrades without losing redundancy
Log
Log
Database
1. Copy logs
3. Replay logs
Local
Cluster
File Share
Standby
AD site: Dallas
Client Access Server
DB4 DB5
Standby Server
DB6
SCR
CCR #1 Node A
CCR #1 Node B
CCR #2 Node A
CCR #2 Node B
SCR managed separately; no GUI Clustering knowledge required Database failure requires server failover
Windows cluster
Windows cluster
Client
DB1 DB3
Mailbox Server 6
DB5
Mailbox Server 1
Mailbox Server 2
Mailbox Server 3
Mailbox Server 4
Mailbox Server 5
DB2 DB3
DB5 DB1
DB4
DB2
Defines the boundary of database replication Defines the boundary of failover/switchover (*over) Defines boundary for DAGs Active Manager
Mailbox Server 1 Mailbox Server 2 Mailbox Server 3 Mailbox Server 4 Mailbox Server 16
DB3 DB4
DB3 DB4
Unit of *over A database has 1 active copy active copy can be mounted or dismounted Maximum # of passive copies == # servers in DAG 1
Mailbox Server 1 Mailbox Server 2 Mailbox Server 3
Availability Terms
Active: Selected to provide email services to clients Passive: Available to provide email services to clients if active fails
Replication Terms
Source: Provides data for copying to a separate location Target: Receives data from the source
Mailbox Server 1
Mailbox Server 2
information (such as active and mounted) Replication service monitors health of all mounted databases, and monitors ESE for I/O errors or failure
Replication service on target notifies the active instance the next log file it expects
Based on last log file which it inspected
Replication service on source responds by sending the required log file(s) Copied log files are placed in the targets Inspector directory
The following actions are performed to verify the log file before replay:
Physical integrity inspection Header inspection Move any Exx.log files to ExxOutofDate folder that exist on target if it was previously a source
If inspection fails, the file will be recopied and inspected (up to 3 times) If the log file passes inspection it is moved into the database copys log directory
Replay the log file using a special recovery mode (undo phase is skipped)
The mounted database will generate new log files (using the same log generation sequence) Transport Dumpster requests will be initiated for the mounted database to recover lost messages When original server or database recovers, it will run through divergence detection and perform an incremental reseed or require a full reseed
Streaming backup APIs for public use have been cut, must use VSS for backups Backup from any copy of the database/logs Always choose Passive (or Active) copy Backup an entire server Designate a dedicated backup server for a given database Restore from any of these backups scenarios
Database Availability Group
Mailbox Server 1
Mailbox Server 2
Mailbox Server 3
DB1
DB2
DB3
VSS requestor
Mailbox Server 1
Mailbox Server 2
Each DAG member can host only one copy of a given mailbox database
Database path and log folder path for copy must be identical on all members
A lagged copy is a passive database copy with a replay lag time greater than 0 Lagged copies are only for point-in-time protection, but they are not a replacement for point-in-time backups
Logical corruption and/or mailbox deletion prevention scenarios Provide a maximum of 14 days protection
DAG Design
Two Failure Models Design for all database copies activated
Design for the worst case - server architecture handles 100 percent of all hosted database copies becoming active
DAG Design
DAG Design
Life is good
Server 1 DB1 DB2 DB3 DB4 DB5 DB36 DB37 DB38 DB39 DB40 Server 2 DB6 DB7 DB8 DB9 DB10 DB31 DB32 DB33 DB34 DB35
DAG Design
DAG Design
DAG Design
Quorum
Quorum
Used to ensure that only one subset of members is functioning at one time A majority of members must be active and have communications with each other Represents a shared view of members (voters and some resources) Dual Usage
Data shared between the voters representing configuration, etc. Number of voters required for the solution to stay running (majority); quorum is a consensus of voters
When a majority of voters can communicate with each other, the cluster has quorum When a majority of voters cannot communicate with each other, the cluster does not have quorum
Quorum
Quorum is not only necessary for cluster functions, but it is also necessary for DAG functions
In order for a DAG member to mount and activate databases, it must participate in quorum
Exchange 2010 uses only two of the four available cluster quorum models
Node Majority (DAGs with an odd number of members) Node and File Share Majority (DAGs with an even number of members)
Witness
A witness is a share on a server that is external to the DAG that participates in quorum by providing a weighted vote for the DAG member that has a lock on the witness.log file
Used only by DAGs that have an even number of members
Witness server does not maintain a full copy of quorum data and is not a member of the DAG or cluster
Witness
Witness
If in a Failed state and needed for quorum, cluster will try to online File Share Witness resource once
If witness cannot be restarted, it is considered failed and quorum is lost If witness can be restarted, it is considered successful and quorum is maintained
An SMB lock is placed on witness.log Node PAXOS information is incremented and the updated PAXOS tag is written to witness.log
If in an Offline state and needed for quorum, cluster will not try to restart quorum lost
Witness
When witness is no longer needed to maintain quorum, lock on witness.log is released Any member that locks the witness, retains the weighted vote (locking node)
Members in contact with locking node are in majority and maintain quorum Members not in contact with locking node are in minority and lose quorum
Witness Server
No pre-configuration typically necessary
Exchange Trusted Subsystem must be member of local Administrators group on Witness Server if Witness Server is not running Exchange 2010
Cannot be a member of the DAG (present or future) Must be in the same Active Directory forest as DAG
Witness Server
Can be Windows Server 2003 or later
File and Printer Sharing for Microsoft Networks must be enabled
Replicating witness directory/share with DFS not supported Not necessary to cluster Witness Server
If you do cluster witness server, you must use Windows 2008
DAG is created initially as empty object in Active Directory Continuous replication or 3rd party replication using Third Party Replication mode
Once changed to Third Party Replication mode, the DAG cannot be changed back
DAG is given a unique name and configured for IP addresses (or configured to use DHCP)
When second and subsequent Mailbox server is added to a DAG The server is joined to cluster for the DAG The quorum model is automatically adjusted The server is added to the DAG object in Active Directory The cluster database for the DAG is updated with info about local databases
Monitor health and status of database copies and perform switchovers as needed
Before you can remove a DAG, you must first remove all servers from the DAG
DAG Networks
DAG Networks
A DAG network is a collection of subnets All DAGs must have:
Exactly one MAPI network
MAPI network connects DAG members to network resources (Active Directory, other Exchange servers, etc.)
DAG Networks
DAG Networks
Server / Network EX1 MAPI EX2 MAPI IP Address / Subnet Bits 192.168.0.15/24 192.168.0.16/24 Default Gateway 192.168.0.1 N /A 192.168.0.1 N /A EX1 REPLICATION 10.0.0.15/24 EX2 REPLICATION 10.0.0.16/24
DAG Networks
Server / Network EX1 MAPI EX1 REPLICATION EX2 MAPI EX2 REPLICATION IP Address / Subnet Bits 192.168.0.15/24 10.0.0.15/24 192.168.1.15/24 10.0.1.15/24 Default Gateway 192.168.0.1 N/A 192.168.1.1 N/A
DAG Networks
To collapse subnets into two DAG networks and disable replication for the MAPI network:
Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork01 Subnets 192.168.0.0,192.168.1.0 -ReplicationEnabled:$false Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork02 Subnets 10.0.0.0,10.0.1.0 Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork03 Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork04
Name DAGNetwork01 DAGNetwork02 DAGNetwork03 DAGNetwork04 Subnet(s) 192.168.0.0/24 10.0.0.0/24 192.168.1.0/24 10.0.1.0/24 Interface(s) EX1 (192.168.0.15) EX1 (10.0.0.15) EX2 (192.168.1.15) EX2 (10.0.1.15) MAPI Access Enabled True False True False Replication Enabled True True True True
DAG Networks
To collapse subnets into two DAG networks and disable replication for the MAPI network:
Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork01 Subnets 192.168.0.0,192.168.1.0 -ReplicationEnabled:$false Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork02 Subnets 10.0.0.0,10.0.1.0 Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork03 Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork04
Name DAGNetwork01 DAGNetwork02 Subnet(s) 192.168.0.0/24 192.168.1.0/24 10.0.0.0/24 10.0.1.0/24 Interface(s) EX1 (192.168.0.15) EX2 (192.168.1.15) EX1 (10.0.0.15) EX2 (10.0.1.15) MAPI Access Enabled True False Replication Enabled False True
DAG Networks
Set-DatabaseAvailabilityGroup -DiscoverNetworks
Active Manager
Active Manager
Active Manager
Active Manager
Transition of role state logged into Microsoft-ExchangeHighAvailability/Operational event log (Crimson Channel)
Active Manager
Active Manager
Responds to queries from CAS/Hub about which server hosts the active copy
Process of finding the best copy to activate for an individual database given a list of status results of potential copies for activation Active Manager selects the best copy to become the new active copy when the existing active copy fails
Sorts copies by copy queue length to minimize data loss, using activation preference as a secondary sorting key if necessary Selects from sorted listed based on which set of criteria met by each copy Attempt Copy Last Logs (ACLL) runs and attempts to copy missing log files from previous active copy
Sorts copies by activation preference when auto database mount dial is set to Lossless
Otherwise, sorts copies based on copy queue length, with activation preference used a secondary sorting key if necessary
Selects from sorted listed based on which set of criteria met by each copy Attempt Copy Last Logs (ACLL) runs and attempts to copy missing log files from previous active copy
During best copy selection, any servers that are unreachable or activation blocked are ignored
1
2 3 4
< 10 logs
< 10 logs N /A N /A
< 50 logs
< 50 logs < 50 logs < 50 logs
Healthy
Crawling Healthy Crawling
5
6 7 8 9 10
N /A
< 10 logs < 10 logs N /A N /A
< 50 logs
N /A N /A N /A N /A
N /A
Healthy Crawling Healthy Crawling
X
DB 1
DB 1
DB 1
DB 1
Activation Preference 2 3 4
CI State
Database State
Sort list of available copies based by Copy Queue Length (using Activation Preference as secondary sort key if necessary):
Server3\DB1 Server2\DB1 Server4\DB1
Activation Preference 2 3 4
CI State
Database State
Only two copies meet first set of criteria for activation (CQL< 10; RQL< 50; CI=Healthy):
Server3\DB1 Server2\DB1 Server4\DB1
Lowest copy queue length tried first
Activation Preference 2 3 4
CI State
Database State
Server2
Server3
Server4
X
DB 1
DB 1
DB 1
DB 1
Activation Preference 2 3 4
CI State
Database State
Activation Preference 2 3 4
CI State
Database State
Activation Preference 2 3 4
CI State
Database State
RTM: DAC Mode is only for DAGs with three or more members that are extended to two Active Directory sites
Dont enable for two-member DAGs where each member is in different AD site or DAGs where all members are in the same AD site DAC Mode also enables use of Site Resilience tasks
Stop-DatabaseAvailabilityGroup Restore-DatabaseAvailabilityGroup Start-DatabaseAvailabilityGroup
Uses Datacenter Activation Coordination Protocol (DACP), which is a bit in memory set to either:
0 = cant mount 1 = can mount
Continuous replication changes
Enhanced to reduce data loss Eliminates log drive as single point of failure
Switching process:
Initial mode is file mode Block mode triggered when target needs Exx.log file (e.g., copy queue length = 0) All healthy passives processed in parallel File mode triggered when block mode falls too far behind (e.g., copy queue length > 0)
Log File 6
Database Log is built copy up Log fragment toand date detected and inspected converted to
Log File 5
Log File 6
Log File 7
Log File 3
complete log
Log File 1 Log File 2
Log File 1
Log File 2
SP1 introduces RedistributeActiveDatabases.ps1 script (keep database copies balanced across DAG members)
Moves databases to the most preferred copy If cross-site, tries to balance between sites
Within a datacenter
Database *over Server *over
Between datacenters
Single database *over Server *over
Datacenter switchover
Database mounted in another datacenter and another Active Directory site Serviced by new Hub Transport servers Different OwningServer for routing
Transport dumpster re-delivery now from both Active Directory sites
Datacenter Switchover
Customers can evolve to site resilience Standalone local redundancy site resilience Consider name space design at first deployment Keep extending the DAG! Monitoring and many other concepts/skills just reapplied Normal administration remains unchanged Disaster recovery not HA event
Site Resilience
Agenda
Understand the steps required to build and activate a standby site for Exchange 2010
Site Resilience Overview Site Resilience Models Planning and Design Site activation steps Client Behavior
Exchange 2007
Site resilience choices
CCR+SCR and /recoverCMS SCC+SCR and /recoverCMS CCR stretched across datacenters SCR and database portability SCR and /m:RecoverServer SCC stretched across datacenters with synchronous replication
Low
Low
Low
Low
High
Low
Infrastructure Design
There are two key models you have to take into account when designing site resilient solutions
Datacenter / Namespace Model User Distribution Model
Infrastructure Design
User Distribution Models
The locality of the users will ultimately determine your site resilience architecture
Are users primarily located in one datacenter? Are users located in multiple datacenters? Is there a requirement to maintain user population in a particular datacenter?
Infrastructure Design
Client Access Arrays 1 CAS array per AD site
Multiple DAGs within an AD site can use the same CAS array
FQDN of the CAS array needs to resolve to a loadbalanced virtual IP address in DNS, but only in internal DNS
You need a load balancer for CAS array, as well
Set the databases in the AD site to utilize CAS array via Set-MailboxDatabase -RPCClientAccessServer property By default, new databases will have the RPCClientAccessServer value set on creation
If database was created prior to creating CAS array, then it is set to random CAS FQDN (or local machine if role co-location) If database is created after creating CAS array, then it is set to the CAS array FQDN
Voter Placement
If user population is spread across datacenters, deploy multiple DAGs to prevent WAN outage from taking one datacenter offline
Voter Placement
Portland Seattle DAG2 Alt Wit DAG1 Alt Wit
CAS03
HUB04
CAS04
CAS01
HUB01
MBX02
MBX01
MBX08
MBX07
MBX06
MBX05
DAG2
MBX03
MBX04
DAG1
CAS02
HUB02
HUB03
DAG1 Witness
DAG2 Witness
Site Resilience
Best Practice: Use Split DNS for Exchange hostnames used by clients Goal: minimize number of hostnames
mail.contoso.com for Exchange connectivity on intranet and Internet mail.contoso.com has different IP addresses in intranet/Internet DNS
Important before moving down this path, be sure to map out all host names (outside of Exchange) that you want to create in the internal zone
External DNS
Mail.region.contoso.com Pop.region.contoso.com Imap.region.contoso.com Smtp.region.contoso.com
Exchange Config
ExternalURL = mail.contoso.com CAS Array = outlook.contoso.com OA endpoint = mail.contoso.com
Exchange Config
Datacenter 1 Datacenter 2
ExternalURL = mail.region.contoso.com CAS Array = outlook.region.contoso.co m OA endpoint = mail.region.contoso.com
Internal DNS
Mail.contoso.com Pop.contoso.com Imap.contoso.com Autodiscover.contoso.co m Smtp.contoso.com Outlook.contoso.com
CAS
HT
HT
CAS
Internal DNS
Mail.region.contoso.com Pop.region.contoso.com Imap.region.contoso.com Smtp.region.contoso.com Outlook.region.contoso.co m
AD
MBX
MBX
AD
Intelligent Firewall
Load Balancer
Load Balancer can listen for both external names and forward to the correct CAS Just an A record change required after site failover Minimal configuration changes required after failover Works with all clients
OCS requires certificates with <=1024 bit keys and the server name in the certificate principal name
If leveraging a certificate per datacenter, ensure the Certificate Principal Name is the same on all certificates
Outlook Anywhere wont connect if the Principal Name on the certificate does not match the value configured in msstd: (default matches OA RPC End Point)
Set-OutlookProvider EXPR -CertPrincipalName msstd:mail.contoso.com
Datacenter Switchover
Switchover Tasks
Service is restored
Datacenter Switchovers
Primary to Standby Standby to Primary
1. 2. Verify all services working Start-DatabaseAvailabilityGroup <DAGName> ActiveDirectorySite <PSiteName> Set-DatabaseAvailabilityGroup <DAGName> WitnessDirectory <Directory> WitnessServer <ServerName> Reseed data Schedule downtime for dismount Change DNS records back MoveActiveMailboxDatabase <DBName> ActivateOnServer <ServerName> Mount databases in primary datacenter
1. 2.
3. 4. 5. 6.
Primary site fails Stop-DatabaseAvailabilityGroup <DAGName> ActiveDirectorySite <PSiteName> ConfigurationOnly (run this in both datacenters) Stop-Service clussvc Restore-DatabaseAvailabilityGroup <DAGName> ActiveDirectorySite <SSiteName> Databases mount (assuming no activation blocks) Adjust DNS records for SMTP and HTTPS
3.
4. 5. 6. 7.
8.
Restore-DatabaseAvailabilityGroup
Force quorum Evict stopped nodes Start using alternate file share witness if necessary
Start-DatabaseAvailabilityGroup
Remove servers from stopped list Join servers to cluster Add joined servers to started list
Client Experiences
Typical Outlook Behavior
All Outlook versions behave consistently in a single datacenter scenario
Profile points to RPC Client Access Server array Profile is unchanged by failovers or loss of CAS
Client Experiences
Client Experiences
Other Clients
Planning services
http://technet.microsoft.com/en-us/library/cc261834.aspx
Microsoft TechNet
http://www.microsoft.com/technet/itshowcase
2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.