Download as pdf or txt
Download as pdf or txt
You are on page 1of 166

VMware Site Recovery Manager

Reference & Troubleshooting Guide


(Version: SRM Reference Guide_x.docx)

Information to help you with your SRM experience! This guide has been created for VMware SEs as well as our partners SEs who are responsible for working with our products at our customers. It provides design, scalability, troubleshooting, and general information about SRM. This guide is intended for knowledgeable practitioners who are VMware staff or VMware partners. The information in this guide can help an experienced virtualization system engineer, but it can also hurt if you do not know what you are doing. This information also comes with no warranty implied or otherwise. This information is not VMware sanctioned or warrantied. Corrections and suggestions gratefully welcomed at mwhite@vmware.com.

Contents
Background .............................................................................................................................................. 7 Educational materials .......................................................................................................................... 7 Some things to think about for a successful SRM project ........................................................ 7 When is SRM not a good solution? ................................................................................................... 8 Install / Uninstall Information .......................................................................................................... 8
Where should I get SRAs from? .................................................................................................................... 8 Install account.................................................................................................................................................... 9 Install and Configure information for specific environments.......................................................... 9 Install Overview ................................................................................................................................................ 9 Install Test Outline ......................................................................................................................................... 10 Uninstall Information ................................................................................................................................... 10 Installing (uninstalling) on Windows 2008 .......................................................................................... 11

Upgrade / Patch Information .......................................................................................................... 11


What are the SRM build numbers? ........................................................................................................... 11 Upgrading to SRM 4.1.1 ................................................................................................................................ 11 Upgrading to SRM 4.1 including upgrading to vSphere VirtualCenter 4.1 ............................. 12 VirtualCenter 4.1 ......................................................................................................................................................... 12 Upgrade to SRM 4.1.................................................................................................................................................... 17 Migration - SRM ........................................................................................................................................................... 18 Undo ................................................................................................................................................................................. 23 To our next major release SRM 4.0 ....................................................................................................... 23

Design Guidelines ............................................................................................................................... 25

What goes wrong in SRM projects? .......................................................................................................... 25 Large VI environments ................................................................................................................................. 26 Suggested Recommendations aka Best Practices ........................................................................ 27 Failback Outline .............................................................................................................................................. 29 Bandwidth Usage ............................................................................................................................................ 30 Multiple Tier Applications .......................................................................................................................... 30 Application References................................................................................................................................. 30 Protecting View Desktops............................................................................................................................ 30 Physical to virtual disaster recovery - P2V DR .................................................................................... 31 Shared Recovery ............................................................................................................................................. 31 Failback (plug-ins) ......................................................................................................................................... 31 A lost protected site and failing back to it ............................................................................................. 32 A sample recovery plan for testing an application ............................................................................. 32 Exchange Recovery Plan .......................................................................................................................................... 32 Adding scripts to a Recovery Plan in a call out .................................................................................... 33 What should I the PowerShell command look like to have it called from SRM?...................... 34 How can I see the environment variables that the admin guide says are available for scripts?................................................................................................................................................................ 35 Can a script execution in a recovery plan impact the inside of a protected VM? .................... 35 Will a non-zero script exit in a recovery plan stop the recovery plan? ...................................... 35 User designed callout has returned a non-zero value: 1 .................................................................. 35

SRM Administration Information .................................................................................................. 32

SRM Reference Guide

Page 2 of 166

What VM parameters are not failed over? ............................................................................................. 35 Does number of PG impact order of start for high priority VMs? ................................................. 36 What about backing up the SRM databases? ........................................................................................ 36 Can I change the Run button to work like the Test button? ............................................................ 36 Can I use VMware Heartbeat to protect SRM and VC? ....................................................................... 36 How can I capture the log and configuration information for support to work with? .......... 36 Where are the SRM server logs stored? .................................................................................................. 36 How do I capture the SRM plug-in log and config info?..................................................................... 37 Where are the Linux Image Customization logs stored? .................................................................. 37 I would like to retain the SRM logs longer ............................................................................................. 37 What happens when ................................................................................................................................... 38 I add a new hard drive to an existing and successfully protect VM? ..................................................... 38 I add CPU and memory to an existing protected VM ................................................................................... 38 I add a network card to an existing protected VM ........................................................................................ 38 I add a new VM to an existing protection group ............................................................................................ 38 I remove a protected VM from a protection group ....................................................................................... 38 What travels with VMs between PG and recovery plans? ............................................................. 38 How can I tell the SRM version from the log files? .............................................................................. 39 Installation logs ............................................................................................................................................... 39 Automated Install ........................................................................................................................................... 39 Changing log details ....................................................................................................................................... 39 I would like to have a automated SRM type solution without SRM .............................................. 40 How can I have SSL communications between SRM and NetApp .................................................. 40 What happens when I Storage VMotion a protect VM or how does changes to VM storage affects protection? .......................................................................................................................................... 40 What should I know about using the bulk IP utility? ......................................................................... 41 SRM Licensing Information ......................................................................................................................... 42 How does the SRM 4.1 licensing work? ............................................................................................................. 42 How does the SRM 4.0 licensing work? ............................................................................................................. 42 How does the SRM 1.0 licensing work? ............................................................................................................. 43 What does it look like if my VI is licensed for SRM? ..................................................................................... 43 What does it look like if my vSphere is licensed for SRM after Update 1? ...................................... 44 What will happen if my license expires? ........................................................................................................... 44 What is the account that is asked for during install used for? ....................................................... 44 Is Essentials and Essentials Plus supported for SRM? ....................................................................... 45 How do I plan for disk utilization due to SRM database? ................................................................. 45 I would like to use trusted certificates with SRM help! ................................................................. 45 Can I change the IP information for the SRM server? ........................................................................ 45 Can network customization work for operating systems other than Windows? .................... 45 Understanding order of operation for bringing VMs back online ............................................... 45 How many VMs can SRM start? ................................................................................................................. 46 Can I start more than, or less than, 2 VMs per host? ......................................................................... 46 What does the Repair button do? ............................................................................................................. 46 Is it all over when the recovery plan fails? ............................................................................................ 46 Can I move an SRM server to a new host? .............................................................................................. 47 How can I configure a second HBA rescan? ........................................................................................... 47 Recommended minimum alarm notifications ..................................................................................... 48 SRM VirtualCenter events ............................................................................................................................ 48 Is thin provisioned VMs support with SRM? ........................................................................................ 49 What does Microsoft offer for licenses for DR test? ........................................................................... 49
SRM Reference Guide Page 3 of 166

What vendors have application consistency options? ...................................................................... 50 What vendors have application consistency options that work with continuous replication? ....................................................................................................................................................... 51 What rights does a user require to be a DR operator? ...................................................................... 51 SRM service doesnt start, and event logs show errors with event ID of 7000 and 7009 ..... 52 How can I have syntax highlighting to help read SRM log files? .................................................... 52 Text Wrangler............................................................................................................................................................... 52 EditPlus ........................................................................................................................................................................... 53

Troubleshooting .................................................................................................................................. 54
Things to watch out for ................................................................................................................................. 54 How can I change the command Timeout? ............................................................................................ 55 My Celerra prepare storage fails, and the error has a null in it .................................................. 56 Where is the new Run and Test privileges? .......................................................................................... 56 I have accidently deleted my Shadow VMs what should I do to fix this? ................................ 56 SQL Authentication, and database access issues ................................................................................. 56 Why cannot I customize Windows 2008? .............................................................................................. 57 Why does my recovery plan show error on VM status but the VMs are ok? ............................ 57 ESX 2.5 accessing protected datastore will cause recomputed datastore failures ................. 57 What causes the Recompute Datastore Group task? ......................................................................... 57 Why is my IP customization taking about 10 minutes extra per VM? ......................................... 58 When using Bulk Import I get column errors....................................................................................... 58 I would like to avoid the messages about shutdown ......................................................................... 58 Unable to find any array script files Please check your SRM installation............................... 58 My Linux VMs dont have the host file changed after IP customization ..................................... 58 dr.secondary.fault.WrongVmInventoryPlacement ............................................................................ 58 Pairing Issues ................................................................................................................................................... 59 I cannot run more than one simultaneous recovery plan with my MirrorView SRA ............. 59 What time guidelines can I expect for protecting VMs? .................................................................. 59 What time guidelines can I expect for failing over VMs?................................................................. 59 When trying to do Inventory Mappings the VI Client hangs ........................................................... 60 Failed to connect to the management system address when executing the discoverArrays command. ........................................................................................................................................................ 60 How can I re-initialize the SRM database .............................................................................................. 60 Error LUNs with duplicate IDs or numbers received from SAN integration scripts .............. 61 Error: Failed to recover datastore: ......................................................................................................... 61 SRM unlicensed error in logs but you have a good license .............................................................. 61 I cannot uninstall SRM successfully what can I do? ........................................................................ 61 SRM doesnt start, and you just uninstalled an SRA ........................................................................... 62 Unable to create placeholder virtual machine at the recovery site: host, resource pool, and datastore are not compatible ..................................................................................................................... 62 Network device needed by recovered virtual machine could not be found at recovery or test time ............................................................................................................................................................. 62 SRM doesnt start and nothing in SRM logs or event logs what to do? ..................................... 62 Only three Recovery Plans can run at the same time ........................................................................ 63 Why is Port 80 used in the install but port 443 later? ...................................................................... 63 Failed to test failover luns. Existing with failure................................................................................ 63 I cant install the plug in get an error ................................................................................................... 64 For SQL server use, does the SRM DB user need the DB_OWNER permission? ........................ 64 Unexpected MethodFault (dr.san.fault.ManagementSystemNotFound) .................................... 64
SRM Reference Guide Page 4 of 166

Changing passwords after SRM is working ........................................................................................... 64 My recovery site is only using x number of hosts to start VMs but it should be using y number ............................................................................................................................................................... 65 Error: A general system error occurred: cannot execute scripts .................................................. 65 Permission to perform this operation failed ........................................................................................ 65 Priority Levels in Recovery Plan dont reflect my changes ............................................................. 65 What does SRM database corruption look like? .................................................................................. 65 Error:Expected virtual machine file path .. vm-vmname/vm-vmname.vmx cannot be found ................................................................................................................................................................... 66 SRM 4.0 cannot start I just updated to vSphere 4.0 Update 1...................................................... 66 ESXi not supported at 1.0.0 nor is ESX / VC Update 2 ..................................................................... 66 My script needs more time to execute .................................................................................................... 66 Database access issues.................................................................................................................................. 66 No available Customization specifications found ............................................................................... 66 Errors with using Network Customization ............................................................................................ 67 Operation Timeout error when doing test recovery ......................................................................... 67 Recovery Plan error: Unable to access the VM config error message ......................................... 67 Grayed out options for creating and editing of protection group ................................................. 68 Net::SSLeay::load_error_strings................................................................................................................. 68 Array with key xxxxxxxxx not found error message ...................................................................... 68 Is there a limitation of DR failover LUNs for some iSCSI arrays and some Hosts? .................. 68 Can I have a VM with multiple VMDKs spread across two NetApp SRAs? ................................. 68 Not sure the error name but interesting problem.............................................................................. 68 Failed to launch SAN integration scripts ................................................................................................ 69 Failed to connect to NFC during test failover with IP customization ........................................... 69 No visible LUNs during configuration of the array ............................................................................ 70 Review Replicate Datastores window of Array Manager is blank ................................................ 70 How do I find the Managed object reference (MoRef) for a VM? ................................................... 70 Null parameter name:key error ................................................................................................................ 70 Missing testbubble switch on recovery host......................................................................................... 71 Error occurred MirrorViewSRACore.dll not found ......................................................................... 71 You do not hold system privilege System.View on ServiceInstance DrServiceInstance 71 Install hangs at 90%, and install log shows VIEINSTUTIL: Failed to open service control manager ............................................................................................................................................................. 71 Execution of scripts is disabled on this system ................................................................................... 71 Protection Group configuration times out ............................................................................................ 71 Failed to update Perl installation directories ...................................................................................... 72 Error: The operation is not supported on this object ....................................................................... 72 You do not see a newly added LUN when creating a PG? ............................................................... 72 Operation failedDetails: VI API Version 4.1 is not supported ..................................................... 72 SRM LUN discovery, test, failover fail with file write errors ........................................................... 74 SRM SRA Errata................................................................................................................................................ 74 LeftHand Networks .................................................................................................................................................... 74 NetApp............................................................................................................................................................................. 75 EMC ................................................................................................................................................................................... 82 FalconStor ...................................................................................................................................................................... 92 IBM .................................................................................................................................................................................... 93 Dell EqualLogic ............................................................................................................................................................ 95 Compellent ..................................................................................................................................................................... 95 HP ...................................................................................................................................................................................... 95
SRM Reference Guide Page 5 of 166

Miscellaneous Information URLs ................................................................................................. 98 Syntax highlight module info .......................................................................................................... 99


Text Wrangler language module ............................................................................................................... 99 EditPlus ........................................................................................................................................................... 100

Lab Exercises ......................................................................................................................................100


Lab 1 Installing SRM ................................................................................................................................ 100 Lab 2 Configuring SRM ............................................................................................................................ 134 Lab 3 IP Customization ........................................................................................................................... 157 Helpful Starters ......................................................................................................................................................... 157 Procedure Hints ........................................................................................................................................................ 157 Reference Materials ................................................................................................................................................ 158 Sample 1 Bulk IP Load Screenshot ................................................................................................................ 158 Conclusion................................................................................................................................................................... 159 Lab 4 Script Intro ...................................................................................................................................... 159 Scenario........................................................................................................................................................................ 159 Helpful Starters ......................................................................................................................................................... 159 Procedure Tips .......................................................................................................................................................... 159 Reference Materials ................................................................................................................................................ 160 Things to Remember about Scripts .................................................................................................................. 160 Conclusion................................................................................................................................................................... 160

Whats New additions or deletions or changes ...................................................................162

SRM Reference Guide

Page 6 of 166

Background
This document has been designed to help your interaction with VMware Site Recovery Manager (SRM), and to make your time with it to be more productive. It is an attempt to share information among users of SRM to provide knowledge and share experience. For that reason please share corrections, suggestions or comments with the author (Michael White mwhite@vmware.com). This document is for the person who has installed SRM once or twice and needs a little help, as well as people working with a new SRA. It continues to grow as people submit new or updated information, and will now also help with design and troubleshooting.

Educational materials
There is a guide called the SRM Evaluation Guide. This is the most well written and informative guide on SRM. It is important to read every page and fully understand it before implementing SRM at a customer site, or trying to technically sell someone SRM. It can be found at http://www.vmware.com/files/pdf/vcenter-srm-evaluators-guide.pdf . The SRM documentation is found at the URL below and the Admin guide is very useful! It has lots of important information so you should be familiar with this very useful guide. http://www.vmware.com/support/pubs/srm_pubs.html Prior to SRM it was still easier to do DR with virtualization than with a totally physical environment even though it was manual. For a very good understanding of that visit http://www.vmware.com/resources/techresources/1063 . It surprises me that I get questions where the answers are in the release notes. Troubleshooting is sometimes quickest when you are familiar with the release notes. In addition, the reason that release notes are HTML instead of PDF is that they are updated as necessary. SRM 4.1.1 - http://www.vmware.com/support/srm/srm_releasenotes_4_1_1.html SRM 4.1 - http://www.vmware.com/support/srm/srm_releasenotes_4_1.html There is a book called Administering VMwares Site Recovery Manager by Mike Laverick that is interesting. Find it at http://www.lulu.com/product/paperback/administering-vmwares-site-recoverymanager/3688988?productTrackingContext=center_search_results .

Some things to think about for a successful SRM project


Here are some things to think about that can really help in your SRM project. A Business Impact Assessment or an existing run book can really help make sure that SRM is successful by protecting what is actually important to the business. The BIA or run book identifies the key apps and their dependencies. Often department heads will debate what are the important apps. Remember important is defined from the company point of view!

SRM Reference Guide

Page 7 of 166

A strong team that will enhance the success of the project will include storage, virtualization, server, and network resources. Senior and experienced in each category is of particular importance. Storage understanding is key. A close relationship with your technical staff at your storage vendor is very helpful. A Corporate sponsor is useful to help break blockage when two different business units declare their app as most important. They can also help with funding and vendor / BU relationships. Lab work or a proof of concept is very important to make sure that the entire DR / BC team fully understands the building blocks. Pick only one app and its dependencies and work all the way through including a fallback. This should also help understand what might go wrong in an SRM implementation and how to manage or mitigate it. Have a strong partner to help. Use VMware PSO or someone else but make sure they have experience. Get proof in the form of references! A strong plan is a big part of success. Really worry about the storage and the SRA. They are often poorly understood and poorly documented. Start small and go one step at a time. Triple check for compatibility issues! Before you start the actual work!

When is SRM not a good solution?


While SRM is a great DR tool, that is also very good for datacenter migration projects, it is not for everything. Bear in mind it needs to start VMs. That alone will determine if an RTO is too aggressive for SRM to handle. If you have 10 VMs, and each requires 10 minutes to start, than your RTO cannot be shorter than 100 minutes. Generally anything that is real time, or just minutes is something that SRM cannot work with. In simple terms we can handle an RPO of zero or near zero, but we cannot handle the same in RTO. Starting VMs takes time. If a customer needs an RTO of zero they need a High Available solution which VMware doesnt have.

Install / Uninstall Information


SRM is a simple application to install. There is a useful order (it is described below in the Install Overview section) that helps it be a smoother process but it is still simple. It is easy to install the Storage Replication Adapter (SRA) but troubleshooting them is quite complicated sometimes. Some SRAs require very different things than others. It is very important to make sure that ESX, VC, and the storage are on the compatibility lists, and in fact are at the specific patch or firmware level as necessary. This information can be found easily at the URLs below. Do not assume that everything is compatible! Make sure by checking the matrixes. VMware SRM compatibility matrixes - http://www.vmware.com/pdf/srm_compat_matrix_4_1.pdf Storage compatibility - http://www.vmware.com/pdf/srm_storage_partners.pdf

Where should I get SRAs from?


It should be noted that the only place to get certified SRAs is at the VMware web site. Do not use SRAs that have been found elsewhere as they are potentially not certified and you may have issues!! I have two
SRM Reference Guide Page 8 of 166

examples personally where there was a stalled and angry SRM install where the problem is an SRA that was not certified and not from VMware so avoid the issue and ONLY use SRAs from www.vmware.com.

Install account
When you install SRM you are prompted for an account and password to connect to VC. This account will be stored in a protected fashion and will be used by SRM to talk to VC. This will be an account that should be treated like a service account. It has a limit of 31 characters and must have a password that is all ASCII. You should not change the password of this account without other steps or SRM will not work. You can find information on this later in the document. During install, when you need to enter an account to access VirtualCenter, you need to be aware that username has a 31-character limit. The host name for VC is 32 characters, and the account name field for dr-ip-customizer.exe is 25 or so characters. Update 8/8/10 I believe that this has be fixed and the character length is now 80. But I have not confirmed that. I recommend that you use a service type account for the install, which is domain admin, and admin in VC, and later after the install it will become SRM admin too. It should be used for the ODBC account, and to run the VC service as well. It has been brought to my attention (thanks Brock) that our admin guide suggests using the local administrator account for the install, and for running repair activities. I have never done this, and many customers I have worked with do not have access to local admin accounts. I am still using the domain account and will continue to unless there is actually a technical reason to not do this which I am not aware of.

Install and Configure information for specific environments


The links below provide install info that is most useful for education and working in labs. Much of it is with virtual storage. It is still a VERY good way to learn, as well as useful to test out ideas and learn additional skills. EMC Celerra VSA - http://nickapedia.com/2010/10/04/play-it-again-sam-celerra-uber-v3-2/ and http://nickapedia.com/2011/02/05/how-to-uber-new-celerra-uber-vsa-guide/ FalconStor NSS Virtual Appliance - http://communities.vmware.com/docs/DOC-11410 LeftHand Networks VSA - http://communities.vmware.com/docs/DOC-11408 SRM with Left Hand Networks in a box http://www.virtuallifestyle.nl/2008/11/vmware-site-recoverymanager-with-lefthand-vsa/#more-60 SRM with NetApp in a box - http://tendam.files.wordpress.com/2008/11/site-recovery-manager-in-a-boxpreview.pdf At the end of this document there is information on various different SRA related issues. Make sure to read through the section that pertains to your install.

Install Overview
It is important to understand the SRM installation overview. You must install using the order of operation as shown in the lab section of this document. You must do this on the protected site first, followed by the recovery side. Here is the outline:

SRM Reference Guide

Page 9 of 166

1. 2. 3. 4. 5. 6. 7. 8.

You will need to create a DB at both sides before you start. SRM application installed at Protected Site SRM application plug-in installed in VI clients that connect with the Protected Site SRA installed at the Protected Site SRM application installed at Recovery Site SRM application plug in installed in VI clients that connect with the Recovery Site SRA installed at the Recovery Site SRM configured at the Protected Site a. SRM server pairing b. Array Configured both Protected Site and Recovery Site c. Inventory Mapping d. Protection Group 9. SRM configured at the Recovery Site a. Recovery Plan created You should now test and tweak SRM. Remember the goal is to have the required VMs running at the recovery site in the least amount of time. Remember when you are testing that you are testing for the applications to fail over in the shortest amount of time, and be functional when they are failed over!

Install Test Outline


When you have your storage ready, and SRM installed, here is a recommended test overview to maximize learning, but also to make sure things work in appropriate order. Some of these steps, if you are doing them with the customer will require stops for education. 1. Simple test failover a. Use no IP customization, or network changes b. Do each CG / LUN individually in a PG (remember, single app or business unit are most common models for organizing storage). However, another idea that may have serious merit at many customers is to organize by tiers meaning all Tier 1 apps recovered together. This may help order of operation and improve performance. c. Now do single RP that covers off each PG 2. Enhanced test failover a. IP customization b. Isolated VLAN c. Callouts 3. Performance a. Does the SRA support simultaneous test failovers? b. Does the storage support simultaneous access? c. Use the info in this document to try and improve failover time Now you know everything works.

Uninstall Information
It is good to uninstall the SRAs first, than plug-ins, and finally SRM. Make sure to clean up the database and other plug-ins. Do this on both sides. Sometimes the scripts folder will be left in the SRM folder after an uninstall. This is due to some miscellaneous SRA files not removed during the uninstall. To be tidy, and avoid potential issues when you install SRM again on this machine you should remove those folders. If you are doing this on Win2K8 check out page 11.

SRM Reference Guide

Page 10 of 166

If you do re-install make sure you have not missed anything above, and make sure the SRM database has been deleted and recreated as well.

Installing (uninstalling) on Windows 2008


If you have an install on Win2K8 with UAC configured on you will have issues with doing custom installs. You will also not be able to uninstall SRM. An attempted repair or uninstall will hang around 80% forever. This occurs for me on Win2K8 R2 as well. The solution is to right + click on the installer file and use Run as Administrator. See http://kb.vmware.com/kb/1028443 for more info.

Upgrade / Patch Information


Currently we have not had many updates or any that were complicated. It is important to understand that two different versions of SRM cannot communicate and will cause errors in the logs. Generally it is best to start by upgrading one SRM server, than the plug-in, followed by the other side. It is also a good idea to uninstall the SRM plug-in first. The release notes should generally have details on doing upgrades as well. Sometimes the patches have recommended uninstalling the plug in. If they dont mention that you can likely skip that step. Sometimes the release notes may indicate something else with respect to the plug-in so be aware. When doing patching, the above is likely all that is necessary. However when upgrading it is quite possible you may need to upgrade your SRA as well as SRM. So watch out for that. As an example, I upgraded my storage and that required a new SRA and it was not noted anywhere (but in this document now) so I had some errors that were cleared up when I upgraded the SRA.

What are the SRM build numbers?


SRM 4.1.1 340092 Feb 10, 2011 SRM 4.1 267817 July 13, 2010 SRM 4.0.1 - 236215 SRM 4.0 192291 SRM 1.0 Update 1 128004

Upgrading to SRM 4.1.1


This is an easy upgrade since you dont need to worry about upgrading to 64bit OS environments. While technically, you do not need to upgrade to vCenter 4.1 Update 1 first, you must for supportability. Fortunately that is an easy upgrade. If you need to upgrade to vCenter from prior to vCenter 4.1 see the information below. I do need to mention that this upgrade will only upgrade SRM 4.1 and not older versions. This SRM upgrade is minor, and there are no new features, but there is a significant list of fixes. Find out more at http://www.vmware.com/support/srm/srm_releasenotes_4_1_1.html . I have done a number of these upgrades without issue. My blog on this is at http://blogs.vmware.com/uptime/2011/02/vmware-vcenter-site-recovery-manager-411-isreleased.html .

SRM Reference Guide

Page 11 of 166

Upgrading to SRM 4.1 including upgrading to vSphere VirtualCenter 4.1


The upgrade process for 4.1 is a lot more complex than previous upgrades. The instructions below have been used by me a few times and it works. I had SQL Express on the VC / SRM server and that was not a good upgrade path. In fact, it would be easier to delete and install new. I changed my lab to use off host SQL and the upgrade process was much easier.

VirtualCenter 4.1
This is more accurately referred to as migration since we are moving from a 32-bit host operating system to a 64-bit operating system. The steps below will help you move from a SRM 4.0 / VC 4.0 environment where SRM and VC are co-located (although that doesnt impact this process much if they are not) and SQL remote. I will try to point out useful information along the way to help in other migration scenarios. I recommend you read carefully this document and its references completely, and understand them carefully, and then plan an appropriate outage and work all the way through. You do want to minimize the outage window of both VC and SRM! A very useful reference is the release notes (link here), and the upgrade guide (link here). Some interesting background VirtualCenter ISO build 4.1 build 259021 ESXi 4.1 build 260247 ESX 4.1 build 260247 SRM 4.1 build 267817 You must use a 64-bit DSN for VC and remember to make it using SQL Native. You can find SQLncli_x64.msi near the bottom of the page at http://www.microsoft.com/downloads/details.aspx?FamilyId=50b97994-8453-49988226-fa42ec403d17&displaylang=en . You will need to use a 32-bit DSN for VUM so see the KB article at http://kb.vmware.com/kb/1010401 for help in making the 32-bit DSN in a 64-bit OS. Things to get ready Make sure you have a good backup of everything that is going to change which means your VC server, and database. Your new host that is 64-bit will need to have the same name and IP as the old host. This is important. So you will need to build it when it is not on the network. Avoid conflict with the existing VC. You need to preserve the VC and SRM FQDN name through the migration. Make sure you have access to your service account information for VC and SRM. Remove your vSphere Client plug-ins. This is not always necessary but it helped sometimes in this upgrade process. The files we need: o VirtualCenter ISO we need the ISO as it comes with a folder we need, that the normal .zip doesnt. The folder is called datamigration. You can extract the ISO

SRM Reference Guide

Page 12 of 166

to a location that you will have access to when working on either the old or new VC.

datamigration folder that is only present when you have downloaded the ISO

o If you have a spreadsheet that details the VM to LUN relationship that is good to have. o An outage you will have no VC and no SRM for a number of hours. With preparation, and understanding of what is needed, you should be able to keep your outage to around 3 4 hours. But this will vary widely! SRM should be available approximately 1 hour after your VC is again available. But that will vary depending on your prep work. Migration - VC 1. Your database for VC is remote, but still make sure to have a backup of it. 2. You should be logged into the current VC (or current VC/ SRM host). Than, either by using / mounting the ISO, or if you have extracted the files from it, click on autorun so that you get the main screen of the install. Near the bottom of it, under Utility, select Agent Pre-Upgrade Check. a. If you have any issues that the check finds, you need to resolve them before continuing. They will generally have KB articles to help.

SRM Reference Guide

Page 13 of 166

Autorun screen with the Agent Pre-upgrade check highlighted.

Make sure to use the Windows credentials that is your VC service account, or your domain admin.

3. On the existing VC / SRM machine, copy the datamigration folder to the local hard drive and expand it.
SRM Reference Guide Page 14 of 166

4. Make sure that VMware Update Manager (VUM), VMware VirtualCenter, VMware VirtualCenter Management Web Services. Use the commands: this may not be on this host if your SRM is not co-located with your VC. 5. Now you need to use the backup.bat file that is in the datamigration folder to do a backup of your Virtual Infrastructure environment. Note the log folder? The backup.log file will provide info on how the backup went backup.log will echo the work done. The datamigration folder has a data folder now that contains the backup. This backup doesnt backup your remote database, but it does backup the port settings in use, SSL certificates, and licensing information. a. When you execute the batch file, it will normally only have a few questions. b. It will ask about if you wish to include ESX or VM patches. And you should generally say yes. 6. Now copy the entire datamigration folder to a location that you can copy it from in the future to the new VC host. 7. Now you must turn off your existing host. Disconnect the network from it to make sure it is not accidently turned on. 8. You will now turn on your new host, which has the same IP and FQDN. You may need to patch it now, or join it to the domain. Do what is necessary to make it part of your domain and healthy. This includes the 64-bit SQL Native client install, and creating the 64-bit DSN, and creating a 32-bit DSN. The URLs earlier can help find what you need. 9. You need to copy the entire datamigration folder to the new host. 10. On the new host, you need to have access to the install media so map a drive. 11. Use the install.bat file from the datamigration folder to start the install process. a. You will be asked for the path to VC and than VUM. If you are using the ISO, or have extracted the ISO, the path will be the same for both VC and VUM.
a. net stop VMware VirtualCenter Server b. net stop VMware Update Manager Service c. net stop VMware vCenter Site Recovery Manager Server

The start of the install.bat process.

The install.bat is in the datamigration folder.

SRM Reference Guide

Page 15 of 166

b. You will see the normal install prompts for VC. c. Use the same DSN information. d. Notice how you have a choice at some point to do an automatic, or manual update of the VC agents on hosts? I used automatic. e. Select the same path as you had previously used (on the old VC) f. Use the same ports.

A nice improvement!

g. The next prompt is about the size of the JVM memory. Use the default or make a more appropriate choice. h. After the VC install is finished the install will return you to the install.bat file and start the VUM install process. i. VUM will now be installed. i. Enter your VC service credentials, ii. Use the appropriate 32-bit DSN iii. Accept the defaults. j. After the install is finished you will be returned to the install.bat file. 12. There is a restore.log file in the logs folder if you need to see what was done. 13. It is important to understand that the install.bat is very smart. If you, like me, dont have the 32-bit DSN for VUM, and exit, you can start the install batch again after you have the 32-bit DSN and it will continue where it should! 14. Confirm that VUM, VC, and the VC Web service is running, including with the correct credentials. They are likely NOT. 15. Now install the VI client from the autorun screen. 16. Connect to the VC.
SRM Reference Guide Page 16 of 166

17. Install the VUM plugin. This could be on your VC or your desktop. But as I mentioned earlier, remove the plug-ins first. 18. Now check your VUM config, and any other items to make sure what you have is 4.1 and your config.

After the upgrade, your VC should show a version of something like above.

You have now upgraded one of your VirtualCenter servers. You need to do the other one now! Note1: In all of the work I did, we always had the VC / VUM services NOT start, and we had to assign the proper credentials instead of the Local Service, and then it worked. Note2: Be careful with the 64-bit and 32-bit DSNs as it can get careless. If you make a mistake, you can cancel the VC or VUM install process, fix the DSN issue, and restart the install batch file. It will not redo an unnecessary install but rather start where you last finished successfully. A very nicely done install.bat file!

Upgrade to SRM 4.1


This too is more a migration, but this time we dont have the lovely datamigration script to help us! But the steps below will help! The process below is for migrating to 4.1 when you are using a 32-bit host OS. If you are in fact using already a 64-bit OS, you can do an in-place upgrade, but while we do not step through that process below, the information below might help. Release notes URL http://www.vmware.com/support/srm/srm_releasenotes_4_1.html SRM Admin guide URL - http://www.vmware.com/pdf/srm_admin_4_1.pdf Some interesting background SRM 4.1 build 267817 Things to get ready You will need to have your VC infrastructure already upgraded! Make sure that SRM can do a test failover! You will need to confirm that your SRA has been upgraded or certified for 4.1. Check the VMware download site for SRM and the current version of SRAs, and the SRA download for a readme that talks about what it is certified for. SRM 4.1 bits SRA bits Be aware that SRM requires a 32-bit DSN, which is NOT the default on a 64-bit OS. If you need help with that see above in the VC section. Remove your SRM plug-ins.
SRM Reference Guide Page 17 of 166

You need an SRM backup, but it needs to be taken at the same time as if it was in a consistency group. BTW, it needs to be restored like that too. You should also have history reports as hardcopy just in case. Copy your vmware-dr.xml file from each SRM server to a location where you will be able to access the file later. The default location is C:\Program Files\VMware\VMware Site Recovery Manager\config .

Migration - SRM
1. Remember that your new SRM host must have the same name / FQDN so you will need to turn off your old SRM host after you have your backups and .xml file so you can deploy the new host. 2. Backup the SRM database on each of the two (or more) sides. 3. Turn off the old SRM host. 4. Turn on the new SRM host. 5. Make sure you have a 32-bit DSN. 6. Create a new install of SRM 4.1 on the new host. Important to note: a. If you are re-using the SRM 4.0 database, make sure to use a copy and not the original. Errors or a cancellation could corrupt your database. b. You will be prompted about there being an SRM extension installed already. This is due to using your old database with a new install. You should selct Yes.

Select Yes at this prompt.

c. You will need to select the Automatically create the certificate choice.
SRM Reference Guide Page 18 of 166

d. Remember the DSN is 32-bit. e. SRM will likely NOT start. Change the credentials with it to the proper SRM service account, and hit retry. It should continue fine. 7. You now have SRM running, but not configured completely. 8. If the plug-in has not been removed, remove it, and install it again. Several times in my testing, right after the SRM upgrade, the plugin had the name of vDr instead of the fully spelled out name. It still worked, and after a restart of the VI Client the name changed. 9. Install the SRA. 10. Now get the other side done. 11. This is the time, if you have changed advanced settings where you will need to migrate them. Make sure to do that before you continue on. See the section below for help. 12. Now you will need to re-create the site pair, and reconfigure the array manager credentials, in particular the authentication information. When doing this, there is a small thing to remember. After entering the correct credentials, you will need to select the array.

When you re-enter your credentials to the Array Manager, make sure to select your array!

13. You should now do a test and make sure it works!

SRM Reference Guide

Page 19 of 166

You are now complete. If you have any issues, please do not hesitate to contact our support organization but also leave me a comment! Migrating Changes to Advanced Settings If you have not made any changes to Advanced Settings, you do not need to do this section which should be true for most customers. Changes would be things like SanProvider.CommandTimeout or San.Provider.hostRescanRepeatCnt. See below for a screenshot of the Advanced Settings categories. If you know the changes you made you can just add them to your new install. But if you are not sure, you will need to work through the process below.

See this by <right+click> on the Site Recovery lighting bolt.

SRM Reference Guide

Page 20 of 166

This is how you access Advanced Settings

Start by loading your vmware-dr.xml file. Load the one from the Recovery Site when editing the Advanced Settings for the Recovery Site and do the same for the Protected Site. 1. When you are in the Advanced Settings window, you will need to work through each category. 2. One example is localSiteStatus. Search your VMware-dr.xml file for that phrase. 3. In the section of the VMware-dr.xml file you find the category, in our example of localSiteStatus, look for variables that match in the Advanced Settings category and change the value to match what is in your VMware-dr.xml file. See below for an example.

This is a sample of the VMware-dr.xml file.

After we see what is in the Vmware-dr.xml file we record it in the Advanced Settings. See below for that.

SRM Reference Guide

Page 21 of 166

LocalSiteStatus of the Advanced Settings.

Remember you will need to work through this process for each category. Some issues I found I have mentioned these issues elsewhere but thought I would mention them here again. Forgot to update the credentials for the arrays. Used a 64-bit DSN for SRM. Then tried the 32-bit and it worked! Didnt know to install the 64bit SQL Native client on Win2K8R2 SRM server. So I did. And it worked. Did not see the Site Recovery Manager plug-in, but saw one that was called vcDr and it worked. Restarted the vSphere Client also cleared it up. None of the VC services started when they were supposed too. But by changing the credentials on the service for the correct ones solved the issue easy. I kept finding LocalService but once changed all was good. The SRM service never started, but when I changed local service to the proper credentials it did and all was good. Didnt know VUM needed a 32-bit DSN. So created one and restarted! It may not be connected to the upgrade, but after three successful test failovers, I had one fail. The error was Error:Error occurred: failed to prepare shadowVM for recovery. One VM was successfully recovered but three were not. I removed the protection group that held the VMs, than made sure that the folders on the ShadowVM LUN associated with
SRM Reference Guide Page 22 of 166

those newly unprotected VMs were gone (several were not). I than recreated the PG, attached it back to the recovery plan and it worked fine. For many tests with no issues.

Undo
If you wish to undo this migration it is almost easy. You would turn off the new hosts, and turn on the old ones. They would not be happy since the databases that would still be in use would be the new ones. You would need to stop all of VC and SRM services, and restore the backup copy of the databases I mentioned you needed to have. Than start the services and you should be good to go.

To our next major release SRM 4.0


Note: I will delete this section in one of the next updates of this document to save space. You will need to use the steps below to successfully and smoothly upgrade to the next release. Release notes - http://www.vmware.com/support/srm/srm_releasenotes_4_0.html Upgrade KB article - http://kb.vmware.com/kb/1013166 Upgrade blog posting - http://blogs.vmware.com/uptime/2009/10/srm-40-is-here-the-wait-for-vsphereand-nfs-support-is-over.html Important Note: You can upgrade an existing site with no changes especially database, OR you can install new. If you try to install new and use the old database there will likely be corruption. However, there is a workaround in that you would do a simple upgrade, allow it to do the database upgrade, and than install new and point it at the upgraded database. This would work fine. Important Note two: You need to check SRM download page to see if there is new SRAs that you should use. With NFS support there will be new versions of the SRA, but if you are not going to use NFS, there may not be new versions. But you should check. Important Note three: You will need to have your new license, and you may need to restart your SRM service after the update once extra time to deal with license related issues. You can learn about the new licenses on page 42. Important Note three: I was using a legitimate SRM 4.0 license before the vSphere Update 1 upgrade, and after the Update 1, and when I should have seen the new SRM Solution Licensing I saw nothing and SRM didnt work. I discovered that I had old style licensing. I dont think any customers would have this sort of license but you are warned now. If this happens to you call support and tell them I sent you! 1. Protection Site a. Make backup of VC database b. Upgrade to VC 4.0. c. Make backup of SRM database in case you need to rollback. d. Make sure your SRM plug in is enabled. If it is not, you will not be able to enable it after the upgrade. e. Install / upgrade the SRM server. f. Upgrade the plug in. g. Restart the VI client. 2. Recovery Site a. Same steps as above.

SRM Reference Guide

Page 23 of 166

b. I suggest that you use linked mode for VC so that you can work with SRM easier than having two clients open. When in linked mode, you can also only license in one place and yet select both sides to apply the license too. So it is a bit easier for licensing too. 3. Now install the licenses! See screen shot below. 4. When you decide to start upgrading ESX hosts to vSphere, remember that ESX 4 cannot failover to ESX 3 IF the VMs have been upgraded to virtual hardware version 7 (VH7). But they can failover if they have not been upgraded to VH7. a. It might be easier in a Protected Site / Recovery Site situation to start upgrading ESX hosts on the recovery side first AND not upgrade to VH7 or VMware Tools, until after the protected site is also upgraded. b. If both sites are hosting protected it could be interesting! But the same idea might be good, to update everything to ESX 4 but without updating the VH or VMware Tools, until all was done. This can be done if necessary at the cluster level too I think. 5. Test the test failover, and as soon as possible test a real failover. Notes: a. A recovery will run fine if the protection site is upgraded but not the recovery side. Try to avoid this but it does work. b. If you try a test recovery on the recovery side while the protected side is being upgraded there may be issues so try to not do this. c. Upgrade quickly so minimal outage / exposure. d. Make sure that DHCP client, Protected Storage, Server (lanmanserver), and Workstation (lanmanworkstation) are all running on the SRM server before the upgrade. e. If you upgrade the OS to Win2K8 as part of the upgrade, make sure the Protected Storage mentioned above is running. f. If you have issues, and cannot proceed, you should uninstall / reinstall SRM. But first rollback your VC to 2.5 Ux. g. Hand made modifications to any of the <SRM root>/conf/* files will be overwritten. There should be backup copies of those files that you can than copy and paste back the custom entries. h. As noted above, make sure to watch out for the VH levels as you can get errors trying to configure a PG to fall over in appropriately i.e. ESX 4 hosted VMs that are at VH7 to a cluster that is held by a ESX 3 host. See below for the place to add the license.

SRM Reference Guide

Page 24 of 166

Advanced Settings option in SRM 4.0 to replace manually editing the vmware-dr.xml file, and where to add the license.

Design Guidelines
This section will look at some design information. The Admin guide has some very good information but we will look at things in this section that are not covered in our guide. It is important to understand that for a test, or a real failover, to have all of your VMs in the same LUN(s) to provide the best situation. Remember the whole LUN must failover! It is worth thinking about having a department, or an application worth of VMs in a LUN or LUNs to provide the best flexibility in a test or real failover. As part of this I would include some XP VMs for test purposes. A Powershell script that can help with understanding where VM and their disk files are and on which LUN can be found at http://www.peetersonline.nl/index.php/vmware/another-way-to-gather-vmware-disk-info-withpowershell/ . Some vendors use Consistency Groups (CG) to group LUNs and this becomes the granularity that is seen through SRM. It is often very successful for the greatest granularity during a failover or test failover when each CG hosts one or more LUNs that hold only one APP or one business unit. A VM must have its VMDK(s) on the same storage vendor arrays and NOT on two different storage vendors storage. If VC is using trusted certificates than SRM must too. This is not simple but instructions are in this document that will make it much easier! Look at the SRA information in this document, as it will sometimes provide information that will impact your design.

What goes wrong in SRM projects?


The issues in SRM projects that become difficulty or expensive generally fall into three different categories.
SRM Reference Guide Page 25 of 166

1. Storage Organization I have seen once potential customer for SRM that used 4 TB LUNs through their virtualization world. This was due to it appearing to be easier for them than anything smaller. I suspect they still have not upgraded to vSphere! But another issue is when there is no pattern for where applications are stored. So Exchange might be scattered on 10 different LUNs. This will mean in a failover that all of the apps on the 10 different LUNs will need to failover. And there will not be any granularity. The best idea is to slowly migrate your applications to be protected to new replication LUNs. You will get the granularity for testing or failover, and it will be easier to upgrade the array in the future. Most people have little storage organization so this will need to be taken into account! 2. Application knowledge we need to know what the corporation thinks is the most important app, not just what an IT manager thinks. We than need to know all of the upstream and downstream services that application needs to be considered working. All of that information is necessary to build a test plan. This can be quite big when you consider all of the applications that companies might have! In addition, most companies are in the category of not entirely sure what apps or what services they need. If the customer has a Business Impact Assessment (BAI) report it will help enormously but most dont have that either. Change Control will sometimes have very good info to help with understanding applications and their necessary services. 3. Storage Replication Adapter (SRA) this little tiny piece of software can cause a great deal of grief. Sometimes it needs a path change that its own installer didnt do. Sometimes it needs a special license like SnapView for MirrorView or space efficient for IBM. Sometimes these requirements are not written anywhere easy. EMC has finally gotten very good release notes, but they are hard to find as they dont ship with the SRA. Also, often the SRAs dont support all of the features that the replication supports. So this can confuse and frustrate customers. So investigate the SRA carefully.

Large VI environments
If you have a large number of hosts, and VMs, you may have some issues with SRM. These issues are considered scalability issues in both the platform and UI. They normally only occur when there is very large numbers of VMs and hosts. We are working hard and fast to make these problems go away but in the meantime here is some useful information. Each of the next major releases will continue to allow more scalability. Design your SRM infrastructure in a POD design. The pod should only manage approximately 750 unprotected VMs (and 1000 protected VMs) and less than 150 replicated LUNs. This will allow SRM to work better as the full 1000s of VMs both protected and not protected are not seen by SRM. It is a good idea in this example to separate SRM and VC. So each POD would have up to approximately 750 unprotected VMS and up to 1000 protected VMs, and VC and SRM installed on separate servers. Align each POD with a business or departmental unit and it will lessen the impact of the extra VCs to manage. In addition, Linked Mode in vSphere will help too. So if you have 2500 VMs total, and 1500 are protected, I would create two PODs, and if possible divide the protected VMs and the unprotected VMs between them. However, more likely is the division by business or departmental guidelines. Each POD would have separate SRM and VC servers, and hopefully would be backed buy a corporate production SQL or Oracle cluster. Some other recommendations would include:

SRM Reference Guide

Page 26 of 166

Large recovery plans may require more resources (processor / RAM / ESX servers) at the recovery site than at the protected side due to the nature of failovers and trying to start everything so quickly. You should separate the VC and SRM databases as they are heavily used during a recovery. A general comment would be that adding VMs to protection groups is less costly in resource usage than adding PGs. Less PGs speed up recoveries, but do not hesitate to use what is necessary. VMware Tools speed recoveries as if they are not installed we must wait for the timeout to occur! High recovery should only be used where necessary as it slows things done. Of course, that is as designed so that we can exactly determine the order of VM recoveries. But only use it when you need too. To maximize performance you should, when doing simultaneous recoveries, try to have each recovery plan target a separate cluster.

Another way of doing things, that may help the need to use a POD design, is to do a 3 year sizing forecast and figure out what the end state architecture needs to be to support the number of projected workloads, and the RTOs (ie how much horizontal scaling) than backdate the end state picture to what you will implementing day 1. That way you will know it will scale without breaking. Do the storage layout just so everyone agrees on it and how it will grow. If however you are starting with 3000 VMs to protect on day 1 the POD design will help. Page 23 in the 1.0.1 U1 Admin guide shows the SRM maximums. They include: 500 protected VMs - enforced 150 Protection Groups enforced 150 Replicated LUNs advisory only (this could be more than an actual 150 LUNs depending on how your LUNs are managed. 3 running recovery plans advisory only

On approximately page 11 in the SRM 4.0 Admin guide is shows the new SRM maximums. 1000 protected VMs - enforced 500 protected VMs in a single protection group - enforced 150 Protection Groups enforced 150 Replicated LUNs advisory only (this could be more than an actual 150 LUNs depending on how your LUNs are managed. 3 running recovery plans advisory only

When you need to build SRM in pods like this it can increase the complexity of management, or perhaps just increase the frustration factor. Try to minimize this by building the pods within the limits above but also as department / business unit / or maybe even application / service based. This may help minimize the frustration and make the management a little more logical.

Suggested Recommendations aka Best Practices


People are always looking for easy answers. These recommendations are not the easy answers that you may be looking for. SRM is a very small part of a DR solution. We can make some suggestions around SRM and some other general observations, but so much of the work around DR is outside of SRM it is hard for VMware to have suggestions or best practices. As well, I am concerned about the terminology of best practices because much of our customer base will
SRM Reference Guide Page 27 of 166

consider them best for them but we cannot do best practices for everyone. People should look at the recommended best practices and see if they apply. The recommendations below are the first recommendations I have done for SRM and I think they should apply to most, but still, please make sure they apply to you before implementing them.
1. Log Settings you should increase the settings around logs. The log files compress very well after they are used, and generally there is a lot of big drives in physical machines, and in virtual machines you could have big drives. Think about keeping 100x 10 MB files. The 10 MB files will compress very well! See page 37 for more info on this. 2. Increase the number of threads If you are not using SRM 4.1 you should increase the number of threads in use to avoid some time out issues. See how to do this on page 67. This change is in SRM 4.1. 3. Maximum power on can this be changed? By default we power on 2 VMs per host to a maximum of 10 hosts. You can change this if you have lots of resources in terms of memory and processor and storage bandwidth as well. See how to do this on page 46. 4. Service Level Agreements SLAs are something you should tread carefully around since they can sometimes be a factor in a problem with SRM. SRM must start VMs, and that time is something that needs to be measured before any SLA should be agreed to! This means while we can support an RPO of zero or near zero, we cannot support the same in RTO as we need time to start VMs. Plus, remember that the decision time must be part of the RTO. 5. Alarms you should configure the minimum set of alarms, or at least think of them and decide to not use them. See the recommendations at page 48. 6. Script usage you should think about your scripts. Should you use the idea of one big script, or many small scripts when it comes to IP Customization? Definitely you should store those scripts in one place which should be on the SRM server. I would recommend strongly the use of the VIX API for the scripts as well. You can see page 33 and page 35 for more info. 7. Patch regularly SRM is not frequently updated, but it is important to upgrade when those patches are available. When you unexpectedly need SRM to work you really need it to go and patches can solve issues that would stop SRM from working when you need it. 8. Use the account suggestions in page 9. 9. Plan specifically for a partial failover. Meaning you can fail your individual tier one applications over without failing anything else. This makes testing much easier, and provides significant opportunity for the customer to have very granular failovers if they need them. Experience suggests they will need partial failovers more often than complete ones. This is accomplished by organizing the storage so that you can in fact fail over just one app. 10. The RTO should always include the time to make the decision to execute a failover. 11. More protection groups lengthen failover. Less shorten it. Within reason of course. For example, in internal tests, 100 VMs in 1 PG failed over in approximately the same time as 30 VMs in 30 PGs. Applications can make big differences in this testing, but it is a good idea to where reasonable minimize the number of PG. Adding more PG is more costly than VMs. 12. With NFS it is less costly to have less and bigger mounts, and more costly to have more and smaller. Costly in this case impacts time to mount / dismount. 13. High priority provides maximum control but the slowest execution. Perhaps it is best to minimize the use of high priority and to plan for the use of recovery plans to provide the control instead of priority. This is of course, tricky manage but it is also powerful. 14. Let replication finish before adding the newly replicated LUN to SRM, and make sure it is visible in the Array Configuration of SRM before attempting to use it. 15. I like the idea of doing a Health Check before starting an SRM project. In particular it is worth doing it on the DR site to make sure it will be healthy when it is required.

SRM Reference Guide

Page 28 of 166

16. Each tier 1 application should have its own PG / RP, as well as be in an RP for a larger group, perhaps the whole company. 17. I recommend if possible having a 5 GB shadow VM location so that the size will prevent any confusion by people putting real VMs on it. 18. You should check out the events and tweak as appropriate. See more of them on 48. 19. I have started to only use SQL accounts for vCenter, VUM, and SRM and have been very happy with that. Starting to think that this should be a recommendation. 20. Do not co-mingle protected and not protected VMs in the same replicated LUN. This is important and do not forget it. 21. Do not use multi extent volumes since SRM will have issues with them. 22. It will likely help most SRM projects, especially in troubleshooting of failed tests to use some sort of software like vADM. This is application discovery and mapping software that can tie servers into applications and help understand what is missing from a test.

Failback Outline
EMC is providing automated failback tools; such as the Celerra plug-in for VC, and there are other vendors like FalconStor that are doing this. I expect to see more from EMC as well. But it is important to understand what the outline should be so you understand the big picture better. On page 53 of the SRM 1.0 Admin guide there is a very nice checklist for doing a fallback (page 41 in the SRM 4.0 Admin guide). In addition, both NetApp and FalconStor have good documents for doing fallback that include both the storage and VMware steps. It is ideal to have one of these documents if possible. A general idea of the failback is to do what you have already done in reverse. Clean up the Protection Groups and mappings at the previously protected site, and the recovery plan(s) at the previously recovery side and start over. The steps might look like: 1. Cleanup a. On the protected side, rescan the HBA, and the failed over VM and PGs are seen as invalid. Delete them. b. On the recovery side, delete the shadow VMs. 2. Configure replication to now be back to the original protected site. Be aware a number of vendors start replication automatically after a failover. So this may be done already. Some HP SRAs, some EMC SRAs, and HDS do this. Make sure the replication finishes. 3. Set up SRM to failback which means you are setting up SRM to fail over to the original protected site. a. Reconfigure the array manager for the new direction b. Inventory mappings, etc. 4. Setup the original protected site but first clean up! a. Clean up any artifacts that remain from the original failover and the subsequent failback. i. Remove the recovered VM from VC and delete them from storage at the recovery site. ii. Remove the PG and RP you used to failback. iii. Remove the placeholder VMs b. Setup replication c. Cofnigure array manager, d. Inventory mappings, etc.
SRM Reference Guide Page 29 of 166

Bandwidth Usage
I don't have specific numbers, but in general SRM consumes very little bandwidth between the sites. Once protection is up and running and SRM is essentially idle, the bandwidth between the sites should be almost nil (just periodic heartbeat/ping messages, and summaries of changes to the VC inventories). During operations such as protection and unprotection of VMs, there is some traffic between the sites, but I would estimate this to be on the order of 100s of KB per VM during protection, and almost none during unprotection. There can be brief spikes if SRM's connection to the remote VC server drops and gets reestablished, but this should likewise not involve more than 100s of KB per VM. Don't take these numbers as gospel, but I cannot imagine a real-world situation in which SRM bandwidth is not utterly dwarfed by that of the SAN.

Multiple Tier Applications


These kinds of applications often dont handle well with our serial or parallel recovery models. However, if you take the first tier of each of the multiple tier applications, and add them to a recovery plan, than take the second tier of each of the multiple tier applications, and add them to a second recovery plan, and than the third tier, and than the fourth tier you will end up with 3, 4 5, or maybe 6 different recovery plans where in each one you can use normal or parallel operation for the best speed. But he recovery plans ensure that you get the proper tier at the propter time and you will end up with your multiple tier apps working much faster than if you used serial or parallel operations alone.

Application References
Are there any application references for SRM and application X? This is a spot where I will start to accumulate links to application SRM support or implement guides. SAP - http://www.vmware.com/files/pdf/partners/sap-srm-cx-final.pdf FUJIFILM Medical Systems - http://www.vmware.com/files/pdf/FUJI_SRM_Final.pdf PTC Windchill Solutions - http://www.vmware.com/resources/techresources/10064

Protecting View Desktops


This is something that comes up often. In View 4.x it was something that was only possible with extensive and complex scripting that I was not comfortable with. There was too much scripting for safety. However there are changes in View 4.5 that should allow SRM to work in protecting View desktops. But that has not been developed or confirmed yet. In the meantime, when I talk to customers about this, I do provide the solution below. Which doesnt require SRM at this time. A possible solution would involve two URLs one to the production View environment and one to the DR site View environment. Both environments would use the same version of templates to provide the same application and desktop experience. The DR site would not have the personalization that the users would or could do in the production site. The users home drive, or shared data drives, would be replicated regularly between the two sites. This would, with some AD help, allow the users to have a new desktop with their personal and shared data at the DR site when it was necessary. This is not perfect, but it would provide the basic requirement. Additional information on protecting View can be found in TECH-EUC-301 from PartnerExchange 2011. It is a presentation titled Designing Disaster Recovery for View and was done by Matt Coppinger and Mark Benson. It is very well done and can be found on PartnerCentral.

SRM Reference Guide

Page 30 of 166

Physical to virtual disaster recovery - P2V DR


This is something that is often talked about, and is often requested that VMware provide it. It is not something that VMware is currently talking about doing, but it can be done without much pain and suffering, and it will be a little easier in the future. The idea of P2V DR is to have physical machines recovered as virtual in a crisis. This is often the most acceptable way for an application to be virtualized. Below is an outline of the steps to make this work, and there are at least two software packages that facilitate this and Symantec and Acronis provide them. What is important about these apps is: Incremental imaging that makes for short backups which minimize impact Universal restore which allows or facilitates restore to a new (virtual) hardware platform The general outline of P2V DR would be: Install, configuration of your backup tool. Test the universal recovery in a VM. Configure daily incremental backups. Each morning, you will need to manually configure SRM to protect the VMs. If you have as a destination a replicated LUN it will mean you dont need to do a storage vMotion. 5. Do a test failover and make sure everything works. Currently, SRM 4.1 doesnt have an API on the protected side. It is likely in the future, perhaps the very next release, where it will have an API and that would mean scripting of some of the steps above would be possible. 1. 2. 3. 4.

Shared Recovery
This was created for our developers by our developers and has since been released as Shared Recovery. If you have Site A, and Site B, protected and recovery at Site C, you should remember that VMs from Site A would go back to Site A, and the same for Site B. To protect VMs on Site C, you would need to have another SRM install, and protect those VMs with another site. It could be A or B or D, but it would need a new SRM instance. If using A or B, it is a little tricky since it would be using / seeing ESX hosts that are being used by a different SRM. It would work but is messy and thus you should use Site D. It would be less messy if different storage were in use compared to A or B. Shared Recovery is mostly targeted to outsource DR organizations. See the Shared Recovery documentation at http://www.vmware.com/pdf/srm_shared_recovery.pdf .

Failback (plug-ins)
Currently, with SRM 4.x, VMware doesnt have the ability to do an automated failback. Elsewhere in here (page 29) there is a fairly straightforward outline of how to do failback. It is not that hard but does have an order of operation to follow. Vendors are now providing failback plug-ins for vSphere. It is important to understand that the majority of them actually do storage failback, and not VM failback. This will improve at some point, but with any and all failback plug-ins, make sure they do start order management, and IP Customization (back to the original no less!) and if they do not, they likely are not good enough for your customer. I am not aware of at this point (12/31/10) of any vendor plug-ins that can do failback with these two necessary features.

SRM Reference Guide

Page 31 of 166

A lost protected site and failing back to it


If you lose your protected site, you can failover to the recovery site and get your people back working. That is the whole premise of SRM. However, when you bring that lost site back, you will NOT be able to failback to it. It will have a completely new infrastructure and you will NOT be able to fail back to it. You will however, be able to protect your applications and fail over to that site. You will need to start from scratch and do your replication, build PG / RP and test. This is because you have lost all of your hardware and software and will need to buy and build new. This new gear will not be the same as the old so you will not be able to fail your storage back.

SRM Administration Information


The information in this section will help with using SRM.

A sample recovery plan for testing an application


It is sometimes hard to organize information to create a recovery plan or to prepare for a RP. Sometimes preparing for a test of the failover is harder than the actual failover. Below is a sample recovery plan. While this example RP for testing an app is for testing, it can be very easily adapted for doing a real failover, but it is important to successfully test an application first and that is why we have this section!

Exchange Recovery Plan


This is a sample to help you get started. Goal: Successfully test Exchange Defined: 1) Send and receive emails using users / groups 2) Book meeting with user / group / room Required: The various things that would be needed, as part of this test would be: 1) Address information: users, groups 2) Credentials to access the accounts this would need to be current information 3) Domain controller to provide credentials 4) XP desktops including Outlook 5) Mailbox servers, including OWA servers 6) Security servers such as anti-virus or anti-spyware, maybe a PGP server. Note: Make sure to understand required. We need to make sure we think of all of the upstream or downstream services that this app requires. Setup: How we will set this up? 1) Domain controller will need to be in this plan, or available to it, and it will need to be current so it has the current AD information. You can use a weekly script to take a cold clone of a DC, and move it to the proper test network. It can be a manually executed script. 2) XP workstations need to be available as well 3) LUN organization needs to support this meaning that the LUN(s) have all the necessary VMs stored on it. It should not have extra VMs as that could be a problem in a real failover, as the whole LUN must be failed over. 4) Isolated VLAN This is important as it allows testing without impact of production resources. You need to have an understanding of the IP scheme, and if you need external connections and you will likely need them for a conference room or something like that. For example there may be an AIX or other midsize box that needs to be connected to the test network. Make sure it is a partitioned part of that AIX or midsize box before connecting it. The isolated VLAN will need to be added to all the recovery side hosts as well as the recovery plan.
SRM Reference Guide Page 32 of 166

5) External Resources this would be anti spam or anti virus. Again, you can take hot clones, or if you have a hardware appliance, sometimes they have a spare network port that can be used for the isolated test VLAN, or perhaps there is a spare appliance on the recovery side that you can use. 6) Exchange this is the subject of the test after all! But do we need to take all of it for the test, or can we take a subset? And what subset should we take? Test: This is the test plan itself, so after the recovery plan has been executed; we would use this information to test the application. A form that is signed after the test would be best. 1) Exchange Test Plan Name:______________ Date: ____________ Pass / Fail: _______ a. Login with your normal account? b. Start Outlook client with no errors? c. Access your mailbox via OWA with no errors? d. Address mess successfully to i. Partner (in test) ii. Stranger (not in test or in your cache) iii. Group e. Book meeting successfully with your partner? f. Look up phone number for someone? g. And so on. Build Plan - Infrastructure: this is the information to build out the plan and its infrastructure. 1) Isolated VLAN this covers the network side (cabling and configs) as well as the VI team (virtual switches) 2) DC in or on the VLAN clone or whatever method you use. 3) XP VMs must be built, configured, and have Office on them. They should be tested and in the proper LUNs to be available for the test and during the test. 4) Exchange we need to get copies of the Exchange servers in the test VLAN. 5) Replication is it working and is everything in place for us? 6) It is suggest having a detailed to-do list with name / date info to make sure it is done smoothly. Build Plan (SRM) this covers off building out the SRM infrastructure to support this plan. 1) Protection groups make sure proper LUN! 2) Recovery plan watch order of recovery DC first for example. Approval section When this test plan is a written document it should have a number of names on it some for approval, but some for simple communications. This document, when created, and approved is very useful to have at the recovery site. 1) The approval would come from the data owner who is sometimes called the application owner. 2) Some other info would include: a. Network contact, b. Virtualization / server / operations contact, c. DR team contact d. Application owner test representative contact

Adding scripts to a Recovery Plan in a call out


When you add a script to a call out in a recovery plan, it is an empty dialog. Use the information below to add a script that will work as expected. It is important to understand that the scripts or commands must be in the path on the SRM server. Use full paths to all executables for example c:\windows\system32\cmd.exe instead of cmd.exe. You can use .exe or .com files only! Command line scripts can only call executables.
Page 33 of 166

SRM Reference Guide

To run a batch file you should start the shell command with c:\windows\system32\cmd.exe. So it would look like c:\windows\system32\cmd.exe /c c:\scripts\alarmscript.bat.

These scripts are executed under the Local Security authority of the SRM server. In addition they can be stored where you like but likely best to have them on the local SRM disk and not on a remote network share. Example: Add to a script callout with the line:
C:\windows\system32\cmd.exe /C c:\scripts\call.cmd

Have a c:\scripts folder on the SRM server. In it have a batch file called call.cmd that contains:
@echo off c:\scripts\test.cmd

In the c:\scripts folder have another file called test.cmd and it will contain for example:
@echo off date /T >> c:\scripts\test.log time /T >> c:\scripts\test.log echo Recovery Test %VMware_RecoveryName% Executed! >> c:\scripts\test.log echo Running in %VMware_RecoveryMode% mode! >> c:\scripts\test.log echo Executed on %computername% - SRM server >> c:\scripts\test.log echo VM name is %VMware_VM_Name% >> c:\scripts\test.log echo ++++++++++++++++++++++++++ >> c:\scripts\test.log

This will execute during test or recovery and create and update a test.log file with the date / time, and some additional information. This is an easy example for the purpose of showing you how to call a script. You can anything you want from inside of the test.cmd file. For more information on the environment variables I am using in this script, please see below to see the environment variables and how they can all be displayed or the page in the admin guide to learn more. Remember that the script file is stored on the SRM server, and executed on the SRM server. If you need to make changes inside a VM, you will need to use something like the VIX API that will allow you to have a script on the SRM server, but yet make changes inside of a VM. If you use PowerShell scripts you may experience an odd issue find it and the solution on page 71. You can find a blog article on this at: http://blogs.vmware.com/uptime/2010/09/vmware-vcenter-siterecovery-manager-and-scripting-.html and also check out http://blogs.vmware.com/uptime/2010/08/cana-script-or-message-call-out-stop-a-recovery-plan-and-a-little-bit-more.html to learn about script placement.

What should I the PowerShell command look like to have it called from SRM?
You can think of this as a scheduled event but rather than Windows executing it on a schedule it is executed by SRM as required. So write your PowerShell command line as if you were going to put it into a Scheduled Task. But instead put it in the test.cmd file above. You will need to have PowerShell and PowerCLI installed on the SRM server remember! See the example below:

SRM Reference Guide

Page 34 of 166

C:\WINDOWS\system32\windowspowershell\v1.0\powershell.exe -PSConsoleFile "C:\Program Files\VMware\Infrastructure\vSphere PowerCLI\vim.psc1" " & "C:\Scripts\MyScript.ps1"

See more info on this at http://www.virtu-al.net/2009/07/10/running-a-powercli-scheduled-task/.

How can I see the environment variables that the admin guide says are available for scripts?
The environment variables that SRM puts into the environment during the test are listed in the admin guide on page 51. But if you wish to see them in action, you can use the command below.
C:\windows\system\cmd.exe /C echo set

This command will echo all the environment variable values to the SRM log file.

Can a script execution in a recovery plan impact the inside of a protected VM?
The scripts that are executed by a RP are held on the local hard disk of the SRM server but can execute against or using the VIX API library and impact the inside of a VM. For more see http://communities.vmware.com/community/developer . There is no other way I am aware of to have a script execute on the SRM server console yet impact the inside of a VM. If the script is inside the VM, than SRM alone cannot execute it, and the audit trail that SRM provides will not record the execution of the script.

Will a non-zero script exit in a recovery plan stop the recovery plan?
In both SRM 1.x and the next major release beta documentation it is said if a script callout during a recovery plan has a non-zero return at the end of the script it will stop the recovery plan from finishing. This is a documentation bug, and is NOT correct. It will be deleted from the SRM beta documentation before GA. Check out http://blogs.vmware.com/uptime/2010/08/cana-script-or-message-call-out-stop-a-recovery-plan-and-a-little-bit-more.html for more info on this.

User designed callout has returned a non-zero value: 1


This occurred in my lab in interesting circumstances. I freshly installed SRM again. I used scripts I had used before with no issues, but now during a test failover the error above occurred. Very odd. My SRM server was now Win2K8 R2 instead of Win2K3, and my test.cmd batch I was trying to execute was actually test.cmd.txt. I had to rename it must be a better way at the DOS prompt and the error went away. I think when I copied the script over this name change occurred. I probably would not have seen this error if I recreated the script.

What VM parameters are not failed over?


This may not sound logical, but there are things about a VM that are NOT failed over. It is not obvious or expected. But here are the things to be aware of: 1. If you have any fields in Annotations (VM Summary tab), they will not be failed over. This is due to these fields not being stored in the vmx file. 2. If you have any text in the Notes field in the Annotations area it will be failed over. 3. VM Permissions will not be failed over. This is due to thinking that there may be different
SRM Reference Guide Page 35 of 166

security thinking in the failover center. 4. Resources things like memory / CPU reservations / shares are not failed over. The thinking was due to the resource decisions / standards in the DR side would be different than on the protected site. However, there is a workaround here in that after a failover occurs, the resources configuration of the shadow VM is copied to the recovered VM. So you can edit a shadow VM for the desired resource configuration that is important, and it will be copied to the VM during the recovery operation.

Does number of PG impact order of start for high priority VMs?


No. It turns out that SRM will look at all the protection groups that are in the RP and start all of the high priority VMs followed by the normal / low as if there was only 1 PG.

What about backing up the SRM databases?


You need to understand that you need to keep the two SRM databases in sync. So back them up in a way that you can restore both of them and what is restored is in sync.

Can I change the Run button to work like the Test button?
I am setting up SRM for computer show, and I dont want anyone to use the Run button, and I am not sure about using the role / permissions to manage this. Is there another way? If you are using the current GA version of 4.1, or later, you do have a configuration file option that can do this. In the vmware-dr.xml file on the recovery side you will need to locate the section <RecoverySecondary>, and add to it an indented line that is <testOnly>true>/testOnly> and you will than have a Run button that looks like Run when you execute it, but it in fact is a test. The history report will confirm that. You can change the true to a false to revert to the normal behavior, or remove the line you added. Since this change to the vmware-dr.xml file directly you will need to restart the SRM service. Make sure you make this change on the recovery side.

Can I use VMware Heartbeat to protect SRM and VC?


You can use HB to protect VC that is hosting SRM, but it will not yet protect SRM. Find out how to do this at http://kb.vmware.com/kb/1014266 .

How can I capture the log and configuration information for support to work with?
This is most easily done after Update 1 by the use of the Generate Site Recovery Manager Log Bundle command in the VMware \ VMware Site Recovery Manager Start Menu folder. Run this command on the SRM server. This command will produce a zipped file on your desktop. I twill be in a MM-DDYYYY-HH-MM.zip format where is it Month Day Year Hours Minutes. Always provide the logs with your request for help! I strongly recommend you use this method. Very often people send to support just one of the support files and support will not be able to help with that. They will need to wait for the other logs. Please, always send the entire log bundle that is created with this tool. It captures things like core dumps, and configuration info as well as all of the log files!

Where are the SRM server logs stored?


They can be found in:
SRM Reference Guide Page 36 of 166

C:\Documents and Settings\All Users\Application Data\VMware\VMware Site Recovery Manager\Logs

You will need to check the vmware-dr-index file to see what is the current log file. Make sure to confirm the number from the index file to make sure you are working with the proper log file. In SRM 4.1 (4.0) the currently used log file will not be zipped, and the other files not in use will be zipped. For SRM 4.1 logs on Win2K8 R2 servers you can find the SRM log location below.
C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs

How do I capture the SRM plug-in log and config info?


This can be accomplished by using the command below when you are in the folder that contains the SRM plug-in. By default it would be c:\Program Files\VMware\Infrastructure\Virtual Infrastructure Client\Plugins\VMware Site Recovery Manager. The command is cscript srm-plugin-support.wsf . The script will produce a zip on your desktop. This is used when you are, for example, trying to solve an issue with UI scalability.

Where are the Linux Image Customization logs stored?


They are kept in /var/log/vmware/imc and /var/log/vmware-imc folders.

I would like to retain the SRM logs longer


With the default settings, the SRM log files will grow to 5 MB and than another log is started, and the previous one is gzipped. There is a limit of 10 files. If something is happening that is generating a lot of info for the lob files, you could end up rotating through the 10 files and lose something important. The instructions to increase the size of files, and the number of log files you keep use the information below, and make sure to do this on both sides! These settings are not part of Advanced Settings so you will need to change the VMware-dr.xml file and restart the SRM service. 1. In the SRM program folder there is a config folder. Locate it. 2. In the config folder you will find the Vmware-dr.xml file, which is the SRM configuration file. 3. Open the vmwaer-dr.xml file, and find the log section which is denoted by <log>. 4. You will need to add the following lines between the <log> and </log>.
a. <maxFileSize>x</maxFileSize> b. <maxFileNum>y</maxFileNum>

5. X is the value for the maximum file size. 6. Y is the value for the maximum number of files. When you are finished it should look like the figure below.

SRM Reference Guide

Page 37 of 166

These changes will not be active until you restart the SRM service. Make sure no one is using SRM before you do that! Also, dont forget to do this on both sides. In the example above, we are changing the settings to 10 MB in size, and keeping 100 copies. Remember that the 10 MB files will be gzipped to a very small size.

What happens when


This section will cover off a number of specific questions on how SRM handles things.

I add a new hard drive to an existing and successfully protect VM?


Nothing happens. It works. The protection group, and the recovery plan, requires nothing. You do need to wait for replication to finish first before testing this.

I add CPU and memory to an existing protected VM


Nothing happens. It works. I did wait for replication to occur before testing but the test was fine.

I add a network card to an existing protected VM


In this case I did have a VM displaying that something needed to be configured. So I right clicked on it and selected Configure and that was it. After replication I did a test recovery and it worked fine. Nothing in the RP needed to be configured.

I add a new VM to an existing protection group


I had a warning that the VM was not protected in the PG. So I right clicked on it and selected Configure and it was fine. You can have an email alert (or an SNMP trap) when a new VM is added to an existing PG. This new VM was at the bottom of the list of protected VMs in the Priority level that was specified when configuring. I did have to configure the PG to protect this VM, but it required no changes in the RP.

I remove a protected VM from a protection group


This will generate an Invalid setting for the VM in the PG it used to be part of. You can right click and select remove protection and nothing else is required.

What travels with VMs between PG and recovery plans?


When I make changes to things like callout scripts or IP customization to a VM, and when the VM becomes part of a different recovery plan, is anything impacted? When you make changes like callout scripts or IP customization files to a VM when it is in a PG attached to a RP, and that PG is attached to a different RP there is no loss of things like callouts or IP customization. This is a situation that is often seen when you have one application in 1 or more protection groups attached to a recovery plan that is specific to that application, and that protection group is also attached to a companywide recovery plan.
SRM Reference Guide Page 38 of 166

How can I tell the SRM version from the log files?
The first line of the SRM log files will hold the release info. The version=1.0.0 tells the version and build=build-97878 tells the build. One exception to this is SRM 1.0 Patch 3. It didnt change the build level and thus the log file will not reflect the proper build. You will need to check the Add / Remove to see if Patch 3 has been installed or not. I am told that this is now a test by QA so it should not be missed again. It is certainly on my test list!

Installation logs
SRM 1.0 You can create an installation log using the command line parameters of /s /Vlve installlog.txt. The command line will look like:
VMware-srm-1.0.0=<build_number>.exe /s /Vlve installlog.txt .

SRM 4.0 Installation logs are always created by default and can be found in C:\Documents and Settings\<user name>\Local Settings\Temp\vmsrminst.log. For installation logs on Win2K8 R2 they will be in a different location. That location is:
C:\users\Install_user\AppData\Local\Temp

You can also generate full logs with the command below but you will need to execute it from the command line. The log file will be generated in the same folder you execute the command. VMware-srm-4.0.0-192291.exe /V/lve installfull.log

Automated Install
If you would like to have an automated install, you can use the following command line, but remember to add your own information to it!
vmware-srm-<version information>.exe /s /v"/qn AgreeToLicense=Yes DR_CB_HOSTNAME_IP=<DR hostname> DR_TXT_VCHOSTNAME=<VC hostname> DR_TXT_VCUSR=<Windows user> DR_TXT_VCPWD=<Windows user password> DR_TXT_LSN=<site name> DR_TXT_ADMINEMAIL=<administrator's e-mail address> DR_CB_DC=<SQL Server|Oracle> DR_TXT_DSN=<System DSN> DR_TXT_DBUSR=<DB user> DR_TXT_DBPWD=<DB user's password> DR_RB_CERTSEL=1 DR_TXT_CERTORG=<Arbitrary organization name> DR_TXT_CERTPWD=<arbitrary password> DR_TXT_CERTFILE=\"C:\Program Files\VMware\VMware vCenter Site Recovery Manager\bin\<VC hostname>.p12\" DR_TXT_CERTORGUNIT=<Arbitrary organization Unit> VC_CERTIFICATE_THUMBPRINT=<untrusted VC certificate thumbprint> DR_TXT_PLUGIN_DESC=<extension description> DR_TXT_PLUGIN_COMPANY=<company name>

Changing log details


You can easily change the log detail level by editing a configuration file. However, to have that change read by SRM you will need to restart the SRM service. The file name is vmware-dr.xml and is found by default in C:\Program Files\VMware\VMware Site Recovery Manager\config . Remember that when you restart the service that you will interrupt anyone working with SRM.
SRM Reference Guide Page 39 of 166

Look for the line that looks like: <directory>C:\Documents and Settings\All Users\Application Data\VMware\VMware Site Recovery Manager\Logs</directory> Below it you will find a line that looks like: <level>verbose</level> You can change the verbose to trivia, which will generate more entries, or to info, which generates less. From least to most reporting the options are: error, warning, verbose, info, and trivia. It is important to understand if you increase the level of detail, the logs will faster and things may rotate and you lose what you need. You can change the roll over detail by using the information on page 37. You can set a different level of logging at the sub-component level. You can have a default level of verbose for the overall log file but one component could be set to something more detailed. Look for the sections in the config (vmware-dr.xml) file with the names from below. Some of the interesting component levels are: Vmware-dr (DR service) PrimarySanProvider (protected side array manager) SecondarySanProvider (recovery side array manager) SanConfigManager (managed storage configuration datastore computation) You should confirm changes like this that you make are seen. The change should be seen in the log as SRM starts. You can therefore confirm the change you made has been accepted.

I would like to have a automated SRM type solution without SRM


It would not be automated, but it would be functional but through the use of scripts and senior experience in virtualization / scripting / storage. It would work well and of course virtualization makes it work well. Full information on this can be found in http://www.vmware.com/resources/techresources/1063 . The amount of work should scare anyone into purchasing SRM!

How can I have SSL communications between SRM and NetApp


By default the communications between SRM and NetApp is not secured or encrypted. This has been difficult to encrypt using SSL. It is now much easier since our great guy in Cork has documented it thanks Cormac. Find it at http://communities.vmware.com/docs/DOC-11545 .

What happens when I Storage VMotion a protect VM or how does changes to VM storage affects protection?
This is a very complicated area. For the most detailed and complete information please see the wonderful KB article at http://kb.vmware.com/kb/1009900. But here is some of the key information. A VM, to be recovered safely, must have all of the datastores that its storage uses recovered at once. Because of this any changes to storage may require editing the PG. Storage VMotion, or even migration across PG boundaries is generally not good and you will need to revisit those VMs to confirm their protection select the Virtual Machines Tab in the RP and clear any unconfigured errors, or do the same in the original PG, or the new PG. If the protected VM is migrated to a replicated datastore, which is not part of any Protection Group, it will stay protected and its datastore will be added to the PG.

SRM Reference Guide

Page 40 of 166

If the protected VM is migrated to a replicated datastore, which is part of some other PG, then the VM will become invalid and a user will need to re protect it. It will I believe show a little yellow triangle.

What should I know about using the bulk IP utility?


You should make sure your recovery plan is done so the utility can pull our the VM info. Page 53 of the Admin guide provides additional info but some useful tidbits are here: Use the generate parameter to pull info down in the form of a CSV file. Use the create parameter to push up info after you tweak the file. If the file has already been used only new info will be executed. Make sure that each line for each VM has the VM name, VM ID, and the Adapter id in it! Only new customization info is implemented in a Create. The recreate parameter deletes then creates. A sample command would look like dr-ip-customizer.exe cfg ..\config\vmware-dr.xml csv c:\example.csv cmd generate . You would take the example.csv open it in Excel and make add your IP changes. The Adapter 0 reference is for all of the potential network cards that may be in a VM. The only information to add to this line is the domain suffix or DNS server. For each line, the Adapter 1 reference is for the actual virtual network card in a VM, and you can fill in the DNS domain, IP address, subnet mask, gateway, and DNS server. Do not use the spacebar to clear any fields in the CSV file. It will generate confusing errors. For Linux, do not put DNS domain info on Adapter 1, but rather Adapter 0. If the VM has been deployed from a template, and has never been turned on, and you try to customize his IP during a test or failover, it will take perhaps 10 minutes. You can solve this by turning it and letting sysprep finish. It appears that if there are values in both line 0 and line 1 that the value in line 1 wins. I have not been able to test this as well as I would like. If you need to specify multiple default gateways, you will need to have an additional line to specify the second gateway. The account name field for this utility is approximately 25 characters long. This will be increased soon early 2010. If you use DHCP in the IP column you will not need to add other info. You cannot assign two IP to the same network card with this utility. Troubleshooting IP Customization If you are having serious issues with IP Customization especially when the VM times out, you can check the logs that are in the VM. Use this process. 1. Run a test recovery. 2. Insert a Message prompt before any VM would power on. Wait for the Plan to stop at the Message Prompt. 3. Make a note of the time that the VM recovery timed out. 4. Use the VI Client to log into one of the VMs that timed out. 5. Make note of the time that you logged into the VM. Also make note if customization kicked off when you logged in. 6. Inside the VM goto c:\windows\temp\vmware-imc this path may change depending on the version of Windows. You may need to find the temp folder to locate the vmware-imc folder. 7. There should be several logs files in this directory. You will need them.

SRM Reference Guide

Page 41 of 166

SRM Licensing Information


This section will detail information about SRM licensing.

How does the SRM 4.1 licensing work?


Before sometime in September (2010), it works exactly like SRM 4.0 licensing described below. However, after that point in time, it will be based on per VM licensing. So if you have 10 protected VMs you will need to have 10x per VM SRM licensing. Likely in the next update of this guide I will provide pictures. This is an easy system however to understand. How many VMs do you want to protect (in other words, have in a protection group)? The Licensing Reporting Manager will help, and there are alarms (page 48) that can alert you if you go outside what you have licenses for. It should be noted that through to December 15, 2010 you can purchase either per proc, or per VM and you can use either but not both. You can find out more in http://blogs.vmware.com/uptime/2010/09/vmware-vcentersite-recovery-manager-and-per-vm-licensing.html . If the licenses expire, there will be no failover possible. Unlike SRM 1.x the service will still start. SRM will total the number of protected virtual machines and every 24hours report if there are more than licensed. As well, every time protection of, or unprotection of a VM the same thing will happen.

How does the SRM 4.0 licensing work?


After SRM 4.0 is released, and prior to vSphere Update 1 is released, SRM licenses will be added to the License section of the Advanced Settings when you access them via a <right + click> on Site Recovery in the navigation pane (see below). Once vSphere Update 1 is released, and installed, this option will not be available (it will be invisible), and if you had SRM before Update 1 your license information will be migrated to the Solution Licensing area of vSphere but if you are installing SRM new you will enter the license info in the vSphere Solution Licensing area. If during an upgrade to vSphere Upgrade 1 you dont see SRM in the License area you will need to restart the SRM service.

SRM 1.0 licenses will not work in SRM 4.0 but new licenses can be obtained from the customer license portal if they have registered their existing SRM licenses. We do not use Flex licensing any longer in SRM. There is no longer a host license and SRM will NOT require a license to work but only to protect VMs and that is a 25-character license that defines what can be protected. The SRM server will continue to work even if it becomes unlicensed SRM works but no failovers. There is no cross-site license communication so licenses will need to be licensed at both sites if appropriate.
SRM Reference Guide Page 42 of 166

Evaluation licenses are checked once per 24 hours to see if they are still active, and this check is not done when there are no evaluation licenses or they have expired. Expiring licenses are managed the same way. Protected VMs are counted whether turned on or not, and the state of protected assets is reported to VC every five minutes. The SRM license in vSphere Update 1 or later will look different. See below for an example.

How does the SRM 1.0 licensing work?


The philosophy behind the SRM licensing is to do basic checks to help a customer ensure that the maintain license compliance, but not to attempt to strictly enforce compliance. Given that SRM is a DR product, the last thing we want to do is to have any possibility that a failover would fail due to a license compliance check. SRM licenses are pooled rather than assigned to specific host CPUs and most elements of license usage are done through periodic reporting rather than through check-in / check-out operations. The key elements in the current SRM implementation are: SRM comes with a built-in 60 day evaluation license We require an SRM server license (PROD_SRM) for the SRM server to start on protected and recovery sites. We periodically take the list of all VMs in all recovery plans, look to see which ESX hosts they are currently running on, count the number of CPU sockets in those hosts, and then compare that against the number of host capacity licenses (SRM_PROTECTED_HOST) in the license file. If there are insufficient licensee, we create an alert / warning in VC reporting an insufficient licenses error but dont take any specific action. The customer is responsible for ensuring that they arent using more host CPUs than they have licenses for, SRM doesnt try to control that. If the SRM license expires a failover would still work, until the SRM service was restarted. After the restart no failover would work.

What does it look like if my VI is licensed for SRM?


See the screen show below for an example of a licensed SRM install.

SRM Reference Guide

Page 43 of 166

If you do not see the licenses you expect, this might be due to an odd issue that SRM has with licensing. While it uses FLEX licensing, if you only drop off the .lic file in the Licenses folder and reread the license file(s) you will not see something like the screen above until you restart SRM!

What does it look like if my vSphere is licensed for SRM after Update 1?
See the screen below for an example of a licensed SRM install.

What will happen if my license expires?


First it is important to understand that it is recommended to use SRM alerts to make sure you are not surprised by your SRM licenses expiring! In SRM 1.x, if the licenses expire you can still failover, unless you restart the SRM server at the protected side after the license expires. After restart the SRM service will not restart. In SRM 4.x failover will not work if the license expires. The service will restart.

What is the account that is asked for during install used for?
The 1.0 installer prompted for a username during installation. This is the account SRM will use to communicate with the local VC server. Since SRM constantly monitors the local VC inventory, this user will be constantly logged into the local VC server. Changing the password for this account will make it impossible to use SRM. Please note that this should be an account in the Administrators group. By default, when you install SRM 1.0 or SRM 1.0 U1, all accounts in the Administrators group have complete access to SRM managed objects. Again, this has not changed with U1. Please try to use AD accounts when you install SRM, and when you log into SRM. Using local accounts can work, but it is a

SRM Reference Guide

Page 44 of 166

little tricky. If you need some guidance on using local accounts I can help. This account is NOT the account used by the system the SRM service uses the Local System Account.

Is Essentials and Essentials Plus supported for SRM?


This is an interesting question. I have heard it a number of times. This is a great example of how people dont look for answers, but rather just ask for the answer. If you check the compatibility guide these are not mentioned thus are not supported. And I checked to confirm not supported! The lesion here is if something is mentioned in our compatibility guides it is supported (perhaps with caveats that would be listed) but if it is not mentioned it is not supported.

How do I plan for disk utilization due to SRM database?


Recently we brought out the database-sizing tool. Find it for SRM 1.x at http://www.vmware.com/files/pdf/Site_Recovery_Manager_1.0U1_Database_Sizing_Calculator.xls. You can find the SQL one for SRM 4.x at http://www.vmware.com/files/pdf/Site_Recovery_Manager_4.0_Database_Sizing_Calculator_SQL.xls / The Oracle one for SRM 4.x can be found at http://www.vmware.com/files/pdf/Site_Recovery_Manager_4.0_Database_Sizing_Calculator_ORACLE. xls .

I would like to use trusted certificates with SRM help!


You can use your own trusted certificates with SRM but it is more complicated than you might expect. There is some excellent information to help you be successful at http://viops.vmware.com/home/docs/DOC-1261 . The new URL path is http://communities.vmware.com/docs/DOC-11411 . Also be aware of http://kb.vmware.com/kb/1008390 and http://kb.vmware.com/kb/1021031 .

Can I change the IP information for the SRM server?


SRM 4.x You can use the Add / Remove in the Control Panel to start the install tool and redo all of the install configuration information. SRM 1.x I would like to change the IP info for the SRM server once it is installed. Is this safe or is there a specific way to do this without issues? When changing the IP info for the SRM server, or if the credentials (account or password) need to be changed you will need to use a special utility to accomplish either of these changes. Once the change is done you will also need to pair the two sites again. You can find detailed info on how to do this on page 85, in Appendix C of the SRM Admin Guide.

Can network customization work for operating systems other than Windows?
Yes. This includes operating systems from Novell, and Red Hat. The specific version information can be found in the SRM Compatibility Matrix document. SRM 4.0 adds in Ubuntu as well to the Linux flavors that can be customized.

Understanding order of operation for bringing VMs back online


During the recovery period, the order of recovery VMs is not as obvious as it may suggest. Normal and Low priority protection groups (VMs) will be started one VM per ESX up tto a limit that varies according to version of SRM / VC see the next point How many VMs can SRM start. So you could
SRM Reference Guide Page 45 of 166

have a number of Normal priority VMs starting at the same time but spread across various ESX servers. However, High priority starts VMs serially regardless of how many hosts are involved. Misconfiguration of the security for storage arrays may impact the start order of VMs. For example, if the security of the array means it cannot talk to a particular ESX host than that host will not be used to start VMs during a recovery plan. It is possible to see this without any obvious error messages!

How many VMs can SRM start?


This is something you may need to be aware of when you have a very large SRM install. If you have 45 ESX servers at the recovery side and you expect to use all of them to restart recovery VMs it will not happen. VC 2.5 has a limit of 16 VMs started at the same time, and SRM shares that limit which means 16 hosts can power on VMs. With SRM 4.0, since it works with vSphere 4.0, there is a limit of 20 VM powering on at the same time, and thus with SRM 4.0 starting two VMs per host, that means only 10 hosts can start VMs simultaneously. See the information below to change the number of concurrent power on VMs value.

Can I start more than, or less than, 2 VMs per host?


Yes. With the current version of SRM (4.0.1 and later) you have access to a vmware-dr.xml file setting that can tweak the number of concurrent power ons for a host. Bear in mind that Virtual Center rules apply, and that this will not solve all issues. In my lab with Nehalem processors I can easily start more than 2 VMs concurrently, but if you have small processors and small amount of RAM per host, you may want to start less than 2 VMs concurrently. Do not change the default unless you have carefully thought about it, and understand the impact! Also be aware this is not a supported change (currently) but it is always easy to change back if necessary. This change is not in the UI so it does require a restart of the SRM service. You will likely need to add a new section to the vmware-dr.xml file on the recovery side. If it is already there that is fine. In the <config> section add:
<Recovery> <powerOnsPerHost>x</powerOnsPerHost> </Recovery>

Where x is the value of the number of concurrent power on operations. Default is 2.

What does the Repair button do?


The repair button is used when the protected site is not available, and some array reconfiguration is required. Normally it would be done at the protected site, but if it is not available than the repair button can be used. An example of when to use it, is when the protected site is gone, and you realize last week you change the storage credentials and that is now stopping you from recovering. The Repair option would allow you to correct the credentials and continue with the failover recovery.

Is it all over when the recovery plan fails?


You can have a test recovery plan fail with some sort of error, but it will complete anything that it can complete. You could then address and solve the error, and run the recovery plan again and if you have correctly addressed the error your test may in fact correctly complete this time. It will not redo things that it has done correctly already. Once I had a problem with a VM starting and I let the replication finish, did a manual HBA refresh, and tried again. The two VMs that had already started were not touched, but the

SRM Reference Guide

Page 46 of 166

third VM that had just finished replicating, was in fact started. In a non-test failover, this may perform differently as it depends on the storage and what stage the issues occur in.

Can I move an SRM server to a new host?


This is possible but requires a number of detailed steps. It would be good to avoid if possible but it can be done. Full info can be found at http://kb.vmware.com/kb/1008426 .

How can I configure a second HBA rescan?


I have been told that my particular array will need a second rescan for my failovers to work. HP has confirmed this is one of their requirements. SRM 4.x This can be done in the Advanced Settings that is available after a <right + click> on the SRM lighting bolt icon. See the graphic below.

Advanced Settings right click on Site Recovery seen at top left

SRM 1.x This is easy and can be configured. Use the steps below: Edit the vmware-dr.xml file on the protected side. You will need to add a <hostRescanRepeatCnt> element in the <SanProvider> element. The value of <hostRescanRepeatCnt> should be set to 2. Make sure no one is using the SRM Plug-in, and restart the SRM service. Now do the same thing on the recovery side. Below is an example.
<SanProvider> . . .

SRM Reference Guide

Page 47 of 166

<hostRescanRepeatCnt>2</hostRescanRepeatCnt> </SanProvider>

SRM 4.x and 1.x You should confirm changes like this that you make are seen. The change should be seen in the log as SRM starts. You can therefore confirm the change you made has been accepted. See http://kb.vmware.com/kb/1008283 as it is now in the kb.

Recommended minimum alarm notifications


We suggest the following alarm notifications. You can set them on the Alarm tab of the SRM status summary page. Most organization will utilize email notifications but there are other choices as well. Remember to set these suggested alarm notifications at both sides as appropriate. Remote Site Down Remote Site Ping Failed Replication Group Removed Recovery Plan Destroyed License Server Unreachable (SRM 1.x) Recovery Plan Started / Recovery Plan Execute Test License Expiring / License Expired Protected VM limit exceeded VM Added (and waiting for you to protect it) VM Not Protected (meaning the VM has been added, but something about it requires additional work for example, a new not replicated VMDK or CD ISO). Recovery Plan Prompt Display important as this may stop the RP, or just a VM from being recovered. So the quicker you acknowledge the quicker the plan will continue. Recovery Profile Prompt Response so you know that something has been acknowledged.

You may want to consider as well: VM Protection invalid I am not sure what triggers this one!

With SRM 4.1 (4.0), these alerts are not part of the improved vSphere environment. So if you set to be alerted on Remote Site Up, you will be alerted very frequently! Remember that these alarms are configured at both the protected and recovery sites. Some of them are not necessarily appropriate on both or either side. Check out my blog for more information on this, and I will update it as necessary. It is at http://blogs.vmware.com/uptime/2011/02/recommended-alarms-for-srm-admins-to-watch.html .

SRM VirtualCenter events


SRM will raise VC events for the following conditions: Disk space low (on the SRM server) CPU use exceeded limit (on the SRM server) Memory low (on the SRM server) Remote Site not responding Remote Site heartbeat failed
Page 48 of 166

SRM Reference Guide

Recovery Plan Test started, ended, succeeded, failed, or cancelled Virtual Machine Recovery started, ended, succeeded, failed, or reports a warning

Some of these can be changed in how they are triggered. For example, the minimum disk space is 100 MB and you may wish to have it 500 MB. You can change disk, CPU or memory in the vmware-dr.xml file in the SRM config folder. Search for the terms below (in vmware-dr.xml) to see where to make the change and than restart the SRM service. Disk (minDiskSpace), where the default is 100. CPU (maxCpuUsage), where the default is 80. Memory (minMemory), where the default is 32.

Is thin provisioned VMs support with SRM?


Yes. No issues.

What does Microsoft offer for licenses for DR test?


Microsoft offers what they call disaster recovery rights for every license that you have covered by software assurance. This is the language from Microsofts use rights document: Cold Disaster Recovery Rights. For each instance of eligible server software you run in a physical or virtual operating system environment on a licensed server, you may temporarily run a backup instance in a physical or virtual operating system environment on a server dedicated to disaster recovery. The product use rights for the software and the following limitations apply to your use of software on a disaster recovery server. The server must be turned off except for (i) limited software self-testing and patch management, or (ii) disaster recovery. The server may not be in the same cluster as the production server. You may run the backup and production instances at the same time only while recovering the production instance from a disaster. Your right to run the backup instances ends when your Software Assurance coverage ends.
The trick is that every software component must be under SA OS, apps, the works for this to be practical. Thats 25% of the license fee per year, for every license. The good part of this grant is that it makes it completely clear that you can run DR testing all you want to prepare for a disaster (limited isnt defined its purely a matter of what the licensee chooses to do), and that you can run the two production instances (primary and failover) simultaneously when recovering from the disaster. The bad part is that it is extremely expensive. 1. This language was developed before SRM and virtualization made DR practical to test and tune on a broad scale. Its probably not the best approach to licensing SRM, due to high cost. If the customer already has SA everywhere, thats great. If not, heres what I recommend: 2. Create the SRM development environment in an isolated, non-production virtual network segment. Be completely sure that production IT traffic cannot leverage resources in the SRM development environment.
SRM Reference Guide Page 49 of 166

3. Acquire enough Microsoft Developer Network (MSDN) subscriptions to license the OS and applications that will be used in the DR site. These are very low cost, but are fully functional and allow any development, non-production use. 4. Test and tune SRM using MSDN licenses until it works as desired.

When the customer is ready to test production failover, they may want to ask for permission from Microsoft to re-assign their licenses on a short-term basis. The failover test is permitted the customer will re-assign all their licenses to the disaster site hardware. However, Microsoft rules state (with some specific exceptions) that re-assignment may not be done more than once every 90 days. The customer would need to either wait 90 days before testing the recovery phase, or ask Microsoft to acknowledge that they can test this critical business function without violating the terms of their license. I think an important note is that many corporate accounts have SA in enough volume to make this test process not an issue.

What vendors have application consistency options?


When storage arrays replicate VMs to another location, and when SRM starts them, the condition of the VM will be crash consistent. This is not always an issue but for sophisticated applications such as SQL, Oracle, and Exchange this can be an issue. The vendors below can help avoid these issues by using application consistent technologies in the replication. EMC has a tool called Replication Manager that works with at least four of the five different EMC replication technologies. It has agents that can provide application consistency for a variety of applications including Exchange, Oracle and SQL. See more at http://www.emc.com/products/detail/software/replication-manager.htm . NetApp has a tool called SnapManager for Virtual Infrastructure that has agents and can do application consistent snapshots that are replicated. See more at http://www.netapp.com/us/products/managementsoftware/snapmanager-virtual.html . FalconStor has something called Snapshot Director for VMware and more information on it can be found at http://www.falconstor.com/en/pages/index.cfm?pn=VMware&bhcp=1 and a direct link to the product datasheet can be found at http://www.falconstor.com/?tk=3Z18DEF34F7873223089B0D956F6EBD8 . This will work with Oracle, Exchange, SQL, IBM DB2, and Lotus Domino. Dell has something called the Automated Snapshot Manager for VMware, but I am unable to find what applications it supports. LeftHand Networks, have this capability but I dont know much about it. If you have knowledge, good links or other info, please send it on to me. Hitachi SANs do not have, from what customers and storage people tell me, any agent based system. This includes the HP SANs that are based on HDS gear. This is a critical issue as it means that complex and modern applications will be replicated at best, with crash consistency. I do not know if there are options that are manual or using scripts.

SRM Reference Guide

Page 50 of 166

What vendors have application consistency options that work with continuous replication?
This is a little different in that with continuous replication it is hard to use agents to work with the point in time snapshot because the replication is perhaps real time or maybe every 2 seconds so there is not enough time to work with the agents to product application consistent snapshots. So everything will show up in crash consistent. HDS has the ability to have application consistent continuous replication for physical machines but not virtual machines at this time. When I asked one of the architects at FalconStor about this, I got what is written below thanks very much David! For the Continuous Replication, the way we can achieve better than Crash Consistency is through our "Snapshot Director" (virtual appliance) and our Snapshot Agents (loaded in the VM's); the main difference, compared to using our "Periodical Replication", is that instead of creating a periodical replication point, we create a "Snapshot Marker" on regular intervals, and that marker gets replicated instantly to the remote site (the TimeMark is then created on arrival on the REPLICA volume). So the end effect is you get incredible RPO using Continuous Replication (any-point crash consistent state), but you also get the benefits of amazing RTO through "application consistent snapshot" via periodical snapshot markers that are trickled down to the DR site via continuous replication (but these quiescent application consistent snapshots are still periodical, thus spaced apart, as if we were doing periodical replication). As for the Continuous Replication question from your previous email, we do not play the snapshot quiesce action "offline". We truly quiesce the VM's applications at the Protected site, but instead of waiting for a "replication interval" (as opposed to Periodical replication), the "state pointer" which is like a bookmak (aka Snapshot Marker) is inserted into the CDR Journal (Continuous Data Replication Journal) at the time right after the filesystem flush, and replicated immediately. The TimeMark is then created on the DR Replica disk, almost right away, as opposed to having to wait for an upcoming replication session (when using Periodical). So bottom line --> VM's are quiesced, but no NSS Snapshot is created on Protected Site, instead, we just insert a bookmark in the I/O Journal queue, and as the journal is flushed out to the remote site, when the remote site processes the bookmark, it creates the TimeMark on the Replica disk. As I learn more about this I will share what other vendors can do.

What rights does a user require to be a DR operator?


If you want a particular user to be a DR type operator and trigger plans, but without sweeping rights, it can be a little tricky. A customer figured this out, and a PSO guy confirmed it was accurate. It likely has extra permissions due to the way it was figured out. This will give a user the ability to trigger fail overs without being a VC admin user. The rights that are required are: Protected Site Read-Only at the vCenter root (Virtual Machine User Operator) Read-Only at the datacenter inventory object (Virtual Machine Operator) Protection Virtual Machine Administrator role at the virtual machine level (propagate) applied to VM folders
SRM Reference Guide Page 51 of 166

Protection SRM Administrator role at the SRM site recovery root level (propagate) Protection Groups Administrator role at the SRM protection groups level (propagate). Recovery Site Recovery Inventory Administrator role at the vCenter root Recovery Datacenter Administrator role at the datacenter level (propagate). Include Virtual Machine Interaction, Host CIM and Rename Datastore Recovery Host Administrator role at the host level and cluster (include Browse Datastore, Assign VM to Resource Pool, Reset Guest Information, Console interaction, Power ON/OFF and Reset) Recovery Virtual Machine Administrator at the resource pool and folder levels (propagate). Didnt work at customer unless assigned at cluster level. Recovery SRM Administrator at the SRM root level (propagate) Recovery Plans Administrator at the SRM recovery plans level (propagate).

SRM service doesnt start, and event logs show errors with event ID of 7000 and 7009
This will not normally be seen in a production environment where SQL / SRM / VC are well designed, but in a lab with limited resources this can and does happen. This is due to the Windows Service Control Manager expecting a Service started successfully message in 30 seconds. You can make a change to a global setting that can increase the 30 to 60 seconds and it appears that will solve this issue. Use the steps below to make this improvement. 1. In Registry Editor, locate, and then right-click the following registry subkey:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control

2. Point to New, and then click DWORD Value. In the right pane of Registry Editor, notice that New Value #1 (the name of a new registry entry) is selected for editing. 3. Type ServicesPipeTimeout to replace New Value #1, and then press ENTER. 4. Right-click the ServicesPipeTimeout registry entry that you created in step c, and then click Modify. The Edit DWORD Value dialog box appears. 5. In the Value data text box, type 60000, and then click OK (value is in milliseconds). Now you can restart and you should have no issue with SRM starting since there is more time for SQL and VC to start. Thanks to Scott for this great info!

How can I have syntax highlighting to help read SRM log files?
This is very useful and can be done on both the Mac and PCs with a little work. Much less work now that I have done it for you! On the Mac you need to use TextWrangler (http://www.barebones.com/products/TextWrangler/) and on the PC you need to use EditPlus (http://www.editplus.com/ ), while this is possible with other editors I only show you with these two. TextWrangler is freeware but EditPlus is only shareware. It is very popular with developers.

Text Wrangler
In the appendix there is a sample file that you can copy and paste to create a text file called log.plist. Than use the following steps to make it live. 1. If necessary create a folder called Language Modules in Your ~\library\Application Support\TextWrangler folder. 2. Move your file to this folder.
SRM Reference Guide Page 52 of 166

3. You should now be able to open a file that has an extention of .log and see words like error in color. 4. The Preferences file can help adjust as necessary. In the Suffix Mappings section you can map the .log to the Language Modules called Log due to the filename. See below for the end result.

EditPlus
In the appendix there is the information to copy and paste that you can use to create a text file called log.stx. Use the following steps to make it live. 1. You will need to copy this file to the C:\Documents and Setings\user_name\Application Data\EditPlus 3 folder. 2. Now in the Documents \ Permanent Settings we need to add this new file into EditPlus. 3. Under the Settings & syntax menu, you will need to define a log file type. 4. In the File extensions section add the log type. 5. In the Syntax file section you should load your log.str file. See below for a completed set of preferences as well as a sample file.

SRM Reference Guide

Page 53 of 166

Troubleshooting
This information will help with troubleshooting of SRM and SRM related issues.

Things to watch out for


There are a number of things to check when troubleshooting an issue. Often the errors visible in a Recovery Plan or history report are in English and you can start with the error message in your troubleshooting. The SRM logs are very useful as well, but remember to check either the production site or the recovery side as appropriate. Dont forget that SRAs have logs too, but generally all of the standard output or input of the SRA are captured to the SRM log.
SRM Reference Guide Page 54 of 166

You can often search the log for things like error] but also you can search it for what you see in the history report. You can use the date / time of the report / error to look for information. Some other things that may be useful to search for include credentials, failure or warning. Some of these may occur naturally so be careful. The start of a recovery plan in the logs looks something like [-1] CHILDREN .
RootStepList-xxx HAS

Also, where possible, it is useful to troubleshoot when the Continue function has not be issued so the RP is in effect paused. This gives you access to storage that you will not have when the RP is complete and cleaned up. Always make sure that the SRM compatibility for compatibility in your situation (with things like ESX patch level or SAN compatibility) but also do not forget that the SRA often has prereqs that you need to worry about. If the Create Protection Group is grayed out that generally means that SRM cannot see the storage. Use the Array Manager configuration LUN view to see if there are any clues. That generally has helped me. There are some odd things that you need to remember; such as an attached CD can be an issue in a failover. Sometimes it is worth starting vmware-dr.exe to see if you can see anything that can help. This is particularly useful when you have tried to start the SRM service and it fails, but nothing is seen in the SRM logs. This can mean a problem occurs that stops SRM from starting before it can touch the logs. Always check the release notes as well!

How can I change the command Timeout?


I am using an EMC SRA and I have heard that I may want to extend the execution timeout so that I can avoid timeout errors. Timeout errors can be found in the SRM log on the recovery side. Look in vmware-dr.xml file that is by default in the C:\Program Files\VMware\VMware Site Recovery Manager\config folder. Remember that after you make a change to the vmwaredr.vml file you need to restart the SRM service. Search for CommandTimeout to find where to make the change. The default is 300 seconds (5 minutes see the relevant section of the file below.

SRM Reference Guide

Page 55 of 166

I have heard that if you go from 300 to 1500 that some EMC SRAs will not error and will work. I also know that the next generation of EMC SRAs will be much faster. The information previous is for SRM 1.0. For SRM 4.0 you will make this change in the Advanced Settings dialog and will not need to restart the service.

My Celerra prepare storage fails, and the error has a null in it


This is due to a bug in the current Celerra SRA and in the second or third week of December 2009 there should be an update SRA that doesnt have this error. It is related to the name of the Celerra VSA but it occurs with physical arrays as well! As of 1/29/10 EMC has released an updated SRA (4.0.17) that is supposed to solve this issue. I have not yet tested this myself, nor have I heard that it does solve the problem. But it is supposed too! 3/26/10 this error is still in the wild. There is a new SRA 4.0.19 that supposedly fixes the issue but it has not been released yet. This updated 4.0.19 SRA has been released (May 2010) and is said to fix this issue but I have not confirmed that myself.

Where is the new Run and Test privileges?


After you update to SRM 1.0 Update 1 you should see a Run and a Test privilege in the roles and privileges area but you may not. Restart VC and you will see them.

I have accidently deleted my Shadow VMs what should I do to fix this?


The shadow VMs are important for several reasons. SRM will replace them with the real VMs during a failover as well as they are placeholders for you to know where VMs may end up at some point. You can fix this easily by accessing the Protection Group that host these VMs on the Protected side and configure the VMs and they will be created again on the recovery side. RP customization around these VMs will be lost if this is necessary. In SRM 4.0 there is an option in the UI to fix this situation. Important Note: When your protected VMs have their shadow VM created they will have a new number assigned to them which will cause you issues with IP customization, and will require you to redo your IP customization. The VM name will match before and after the Shadow VM creation and re creation but the ID will not. You will need to recreate your CSV and reassign your IP customization.

SQL Authentication, and database access issues


If you have some issues with not being able to start the SRM service it may be an SQL server account issue. You may have an issue like below in bold.
Section for VMware vCenter Site Recovery Manager, pid=1348, version=4.1.0, build=build-267817, option=Release [2010-08-02 12:19:21.740 00560 warning 'App'] Failed to create console writer [2010-08-02 12:19:21.740 02940 info 'App'] Set dump dir to 'C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\DumpFiles' [2010-08-02 12:19:21.740 02940 info 'App'] Intializing the DBManager [2010-08-02 12:19:21.850 02940 error 'Vdb'] Connection: Could not connect to database: -1 [2010-08-02 12:19:21.850 02940 warning 'App'] DBManager error: Could not initialize Vdb connection: ODBC error: (28000) - [Microsoft][SQL Server Native Client 10.0][SQL Server]Login failed for user 'VMW-NE\SITE1-SRM01$'. [2010-08-02 12:19:21.850 02940 error 'App'] Application initialization error: Could not initialize Vdb connection: ODBC error: (28000) - [Microsoft][SQL Server Native Client 10.0][SQL Server]Login failed for user 'VMW-NE\SITE1-SRM01$'. [2010-08-02 12:19:21.850 02228 info 'App'] [serviceWin32,421] vmware-dr service stopped

SRM Reference Guide

Page 56 of 166

When you use Windows Authentication to access the DB you must run the SRM service as the DB user account. When using SQL Authentication, you can leave the default local System user.

Why cannot I customize Windows 2008?


This ability was added in at the Patch 4 of SRM 1.0.1 timeframe. Upgrade to it or SRM 4.0 and you will be fine.

Why does my recovery plan show error on VM status but the VMs are ok?
The reason for this error was introduced in Update 3 of ESX. We adjusted the frequency of how often we check for VMware Tools heartbeat. While the VMs are recovered successfully, the history report does show errors on the VMware Tools Status. You can adjust the Recovery Plan Response Times wait for OS heartbeat from 300 to 450 and you will get rid of the error status, but the test will take longer! You can also use the following information to for a better fix, instead of adjusting the Recovery Plan wait of Tools timeout. Start in the ESX Service Console at the command prompt. Edit the /etc/vmware/hostd/config.xml file, You will need to change the vmsvc section (which will look like <vmsvc>) There may be a line that starts with <heartbeatDelayinSecs>XX, and once it is located change the value of XX to 40. If not you will need to add the complete section, which will look like below.
<vmsvc> <heartbeatDelayInSecs>40</heartbeatDelayInSecs> <enabled>true</enabled> </vmsvc>

You will need to restart the management agent to load this change. The command is service mgmt-vmware restart and followed by service vmware-vpxa restart. It is important to note that this may cause an issue with restarting VMs. Avoid this by disabling the automatic start up and shutdown of VMs. See more about this at http://kb.vmware.com/kb/1003490 . You will find info below about avoiding the Shutdown Tracker situation. This recently become a supported fix http://kb.vmware.com/kb/1008059 . In addition, as of 1/30/09, there is a fix so this manual work is not required. Find out more at http://kb.vmware.com/kb/1006651 .

I am not aware of at this time a way to implement this change in ESXi. 8/8/10 AOK.

ESX 2.5 accessing protected datastore will cause recomputed datastore failures
If you have ESX 2.5 hosts accessing a protected datastore you will see datastore recomputed datastore failures. Remove the ESX 2.5 host from the datastore. This was fixed in Update 1 of SRM.

What causes the Recompute Datastore Group task?


A number of things can cause this including Existing VM is deleted or unregistered VM is storage VMotioned off New disk is attached to a VM on a datastore previously NOT used by that VM New datastore is created
SRM Reference Guide Page 57 of 166

Existing datastore is expanded. New VM is created.

Why is my IP customization taking about 10 minutes extra per VM?


This has been seen to occur when a VM has been deployed from a template and never turned on. It turns out that sysprep is trying to run while we use sysprep to re IP. This can be avoided by deploying, turning on and letting the deploy finish, and than trying to re-IP during a test or failover. Now it should work much better.

When using Bulk Import I get column errors


This can occur if you edit the .CSV file with Excel 2003. This will not occur with Excel 2007. The problem is that Excel 2003 will strip off the last comma, which will screw up your columns.

I would like to avoid the messages about shutdown


By default, you will see a request prompt when you log in after a failover about why the computer went down. This sometimes looks poor during a demo. How can I avoid this? You can use GPO to avoid this. It is in Computer Configuration, Administrator Templates, and System.

Unable to find any array script files Please check your SRM installation
This can mean a few things. Your SRM install could be to D:\ and your EMC solution Enabler could be installed to C:\. This error can also occur with any storage vendors if you have not restarted SRM after installing their SRA.

My Linux VMs dont have the host file changed after IP customization
This is a current bug in SRM 4.0. It is being worked on and hopefully will be addressed soon. The IP customization on the Linux VM does actually work except for the change to the host table. This was fixed in 4.0.1 and later.

dr.secondary.fault.WrongVmInventoryPlacement
This error can occur when you are creating a PG and have mapped inventory items that are not compatible. For example, between an ESX host with VMs that are at VH7 level, with an ESX 3 host. Basically it means that host, network, or resource pool are not compatible at the other side. The log will have errors like:
[2009-09-19 18:27:04.252 06140 verbose 'Replication'] Creation of shadow VM failed with error (dr.secondary.fault.WrongVmInventoryPlacement) { [#2] dynamicType = <unset>, [#2] faultCause = (vmodl.MethodFault) null, [#2] resourcePool = 'vim.ResourcePool:resgroup-392', [#2] datastore = 'vim.Datastore:datastore-2840', [#2] host = <unset>, [#2] msg = "Host, resource pool and datastore are not compatible.", [#2] }

You will need to discover what the resource group 392 and datastore 2840 are to discover where the conflict is. This would be done using the following command:
https://vc_recvoery_side/mob/mob?moid=datastore-2840

SRM Reference Guide

Page 58 of 166

You would change the datastore-2840 for the other variables as appropriate.

Pairing Issues
If you have an issue at approximately 24% it could be related to the license file not being live or installed. Reread the license file or restart the license service. If you have an issue at approximately 82 or 84% you should make sure that the account you used to connect to the Recovery site has both VC and SRM admin rights. The specific role for SRM is Protected Site Administrator and on the Recovery Site it is called Recovery Site Administrator. This issue occurs most in a Microsoft domain world. The Administrator role includes both the Protected and Recovery site admin roles. Things to check during troubleshooting of pairing issues would include firewalls between the sites and is the recovery site running VC successfully?

I cannot run more than one simultaneous recovery plan with my MirrorView SRA
I need to run more than one recovery plan at a time so that I can cut my RTO. But I have not been able to do that with my MirrorView SRA. I can in fact do it with other SRAs so I cam curious. This is (as of 12/12/09) correct and a precautionary measure to provide better response time while running the recovery plan. In the future NaviSphere engineering will make some design improvements that will allow additional simultaneous recovery plan operation. In the meantime, with no support, and in a lab, for experimental purposes only, you can change from the limit of 1 simultaneous plan to 3 with a registry change. The registry change should be on the SRM server. The key is HKLM\SOFTWARE\EMC\MirrorViewSRA\Options\NumSimultaneousInvocationsAllowed with a value of 3.

What time guidelines can I expect for protecting VMs?


I wonder if my SRM is taking longer than it should for creating a protection group. What is a baseline? Below is a guideline from our QA department. Your mileage may vary, especially dependent on your storage. 100 VMs in 7 minutes 200 VMs in 15 minutes 300 VMs in 22 minutes 400 VMs in 27 minutes 500 VMs in 35 minutes

What time guidelines can I expect for failing over VMs?


This is something that varies a lot. For example, NFS prepares for use much faster than VMFS FC or iSCSI storage. IP customization and script execution can add minutes per VM as well. The processor / memory of the recovery hosts, and storage performance can make a big difference as well in terms of how fast we can have a recovery plan complete.

SRM Reference Guide

Page 59 of 166

Important note: this information is for discussion and not indicative of what you can expect. Remember there are a lot of variables! But the info below can be used for understanding if you are seeing good numbers or not. Another important note: I have seen tests where I thought everything was the same, and yet take different times, sometimes different by even 5 or 6 minutes. So the test time indicated below is very rough and should not be taken seriously. Number of VMs Scripts and or IP Customization No / No none Storage Information Virtual FalconStor running on dedicated physical FalconStor hardware. Virtual FalconStor running on dedicated physical FalconStor hardware. Time for a test (including clean up) 12 minutes Comments

8 Windows

2 hosts on recovery side.

8 Windows

Yes / Yes all

29 minutes

2 hosts on recovery side.

A customer reported to me that he failed over 100 virtual machines, from 12 protection groups, in 120 minutes. I have no other info on that. But I am looking for more information for this section.

When trying to do Inventory Mappings the VI Client hangs


This can occur when you have more than 7 ESX hosts. There is a Patch (Patch 2) that can solve this issue. This can still occur when you have hundreds of ESX servers or thousands of VMs. That is harder to solve but information in the design section can help. In addition, Patch 3 should help this situation further. SRM 4.0 improved this a lot!

Failed to connect to the management system address when executing the discoverArrays command.
You should not often see this but it can be addressed by making sure the SRA is in fact installed on the recovery side. You may also need to check routing between the sites (in particular to the Recovery side SRA / storage management interface. This can occur after storage is mounted, but the datastore cannot be found. There are MANY causes of this error. Use storage troubleshooting to figure it out. Before continuing the test, check the storage and confirm it is readable and has VMs. You need to find, the boundary of working and not working in the storage world and than deal with that. I have seen this with the MirrorView SRA and its odd ports, as well as with RecoverPoint.

How can I re-initialize the SRM database


SRM 1.0 You can also do it using the commands below:
Cd <SRM bin folder> Initdb.exe ..\config\vmware-dr.xml recreate

SRM Reference Guide

Page 60 of 166

This is not something you need do often. In fact I never have. It would be perhaps useful if you suspect your database information is corrupt. SRM 4.0 In SRM 4.0 you can use the Change option for SRM in the Add / Remove control panel applet and it will allow you to make a number of changes including VC account / password, delete the contents of the SRM database and more.

Error LUNs with duplicate IDs or numbers received from SAN integration scripts
This occurred adding an array in the array configuration manager you may see this error in a popup window. In this example it occurred in an EMC Symmetrix and SRM 1.0 U1 environment. In the SRM logs you could see the same WWN for all LUNS. You will need to talk the storage team and make sure the correct flags are set on ALL FA ports. EMC will normally recommend the following flags set on all FA ports in an ESX environment. Common serial number (C) Auto negotiation (EAN) set Fibrepath enabled on this port (VCM) SCSI 3 (SC3) set (enabled) Unique world wide name (UWN) SP-2 (Decal) (SPC2) flag is required

Error: Failed to recover datastore:


This error usually indicates that the recovery side cannot communicate with the array on the recovery side. In the SRM logs on the recovery side you can see a Mapped LUN line (s) that will help you see what the protected side is mapped to on the recovery side. This will sometimes help you fix this error message.

SRM unlicensed error in logs but you have a good license


If you change the SRM license file(s) you may have a small issue, as it is not the same process as changing an ESX or VC license. You would follow the normal steps of dropping the file in the license folder and rereading the license folder in the license tool. This would be enough for VC or ESX but is not enough for SRM. You could after these steps see the license in the VC Admin License view, but would still see the unlicensed errors in the SRM log. You need to restart the SRM service for the new license change to occur. You can find more information on SRM licensing in other parts of this document. For example see SRM Licensing Information.

I cannot uninstall SRM successfully what can I do?


Uninstalling SRM will normally require access to the VC that it is paired with. If you do not have that VC running it is hard to uninstall SRM. If you dont cleanly uninstall SRM you cannot install it again. It is possible to uninstall with no VC if you read the screens carefully and answer appropriately, but I have seen where that doesnt work. Use one of the ideas below to help if you need it. It is always best to use the Add Remove programs method to uninstall but if that doesnt work the ideas below should.
msiexec.exe /qn /x {35A202EA-1549-4592-97A5-65F5E4CCDEC9}

Microsofts uninstall utility: http://support.microsoft.com/kb/29031


SRM Reference Guide Page 61 of 166

SRM doesnt start, and you just uninstalled an SRA


This was an interesting issue! I decided to not use a particular storage array any longer. So I migrated all of the protected VMs on it off to another storage array. Next, I removed the SRA, and deleted the virtual storage arrays. For an unrelated reason, I restarted SRM but it would not start. When I looked at the logs I noticed it was crashing on issues related to the SRA I had just uninstalled. So I installed the SRA again, and removed the Array Manager config for the removed SRA, and the associated PGs. When I removed the SRA again, there were no issues. The morale of this story, is when you remove an SRA, make sure to remove the Array Manager configs as well as PGs that point to it!

Unable to create placeholder virtual machine at the recovery site: host, resource pool, and datastore are not compatible
This is a frustrating error message. I first saw it when I started using distributed switches at one site and not the other. This error message means that you have mapped resources that are, for some reason, not compatible. One simple example is when you have mapped a VM network to a network where one host doesnt have access to that network. You can also confirm that the Shadow VM location is visible to all hosts at the recovery side. You will need client and server logs to investigate this further. Another cause of this issue can be mapping between a 4.x cluster and a 3.x cluster. You can map between a 3.x and a 4.x cluster, which will work for failover but not failback. I also saw this once after an SRM service restart during a test recovery. Restarting both VC and SRM servers solved it.

Network device needed by recovered virtual machine could not be found at recovery or test time
This error will occur when your protected virtual machines are using dVS switches. With 4.0, or 4.0.1 dVS is not supported even though it is supposed to be. This problem is in two parts with the first being a cosmetic issue in VC, and than the error above, which stops a recovery from being successful. As of 5/22/10 there is a patch that has been confirmed to work available from GSS, which means you need an SR to get it. In our next major release, and in our next patch, we will include this fix. Both of these will be available in the summer of 2010. The VC issue will be fixed in vSphere VC 4.0 Update 2. To confirm you have this issue, you will find NetworkDeviceNotFound in your SRM log. A few lines after that error you will see dvportgroup-xxxx messages. In the History Report you will get an error something like Network device needed by recovered virtual machine couldnt be found at recovery or test time. Update SRM 4.1 doesnt have these issues, and if you use SRM 4.0.2 and VC 4.0 U2 you will not have this issue. KB article can be found at http://kb.vmware.com/kb/1019890 .

SRM doesnt start and nothing in SRM logs or event logs what to do?
The reason nothing is in the SRM logs is that SRM really hasnt started yet. When there is nothing in the events logs it is not a surprise. But I have seen this several times and there are two things to think about. 1. Use depends.exe to determine what missing DLL is hurting SRM. I once had SRM not start for me and it was due to a missing DLL by the name of MSVCP71.dll and by using depend.exe to start vmware-dr.exe (the SRM service) I was able to determine what DLL was missing and replace it with a copy from a different SRM server. Incidentally, depends.exe comes with Visual Studio.

SRM Reference Guide

Page 62 of 166

2. Start vmware-dr.exe manually and you may see a message such as msg=Login failed due to a bad username or password. This may or may not be in the log file. This can occur after changing the password that is tied to SRM. This message was likely in the SRM log but hard to find perhaps.

Only three Recovery Plans can run at the same time


Not sure what the error message is if you try to do more than 3 but at least you now know that only 3 should be executed at the same time Update this limit is not enforced. This is due to the QA level of testing and will be significantly improved in the future. It is rumored that up to 6 running RP will work without issue but above 6 there are issues, and for sure by 10 there is consistent and serious issues. Only three is supported! It is important to note that not all SRAs will support this. For example, due to issues in Navisphere, the MirrorView SRA will only support 1 running RP. Always check the readme or release notes for the SRAs.

Why is Port 80 used in the install but port 443 later?


During install of SRM port 80 is specified and you cannot type in 443, but after the install is complete than SRM talks to VC on 443, so why is 80 specified in the install? Even though SRM uses SSL when it communicates to VC, it does not use port 443. SRM establishes a TCP connection to port 80, than uses an HTTP CONNECT request to establish a tunnel to the VC servers, then does an SSL handshake with the VC over that tunneled connection. The SRM installation enforces these semantics.

Failed to test failover luns. Existing with failure


This is from the EMC SRDF SRA. The error snippet is:
[#4] [07/20 07:03:15 CopyLuns.cpp 1089 CopyLuns::ValidateOptionsFileDevicePairs ] Enter [#4] [07/20 07:03:15 CopyLuns.cpp 1098 CopyLuns::ValidateOptionsFileDevicePairs ] Checking if the number of input devices is same as the number ofsource devices in the options file [#4] [07/20 07:03:15 CopyLuns.cpp 1101 CopyLuns::ValidateOptionsFileDevicePairs ] [ERROR]: One or more input RDF devices are missing from the device pair list within the options file [#4] [07/20 07:03:15 CopyLuns.cpp 1258 CopyLuns::ValidateOptionsFileDevicePairs ] Exit [#4] [07/20 07:03:15 CopyLuns.cpp 0154 CopyLuns::TestFailover ] [ERROR]: Options file device pairs validation succeeded but one/many of the adapter's conditions have not met. Exiting with failure [#4] [07/20 07:03:15 EmcSrdfSra.h 0040 SymapiSession::~SymapiSession ] SymCommit() and SymExit() [#4] [07/20 07:03:16 CopyLuns.cpp 0206 CopyLuns::TestFailover ] Exit [#4] [07/20 07:03:16 EmcSrdfSra.cpp 1203 wmain ] [ERROR]: Failed to test failover luns. Exiting with failure

The question is what RDF is it talking about, and which options file? In the adapters directory on the recovery side there should be a file called EmcSrdSraOptions.xml. In that file you need to specify the R2 devices and their associated BCV pairs as part of the <TestFailoverInfo> information. You need to find the associated BCV device names for each of those devices, for example by using the "symmir" command and specifying the device group containing those devices. Then, modify EmcSrdfSraOptions.xml to include entries in the <TestFailoverInfo> stanza such as (for example if 477's BCV is 35F) <DevicePair> <Source>0477</Source> <Target>035F</Target> </DevicePair> Then run the test again, since this the "options" that the SRDF adapter is looking for. You will have to create this pairing information for each R2 device you plan to test. The output from the adapter will summarize what it thinks is specified in the EmcSrdfSraOptions.xml file, for example if the output has:
[#4] [07/16 08:57:16 EmcSrdfSra.cpp save_pool_name = n/a [#4] [07/16 08:57:16 EmcSrdfSra.cpp devices = n/a 0655 0673 SrdfSraOptionsReader::DisplaySrdfSraOptions] SrdfSraOptionsReader::DisplaySrdfSraOptions]

SRM Reference Guide

Page 63 of 166

[#4] [07/16 08:57:16 EmcSrdfSra.cpp gold_copy_type = BCV

0676

SrdfSraOptionsReader::DisplaySrdfSraOptions]

where "devices = n/a" it thinks you haven't set any DevicePair settings. After you modify EmcSrdfSraOptions.xml you can also run the adapter binary by hand (EmcSrdfSra.exe -env) where the env flag will cause it to print out what it thinks is in the options file. EMC can probably give more details as to the purpose of the options file. This all assumes you are using standard Timefinder for snapshots; if you are using BCV clones you will need to modify the EmcSrdfSraOptions.xml file accordingly including specifying the save pool name.

I cant install the plug in get an error


The information of where to install the plug-in from is held in extension.xml which is in the install folder. It may have the wrong path. This could be due to an issue during the install.

For SQL server use, does the SRM DB user need the DB_OWNER permission?
For SQL server, the SRM DB user doesnt need the DB_OWNER permissions. As long as the schema has the same name as the username, and is the default schema for that user, and is owned by that user, then you are ok.

Unexpected MethodFault (dr.san.fault.ManagementSystemNotFound)


This error occurs after you upgrade the EqualLogic PS Series Interface SRA adapter to the Dell EqualLogic PS Series Interface. You can uninstall the new SRA and install the old one as a work around, but there is another option. You can locate the manifest.xml file in the SRA installation directory, modify the SRA name in it, and restart the SRM service and you would be good to go. This problem was fixed in Update 1.

Changing passwords after SRM is working


Update 11/26/09 For SRM 4.0 and later you can do this sort of thing much easier via the Add / Remove Programs and use the SRM Repair option. The information below is still appropriate for SRM 1.0. You can have some issues with changing account passwords after everything is working. In theory you can use the installcreds.exe file but it has been reported to not always work. In a near future there will be an update to make this process easier but for now you must use the srm-config.exe command. When it is complete you will be able to restart the SRM service and have communication between the SRM servers (will need to repair the communication by doing the pairing again). The format is complex for this command. You must ran it twice, the first time to obtain a thumbprint, and than the second time to actually make the change. Below is a sample command line. This utility is found in the bin directory of the c:\program files\VMware\VMware Site Recovery Manager\config folder. You can find parameter names (such as value for sitename) in the vmware-dr.xml file found in the config folder.
Srm-config.exe cmd confuserbased sitename <local site name> -cfg <SRM configuration file> -u <username> -vc <host[:port]> [-thumbprint <sha-1 server certificate thumbprint] Srm-config.exe cmd confuserbased sitename srm-primary cfg vmware-dr-primary.xml u administrator vc 10.10.10.10 thumbprint 96:E0:E8:F5:59:1C:BF:6D:81:6C:A2:AB:51:76:24:DE:31:D1:E8

SRM Reference Guide

Page 64 of 166

Without the password you will need to use the thumbprint. So run this command the first time without the thumbprint parameter and you will be shown the thumbprint and than run it again with the thumbprint. If your site name contains spaces enclose the name in quotes. You will need to worry about this if you cannot get the SRM service to start. You will see in the error log messages about ERROR 1920 Service VMware SRM Service (vmware-dr) failed to start. You can see a little more about this on page 44. This is easier in SRM 4.0 and is covered in the admin guide.

My recovery site is only using x number of hosts to start VMs but it should be using y number
When I experienced this, it was due to the host that was not starting VMs not having access to the storage array. This was due to it not having a vmkernal port that LHN required. I have seen this with other vendors where there was no security between the ESX host in questions and the storage array. There are no error messages associated with this situation so make sure you test for it. I have seen a similar error where the single host at the recovery site didnt have an IP entered for the iSCSI array. In addition, make sure that DRS is healthy. If there is wide deltas between the build / patch level of the hosts in the cluster it is possible that certain hosts will not be used by SRM since DRS is not using them. Test that all hosts can be used by VMotion by setting all hosts one by one in and out of Maintenance mode to confirm things are ok.

Error: A general system error occurred: cannot execute scripts


If you see this error, the manual power on should work fine but you should provide your logs to VMware support and figure out what the issue is and get it fixed.

Permission to perform this operation failed


This may occur when you try a variety of different options when you are using an account that is not the default the install occurred under, or if you have tweaked the permissions of the account. Just having VC and local admin rights are not enough. The SRM log would have in it errors that include (vim.fault.NoPermission). To solve this issue make sure the user account has the protect privilege.

Priority Levels in Recovery Plan dont reflect my changes


You have made changes in the Protection Group to the priority level of some of your protected VMs. But when you refresh the Recovery Steps you see your VMs with the original priority and not the new that you changed in the Protection Group. This is correct behavior. It may be improved in the future. It is due to the difference in security permissions on both sides. It would be possible from someone on the Protected side to make changes that affect VMs on the recovery side. This may or may not be appropriate. Until there is a good solution, just right click on the VM in question and use the Move Up or Move Down options to change its execution order priority.

What does SRM database corruption look like?


I would like to know what I might see in the SRM logs if my SRM database is corrupted?
[2009-08-04 21:15:18.077 'SecondaryReplication' 1768 verbose] Loading ShadowVm from DB object

SRM Reference Guide

Page 65 of 166

[2009-08-04 21:15:18.077 'DrServiceInstance' 1768 warning] Initializing service content: Unexpected exception 'class Vmacore::Xml::XMLParseException' unclosed token [2009-08-04 21:15:18.077 'App' 1768 error] Application error: unclosed token. Shutting down ... [2009-08-04 21:15:18.187 'App' 6344 info] [serviceWin32,414] vmware-dr service stopped

Above is an example of what you might see in the SRM log files when the SRM database is corrupt. You can restore the database if necessary, but make sure to do it on both sides and have SRM not running when you do it.

Error:Expected virtual machine file path .. vm-vmname/vm-vmname.vmx cannot be found


This can occur during test or recovery and it means quite simply the VM reference in the error is not in the replicated SAN datastore where it is expected. This most often occurs when you add another VM to the protected datastore and before it has time to replicate start a test recovery. The solution is to wait until the replication catches up and try the test again. Sometimes this is due to a VM that is Storage VMotioned, or migrated off of a host at the worst possible time!

SRM 4.0 cannot start I just updated to vSphere 4.0 Update 1


There is a proxy.xml file that maps between SRM and VC. This file is reset to its default values during the vSphere Update 1 process. This means no mapping and as a result SRM will not be able to start. This can be easily solved after it occurs by using the Repair SRM option in the Add / Remove Programs area. See more at http://blogs.vmware.com/uptime/2009/11/srm-40-license-change-after-upgrading-tovsphere-40-update-1.html including a link there to the KB article.

ESXi not supported at 1.0.0 nor is ESX / VC Update 2


It is not as obvious as it should be but ESXi is not supported with 1.0.0 of SRM. And as the title mentions nor is Update 2 of ESX or VirtualCenter. Update 2 does seem to work at least for me. As of Update 1 for SRM U3 of both ESX and VC are supported. Even ESXi on FC is supported. However ESXi on anything else is NOT supported. This will change with a patch for ESXi is released which should occur in late January. This patch has been released and thus ESXi is supported.

My script needs more time to execute


There is a variable that controls this. It is in the vmware-drl.xml file. It doesnt appear to be in the SRM 4.0 GUI and so will need a restart. Remember that if you make this change every script will not have longer to run in and this may impact your recovery RTO. In SRM 4.0 it would look like
<Recovery> <calloutCommandLineTimeout>500</calloutCommandLineTimeout> </Recovery>

Database access issues


Use Windows Authentication if the DB server is local to the SRM server, and SQL Authentication if the DB server is remote to the SRM server. Make sure the schema for the database has the same name as the user.

No available Customization specifications found


You can create customizations using the View \ Edit Customization command in the VI client. This is how you can change a network setting in a recovery. This is like sysprep, and you are required to fill in all of the necessary information, but only the network info will be used. You will need to create your
SRM Reference Guide Page 66 of 166

customization specification on the recovery site. Remember that you can export and import customizations so if necessary it doesnt take much to move them between your protected and recovery sites.

Errors with using Network Customization


This problem is seen when you try to change a VM during recovery from using DHCP to a static IP. If this doesnt work you should check the vmware-dr.xml file on the recovery site for the following line: <disableNFCServerCertificateChecks>false</disableNFCServerCertificateChecks> You will need to change the false to true and then restart the VMware SRM service. You will likely need to re-pair. This error should not occur at GA I confirmed as of build 97878 this has been fixed.

Operation Timeout error when doing test recovery


I have seen recently device timeout errors when doing test recoveries where the storage is RecoverPoint 3.3. Some of the VMs were properly recovered, and some where not. In this case the issue is RecoverPoint and there is a forthcoming patch to address it. Sometimes this error may occur on older and slower storage, and it may be called operation timeout. There is a fix for this that involves a configuration change, and it will be done in future releases, but you can do it today if you see this operation timeout error. On the recovery side, open the vmware-dr.xml file and look for the section below:
<vmacore> <threadPool> <initializedCOM>mta</initializedCOM> </threadPool>

and it continues on . . . . And add the line <TaskMax>20</TaskMax> so the section will look like:
<vmacore> <threadPool> <initializedCOM>mta</initializedCOM> <TaskMax>20</TaskMax> </threadPool>

and it continues on . . . . Remember to restart the SRM service. Currently, the value of TaskMax is 10, and sometimes that is not enough. We will increse the value for it to 20 in current releases.

Recovery Plan error: Unable to access the VM config error message


I have seen this error in a number of different situations. Does your recovery server have a software initiator that points to your shared and replicated storage? This is configured with the Properties (and Configure on the General tab) button on the iSCSI adapter in the Storage Adapters area of the ESX server in question configuration area. For Left Hand SRAs you need to have an Authentication Group on the recovery side and if you dont this error can occur. Along with the Authentication Group you also need a volume list.

SRM Reference Guide

Page 67 of 166

This can also occur when you have a cluster that you are recovering to and some of the hosts in the cluster do not have access to the storage! For example no iSCSI access to the recovering storage arrays.

Grayed out options for creating and editing of protection group


This happened several times in earlier builds and I was not able to understand why or what the problem was. But the solution was to log into the VSA LeftHand Networks CMC software. After you expanded and looked at the VSAs all was good. This can also occur when you have no datastore groups.

Net::SSLeay::load_error_strings
This comes from the Perl module for OpenSSL, which is required by some SRAs (such as NetApp) and means that perl is not installed on the recovery SRM server.

Array with key xxxxxxxxx not found error message


I received this error recently, with a real array name in place of the xxxx, and it was due to me setting the protected site LeftHand Networks SRA to the VIP instead of the management IP. I got through the Array Management configuration with no errors (but no green checkmark) but during the test failover it had an error with the error above. The fix was easy, I used the right IP in the array configuration, saw the green checkmark, and all was good.

Is there a limitation of DR failover LUNs for some iSCSI arrays and some Hosts?
There is a hard limit of 64 iSCSI arrays per host. However, when using SRM there is a limit of approximately 23 recovery iSCSI LUNs on the recovery side only. For more information about this please visit http://kb.vmware.com/kb/1005867 . This is not specific to SRM but to any DR setup you might test.

Can I have a VM with multiple VMDKs spread across two NetApp SRAs?
No. If you have one VM, with two VMDK files, and one is on the NetApp FC / iSCSI SRA, and one is on NetApp NFS you will get an error. This is true for any SRAs. You cannot spread a VM between arrays.

Not sure the error name but interesting problem


Shadow VM issue (thanks Jason): Customer cannot configure protection group because SRM throws the following error:
[2009-02-11 16:16:45.804 'SecondarySanProvider' 9896 warning] Failed to prepare shadow vm for recovery: Unexpected MethodFault (vim.fault.FileNotFound) { [#2] dynamicType = <unset>, [#2] file = "[DATASTORE-SRM-VDISK1]", [#2] msg = "A file was not found. [#2] [DATASTORE-SRM-VDISK1]" [#2] }

Technically it's not an adapter problem because the adapter successfully returned the replicated LUN. However, the shadow VM needs to be on a temporary datastore at the recovery site, and this datastore name looks a little strange. Further up in the log I see that datastore:

SRM Reference Guide

Page 68 of 166

[2009-02-11 16:16:45.460 'SecondarySanProvider' 9896 verbose] Adding datastore 'DATASTORE-SRMVDISK1' with MoId 'datastore-220' and VMFS volume UUID '4992af7e-6a5f6312-7a66-001cc4bd0c2e' spanning 1 LUNs Hmm, the protection site has a datastore with the same name as the recovery site ... could it be that the customer has somehow exposed the replicated datastore to the recovery site and is trying to use it as the temporary datastore? Further up in the log I see that the datastore UUID is:
[2009-02-11 16:16:45.382 'SecondarySanProvider' 9896 trivia] Added vmfs extent 'host-69;vmhba1:0:2' with key 'host69;4992af7e-6a5f6312-7a66-001cc4bd0c2e;0' LUN vmhba1:0:2

Then in discoverLuns I see that LUN #2 is replicated:


[#2] <ReplicaLunList> [#2] <ReplicaLun key="\Virtual Disks\SRM - TEST\SRM-VDISK1\ACTIVE"> [#2] <Number initiatorGroupId="\Hosts\SECOURS\LYONSEC1">2</Number> [#2] <Number initiatorGroupId="\Hosts\SECOURS\LYONSEC2">2</Number> [#2] <Number initiatorGroupId="\Hosts\SECOURS\LYONSEC3">2</Number> [#2] </ReplicaLun> [#2] </ReplicaLunList>

So, it seems the customer thought they needed to specify the replicated datastore as the shadow VM datastore, so perhaps they split replication, made the replicated datastore visible, the resynchronized replication (so the remote LUN is read-only). Now when SRM tries to create the shadow VM there, the creation fails. Customer corrects issue by selecting a non-replicated datastore at recovery side as for shadow VMs.

Failed to launch SAN integration scripts


If you are using SRDF and get the error below when configuring your array you have a path issue. The error is Failed to launch SAN integration scripts to execute discoverArrays command. The issue is a missing path to the SYMCLI folder in the path. The solution is to add the path to the SYMCLI bin folder to the System variables PATH environment. The default path is C:\Program Files\EMC\SYMCLI\bin and you will need to restart the SRM server service after the PATH change. This exact error is from an issue with SRDF it may occur with other SRAs from other or the same vendor. This issue can also be caused when there is no SRA installed!

Failed to connect to NFC during test failover with IP customization


I accidently restarted my recovery side SRM server during a test failover with IP customization. I have done this before, with no IP customization, without issue. The recovery plan, after SRM starts again, is in paused mode and you just continue it. But this time, when doing IP customization I had an issue. During the rest of the failover the IP customization failed. I removed the IP customization and everything worked. I attached again the IP customization and the problem came back failed to connect to NFC. I found an KB article (http://kb.vmware.com/kb/1009903 ) that I didnt like, and thought would be an issue so I ignored it! I restarted the recovery side SRM service with no change or improvement. I restarted both SRM services and I had no more issue. I could do test failovers with IP customization with no error.

SRM Reference Guide

Page 69 of 166

No visible LUNs during configuration of the array


This will occur if there is NO VMs in the protected datastore. Add a VM to the protected datastore and the LUN will be visible in the array configuration. Technically it is looking for the .vmx files. Recently this occurred for a very odd reason. The SPC-2 bit was not set properly in an SRDF environment. See http://www.yellow-bricks.com/2009/07/21/srdf-sra-and-the-spc-2-bit/ for more information. This has been fixed in SRM 4.0 for no VM or VMFS on the replicated storage. Now a message is displayed.

Review Replicate Datastores window of Array Manager is blank


When you are configuring your SRA and the last step in it is to show you the replicated LUNs, but you see nothing you have a problem. Using the Rescan button doesnt cause the LUN(s) to be displayed. To work around this issue, use the following steps: 1. In the VI Client, 2. Goto the ESX host configuration area 3. Now select Storage 4. In the upper right area select the Refresh option. 5. Now return to the SRM Array Manager configuration, 6. Select Rescan, 7. Than select Back, 8. Now select Next 9. You should now see your LUN information displayed. It should be noted that this could occur if there is NO VMs on the LUN. In SRM 4.0 if you have no VMs on your storage you will be able to see it now but with a warning.

How do I find the Managed object reference (MoRef) for a VM?


Sometimes you will have the find the MoRef of a VM, and you can use the info below to do that. Sometimes you will not see the name of a VM, but rather its MoRef. This would be in the SRM log for example where the VM might be known as vm-xxxx. By DNS: 1. Visit https://VC_HOST_NAME/mob/?moid=SearchIndex&method=findByDnsName 2. Under dnsName, enter the DNS name of the virtual machine as seen by the vCenter server in the summer tab 3. Under vmSearch, enter true 4. Click Invoke Method By IP: 1. Visit https://VC_HOST_NAME/mob/?moid=SearchIndex&method=findByip 2. Under ip, enter the ip of the virtual machine as seen by the vCenter server in the summer tab 3. Under vmSearch, enter true 4. Click Invoke Method

See more info in http://kb.vmware.com/kb/1017126. This very new KB article shows a different
method than above. That may be due to the process above being old and not usable any longer. I will test this when I get a chance and correct as necessary. But for now, use the process above if it works!

Null parameter name:key error


If you are adding a protection group and you get a error with a value of null parameter name:key in it, the solution at this time is to restart the SRM service on both the protected and recovery sites.

SRM Reference Guide

Page 70 of 166

Missing testbubble switch on recovery host


When you are checking your test recovery VMs for network connectivity you find that while one ESX host worth of VMs can talk to each other, but on other ESX hosts there is no connectivity. Further checking shows that only one recovery ESX host has the testbubble switch and the other hosts do not have that switch even though the recovery VMs are configured to use it. Therefore the VMs configured to use the test bubble switch that doesnt exist will not be able to communicate. This is a recently discovered bug that will likely be fixed in the very next release - confirmed.

Error occurred MirrorViewSRACore.dll not found


If you see an Error occurred in array configuration, and than in the SRM log you see an unable to load DLL MirrorViewSRACore.DLL you have this problem. Which is not a missing MirrorViewSRACore.DLL file. But you are missing potentially four other DLL files that you need to make sure are in the path. They are: msvcr80.DLL, storapi.DLL, symapi.DLL, and msjava.DLL. I have also seen myself this error when it was Solutions Enabler not installed. As well, I have seen this error when you have 64 bit SE on a 64 bit platform. You need to use 32 bit SE since the SRA is also 32 bit.

You do not hold system privilege System.View on ServiceInstance DrServiceInstance


This error is referred to in http://kb.vmware.com/kb/1016875 and is not frequently seen. However, this error can be avoided by using an AD account that is Domain Admin, and admin in both VC and SRM (although this comes after the install adds the SRM objects), and the account should be the ODBC account as well, and you should log into the VC server with it as well. Make sure that account is used by the VC service as well. This account will be used in the SRM install for the account that is kept by the install. It will at that point turn into a service account. After that point, you will not use that account again. Other VC / SRM work will be done with your own account. The error above, will generally occur when you are not following any of the suggestions above.

Install hangs at 90%, and install log shows VIEINSTUTIL: Failed to open service control manager
This error can occur when you are installing SRM with a partial admin account. In point of fact you are missing the privilege to add a service. Redo the install with an admin account.

Execution of scripts is disabled on this system


If you are trying to execute PowerShell scripts during the execution of a recovery plan and it is not working, you may see the above error in the SRM logs. The solution is to enable remote signing. You can get help on this in PowerShell with the command get-help about signing or you can check out http://technet.microsoft.com/en-us/library/ee176949.aspx#EEAA .

Protection Group configuration times out


If you have done a failover, and are trying to failback, and you get timeouts when you are configuring the protection group, it can be tricky to troubleshoot. The Shadow VMs would be created, but the task would not complete. This problem has been reported (thanks John) when during the cleanup after the failover the LUNs with the name snap-xxx were NOT renamed. Once they were renamed the timeout errors would go away and the configuration of the protection group went through without issue.

SRM Reference Guide

Page 71 of 166

Failed to update Perl installation directories


This was reported in our KB specifically with Win2K8 R2, but I have received reports from the field that indicate it may occur with Win2K3 as well. See our KB for help at http://kb.vmware.com/kb/1028918 but also you may need info from MS at http://support.microsoft.com/kb/210638 . I think between both of these you should be fine. If not, call support, and let me know how it all goes. BTW, when I had this issue, I could not find the fsutil tool, but I did learn that you can make the change in registry at HKLM\System\CurrentControlSet\Control\FileSystem\NtfsDisable8Dot3NameCreateion and change the 0 to a 1. Than do the install again.

Error: The operation is not supported on this object


This was very confusing for me and hard to troubleshoot. It has been seen in the wild once or twice, and now it was happening to me. It does have a KB now but for the longest time it didnt. When the error occurs it is a pop up error unable to create placeholder virtual machine at recovery site: Recovery virtual machine could not be created: the operation is not supported on the object. It turns out that this error is due to a .VMX parameter error. I had deployed several virtual machines (from template) that were Win2K8 R2 and they could not be protected. The VMX file setting that was causing the issue was svga.autodetect=true together with svga.vramsize=167772161. The workaround was to have svga.autodetect say false, or the vramsize parameter say 4194304. Neither of witch was default. For the virtual machines already deployed I could edit the .VMX file and remove / add them back to inventory and they could be protected fine. The template was hard to fix. I actually had to power it on, turn it into a VM, change the video settings (to set it exactly to 4 MB) and turn it back to a template. I think the 4 MB could also be 8 MB but I have not tested that. One issue with this suggested solution is you will see in the VM events a notice that you can have larger video memory. Ignore it. The KB is at http://kb.vmware.com/kb/1020796. I have not seen this error with XP or Win2K3 but it is in theory possible with them. This is a problem that can occur with any of our software that does re-configuration like View or Converter.

You do not see a newly added LUN when creating a PG?


This happened to a friend of mine lately. When you start replicating a new LUN, you should wait until it is finished replicating before adding it to SRM. That is not required but a suggestion. Before you add it to a new PG, you will generally need to use the Array Manager and next through it until the last screen where you should see your new LUN. While this may not always be necessary, in that there may be some sort of a scheduled activity that allows SRM to see a newly replicated LUN, I am not sure that is true and I always use that method and I have no issues as a result. So, suggested best practice is to let your replication finish, and than use the Array Manager to make sure you can see that new LUN. I think if you do that, you will not have a problem when creating your new PG.

Operation failedDetails: VI API Version 4.1 is not supported


This is covered in Primus Article 247587. It is related to MirrorView. See below for the solution: ID: emc247587 Domain: EMC1 Solution Class: 3.X Compatibility Fact Fact Fact Fact Fact Product: CLARiiON CX4 Product: VMware VCentre Server 4.1 Product: VMware ESX Server 4.x Application SW: VMware SRM 4.1 Application SW: MirrorView Insight for VMware 1.4.0.15
Page 72 of 166

SRM Reference Guide

Fact Application SW: MirrorView Insight for VMware 1.4.0.16 Symptom Error when executing MirrorView Insight for VMware Symptom Operation failed...Details: VI API Version 4.1 is not supported Cause At the time of MirrorView Insight for VMware (MVIV) release in the year 2009, vCenter Server 4.1 was not yet available and the official support was only for VMware Virtual Center Server v2.5u2 and vCenter Server 4.0. Subsequent to the release of the vCenter Server 4.1, MVIV was qualified with vCenter Server 4.1. However, to enable MVIV to recognize vCenter Server 4.1 as the supported version, a registry key must be added. Fix Follow these steps to create or modify the following registry entry:

For 64-bit machines 1. Start the registry editor. 2. Navigate to: My Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\emc\MirrorViewInsightForVMW are\Preferences 3. Modify the "SupportedVIAPIVersion" data so it reads as follows: 2.5u2;4.0;4.1 4. If this entry is not there, create a new string value of "SupportedVIAPIVersion" with the data of 2.5u2;4.0;4.1. The entry should look like this: SupportedVIAPIVersion REG_SZ For 32-bit Machines 1. Start the registry editor. 2. Navigate to: My Computer\HKEY_LOCAL_MACHINE\SOFTWARE\emc\MirrorViewInsightForVMWare\Preferences 3. Modify the "SupportedVIAPIVersion" data so it reads as follows: 2.5u2;4.0;4.1 4. If this entry is not there, create a new string value of "SupportedVIAPIVersion" with the data of 2.5u2;4.0;4.1. The entry should look like this: SupportedVIAPIVersion REG_SZ 2.5u2;4.0;4.1 2.5u2;4.0;4.1

SRM Reference Guide

Page 73 of 166

SRM LUN discovery, test, failover fail with file write errors
Brock reported this to me so thanks very much for that. It is related to IBM SVC but it is an interesting one. The SRM log will show Error writing to C:\users\srmadmin\appdata\local\temp\vmwaresrmadmin\dr-sanprovider6984-0 or something similar. For the solution and more details see http://kb.vmware.com/kb/1033871. It turns out that this is caused by a Java garbage collection issue!

SRM SRA Errata


This section has very specific notes on SRAs that I work with, or that others share with me. Dont forget that specific errors with any SRA will be reported above in the troubleshooting section but background information will be below in the appropriate section.

LeftHand Networks
The LHN adapter requires the account / password of the CMC management app. The protected side array configuration should reference the SRA installed on the protected side! Both IP fields should contain the same IP information, which should be the VSA on the protected side. Update, the two IP fields for the LHN SRA do not require the same IP information nor to be both filled. Only the first one needs to be used. The SRA must talk to a manager, and NOT to a virtual IP. You can put more than one IP address in the fields by separating them with a comma. If you have five managers it would be a good idea to put at least two of them into the first or first and second IP fields. The original certified version is 7.0.01.6066. But now it is currently 8.0.00.1682. There is a new version of the VSA and of the SRA and they both work well with Update 1 of SRM. Current version of LHN is 9.0 and the SRA is 9.0.0.3561 (11/11/10). There are a lot of new features in the SAN/iQ software, but there is not many changes required for this document in terms of install and configure of the array. An old report of Lessons Learned is still interesting at http://frankdenneman.nl/2009/10/lefthand-sanlessons-learned/ . Good info on using this excellent gear.

LeftHand snap left visible after test recovery


This is appropriate with the current versions of the LeftHand VSA and SRM. It will disappear according to the retention guidelines in the LeftHand CMC.

Miscellaneous Information
When you install your VSAs make sure to specifically step by step follow the LHN instructions. Than on your protected site use the wizards to configure the VSA to be able to present storage to the protected site ESX server. Make sure it is seen in ESX before continuing. Once this is done you can work on the recovery site VSA but your configuration will be different. Create a Remote Scheduled Copy from the protected site to the recovery site. As part of this create a remote volume on the recovery side. If you stop now you will apparently have a working shared storage that is replicating. But you will get the error mentioned in the Appendix about unable to access the VM configuration. You will need to use the Tasks menu in the CMC to create a Volume List and than an Authentication Group. Once this is done your Recovery Plan should work fine. All storage vendor SRAs requires a restart of the SRM service after the install.

SRM Reference Guide

Page 74 of 166

The LHN VSA uses remote scheduled copies to do the replication and this means when the test fail over is progressing the remote copy process is not copied. One of the remote copies is mounted for the recovery site to work with but that doesnt stop the replication / copy process. I recently upgraded from 8.0 to 8.1 and had a little interesting things happen. I forgot to upgrade my SRA. So after I did the VSA upgrade my test failover failed. It only took 5 seconds to fail. The error message in the history report was almost misleading. It looked like a credentials issue. It said that it failed to authenticate with the array management system during a test failover. It only said it failed to authenticate it would have been true, but with the extra stuff it was a different issue. Upgrading the VSA cleared this issue.

NetApp
When using SRM and NetApp, and when using NFS and OnTap version 8, you may have a configuration issue stopping you from successful configuration. More info on this can be found at the link below, but the workaround is simple. Make sure you put your NFS IP address into the NFS IP field even if you think since you are using the same IPs it is not necessary to do it. http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=464045 This SRA requires the account / password to the simulator or NetApp device but it only has one IP address field compared to the Left Hand which requires 2. The minimum software version on the NetApp devices that the NetApp SRA requires is 7.2.2. Of significant note, the simulator can be one VM but be two instances so that it can do both the protected and fail over storage but with much less resource usage. NetApp has mentioned that they designed the SRA to support simultaneous recovery plan operations but did not test it extensively. By default OnTap doesnt have SSH enabled and it should be (but is not required) to be for SRM. NetApp uses Flexclone to provide storage for the test failover so that means the replication of data is not impacted during a failover. The NetApp SRA uses SSL to talk to the NetApp controller and there are no other ports required. However I have another report that it uses unsecure HTTP over port 80. This can be configured as something else. I have confirmed it works by default over HTTP but that it can in fact be changed to use HTTPS. When working with NetApp it is worth having a volume equal to LUN. It is sometimes configured as a very large volume that has multiple LUNs on it. So it is best if possible for best flexibility and working with SRM if there was one volume on one LUN. If you have recovery site igroups configured in your protected site igroups you will have errors. Dont do this. When reading through SRM logs, you will sometimes see lines that start with <StoragePort id=> and what you see after the =, when it starts with 50:0A, it means NetApp. This is useful to know if you think you are using something else such as IBM or EMC. You can troubleshoot communication by using a browser and connecting to the filer as http://aa.bb.cc.dd/na_admin and you should get the FilerView page.
SRM Reference Guide Page 75 of 166

There is a VMware SRM in a NetApp Environment document that is quite useful find it at http://media.netapp.com/documents/tr-3671.pdf . If you get SRM/VC events about Virtual machines have one or more devices which dont have file backings on the replicated site this is due to a CD being attached to a VM. As of 4/3/09 the updated SRA (or as NetApp says DRA) is available that solves a number of issues and you should be sure to use it with SRM 1.0.1 Update 1. The new version number of the DRA is v1.0.1. Both it and the IBM N-Series 1.0.1 adapter are available at the VMware SRM download page.

SRM Test recovery error: failed to recovery datastore background information


SRM will set the LVM settings it needs you should not need to mess with those. You should see that disablesnapshot and enableresignature are BOTH set to 1. The message you have copied below are coming from the fact that ESX has rescanned its devices and has just processed the "read-only" snapmirror destination lun and worked out that its what we call a snapshot lun but in storage terms is just a replica. We have not done anything else with the device at this stage. When you execute a recovery plan in "Test" mode you will not see the snapmirror destination lun status change at all. Its state should remain "online,snapmirrored,read-only". The reason for this is that during a "Test" the netapp SRA will dynamically provision a flexclone of the destination volume and present that device to the ESX hosts being used at the recovery site. During the running of a test if you open Netapp FilerView and navigate to the "Manage Volumes" screen and keep hitting the refresh button, once you see our "Preparing Storage..." task hit around 25% complete you should see a new volume get created. On the netapp side it will appear as "testfailoverclone.xxxxxxxxxx" on the VMware ESX side it will appear as a snap-xxxxxx<originalvmfsname> datastore. Note the customer must have the netapp flexclone license installed to be able to create these devices. If you run your recovery plans in "Run" or full-failover mode then you will see the netapp SRA break-off the snapmirror relationship and present the actual destination lun to ESX rather than use a flexclone. If the customer starts to manually attempt to alter the status of the devices SRM+SRA are expecting to control then they will see errors during the recovery plan execution since the customer has changed the state of the device and SRM+SRA was expecting to do that itself so will report the fact something unexpected has happened. There are some NetApp configurations that cause their SRA problems such as having multiple filers at both protected and recovery sites where each sites filers are clustered together but we should be able to identify that setup from the full SRM log. Netapp have a new SRA coming i believe that supports this is out now 3/22/09 see http://blogs.netapp.com/virtualization/2009/04/some-news-on-the-netapp-srmfront.html#more for more info on this updated SRA. Logs
[2009-02-18 09:41:04.448 'SecondarySanProvider' 4176 trivia] 'Prepare 1 groups for test' took 34.845 seconds [2009-02-18 09:41:04.448 'SecondarySanProvider' 4176 trivia] Firing CallOnDestruction callback [2009-02-18 09:41:04.448 'SanConfigManager' 4176 trivia] Scheduling lun group computation in 0 seconds [2009-02-18 09:41:04.448 'RSStorageOperation-8814-Task' 4176 verbose] Result set to

SRM Reference Guide

Page 76 of 166

(dr.secondary.ReplicationManager.SingleVmFailure) [ [#14] (dr.secondary.ReplicationManager.SingleVmFailure) { [#14] dynamicType = <unset>, [#14] vm = 'dr.secondary.ShadowVm:shadow-vm-8688', [#14] fault = (dr.san.fault.RecoveredDatastoreNotFound) { [#14] dynamicType = <unset>, [#14] datastore = (dr.vimext.SanProviderDatastoreLocator) { [#14] dynamicType = <unset>, [#14] primaryUrl = "sanfs://vmfs_uuid:4994e685-01ee1320-a88a001ec9f48f03/", [#14] }, [#14] reason = (vmodl.MethodFault) null, [#14] msg = "" [#14] }, [#14] } [#14] ] ---------------Now the problem seems to be that the replicated LUN is seen as a snapshot by the ESX host. -----------vmhba2:0:5 vml.020005000060a98000486e2f39535a4e674c59674e4c554e202020 Disk change be a disk ID: disk ID:

Feb 18 00:47:13 vmkernel: 27:17:19:37.383 cpu4:1221)LinBlock: 1994: VFS: detected on device 3:0 Feb 18 00:47:13 vmkernel: 27:17:19:37.408 cpu7:1339)LVM: 5573: Device vml.020005000060a98000486e2f39535a4e674c59674e4c554e202020:1 detected to snapshot: Feb 18 00:47:13 vmkernel: 27:17:19:37.408 cpu7:1339)LVM: 5580: queried <type 2, len 22, lun 5, devType 0, scsi 5, h(id) 11890432529146075181> Feb 18 00:47:13 vmkernel: 27:17:19:37.408 cpu7:1339)LVM: 5587: on-disk <type 2, len 22, lun 20, devType 0, scsi 5, h(id) 3407130522988133436>

MetroCluster background information (thanks Lee!) The sort version is that SRM is a DR solution, and MetroCluster is a stretched HA solution. BTW, our HA was not designed for distance, and thus the KB article to help with that at http://kb.vmware.com/kb/1001783 . MetroCluster is basically a dual controller NetApp system, stretched across two sites, with the disks from each site synchronously mirrored over fibre to the other site. The idea being that you can loose one site, and the surviving controller takes over - just like a normal controller failover process. If you're going to stretch a storage system across two sites, then the chances are you'll have a decent network between the sites, and you'd want to stretch your ESX HA Clusters across the two sites as well (so you can VMotion from one site to the other). Then loosing one site results in a MetroCluster failover, followed shortly by a HA restart of the VMs on the surviving ESX servers in the surviving site. In terms of high-level comparison we could do this: SRM No MetroCluster Yes
Page 77 of 166

Distance Limited
SRM Reference Guide

vCenter Integrated DR Workflow Creation Transparent Failover Non-disruptive DR testing Site Failure VM Protection NFS Support

Yes Yes No Yes Yes Yes

No No Yes No Yes Yes

Campus cluster / stretched HA environments (i.e MetroCluster) work well if you have the right kind of infrastructure but they are not really DR solutions as typically the two sites are very close together and most customers I work with do not consider a DR site true DR if it is located within a certain distance of the primary. we had a couple of customers a few years ago whose "campus" solution was wiped out entirely when the UK oil field disaster struck and took out both datacentres at the same time (they were 0.5 miles apart). Extreme example maybe but illustrates the difference. If you can live with the limitations of a campus cluster solution and they fit your needs then they can work well. As we say in the UK take what the whitepapers say with a pinch of salt until you've tried it yourself. With any cross site storage architecture I have implemented, there will be **some** kind of pause whilst the system sorts things out. The amount of time this takes depends entirely on what failed. Could be 2 seconds, or it could be 2 minutes or more, then you need to wait for HA to kick in. So when talking about failover initiation I would not say SRM vs stretched HA solutions are really any different time wise, indeed if you wanted to automate the initiation of an SRM recovery plan you can do this though if it were my pair of sites i would want this process at some point to be kick started manually by someone once the true nature of the event was understood. With an SRM recovery plan the storage integration "tells" the storage to come online rather than having to wait for a failover heartbeat or similar to be detected by the storage itself. Going back to campus clustering although array/disk shelf failover can be automated this does not always happen automatically either in my experience, again sometimes it may require a manual intervention (click a button, or type a command to failover) and you need to have the process defined clearly for that event. Loosing a controller in either site for most vendors should be no big deal and the failover operation should take care of the storage side. If you loose the entire site, then manual intervention will (probably) be required to failover it can sometimes be possible to script round this using staged heartbeats. Again still adds time to the failover. If we look at failback, with the campus implementation the process to failback is not as simple as bringing up Site1 and then just vmotioning the VM's back from Site2, again it depends on the failure. If you lost site1 completely and have had to failover to the disk shelves at Site2 then the VM's will now (once HA has restarted them) all be running from the disk shelves at Site2 if you simply VMotion them back to Site1, when its ready, then the storage will still be accessed via Stie2's controller / disks until you tell the storage arrays to go back to their default configuration, which will require restarting the VM's again and will incur downtime in the same way and SRM failback would work. I cannot imagine you would want a situation where Site1 came back online and you vmotioned 50% of your workload back to Site1 but left 100% of your disk workload
SRM Reference Guide Page 78 of 166

running at Site2, I think in all cases customers I have put this in with have wanted the storage to "go back to how it was" ready for the next event or failure. The biggest difference in terms of customer feedback I receive is that the ability to perform automated; repeatable non-disruptive DR testing is one of the key factors moving customers towards SRM. Only other items you need to be thinking about with campus cluster are below I am not adding these to say "SRM is better" these are simply things I have had to work through when implementing campus cluster and some of these nuances don't always make it into the whitepapers/datasheets shall we say VC Inventory / Layout, be careful with the design, as everything is stretched you need to be very consistent and accurate with naming conventions across all inventory objects the VM's will use DRS/HA settings, with campus clustering ensure that you know which VM's are important and define the correct settings per VM for recovery. Unless you have N+1 capacity spare at each site you will need to put in place HA/DRS settings that bring online the most important VM's first and dont end up in a failure situation with all your dev/test VM's online and half the production VM's "down" because you did not set correct priorities in HA. In SRM this is something the recovery plan handles and you can control. Split Brain, if you run the two sites as one big HA/DRS cluster ensure you test out the various failure scenarios, for example if DRS (or manual VMotion) moves a bunch of VM's from site1 to site2 but no failure as occurred at that time you now end up with VM's CPU/Memory/Network contexts running on hosts at Site2 but accessing their VMDK's on site1. This will work but is not always desirable from a latency point of view (might be none-issue if bandwidth sufficient) however what happens next if you now suffer disk outage at Site1, at this point the VM's will not crash immediately at Site2 and it will take HA sometime to realise these VM's have an issue. Try it and see, if you disconnect storage from a VM the VM will cling on to life (assuming IO pattern is normal) for quite sometime before a bluescreen is seen. Storage Presentation, if your vendor wants the zone across the sites to effectively be "open" to all ESX hosts then ensure you understand the implications of the ESX LVM settings with regards snapshot / disk resignature. You potentially will have ESX hosts that could at some point access both a source and target lun at the same time if someone or something altered the LVM defaults. Zoning, if the vsan / zones are truly open or all hosts in same then certain fabric events can be a potential pain. Any rogue events such RSCN will disrupt both sites at the same time if all ESX hosts are on same open fabric so be careful here. Not something that is too common but i have seen it hurt a few customers, usually comes down to bad HBA or cables but can be a real pain to track down. VC / ESX limits, as you build the design out for campus cluster ensure the design wont
Page 79 of 166

SRM Reference Guide

have you quickly reaching the limits of what it supported in terms of things like max number of VMs/VC, max number of luns/ESX host, max number paths/lun/ESX host etc.

As much as I like SRM solutions I also like the campus cluster / single pane of glass approach as well where it works/fits. Both use-cases are valid but ensure you work out what you actually need. And some more on this subject from Lee: ......by the way....metroclusters competition is NOT SRM....its EMC VPLEX....VPLEX is EMC's stretched storage / HA solution....but as with NetApp, EMC also integrate ALL of their platforms with SRM as they know that is what provides DR. Lets compare some component basics. With SRM the architecture is designed so that the recovery process does not depend on any component from the protected site to work. Simple example. SRM uses two vCenters meaning if the protected site VC dies it does not affect your ability to recover. This is not the case with metrocluster, with metrocluster youre using a single vCenter instance across the two datacenter rooms so although HA can recover the VMs if vCenter is lost in your design your now using a single vCenter namespace across two sites so this needs to be taken into account when your adding objects in to your vCenter inventory (naming consistency, scale...limits etc). Also there are failover scenarios to think about with metrocluster, its not ALL automated by VMware HA by any means. SRM's strength as a DR solution (combined with netapp snapmirror) is it allows customers to build repeatable recovery workflows that bring their infrastructure back online in a specific order. VMware HA does NOT do that. Other factors to consider that might not be immediately obvious (dont get me wrong here im not bashing metrocluster with netapp/vmware its a good solution you just need to be sure its what customer needs...otherwise SRM/netapp/vmware is a good solution also :) Sometimes I find metrocluster is wrongly sold to customers who really needed DR...those are the situations to be careful of, some (not all) netapp account teams will try and only sell metrocluster because that solution is more $$$$$ for them and sometimes it because they just don't understand that netapp integrate with SRM very well!!! Other caveats of a metrocluster solution customers need to understand and be happy with are below....if your sites are close together and you truly want stretch HA then you will most likely be fine with these points BUT if you really wanted DR then these points will usually come as a surprise to customers and annoy them! In a metrocluster deployment granularity in the filer is at the aggregate level (highest level!!!) NOT at the volume level. With SRM you can simply setup snapmirror for the volumes / luns you want to protect this is NOT the case with metrocluster Metrocluster has no offline/non-disruptive testing capability as we do with SRM/Netapp Flexclones so how do you prove you can failover successfully? Metrocluster has distance limits (2GB link 500m stretch or 4GB link 270M stretch,
SRM Reference Guide Page 80 of 166

for greater distance need fabric/switch MC for up 100KM max distance span) Metrocluster solution *must* use recommended brocade switches to be supported All disks login to switches as hosts. Max limit is 672 logins, cannot go beyond this limit (672 = 6 switches) and would need 6 x 2 x 2 for redundancy at both sites...lot of switches J ALL disk shelves must be mirrored. In a metrocluster solution your using SyncMirror NOT Snapmirror. If you had say 20 diskshelves and broke those up into 4 aggregates and within each aggregate had say 40 volumes each containing single VMware NFS exports then with SRM you can simply snapmirror the volumes (export) you need to replicate, so lots of flexibility and granularity no need to replicate things you dont need at DR site, why waste bandwidth???? With a Metrocluster solution ALL disk shelves MUST be mirrored even if the VMs within them are not needed for DR or are not business critical...so less flexibility when carving up the storage. If you suspect the customer is being miss sold metrocluster ask customers simple questions to try and work out what do they want: - Control of the failover? - Orchestration, using recovery plans to build recovery workflows that match what their business wants to happen - Ability to pre-build recovery workflows that can be tested, validated and invoked during an outage knowing the recovery will take place following the pre-programmed workflow - Ability to perform non-disruptive tests - Ability to run pre/post power on scripts - Ability to customize network at remote sites as part of the recovery - Ability to run callout scripts that talk to other pieces of their infrastructure - Ability to run the recovery in a pre-defined sequence that matches their own business recovery processes and SLA's - No single point of failure....i.e both sites run with own management layers (vCenter) meaning if one of the sites is lost nothing from that site is needed to recover.....in a stretched HA+metrocluster environment that is NOT the case. Metrocluster uses a single vCenter server for both sites. vCenter is the single point of failure for some scenarios here. if the answer to the above is YES then they are looking for a DR solution....hence they need NetApp Snapmirror combined with NetApp's SRA for SRM. If the customer has two sites / server rooms that are VERY close together and they do just want to run both rooms as "one" then that might be a good fit for metrocluster. For example if the two rooms were <20KM apart for some industries that might mean those two sites couldn't be classed by the regulator/auditor as DR anyway because they are too close together....so if that is the case...metrocluster works for them there as they might be breaking the law by claiming to have DR protection if their datacenters are too close i.e could both be wiped out by the same disaster such as flood or power blackout to a city. Hope this information helps you work out what solution fits for your customer. HDS
One important thing to remember with HDS is that immediately after a real failover it will reverse direction and start replication in a new direction. This can be changed.

SRM Reference Guide

Page 81 of 166

Lots of help can be found in http://www.hds.com/assets/pdf/hitachi-storage-replication-adapter-softwarevmware-vcenter-site-recovery-manager-deployment-guide.pdf . Some help can be found in http://www.hds.com/assets/pdf/implementing-vmware-site-recovery-managerwith-hitachi-enterprise-storage-systems.pdf . Currently the HDS SRA doesnt set the path to the perl binary, but it should do that properly in the next release - it does. The SRA also needs the HDSs cci component installed as well. SRM will only look for replicated datastores on devices that are presented to ports on the array that are returned by the discoverArrays command. If discoverArrays does not return a WWPN, than SRM assumes that devices on that part are not for use by SRM, even if LUNs on that part are made visible to the ESX hosts. The HDS adapter is not returning the port, which presents the snapshot LUN. The reason seems to be some logic in the adapter which determines the ports on the array by looping through the list of source (L or local) replicated and shadow image devices on the target array. However, the shadow image snapshot is actually a remote volume (R) because the L local volume is actually the replicated target, so the adapter is not returning the port of this volume. Because it is not returning the port, SRM ignores it (by design) and the test fails. This problem does not occur when all volumes are on the same port. VMware Engineering is working on this with HDS.

EMC
As always, make sure you have checked the SRM HCL but also confirm that your SRA pre-requisites like SE, FLARE/DART/RP versions are correct.

A video that talks about all four EMC replication technologies for SRM: http://www.emc.com/collateral/demos/microsites/mediaplayer-video/video-walsworthtothepoint-vmsrm.htm VSI version 4 is out - http://itzikr.wordpress.com/2010/12/20/emc-virtual-storage-integratorvsi4-is-out/ Celerra
This supports simultaneous recovery plan operation. Make sure they do not step on each other in terms of LUNs / VMs or their components. A new SRA (4.0.22) is out and I think it important - http://itzikr.wordpress.com/2010/12/20/new-celerrasra-and-a-celerra-failback-plug-in-for-vmware-srm/ The Celerra 2.0 beta SRA has a log location of [SRM_InstallDir]\scripts\SAN\celerra\log\sra.log . Celerra and VMware Techbook - http://www.emc.com/collateral/hardware/technicaldocumentation/h5536-vmware-esx-srvr-using-celerra-stor-sys-wp.pdf Celerra SRA release notes http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Technical_Documentation/300-007023.pdf Celerra Failback plug-in Release notes http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Software_Download/SRMFailbackWizar
SRM Reference Guide Page 82 of 166

d_read_me_first.pdf http://www.emc.com/collateral/hardware/technical-documentation/h5536-vmware-esx-srvr-using-celerrastor-sys-wp.pdf Celerra and NFS with SRM - http://communities.vmware.com/docs/DOC-11541 Plug-in http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Software_Download/EMC_Celerra_Failb ack_Plug-in_for_VMware_vCenter_SRM.zip Celerra VSA great for learning and testing - http://nickapedia.com/2010/09/12/ubertastic-celerra-ubervsa-v3-unisphere/ Changing the Celerra passwords

Use the following procedure to reset the root password: 1. Access to the Console of the control station is required, so either connect the console physically or use a serial console. 2. Boot the Control Station or reboot or reset the power switch if shutdown commands cannot be issued. 3. When the BIOS checks complete and GRUB is loading, press any key (arrow key is best) to stop it from auto booting in 10 seconds. 4. Press e to edit the line it is highlighting ("Linux" would be the normal word). 5. Select/highlight the line starting with the word "kernel" and press e to edit. 6. At the end of the word, append the word "single" with a space in the front and press ENTER. 7. Now the highlight should show the word "single" at the end. 8. Press b to boot from this modified line. 9. Now the Control Station will boot to single user mode, with a # prompt appearing, which means it logged in as root already. 10. Issue passwd command and enter the new password (with confirmation of same) to be set, which will be the new password. 11. "init 6" will reboot and it should boot automatically as normal boot. Use the new password set at step-10 for root. With this root login, reset the nasadmin password, if required. You would do this after logging in with the root account. CLARiiON
Currently the CLARiiON has a limit of 32 characters for CG names. The Solutions Enabler API is trimming off the last two characters from the name. Which causes SRM issues. Until the next release of the SE software it is best to avoid this issue by using only 30 characters in the CG name. When using SnapView, remember that SV must snap to THICK luns. On the CX the snap name should have the following prefix: VMWARE_SRM_SNAP . While the Solutions Enabler (SE) can be installed on the SRM server, physical host, or VM it sometimes will make thing easier to have it on the SRM server. This is for both CX and DMX equipment. I have been told that this is a requirement that is not documented anywhere.
SRM Reference Guide Page 83 of 166

Something that may be useful SRM error is failed to create LUN snapshots http://blog.virtualtacit.com/home/2009/7/30/clariion-cg-snap-session-limit-smack-down-during-srm-testfa.html . EMC CLARiiON - http://communities.vmware.com/docs/DOC-11544 http://www.emc.com/collateral/software/solution-overview/h2197-vmware-esx-clariion-stor-syst-ldv.pdf

DMX
When working with DMX, and using BCVs, you cannot use Timefinder snapshots. This is not a limitation of or by VMware but rather an EMC limitation. While the Solutions Enabler (SE) can be installed on the SRM server, physical host, or VM it sometimes will make thing easier to have it on the SRM server. This is for both CX and DMX equipment. Remember for DMX equipment the SE will need to have a gatekeeper LUN, and if the SE host is a VM, the gatekeeper LUN will need to be a pRDM. The DMX will need to have its LUNs in a device group. http://www.emc.com/collateral/hardware/solution-overview/h2529-vmware-esx-svr-w-symmetrix-wpldv.pdf New version of SRA 2.2.0.3 - http://itzikr.wordpress.com/2010/12/16/new-emc-srdf-sra-for-srm-getthe-scoop-inside-3/ - this is big release and an important one! SPC-2 - http://www.yellow-bricks.com/2009/12/08/spc-2-set-or-not/

SRDF
A new tech book on SRDF and SRM is now available. There is both hard copy and soft copy available. It covers off version 2.2 of the SRA and install / configuration, plus how to use the new features that include: Test failover using TimeFinder/Snap off of a SRDF/A R2 (new with 5875) Test failover without using TimeFinder technologies and instead directly running the test failover off of the SRDF R2 How to use the new VSI SRA utilities And information in the Appendix on SE licensing. Powerlink (soft copy): http://powerlink.emc.com/km/live1/en_US/Offering_Basics/White_Paper/h7061-srdf-adapter-vcentersrm.pdf Vervante (hard copy): http://store.vervante.com/c/v/V4081409244.html?base_cat=EMC%3a%20EMC%20TechBooks&pard=e mc Important Note the EMC VSI Plug-in version 4 does NOT write SRDF configuration out to the EmcSrdfSraOptions.xml but it says it did, when it has NOT been started with the Administrator rights. Or rather, when the vSphere client is started (using the right click and start as admin option). This may, or may not, be mentioned in the release notes. It will be mentioned in the future if it is not, and EMC is
SRM Reference Guide Page 84 of 166

thinking of other ways to manage this. http://itzikr.wordpress.com/2011/01/10/srm-automatic-failback-using-emc-symmetrix-vmax/ New version of SRA 2.2.0.3 - http://itzikr.wordpress.com/2010/12/16/new-emc-srdf-sra-for-srm-getthe-scoop-inside-3/ - this is big release and an important one! To make your work with SRDF and SRM successful you will need two documents. The first is the SRA release notes, which will be found in PartnerLink. The second is a new SRDF and SRM techbook, which can be found at https://powerlink.emc.com/nsepn/webapps/btg548664833igtcuup4826/km/live1/en_US/Offering_Basics/ White_Paper/h7061-srdf-adapter-vcenter-srm.pdf Or http://www.emc.com/collateral/software/technical-documentation/h7061-srdf-adapter-vcenter-srm.pdf Or at http://www.emc.com/collateral/software/technical-documentation/h7061-srdf-adapter-vcentersrm.pdf Latest SRDF SRA release notes http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Technical_Documentation/300-010235_a03.pdf I have had troubles with both links at different times, and both links have worked for me at times. If you cannot get the document I can send it to you! A View/SRDF/SRM white paper http://www.emc.com/collateral/software/white-papers/h6971-businesscontinuity-view-srdf-wp.pdf What licenses are necessary to successfully use SRDF and SRM? Generally you will require: BASE SERVER (to allow it to be an API-SERVER) SRDFA (to allow it to manipulate SRDF/A RDF groups SRDF (to allow it to manipulate RDF devices) TimeFinder (to allow it to use TimeFinder /Mirror) TimeFinder-Clone (to allow it to use BCVs for testing) A useful SRA and SRM document can be found at: http://www.emc.com/collateral/software/whitepapers/h6368-using-emc-srdf-adapter-v2-vmware-srm-wp.pdf - I believe this may be been replaced with the document above. 12/12/09 I have heard but have not confirmed that SRDF will immediately after a failover reverse direction and start replication. 10/2/09 there is updated SRA and Storage plug-in that makes things work easier! Make sure that you use them. EMC just told me the latest code went on powerlink today. Look under:Home

SRM Reference Guide

Page 85 of 166

<http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=homeP gSecureContentBk> > Support <http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=-NULL--&internalId=0b01406680024e1b> > Product and Diagnostic Tools <http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=-NULL--&internalId=0b014066800251e5> > Symmetrix Tools <http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=-NULL--&internalId=0b01406680270f14> > Symmetrix Tools for VMware <http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=image7 b&internalId=0b01406680407180&_irrt=true]]> When you are performing the failover test what kind of devices are we working with sync (SRDF/S) using Timefinder/Snaps (VDEVS) or async (SRDF/A) using Timefinder/Clones? (aka BCVs) At the recovery site you need to pair up the R2 devices (replicas) in your datastore groups being tested with appropriate target devices for testing, these target devices are the Timefinder devices I just mentioned, VDEVs if your sync and BCVs if your async. The device pair list (as mentioned in your error) is stored in the xml file in Program Files\Vmware\Vmware Site Recovery Manager\scripts\SAN\EMC Symmetrix folder. File is called EmcSrdfSraOptions.xml In that file you need to specify the R2 devices and their associated VDEV/BCV pairs as part of the <TestFailoverInfo> information inside the device pair list element. Example device pair entry: <DevicePair> <Source>0477</Source> <Target>035F</Target> </DevicePair> Once you have a device pair for ALL of the R2 devices in your recovery plan save the xml file and try the test again. The purpose of the EMC Storage plugin (latest one) is that it now includes an EMC SRDF SRA tab in vCenter that allows you to match up the pairs in vCenter and then save the xml file from that tab so no manual editing is required. I have included a screenshot below of what this looks like. All of this is also covered in the SRDF guide (let me know if you dont have this and I can send separately). SRDF adapter version 2.0 does not reference the netcnfg file any longer. Instead you are expected to specify a resolvable host name or IP address in the address field of the Array Manager. You can even add a port with it if you are not using the default 2707. SYMAPI_C_NET_Handshake_FAILED error This usually occurs when there is a security level mismatch between client and server. Sometimes where the Solutions Enabler versions are mismatched. Check the options file in symapi/config folder and change the sym server security level from the default of ANY to NONSECURE. Confirm both sides. On a Symmetrix using SRDF-A, you may find a failover that successfully proceeds past the storage configuration, but it fails when powering on the VMs. This issue may be due to SPC-2. It needs to be enabled on the front-end adapter on the recovery Symmetrix that is exposing the RDF and BCV LUNs to the ESX host, otherwise SRM cannot match the WWN of the LUN returned by the SRA with the WWN of the LUN present to the ESX host. You can find information on this in our forums, but also in document emc71378 in EMCs Powerlink KB.
SRM Reference Guide Page 86 of 166

This SPC-2 flag can be set either on the FA OR the initiator itself. By making the change on the initiator, you can avoid moving hosts off the FA to make the change there. http://www.yellowbricks.com/2009/12/08/spc-2-set-or-not/ . All devices in a consistency group (device group) must be failed over together. This means they can only have one protection group and one recovery plan. If they need more they will need to create multiple consistency groups. If you are using the SRDF SRA you will need the following manual step to avoid errors in the log that appear to indicate a path issue, as well as an error in the UI that is Failed to launch SAN integration scripts to execute discoverArrays command. This will occur when you are trying to configure the Arrays during the initial SRM setup. The solution is to add a path to the SYMCLI binaries to the System variables Path environment variables. The default path to SYMCLI is C:\Program Files\EMC\SYMCLI\bin . After adding the path you will need to restart the SRM server service. For SRM setups, EMC recommends to use the Solutions Enabler in a "client-server" fashion because the SRM server typically does not have direct fiber connectivity to the SAN (whereas the ESX host does). To use SE in a client-server fashion, SE needs to be installed on both the SRM server (Windows version) as an SE "client", and also on at least 1 ESX host (RH Linux version) as the SE "server". On the SE "client", you edit the netcnfg file to tell SE who the SE "server" is. The edited line contains a "service name" (which can be arbitrary, whatever you want), the hostname of the SE server (in this case, the ESX server), and the IP address of the SE server. The "service name" is the name that should be entered for the SYMCLI_CONNECT environment variable on the SRM server. That's how the SE "client" identifies the SE "server" to direct its SYMCLI commands to. The use of the netcfg means that there is a single point of failure. Some clients might use the Control Center as the SYMAPI server to avoid this. Put the path of SE bin folder into the System Variable PATH. By default it is C:\Program Fiels\EMC\SYMCLI\bin . You will see errors about this if you dont. In addition, you should restart the SRM server after you are complete with the SE install and tweaks. This is when using SRDF/A. The SYMCLI is what is required by the SRA to talk to SRM. The SRA by default creates a log under \program files\emc\symapi\log with the name of symvmwsrm<date>.log. If you wish to have application consistent VMs after a failover you will need to use Replication Manager to arrange that. I am not sure if it is yet compatible with SRM. Using EMC SRDF Adapter for VMware Site Recovery Manager http://www.vmware.com/files/pdf/VMware_SRM_SRDF_bestpractices.pdf SRDF DM doesnt work with SRM. The EMC adapter seems to be coded to skip any devices in Adaptive Copy state (Data Mobility is the fancy name for Adaptive Copy); as these devices wont be reported to SRM any VMs on these LUNs cannot be added to a protection group. In addition SRDF DM copies dirty tracks out of order to the R2 devices so likely not able to guarantee a consistent image so it is not a good SRM candidate. SRDF issue (thanks Jason for this sample): Customer claims datastore DMX-25-SRM-Testing-955 is on a replicated LUN however SRM does not create a datastore group including this datastore.
SRM Reference Guide Page 87 of 166

Looking at SRM log, I see:


[2009-03-11 13:04:34.962 'SanConfigManager' 13084 verbose] Adding datastore 'DMX-25-SRM-Testing-955' with MoId 'datastore-8023' and VMFS volume UUID '49b66ac2-f8e6e22e-e912-002264f6252c' spanning 1 LUNs

I see that this UUID is


[2009-03-11 13:04:32.431 'SanConfigManager' 13084 trivia] Added vmfs extent 'host-7422;vmhba1:3:14' with key 'host7422;49b66ac2-f8e6e22e-e912-002264f6252c;0'

vmhba1:3:14, i.e. LUN 14 on target 3 of hba1 on host-7422, this host sees this LUN's UUID as:
[2009-03-11 13:04:23.493 'SanConfigManager' 13084 trivia] Added LUN '10:00:00:00:C9:7A:42:65;14;50:06:04:82:D5:2E:89:09' with keys 'host-7422;vmhba1:3:14' and 'host7422;02000e00006006048000019010205253303039353553594d4d4554'

the LUN WWN is encoded within the UUID (last token of this line) as characters 10 through 42, i.e. 600604800001901020525330303935355 However, discoverLuns returns only 1 replicated LUN, which is not this WWN:
[2009-03-11 14:30:46.094 'PrimarySanProvider' 14848 trivia] 'discoverLuns' returned <?xml version='1.0' standalone='yes'?> [#2] <Response> [#2] <LunList arrayId="000190102052"> [#2] <Lun consistencyGroupId="RA::9" id="8F3" wwn="60:06:04:80:00:01:90:10:20:52:53:30:30:38:46:33"> [#2] <Peer> [#2] <ArrayKey>000187401329</ArrayKey> [#2] <ReplicaLunKey>738</ReplicaLunKey> [#2] </Peer> [#2] </Lun> [#2] </LunList> [#2] <ReturnCode>0</ReturnCode> [#2] </Response>

How could there be only 1 replicated LUN? looking further up in the log the SRDF SRA reports several messages such as:
[#2] 20090311 14:30:45 INFO Skipping SID [000190102052] RDF device [82C] config [#2] [RDF1+R-5] mode [Adaptive Copy] pair state [SyncInProg] [#2] star mode [False] meta type [Member]

So it is skipping several LUNs that presumably are RDF1, but they are in the "SyncInProg" state in Adaptive Copy mode, but SRDF adapter only supports SRDF/S or SRDF/A, so they would have to be in the "Synchronous" or "Asynchronous" mode. So the LUN is being replicated but not in the right mode, and the adapter is skipping it, so SRM cannot map vmhba1:3:14 to a replicated LUN. Solution is for customer to correct the LUN on which the datastore was created so that it is in synchronous or asynchronous mode, not adaptive copy (which in fact is the mode when you do the initial full synch from R1 to R2)

MirrorView
If you are using MV with Clariion you will need to use Solution Enabler (for communications) and Navisphere for the replication management. If you use MV with the Celerra platform you will require neither. Replication Manger is useful for both. As of 12/12/09 you can only run 1 simultaneous recovery plan. Elsewhere in this document you can
SRM Reference Guide Page 88 of 166

experimentally change this. If you are working on a 64-bit SRM host, you must use the 32-bit solutions enabler software. SE can be installed with no configuration or extra bits. The MV SRA works on ports 80/443 but if they are not used, you will end up using 2162 / 2163. You can get a failed to create LUN snapshots error when working with MirrorView. It is generally a problem in the EMC configuration. You can sometimes avoid it by using the following steps:

Create the source volume and mask it to the ESX hosts on the production site Build a VMFS datastore on the Source and add a Guest to the Datastore Use the Navisphere MirrorView wizard to create a Target volume Create the MVs or MVA relationship Add the Target volume to the MVa / MVs relationship Once synced, create snapshots on both sides Add the production side snapshot into the storage group for the Production site ESX hosts Add the Target volume an its newly created snapshot into the DR side ESX host storage group Create a consistency group on the production array and add the MirrorView relationship(s) to the consistency group.

Some specific suggestions for MirrorView/S on Clariion would include: 1. Solutions Enabler 6.5.2 or later 2. SRA 1.3 or later 3. Consistency Groups must have pure alphanumeric characters in use or a real failover will work but not a test. 4. The snapshot must have VMWARE_SRM_SNAP in the name somewhere. It appears to me that you create the snapshot (or the storage admin does) before it is required and than the SRA activates it. This is for test failover only. 5. You can have only 1 recovery plan active when using this SRA. Hopefully this will be improved in the future releases of the SRA. There is some disagreement about this so it may work but I am checking. For now the Release notes say no. Confirmed this is correct. It will take NavisSphere engineering changes to support running more than one RP. But see above how you could do I tnow if you need to test it. With MirrorView you will need to make sure the EMC array scripts are in the same folder structure as the SRM install. This is only relevant if you have installed SRM to a different drive. This will impact a number of applications. In addition, it has been suggested that all of the Storage Enabler options need to be installed. When installing solutions enabler accept all the defaults and perform a complete install. If you have some performance issues with the failover you may be using an old version of the Solutions Enabler. In PowerLink article emc203510 you can find the SE Patch Release 6.5.2.20 that reduces the time required for storage preparation by more than 50%.

Replication Manager
Currently Replication Manager (RM) doesnt co-exist with SRM, but in December 2008 it may be supported. It has been said it will be dramatically easier to setup the array-to-array replication.
SRM Reference Guide Page 89 of 166

RecoverPoint
You should avoid having spaces in a Consistency Group. It appears that the SRA cannot handle it you can see errors in the SRM log about not being able to find CGs. The CG should also be a CRR consistency group for remote replication and not CDP/local or CLR / local-remote. The CG polices that must be set include reservations support and VMware ESX or VMware ESX Windows as the host. It seems that RecoverPoint and SRM have issues if the ~ is in the CG name. Avoid that. The RecoverPoint SRA uses TCP 7115 to talk to the RPAs. If the MUI cannot talk to the RPA neither will the SRA. http://www.emc.com/collateral/software/white-papers/h7261-business-continuity-vsphere-recoverpointwp.pdf The log location for the SRA is c:\program files\EMC\SYMAPI\log . In addition, when there were 10 CGs and one VM, that had 23 VMDKs attached and spread around those 10 CGs we were not able to do a failover. Changed it to one CG and the rest the same and it worked. It was the RP SRA 3.1, which is 1.0.2.1. A customer recently had 19 LUNs in a CG and was failing over unsuccessfully. There were device 0 and device 1 errors for the VM configuration since the VM configuration files were not being seen in the time that SRM required. A manual refresh on the host brought all the VMs online. This is a clue that indicates the solution. You need to do two HBA refreshes. This has been reported as necessary for HP, and sometimes with big HDS and SRDF environments, but now with a large RecoverPoint CG as well. See How can I configure a second HBA rescan? For help on fixing this. With the number of CGs that are currently supported, and that the number will grow in the future, it is suggested to think about have one app, or business unit per CG. That would be one or more LUNs as that app or business unit would require. This would provide the greatest flexibility in testing and failover. It has been reported to me that RecoverPoint will support simultaneous recovery plan operation. You must organize that so that nothing impacts each other but it works. The account that you use in SRM Array Manager to talk to the RecoverPoint appliance must be configured as admin in the RecoverPoint appliance. SPC2 issue with RecoverPoint and DMX If you see the error message below when working in the Array Manager and trying to configure your connection to RecoverPoint that is using DMX storage you may have an SPC2 problem.

This error occurred after entering your credentials and selecting Connect. This occurred due to the FA flag on the DMX source storage that was not set for the RPA but was in fact set for the ESX servers. EMC was a very quick help with this issue.
SRM Reference Guide Page 90 of 166

Site Management IP
Using RecoverPoint (RP) which server talks to the RP management server? During the Array Manager configuration a connection is created to the Site Management IP for RP. It was asked when the Site Management IP is in a protected management network, and a rule is required to be created for the firewall to provide access, which is the source server the VirtualCenter server, SRM server, or the ESX server? It is the SRM server, which is often located on the VC server that needs the communication with the RecoverPoint Site Management IP. Unable to connect SRM to the RecoverPoint Management Server The 3.0 RecoverPoint SRA uses the same ports as the RP GUI (1099 and 4401). So if you can open the RP GUI from the SRM server, the SRA should work.

WARNING: UNKNOWN_ERROR
When you see an error that looks like [#1] Fri Mar 20 09:25:05 PDT 2009 WARNING: UNKNOWN_ERROR it can mean that an older SRA is in use. Make sure you are using v1.0.2.1 or later.

AM I using the latest version of the SRA?


If you are in the RP folder on the SRM server you can run:
C:\Program Files\VMware\VMware Site Recovery Manager\SAN\array-typerecoverypoint>..\..\..\external\perl-5.8.8\bin/perl.exe command.pl version

And you should see:


EMC RecoverPoint Adapter for VMware Site Recovery Manager version 1.0 SP2 P1

Which makes v1.0.2.1. The instructions above are not quite right there is an issue with the path or format. It would be good if someone could test this. Test Recovery fails with already accessing image error message If you do a test failover when using RecoverPoint and it fails and in the very large error in the history report you see near the bottom a message about already accessing image you will know that the recovery side (or target) LUN is already set for access before the SRA arranges for it to be set for access and this generates an error. DMX and RecoverPoint This is from a support guy on how something was fixed in a RP issue with a DMX. In the connectivity of ESX to the DMX there requires the SPC2 bit be set on the DMX array. This bit setting was set on the FA ports that the ESX host was connected too, though somehow the HBA wwns were excluded on the symmask list. Additionally, the RPA connections to those LUNs on the DMX did not have any SPC2 bit setting in place on the FA ports or for the RPA initiators. This caused a mis-match of LUN UIDs that SRM saw on the ESX host versus the RecoverPoint Appliances. RecoverPoint engineering identified this and the necessary changes made in the lab. After the SPC2 bit was set, they then required the HBAs on the ESX to be reset, as well as the RPA appliances on both the source and target site. The target site RPAs required to be reset because the target site Journals actually resided on the same source site DMX, as part of this specific LAB environment (would not happen like this in an actual implementation) That is why the SPC2 bit setting issue hit both source and target SRM implementations even though the RecoverPoint target storage was on Clariion (which doesnt require SPC2 bit). So Lessons learned are to make sure that the SPC2 bit setting is in place prior to deploying SRM for both the ESX and RPA appliances for those LUNs. SPC2 bit setting can be done at the FA port level or at the initiator level.

SRM Reference Guide

Page 91 of 166

EMC Q & A
Q: What happens if SRM uses fewer LUNs than those contained in an RA group? A: EMC's newly published adapter populates the consistency GroupId field of the SRM XML specification by defining all of the R1 LUNs in an RA group as part of the same consistency group, which means SRM will always try to fail over those LUNs as a unit regardless of the presence of VMs on them Q: What if the adapter does not have visibility to all of the LUNs in the RA group A: EMC recommends the adapter use EMC Control Center (ECC) as the management server for manipulating the RDF devices; ECC should have the ability to manage any device regardless of its visibility to any host, and if not, then any script (not just an SRA) to manage that group would be impossible Q: What if VMs not part of a recovery plan use the same RA group? A: Customer that replicates VMs using SRDF but does not recovery them as part of a recovery plan is probably wasting bandwidth replicating data that is never consumed, so this is probably not a best practice. EMC codes their adapter to best practices Q: What if the RA group constrains LUNs not used by ESX hosts? A: This question can be turned against any DR software; what about the script that tries to fail over the LUNs that is running on the non-ESX hosts (e.g. a Solaris cluster, etc.), it will impact SRM protected VMs. Using the SRM API there is always a way to integrate failover among disparate clusters Q: Why is the test recovery required to be performed on the entire RA group when it is possible to snapshot a LUNs using BCVs? A: This is a good question that was posed to the EMC engineering team; my understanding of their answer is that if an RA group is created it represents a consistency set of data that must be tested together, such as a multi-tiered application that requires cross-application consistency.

FalconStor
You should be aware that the FalconStor replication product will not allow a takeover of the replicated data if there are still hosts with live iSCSI connections to the primary volumes. This is designed behavior by FalconStor. In a DR situation this would NOT be an issue since the primary volumes would not exist. But this is an issue for test modes. You can address this by disconnecting primary ESX hosts from the primary targets pior to failover, or they can script this disconnect as part of the recovery plan (after the primary VM shutdown, but before the secondary storage recovery you could use a script callout for this). Not correct any longer. It is reported that only FalconStor has a product that integrates with SRM with this particular requirement. I have not been able to confirm that this behavior exists with both IPStor 5.1 and IPStor 6.0 but at this time I believe it does. 4/7/09 Update this is not an issue with current versions of the SRA. The FalconStor virtual appliance is both a gateway and a storage device. FalconStor is more known as a company that provides gateway products rather than storage. This is very useful in the BC / DR space.

FalconStor SRA invalid keycode error


The FalconStor SRA requires a key before it will work. If you have no key you get an invalid keycode error. As of the most current SRA (3/26/10) you will see the keycode is inserted for you.

Error: Non-fatal error information reported during execution of array.


The complete error is Error: Non-fatal error information reported during execution of array integration script: Failed to create lun snapshots. I have seen this twice. Once when TimeMark was not enabled on the recovery side. Sort of makes sense when you think of the error message. Another time I saw this was

SRM Reference Guide

Page 92 of 166

when the NSS Appliance had not been fully patched. By installing all patches as of 3/17/09 this error went away.

Does the FalconStor log file have any extra info?


No. However, sometimes if you have trouble looking in the SRM log for SRA info, it maybe easier for you to look at the SRA log. It has the same info, but none of the VMware stuff. It is possible to do different logging levels for the SRA and product more info for the SRA log than what goes to the SRM log but that is not the default method enabled. The SRA level of logging is set by the config.ini for more info see the FalconStor troubleshooting section of the SRA admin guide. For information on changing the SRM log levels see the SRM admin guide.

IBM
IBM DS4000/5000
IBM branded SRA (SMSRAinstaller-WS32-101.01.35.06.exe) on Win2K8 will not install This LSI SRA will not install on Windows 2008 and it will quit. You can of course have it run in compatibility mode and it will install fine. Just right click on the installer and select Compatibility, and than run it as Win2K3 (SP1). It will proceed fine.

Old information
The IBM DS4xxx SRA when installed has two issues that stop it from working. The first is that the correct path to perl.exe is not set. The second is that they have capitalization errors in two files. Both of these are mentioned in the readme so make sure to read it. The path that needs to be added to the environment is c:\Program Files\VMware\VMware Site Recovery Manager\external\perl5.8.8 . You can confirm this by looking for the directory that perl.exe is in. You should also add c:\program files\VMware\VMware Site Recovery Manager\scripts\SAN\IBM to the path as well. The two files that you need to edit are command.pl and common.pm and they are both in the C:\Program Files\VMware\VMware Site Recovery Manager\scripts\SAN\IBM folder. You need to look for the $XML_RETURNCODE = Returncode line and change it to ReturnCode. Another path issue that has been reported (11/26/09) shows itself with errors when trying to configure the array. This has been reported for both the IBM SVC ad DS8K. It is a version different in the path. To solve this issue you need to change the following file:
C:\Program Files\VMware\VMware Site Recovery Manager\config\vmware-dr.xml

You will need to look for the line under SanProvider and change the ConfigPath variable.
<ConfigPath>=C:\Program Files\VMware\VMware vCenter Site Recovery Manager\scripts\SAN\</

to read:
<ConfigPath>=C:\Program Files\VMware\VMware Site Recovery Manger\scripts\SAN\</

It is important to note that you need to restart the SRM service to make this work. As well, no other vendor SRAs will now work with this change! And one day the IBM SRA will use the new SRM 4.0 path and you will than need to return the file to the way it was!
SRM Reference Guide Page 93 of 166

IBM DS8000
This SRA has a configuration utility that is case sensitive. The manual does talk about the utility, but doesnt mention it is case sensitive! For example, when you configure a field with p4, it will not work if the array sees that field as P4. The DS800 is only able to pause all of the replicated LUNs. This means if you replicate three LUNs and one of them is not managed by SRM, it will still be impacted during a failover. Meaning it will be paused along with the SRM managed / used LUNs. This is an IBM issue, and hopefully in the 2Q2011 there will be a microcode update for the DS8000 that will fix this. It is not known currently if an SRA update will be required as well.

IBM XIV
This new hardware is supported by SRM 4.x and information on how it can be made to work is in http://communities.vmware.com/docs/DOC-12372 .

IBM SVC
IBM recently released additional SRA support for other devices, namely SVC. Check the Storage Partners compatibility matrix for the specifically supported hardware info. With 1.20.10713 it seems they have fixed most of the issues below. It still has some things to understand. If the customer has only one IBM SVC Console to manage both the protected and recovery array you will run into the errors mentioned in http://kb.vmware.com/kb/1013643 and easily solved by having to consoles. As well, the IBM SVC SRA installs a utility called IBMSVCSRAUtil.exe on the SRM server desktop. You need to understand this utility and the document included with the SRA explains this utility starting on page 11. Thanks again to Brock for this! Some things to note about the SVC SRA include: My experience is that it had the perl issue mentioned above. Consistency Groups are not supported SVC host object names must be 9 characters or less LUNs assigned to ESX must have vmware in the SVCs hostname (see http://kb.vmware.com/kb/1013616 for examples when this is not done). The patch error mentioned about perl.exe not in the path is a problem for the SVC SRA. If the test recovery fails around 3% make sure that the flashcopy is deleted. It is still there likely due to a failed test. If the test recovery fails around 14% it is like due to the configuration and you should check it out. In my case, it was that my management server also managed the recovery side. I have confirmed that the SVC SRA supports simultaneous recovery plan operation.

In addition, my first experience with the SVC SRA was with a client that had IBM install the SVC hardware. They followed best practices, but that meant that the SRA didnt work. The problem was the SVC management host on the protected side, and the recovery side, could manage either side of the SVC. So when the SRA was installed on the SRM server and pointed at the protected side SVC management host it was confused as it saw both sides. The workaround is to put a management agent on the SRM server (on each side) and make sure it can only see / manage the one side it is assigned to! This was using the 1.0.1 SRA from IBM as well as the unreleased next build of it.

SRM Reference Guide

Page 94 of 166

Dell EqualLogic
Dell firmware 4 requires the SRA that is 1.01. Our site currently has version 1.0. If you use 1.0 with EqualLogic firmware 4 you will have a failover fail with a missing share LUN flag on the disk array management. You can find help installing the SRA at the URL below. It is for an old version of SRM but it still is helpful. http://www.equallogic.com/resourcecenter/assetview.aspx?id=5261

Compellent
Currently there is a small issue with installing the SRA on Win2K8. It is due to registry security issues. Make sure the path HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\VMware Site Recovery Manager\InstallPath exists and it has instead of InstallPath the path to the SRM installation folder without quotes. The regedit type file would look like:
Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\VMware Site Recovery Manager] "InstallPath"="C:\\Program Files (x86)\\VMware\\VMware vCenter Site Recovery Manager"

Once you have corrected the registry, install the SRA again, restart the service again and it should work fine.

HP
EVA
I have heard, but have not confirmed, that after a failover the direction will be reversed and replication started. This can be a big surprise and an issue if there is low bandwidth. Confirmed. There is an excellent best practices guide for working with HP and SRM at: http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA2-8848ENW&cc=us&lc=en An online guide that helps with the EVA can be found at: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=120&prod SeriesId=499896&prodTypeId=18964&objectID=c01493772 In my experience with working with an EVA in a PofC I found the above guide very useful. It did neglect to mention several things. You need to select, or deselect the arrays in the array configuration depending on which side you are working with. You need to the management IP addresses, account name and password fields filled in for both sides! This means both values protected site and recovery site are entered in each field and separated by a semi colon. I could not find anywhere that had mentioned the mode of access be set to none. When you start a failover (not a test) the replication will be automatically be reversed. Watch out for this as if you have a slow connection it can cause issues! You need to set the rescan to two times find out how in How can I configure a second HBA rescan?

SRM Reference Guide

Page 95 of 166

As well here is some background information - http://www.yellow-bricks.com/2009/01/20/sradiscoverluns/ .

Background Information
The info below may not be necessary for the latest SRA, but it might be useful for background information. The version / firmware information below is what works with the HP StorageWorks EVA Virtualization Adapter version 1. HW Models Firmware CommandView EVA4000 XCS 6.1xx or 6.2xx v 8.0.1 EVA4100 XCS 6.1xx or 6.2xx v 8.0.1 EVA4400 XCS 9.xxx v 8.0.1 EVA6000 XCS 6.1xx or 6.2xx v 8.0.1 EVA6100 XCS 6.1xx or 6.2xx v 8.0.1 EVA8000 XCS 6.1xx or 6.2xx v 8.0.1 EVA8100 XCS 6.1xx or 6.2xx v 8.0.1 Miscellaneous information Replicated vdisks on the EVA MUST be zoned to both sites (but must be created on the primary command view server). Replicated vdisks must e in a DR group and ALL vdisks in that DR group must be used by the primary / recovery site Hosts, if vdisks in the DR group are used by other hosts then the SRA discards them. If using two command view EVA management nodes then both IP addresses must be entered in the SRA config wizard primary first, than secondary separated by a ;. The recovery site EVA must have enough space to contain the snapshot volumes. It has been seen once that the SRM service needed a domain account to talk to the Command View EVA. I have not been able to confirm this no HP gear in my lab other than LHN but this might change whether the Command View EVA is local not? I have confirmed in a VMware QA lab this change was not required. This was seen due to an error 4 in the SRM logs. It occurs when discoverluns runs as part of the setup through the SRM Plug in it fails with error 4. When using Hp Command View with HP EVA, the HP best practice is to run CV servers active / active where the CV server in the Protected site manages its EVA actively and the remote EVA passively, similarly the CV server at the recovery site manages its EVA actively and the protected site EVA passively. Then either CV server fails the other will takeover with no manual intervention required.

In some combinations of storage hardware and FC drivers, the driver does not deliver information about new LUNs to the ESX kernel in a timely fashion, so on a rescan the ESX kernel doesnt learn abut the LUNs. A second rescan is necessary in order to deliver info about the new LUN to the ESX kernel. You can configure SRM 1.0.1 Update 1 to do a second scan if necessary. A work around for the EVA right now is available. Use the following steps to have a successful test. After setting up the replication pair between two EVA arrays, at the secondary array take a manual snapshot of each of the target Vdisks and present those snapshots to the ESX host using the default LUN number that the EVA management system picks (usually the lowest available

SRM Reference Guide

Page 96 of 166

LUN number). Then do a rescan twice on the ESX host so it discovers those LUN numbers for the first time. Then destroy the snapshot. Now if you do a test recovery, the SRA for EVA will create a snapshot of each of the replicated LUNs and the behavior of the adapter will present those snapshots to the ESX host using the default LUN number. Since those numbers will already have been been discovered by ESX because of the manual steps done in the previous step, only a single scan will be required so the test will succeed.

Real solutions are not far off for this problem. It has been confirmed that some HP arrays will need a second rescan to make the recovered LUNs visible. You can find information on how to manage that elsewhere in this document. HP Storage Virtualization EVA Adapter configuration information http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=120&prod SeriesId=499896&prodTypeId=18964&objectID=c01493772 SRM has to be set up with HP Command View (CV) EVA to always have the DR site as the hosting the HP EVA failover primary site (i.e managing it actively), which should make sense since in a failover, you would lose the primary site and would not really want to have to manually make CV active . One area, which seems to cause HP EVA customers a lot of problems, is working out which CV setup/config they have in place and then working out how this should be entered into the GUI in SRM. Example: lets assume you have two command view (CV) EVA servers one in datacenter A (DC A)and one in datacenter B (DC B) and in each datacenter you have an EVA. EVA-A in DC A and EVA-B in DC-B. Usually HP will recommend that the CV servers are configured active/passive with each CV server actively managing the EVA in its local DC and passively managing the EVA at the opposite DC. so in this example we would have, CV-DC A manages EVA1 actively EVA2 passively CV-DC B manages EVA2 actively EVA1 passively When entering the information in the "Configure Array" wizard you can include the ip addresses for both CV servers in the same line and separate with a ";" you also assume both CV servers can be accessed using single login username/password. When you hit the "Connect" button two storage arrays will appear, when entering the protected side info simply check the box for the local EVA at the protected site and when you get to the recovery side screen select the other EVA. Some customers have an alternate configuration which is NOT HP best practice that is: CV-DC A manages EVA1 actively and EVA2 actively CV-DC B manages EVA2 and EVA1 passively The HP SRA adapter cannot associate vdisk with drgroups when connected to a passive command view host and I think this configuration has caused some customers issues during the setup stage. I believe this second config works but you need to be careful in the config array wizard as we cannot currently force the passive CV to become active. Below are some other checks that you may want to look at. Verify that the vdisks are correctly presented to the ESX hosts at both sites. I have seen issues where
SRM Reference Guide Page 97 of 166

customers don't have the access method set correctly for the vdisks. The HP documentation seems to make customers believe that the replica vdisks at the recovery site need to be made accessible to the recovery site ESX hosts at all times i.e read/write. All that is actually required (as with other replicated array configs) is that the replicated luns, at the lun device level, simply need to be in the same zone as the ESX hosts at the recovery site (i.e within VC they will appear on rescan in the storage adapter screen but not in the storage/vmfs datastores screen by default). Other things we have seen include: Replicated vdisks on the eva MUST be zoned to both sites (must be created on the primary command view server) check they have done this. Replicated vdisks must be in a DR group and ALL vdisks in that DR group must be used by the primary/recovery site ESX hosts, if vdisks in the DR group are used by other hosts then the SRA discards them. So again they need to verify this. If using two command view eva management nodes (as described above) then both ip addresses must be entered in the SRA config wizard primary command view EVA first then secondary command view EVA, separated by a ; Recovery site command view EVA should be defined as the site that is the failover primary. Customer must ensure recovery site EVA has enough space to contain the snapshot volumes. The SRA produces a log (hpsrmeva.log) which is a good place to look for other error messages. We have seen where sometimes the issue is a miss - configuration error of the SRA / Array Manager. During the setup because of the way Command View works you are presented with both EVAs in the Protected Arrays and Recovery Arrays screens. you need to uncheck the relevant EVA at each screen. Failure to do so can cause issues when you run test plans.

Miscellaneous Information URLs


SRM 4.1 Release notes - http://www.vmware.com/support/srm/srm_releasenotes_4_1.html SRM 4.1 upgrade blog - http://blogs.vmware.com/uptime/2010/07/upgrading-to-srm-41-includingupgrading-to-vsphere-virtualcenter-41.html SRM 4.0 Release notes - http://www.vmware.com/support/srm/srm_releasenotes_4_0.html SRM 4.0 Upgrade KB article - http://kb.vmware.com/kb/1013166 SRM 1.0 Release Notes - http://www.vmware.com/support/srm/srm_10_releasenotes.html SRM 1.0 Installation and Configuration guide http://www.vmware.com/pdf/srm_10_admin.pdf SRM install and configure video - http://mylearn.vmware.com/register.cfm?course=22279 List of SRM links - http://tendam.wordpress.com/srm-links/ VMware SRM in a NetApp environment - http://media.netapp.com/documents/tr-3671.pdf SRM 4.0 performance whitepaper - http://www.vmware.com/resources/techresources/10076

SRM Reference Guide

Page 98 of 166

Dell EqualLogic guide - http://www.equallogic.com/uploadedFiles/Resources/Tech_Reports/TR1039-De ll-EqualLogic-PS-Series-SAN-and-VMware-SRM.pdf Using EMC SRDF Adapter for VMware Site Recovery Manager http://www.vmware.com/files/pdf/VMware_SRM_SRDF_bestpractices.pdf VMware vCenter SRM in a NetApp Environment - http://media.netapp.com/documents/tr-3671.pdf VMware Uptime blog (VMware and Business Continuity) http://blogs.vmware.com/uptime Availability Zone of VI:OPS - http://viops.vmware.com/home/community/availability - this includes a lot of lab setup info for various storage arrays. LeftHand Networks SRA Failback Procedure for SRM http://www.lefthandnetworks.com/document.aspx?oid=a0e0000000000NxAAI SRM in a can EMC with automated failback info http://virtualgeek.typepad.com/virtual_geek/2009/07/updated-site-recovery-manager-in-a-can-doc-nowwith-extra-emc-automated-failback--.html HP Storage Virtualization EVA Adapter configuration information http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=120&prod SeriesId=499896&prodTypeId=18964&objectID=c01493772

Syntax highlight module info


Text Wrangler language module
Use the info below to create a text file called log.plist.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd "> <plist version="1.0"> <dict> <key>BBEditDocumentType</key> <string>CodelessLanguageModule</string> <key>BBLMLanguageCode</key> <string>MWTR</string> <key>BBLMLanguageDisplayName</key> <string>Log</string> <key>BBLMSuffixMap</key> <array> <dict> <key>BBLMLanguageSuffix</key> <string>.log</string> </dict> </array> <key>BBLMColorsSyntax</key> <true/> <key>BBLMIsCaseSensitive</key> <false/> <key>BBLMKeywordList</key> <array> <string>authorization</string>

SRM Reference Guide

Page 99 of 166

<string>credentials</string> <string>authentication</string> </array> <key>Language Features</key> <dict> <key>Identifier and Keyword Characters</key> <string>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</string> <key>String Pattern</key> <string>(error|warning|critical)</string>

</dict> </dict> </plist>

EditPlus
Use the information below to create a text file called log.stx.
#TITLE=XML ; XML syntax file written by ES-Computing. ; This file is required for EditPlus to run correctly. #DELIMITER=[]:() #CASE=y #KEYWORD=Error error ERROR #KEYWORD=Warning warning WARNING #KEYWORD=Verbose verbose info trivia #

Lab Exercises
Below are a number of lab exercises from the Partner Exchange 2010 SRM bootcamp lab. They may be useful to people new to SRM.

Lab 1 Installing SRM


VMware vCenter Site Recovery Manager (SRM) Installation Lab Station 01 Create the SRM database instance 1. Using SQL Server Management Server Express, create a new database instance.

SRM Reference Guide

Page 100 of 166

2. Provide a name for the database instance.

3. Close SQL Server Management Server Express. Create the ODBC data source connection for SRM 4. Start > All Programs > Administrative Tools > Data Sources (ODBC) 5. Select the System DSN tab. 6. Click the Add button to open the Create New Data Source window. 7. Scroll down the list and select SQL Native Client.

SRM Reference Guide

Page 101 of 166

8. Provide a Name and Description (optional) for the Data Source connection. 9. Using the Server drop-down menu, select the local server\database. The number in the server name must match the station you are sitting at. Example: Station 16 would select STU16-VCA\SQLEXP_VIM when installing in the Protected Site, STU16-VC-B\SQLEXP_VIM when installing in the Recovery Site. 10. For authentication, keep the default settings (which should be With Integrated Windows authentication).

SRM Reference Guide

Page 102 of 166

11. Change the default database to the SRM database instance name.

SRM Reference Guide

Page 103 of 166

12. On the last window, keep the default settings and click Finish.

SRM Reference Guide

Page 104 of 166

13. Click the Test Data Source button which should result in a test completed successfully message.

SRM Reference Guide

Page 105 of 166

SRM Reference Guide

Page 106 of 166

14. Click the OK button to close the ODBC Data Source Administrator window. This step completes the setup of the SRM database and the ODBC data source configuration.

SRM Reference Guide

Page 107 of 166

SRM Installation 15. Locate the SRM installation files in c:\files\SRM

16. Click the VMware vCenter Site Recovery Manager link.

SRM Reference Guide

Page 108 of 166

17. Click the Next button.

SRM Reference Guide

Page 109 of 166

18. Accept the license agreement and click the Next button.

SRM Reference Guide

Page 110 of 166

19. Leave the default settings for the Destination Folder.

SRM Reference Guide

Page 111 of 166

20. Enter the vCenter Server address and credentials. The number in the server address must correspond with the station you are sitting at. Example: Station 16 would select STU16-VCA\SQLEXP_VIM when installing in the Protected Site, STU16-VC-B\SQLEXP_VIM when installing in the Recovery Site.

SRM Reference Guide

Page 112 of 166

21. If you receive a security warning, click Yes to proceed.

SRM Reference Guide

Page 113 of 166

22. Keep the default settings for the Certificate Source.

SRM Reference Guide

Page 114 of 166

23. For the Organization and Organization Unit, enter VMware.

SRM Reference Guide

Page 115 of 166

24. Provide a Local Site name and Administrator E-mail address.

SRM Reference Guide

Page 116 of 166

25. In the Database Configuration window, use the Data Source Name drop-down menu to select the ODBC DSN created earlier. Enter the database user credentials.

SRM Reference Guide

Page 117 of 166

26. Click the Install button to initiate and complete the SRM installation.

SRM Reference Guide

Page 118 of 166

SRM Reference Guide

Page 119 of 166

SRM Reference Guide

Page 120 of 166

SRM Plugin installation 27. Going back to the original VMware vCenter Site Recovery Manager Installer window, select the VMware vCenter Site Recovery Manager Plugin link to start the SRM plugin installation.

SRM Reference Guide

Page 121 of 166

28. Click the Next button.

SRM Reference Guide

Page 122 of 166

29. Accept the license agreement and click the Next button.

SRM Reference Guide

Page 123 of 166

30. Click the Install button.

SRM Reference Guide

Page 124 of 166

31. Click the Finish button.

SRM Reference Guide

Page 125 of 166

32. You can verify the SRM plugin installation by opening the vSphere Client and clicking on the Plug-ins menu item.

SRM Reference Guide

Page 126 of 166

33. You can also click on Home in the menu bar and look for the Site Recovery button under Solutions and Applications.

Storage Replication Adaptor (SRA) installation 34. Navigate to c:\files\SRA.

35. Double-click the FalconStorSRA executable to begin the SRA installation. Click the Next button to begin the installation.

SRM Reference Guide

Page 127 of 166

36. Accept the license agreement and click the Next button.

SRM Reference Guide

Page 128 of 166

37. Enter the customer information and click the Next button.

SRM Reference Guide

Page 129 of 166

38. You will be prompted for a keycode. This can be found in the text file locate in c:\files\SRA. Copy and paste this keycode into the Keycode - License window. Click the Next button.

SRM Reference Guide

Page 130 of 166

39. Click the Install button.

SRM Reference Guide

Page 131 of 166

40. Click the Finish button.

SRM Reference Guide

Page 132 of 166

41. It is important to review the readme file included with SRAs. These often include information about features, known issues, etc. about the SRA that are important to know when implementing SRM.

SRM Reference Guide

Page 133 of 166

42. This completes the installation of the SRA.

Lab 2 Configuring SRM


Start your vSphere client, and select the SRM application. Your not configured SRM should look like this on the protected side:

SRM Reference Guide

Page 134 of 166

First, connect the protected side to the recovery side (click Configure). Add the recovery sites VC (stuXX-vc-bshort DNS name is sufficient):

Click next and accept the certificate on the error page (this is self-signed, so you should work with a valid one in your production environment):

SRM Reference Guide

Page 135 of 166

Authenticate with administrative credentials:

Accept the certificate again:

Wait for it to go through the connection procedure:

SRM Reference Guide

Page 136 of 166

SRM Reference Guide

Page 137 of 166

Enter administrative credentials for the remote VC and click OK:

Click Finish and youre through.

SRM Reference Guide

Page 138 of 166

Now go configure the Array Managers. Click Add:

Select FalconStor NSS Series from the Manager Type pull-down menu:

SRM Reference Guide

Page 139 of 166

Fill out the rest of the information and click Connect:

SRM Reference Guide

Page 140 of 166

The array(s) should show up in the lower part of the window:

Click OK and your Protected Site Array Managers should look like this:

SRM Reference Guide

Page 141 of 166

Click Next and repeat the process for Recovery Site Array Managers. It should look like this when completed:

SRM Reference Guide

Page 142 of 166

Click Next and confirm the datastores:

SRM Reference Guide

Page 143 of 166

Click Finish.

SRM Reference Guide

Page 144 of 166

Click Configure on Inventory Mappings to pull up:

Select VM Network and click Configure

Select VM Network and click OK. Configure the Protected Apps Resource Pool to map to Recovery Apps:

SRM Reference Guide

Page 145 of 166

Map the Protected Apps Virtual Machine Folders to Recovery Apps:

Click on Site Recovery to get back to the SRM home screen:

SRM Reference Guide

Page 146 of 166

SRM Reference Guide

Page 147 of 166

Go to Protection Groups and click Create a Protection Group called Production Group:

Click Next and select the protected datastore. Note the VMs show up at the bottom:

SRM Reference Guide

Page 148 of 166

Click Next and select the datastore labeled esxXXb-shadow for your placeholder VMs:

SRM Reference Guide

Page 149 of 166

Click Finish.

SRM Reference Guide

Page 150 of 166

To set up the Recovery Plan, you need to move over to the recovery side vCenter. If you are using vCenter Linked Mode like we are in the lab, it is easy to do. Just pull down and select the appropriate vCenter from the drop-down Site Recovery list in the breadcrumb trail:

You will get a similar window, but notice the Protection Setup section is almost empty. Thats OK. Were not protecting anything here.

Go to Recovery Plans near the bottom and click Create. Name the Recovery Plan:

SRM Reference Guide

Page 151 of 166

Click Next and select Production Group to protect.

Click Next. Accept the defaults for Response Times:

SRM Reference Guide

Page 152 of 166

Click Next. Map VM Network to the Test Network:

Click Next. There are no VMs that we are suspending, but feel free to look:
SRM Reference Guide Page 153 of 166

SRM Reference Guide

Page 154 of 166

Select Production Recovery Group on the left:

Click on the Test button:

Confirm:

Move over to the Recovery Steps tab to watch the progress:

At around 54% on the progress bar, the default message will show:

Click Continue. Notice after it finishes step 11, all info disappears! Not to worry, SRM has saved it in a History Report. Click on the History tab and select your RP and click View:

SRM Reference Guide

Page 156 of 166

That will pull up an HTML report detailing the entire Test process.

Lab 3 IP Customization
You have a healthy SRM implementation and are protecting hundreds of virtual machines. You now want to ensure your VMs are connected to the right network on the recovery side for testing and for real failovers. You have created a set of VLANs at the recovery facility that will be used during a test but since the IP information in the recovery facility is different than what you have in production you need a way to change the IP information for each VM during a test or actual failover. When you have only a few virtual machines in your recovery plan, it is relatively easy to create a customization specification to change the IP information and connect it to the individual virtual machines. However, when you have 50, 80 or hundreds of virtual machines it becomes much more time consuming to create a custom specification for each one. The bulk IP tool we use here is designed to make it easier to create custom specifications. This lab will help you understand this tool and how to use it.

Helpful Starters
The general idea when working with the Bulk IP Load tool is to create a CSV file that contains vital information about the virtual machines which are a part of the recovery plan. When first created, the CSV file contains a list of virtual machine names, and then several IP related fields adjacent to the name that may be modified to fit your needs. You use this template as a starting place to provide the proper IP information which will be required by the virtual machines at the recover site. When complete, you import the CSV back into SRM. This process creates a series of customization specifications which will be read during the boot process for the affected virtual machines on the recovery side.

Procedure Hints
1. Run the dr-ip-customizer executable. You will find this utility in the VMware Site Recovery Manager program files folder under the bin directory. Think: You are working to customize a recovery plan, which side should you be working in? 2. You will need to execute the utility from a DOS prompt so that you may provide it with the correct parameters. To tell the utility to pull information from SRM, your command should look as follows: a. dr-ip-customizer.exe cfg ..\config\vmware-dr.xml csv
c:\down.csv cmd generate

SRM Reference Guide

Page 157 of 166

3. You will be prompted to trust a server twice, and you will need to authenticate as well. 4. Did the command succeed? Tip: If the command worked, you should have a file named down.csv in the root of the C:\ drive. 5. You can now use Microsoft Excel to open the CSV file. NOTE: Excel is not installed on any of these machines Instead, Excel.exe is wrapped in a VMware ThinApp package and simply runs as a self contained exe. Checkout VMware ThinApp when you have a chance! 6. Please see below for an example of a clean dr-ip-customizer export. 7. Now you will introduce some changes to the file so the referenced virtual machines will be associated with new customization specifications. Follow these steps closely when modifying the file: a. The 2nd column contains the name of your virtual machine(s). For simplicity sake, we will be modifying only 1 virtual machine and we will use the one on the bottom of your list. b. Click the row header to highlight the last row, and copy it to the clipboard c. In the blank row directly below your last row, paste the contents of the clipboard. Tip: Before you paste, click on the first cell in that new row (in the A column). d. In your new row, change the value in the Adapter ID column from 0 to 1. Tip: The values for Adapter ID can range from 0 to 4. 0 means global, and 1-4 refer to specific adapters. Since we have only 1 adapter, its ID is 1. e. In the DNS Domain column, type vmworldtest.com. f. In the IP Address column, type dhcp. g. Save the CSV file. 8. Import the file back into SRM by executing the following command: a. dr-ip-customizer.exe cfg ..\config\vmware-dr.xml csv
c:\down.csv cmd create

9. You will be prompted to trust twice again and authenticate 10. Verify your import was successful. Think! Dr-ip-customizer creates customization specifications. Where in vCenter can you view Customization Specifications? Tip: Goto the Home screen. 11. To complete this lab, you can run a test recovery and when you get to the yellow pause message (in the recovery steps tab), go to the virtual machine which was associated with the customization specification and check out its IP information. The DNS domain should now be vmworldtest.com and it should be configured to use DHCP.

Reference Materials Sample 1 Bulk IP Load Screenshot

SRM Reference Guide

Page 158 of 166

Conclusion
In Lab 2, you utilized the bulk IP customization utility to alter the IP information for a virtual machine in a recovery plan. This utility created a customization specification and associated it with the virtual machine. When the virtual machine is powered on for test in the recovery facility, it automatically obtains custom IP attributes by taking direction from the customization specification. This is a considerably valuable tool, especially for very large SRM environments where IP information may need to be altered for a large number, or even all virtual machines. Dr-ip-customizer saves administrative time and minimizes errors.

Lab 4 Script Intro


Scenario
In this lab, we are going to investigate the use of scripts during a recovery effort. Scripting is quite powerful, and the ability to make callouts during a test or full recovery will serve to even further streamline the recovery effort. For this lab, we will make a simple call to a pre-written script. This script is a control script and thus calls another script, which simply records the date and time of a virtual machine power on operation during a test recovery.

Helpful Starters
1. With scripting, syntax is important. To that end, remember to always use full path references and be sure of your spelling and punctuation. 2. Consider if the script should be executed before the VM power on, or after it boots. 3. You are applying extended attributes to a virtual machine that is being protected by SRM so that when it is recovered, the script will execute. Think: What side of SRM contains a list of the actual virtual machines that are being protected?

Procedure Tips
1. Be sure there are two scripts located on your SRM server. There should be scripts named call.cmd and test.cmd. Both should be located in c:\scripts. If you do not see these scripts, please let a lab proctor know. 2. On the Protected side, highlight your protection group in the left pane, and click on the Virtual Machines tab in the right pane 3. For each virtual machine in the list, click Configure Protection
SRM Reference Guide Page 159 of 166

4. Click Next through to the very last section, Post Power On 5. Click Add Command to insert a call out to your control script. Type the following into the Add Command dialog box: a. C:\windows\system32\cmd.exe /C c:\scripts\call.cmd 6. Be sure to repeat steps 4 and 5 for each virtual machine in the protection group 7. Click Finish to store the configuration. Your virtual machines are now configured to run a script post power on. 8. Flip to the recovery side and run a test failover. 9. The scripts are configured to record the date and time stamp (for lack of anything else more interesting) to a log file located in the c:\scripts directory of the SRM server on the recovery side. 10. Once the yellow banner appears in the Recovery Steps tab, open up the c:\scripts\test.log file. You should see date and timestamp entries.

Reference Materials Things to Remember about Scripts

Conclusion
In Lab 4, you learned how to inject scripts into your SRM environment. Scripts are useful for a variety of reasons from simple diagnostics and logging to more complex

SRM Reference Guide

Page 160 of 166

integration with the DNS environment, load balancers, and other services which may require updating on the recovery side.

SRM Reference Guide

Page 161 of 166

Whats New additions or deletions or changes


2/26/11 Added info on what a PowerShell script should look like to be executed by SRM see it on page 34. I increased the complexity of the sample demo script on page 33. I added a section on upgrading to 4.1.1 on page 11. I added a reference to my suggested alarms blog on page 48, as well as updating some of the recommended alarms. I added more to the best practices - #15,16,17, and 18 on page 27. Also added 19 as well about using SQL accounts. Added various URL (evaluation guide, storage and release notes). Added the Win2K8 R2 install log location on page 39. I added some info to using the dr-ip-customizer tool on page 41. Added some info to protecting View desktops on page 30. Added some blog links to scripting on page 33. Added an important new best practice (#20) on page 27. Added additional resources for working with certificates on page 45. Added a section about not being able to failback to a lost protected site on page 32. Added a MirrorView issue and solution on page 72. Added another couple of best practices around multi extents and application discovery and mapping software on page 27. Added a reference to the new SRDF techbook in the EMC \ SRDF section. Added info about a NetApp issue on page 75. Added some miscellaneous information to the install account section as a result of comments from Brock (thanks BTW). See page 9 for more info. Added a simpler solution and a KB about it to issues installing / uninstalling on win2K8 thanks again Brock see page 11 for more info. Updated some IBM SVC SRA and issue thanks to Brock for sharing. See this info on page 94. IBM SVC and Java issue added on page 74. 12/31/10 Added some info in best practices to improve the section (points 10 through 13 page 27). Corrected spelling / grammar issues. Added a URL to a blog about LHN lessons learned in LHN page 74. Add a section on Shared Recovery (Page 31). Added some new release info (on SRDF ARA, VSI4, and Celerra SRA) in EMC page 82. Added a section on failback plug-ins on page 31. Added additional information on NetApp Metrocluster from Lee D on page 77. Added a section on what rights are required to execute recovery plans when you dont want your user to be a VC admin user page 51. Added info on an error configuring protection group timeout error on page 71. Added important note about the VSI and how it may not right SRA configuration information out page 84.
SRM Reference Guide Page 162 of 166

Added information on case sensitivity, and replicated LUNs issue with the DS8000 SRA page 94. Added a section on P2V DR page 31. Added some extra info on 4.1 SRM licensing on page 42. Also clear up the 1.0 section a little page 43. Added info on install error with Perl page 72. Added info on a script error page 35. Added info on a hard to troubleshoot / fix error with a pop up error about not being able to protect a VM page 72. Added error info on not visible LUN (not able to create a PG with it) to page 72, but also a new suggestion (number 14) in best practices on page 27. 10/16/10 Added a little text on the title page to explain this guide, with its information can be useful, or dangerous, so make sure you know what you are doing! How to reset the Celerra root and NASADMIN accounts (Page 82) Account solution help for an error (Page 71) I added a little more info, and a blog reference to SRM 4.1 licensing (Page 42) Added a little more detail to where SRM doesnt fit (Page 8) Added a section on protecting View desktops be warned it doesnt, yet, include SRM. Page 30). Some general readability improvements. I added some details to the best practices section (Page 27) Added a new section to the IBM SRA section IBM DS4000/5000 (Page 93) Added a new issue install hangs at 90% - on Page 71. Added PowerShell signature issue and solution on page 71. 8/7/10 Added info on upgrading to SRM 4.1 (Page 11) Added 4.1 build information Added info on tweaking the log parameters (Page 37) Added some info on best practices (Page 27) Updated minimum Alarm notification suggestions (Page 48) Added two URLs to help with HDS SRA installation (Page 81) Added some info on SQL authentication and starting SRM issues. See page 56. Updated SRM scripting info in several places. Updated the network device not found info page 62. Added some EMC video links. Added Application References section see page 30. Added information on what VM parameters are not failed over see page 35 Added info on high priority start order and multiple protection groups page 36. Miscellaneous link and text updates mostly spelling / grammar. 5/22/10 Added info on Network device needed by recovered virtual machine error. Add URL reference to the new SRDF techbook. Added a brief note on when SRM is not a solution to consider. Added a Can I change the Run button to work like the Test button section? Updated the how to find a name of a VM when I have a MoRef article. Added an error / solution operation timeout. Added RecoverPoint TCP port info.
SRM Reference Guide Page 163 of 166

3/26/10 Added another HP link to online help for the EVA and SRM. Added a solution to a Compellent SRA issue on Win2K8. Added two new alarms to the recommended alarms. Also provided info on how SRM 1.0 and SRM 4.x would handle a expired license situation. Added info on thick LUNs to MirrorView. Fixed error in script example. Added some extra info about which storage arrays support application consistency. Added info on SRDF / SRM EMC SRDF licenses. Added KB article to IBM SVC for an error previous reported but KB article has extra info. Added an updated to the Celerra and null issue. Added link to issue / solution for CLARiiON issue. Added a workaround for an issue with RecoverPoint and a CG with a large number of LUNS. Added a SRDF white paper URL. Added info in the script section to show how you could see the variables that are available during the run of the failover or test. Corrected the path to the how to use trusted certificates document. Corrected the path to the SSL and NetApp document. Added to the EMC RecoverPoint section the issue with using the ~ in CG names. 2/4/10 Added some additional Script info, including a sample. Added Labs to work with the SRM Boot camp at PEX. While they are designed for a specific lab, they are still useful for someone who wants to learn more. Added some additional info on IP Customization. Did some miscellaneous spelling corrections. Added info on the null Celerra issue it is supposedly fixed. Added detail about protecting multiple tiered applications. Added some basic detail on backups of SRM. Added information on the three things that derail SRM projects and Proof of Concepts. Added an additional suggested alert condition. Added info on time required to protect VMs. 12/12/09 Add some additional info on redoing the SRM db. Working on improving spelling and grammar. Slowly but will keep at it. Added info on the two NetApp SRA question. Some additional info for the HP EVA.
SRM Reference Guide Page 164 of 166

Added additional info on the mirrorviewsracore.dll issue and SE info in the EMC MirrorView section. Added a solution to SRM not starting with event log errors 7000 / 7009. Added information on FalconStor SRA log levels. Added info on what travels with a VM between recovery plans. Added problem / solution of failed to connect to NFC. Added information to change the concurrent power on value. Some general cleanup and adding of info to various sections of the document. Added information on expectations you can have for how long to fail over.

Added some additional information on number of character limits for a number of tools. Added some SPC-2 info to EMC DMX area. Added info on avoiding shutdown tracker prompt. Added a LHN error / solution (arrayxxxx not found). For HDS, SRDF, EVA added in info about the replication direction changed during a failover. Added a section on vendors and their tools that can do application consistent replication. Added a section on vendors who can do application consistent continuous replication. Added info on how to change the MirrorView SRA to support 3 simultaneous recovery plans instead of the default and supported one. Added some MirrorView port usage info to the EMC MirrorView section. Added an MV error and solution, and background on MV in the general section.

11/27/09 Add info on Linux IP customization issue. Added info on IBM path issues Re-initialize SRM database URL to SRM 4.0 performance whitepaper Add info on repairing in SRM 4.0 (Repair SRM in add remove instead of srmconfig). Added info on incompatible host / resource pool etc. Added info on extending script timeouts. Update, and corrected, SRM 4.0 license info in post Update 1 of vSphere. Added problem and solution of the proxy issue after Update 1. Added a screen shot for what new license looks like. Various edits and suggestions from Rob N. Thanks Rob!! SRM build numbers added. Some additional IP customization things to watch out for. Also add some grayed out create PG general help in troubleshooting. Corrected Changing passwords after SRM is working for SRM 4.0 new method. Added the null Celerra error. 10/30/09 Add additional info, and clarity around how many VMs can be started. Added info around why you cannot see a LUN during array config. Added to install section info on 15-character limit for VC username. Added how to catch the install and srm-config log files during install. Some HP EVA info was added. Added info on Heartbeat and SRM Finding a VM moid. Added info on Symmetrix and duplicate WWN issue. Added info on dr-ip-customizer issues 10/2/09 Miscellaneous updates for SRM 4.0. Correction for how many VMs can be started in a SRM 1.0 world, and added SRM 4.0 info. 8/8/09 Comments about starting vmware-dr.exe in the general troubleshooting section. Fixed a misspelling in EditPlus section. HP EVA setting issue added.
SRM Reference Guide Page 165 of 166

7/5/09 Added Linux customization log info Upgrade information for next release RecoverPoint log location Added symapi_c_net_handshake_failed error to SRDF section 6/20/09 Added some additional info on upgrades. Updated the app test plan. Added a post install test outline. 6/13/09 Added info on scalability. 6/4/09 LeftHand VSA upgrade info Added info on syntax highlighting. Information on uninstalling an SRA which stops SRM from working! Some info on MS licensing and DR testing. NetApp MetroCluster background info (thanks Lee!) A RecoverPoint / DMX / SPC2 issue! 6/2/09 Added info on SVC and simultaneous recovery operations. Additional info on log file component logging. Added the error message to changing install passwords after install and how to fix. Updated MirrorView for possible working of simultaneous recovery plan operation. Added Celerra section to EMC and that it works with simultaneous recovery plan operation. Also said it works for RecoverPoint.

Celerra 2.0 SRA logs location. Added additional info on uninstalls. Some additional clarity on what the Repair button is for. Two possible solutions for SRM not starting. Added EMC SRDF error and solution to the Troubleshooting section and NOT to the EMC section. Documentation bug non-zero exit crash not true Added some info on what database corruption looks like.

Filename: Z:\Downloads\SRM Reference Guide_x.docx Revision: 46 Last Save By: Michael White at 2/26/2011 16:16 Created by: VMware, Inc at 8/8/2010 10:52 Last printed at: 2/26/2011 16:162/26/2011 16:16 Comments / suggestions / corrections / changes to MWhite@VMware.com

SRM Reference Guide

Page 166 of 166

You might also like