Professional Documents
Culture Documents
SRM Reference Guide
SRM Reference Guide
Information to help you with your SRM experience! This guide has been created for VMware SEs as well as our partners SEs who are responsible for working with our products at our customers. It provides design, scalability, troubleshooting, and general information about SRM. This guide is intended for knowledgeable practitioners who are VMware staff or VMware partners. The information in this guide can help an experienced virtualization system engineer, but it can also hurt if you do not know what you are doing. This information also comes with no warranty implied or otherwise. This information is not VMware sanctioned or warrantied. Corrections and suggestions gratefully welcomed at mwhite@vmware.com.
Contents
Background .............................................................................................................................................. 7 Educational materials .......................................................................................................................... 7 Some things to think about for a successful SRM project ........................................................ 7 When is SRM not a good solution? ................................................................................................... 8 Install / Uninstall Information .......................................................................................................... 8
Where should I get SRAs from? .................................................................................................................... 8 Install account.................................................................................................................................................... 9 Install and Configure information for specific environments.......................................................... 9 Install Overview ................................................................................................................................................ 9 Install Test Outline ......................................................................................................................................... 10 Uninstall Information ................................................................................................................................... 10 Installing (uninstalling) on Windows 2008 .......................................................................................... 11
What goes wrong in SRM projects? .......................................................................................................... 25 Large VI environments ................................................................................................................................. 26 Suggested Recommendations aka Best Practices ........................................................................ 27 Failback Outline .............................................................................................................................................. 29 Bandwidth Usage ............................................................................................................................................ 30 Multiple Tier Applications .......................................................................................................................... 30 Application References................................................................................................................................. 30 Protecting View Desktops............................................................................................................................ 30 Physical to virtual disaster recovery - P2V DR .................................................................................... 31 Shared Recovery ............................................................................................................................................. 31 Failback (plug-ins) ......................................................................................................................................... 31 A lost protected site and failing back to it ............................................................................................. 32 A sample recovery plan for testing an application ............................................................................. 32 Exchange Recovery Plan .......................................................................................................................................... 32 Adding scripts to a Recovery Plan in a call out .................................................................................... 33 What should I the PowerShell command look like to have it called from SRM?...................... 34 How can I see the environment variables that the admin guide says are available for scripts?................................................................................................................................................................ 35 Can a script execution in a recovery plan impact the inside of a protected VM? .................... 35 Will a non-zero script exit in a recovery plan stop the recovery plan? ...................................... 35 User designed callout has returned a non-zero value: 1 .................................................................. 35
Page 2 of 166
What VM parameters are not failed over? ............................................................................................. 35 Does number of PG impact order of start for high priority VMs? ................................................. 36 What about backing up the SRM databases? ........................................................................................ 36 Can I change the Run button to work like the Test button? ............................................................ 36 Can I use VMware Heartbeat to protect SRM and VC? ....................................................................... 36 How can I capture the log and configuration information for support to work with? .......... 36 Where are the SRM server logs stored? .................................................................................................. 36 How do I capture the SRM plug-in log and config info?..................................................................... 37 Where are the Linux Image Customization logs stored? .................................................................. 37 I would like to retain the SRM logs longer ............................................................................................. 37 What happens when ................................................................................................................................... 38 I add a new hard drive to an existing and successfully protect VM? ..................................................... 38 I add CPU and memory to an existing protected VM ................................................................................... 38 I add a network card to an existing protected VM ........................................................................................ 38 I add a new VM to an existing protection group ............................................................................................ 38 I remove a protected VM from a protection group ....................................................................................... 38 What travels with VMs between PG and recovery plans? ............................................................. 38 How can I tell the SRM version from the log files? .............................................................................. 39 Installation logs ............................................................................................................................................... 39 Automated Install ........................................................................................................................................... 39 Changing log details ....................................................................................................................................... 39 I would like to have a automated SRM type solution without SRM .............................................. 40 How can I have SSL communications between SRM and NetApp .................................................. 40 What happens when I Storage VMotion a protect VM or how does changes to VM storage affects protection? .......................................................................................................................................... 40 What should I know about using the bulk IP utility? ......................................................................... 41 SRM Licensing Information ......................................................................................................................... 42 How does the SRM 4.1 licensing work? ............................................................................................................. 42 How does the SRM 4.0 licensing work? ............................................................................................................. 42 How does the SRM 1.0 licensing work? ............................................................................................................. 43 What does it look like if my VI is licensed for SRM? ..................................................................................... 43 What does it look like if my vSphere is licensed for SRM after Update 1? ...................................... 44 What will happen if my license expires? ........................................................................................................... 44 What is the account that is asked for during install used for? ....................................................... 44 Is Essentials and Essentials Plus supported for SRM? ....................................................................... 45 How do I plan for disk utilization due to SRM database? ................................................................. 45 I would like to use trusted certificates with SRM help! ................................................................. 45 Can I change the IP information for the SRM server? ........................................................................ 45 Can network customization work for operating systems other than Windows? .................... 45 Understanding order of operation for bringing VMs back online ............................................... 45 How many VMs can SRM start? ................................................................................................................. 46 Can I start more than, or less than, 2 VMs per host? ......................................................................... 46 What does the Repair button do? ............................................................................................................. 46 Is it all over when the recovery plan fails? ............................................................................................ 46 Can I move an SRM server to a new host? .............................................................................................. 47 How can I configure a second HBA rescan? ........................................................................................... 47 Recommended minimum alarm notifications ..................................................................................... 48 SRM VirtualCenter events ............................................................................................................................ 48 Is thin provisioned VMs support with SRM? ........................................................................................ 49 What does Microsoft offer for licenses for DR test? ........................................................................... 49
SRM Reference Guide Page 3 of 166
What vendors have application consistency options? ...................................................................... 50 What vendors have application consistency options that work with continuous replication? ....................................................................................................................................................... 51 What rights does a user require to be a DR operator? ...................................................................... 51 SRM service doesnt start, and event logs show errors with event ID of 7000 and 7009 ..... 52 How can I have syntax highlighting to help read SRM log files? .................................................... 52 Text Wrangler............................................................................................................................................................... 52 EditPlus ........................................................................................................................................................................... 53
Troubleshooting .................................................................................................................................. 54
Things to watch out for ................................................................................................................................. 54 How can I change the command Timeout? ............................................................................................ 55 My Celerra prepare storage fails, and the error has a null in it .................................................. 56 Where is the new Run and Test privileges? .......................................................................................... 56 I have accidently deleted my Shadow VMs what should I do to fix this? ................................ 56 SQL Authentication, and database access issues ................................................................................. 56 Why cannot I customize Windows 2008? .............................................................................................. 57 Why does my recovery plan show error on VM status but the VMs are ok? ............................ 57 ESX 2.5 accessing protected datastore will cause recomputed datastore failures ................. 57 What causes the Recompute Datastore Group task? ......................................................................... 57 Why is my IP customization taking about 10 minutes extra per VM? ......................................... 58 When using Bulk Import I get column errors....................................................................................... 58 I would like to avoid the messages about shutdown ......................................................................... 58 Unable to find any array script files Please check your SRM installation............................... 58 My Linux VMs dont have the host file changed after IP customization ..................................... 58 dr.secondary.fault.WrongVmInventoryPlacement ............................................................................ 58 Pairing Issues ................................................................................................................................................... 59 I cannot run more than one simultaneous recovery plan with my MirrorView SRA ............. 59 What time guidelines can I expect for protecting VMs? .................................................................. 59 What time guidelines can I expect for failing over VMs?................................................................. 59 When trying to do Inventory Mappings the VI Client hangs ........................................................... 60 Failed to connect to the management system address when executing the discoverArrays command. ........................................................................................................................................................ 60 How can I re-initialize the SRM database .............................................................................................. 60 Error LUNs with duplicate IDs or numbers received from SAN integration scripts .............. 61 Error: Failed to recover datastore: ......................................................................................................... 61 SRM unlicensed error in logs but you have a good license .............................................................. 61 I cannot uninstall SRM successfully what can I do? ........................................................................ 61 SRM doesnt start, and you just uninstalled an SRA ........................................................................... 62 Unable to create placeholder virtual machine at the recovery site: host, resource pool, and datastore are not compatible ..................................................................................................................... 62 Network device needed by recovered virtual machine could not be found at recovery or test time ............................................................................................................................................................. 62 SRM doesnt start and nothing in SRM logs or event logs what to do? ..................................... 62 Only three Recovery Plans can run at the same time ........................................................................ 63 Why is Port 80 used in the install but port 443 later? ...................................................................... 63 Failed to test failover luns. Existing with failure................................................................................ 63 I cant install the plug in get an error ................................................................................................... 64 For SQL server use, does the SRM DB user need the DB_OWNER permission? ........................ 64 Unexpected MethodFault (dr.san.fault.ManagementSystemNotFound) .................................... 64
SRM Reference Guide Page 4 of 166
Changing passwords after SRM is working ........................................................................................... 64 My recovery site is only using x number of hosts to start VMs but it should be using y number ............................................................................................................................................................... 65 Error: A general system error occurred: cannot execute scripts .................................................. 65 Permission to perform this operation failed ........................................................................................ 65 Priority Levels in Recovery Plan dont reflect my changes ............................................................. 65 What does SRM database corruption look like? .................................................................................. 65 Error:Expected virtual machine file path .. vm-vmname/vm-vmname.vmx cannot be found ................................................................................................................................................................... 66 SRM 4.0 cannot start I just updated to vSphere 4.0 Update 1...................................................... 66 ESXi not supported at 1.0.0 nor is ESX / VC Update 2 ..................................................................... 66 My script needs more time to execute .................................................................................................... 66 Database access issues.................................................................................................................................. 66 No available Customization specifications found ............................................................................... 66 Errors with using Network Customization ............................................................................................ 67 Operation Timeout error when doing test recovery ......................................................................... 67 Recovery Plan error: Unable to access the VM config error message ......................................... 67 Grayed out options for creating and editing of protection group ................................................. 68 Net::SSLeay::load_error_strings................................................................................................................. 68 Array with key xxxxxxxxx not found error message ...................................................................... 68 Is there a limitation of DR failover LUNs for some iSCSI arrays and some Hosts? .................. 68 Can I have a VM with multiple VMDKs spread across two NetApp SRAs? ................................. 68 Not sure the error name but interesting problem.............................................................................. 68 Failed to launch SAN integration scripts ................................................................................................ 69 Failed to connect to NFC during test failover with IP customization ........................................... 69 No visible LUNs during configuration of the array ............................................................................ 70 Review Replicate Datastores window of Array Manager is blank ................................................ 70 How do I find the Managed object reference (MoRef) for a VM? ................................................... 70 Null parameter name:key error ................................................................................................................ 70 Missing testbubble switch on recovery host......................................................................................... 71 Error occurred MirrorViewSRACore.dll not found ......................................................................... 71 You do not hold system privilege System.View on ServiceInstance DrServiceInstance 71 Install hangs at 90%, and install log shows VIEINSTUTIL: Failed to open service control manager ............................................................................................................................................................. 71 Execution of scripts is disabled on this system ................................................................................... 71 Protection Group configuration times out ............................................................................................ 71 Failed to update Perl installation directories ...................................................................................... 72 Error: The operation is not supported on this object ....................................................................... 72 You do not see a newly added LUN when creating a PG? ............................................................... 72 Operation failedDetails: VI API Version 4.1 is not supported ..................................................... 72 SRM LUN discovery, test, failover fail with file write errors ........................................................... 74 SRM SRA Errata................................................................................................................................................ 74 LeftHand Networks .................................................................................................................................................... 74 NetApp............................................................................................................................................................................. 75 EMC ................................................................................................................................................................................... 82 FalconStor ...................................................................................................................................................................... 92 IBM .................................................................................................................................................................................... 93 Dell EqualLogic ............................................................................................................................................................ 95 Compellent ..................................................................................................................................................................... 95 HP ...................................................................................................................................................................................... 95
SRM Reference Guide Page 5 of 166
Page 6 of 166
Background
This document has been designed to help your interaction with VMware Site Recovery Manager (SRM), and to make your time with it to be more productive. It is an attempt to share information among users of SRM to provide knowledge and share experience. For that reason please share corrections, suggestions or comments with the author (Michael White mwhite@vmware.com). This document is for the person who has installed SRM once or twice and needs a little help, as well as people working with a new SRA. It continues to grow as people submit new or updated information, and will now also help with design and troubleshooting.
Educational materials
There is a guide called the SRM Evaluation Guide. This is the most well written and informative guide on SRM. It is important to read every page and fully understand it before implementing SRM at a customer site, or trying to technically sell someone SRM. It can be found at http://www.vmware.com/files/pdf/vcenter-srm-evaluators-guide.pdf . The SRM documentation is found at the URL below and the Admin guide is very useful! It has lots of important information so you should be familiar with this very useful guide. http://www.vmware.com/support/pubs/srm_pubs.html Prior to SRM it was still easier to do DR with virtualization than with a totally physical environment even though it was manual. For a very good understanding of that visit http://www.vmware.com/resources/techresources/1063 . It surprises me that I get questions where the answers are in the release notes. Troubleshooting is sometimes quickest when you are familiar with the release notes. In addition, the reason that release notes are HTML instead of PDF is that they are updated as necessary. SRM 4.1.1 - http://www.vmware.com/support/srm/srm_releasenotes_4_1_1.html SRM 4.1 - http://www.vmware.com/support/srm/srm_releasenotes_4_1.html There is a book called Administering VMwares Site Recovery Manager by Mike Laverick that is interesting. Find it at http://www.lulu.com/product/paperback/administering-vmwares-site-recoverymanager/3688988?productTrackingContext=center_search_results .
Page 7 of 166
A strong team that will enhance the success of the project will include storage, virtualization, server, and network resources. Senior and experienced in each category is of particular importance. Storage understanding is key. A close relationship with your technical staff at your storage vendor is very helpful. A Corporate sponsor is useful to help break blockage when two different business units declare their app as most important. They can also help with funding and vendor / BU relationships. Lab work or a proof of concept is very important to make sure that the entire DR / BC team fully understands the building blocks. Pick only one app and its dependencies and work all the way through including a fallback. This should also help understand what might go wrong in an SRM implementation and how to manage or mitigate it. Have a strong partner to help. Use VMware PSO or someone else but make sure they have experience. Get proof in the form of references! A strong plan is a big part of success. Really worry about the storage and the SRA. They are often poorly understood and poorly documented. Start small and go one step at a time. Triple check for compatibility issues! Before you start the actual work!
examples personally where there was a stalled and angry SRM install where the problem is an SRA that was not certified and not from VMware so avoid the issue and ONLY use SRAs from www.vmware.com.
Install account
When you install SRM you are prompted for an account and password to connect to VC. This account will be stored in a protected fashion and will be used by SRM to talk to VC. This will be an account that should be treated like a service account. It has a limit of 31 characters and must have a password that is all ASCII. You should not change the password of this account without other steps or SRM will not work. You can find information on this later in the document. During install, when you need to enter an account to access VirtualCenter, you need to be aware that username has a 31-character limit. The host name for VC is 32 characters, and the account name field for dr-ip-customizer.exe is 25 or so characters. Update 8/8/10 I believe that this has be fixed and the character length is now 80. But I have not confirmed that. I recommend that you use a service type account for the install, which is domain admin, and admin in VC, and later after the install it will become SRM admin too. It should be used for the ODBC account, and to run the VC service as well. It has been brought to my attention (thanks Brock) that our admin guide suggests using the local administrator account for the install, and for running repair activities. I have never done this, and many customers I have worked with do not have access to local admin accounts. I am still using the domain account and will continue to unless there is actually a technical reason to not do this which I am not aware of.
Install Overview
It is important to understand the SRM installation overview. You must install using the order of operation as shown in the lab section of this document. You must do this on the protected site first, followed by the recovery side. Here is the outline:
Page 9 of 166
1. 2. 3. 4. 5. 6. 7. 8.
You will need to create a DB at both sides before you start. SRM application installed at Protected Site SRM application plug-in installed in VI clients that connect with the Protected Site SRA installed at the Protected Site SRM application installed at Recovery Site SRM application plug in installed in VI clients that connect with the Recovery Site SRA installed at the Recovery Site SRM configured at the Protected Site a. SRM server pairing b. Array Configured both Protected Site and Recovery Site c. Inventory Mapping d. Protection Group 9. SRM configured at the Recovery Site a. Recovery Plan created You should now test and tweak SRM. Remember the goal is to have the required VMs running at the recovery site in the least amount of time. Remember when you are testing that you are testing for the applications to fail over in the shortest amount of time, and be functional when they are failed over!
Uninstall Information
It is good to uninstall the SRAs first, than plug-ins, and finally SRM. Make sure to clean up the database and other plug-ins. Do this on both sides. Sometimes the scripts folder will be left in the SRM folder after an uninstall. This is due to some miscellaneous SRA files not removed during the uninstall. To be tidy, and avoid potential issues when you install SRM again on this machine you should remove those folders. If you are doing this on Win2K8 check out page 11.
Page 10 of 166
If you do re-install make sure you have not missed anything above, and make sure the SRM database has been deleted and recreated as well.
Page 11 of 166
VirtualCenter 4.1
This is more accurately referred to as migration since we are moving from a 32-bit host operating system to a 64-bit operating system. The steps below will help you move from a SRM 4.0 / VC 4.0 environment where SRM and VC are co-located (although that doesnt impact this process much if they are not) and SQL remote. I will try to point out useful information along the way to help in other migration scenarios. I recommend you read carefully this document and its references completely, and understand them carefully, and then plan an appropriate outage and work all the way through. You do want to minimize the outage window of both VC and SRM! A very useful reference is the release notes (link here), and the upgrade guide (link here). Some interesting background VirtualCenter ISO build 4.1 build 259021 ESXi 4.1 build 260247 ESX 4.1 build 260247 SRM 4.1 build 267817 You must use a 64-bit DSN for VC and remember to make it using SQL Native. You can find SQLncli_x64.msi near the bottom of the page at http://www.microsoft.com/downloads/details.aspx?FamilyId=50b97994-8453-49988226-fa42ec403d17&displaylang=en . You will need to use a 32-bit DSN for VUM so see the KB article at http://kb.vmware.com/kb/1010401 for help in making the 32-bit DSN in a 64-bit OS. Things to get ready Make sure you have a good backup of everything that is going to change which means your VC server, and database. Your new host that is 64-bit will need to have the same name and IP as the old host. This is important. So you will need to build it when it is not on the network. Avoid conflict with the existing VC. You need to preserve the VC and SRM FQDN name through the migration. Make sure you have access to your service account information for VC and SRM. Remove your vSphere Client plug-ins. This is not always necessary but it helped sometimes in this upgrade process. The files we need: o VirtualCenter ISO we need the ISO as it comes with a folder we need, that the normal .zip doesnt. The folder is called datamigration. You can extract the ISO
Page 12 of 166
to a location that you will have access to when working on either the old or new VC.
datamigration folder that is only present when you have downloaded the ISO
o If you have a spreadsheet that details the VM to LUN relationship that is good to have. o An outage you will have no VC and no SRM for a number of hours. With preparation, and understanding of what is needed, you should be able to keep your outage to around 3 4 hours. But this will vary widely! SRM should be available approximately 1 hour after your VC is again available. But that will vary depending on your prep work. Migration - VC 1. Your database for VC is remote, but still make sure to have a backup of it. 2. You should be logged into the current VC (or current VC/ SRM host). Than, either by using / mounting the ISO, or if you have extracted the files from it, click on autorun so that you get the main screen of the install. Near the bottom of it, under Utility, select Agent Pre-Upgrade Check. a. If you have any issues that the check finds, you need to resolve them before continuing. They will generally have KB articles to help.
Page 13 of 166
Make sure to use the Windows credentials that is your VC service account, or your domain admin.
3. On the existing VC / SRM machine, copy the datamigration folder to the local hard drive and expand it.
SRM Reference Guide Page 14 of 166
4. Make sure that VMware Update Manager (VUM), VMware VirtualCenter, VMware VirtualCenter Management Web Services. Use the commands: this may not be on this host if your SRM is not co-located with your VC. 5. Now you need to use the backup.bat file that is in the datamigration folder to do a backup of your Virtual Infrastructure environment. Note the log folder? The backup.log file will provide info on how the backup went backup.log will echo the work done. The datamigration folder has a data folder now that contains the backup. This backup doesnt backup your remote database, but it does backup the port settings in use, SSL certificates, and licensing information. a. When you execute the batch file, it will normally only have a few questions. b. It will ask about if you wish to include ESX or VM patches. And you should generally say yes. 6. Now copy the entire datamigration folder to a location that you can copy it from in the future to the new VC host. 7. Now you must turn off your existing host. Disconnect the network from it to make sure it is not accidently turned on. 8. You will now turn on your new host, which has the same IP and FQDN. You may need to patch it now, or join it to the domain. Do what is necessary to make it part of your domain and healthy. This includes the 64-bit SQL Native client install, and creating the 64-bit DSN, and creating a 32-bit DSN. The URLs earlier can help find what you need. 9. You need to copy the entire datamigration folder to the new host. 10. On the new host, you need to have access to the install media so map a drive. 11. Use the install.bat file from the datamigration folder to start the install process. a. You will be asked for the path to VC and than VUM. If you are using the ISO, or have extracted the ISO, the path will be the same for both VC and VUM.
a. net stop VMware VirtualCenter Server b. net stop VMware Update Manager Service c. net stop VMware vCenter Site Recovery Manager Server
Page 15 of 166
b. You will see the normal install prompts for VC. c. Use the same DSN information. d. Notice how you have a choice at some point to do an automatic, or manual update of the VC agents on hosts? I used automatic. e. Select the same path as you had previously used (on the old VC) f. Use the same ports.
A nice improvement!
g. The next prompt is about the size of the JVM memory. Use the default or make a more appropriate choice. h. After the VC install is finished the install will return you to the install.bat file and start the VUM install process. i. VUM will now be installed. i. Enter your VC service credentials, ii. Use the appropriate 32-bit DSN iii. Accept the defaults. j. After the install is finished you will be returned to the install.bat file. 12. There is a restore.log file in the logs folder if you need to see what was done. 13. It is important to understand that the install.bat is very smart. If you, like me, dont have the 32-bit DSN for VUM, and exit, you can start the install batch again after you have the 32-bit DSN and it will continue where it should! 14. Confirm that VUM, VC, and the VC Web service is running, including with the correct credentials. They are likely NOT. 15. Now install the VI client from the autorun screen. 16. Connect to the VC.
SRM Reference Guide Page 16 of 166
17. Install the VUM plugin. This could be on your VC or your desktop. But as I mentioned earlier, remove the plug-ins first. 18. Now check your VUM config, and any other items to make sure what you have is 4.1 and your config.
After the upgrade, your VC should show a version of something like above.
You have now upgraded one of your VirtualCenter servers. You need to do the other one now! Note1: In all of the work I did, we always had the VC / VUM services NOT start, and we had to assign the proper credentials instead of the Local Service, and then it worked. Note2: Be careful with the 64-bit and 32-bit DSNs as it can get careless. If you make a mistake, you can cancel the VC or VUM install process, fix the DSN issue, and restart the install batch file. It will not redo an unnecessary install but rather start where you last finished successfully. A very nicely done install.bat file!
You need an SRM backup, but it needs to be taken at the same time as if it was in a consistency group. BTW, it needs to be restored like that too. You should also have history reports as hardcopy just in case. Copy your vmware-dr.xml file from each SRM server to a location where you will be able to access the file later. The default location is C:\Program Files\VMware\VMware Site Recovery Manager\config .
Migration - SRM
1. Remember that your new SRM host must have the same name / FQDN so you will need to turn off your old SRM host after you have your backups and .xml file so you can deploy the new host. 2. Backup the SRM database on each of the two (or more) sides. 3. Turn off the old SRM host. 4. Turn on the new SRM host. 5. Make sure you have a 32-bit DSN. 6. Create a new install of SRM 4.1 on the new host. Important to note: a. If you are re-using the SRM 4.0 database, make sure to use a copy and not the original. Errors or a cancellation could corrupt your database. b. You will be prompted about there being an SRM extension installed already. This is due to using your old database with a new install. You should selct Yes.
c. You will need to select the Automatically create the certificate choice.
SRM Reference Guide Page 18 of 166
d. Remember the DSN is 32-bit. e. SRM will likely NOT start. Change the credentials with it to the proper SRM service account, and hit retry. It should continue fine. 7. You now have SRM running, but not configured completely. 8. If the plug-in has not been removed, remove it, and install it again. Several times in my testing, right after the SRM upgrade, the plugin had the name of vDr instead of the fully spelled out name. It still worked, and after a restart of the VI Client the name changed. 9. Install the SRA. 10. Now get the other side done. 11. This is the time, if you have changed advanced settings where you will need to migrate them. Make sure to do that before you continue on. See the section below for help. 12. Now you will need to re-create the site pair, and reconfigure the array manager credentials, in particular the authentication information. When doing this, there is a small thing to remember. After entering the correct credentials, you will need to select the array.
When you re-enter your credentials to the Array Manager, make sure to select your array!
Page 19 of 166
You are now complete. If you have any issues, please do not hesitate to contact our support organization but also leave me a comment! Migrating Changes to Advanced Settings If you have not made any changes to Advanced Settings, you do not need to do this section which should be true for most customers. Changes would be things like SanProvider.CommandTimeout or San.Provider.hostRescanRepeatCnt. See below for a screenshot of the Advanced Settings categories. If you know the changes you made you can just add them to your new install. But if you are not sure, you will need to work through the process below.
Page 20 of 166
Start by loading your vmware-dr.xml file. Load the one from the Recovery Site when editing the Advanced Settings for the Recovery Site and do the same for the Protected Site. 1. When you are in the Advanced Settings window, you will need to work through each category. 2. One example is localSiteStatus. Search your VMware-dr.xml file for that phrase. 3. In the section of the VMware-dr.xml file you find the category, in our example of localSiteStatus, look for variables that match in the Advanced Settings category and change the value to match what is in your VMware-dr.xml file. See below for an example.
After we see what is in the Vmware-dr.xml file we record it in the Advanced Settings. See below for that.
Page 21 of 166
Remember you will need to work through this process for each category. Some issues I found I have mentioned these issues elsewhere but thought I would mention them here again. Forgot to update the credentials for the arrays. Used a 64-bit DSN for SRM. Then tried the 32-bit and it worked! Didnt know to install the 64bit SQL Native client on Win2K8R2 SRM server. So I did. And it worked. Did not see the Site Recovery Manager plug-in, but saw one that was called vcDr and it worked. Restarted the vSphere Client also cleared it up. None of the VC services started when they were supposed too. But by changing the credentials on the service for the correct ones solved the issue easy. I kept finding LocalService but once changed all was good. The SRM service never started, but when I changed local service to the proper credentials it did and all was good. Didnt know VUM needed a 32-bit DSN. So created one and restarted! It may not be connected to the upgrade, but after three successful test failovers, I had one fail. The error was Error:Error occurred: failed to prepare shadowVM for recovery. One VM was successfully recovered but three were not. I removed the protection group that held the VMs, than made sure that the folders on the ShadowVM LUN associated with
SRM Reference Guide Page 22 of 166
those newly unprotected VMs were gone (several were not). I than recreated the PG, attached it back to the recovery plan and it worked fine. For many tests with no issues.
Undo
If you wish to undo this migration it is almost easy. You would turn off the new hosts, and turn on the old ones. They would not be happy since the databases that would still be in use would be the new ones. You would need to stop all of VC and SRM services, and restore the backup copy of the databases I mentioned you needed to have. Than start the services and you should be good to go.
Page 23 of 166
b. I suggest that you use linked mode for VC so that you can work with SRM easier than having two clients open. When in linked mode, you can also only license in one place and yet select both sides to apply the license too. So it is a bit easier for licensing too. 3. Now install the licenses! See screen shot below. 4. When you decide to start upgrading ESX hosts to vSphere, remember that ESX 4 cannot failover to ESX 3 IF the VMs have been upgraded to virtual hardware version 7 (VH7). But they can failover if they have not been upgraded to VH7. a. It might be easier in a Protected Site / Recovery Site situation to start upgrading ESX hosts on the recovery side first AND not upgrade to VH7 or VMware Tools, until after the protected site is also upgraded. b. If both sites are hosting protected it could be interesting! But the same idea might be good, to update everything to ESX 4 but without updating the VH or VMware Tools, until all was done. This can be done if necessary at the cluster level too I think. 5. Test the test failover, and as soon as possible test a real failover. Notes: a. A recovery will run fine if the protection site is upgraded but not the recovery side. Try to avoid this but it does work. b. If you try a test recovery on the recovery side while the protected side is being upgraded there may be issues so try to not do this. c. Upgrade quickly so minimal outage / exposure. d. Make sure that DHCP client, Protected Storage, Server (lanmanserver), and Workstation (lanmanworkstation) are all running on the SRM server before the upgrade. e. If you upgrade the OS to Win2K8 as part of the upgrade, make sure the Protected Storage mentioned above is running. f. If you have issues, and cannot proceed, you should uninstall / reinstall SRM. But first rollback your VC to 2.5 Ux. g. Hand made modifications to any of the <SRM root>/conf/* files will be overwritten. There should be backup copies of those files that you can than copy and paste back the custom entries. h. As noted above, make sure to watch out for the VH levels as you can get errors trying to configure a PG to fall over in appropriately i.e. ESX 4 hosted VMs that are at VH7 to a cluster that is held by a ESX 3 host. See below for the place to add the license.
Page 24 of 166
Advanced Settings option in SRM 4.0 to replace manually editing the vmware-dr.xml file, and where to add the license.
Design Guidelines
This section will look at some design information. The Admin guide has some very good information but we will look at things in this section that are not covered in our guide. It is important to understand that for a test, or a real failover, to have all of your VMs in the same LUN(s) to provide the best situation. Remember the whole LUN must failover! It is worth thinking about having a department, or an application worth of VMs in a LUN or LUNs to provide the best flexibility in a test or real failover. As part of this I would include some XP VMs for test purposes. A Powershell script that can help with understanding where VM and their disk files are and on which LUN can be found at http://www.peetersonline.nl/index.php/vmware/another-way-to-gather-vmware-disk-info-withpowershell/ . Some vendors use Consistency Groups (CG) to group LUNs and this becomes the granularity that is seen through SRM. It is often very successful for the greatest granularity during a failover or test failover when each CG hosts one or more LUNs that hold only one APP or one business unit. A VM must have its VMDK(s) on the same storage vendor arrays and NOT on two different storage vendors storage. If VC is using trusted certificates than SRM must too. This is not simple but instructions are in this document that will make it much easier! Look at the SRA information in this document, as it will sometimes provide information that will impact your design.
1. Storage Organization I have seen once potential customer for SRM that used 4 TB LUNs through their virtualization world. This was due to it appearing to be easier for them than anything smaller. I suspect they still have not upgraded to vSphere! But another issue is when there is no pattern for where applications are stored. So Exchange might be scattered on 10 different LUNs. This will mean in a failover that all of the apps on the 10 different LUNs will need to failover. And there will not be any granularity. The best idea is to slowly migrate your applications to be protected to new replication LUNs. You will get the granularity for testing or failover, and it will be easier to upgrade the array in the future. Most people have little storage organization so this will need to be taken into account! 2. Application knowledge we need to know what the corporation thinks is the most important app, not just what an IT manager thinks. We than need to know all of the upstream and downstream services that application needs to be considered working. All of that information is necessary to build a test plan. This can be quite big when you consider all of the applications that companies might have! In addition, most companies are in the category of not entirely sure what apps or what services they need. If the customer has a Business Impact Assessment (BAI) report it will help enormously but most dont have that either. Change Control will sometimes have very good info to help with understanding applications and their necessary services. 3. Storage Replication Adapter (SRA) this little tiny piece of software can cause a great deal of grief. Sometimes it needs a path change that its own installer didnt do. Sometimes it needs a special license like SnapView for MirrorView or space efficient for IBM. Sometimes these requirements are not written anywhere easy. EMC has finally gotten very good release notes, but they are hard to find as they dont ship with the SRA. Also, often the SRAs dont support all of the features that the replication supports. So this can confuse and frustrate customers. So investigate the SRA carefully.
Large VI environments
If you have a large number of hosts, and VMs, you may have some issues with SRM. These issues are considered scalability issues in both the platform and UI. They normally only occur when there is very large numbers of VMs and hosts. We are working hard and fast to make these problems go away but in the meantime here is some useful information. Each of the next major releases will continue to allow more scalability. Design your SRM infrastructure in a POD design. The pod should only manage approximately 750 unprotected VMs (and 1000 protected VMs) and less than 150 replicated LUNs. This will allow SRM to work better as the full 1000s of VMs both protected and not protected are not seen by SRM. It is a good idea in this example to separate SRM and VC. So each POD would have up to approximately 750 unprotected VMS and up to 1000 protected VMs, and VC and SRM installed on separate servers. Align each POD with a business or departmental unit and it will lessen the impact of the extra VCs to manage. In addition, Linked Mode in vSphere will help too. So if you have 2500 VMs total, and 1500 are protected, I would create two PODs, and if possible divide the protected VMs and the unprotected VMs between them. However, more likely is the division by business or departmental guidelines. Each POD would have separate SRM and VC servers, and hopefully would be backed buy a corporate production SQL or Oracle cluster. Some other recommendations would include:
Page 26 of 166
Large recovery plans may require more resources (processor / RAM / ESX servers) at the recovery site than at the protected side due to the nature of failovers and trying to start everything so quickly. You should separate the VC and SRM databases as they are heavily used during a recovery. A general comment would be that adding VMs to protection groups is less costly in resource usage than adding PGs. Less PGs speed up recoveries, but do not hesitate to use what is necessary. VMware Tools speed recoveries as if they are not installed we must wait for the timeout to occur! High recovery should only be used where necessary as it slows things done. Of course, that is as designed so that we can exactly determine the order of VM recoveries. But only use it when you need too. To maximize performance you should, when doing simultaneous recoveries, try to have each recovery plan target a separate cluster.
Another way of doing things, that may help the need to use a POD design, is to do a 3 year sizing forecast and figure out what the end state architecture needs to be to support the number of projected workloads, and the RTOs (ie how much horizontal scaling) than backdate the end state picture to what you will implementing day 1. That way you will know it will scale without breaking. Do the storage layout just so everyone agrees on it and how it will grow. If however you are starting with 3000 VMs to protect on day 1 the POD design will help. Page 23 in the 1.0.1 U1 Admin guide shows the SRM maximums. They include: 500 protected VMs - enforced 150 Protection Groups enforced 150 Replicated LUNs advisory only (this could be more than an actual 150 LUNs depending on how your LUNs are managed. 3 running recovery plans advisory only
On approximately page 11 in the SRM 4.0 Admin guide is shows the new SRM maximums. 1000 protected VMs - enforced 500 protected VMs in a single protection group - enforced 150 Protection Groups enforced 150 Replicated LUNs advisory only (this could be more than an actual 150 LUNs depending on how your LUNs are managed. 3 running recovery plans advisory only
When you need to build SRM in pods like this it can increase the complexity of management, or perhaps just increase the frustration factor. Try to minimize this by building the pods within the limits above but also as department / business unit / or maybe even application / service based. This may help minimize the frustration and make the management a little more logical.
consider them best for them but we cannot do best practices for everyone. People should look at the recommended best practices and see if they apply. The recommendations below are the first recommendations I have done for SRM and I think they should apply to most, but still, please make sure they apply to you before implementing them.
1. Log Settings you should increase the settings around logs. The log files compress very well after they are used, and generally there is a lot of big drives in physical machines, and in virtual machines you could have big drives. Think about keeping 100x 10 MB files. The 10 MB files will compress very well! See page 37 for more info on this. 2. Increase the number of threads If you are not using SRM 4.1 you should increase the number of threads in use to avoid some time out issues. See how to do this on page 67. This change is in SRM 4.1. 3. Maximum power on can this be changed? By default we power on 2 VMs per host to a maximum of 10 hosts. You can change this if you have lots of resources in terms of memory and processor and storage bandwidth as well. See how to do this on page 46. 4. Service Level Agreements SLAs are something you should tread carefully around since they can sometimes be a factor in a problem with SRM. SRM must start VMs, and that time is something that needs to be measured before any SLA should be agreed to! This means while we can support an RPO of zero or near zero, we cannot support the same in RTO as we need time to start VMs. Plus, remember that the decision time must be part of the RTO. 5. Alarms you should configure the minimum set of alarms, or at least think of them and decide to not use them. See the recommendations at page 48. 6. Script usage you should think about your scripts. Should you use the idea of one big script, or many small scripts when it comes to IP Customization? Definitely you should store those scripts in one place which should be on the SRM server. I would recommend strongly the use of the VIX API for the scripts as well. You can see page 33 and page 35 for more info. 7. Patch regularly SRM is not frequently updated, but it is important to upgrade when those patches are available. When you unexpectedly need SRM to work you really need it to go and patches can solve issues that would stop SRM from working when you need it. 8. Use the account suggestions in page 9. 9. Plan specifically for a partial failover. Meaning you can fail your individual tier one applications over without failing anything else. This makes testing much easier, and provides significant opportunity for the customer to have very granular failovers if they need them. Experience suggests they will need partial failovers more often than complete ones. This is accomplished by organizing the storage so that you can in fact fail over just one app. 10. The RTO should always include the time to make the decision to execute a failover. 11. More protection groups lengthen failover. Less shorten it. Within reason of course. For example, in internal tests, 100 VMs in 1 PG failed over in approximately the same time as 30 VMs in 30 PGs. Applications can make big differences in this testing, but it is a good idea to where reasonable minimize the number of PG. Adding more PG is more costly than VMs. 12. With NFS it is less costly to have less and bigger mounts, and more costly to have more and smaller. Costly in this case impacts time to mount / dismount. 13. High priority provides maximum control but the slowest execution. Perhaps it is best to minimize the use of high priority and to plan for the use of recovery plans to provide the control instead of priority. This is of course, tricky manage but it is also powerful. 14. Let replication finish before adding the newly replicated LUN to SRM, and make sure it is visible in the Array Configuration of SRM before attempting to use it. 15. I like the idea of doing a Health Check before starting an SRM project. In particular it is worth doing it on the DR site to make sure it will be healthy when it is required.
Page 28 of 166
16. Each tier 1 application should have its own PG / RP, as well as be in an RP for a larger group, perhaps the whole company. 17. I recommend if possible having a 5 GB shadow VM location so that the size will prevent any confusion by people putting real VMs on it. 18. You should check out the events and tweak as appropriate. See more of them on 48. 19. I have started to only use SQL accounts for vCenter, VUM, and SRM and have been very happy with that. Starting to think that this should be a recommendation. 20. Do not co-mingle protected and not protected VMs in the same replicated LUN. This is important and do not forget it. 21. Do not use multi extent volumes since SRM will have issues with them. 22. It will likely help most SRM projects, especially in troubleshooting of failed tests to use some sort of software like vADM. This is application discovery and mapping software that can tie servers into applications and help understand what is missing from a test.
Failback Outline
EMC is providing automated failback tools; such as the Celerra plug-in for VC, and there are other vendors like FalconStor that are doing this. I expect to see more from EMC as well. But it is important to understand what the outline should be so you understand the big picture better. On page 53 of the SRM 1.0 Admin guide there is a very nice checklist for doing a fallback (page 41 in the SRM 4.0 Admin guide). In addition, both NetApp and FalconStor have good documents for doing fallback that include both the storage and VMware steps. It is ideal to have one of these documents if possible. A general idea of the failback is to do what you have already done in reverse. Clean up the Protection Groups and mappings at the previously protected site, and the recovery plan(s) at the previously recovery side and start over. The steps might look like: 1. Cleanup a. On the protected side, rescan the HBA, and the failed over VM and PGs are seen as invalid. Delete them. b. On the recovery side, delete the shadow VMs. 2. Configure replication to now be back to the original protected site. Be aware a number of vendors start replication automatically after a failover. So this may be done already. Some HP SRAs, some EMC SRAs, and HDS do this. Make sure the replication finishes. 3. Set up SRM to failback which means you are setting up SRM to fail over to the original protected site. a. Reconfigure the array manager for the new direction b. Inventory mappings, etc. 4. Setup the original protected site but first clean up! a. Clean up any artifacts that remain from the original failover and the subsequent failback. i. Remove the recovered VM from VC and delete them from storage at the recovery site. ii. Remove the PG and RP you used to failback. iii. Remove the placeholder VMs b. Setup replication c. Cofnigure array manager, d. Inventory mappings, etc.
SRM Reference Guide Page 29 of 166
Bandwidth Usage
I don't have specific numbers, but in general SRM consumes very little bandwidth between the sites. Once protection is up and running and SRM is essentially idle, the bandwidth between the sites should be almost nil (just periodic heartbeat/ping messages, and summaries of changes to the VC inventories). During operations such as protection and unprotection of VMs, there is some traffic between the sites, but I would estimate this to be on the order of 100s of KB per VM during protection, and almost none during unprotection. There can be brief spikes if SRM's connection to the remote VC server drops and gets reestablished, but this should likewise not involve more than 100s of KB per VM. Don't take these numbers as gospel, but I cannot imagine a real-world situation in which SRM bandwidth is not utterly dwarfed by that of the SAN.
Application References
Are there any application references for SRM and application X? This is a spot where I will start to accumulate links to application SRM support or implement guides. SAP - http://www.vmware.com/files/pdf/partners/sap-srm-cx-final.pdf FUJIFILM Medical Systems - http://www.vmware.com/files/pdf/FUJI_SRM_Final.pdf PTC Windchill Solutions - http://www.vmware.com/resources/techresources/10064
Page 30 of 166
Shared Recovery
This was created for our developers by our developers and has since been released as Shared Recovery. If you have Site A, and Site B, protected and recovery at Site C, you should remember that VMs from Site A would go back to Site A, and the same for Site B. To protect VMs on Site C, you would need to have another SRM install, and protect those VMs with another site. It could be A or B or D, but it would need a new SRM instance. If using A or B, it is a little tricky since it would be using / seeing ESX hosts that are being used by a different SRM. It would work but is messy and thus you should use Site D. It would be less messy if different storage were in use compared to A or B. Shared Recovery is mostly targeted to outsource DR organizations. See the Shared Recovery documentation at http://www.vmware.com/pdf/srm_shared_recovery.pdf .
Failback (plug-ins)
Currently, with SRM 4.x, VMware doesnt have the ability to do an automated failback. Elsewhere in here (page 29) there is a fairly straightforward outline of how to do failback. It is not that hard but does have an order of operation to follow. Vendors are now providing failback plug-ins for vSphere. It is important to understand that the majority of them actually do storage failback, and not VM failback. This will improve at some point, but with any and all failback plug-ins, make sure they do start order management, and IP Customization (back to the original no less!) and if they do not, they likely are not good enough for your customer. I am not aware of at this point (12/31/10) of any vendor plug-ins that can do failback with these two necessary features.
Page 31 of 166
5) External Resources this would be anti spam or anti virus. Again, you can take hot clones, or if you have a hardware appliance, sometimes they have a spare network port that can be used for the isolated test VLAN, or perhaps there is a spare appliance on the recovery side that you can use. 6) Exchange this is the subject of the test after all! But do we need to take all of it for the test, or can we take a subset? And what subset should we take? Test: This is the test plan itself, so after the recovery plan has been executed; we would use this information to test the application. A form that is signed after the test would be best. 1) Exchange Test Plan Name:______________ Date: ____________ Pass / Fail: _______ a. Login with your normal account? b. Start Outlook client with no errors? c. Access your mailbox via OWA with no errors? d. Address mess successfully to i. Partner (in test) ii. Stranger (not in test or in your cache) iii. Group e. Book meeting successfully with your partner? f. Look up phone number for someone? g. And so on. Build Plan - Infrastructure: this is the information to build out the plan and its infrastructure. 1) Isolated VLAN this covers the network side (cabling and configs) as well as the VI team (virtual switches) 2) DC in or on the VLAN clone or whatever method you use. 3) XP VMs must be built, configured, and have Office on them. They should be tested and in the proper LUNs to be available for the test and during the test. 4) Exchange we need to get copies of the Exchange servers in the test VLAN. 5) Replication is it working and is everything in place for us? 6) It is suggest having a detailed to-do list with name / date info to make sure it is done smoothly. Build Plan (SRM) this covers off building out the SRM infrastructure to support this plan. 1) Protection groups make sure proper LUN! 2) Recovery plan watch order of recovery DC first for example. Approval section When this test plan is a written document it should have a number of names on it some for approval, but some for simple communications. This document, when created, and approved is very useful to have at the recovery site. 1) The approval would come from the data owner who is sometimes called the application owner. 2) Some other info would include: a. Network contact, b. Virtualization / server / operations contact, c. DR team contact d. Application owner test representative contact
To run a batch file you should start the shell command with c:\windows\system32\cmd.exe. So it would look like c:\windows\system32\cmd.exe /c c:\scripts\alarmscript.bat.
These scripts are executed under the Local Security authority of the SRM server. In addition they can be stored where you like but likely best to have them on the local SRM disk and not on a remote network share. Example: Add to a script callout with the line:
C:\windows\system32\cmd.exe /C c:\scripts\call.cmd
Have a c:\scripts folder on the SRM server. In it have a batch file called call.cmd that contains:
@echo off c:\scripts\test.cmd
In the c:\scripts folder have another file called test.cmd and it will contain for example:
@echo off date /T >> c:\scripts\test.log time /T >> c:\scripts\test.log echo Recovery Test %VMware_RecoveryName% Executed! >> c:\scripts\test.log echo Running in %VMware_RecoveryMode% mode! >> c:\scripts\test.log echo Executed on %computername% - SRM server >> c:\scripts\test.log echo VM name is %VMware_VM_Name% >> c:\scripts\test.log echo ++++++++++++++++++++++++++ >> c:\scripts\test.log
This will execute during test or recovery and create and update a test.log file with the date / time, and some additional information. This is an easy example for the purpose of showing you how to call a script. You can anything you want from inside of the test.cmd file. For more information on the environment variables I am using in this script, please see below to see the environment variables and how they can all be displayed or the page in the admin guide to learn more. Remember that the script file is stored on the SRM server, and executed on the SRM server. If you need to make changes inside a VM, you will need to use something like the VIX API that will allow you to have a script on the SRM server, but yet make changes inside of a VM. If you use PowerShell scripts you may experience an odd issue find it and the solution on page 71. You can find a blog article on this at: http://blogs.vmware.com/uptime/2010/09/vmware-vcenter-siterecovery-manager-and-scripting-.html and also check out http://blogs.vmware.com/uptime/2010/08/cana-script-or-message-call-out-stop-a-recovery-plan-and-a-little-bit-more.html to learn about script placement.
What should I the PowerShell command look like to have it called from SRM?
You can think of this as a scheduled event but rather than Windows executing it on a schedule it is executed by SRM as required. So write your PowerShell command line as if you were going to put it into a Scheduled Task. But instead put it in the test.cmd file above. You will need to have PowerShell and PowerCLI installed on the SRM server remember! See the example below:
Page 34 of 166
How can I see the environment variables that the admin guide says are available for scripts?
The environment variables that SRM puts into the environment during the test are listed in the admin guide on page 51. But if you wish to see them in action, you can use the command below.
C:\windows\system\cmd.exe /C echo set
This command will echo all the environment variable values to the SRM log file.
Can a script execution in a recovery plan impact the inside of a protected VM?
The scripts that are executed by a RP are held on the local hard disk of the SRM server but can execute against or using the VIX API library and impact the inside of a VM. For more see http://communities.vmware.com/community/developer . There is no other way I am aware of to have a script execute on the SRM server console yet impact the inside of a VM. If the script is inside the VM, than SRM alone cannot execute it, and the audit trail that SRM provides will not record the execution of the script.
Will a non-zero script exit in a recovery plan stop the recovery plan?
In both SRM 1.x and the next major release beta documentation it is said if a script callout during a recovery plan has a non-zero return at the end of the script it will stop the recovery plan from finishing. This is a documentation bug, and is NOT correct. It will be deleted from the SRM beta documentation before GA. Check out http://blogs.vmware.com/uptime/2010/08/cana-script-or-message-call-out-stop-a-recovery-plan-and-a-little-bit-more.html for more info on this.
security thinking in the failover center. 4. Resources things like memory / CPU reservations / shares are not failed over. The thinking was due to the resource decisions / standards in the DR side would be different than on the protected site. However, there is a workaround here in that after a failover occurs, the resources configuration of the shadow VM is copied to the recovered VM. So you can edit a shadow VM for the desired resource configuration that is important, and it will be copied to the VM during the recovery operation.
Can I change the Run button to work like the Test button?
I am setting up SRM for computer show, and I dont want anyone to use the Run button, and I am not sure about using the role / permissions to manage this. Is there another way? If you are using the current GA version of 4.1, or later, you do have a configuration file option that can do this. In the vmware-dr.xml file on the recovery side you will need to locate the section <RecoverySecondary>, and add to it an indented line that is <testOnly>true>/testOnly> and you will than have a Run button that looks like Run when you execute it, but it in fact is a test. The history report will confirm that. You can change the true to a false to revert to the normal behavior, or remove the line you added. Since this change to the vmware-dr.xml file directly you will need to restart the SRM service. Make sure you make this change on the recovery side.
How can I capture the log and configuration information for support to work with?
This is most easily done after Update 1 by the use of the Generate Site Recovery Manager Log Bundle command in the VMware \ VMware Site Recovery Manager Start Menu folder. Run this command on the SRM server. This command will produce a zipped file on your desktop. I twill be in a MM-DDYYYY-HH-MM.zip format where is it Month Day Year Hours Minutes. Always provide the logs with your request for help! I strongly recommend you use this method. Very often people send to support just one of the support files and support will not be able to help with that. They will need to wait for the other logs. Please, always send the entire log bundle that is created with this tool. It captures things like core dumps, and configuration info as well as all of the log files!
You will need to check the vmware-dr-index file to see what is the current log file. Make sure to confirm the number from the index file to make sure you are working with the proper log file. In SRM 4.1 (4.0) the currently used log file will not be zipped, and the other files not in use will be zipped. For SRM 4.1 logs on Win2K8 R2 servers you can find the SRM log location below.
C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs
5. X is the value for the maximum file size. 6. Y is the value for the maximum number of files. When you are finished it should look like the figure below.
Page 37 of 166
These changes will not be active until you restart the SRM service. Make sure no one is using SRM before you do that! Also, dont forget to do this on both sides. In the example above, we are changing the settings to 10 MB in size, and keeping 100 copies. Remember that the 10 MB files will be gzipped to a very small size.
How can I tell the SRM version from the log files?
The first line of the SRM log files will hold the release info. The version=1.0.0 tells the version and build=build-97878 tells the build. One exception to this is SRM 1.0 Patch 3. It didnt change the build level and thus the log file will not reflect the proper build. You will need to check the Add / Remove to see if Patch 3 has been installed or not. I am told that this is now a test by QA so it should not be missed again. It is certainly on my test list!
Installation logs
SRM 1.0 You can create an installation log using the command line parameters of /s /Vlve installlog.txt. The command line will look like:
VMware-srm-1.0.0=<build_number>.exe /s /Vlve installlog.txt .
SRM 4.0 Installation logs are always created by default and can be found in C:\Documents and Settings\<user name>\Local Settings\Temp\vmsrminst.log. For installation logs on Win2K8 R2 they will be in a different location. That location is:
C:\users\Install_user\AppData\Local\Temp
You can also generate full logs with the command below but you will need to execute it from the command line. The log file will be generated in the same folder you execute the command. VMware-srm-4.0.0-192291.exe /V/lve installfull.log
Automated Install
If you would like to have an automated install, you can use the following command line, but remember to add your own information to it!
vmware-srm-<version information>.exe /s /v"/qn AgreeToLicense=Yes DR_CB_HOSTNAME_IP=<DR hostname> DR_TXT_VCHOSTNAME=<VC hostname> DR_TXT_VCUSR=<Windows user> DR_TXT_VCPWD=<Windows user password> DR_TXT_LSN=<site name> DR_TXT_ADMINEMAIL=<administrator's e-mail address> DR_CB_DC=<SQL Server|Oracle> DR_TXT_DSN=<System DSN> DR_TXT_DBUSR=<DB user> DR_TXT_DBPWD=<DB user's password> DR_RB_CERTSEL=1 DR_TXT_CERTORG=<Arbitrary organization name> DR_TXT_CERTPWD=<arbitrary password> DR_TXT_CERTFILE=\"C:\Program Files\VMware\VMware vCenter Site Recovery Manager\bin\<VC hostname>.p12\" DR_TXT_CERTORGUNIT=<Arbitrary organization Unit> VC_CERTIFICATE_THUMBPRINT=<untrusted VC certificate thumbprint> DR_TXT_PLUGIN_DESC=<extension description> DR_TXT_PLUGIN_COMPANY=<company name>
Look for the line that looks like: <directory>C:\Documents and Settings\All Users\Application Data\VMware\VMware Site Recovery Manager\Logs</directory> Below it you will find a line that looks like: <level>verbose</level> You can change the verbose to trivia, which will generate more entries, or to info, which generates less. From least to most reporting the options are: error, warning, verbose, info, and trivia. It is important to understand if you increase the level of detail, the logs will faster and things may rotate and you lose what you need. You can change the roll over detail by using the information on page 37. You can set a different level of logging at the sub-component level. You can have a default level of verbose for the overall log file but one component could be set to something more detailed. Look for the sections in the config (vmware-dr.xml) file with the names from below. Some of the interesting component levels are: Vmware-dr (DR service) PrimarySanProvider (protected side array manager) SecondarySanProvider (recovery side array manager) SanConfigManager (managed storage configuration datastore computation) You should confirm changes like this that you make are seen. The change should be seen in the log as SRM starts. You can therefore confirm the change you made has been accepted.
What happens when I Storage VMotion a protect VM or how does changes to VM storage affects protection?
This is a very complicated area. For the most detailed and complete information please see the wonderful KB article at http://kb.vmware.com/kb/1009900. But here is some of the key information. A VM, to be recovered safely, must have all of the datastores that its storage uses recovered at once. Because of this any changes to storage may require editing the PG. Storage VMotion, or even migration across PG boundaries is generally not good and you will need to revisit those VMs to confirm their protection select the Virtual Machines Tab in the RP and clear any unconfigured errors, or do the same in the original PG, or the new PG. If the protected VM is migrated to a replicated datastore, which is not part of any Protection Group, it will stay protected and its datastore will be added to the PG.
Page 40 of 166
If the protected VM is migrated to a replicated datastore, which is part of some other PG, then the VM will become invalid and a user will need to re protect it. It will I believe show a little yellow triangle.
Page 41 of 166
SRM 1.0 licenses will not work in SRM 4.0 but new licenses can be obtained from the customer license portal if they have registered their existing SRM licenses. We do not use Flex licensing any longer in SRM. There is no longer a host license and SRM will NOT require a license to work but only to protect VMs and that is a 25-character license that defines what can be protected. The SRM server will continue to work even if it becomes unlicensed SRM works but no failovers. There is no cross-site license communication so licenses will need to be licensed at both sites if appropriate.
SRM Reference Guide Page 42 of 166
Evaluation licenses are checked once per 24 hours to see if they are still active, and this check is not done when there are no evaluation licenses or they have expired. Expiring licenses are managed the same way. Protected VMs are counted whether turned on or not, and the state of protected assets is reported to VC every five minutes. The SRM license in vSphere Update 1 or later will look different. See below for an example.
Page 43 of 166
If you do not see the licenses you expect, this might be due to an odd issue that SRM has with licensing. While it uses FLEX licensing, if you only drop off the .lic file in the Licenses folder and reread the license file(s) you will not see something like the screen above until you restart SRM!
What does it look like if my vSphere is licensed for SRM after Update 1?
See the screen below for an example of a licensed SRM install.
What is the account that is asked for during install used for?
The 1.0 installer prompted for a username during installation. This is the account SRM will use to communicate with the local VC server. Since SRM constantly monitors the local VC inventory, this user will be constantly logged into the local VC server. Changing the password for this account will make it impossible to use SRM. Please note that this should be an account in the Administrators group. By default, when you install SRM 1.0 or SRM 1.0 U1, all accounts in the Administrators group have complete access to SRM managed objects. Again, this has not changed with U1. Please try to use AD accounts when you install SRM, and when you log into SRM. Using local accounts can work, but it is a
Page 44 of 166
little tricky. If you need some guidance on using local accounts I can help. This account is NOT the account used by the system the SRM service uses the Local System Account.
Can network customization work for operating systems other than Windows?
Yes. This includes operating systems from Novell, and Red Hat. The specific version information can be found in the SRM Compatibility Matrix document. SRM 4.0 adds in Ubuntu as well to the Linux flavors that can be customized.
have a number of Normal priority VMs starting at the same time but spread across various ESX servers. However, High priority starts VMs serially regardless of how many hosts are involved. Misconfiguration of the security for storage arrays may impact the start order of VMs. For example, if the security of the array means it cannot talk to a particular ESX host than that host will not be used to start VMs during a recovery plan. It is possible to see this without any obvious error messages!
Page 46 of 166
third VM that had just finished replicating, was in fact started. In a non-test failover, this may perform differently as it depends on the storage and what stage the issues occur in.
SRM 1.x This is easy and can be configured. Use the steps below: Edit the vmware-dr.xml file on the protected side. You will need to add a <hostRescanRepeatCnt> element in the <SanProvider> element. The value of <hostRescanRepeatCnt> should be set to 2. Make sure no one is using the SRM Plug-in, and restart the SRM service. Now do the same thing on the recovery side. Below is an example.
<SanProvider> . . .
Page 47 of 166
<hostRescanRepeatCnt>2</hostRescanRepeatCnt> </SanProvider>
SRM 4.x and 1.x You should confirm changes like this that you make are seen. The change should be seen in the log as SRM starts. You can therefore confirm the change you made has been accepted. See http://kb.vmware.com/kb/1008283 as it is now in the kb.
You may want to consider as well: VM Protection invalid I am not sure what triggers this one!
With SRM 4.1 (4.0), these alerts are not part of the improved vSphere environment. So if you set to be alerted on Remote Site Up, you will be alerted very frequently! Remember that these alarms are configured at both the protected and recovery sites. Some of them are not necessarily appropriate on both or either side. Check out my blog for more information on this, and I will update it as necessary. It is at http://blogs.vmware.com/uptime/2011/02/recommended-alarms-for-srm-admins-to-watch.html .
Recovery Plan Test started, ended, succeeded, failed, or cancelled Virtual Machine Recovery started, ended, succeeded, failed, or reports a warning
Some of these can be changed in how they are triggered. For example, the minimum disk space is 100 MB and you may wish to have it 500 MB. You can change disk, CPU or memory in the vmware-dr.xml file in the SRM config folder. Search for the terms below (in vmware-dr.xml) to see where to make the change and than restart the SRM service. Disk (minDiskSpace), where the default is 100. CPU (maxCpuUsage), where the default is 80. Memory (minMemory), where the default is 32.
3. Acquire enough Microsoft Developer Network (MSDN) subscriptions to license the OS and applications that will be used in the DR site. These are very low cost, but are fully functional and allow any development, non-production use. 4. Test and tune SRM using MSDN licenses until it works as desired.
When the customer is ready to test production failover, they may want to ask for permission from Microsoft to re-assign their licenses on a short-term basis. The failover test is permitted the customer will re-assign all their licenses to the disaster site hardware. However, Microsoft rules state (with some specific exceptions) that re-assignment may not be done more than once every 90 days. The customer would need to either wait 90 days before testing the recovery phase, or ask Microsoft to acknowledge that they can test this critical business function without violating the terms of their license. I think an important note is that many corporate accounts have SA in enough volume to make this test process not an issue.
Page 50 of 166
What vendors have application consistency options that work with continuous replication?
This is a little different in that with continuous replication it is hard to use agents to work with the point in time snapshot because the replication is perhaps real time or maybe every 2 seconds so there is not enough time to work with the agents to product application consistent snapshots. So everything will show up in crash consistent. HDS has the ability to have application consistent continuous replication for physical machines but not virtual machines at this time. When I asked one of the architects at FalconStor about this, I got what is written below thanks very much David! For the Continuous Replication, the way we can achieve better than Crash Consistency is through our "Snapshot Director" (virtual appliance) and our Snapshot Agents (loaded in the VM's); the main difference, compared to using our "Periodical Replication", is that instead of creating a periodical replication point, we create a "Snapshot Marker" on regular intervals, and that marker gets replicated instantly to the remote site (the TimeMark is then created on arrival on the REPLICA volume). So the end effect is you get incredible RPO using Continuous Replication (any-point crash consistent state), but you also get the benefits of amazing RTO through "application consistent snapshot" via periodical snapshot markers that are trickled down to the DR site via continuous replication (but these quiescent application consistent snapshots are still periodical, thus spaced apart, as if we were doing periodical replication). As for the Continuous Replication question from your previous email, we do not play the snapshot quiesce action "offline". We truly quiesce the VM's applications at the Protected site, but instead of waiting for a "replication interval" (as opposed to Periodical replication), the "state pointer" which is like a bookmak (aka Snapshot Marker) is inserted into the CDR Journal (Continuous Data Replication Journal) at the time right after the filesystem flush, and replicated immediately. The TimeMark is then created on the DR Replica disk, almost right away, as opposed to having to wait for an upcoming replication session (when using Periodical). So bottom line --> VM's are quiesced, but no NSS Snapshot is created on Protected Site, instead, we just insert a bookmark in the I/O Journal queue, and as the journal is flushed out to the remote site, when the remote site processes the bookmark, it creates the TimeMark on the Replica disk. As I learn more about this I will share what other vendors can do.
Protection SRM Administrator role at the SRM site recovery root level (propagate) Protection Groups Administrator role at the SRM protection groups level (propagate). Recovery Site Recovery Inventory Administrator role at the vCenter root Recovery Datacenter Administrator role at the datacenter level (propagate). Include Virtual Machine Interaction, Host CIM and Rename Datastore Recovery Host Administrator role at the host level and cluster (include Browse Datastore, Assign VM to Resource Pool, Reset Guest Information, Console interaction, Power ON/OFF and Reset) Recovery Virtual Machine Administrator at the resource pool and folder levels (propagate). Didnt work at customer unless assigned at cluster level. Recovery SRM Administrator at the SRM root level (propagate) Recovery Plans Administrator at the SRM recovery plans level (propagate).
SRM service doesnt start, and event logs show errors with event ID of 7000 and 7009
This will not normally be seen in a production environment where SQL / SRM / VC are well designed, but in a lab with limited resources this can and does happen. This is due to the Windows Service Control Manager expecting a Service started successfully message in 30 seconds. You can make a change to a global setting that can increase the 30 to 60 seconds and it appears that will solve this issue. Use the steps below to make this improvement. 1. In Registry Editor, locate, and then right-click the following registry subkey:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control
2. Point to New, and then click DWORD Value. In the right pane of Registry Editor, notice that New Value #1 (the name of a new registry entry) is selected for editing. 3. Type ServicesPipeTimeout to replace New Value #1, and then press ENTER. 4. Right-click the ServicesPipeTimeout registry entry that you created in step c, and then click Modify. The Edit DWORD Value dialog box appears. 5. In the Value data text box, type 60000, and then click OK (value is in milliseconds). Now you can restart and you should have no issue with SRM starting since there is more time for SQL and VC to start. Thanks to Scott for this great info!
How can I have syntax highlighting to help read SRM log files?
This is very useful and can be done on both the Mac and PCs with a little work. Much less work now that I have done it for you! On the Mac you need to use TextWrangler (http://www.barebones.com/products/TextWrangler/) and on the PC you need to use EditPlus (http://www.editplus.com/ ), while this is possible with other editors I only show you with these two. TextWrangler is freeware but EditPlus is only shareware. It is very popular with developers.
Text Wrangler
In the appendix there is a sample file that you can copy and paste to create a text file called log.plist. Than use the following steps to make it live. 1. If necessary create a folder called Language Modules in Your ~\library\Application Support\TextWrangler folder. 2. Move your file to this folder.
SRM Reference Guide Page 52 of 166
3. You should now be able to open a file that has an extention of .log and see words like error in color. 4. The Preferences file can help adjust as necessary. In the Suffix Mappings section you can map the .log to the Language Modules called Log due to the filename. See below for the end result.
EditPlus
In the appendix there is the information to copy and paste that you can use to create a text file called log.stx. Use the following steps to make it live. 1. You will need to copy this file to the C:\Documents and Setings\user_name\Application Data\EditPlus 3 folder. 2. Now in the Documents \ Permanent Settings we need to add this new file into EditPlus. 3. Under the Settings & syntax menu, you will need to define a log file type. 4. In the File extensions section add the log type. 5. In the Syntax file section you should load your log.str file. See below for a completed set of preferences as well as a sample file.
Page 53 of 166
Troubleshooting
This information will help with troubleshooting of SRM and SRM related issues.
You can often search the log for things like error] but also you can search it for what you see in the history report. You can use the date / time of the report / error to look for information. Some other things that may be useful to search for include credentials, failure or warning. Some of these may occur naturally so be careful. The start of a recovery plan in the logs looks something like [-1] CHILDREN .
RootStepList-xxx HAS
Also, where possible, it is useful to troubleshoot when the Continue function has not be issued so the RP is in effect paused. This gives you access to storage that you will not have when the RP is complete and cleaned up. Always make sure that the SRM compatibility for compatibility in your situation (with things like ESX patch level or SAN compatibility) but also do not forget that the SRA often has prereqs that you need to worry about. If the Create Protection Group is grayed out that generally means that SRM cannot see the storage. Use the Array Manager configuration LUN view to see if there are any clues. That generally has helped me. There are some odd things that you need to remember; such as an attached CD can be an issue in a failover. Sometimes it is worth starting vmware-dr.exe to see if you can see anything that can help. This is particularly useful when you have tried to start the SRM service and it fails, but nothing is seen in the SRM logs. This can mean a problem occurs that stops SRM from starting before it can touch the logs. Always check the release notes as well!
Page 55 of 166
I have heard that if you go from 300 to 1500 that some EMC SRAs will not error and will work. I also know that the next generation of EMC SRAs will be much faster. The information previous is for SRM 1.0. For SRM 4.0 you will make this change in the Advanced Settings dialog and will not need to restart the service.
Page 56 of 166
When you use Windows Authentication to access the DB you must run the SRM service as the DB user account. When using SQL Authentication, you can leave the default local System user.
Why does my recovery plan show error on VM status but the VMs are ok?
The reason for this error was introduced in Update 3 of ESX. We adjusted the frequency of how often we check for VMware Tools heartbeat. While the VMs are recovered successfully, the history report does show errors on the VMware Tools Status. You can adjust the Recovery Plan Response Times wait for OS heartbeat from 300 to 450 and you will get rid of the error status, but the test will take longer! You can also use the following information to for a better fix, instead of adjusting the Recovery Plan wait of Tools timeout. Start in the ESX Service Console at the command prompt. Edit the /etc/vmware/hostd/config.xml file, You will need to change the vmsvc section (which will look like <vmsvc>) There may be a line that starts with <heartbeatDelayinSecs>XX, and once it is located change the value of XX to 40. If not you will need to add the complete section, which will look like below.
<vmsvc> <heartbeatDelayInSecs>40</heartbeatDelayInSecs> <enabled>true</enabled> </vmsvc>
You will need to restart the management agent to load this change. The command is service mgmt-vmware restart and followed by service vmware-vpxa restart. It is important to note that this may cause an issue with restarting VMs. Avoid this by disabling the automatic start up and shutdown of VMs. See more about this at http://kb.vmware.com/kb/1003490 . You will find info below about avoiding the Shutdown Tracker situation. This recently become a supported fix http://kb.vmware.com/kb/1008059 . In addition, as of 1/30/09, there is a fix so this manual work is not required. Find out more at http://kb.vmware.com/kb/1006651 .
I am not aware of at this time a way to implement this change in ESXi. 8/8/10 AOK.
ESX 2.5 accessing protected datastore will cause recomputed datastore failures
If you have ESX 2.5 hosts accessing a protected datastore you will see datastore recomputed datastore failures. Remove the ESX 2.5 host from the datastore. This was fixed in Update 1 of SRM.
Unable to find any array script files Please check your SRM installation
This can mean a few things. Your SRM install could be to D:\ and your EMC solution Enabler could be installed to C:\. This error can also occur with any storage vendors if you have not restarted SRM after installing their SRA.
My Linux VMs dont have the host file changed after IP customization
This is a current bug in SRM 4.0. It is being worked on and hopefully will be addressed soon. The IP customization on the Linux VM does actually work except for the change to the host table. This was fixed in 4.0.1 and later.
dr.secondary.fault.WrongVmInventoryPlacement
This error can occur when you are creating a PG and have mapped inventory items that are not compatible. For example, between an ESX host with VMs that are at VH7 level, with an ESX 3 host. Basically it means that host, network, or resource pool are not compatible at the other side. The log will have errors like:
[2009-09-19 18:27:04.252 06140 verbose 'Replication'] Creation of shadow VM failed with error (dr.secondary.fault.WrongVmInventoryPlacement) { [#2] dynamicType = <unset>, [#2] faultCause = (vmodl.MethodFault) null, [#2] resourcePool = 'vim.ResourcePool:resgroup-392', [#2] datastore = 'vim.Datastore:datastore-2840', [#2] host = <unset>, [#2] msg = "Host, resource pool and datastore are not compatible.", [#2] }
You will need to discover what the resource group 392 and datastore 2840 are to discover where the conflict is. This would be done using the following command:
https://vc_recvoery_side/mob/mob?moid=datastore-2840
Page 58 of 166
You would change the datastore-2840 for the other variables as appropriate.
Pairing Issues
If you have an issue at approximately 24% it could be related to the license file not being live or installed. Reread the license file or restart the license service. If you have an issue at approximately 82 or 84% you should make sure that the account you used to connect to the Recovery site has both VC and SRM admin rights. The specific role for SRM is Protected Site Administrator and on the Recovery Site it is called Recovery Site Administrator. This issue occurs most in a Microsoft domain world. The Administrator role includes both the Protected and Recovery site admin roles. Things to check during troubleshooting of pairing issues would include firewalls between the sites and is the recovery site running VC successfully?
I cannot run more than one simultaneous recovery plan with my MirrorView SRA
I need to run more than one recovery plan at a time so that I can cut my RTO. But I have not been able to do that with my MirrorView SRA. I can in fact do it with other SRAs so I cam curious. This is (as of 12/12/09) correct and a precautionary measure to provide better response time while running the recovery plan. In the future NaviSphere engineering will make some design improvements that will allow additional simultaneous recovery plan operation. In the meantime, with no support, and in a lab, for experimental purposes only, you can change from the limit of 1 simultaneous plan to 3 with a registry change. The registry change should be on the SRM server. The key is HKLM\SOFTWARE\EMC\MirrorViewSRA\Options\NumSimultaneousInvocationsAllowed with a value of 3.
Page 59 of 166
Important note: this information is for discussion and not indicative of what you can expect. Remember there are a lot of variables! But the info below can be used for understanding if you are seeing good numbers or not. Another important note: I have seen tests where I thought everything was the same, and yet take different times, sometimes different by even 5 or 6 minutes. So the test time indicated below is very rough and should not be taken seriously. Number of VMs Scripts and or IP Customization No / No none Storage Information Virtual FalconStor running on dedicated physical FalconStor hardware. Virtual FalconStor running on dedicated physical FalconStor hardware. Time for a test (including clean up) 12 minutes Comments
8 Windows
8 Windows
29 minutes
A customer reported to me that he failed over 100 virtual machines, from 12 protection groups, in 120 minutes. I have no other info on that. But I am looking for more information for this section.
Failed to connect to the management system address when executing the discoverArrays command.
You should not often see this but it can be addressed by making sure the SRA is in fact installed on the recovery side. You may also need to check routing between the sites (in particular to the Recovery side SRA / storage management interface. This can occur after storage is mounted, but the datastore cannot be found. There are MANY causes of this error. Use storage troubleshooting to figure it out. Before continuing the test, check the storage and confirm it is readable and has VMs. You need to find, the boundary of working and not working in the storage world and than deal with that. I have seen this with the MirrorView SRA and its odd ports, as well as with RecoverPoint.
Page 60 of 166
This is not something you need do often. In fact I never have. It would be perhaps useful if you suspect your database information is corrupt. SRM 4.0 In SRM 4.0 you can use the Change option for SRM in the Add / Remove control panel applet and it will allow you to make a number of changes including VC account / password, delete the contents of the SRM database and more.
Error LUNs with duplicate IDs or numbers received from SAN integration scripts
This occurred adding an array in the array configuration manager you may see this error in a popup window. In this example it occurred in an EMC Symmetrix and SRM 1.0 U1 environment. In the SRM logs you could see the same WWN for all LUNS. You will need to talk the storage team and make sure the correct flags are set on ALL FA ports. EMC will normally recommend the following flags set on all FA ports in an ESX environment. Common serial number (C) Auto negotiation (EAN) set Fibrepath enabled on this port (VCM) SCSI 3 (SC3) set (enabled) Unique world wide name (UWN) SP-2 (Decal) (SPC2) flag is required
Unable to create placeholder virtual machine at the recovery site: host, resource pool, and datastore are not compatible
This is a frustrating error message. I first saw it when I started using distributed switches at one site and not the other. This error message means that you have mapped resources that are, for some reason, not compatible. One simple example is when you have mapped a VM network to a network where one host doesnt have access to that network. You can also confirm that the Shadow VM location is visible to all hosts at the recovery side. You will need client and server logs to investigate this further. Another cause of this issue can be mapping between a 4.x cluster and a 3.x cluster. You can map between a 3.x and a 4.x cluster, which will work for failover but not failback. I also saw this once after an SRM service restart during a test recovery. Restarting both VC and SRM servers solved it.
Network device needed by recovered virtual machine could not be found at recovery or test time
This error will occur when your protected virtual machines are using dVS switches. With 4.0, or 4.0.1 dVS is not supported even though it is supposed to be. This problem is in two parts with the first being a cosmetic issue in VC, and than the error above, which stops a recovery from being successful. As of 5/22/10 there is a patch that has been confirmed to work available from GSS, which means you need an SR to get it. In our next major release, and in our next patch, we will include this fix. Both of these will be available in the summer of 2010. The VC issue will be fixed in vSphere VC 4.0 Update 2. To confirm you have this issue, you will find NetworkDeviceNotFound in your SRM log. A few lines after that error you will see dvportgroup-xxxx messages. In the History Report you will get an error something like Network device needed by recovered virtual machine couldnt be found at recovery or test time. Update SRM 4.1 doesnt have these issues, and if you use SRM 4.0.2 and VC 4.0 U2 you will not have this issue. KB article can be found at http://kb.vmware.com/kb/1019890 .
SRM doesnt start and nothing in SRM logs or event logs what to do?
The reason nothing is in the SRM logs is that SRM really hasnt started yet. When there is nothing in the events logs it is not a surprise. But I have seen this several times and there are two things to think about. 1. Use depends.exe to determine what missing DLL is hurting SRM. I once had SRM not start for me and it was due to a missing DLL by the name of MSVCP71.dll and by using depend.exe to start vmware-dr.exe (the SRM service) I was able to determine what DLL was missing and replace it with a copy from a different SRM server. Incidentally, depends.exe comes with Visual Studio.
Page 62 of 166
2. Start vmware-dr.exe manually and you may see a message such as msg=Login failed due to a bad username or password. This may or may not be in the log file. This can occur after changing the password that is tied to SRM. This message was likely in the SRM log but hard to find perhaps.
The question is what RDF is it talking about, and which options file? In the adapters directory on the recovery side there should be a file called EmcSrdSraOptions.xml. In that file you need to specify the R2 devices and their associated BCV pairs as part of the <TestFailoverInfo> information. You need to find the associated BCV device names for each of those devices, for example by using the "symmir" command and specifying the device group containing those devices. Then, modify EmcSrdfSraOptions.xml to include entries in the <TestFailoverInfo> stanza such as (for example if 477's BCV is 35F) <DevicePair> <Source>0477</Source> <Target>035F</Target> </DevicePair> Then run the test again, since this the "options" that the SRDF adapter is looking for. You will have to create this pairing information for each R2 device you plan to test. The output from the adapter will summarize what it thinks is specified in the EmcSrdfSraOptions.xml file, for example if the output has:
[#4] [07/16 08:57:16 EmcSrdfSra.cpp save_pool_name = n/a [#4] [07/16 08:57:16 EmcSrdfSra.cpp devices = n/a 0655 0673 SrdfSraOptionsReader::DisplaySrdfSraOptions] SrdfSraOptionsReader::DisplaySrdfSraOptions]
Page 63 of 166
0676
SrdfSraOptionsReader::DisplaySrdfSraOptions]
where "devices = n/a" it thinks you haven't set any DevicePair settings. After you modify EmcSrdfSraOptions.xml you can also run the adapter binary by hand (EmcSrdfSra.exe -env) where the env flag will cause it to print out what it thinks is in the options file. EMC can probably give more details as to the purpose of the options file. This all assumes you are using standard Timefinder for snapshots; if you are using BCV clones you will need to modify the EmcSrdfSraOptions.xml file accordingly including specifying the save pool name.
For SQL server use, does the SRM DB user need the DB_OWNER permission?
For SQL server, the SRM DB user doesnt need the DB_OWNER permissions. As long as the schema has the same name as the username, and is the default schema for that user, and is owned by that user, then you are ok.
Page 64 of 166
Without the password you will need to use the thumbprint. So run this command the first time without the thumbprint parameter and you will be shown the thumbprint and than run it again with the thumbprint. If your site name contains spaces enclose the name in quotes. You will need to worry about this if you cannot get the SRM service to start. You will see in the error log messages about ERROR 1920 Service VMware SRM Service (vmware-dr) failed to start. You can see a little more about this on page 44. This is easier in SRM 4.0 and is covered in the admin guide.
My recovery site is only using x number of hosts to start VMs but it should be using y number
When I experienced this, it was due to the host that was not starting VMs not having access to the storage array. This was due to it not having a vmkernal port that LHN required. I have seen this with other vendors where there was no security between the ESX host in questions and the storage array. There are no error messages associated with this situation so make sure you test for it. I have seen a similar error where the single host at the recovery site didnt have an IP entered for the iSCSI array. In addition, make sure that DRS is healthy. If there is wide deltas between the build / patch level of the hosts in the cluster it is possible that certain hosts will not be used by SRM since DRS is not using them. Test that all hosts can be used by VMotion by setting all hosts one by one in and out of Maintenance mode to confirm things are ok.
Page 65 of 166
[2009-08-04 21:15:18.077 'DrServiceInstance' 1768 warning] Initializing service content: Unexpected exception 'class Vmacore::Xml::XMLParseException' unclosed token [2009-08-04 21:15:18.077 'App' 1768 error] Application error: unclosed token. Shutting down ... [2009-08-04 21:15:18.187 'App' 6344 info] [serviceWin32,414] vmware-dr service stopped
Above is an example of what you might see in the SRM log files when the SRM database is corrupt. You can restore the database if necessary, but make sure to do it on both sides and have SRM not running when you do it.
customization specification on the recovery site. Remember that you can export and import customizations so if necessary it doesnt take much to move them between your protected and recovery sites.
and it continues on . . . . And add the line <TaskMax>20</TaskMax> so the section will look like:
<vmacore> <threadPool> <initializedCOM>mta</initializedCOM> <TaskMax>20</TaskMax> </threadPool>
and it continues on . . . . Remember to restart the SRM service. Currently, the value of TaskMax is 10, and sometimes that is not enough. We will increse the value for it to 20 in current releases.
Page 67 of 166
This can also occur when you have a cluster that you are recovering to and some of the hosts in the cluster do not have access to the storage! For example no iSCSI access to the recovering storage arrays.
Net::SSLeay::load_error_strings
This comes from the Perl module for OpenSSL, which is required by some SRAs (such as NetApp) and means that perl is not installed on the recovery SRM server.
Is there a limitation of DR failover LUNs for some iSCSI arrays and some Hosts?
There is a hard limit of 64 iSCSI arrays per host. However, when using SRM there is a limit of approximately 23 recovery iSCSI LUNs on the recovery side only. For more information about this please visit http://kb.vmware.com/kb/1005867 . This is not specific to SRM but to any DR setup you might test.
Can I have a VM with multiple VMDKs spread across two NetApp SRAs?
No. If you have one VM, with two VMDK files, and one is on the NetApp FC / iSCSI SRA, and one is on NetApp NFS you will get an error. This is true for any SRAs. You cannot spread a VM between arrays.
Technically it's not an adapter problem because the adapter successfully returned the replicated LUN. However, the shadow VM needs to be on a temporary datastore at the recovery site, and this datastore name looks a little strange. Further up in the log I see that datastore:
Page 68 of 166
[2009-02-11 16:16:45.460 'SecondarySanProvider' 9896 verbose] Adding datastore 'DATASTORE-SRMVDISK1' with MoId 'datastore-220' and VMFS volume UUID '4992af7e-6a5f6312-7a66-001cc4bd0c2e' spanning 1 LUNs Hmm, the protection site has a datastore with the same name as the recovery site ... could it be that the customer has somehow exposed the replicated datastore to the recovery site and is trying to use it as the temporary datastore? Further up in the log I see that the datastore UUID is:
[2009-02-11 16:16:45.382 'SecondarySanProvider' 9896 trivia] Added vmfs extent 'host-69;vmhba1:0:2' with key 'host69;4992af7e-6a5f6312-7a66-001cc4bd0c2e;0' LUN vmhba1:0:2
So, it seems the customer thought they needed to specify the replicated datastore as the shadow VM datastore, so perhaps they split replication, made the replicated datastore visible, the resynchronized replication (so the remote LUN is read-only). Now when SRM tries to create the shadow VM there, the creation fails. Customer corrects issue by selecting a non-replicated datastore at recovery side as for shadow VMs.
Page 69 of 166
See more info in http://kb.vmware.com/kb/1017126. This very new KB article shows a different
method than above. That may be due to the process above being old and not usable any longer. I will test this when I get a chance and correct as necessary. But for now, use the process above if it works!
Page 70 of 166
Install hangs at 90%, and install log shows VIEINSTUTIL: Failed to open service control manager
This error can occur when you are installing SRM with a partial admin account. In point of fact you are missing the privilege to add a service. Redo the install with an admin account.
Page 71 of 166
Fact Application SW: MirrorView Insight for VMware 1.4.0.16 Symptom Error when executing MirrorView Insight for VMware Symptom Operation failed...Details: VI API Version 4.1 is not supported Cause At the time of MirrorView Insight for VMware (MVIV) release in the year 2009, vCenter Server 4.1 was not yet available and the official support was only for VMware Virtual Center Server v2.5u2 and vCenter Server 4.0. Subsequent to the release of the vCenter Server 4.1, MVIV was qualified with vCenter Server 4.1. However, to enable MVIV to recognize vCenter Server 4.1 as the supported version, a registry key must be added. Fix Follow these steps to create or modify the following registry entry:
For 64-bit machines 1. Start the registry editor. 2. Navigate to: My Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\emc\MirrorViewInsightForVMW are\Preferences 3. Modify the "SupportedVIAPIVersion" data so it reads as follows: 2.5u2;4.0;4.1 4. If this entry is not there, create a new string value of "SupportedVIAPIVersion" with the data of 2.5u2;4.0;4.1. The entry should look like this: SupportedVIAPIVersion REG_SZ For 32-bit Machines 1. Start the registry editor. 2. Navigate to: My Computer\HKEY_LOCAL_MACHINE\SOFTWARE\emc\MirrorViewInsightForVMWare\Preferences 3. Modify the "SupportedVIAPIVersion" data so it reads as follows: 2.5u2;4.0;4.1 4. If this entry is not there, create a new string value of "SupportedVIAPIVersion" with the data of 2.5u2;4.0;4.1. The entry should look like this: SupportedVIAPIVersion REG_SZ 2.5u2;4.0;4.1 2.5u2;4.0;4.1
Page 73 of 166
SRM LUN discovery, test, failover fail with file write errors
Brock reported this to me so thanks very much for that. It is related to IBM SVC but it is an interesting one. The SRM log will show Error writing to C:\users\srmadmin\appdata\local\temp\vmwaresrmadmin\dr-sanprovider6984-0 or something similar. For the solution and more details see http://kb.vmware.com/kb/1033871. It turns out that this is caused by a Java garbage collection issue!
LeftHand Networks
The LHN adapter requires the account / password of the CMC management app. The protected side array configuration should reference the SRA installed on the protected side! Both IP fields should contain the same IP information, which should be the VSA on the protected side. Update, the two IP fields for the LHN SRA do not require the same IP information nor to be both filled. Only the first one needs to be used. The SRA must talk to a manager, and NOT to a virtual IP. You can put more than one IP address in the fields by separating them with a comma. If you have five managers it would be a good idea to put at least two of them into the first or first and second IP fields. The original certified version is 7.0.01.6066. But now it is currently 8.0.00.1682. There is a new version of the VSA and of the SRA and they both work well with Update 1 of SRM. Current version of LHN is 9.0 and the SRA is 9.0.0.3561 (11/11/10). There are a lot of new features in the SAN/iQ software, but there is not many changes required for this document in terms of install and configure of the array. An old report of Lessons Learned is still interesting at http://frankdenneman.nl/2009/10/lefthand-sanlessons-learned/ . Good info on using this excellent gear.
Miscellaneous Information
When you install your VSAs make sure to specifically step by step follow the LHN instructions. Than on your protected site use the wizards to configure the VSA to be able to present storage to the protected site ESX server. Make sure it is seen in ESX before continuing. Once this is done you can work on the recovery site VSA but your configuration will be different. Create a Remote Scheduled Copy from the protected site to the recovery site. As part of this create a remote volume on the recovery side. If you stop now you will apparently have a working shared storage that is replicating. But you will get the error mentioned in the Appendix about unable to access the VM configuration. You will need to use the Tasks menu in the CMC to create a Volume List and than an Authentication Group. Once this is done your Recovery Plan should work fine. All storage vendor SRAs requires a restart of the SRM service after the install.
Page 74 of 166
The LHN VSA uses remote scheduled copies to do the replication and this means when the test fail over is progressing the remote copy process is not copied. One of the remote copies is mounted for the recovery site to work with but that doesnt stop the replication / copy process. I recently upgraded from 8.0 to 8.1 and had a little interesting things happen. I forgot to upgrade my SRA. So after I did the VSA upgrade my test failover failed. It only took 5 seconds to fail. The error message in the history report was almost misleading. It looked like a credentials issue. It said that it failed to authenticate with the array management system during a test failover. It only said it failed to authenticate it would have been true, but with the extra stuff it was a different issue. Upgrading the VSA cleared this issue.
NetApp
When using SRM and NetApp, and when using NFS and OnTap version 8, you may have a configuration issue stopping you from successful configuration. More info on this can be found at the link below, but the workaround is simple. Make sure you put your NFS IP address into the NFS IP field even if you think since you are using the same IPs it is not necessary to do it. http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=464045 This SRA requires the account / password to the simulator or NetApp device but it only has one IP address field compared to the Left Hand which requires 2. The minimum software version on the NetApp devices that the NetApp SRA requires is 7.2.2. Of significant note, the simulator can be one VM but be two instances so that it can do both the protected and fail over storage but with much less resource usage. NetApp has mentioned that they designed the SRA to support simultaneous recovery plan operations but did not test it extensively. By default OnTap doesnt have SSH enabled and it should be (but is not required) to be for SRM. NetApp uses Flexclone to provide storage for the test failover so that means the replication of data is not impacted during a failover. The NetApp SRA uses SSL to talk to the NetApp controller and there are no other ports required. However I have another report that it uses unsecure HTTP over port 80. This can be configured as something else. I have confirmed it works by default over HTTP but that it can in fact be changed to use HTTPS. When working with NetApp it is worth having a volume equal to LUN. It is sometimes configured as a very large volume that has multiple LUNs on it. So it is best if possible for best flexibility and working with SRM if there was one volume on one LUN. If you have recovery site igroups configured in your protected site igroups you will have errors. Dont do this. When reading through SRM logs, you will sometimes see lines that start with <StoragePort id=> and what you see after the =, when it starts with 50:0A, it means NetApp. This is useful to know if you think you are using something else such as IBM or EMC. You can troubleshoot communication by using a browser and connecting to the filer as http://aa.bb.cc.dd/na_admin and you should get the FilerView page.
SRM Reference Guide Page 75 of 166
There is a VMware SRM in a NetApp Environment document that is quite useful find it at http://media.netapp.com/documents/tr-3671.pdf . If you get SRM/VC events about Virtual machines have one or more devices which dont have file backings on the replicated site this is due to a CD being attached to a VM. As of 4/3/09 the updated SRA (or as NetApp says DRA) is available that solves a number of issues and you should be sure to use it with SRM 1.0.1 Update 1. The new version number of the DRA is v1.0.1. Both it and the IBM N-Series 1.0.1 adapter are available at the VMware SRM download page.
Page 76 of 166
(dr.secondary.ReplicationManager.SingleVmFailure) [ [#14] (dr.secondary.ReplicationManager.SingleVmFailure) { [#14] dynamicType = <unset>, [#14] vm = 'dr.secondary.ShadowVm:shadow-vm-8688', [#14] fault = (dr.san.fault.RecoveredDatastoreNotFound) { [#14] dynamicType = <unset>, [#14] datastore = (dr.vimext.SanProviderDatastoreLocator) { [#14] dynamicType = <unset>, [#14] primaryUrl = "sanfs://vmfs_uuid:4994e685-01ee1320-a88a001ec9f48f03/", [#14] }, [#14] reason = (vmodl.MethodFault) null, [#14] msg = "" [#14] }, [#14] } [#14] ] ---------------Now the problem seems to be that the replicated LUN is seen as a snapshot by the ESX host. -----------vmhba2:0:5 vml.020005000060a98000486e2f39535a4e674c59674e4c554e202020 Disk change be a disk ID: disk ID:
Feb 18 00:47:13 vmkernel: 27:17:19:37.383 cpu4:1221)LinBlock: 1994: VFS: detected on device 3:0 Feb 18 00:47:13 vmkernel: 27:17:19:37.408 cpu7:1339)LVM: 5573: Device vml.020005000060a98000486e2f39535a4e674c59674e4c554e202020:1 detected to snapshot: Feb 18 00:47:13 vmkernel: 27:17:19:37.408 cpu7:1339)LVM: 5580: queried <type 2, len 22, lun 5, devType 0, scsi 5, h(id) 11890432529146075181> Feb 18 00:47:13 vmkernel: 27:17:19:37.408 cpu7:1339)LVM: 5587: on-disk <type 2, len 22, lun 20, devType 0, scsi 5, h(id) 3407130522988133436>
MetroCluster background information (thanks Lee!) The sort version is that SRM is a DR solution, and MetroCluster is a stretched HA solution. BTW, our HA was not designed for distance, and thus the KB article to help with that at http://kb.vmware.com/kb/1001783 . MetroCluster is basically a dual controller NetApp system, stretched across two sites, with the disks from each site synchronously mirrored over fibre to the other site. The idea being that you can loose one site, and the surviving controller takes over - just like a normal controller failover process. If you're going to stretch a storage system across two sites, then the chances are you'll have a decent network between the sites, and you'd want to stretch your ESX HA Clusters across the two sites as well (so you can VMotion from one site to the other). Then loosing one site results in a MetroCluster failover, followed shortly by a HA restart of the VMs on the surviving ESX servers in the surviving site. In terms of high-level comparison we could do this: SRM No MetroCluster Yes
Page 77 of 166
Distance Limited
SRM Reference Guide
vCenter Integrated DR Workflow Creation Transparent Failover Non-disruptive DR testing Site Failure VM Protection NFS Support
Campus cluster / stretched HA environments (i.e MetroCluster) work well if you have the right kind of infrastructure but they are not really DR solutions as typically the two sites are very close together and most customers I work with do not consider a DR site true DR if it is located within a certain distance of the primary. we had a couple of customers a few years ago whose "campus" solution was wiped out entirely when the UK oil field disaster struck and took out both datacentres at the same time (they were 0.5 miles apart). Extreme example maybe but illustrates the difference. If you can live with the limitations of a campus cluster solution and they fit your needs then they can work well. As we say in the UK take what the whitepapers say with a pinch of salt until you've tried it yourself. With any cross site storage architecture I have implemented, there will be **some** kind of pause whilst the system sorts things out. The amount of time this takes depends entirely on what failed. Could be 2 seconds, or it could be 2 minutes or more, then you need to wait for HA to kick in. So when talking about failover initiation I would not say SRM vs stretched HA solutions are really any different time wise, indeed if you wanted to automate the initiation of an SRM recovery plan you can do this though if it were my pair of sites i would want this process at some point to be kick started manually by someone once the true nature of the event was understood. With an SRM recovery plan the storage integration "tells" the storage to come online rather than having to wait for a failover heartbeat or similar to be detected by the storage itself. Going back to campus clustering although array/disk shelf failover can be automated this does not always happen automatically either in my experience, again sometimes it may require a manual intervention (click a button, or type a command to failover) and you need to have the process defined clearly for that event. Loosing a controller in either site for most vendors should be no big deal and the failover operation should take care of the storage side. If you loose the entire site, then manual intervention will (probably) be required to failover it can sometimes be possible to script round this using staged heartbeats. Again still adds time to the failover. If we look at failback, with the campus implementation the process to failback is not as simple as bringing up Site1 and then just vmotioning the VM's back from Site2, again it depends on the failure. If you lost site1 completely and have had to failover to the disk shelves at Site2 then the VM's will now (once HA has restarted them) all be running from the disk shelves at Site2 if you simply VMotion them back to Site1, when its ready, then the storage will still be accessed via Stie2's controller / disks until you tell the storage arrays to go back to their default configuration, which will require restarting the VM's again and will incur downtime in the same way and SRM failback would work. I cannot imagine you would want a situation where Site1 came back online and you vmotioned 50% of your workload back to Site1 but left 100% of your disk workload
SRM Reference Guide Page 78 of 166
running at Site2, I think in all cases customers I have put this in with have wanted the storage to "go back to how it was" ready for the next event or failure. The biggest difference in terms of customer feedback I receive is that the ability to perform automated; repeatable non-disruptive DR testing is one of the key factors moving customers towards SRM. Only other items you need to be thinking about with campus cluster are below I am not adding these to say "SRM is better" these are simply things I have had to work through when implementing campus cluster and some of these nuances don't always make it into the whitepapers/datasheets shall we say VC Inventory / Layout, be careful with the design, as everything is stretched you need to be very consistent and accurate with naming conventions across all inventory objects the VM's will use DRS/HA settings, with campus clustering ensure that you know which VM's are important and define the correct settings per VM for recovery. Unless you have N+1 capacity spare at each site you will need to put in place HA/DRS settings that bring online the most important VM's first and dont end up in a failure situation with all your dev/test VM's online and half the production VM's "down" because you did not set correct priorities in HA. In SRM this is something the recovery plan handles and you can control. Split Brain, if you run the two sites as one big HA/DRS cluster ensure you test out the various failure scenarios, for example if DRS (or manual VMotion) moves a bunch of VM's from site1 to site2 but no failure as occurred at that time you now end up with VM's CPU/Memory/Network contexts running on hosts at Site2 but accessing their VMDK's on site1. This will work but is not always desirable from a latency point of view (might be none-issue if bandwidth sufficient) however what happens next if you now suffer disk outage at Site1, at this point the VM's will not crash immediately at Site2 and it will take HA sometime to realise these VM's have an issue. Try it and see, if you disconnect storage from a VM the VM will cling on to life (assuming IO pattern is normal) for quite sometime before a bluescreen is seen. Storage Presentation, if your vendor wants the zone across the sites to effectively be "open" to all ESX hosts then ensure you understand the implications of the ESX LVM settings with regards snapshot / disk resignature. You potentially will have ESX hosts that could at some point access both a source and target lun at the same time if someone or something altered the LVM defaults. Zoning, if the vsan / zones are truly open or all hosts in same then certain fabric events can be a potential pain. Any rogue events such RSCN will disrupt both sites at the same time if all ESX hosts are on same open fabric so be careful here. Not something that is too common but i have seen it hurt a few customers, usually comes down to bad HBA or cables but can be a real pain to track down. VC / ESX limits, as you build the design out for campus cluster ensure the design wont
Page 79 of 166
have you quickly reaching the limits of what it supported in terms of things like max number of VMs/VC, max number of luns/ESX host, max number paths/lun/ESX host etc.
As much as I like SRM solutions I also like the campus cluster / single pane of glass approach as well where it works/fits. Both use-cases are valid but ensure you work out what you actually need. And some more on this subject from Lee: ......by the way....metroclusters competition is NOT SRM....its EMC VPLEX....VPLEX is EMC's stretched storage / HA solution....but as with NetApp, EMC also integrate ALL of their platforms with SRM as they know that is what provides DR. Lets compare some component basics. With SRM the architecture is designed so that the recovery process does not depend on any component from the protected site to work. Simple example. SRM uses two vCenters meaning if the protected site VC dies it does not affect your ability to recover. This is not the case with metrocluster, with metrocluster youre using a single vCenter instance across the two datacenter rooms so although HA can recover the VMs if vCenter is lost in your design your now using a single vCenter namespace across two sites so this needs to be taken into account when your adding objects in to your vCenter inventory (naming consistency, scale...limits etc). Also there are failover scenarios to think about with metrocluster, its not ALL automated by VMware HA by any means. SRM's strength as a DR solution (combined with netapp snapmirror) is it allows customers to build repeatable recovery workflows that bring their infrastructure back online in a specific order. VMware HA does NOT do that. Other factors to consider that might not be immediately obvious (dont get me wrong here im not bashing metrocluster with netapp/vmware its a good solution you just need to be sure its what customer needs...otherwise SRM/netapp/vmware is a good solution also :) Sometimes I find metrocluster is wrongly sold to customers who really needed DR...those are the situations to be careful of, some (not all) netapp account teams will try and only sell metrocluster because that solution is more $$$$$ for them and sometimes it because they just don't understand that netapp integrate with SRM very well!!! Other caveats of a metrocluster solution customers need to understand and be happy with are below....if your sites are close together and you truly want stretch HA then you will most likely be fine with these points BUT if you really wanted DR then these points will usually come as a surprise to customers and annoy them! In a metrocluster deployment granularity in the filer is at the aggregate level (highest level!!!) NOT at the volume level. With SRM you can simply setup snapmirror for the volumes / luns you want to protect this is NOT the case with metrocluster Metrocluster has no offline/non-disruptive testing capability as we do with SRM/Netapp Flexclones so how do you prove you can failover successfully? Metrocluster has distance limits (2GB link 500m stretch or 4GB link 270M stretch,
SRM Reference Guide Page 80 of 166
for greater distance need fabric/switch MC for up 100KM max distance span) Metrocluster solution *must* use recommended brocade switches to be supported All disks login to switches as hosts. Max limit is 672 logins, cannot go beyond this limit (672 = 6 switches) and would need 6 x 2 x 2 for redundancy at both sites...lot of switches J ALL disk shelves must be mirrored. In a metrocluster solution your using SyncMirror NOT Snapmirror. If you had say 20 diskshelves and broke those up into 4 aggregates and within each aggregate had say 40 volumes each containing single VMware NFS exports then with SRM you can simply snapmirror the volumes (export) you need to replicate, so lots of flexibility and granularity no need to replicate things you dont need at DR site, why waste bandwidth???? With a Metrocluster solution ALL disk shelves MUST be mirrored even if the VMs within them are not needed for DR or are not business critical...so less flexibility when carving up the storage. If you suspect the customer is being miss sold metrocluster ask customers simple questions to try and work out what do they want: - Control of the failover? - Orchestration, using recovery plans to build recovery workflows that match what their business wants to happen - Ability to pre-build recovery workflows that can be tested, validated and invoked during an outage knowing the recovery will take place following the pre-programmed workflow - Ability to perform non-disruptive tests - Ability to run pre/post power on scripts - Ability to customize network at remote sites as part of the recovery - Ability to run callout scripts that talk to other pieces of their infrastructure - Ability to run the recovery in a pre-defined sequence that matches their own business recovery processes and SLA's - No single point of failure....i.e both sites run with own management layers (vCenter) meaning if one of the sites is lost nothing from that site is needed to recover.....in a stretched HA+metrocluster environment that is NOT the case. Metrocluster uses a single vCenter server for both sites. vCenter is the single point of failure for some scenarios here. if the answer to the above is YES then they are looking for a DR solution....hence they need NetApp Snapmirror combined with NetApp's SRA for SRM. If the customer has two sites / server rooms that are VERY close together and they do just want to run both rooms as "one" then that might be a good fit for metrocluster. For example if the two rooms were <20KM apart for some industries that might mean those two sites couldn't be classed by the regulator/auditor as DR anyway because they are too close together....so if that is the case...metrocluster works for them there as they might be breaking the law by claiming to have DR protection if their datacenters are too close i.e could both be wiped out by the same disaster such as flood or power blackout to a city. Hope this information helps you work out what solution fits for your customer. HDS
One important thing to remember with HDS is that immediately after a real failover it will reverse direction and start replication in a new direction. This can be changed.
Page 81 of 166
Lots of help can be found in http://www.hds.com/assets/pdf/hitachi-storage-replication-adapter-softwarevmware-vcenter-site-recovery-manager-deployment-guide.pdf . Some help can be found in http://www.hds.com/assets/pdf/implementing-vmware-site-recovery-managerwith-hitachi-enterprise-storage-systems.pdf . Currently the HDS SRA doesnt set the path to the perl binary, but it should do that properly in the next release - it does. The SRA also needs the HDSs cci component installed as well. SRM will only look for replicated datastores on devices that are presented to ports on the array that are returned by the discoverArrays command. If discoverArrays does not return a WWPN, than SRM assumes that devices on that part are not for use by SRM, even if LUNs on that part are made visible to the ESX hosts. The HDS adapter is not returning the port, which presents the snapshot LUN. The reason seems to be some logic in the adapter which determines the ports on the array by looping through the list of source (L or local) replicated and shadow image devices on the target array. However, the shadow image snapshot is actually a remote volume (R) because the L local volume is actually the replicated target, so the adapter is not returning the port of this volume. Because it is not returning the port, SRM ignores it (by design) and the test fails. This problem does not occur when all volumes are on the same port. VMware Engineering is working on this with HDS.
EMC
As always, make sure you have checked the SRM HCL but also confirm that your SRA pre-requisites like SE, FLARE/DART/RP versions are correct.
A video that talks about all four EMC replication technologies for SRM: http://www.emc.com/collateral/demos/microsites/mediaplayer-video/video-walsworthtothepoint-vmsrm.htm VSI version 4 is out - http://itzikr.wordpress.com/2010/12/20/emc-virtual-storage-integratorvsi4-is-out/ Celerra
This supports simultaneous recovery plan operation. Make sure they do not step on each other in terms of LUNs / VMs or their components. A new SRA (4.0.22) is out and I think it important - http://itzikr.wordpress.com/2010/12/20/new-celerrasra-and-a-celerra-failback-plug-in-for-vmware-srm/ The Celerra 2.0 beta SRA has a log location of [SRM_InstallDir]\scripts\SAN\celerra\log\sra.log . Celerra and VMware Techbook - http://www.emc.com/collateral/hardware/technicaldocumentation/h5536-vmware-esx-srvr-using-celerra-stor-sys-wp.pdf Celerra SRA release notes http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Technical_Documentation/300-007023.pdf Celerra Failback plug-in Release notes http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Software_Download/SRMFailbackWizar
SRM Reference Guide Page 82 of 166
d_read_me_first.pdf http://www.emc.com/collateral/hardware/technical-documentation/h5536-vmware-esx-srvr-using-celerrastor-sys-wp.pdf Celerra and NFS with SRM - http://communities.vmware.com/docs/DOC-11541 Plug-in http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Software_Download/EMC_Celerra_Failb ack_Plug-in_for_VMware_vCenter_SRM.zip Celerra VSA great for learning and testing - http://nickapedia.com/2010/09/12/ubertastic-celerra-ubervsa-v3-unisphere/ Changing the Celerra passwords
Use the following procedure to reset the root password: 1. Access to the Console of the control station is required, so either connect the console physically or use a serial console. 2. Boot the Control Station or reboot or reset the power switch if shutdown commands cannot be issued. 3. When the BIOS checks complete and GRUB is loading, press any key (arrow key is best) to stop it from auto booting in 10 seconds. 4. Press e to edit the line it is highlighting ("Linux" would be the normal word). 5. Select/highlight the line starting with the word "kernel" and press e to edit. 6. At the end of the word, append the word "single" with a space in the front and press ENTER. 7. Now the highlight should show the word "single" at the end. 8. Press b to boot from this modified line. 9. Now the Control Station will boot to single user mode, with a # prompt appearing, which means it logged in as root already. 10. Issue passwd command and enter the new password (with confirmation of same) to be set, which will be the new password. 11. "init 6" will reboot and it should boot automatically as normal boot. Use the new password set at step-10 for root. With this root login, reset the nasadmin password, if required. You would do this after logging in with the root account. CLARiiON
Currently the CLARiiON has a limit of 32 characters for CG names. The Solutions Enabler API is trimming off the last two characters from the name. Which causes SRM issues. Until the next release of the SE software it is best to avoid this issue by using only 30 characters in the CG name. When using SnapView, remember that SV must snap to THICK luns. On the CX the snap name should have the following prefix: VMWARE_SRM_SNAP . While the Solutions Enabler (SE) can be installed on the SRM server, physical host, or VM it sometimes will make thing easier to have it on the SRM server. This is for both CX and DMX equipment. I have been told that this is a requirement that is not documented anywhere.
SRM Reference Guide Page 83 of 166
Something that may be useful SRM error is failed to create LUN snapshots http://blog.virtualtacit.com/home/2009/7/30/clariion-cg-snap-session-limit-smack-down-during-srm-testfa.html . EMC CLARiiON - http://communities.vmware.com/docs/DOC-11544 http://www.emc.com/collateral/software/solution-overview/h2197-vmware-esx-clariion-stor-syst-ldv.pdf
DMX
When working with DMX, and using BCVs, you cannot use Timefinder snapshots. This is not a limitation of or by VMware but rather an EMC limitation. While the Solutions Enabler (SE) can be installed on the SRM server, physical host, or VM it sometimes will make thing easier to have it on the SRM server. This is for both CX and DMX equipment. Remember for DMX equipment the SE will need to have a gatekeeper LUN, and if the SE host is a VM, the gatekeeper LUN will need to be a pRDM. The DMX will need to have its LUNs in a device group. http://www.emc.com/collateral/hardware/solution-overview/h2529-vmware-esx-svr-w-symmetrix-wpldv.pdf New version of SRA 2.2.0.3 - http://itzikr.wordpress.com/2010/12/16/new-emc-srdf-sra-for-srm-getthe-scoop-inside-3/ - this is big release and an important one! SPC-2 - http://www.yellow-bricks.com/2009/12/08/spc-2-set-or-not/
SRDF
A new tech book on SRDF and SRM is now available. There is both hard copy and soft copy available. It covers off version 2.2 of the SRA and install / configuration, plus how to use the new features that include: Test failover using TimeFinder/Snap off of a SRDF/A R2 (new with 5875) Test failover without using TimeFinder technologies and instead directly running the test failover off of the SRDF R2 How to use the new VSI SRA utilities And information in the Appendix on SE licensing. Powerlink (soft copy): http://powerlink.emc.com/km/live1/en_US/Offering_Basics/White_Paper/h7061-srdf-adapter-vcentersrm.pdf Vervante (hard copy): http://store.vervante.com/c/v/V4081409244.html?base_cat=EMC%3a%20EMC%20TechBooks&pard=e mc Important Note the EMC VSI Plug-in version 4 does NOT write SRDF configuration out to the EmcSrdfSraOptions.xml but it says it did, when it has NOT been started with the Administrator rights. Or rather, when the vSphere client is started (using the right click and start as admin option). This may, or may not, be mentioned in the release notes. It will be mentioned in the future if it is not, and EMC is
SRM Reference Guide Page 84 of 166
thinking of other ways to manage this. http://itzikr.wordpress.com/2011/01/10/srm-automatic-failback-using-emc-symmetrix-vmax/ New version of SRA 2.2.0.3 - http://itzikr.wordpress.com/2010/12/16/new-emc-srdf-sra-for-srm-getthe-scoop-inside-3/ - this is big release and an important one! To make your work with SRDF and SRM successful you will need two documents. The first is the SRA release notes, which will be found in PartnerLink. The second is a new SRDF and SRM techbook, which can be found at https://powerlink.emc.com/nsepn/webapps/btg548664833igtcuup4826/km/live1/en_US/Offering_Basics/ White_Paper/h7061-srdf-adapter-vcenter-srm.pdf Or http://www.emc.com/collateral/software/technical-documentation/h7061-srdf-adapter-vcenter-srm.pdf Or at http://www.emc.com/collateral/software/technical-documentation/h7061-srdf-adapter-vcentersrm.pdf Latest SRDF SRA release notes http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Technical_Documentation/300-010235_a03.pdf I have had troubles with both links at different times, and both links have worked for me at times. If you cannot get the document I can send it to you! A View/SRDF/SRM white paper http://www.emc.com/collateral/software/white-papers/h6971-businesscontinuity-view-srdf-wp.pdf What licenses are necessary to successfully use SRDF and SRM? Generally you will require: BASE SERVER (to allow it to be an API-SERVER) SRDFA (to allow it to manipulate SRDF/A RDF groups SRDF (to allow it to manipulate RDF devices) TimeFinder (to allow it to use TimeFinder /Mirror) TimeFinder-Clone (to allow it to use BCVs for testing) A useful SRA and SRM document can be found at: http://www.emc.com/collateral/software/whitepapers/h6368-using-emc-srdf-adapter-v2-vmware-srm-wp.pdf - I believe this may be been replaced with the document above. 12/12/09 I have heard but have not confirmed that SRDF will immediately after a failover reverse direction and start replication. 10/2/09 there is updated SRA and Storage plug-in that makes things work easier! Make sure that you use them. EMC just told me the latest code went on powerlink today. Look under:Home
Page 85 of 166
<http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=homeP gSecureContentBk> > Support <http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=-NULL--&internalId=0b01406680024e1b> > Product and Diagnostic Tools <http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=-NULL--&internalId=0b014066800251e5> > Symmetrix Tools <http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=-NULL--&internalId=0b01406680270f14> > Symmetrix Tools for VMware <http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=image7 b&internalId=0b01406680407180&_irrt=true]]> When you are performing the failover test what kind of devices are we working with sync (SRDF/S) using Timefinder/Snaps (VDEVS) or async (SRDF/A) using Timefinder/Clones? (aka BCVs) At the recovery site you need to pair up the R2 devices (replicas) in your datastore groups being tested with appropriate target devices for testing, these target devices are the Timefinder devices I just mentioned, VDEVs if your sync and BCVs if your async. The device pair list (as mentioned in your error) is stored in the xml file in Program Files\Vmware\Vmware Site Recovery Manager\scripts\SAN\EMC Symmetrix folder. File is called EmcSrdfSraOptions.xml In that file you need to specify the R2 devices and their associated VDEV/BCV pairs as part of the <TestFailoverInfo> information inside the device pair list element. Example device pair entry: <DevicePair> <Source>0477</Source> <Target>035F</Target> </DevicePair> Once you have a device pair for ALL of the R2 devices in your recovery plan save the xml file and try the test again. The purpose of the EMC Storage plugin (latest one) is that it now includes an EMC SRDF SRA tab in vCenter that allows you to match up the pairs in vCenter and then save the xml file from that tab so no manual editing is required. I have included a screenshot below of what this looks like. All of this is also covered in the SRDF guide (let me know if you dont have this and I can send separately). SRDF adapter version 2.0 does not reference the netcnfg file any longer. Instead you are expected to specify a resolvable host name or IP address in the address field of the Array Manager. You can even add a port with it if you are not using the default 2707. SYMAPI_C_NET_Handshake_FAILED error This usually occurs when there is a security level mismatch between client and server. Sometimes where the Solutions Enabler versions are mismatched. Check the options file in symapi/config folder and change the sym server security level from the default of ANY to NONSECURE. Confirm both sides. On a Symmetrix using SRDF-A, you may find a failover that successfully proceeds past the storage configuration, but it fails when powering on the VMs. This issue may be due to SPC-2. It needs to be enabled on the front-end adapter on the recovery Symmetrix that is exposing the RDF and BCV LUNs to the ESX host, otherwise SRM cannot match the WWN of the LUN returned by the SRA with the WWN of the LUN present to the ESX host. You can find information on this in our forums, but also in document emc71378 in EMCs Powerlink KB.
SRM Reference Guide Page 86 of 166
This SPC-2 flag can be set either on the FA OR the initiator itself. By making the change on the initiator, you can avoid moving hosts off the FA to make the change there. http://www.yellowbricks.com/2009/12/08/spc-2-set-or-not/ . All devices in a consistency group (device group) must be failed over together. This means they can only have one protection group and one recovery plan. If they need more they will need to create multiple consistency groups. If you are using the SRDF SRA you will need the following manual step to avoid errors in the log that appear to indicate a path issue, as well as an error in the UI that is Failed to launch SAN integration scripts to execute discoverArrays command. This will occur when you are trying to configure the Arrays during the initial SRM setup. The solution is to add a path to the SYMCLI binaries to the System variables Path environment variables. The default path to SYMCLI is C:\Program Files\EMC\SYMCLI\bin . After adding the path you will need to restart the SRM server service. For SRM setups, EMC recommends to use the Solutions Enabler in a "client-server" fashion because the SRM server typically does not have direct fiber connectivity to the SAN (whereas the ESX host does). To use SE in a client-server fashion, SE needs to be installed on both the SRM server (Windows version) as an SE "client", and also on at least 1 ESX host (RH Linux version) as the SE "server". On the SE "client", you edit the netcnfg file to tell SE who the SE "server" is. The edited line contains a "service name" (which can be arbitrary, whatever you want), the hostname of the SE server (in this case, the ESX server), and the IP address of the SE server. The "service name" is the name that should be entered for the SYMCLI_CONNECT environment variable on the SRM server. That's how the SE "client" identifies the SE "server" to direct its SYMCLI commands to. The use of the netcfg means that there is a single point of failure. Some clients might use the Control Center as the SYMAPI server to avoid this. Put the path of SE bin folder into the System Variable PATH. By default it is C:\Program Fiels\EMC\SYMCLI\bin . You will see errors about this if you dont. In addition, you should restart the SRM server after you are complete with the SE install and tweaks. This is when using SRDF/A. The SYMCLI is what is required by the SRA to talk to SRM. The SRA by default creates a log under \program files\emc\symapi\log with the name of symvmwsrm<date>.log. If you wish to have application consistent VMs after a failover you will need to use Replication Manager to arrange that. I am not sure if it is yet compatible with SRM. Using EMC SRDF Adapter for VMware Site Recovery Manager http://www.vmware.com/files/pdf/VMware_SRM_SRDF_bestpractices.pdf SRDF DM doesnt work with SRM. The EMC adapter seems to be coded to skip any devices in Adaptive Copy state (Data Mobility is the fancy name for Adaptive Copy); as these devices wont be reported to SRM any VMs on these LUNs cannot be added to a protection group. In addition SRDF DM copies dirty tracks out of order to the R2 devices so likely not able to guarantee a consistent image so it is not a good SRM candidate. SRDF issue (thanks Jason for this sample): Customer claims datastore DMX-25-SRM-Testing-955 is on a replicated LUN however SRM does not create a datastore group including this datastore.
SRM Reference Guide Page 87 of 166
vmhba1:3:14, i.e. LUN 14 on target 3 of hba1 on host-7422, this host sees this LUN's UUID as:
[2009-03-11 13:04:23.493 'SanConfigManager' 13084 trivia] Added LUN '10:00:00:00:C9:7A:42:65;14;50:06:04:82:D5:2E:89:09' with keys 'host-7422;vmhba1:3:14' and 'host7422;02000e00006006048000019010205253303039353553594d4d4554'
the LUN WWN is encoded within the UUID (last token of this line) as characters 10 through 42, i.e. 600604800001901020525330303935355 However, discoverLuns returns only 1 replicated LUN, which is not this WWN:
[2009-03-11 14:30:46.094 'PrimarySanProvider' 14848 trivia] 'discoverLuns' returned <?xml version='1.0' standalone='yes'?> [#2] <Response> [#2] <LunList arrayId="000190102052"> [#2] <Lun consistencyGroupId="RA::9" id="8F3" wwn="60:06:04:80:00:01:90:10:20:52:53:30:30:38:46:33"> [#2] <Peer> [#2] <ArrayKey>000187401329</ArrayKey> [#2] <ReplicaLunKey>738</ReplicaLunKey> [#2] </Peer> [#2] </Lun> [#2] </LunList> [#2] <ReturnCode>0</ReturnCode> [#2] </Response>
How could there be only 1 replicated LUN? looking further up in the log the SRDF SRA reports several messages such as:
[#2] 20090311 14:30:45 INFO Skipping SID [000190102052] RDF device [82C] config [#2] [RDF1+R-5] mode [Adaptive Copy] pair state [SyncInProg] [#2] star mode [False] meta type [Member]
So it is skipping several LUNs that presumably are RDF1, but they are in the "SyncInProg" state in Adaptive Copy mode, but SRDF adapter only supports SRDF/S or SRDF/A, so they would have to be in the "Synchronous" or "Asynchronous" mode. So the LUN is being replicated but not in the right mode, and the adapter is skipping it, so SRM cannot map vmhba1:3:14 to a replicated LUN. Solution is for customer to correct the LUN on which the datastore was created so that it is in synchronous or asynchronous mode, not adaptive copy (which in fact is the mode when you do the initial full synch from R1 to R2)
MirrorView
If you are using MV with Clariion you will need to use Solution Enabler (for communications) and Navisphere for the replication management. If you use MV with the Celerra platform you will require neither. Replication Manger is useful for both. As of 12/12/09 you can only run 1 simultaneous recovery plan. Elsewhere in this document you can
SRM Reference Guide Page 88 of 166
experimentally change this. If you are working on a 64-bit SRM host, you must use the 32-bit solutions enabler software. SE can be installed with no configuration or extra bits. The MV SRA works on ports 80/443 but if they are not used, you will end up using 2162 / 2163. You can get a failed to create LUN snapshots error when working with MirrorView. It is generally a problem in the EMC configuration. You can sometimes avoid it by using the following steps:
Create the source volume and mask it to the ESX hosts on the production site Build a VMFS datastore on the Source and add a Guest to the Datastore Use the Navisphere MirrorView wizard to create a Target volume Create the MVs or MVA relationship Add the Target volume to the MVa / MVs relationship Once synced, create snapshots on both sides Add the production side snapshot into the storage group for the Production site ESX hosts Add the Target volume an its newly created snapshot into the DR side ESX host storage group Create a consistency group on the production array and add the MirrorView relationship(s) to the consistency group.
Some specific suggestions for MirrorView/S on Clariion would include: 1. Solutions Enabler 6.5.2 or later 2. SRA 1.3 or later 3. Consistency Groups must have pure alphanumeric characters in use or a real failover will work but not a test. 4. The snapshot must have VMWARE_SRM_SNAP in the name somewhere. It appears to me that you create the snapshot (or the storage admin does) before it is required and than the SRA activates it. This is for test failover only. 5. You can have only 1 recovery plan active when using this SRA. Hopefully this will be improved in the future releases of the SRA. There is some disagreement about this so it may work but I am checking. For now the Release notes say no. Confirmed this is correct. It will take NavisSphere engineering changes to support running more than one RP. But see above how you could do I tnow if you need to test it. With MirrorView you will need to make sure the EMC array scripts are in the same folder structure as the SRM install. This is only relevant if you have installed SRM to a different drive. This will impact a number of applications. In addition, it has been suggested that all of the Storage Enabler options need to be installed. When installing solutions enabler accept all the defaults and perform a complete install. If you have some performance issues with the failover you may be using an old version of the Solutions Enabler. In PowerLink article emc203510 you can find the SE Patch Release 6.5.2.20 that reduces the time required for storage preparation by more than 50%.
Replication Manager
Currently Replication Manager (RM) doesnt co-exist with SRM, but in December 2008 it may be supported. It has been said it will be dramatically easier to setup the array-to-array replication.
SRM Reference Guide Page 89 of 166
RecoverPoint
You should avoid having spaces in a Consistency Group. It appears that the SRA cannot handle it you can see errors in the SRM log about not being able to find CGs. The CG should also be a CRR consistency group for remote replication and not CDP/local or CLR / local-remote. The CG polices that must be set include reservations support and VMware ESX or VMware ESX Windows as the host. It seems that RecoverPoint and SRM have issues if the ~ is in the CG name. Avoid that. The RecoverPoint SRA uses TCP 7115 to talk to the RPAs. If the MUI cannot talk to the RPA neither will the SRA. http://www.emc.com/collateral/software/white-papers/h7261-business-continuity-vsphere-recoverpointwp.pdf The log location for the SRA is c:\program files\EMC\SYMAPI\log . In addition, when there were 10 CGs and one VM, that had 23 VMDKs attached and spread around those 10 CGs we were not able to do a failover. Changed it to one CG and the rest the same and it worked. It was the RP SRA 3.1, which is 1.0.2.1. A customer recently had 19 LUNs in a CG and was failing over unsuccessfully. There were device 0 and device 1 errors for the VM configuration since the VM configuration files were not being seen in the time that SRM required. A manual refresh on the host brought all the VMs online. This is a clue that indicates the solution. You need to do two HBA refreshes. This has been reported as necessary for HP, and sometimes with big HDS and SRDF environments, but now with a large RecoverPoint CG as well. See How can I configure a second HBA rescan? For help on fixing this. With the number of CGs that are currently supported, and that the number will grow in the future, it is suggested to think about have one app, or business unit per CG. That would be one or more LUNs as that app or business unit would require. This would provide the greatest flexibility in testing and failover. It has been reported to me that RecoverPoint will support simultaneous recovery plan operation. You must organize that so that nothing impacts each other but it works. The account that you use in SRM Array Manager to talk to the RecoverPoint appliance must be configured as admin in the RecoverPoint appliance. SPC2 issue with RecoverPoint and DMX If you see the error message below when working in the Array Manager and trying to configure your connection to RecoverPoint that is using DMX storage you may have an SPC2 problem.
This error occurred after entering your credentials and selecting Connect. This occurred due to the FA flag on the DMX source storage that was not set for the RPA but was in fact set for the ESX servers. EMC was a very quick help with this issue.
SRM Reference Guide Page 90 of 166
Site Management IP
Using RecoverPoint (RP) which server talks to the RP management server? During the Array Manager configuration a connection is created to the Site Management IP for RP. It was asked when the Site Management IP is in a protected management network, and a rule is required to be created for the firewall to provide access, which is the source server the VirtualCenter server, SRM server, or the ESX server? It is the SRM server, which is often located on the VC server that needs the communication with the RecoverPoint Site Management IP. Unable to connect SRM to the RecoverPoint Management Server The 3.0 RecoverPoint SRA uses the same ports as the RP GUI (1099 and 4401). So if you can open the RP GUI from the SRM server, the SRA should work.
WARNING: UNKNOWN_ERROR
When you see an error that looks like [#1] Fri Mar 20 09:25:05 PDT 2009 WARNING: UNKNOWN_ERROR it can mean that an older SRA is in use. Make sure you are using v1.0.2.1 or later.
Which makes v1.0.2.1. The instructions above are not quite right there is an issue with the path or format. It would be good if someone could test this. Test Recovery fails with already accessing image error message If you do a test failover when using RecoverPoint and it fails and in the very large error in the history report you see near the bottom a message about already accessing image you will know that the recovery side (or target) LUN is already set for access before the SRA arranges for it to be set for access and this generates an error. DMX and RecoverPoint This is from a support guy on how something was fixed in a RP issue with a DMX. In the connectivity of ESX to the DMX there requires the SPC2 bit be set on the DMX array. This bit setting was set on the FA ports that the ESX host was connected too, though somehow the HBA wwns were excluded on the symmask list. Additionally, the RPA connections to those LUNs on the DMX did not have any SPC2 bit setting in place on the FA ports or for the RPA initiators. This caused a mis-match of LUN UIDs that SRM saw on the ESX host versus the RecoverPoint Appliances. RecoverPoint engineering identified this and the necessary changes made in the lab. After the SPC2 bit was set, they then required the HBAs on the ESX to be reset, as well as the RPA appliances on both the source and target site. The target site RPAs required to be reset because the target site Journals actually resided on the same source site DMX, as part of this specific LAB environment (would not happen like this in an actual implementation) That is why the SPC2 bit setting issue hit both source and target SRM implementations even though the RecoverPoint target storage was on Clariion (which doesnt require SPC2 bit). So Lessons learned are to make sure that the SPC2 bit setting is in place prior to deploying SRM for both the ESX and RPA appliances for those LUNs. SPC2 bit setting can be done at the FA port level or at the initiator level.
Page 91 of 166
EMC Q & A
Q: What happens if SRM uses fewer LUNs than those contained in an RA group? A: EMC's newly published adapter populates the consistency GroupId field of the SRM XML specification by defining all of the R1 LUNs in an RA group as part of the same consistency group, which means SRM will always try to fail over those LUNs as a unit regardless of the presence of VMs on them Q: What if the adapter does not have visibility to all of the LUNs in the RA group A: EMC recommends the adapter use EMC Control Center (ECC) as the management server for manipulating the RDF devices; ECC should have the ability to manage any device regardless of its visibility to any host, and if not, then any script (not just an SRA) to manage that group would be impossible Q: What if VMs not part of a recovery plan use the same RA group? A: Customer that replicates VMs using SRDF but does not recovery them as part of a recovery plan is probably wasting bandwidth replicating data that is never consumed, so this is probably not a best practice. EMC codes their adapter to best practices Q: What if the RA group constrains LUNs not used by ESX hosts? A: This question can be turned against any DR software; what about the script that tries to fail over the LUNs that is running on the non-ESX hosts (e.g. a Solaris cluster, etc.), it will impact SRM protected VMs. Using the SRM API there is always a way to integrate failover among disparate clusters Q: Why is the test recovery required to be performed on the entire RA group when it is possible to snapshot a LUNs using BCVs? A: This is a good question that was posed to the EMC engineering team; my understanding of their answer is that if an RA group is created it represents a consistency set of data that must be tested together, such as a multi-tiered application that requires cross-application consistency.
FalconStor
You should be aware that the FalconStor replication product will not allow a takeover of the replicated data if there are still hosts with live iSCSI connections to the primary volumes. This is designed behavior by FalconStor. In a DR situation this would NOT be an issue since the primary volumes would not exist. But this is an issue for test modes. You can address this by disconnecting primary ESX hosts from the primary targets pior to failover, or they can script this disconnect as part of the recovery plan (after the primary VM shutdown, but before the secondary storage recovery you could use a script callout for this). Not correct any longer. It is reported that only FalconStor has a product that integrates with SRM with this particular requirement. I have not been able to confirm that this behavior exists with both IPStor 5.1 and IPStor 6.0 but at this time I believe it does. 4/7/09 Update this is not an issue with current versions of the SRA. The FalconStor virtual appliance is both a gateway and a storage device. FalconStor is more known as a company that provides gateway products rather than storage. This is very useful in the BC / DR space.
Page 92 of 166
when the NSS Appliance had not been fully patched. By installing all patches as of 3/17/09 this error went away.
IBM
IBM DS4000/5000
IBM branded SRA (SMSRAinstaller-WS32-101.01.35.06.exe) on Win2K8 will not install This LSI SRA will not install on Windows 2008 and it will quit. You can of course have it run in compatibility mode and it will install fine. Just right click on the installer and select Compatibility, and than run it as Win2K3 (SP1). It will proceed fine.
Old information
The IBM DS4xxx SRA when installed has two issues that stop it from working. The first is that the correct path to perl.exe is not set. The second is that they have capitalization errors in two files. Both of these are mentioned in the readme so make sure to read it. The path that needs to be added to the environment is c:\Program Files\VMware\VMware Site Recovery Manager\external\perl5.8.8 . You can confirm this by looking for the directory that perl.exe is in. You should also add c:\program files\VMware\VMware Site Recovery Manager\scripts\SAN\IBM to the path as well. The two files that you need to edit are command.pl and common.pm and they are both in the C:\Program Files\VMware\VMware Site Recovery Manager\scripts\SAN\IBM folder. You need to look for the $XML_RETURNCODE = Returncode line and change it to ReturnCode. Another path issue that has been reported (11/26/09) shows itself with errors when trying to configure the array. This has been reported for both the IBM SVC ad DS8K. It is a version different in the path. To solve this issue you need to change the following file:
C:\Program Files\VMware\VMware Site Recovery Manager\config\vmware-dr.xml
You will need to look for the line under SanProvider and change the ConfigPath variable.
<ConfigPath>=C:\Program Files\VMware\VMware vCenter Site Recovery Manager\scripts\SAN\</
to read:
<ConfigPath>=C:\Program Files\VMware\VMware Site Recovery Manger\scripts\SAN\</
It is important to note that you need to restart the SRM service to make this work. As well, no other vendor SRAs will now work with this change! And one day the IBM SRA will use the new SRM 4.0 path and you will than need to return the file to the way it was!
SRM Reference Guide Page 93 of 166
IBM DS8000
This SRA has a configuration utility that is case sensitive. The manual does talk about the utility, but doesnt mention it is case sensitive! For example, when you configure a field with p4, it will not work if the array sees that field as P4. The DS800 is only able to pause all of the replicated LUNs. This means if you replicate three LUNs and one of them is not managed by SRM, it will still be impacted during a failover. Meaning it will be paused along with the SRM managed / used LUNs. This is an IBM issue, and hopefully in the 2Q2011 there will be a microcode update for the DS8000 that will fix this. It is not known currently if an SRA update will be required as well.
IBM XIV
This new hardware is supported by SRM 4.x and information on how it can be made to work is in http://communities.vmware.com/docs/DOC-12372 .
IBM SVC
IBM recently released additional SRA support for other devices, namely SVC. Check the Storage Partners compatibility matrix for the specifically supported hardware info. With 1.20.10713 it seems they have fixed most of the issues below. It still has some things to understand. If the customer has only one IBM SVC Console to manage both the protected and recovery array you will run into the errors mentioned in http://kb.vmware.com/kb/1013643 and easily solved by having to consoles. As well, the IBM SVC SRA installs a utility called IBMSVCSRAUtil.exe on the SRM server desktop. You need to understand this utility and the document included with the SRA explains this utility starting on page 11. Thanks again to Brock for this! Some things to note about the SVC SRA include: My experience is that it had the perl issue mentioned above. Consistency Groups are not supported SVC host object names must be 9 characters or less LUNs assigned to ESX must have vmware in the SVCs hostname (see http://kb.vmware.com/kb/1013616 for examples when this is not done). The patch error mentioned about perl.exe not in the path is a problem for the SVC SRA. If the test recovery fails around 3% make sure that the flashcopy is deleted. It is still there likely due to a failed test. If the test recovery fails around 14% it is like due to the configuration and you should check it out. In my case, it was that my management server also managed the recovery side. I have confirmed that the SVC SRA supports simultaneous recovery plan operation.
In addition, my first experience with the SVC SRA was with a client that had IBM install the SVC hardware. They followed best practices, but that meant that the SRA didnt work. The problem was the SVC management host on the protected side, and the recovery side, could manage either side of the SVC. So when the SRA was installed on the SRM server and pointed at the protected side SVC management host it was confused as it saw both sides. The workaround is to put a management agent on the SRM server (on each side) and make sure it can only see / manage the one side it is assigned to! This was using the 1.0.1 SRA from IBM as well as the unreleased next build of it.
Page 94 of 166
Dell EqualLogic
Dell firmware 4 requires the SRA that is 1.01. Our site currently has version 1.0. If you use 1.0 with EqualLogic firmware 4 you will have a failover fail with a missing share LUN flag on the disk array management. You can find help installing the SRA at the URL below. It is for an old version of SRM but it still is helpful. http://www.equallogic.com/resourcecenter/assetview.aspx?id=5261
Compellent
Currently there is a small issue with installing the SRA on Win2K8. It is due to registry security issues. Make sure the path HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\VMware Site Recovery Manager\InstallPath exists and it has instead of InstallPath the path to the SRM installation folder without quotes. The regedit type file would look like:
Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\VMware Site Recovery Manager] "InstallPath"="C:\\Program Files (x86)\\VMware\\VMware vCenter Site Recovery Manager"
Once you have corrected the registry, install the SRA again, restart the service again and it should work fine.
HP
EVA
I have heard, but have not confirmed, that after a failover the direction will be reversed and replication started. This can be a big surprise and an issue if there is low bandwidth. Confirmed. There is an excellent best practices guide for working with HP and SRM at: http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA2-8848ENW&cc=us&lc=en An online guide that helps with the EVA can be found at: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=120&prod SeriesId=499896&prodTypeId=18964&objectID=c01493772 In my experience with working with an EVA in a PofC I found the above guide very useful. It did neglect to mention several things. You need to select, or deselect the arrays in the array configuration depending on which side you are working with. You need to the management IP addresses, account name and password fields filled in for both sides! This means both values protected site and recovery site are entered in each field and separated by a semi colon. I could not find anywhere that had mentioned the mode of access be set to none. When you start a failover (not a test) the replication will be automatically be reversed. Watch out for this as if you have a slow connection it can cause issues! You need to set the rescan to two times find out how in How can I configure a second HBA rescan?
Page 95 of 166
Background Information
The info below may not be necessary for the latest SRA, but it might be useful for background information. The version / firmware information below is what works with the HP StorageWorks EVA Virtualization Adapter version 1. HW Models Firmware CommandView EVA4000 XCS 6.1xx or 6.2xx v 8.0.1 EVA4100 XCS 6.1xx or 6.2xx v 8.0.1 EVA4400 XCS 9.xxx v 8.0.1 EVA6000 XCS 6.1xx or 6.2xx v 8.0.1 EVA6100 XCS 6.1xx or 6.2xx v 8.0.1 EVA8000 XCS 6.1xx or 6.2xx v 8.0.1 EVA8100 XCS 6.1xx or 6.2xx v 8.0.1 Miscellaneous information Replicated vdisks on the EVA MUST be zoned to both sites (but must be created on the primary command view server). Replicated vdisks must e in a DR group and ALL vdisks in that DR group must be used by the primary / recovery site Hosts, if vdisks in the DR group are used by other hosts then the SRA discards them. If using two command view EVA management nodes then both IP addresses must be entered in the SRA config wizard primary first, than secondary separated by a ;. The recovery site EVA must have enough space to contain the snapshot volumes. It has been seen once that the SRM service needed a domain account to talk to the Command View EVA. I have not been able to confirm this no HP gear in my lab other than LHN but this might change whether the Command View EVA is local not? I have confirmed in a VMware QA lab this change was not required. This was seen due to an error 4 in the SRM logs. It occurs when discoverluns runs as part of the setup through the SRM Plug in it fails with error 4. When using Hp Command View with HP EVA, the HP best practice is to run CV servers active / active where the CV server in the Protected site manages its EVA actively and the remote EVA passively, similarly the CV server at the recovery site manages its EVA actively and the protected site EVA passively. Then either CV server fails the other will takeover with no manual intervention required.
In some combinations of storage hardware and FC drivers, the driver does not deliver information about new LUNs to the ESX kernel in a timely fashion, so on a rescan the ESX kernel doesnt learn abut the LUNs. A second rescan is necessary in order to deliver info about the new LUN to the ESX kernel. You can configure SRM 1.0.1 Update 1 to do a second scan if necessary. A work around for the EVA right now is available. Use the following steps to have a successful test. After setting up the replication pair between two EVA arrays, at the secondary array take a manual snapshot of each of the target Vdisks and present those snapshots to the ESX host using the default LUN number that the EVA management system picks (usually the lowest available
Page 96 of 166
LUN number). Then do a rescan twice on the ESX host so it discovers those LUN numbers for the first time. Then destroy the snapshot. Now if you do a test recovery, the SRA for EVA will create a snapshot of each of the replicated LUNs and the behavior of the adapter will present those snapshots to the ESX host using the default LUN number. Since those numbers will already have been been discovered by ESX because of the manual steps done in the previous step, only a single scan will be required so the test will succeed.
Real solutions are not far off for this problem. It has been confirmed that some HP arrays will need a second rescan to make the recovered LUNs visible. You can find information on how to manage that elsewhere in this document. HP Storage Virtualization EVA Adapter configuration information http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=120&prod SeriesId=499896&prodTypeId=18964&objectID=c01493772 SRM has to be set up with HP Command View (CV) EVA to always have the DR site as the hosting the HP EVA failover primary site (i.e managing it actively), which should make sense since in a failover, you would lose the primary site and would not really want to have to manually make CV active . One area, which seems to cause HP EVA customers a lot of problems, is working out which CV setup/config they have in place and then working out how this should be entered into the GUI in SRM. Example: lets assume you have two command view (CV) EVA servers one in datacenter A (DC A)and one in datacenter B (DC B) and in each datacenter you have an EVA. EVA-A in DC A and EVA-B in DC-B. Usually HP will recommend that the CV servers are configured active/passive with each CV server actively managing the EVA in its local DC and passively managing the EVA at the opposite DC. so in this example we would have, CV-DC A manages EVA1 actively EVA2 passively CV-DC B manages EVA2 actively EVA1 passively When entering the information in the "Configure Array" wizard you can include the ip addresses for both CV servers in the same line and separate with a ";" you also assume both CV servers can be accessed using single login username/password. When you hit the "Connect" button two storage arrays will appear, when entering the protected side info simply check the box for the local EVA at the protected site and when you get to the recovery side screen select the other EVA. Some customers have an alternate configuration which is NOT HP best practice that is: CV-DC A manages EVA1 actively and EVA2 actively CV-DC B manages EVA2 and EVA1 passively The HP SRA adapter cannot associate vdisk with drgroups when connected to a passive command view host and I think this configuration has caused some customers issues during the setup stage. I believe this second config works but you need to be careful in the config array wizard as we cannot currently force the passive CV to become active. Below are some other checks that you may want to look at. Verify that the vdisks are correctly presented to the ESX hosts at both sites. I have seen issues where
SRM Reference Guide Page 97 of 166
customers don't have the access method set correctly for the vdisks. The HP documentation seems to make customers believe that the replica vdisks at the recovery site need to be made accessible to the recovery site ESX hosts at all times i.e read/write. All that is actually required (as with other replicated array configs) is that the replicated luns, at the lun device level, simply need to be in the same zone as the ESX hosts at the recovery site (i.e within VC they will appear on rescan in the storage adapter screen but not in the storage/vmfs datastores screen by default). Other things we have seen include: Replicated vdisks on the eva MUST be zoned to both sites (must be created on the primary command view server) check they have done this. Replicated vdisks must be in a DR group and ALL vdisks in that DR group must be used by the primary/recovery site ESX hosts, if vdisks in the DR group are used by other hosts then the SRA discards them. So again they need to verify this. If using two command view eva management nodes (as described above) then both ip addresses must be entered in the SRA config wizard primary command view EVA first then secondary command view EVA, separated by a ; Recovery site command view EVA should be defined as the site that is the failover primary. Customer must ensure recovery site EVA has enough space to contain the snapshot volumes. The SRA produces a log (hpsrmeva.log) which is a good place to look for other error messages. We have seen where sometimes the issue is a miss - configuration error of the SRA / Array Manager. During the setup because of the way Command View works you are presented with both EVAs in the Protected Arrays and Recovery Arrays screens. you need to uncheck the relevant EVA at each screen. Failure to do so can cause issues when you run test plans.
Page 98 of 166
Dell EqualLogic guide - http://www.equallogic.com/uploadedFiles/Resources/Tech_Reports/TR1039-De ll-EqualLogic-PS-Series-SAN-and-VMware-SRM.pdf Using EMC SRDF Adapter for VMware Site Recovery Manager http://www.vmware.com/files/pdf/VMware_SRM_SRDF_bestpractices.pdf VMware vCenter SRM in a NetApp Environment - http://media.netapp.com/documents/tr-3671.pdf VMware Uptime blog (VMware and Business Continuity) http://blogs.vmware.com/uptime Availability Zone of VI:OPS - http://viops.vmware.com/home/community/availability - this includes a lot of lab setup info for various storage arrays. LeftHand Networks SRA Failback Procedure for SRM http://www.lefthandnetworks.com/document.aspx?oid=a0e0000000000NxAAI SRM in a can EMC with automated failback info http://virtualgeek.typepad.com/virtual_geek/2009/07/updated-site-recovery-manager-in-a-can-doc-nowwith-extra-emc-automated-failback--.html HP Storage Virtualization EVA Adapter configuration information http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=120&prod SeriesId=499896&prodTypeId=18964&objectID=c01493772
Page 99 of 166
<string>credentials</string> <string>authentication</string> </array> <key>Language Features</key> <dict> <key>Identifier and Keyword Characters</key> <string>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</string> <key>String Pattern</key> <string>(error|warning|critical)</string>
EditPlus
Use the information below to create a text file called log.stx.
#TITLE=XML ; XML syntax file written by ES-Computing. ; This file is required for EditPlus to run correctly. #DELIMITER=[]:() #CASE=y #KEYWORD=Error error ERROR #KEYWORD=Warning warning WARNING #KEYWORD=Verbose verbose info trivia #
Lab Exercises
Below are a number of lab exercises from the Partner Exchange 2010 SRM bootcamp lab. They may be useful to people new to SRM.
3. Close SQL Server Management Server Express. Create the ODBC data source connection for SRM 4. Start > All Programs > Administrative Tools > Data Sources (ODBC) 5. Select the System DSN tab. 6. Click the Add button to open the Create New Data Source window. 7. Scroll down the list and select SQL Native Client.
8. Provide a Name and Description (optional) for the Data Source connection. 9. Using the Server drop-down menu, select the local server\database. The number in the server name must match the station you are sitting at. Example: Station 16 would select STU16-VCA\SQLEXP_VIM when installing in the Protected Site, STU16-VC-B\SQLEXP_VIM when installing in the Recovery Site. 10. For authentication, keep the default settings (which should be With Integrated Windows authentication).
11. Change the default database to the SRM database instance name.
12. On the last window, keep the default settings and click Finish.
13. Click the Test Data Source button which should result in a test completed successfully message.
14. Click the OK button to close the ODBC Data Source Administrator window. This step completes the setup of the SRM database and the ODBC data source configuration.
18. Accept the license agreement and click the Next button.
20. Enter the vCenter Server address and credentials. The number in the server address must correspond with the station you are sitting at. Example: Station 16 would select STU16-VCA\SQLEXP_VIM when installing in the Protected Site, STU16-VC-B\SQLEXP_VIM when installing in the Recovery Site.
25. In the Database Configuration window, use the Data Source Name drop-down menu to select the ODBC DSN created earlier. Enter the database user credentials.
26. Click the Install button to initiate and complete the SRM installation.
SRM Plugin installation 27. Going back to the original VMware vCenter Site Recovery Manager Installer window, select the VMware vCenter Site Recovery Manager Plugin link to start the SRM plugin installation.
29. Accept the license agreement and click the Next button.
32. You can verify the SRM plugin installation by opening the vSphere Client and clicking on the Plug-ins menu item.
33. You can also click on Home in the menu bar and look for the Site Recovery button under Solutions and Applications.
35. Double-click the FalconStorSRA executable to begin the SRA installation. Click the Next button to begin the installation.
36. Accept the license agreement and click the Next button.
37. Enter the customer information and click the Next button.
38. You will be prompted for a keycode. This can be found in the text file locate in c:\files\SRA. Copy and paste this keycode into the Keycode - License window. Click the Next button.
41. It is important to review the readme file included with SRAs. These often include information about features, known issues, etc. about the SRA that are important to know when implementing SRM.
First, connect the protected side to the recovery side (click Configure). Add the recovery sites VC (stuXX-vc-bshort DNS name is sufficient):
Click next and accept the certificate on the error page (this is self-signed, so you should work with a valid one in your production environment):
Select FalconStor NSS Series from the Manager Type pull-down menu:
Click OK and your Protected Site Array Managers should look like this:
Click Next and repeat the process for Recovery Site Array Managers. It should look like this when completed:
Click Finish.
Select VM Network and click OK. Configure the Protected Apps Resource Pool to map to Recovery Apps:
Go to Protection Groups and click Create a Protection Group called Production Group:
Click Next and select the protected datastore. Note the VMs show up at the bottom:
Click Next and select the datastore labeled esxXXb-shadow for your placeholder VMs:
Click Finish.
To set up the Recovery Plan, you need to move over to the recovery side vCenter. If you are using vCenter Linked Mode like we are in the lab, it is easy to do. Just pull down and select the appropriate vCenter from the drop-down Site Recovery list in the breadcrumb trail:
You will get a similar window, but notice the Protection Setup section is almost empty. Thats OK. Were not protecting anything here.
Go to Recovery Plans near the bottom and click Create. Name the Recovery Plan:
Click Next. There are no VMs that we are suspending, but feel free to look:
SRM Reference Guide Page 153 of 166
Confirm:
At around 54% on the progress bar, the default message will show:
Click Continue. Notice after it finishes step 11, all info disappears! Not to worry, SRM has saved it in a History Report. Click on the History tab and select your RP and click View:
That will pull up an HTML report detailing the entire Test process.
Lab 3 IP Customization
You have a healthy SRM implementation and are protecting hundreds of virtual machines. You now want to ensure your VMs are connected to the right network on the recovery side for testing and for real failovers. You have created a set of VLANs at the recovery facility that will be used during a test but since the IP information in the recovery facility is different than what you have in production you need a way to change the IP information for each VM during a test or actual failover. When you have only a few virtual machines in your recovery plan, it is relatively easy to create a customization specification to change the IP information and connect it to the individual virtual machines. However, when you have 50, 80 or hundreds of virtual machines it becomes much more time consuming to create a custom specification for each one. The bulk IP tool we use here is designed to make it easier to create custom specifications. This lab will help you understand this tool and how to use it.
Helpful Starters
The general idea when working with the Bulk IP Load tool is to create a CSV file that contains vital information about the virtual machines which are a part of the recovery plan. When first created, the CSV file contains a list of virtual machine names, and then several IP related fields adjacent to the name that may be modified to fit your needs. You use this template as a starting place to provide the proper IP information which will be required by the virtual machines at the recover site. When complete, you import the CSV back into SRM. This process creates a series of customization specifications which will be read during the boot process for the affected virtual machines on the recovery side.
Procedure Hints
1. Run the dr-ip-customizer executable. You will find this utility in the VMware Site Recovery Manager program files folder under the bin directory. Think: You are working to customize a recovery plan, which side should you be working in? 2. You will need to execute the utility from a DOS prompt so that you may provide it with the correct parameters. To tell the utility to pull information from SRM, your command should look as follows: a. dr-ip-customizer.exe cfg ..\config\vmware-dr.xml csv
c:\down.csv cmd generate
3. You will be prompted to trust a server twice, and you will need to authenticate as well. 4. Did the command succeed? Tip: If the command worked, you should have a file named down.csv in the root of the C:\ drive. 5. You can now use Microsoft Excel to open the CSV file. NOTE: Excel is not installed on any of these machines Instead, Excel.exe is wrapped in a VMware ThinApp package and simply runs as a self contained exe. Checkout VMware ThinApp when you have a chance! 6. Please see below for an example of a clean dr-ip-customizer export. 7. Now you will introduce some changes to the file so the referenced virtual machines will be associated with new customization specifications. Follow these steps closely when modifying the file: a. The 2nd column contains the name of your virtual machine(s). For simplicity sake, we will be modifying only 1 virtual machine and we will use the one on the bottom of your list. b. Click the row header to highlight the last row, and copy it to the clipboard c. In the blank row directly below your last row, paste the contents of the clipboard. Tip: Before you paste, click on the first cell in that new row (in the A column). d. In your new row, change the value in the Adapter ID column from 0 to 1. Tip: The values for Adapter ID can range from 0 to 4. 0 means global, and 1-4 refer to specific adapters. Since we have only 1 adapter, its ID is 1. e. In the DNS Domain column, type vmworldtest.com. f. In the IP Address column, type dhcp. g. Save the CSV file. 8. Import the file back into SRM by executing the following command: a. dr-ip-customizer.exe cfg ..\config\vmware-dr.xml csv
c:\down.csv cmd create
9. You will be prompted to trust twice again and authenticate 10. Verify your import was successful. Think! Dr-ip-customizer creates customization specifications. Where in vCenter can you view Customization Specifications? Tip: Goto the Home screen. 11. To complete this lab, you can run a test recovery and when you get to the yellow pause message (in the recovery steps tab), go to the virtual machine which was associated with the customization specification and check out its IP information. The DNS domain should now be vmworldtest.com and it should be configured to use DHCP.
Conclusion
In Lab 2, you utilized the bulk IP customization utility to alter the IP information for a virtual machine in a recovery plan. This utility created a customization specification and associated it with the virtual machine. When the virtual machine is powered on for test in the recovery facility, it automatically obtains custom IP attributes by taking direction from the customization specification. This is a considerably valuable tool, especially for very large SRM environments where IP information may need to be altered for a large number, or even all virtual machines. Dr-ip-customizer saves administrative time and minimizes errors.
Helpful Starters
1. With scripting, syntax is important. To that end, remember to always use full path references and be sure of your spelling and punctuation. 2. Consider if the script should be executed before the VM power on, or after it boots. 3. You are applying extended attributes to a virtual machine that is being protected by SRM so that when it is recovered, the script will execute. Think: What side of SRM contains a list of the actual virtual machines that are being protected?
Procedure Tips
1. Be sure there are two scripts located on your SRM server. There should be scripts named call.cmd and test.cmd. Both should be located in c:\scripts. If you do not see these scripts, please let a lab proctor know. 2. On the Protected side, highlight your protection group in the left pane, and click on the Virtual Machines tab in the right pane 3. For each virtual machine in the list, click Configure Protection
SRM Reference Guide Page 159 of 166
4. Click Next through to the very last section, Post Power On 5. Click Add Command to insert a call out to your control script. Type the following into the Add Command dialog box: a. C:\windows\system32\cmd.exe /C c:\scripts\call.cmd 6. Be sure to repeat steps 4 and 5 for each virtual machine in the protection group 7. Click Finish to store the configuration. Your virtual machines are now configured to run a script post power on. 8. Flip to the recovery side and run a test failover. 9. The scripts are configured to record the date and time stamp (for lack of anything else more interesting) to a log file located in the c:\scripts directory of the SRM server on the recovery side. 10. Once the yellow banner appears in the Recovery Steps tab, open up the c:\scripts\test.log file. You should see date and timestamp entries.
Conclusion
In Lab 4, you learned how to inject scripts into your SRM environment. Scripts are useful for a variety of reasons from simple diagnostics and logging to more complex
integration with the DNS environment, load balancers, and other services which may require updating on the recovery side.
Added information on case sensitivity, and replicated LUNs issue with the DS8000 SRA page 94. Added a section on P2V DR page 31. Added some extra info on 4.1 SRM licensing on page 42. Also clear up the 1.0 section a little page 43. Added info on install error with Perl page 72. Added info on a script error page 35. Added info on a hard to troubleshoot / fix error with a pop up error about not being able to protect a VM page 72. Added error info on not visible LUN (not able to create a PG with it) to page 72, but also a new suggestion (number 14) in best practices on page 27. 10/16/10 Added a little text on the title page to explain this guide, with its information can be useful, or dangerous, so make sure you know what you are doing! How to reset the Celerra root and NASADMIN accounts (Page 82) Account solution help for an error (Page 71) I added a little more info, and a blog reference to SRM 4.1 licensing (Page 42) Added a little more detail to where SRM doesnt fit (Page 8) Added a section on protecting View desktops be warned it doesnt, yet, include SRM. Page 30). Some general readability improvements. I added some details to the best practices section (Page 27) Added a new section to the IBM SRA section IBM DS4000/5000 (Page 93) Added a new issue install hangs at 90% - on Page 71. Added PowerShell signature issue and solution on page 71. 8/7/10 Added info on upgrading to SRM 4.1 (Page 11) Added 4.1 build information Added info on tweaking the log parameters (Page 37) Added some info on best practices (Page 27) Updated minimum Alarm notification suggestions (Page 48) Added two URLs to help with HDS SRA installation (Page 81) Added some info on SQL authentication and starting SRM issues. See page 56. Updated SRM scripting info in several places. Updated the network device not found info page 62. Added some EMC video links. Added Application References section see page 30. Added information on what VM parameters are not failed over see page 35 Added info on high priority start order and multiple protection groups page 36. Miscellaneous link and text updates mostly spelling / grammar. 5/22/10 Added info on Network device needed by recovered virtual machine error. Add URL reference to the new SRDF techbook. Added a brief note on when SRM is not a solution to consider. Added a Can I change the Run button to work like the Test button section? Updated the how to find a name of a VM when I have a MoRef article. Added an error / solution operation timeout. Added RecoverPoint TCP port info.
SRM Reference Guide Page 163 of 166
3/26/10 Added another HP link to online help for the EVA and SRM. Added a solution to a Compellent SRA issue on Win2K8. Added two new alarms to the recommended alarms. Also provided info on how SRM 1.0 and SRM 4.x would handle a expired license situation. Added info on thick LUNs to MirrorView. Fixed error in script example. Added some extra info about which storage arrays support application consistency. Added info on SRDF / SRM EMC SRDF licenses. Added KB article to IBM SVC for an error previous reported but KB article has extra info. Added an updated to the Celerra and null issue. Added link to issue / solution for CLARiiON issue. Added a workaround for an issue with RecoverPoint and a CG with a large number of LUNS. Added a SRDF white paper URL. Added info in the script section to show how you could see the variables that are available during the run of the failover or test. Corrected the path to the how to use trusted certificates document. Corrected the path to the SSL and NetApp document. Added to the EMC RecoverPoint section the issue with using the ~ in CG names. 2/4/10 Added some additional Script info, including a sample. Added Labs to work with the SRM Boot camp at PEX. While they are designed for a specific lab, they are still useful for someone who wants to learn more. Added some additional info on IP Customization. Did some miscellaneous spelling corrections. Added info on the null Celerra issue it is supposedly fixed. Added detail about protecting multiple tiered applications. Added some basic detail on backups of SRM. Added information on the three things that derail SRM projects and Proof of Concepts. Added an additional suggested alert condition. Added info on time required to protect VMs. 12/12/09 Add some additional info on redoing the SRM db. Working on improving spelling and grammar. Slowly but will keep at it. Added info on the two NetApp SRA question. Some additional info for the HP EVA.
SRM Reference Guide Page 164 of 166
Added additional info on the mirrorviewsracore.dll issue and SE info in the EMC MirrorView section. Added a solution to SRM not starting with event log errors 7000 / 7009. Added information on FalconStor SRA log levels. Added info on what travels with a VM between recovery plans. Added problem / solution of failed to connect to NFC. Added information to change the concurrent power on value. Some general cleanup and adding of info to various sections of the document. Added information on expectations you can have for how long to fail over.
Added some additional information on number of character limits for a number of tools. Added some SPC-2 info to EMC DMX area. Added info on avoiding shutdown tracker prompt. Added a LHN error / solution (arrayxxxx not found). For HDS, SRDF, EVA added in info about the replication direction changed during a failover. Added a section on vendors and their tools that can do application consistent replication. Added a section on vendors who can do application consistent continuous replication. Added info on how to change the MirrorView SRA to support 3 simultaneous recovery plans instead of the default and supported one. Added some MirrorView port usage info to the EMC MirrorView section. Added an MV error and solution, and background on MV in the general section.
11/27/09 Add info on Linux IP customization issue. Added info on IBM path issues Re-initialize SRM database URL to SRM 4.0 performance whitepaper Add info on repairing in SRM 4.0 (Repair SRM in add remove instead of srmconfig). Added info on incompatible host / resource pool etc. Added info on extending script timeouts. Update, and corrected, SRM 4.0 license info in post Update 1 of vSphere. Added problem and solution of the proxy issue after Update 1. Added a screen shot for what new license looks like. Various edits and suggestions from Rob N. Thanks Rob!! SRM build numbers added. Some additional IP customization things to watch out for. Also add some grayed out create PG general help in troubleshooting. Corrected Changing passwords after SRM is working for SRM 4.0 new method. Added the null Celerra error. 10/30/09 Add additional info, and clarity around how many VMs can be started. Added info around why you cannot see a LUN during array config. Added to install section info on 15-character limit for VC username. Added how to catch the install and srm-config log files during install. Some HP EVA info was added. Added info on Heartbeat and SRM Finding a VM moid. Added info on Symmetrix and duplicate WWN issue. Added info on dr-ip-customizer issues 10/2/09 Miscellaneous updates for SRM 4.0. Correction for how many VMs can be started in a SRM 1.0 world, and added SRM 4.0 info. 8/8/09 Comments about starting vmware-dr.exe in the general troubleshooting section. Fixed a misspelling in EditPlus section. HP EVA setting issue added.
SRM Reference Guide Page 165 of 166
7/5/09 Added Linux customization log info Upgrade information for next release RecoverPoint log location Added symapi_c_net_handshake_failed error to SRDF section 6/20/09 Added some additional info on upgrades. Updated the app test plan. Added a post install test outline. 6/13/09 Added info on scalability. 6/4/09 LeftHand VSA upgrade info Added info on syntax highlighting. Information on uninstalling an SRA which stops SRM from working! Some info on MS licensing and DR testing. NetApp MetroCluster background info (thanks Lee!) A RecoverPoint / DMX / SPC2 issue! 6/2/09 Added info on SVC and simultaneous recovery operations. Additional info on log file component logging. Added the error message to changing install passwords after install and how to fix. Updated MirrorView for possible working of simultaneous recovery plan operation. Added Celerra section to EMC and that it works with simultaneous recovery plan operation. Also said it works for RecoverPoint.
Celerra 2.0 SRA logs location. Added additional info on uninstalls. Some additional clarity on what the Repair button is for. Two possible solutions for SRM not starting. Added EMC SRDF error and solution to the Troubleshooting section and NOT to the EMC section. Documentation bug non-zero exit crash not true Added some info on what database corruption looks like.
Filename: Z:\Downloads\SRM Reference Guide_x.docx Revision: 46 Last Save By: Michael White at 2/26/2011 16:16 Created by: VMware, Inc at 8/8/2010 10:52 Last printed at: 2/26/2011 16:162/26/2011 16:16 Comments / suggestions / corrections / changes to MWhite@VMware.com