Hello,
Occasionally I get this error while running plans.
Lets say I have five VM's spread over two volumes. Three of the five may recover five, but the other will generate this error. They will be there, but the "change network settings" (and onwards) won't have been done.
When I happens, I can press "continue" to finish the task and try the test later, which will normally work.
Last time this happened, I found the following error in the vmware-dr.log file:
Section for VMware vCenter Site Recovery Manager, pid=1532, version=4.0.0, build=build-236215, option=Release
RecordOp ASSIGN: runtimeInfo.runtimeStatus, RSVm-87974
RecordOp ASSIGN: runtimeInfo.finishTime, RSVm-87974
Released VC LRO semaphore, token = '947'
RecordOp ASSIGN: info.error, RSGroup-87813SecondaryShadowVMRecover-7724
Error set to (dr.san.fault.RecoveredDatastoreNotFound) {
faultCause = (vmodl.MethodFault) null,
datastore = (dr.vimext.SanProviderDatastoreLocator) {
primaryUrl = "sanfs://vmfs_uuid:4bfe3ea4-fc89b582-71eb-0024817df61b/",
reason = (vmodl.MethodFault) null,
RecordOp ASSIGN: info.completeTime, RSGroup-87813SecondaryShadowVMRecover-7724
RecordOp ASSIGN: info.state, RSGroup-87813SecondaryShadowVMRecover-7724
Not Starting Tasks All Tasks Complete
MRT-DoneCallback Task RSGroup-87813SecondaryShadowVMRecover-7724 for RSGroup-87813
SetRuntimeStatus for RSGroup-87813 from running to error
RecordOp ASSIGN: runtimeInfo.runtimeStatus, RSGroup-87813
RuntimeInfoError (dr.san.fault.RecoveredDatastoreNotFound) {
faultCause = (vmodl.MethodFault) null,
datastore = (dr.vimext.SanProviderDatastoreLocator) {
primaryUrl = "sanfs://vmfs_uuid:4bfe3ea4-fc89b582-71eb-0024817df61b/",
reason = (vmodl.MethodFault) null,
RecordOp ASSIGN: runtimeInfo.finishTime, RSGroup-87813
RecordOp ASSIGN: runtimeInfo.runtimeFault, RSGroup-87813
FormatField: Optional unset (dr.san.fault.RecoveredDatastoreNotFound.reason)
Starting Task RSGroup-87994SecondaryShadowVMRecover-7942 for step RSGroup-87994
MultipleRecoveryTask Info max 1 cur 1 remain 5
RecordOp ASSIGN: info.startTime, RSGroup-87994SecondaryShadowVMRecover-7942
The datastores are obviously available as the VM's are being presented.
We are running various Equallogic SAN's to host the VM's. I did log this a while ago and VMware suggested that it could be because of a loss of contact between the two VC's, but the error which they found was actually slightly earlier in the day and related to another issue (a timeout issue).
Any thoughts on this one?
With inconsistent errors registering a virtual machine with this error I have found that setting the ESX servers to rescan for storage twice has provided resolution. There is actually a KB article outline some storage arrays in which this is recommended. On the recovery site, edit advanced settings and set SanProvider.hostRescanRepeatCnt = 2. Default settings on this value is 1. Try it, it can't hurt anything!
Tim Oudin
Hi,
Could you please give me the details about primary and recovery site configurations like number of ESX servers in primary site, number node in primary site cluster, no. of ESX in recovery site and no. of ESX in recoery cluster. I think it may be a resoure problem because we faced the same issue with resource problem in recovery site.
Regards,
Vijaya
We have five DL380 G6's at our production site with:
Total CPU: 117 GHZ
Total Memory: 319.96 GB
DR is smaller. We have four DL380 G5's.
Total CPU: 60 GHZ
Total Memory: 73.99 GB
Not sure how much that tells you but we also run a test environment at DR and continue to run it while DR tests are on-going. That said, although the boxes are slower and nowhere near as beefy (with older generation CPU's), the VM's are obviously slower but I am generally not at the point of maxing out the host CPU's or RAM.
The other difference is storage. We have much quicker storage at production because we have more units and more spindles and use multiple paths to connect to the storage units. While the hosts are configured in DR to use multiple paths, VMware defaults to "fixed path" for the newly connected volumes. They are also connecting to volumes on one SAN (48 disks at DR compared to 128 disks across 6 units in production).
VMware are pointing at the SRA, but the storage is being presented to the hosts and the hosts are seeing the storage (I can see it listed as a datastore and I can browse it). I think perhaps SRM is attempting to register or perform some operation on those failing VM's before the hosts have finishing configuring the connection to the datastore (or something like that).
Did you try this scenario:
Creating Recovery plan with only one VM which is failed with network configuration error.
then we can make sure that the problem starting point.
Regards,
Not specifically one VM, but I split the plan into three smaller plans and it does seem much more reliable (the smaller plans haven't failed). Even on the one plan though, it would often work intermittently..
Did u check whether the VM network has auto config while creating recovery plan?
No, my test networks all go onto a specific VMPG on all my plans. Why do you think that could be a factor?
While SRM certification we used auto network configuration only, because in SRM config guide vmware told to set the network config as auto.
can u test with auto config whether its working with that config. Alos may i know what is VMPG? I am not aware abt this.
Regards,
vijaya
VMPG = Virtual Machine Port Group.
I have set it to auto and all is OK at the moment. I am going to do some more testing over the next few days will report back.
Also found this in the Admin guide:
"By default, the test network is specified as Auto, which creates an isolated test network. If you would
prefer to specify an existing recovery site network as the test network, click Auto and select the network
from the drop-down menu."
So it should work on networks other than auto...
With inconsistent errors registering a virtual machine with this error I have found that setting the ESX servers to rescan for storage twice has provided resolution. There is actually a KB article outline some storage arrays in which this is recommended. On the recovery site, edit advanced settings and set SanProvider.hostRescanRepeatCnt = 2. Default settings on this value is 1. Try it, it can't hurt anything!
Tim Oudin
Just got the error again with the networks set to auto.
Thanks Tim there are a couple of those that I am intending to look at. That one, and some of the timeout ones. I will make a couple of changes and see what happens.
Found KB 1008283 for reference, all thanks to @dawoo reponse to twitter rant.
This refers to a total failure to recover a datastore but the concept is the same.
Tim Oudin
Well after more testing yesterday, changing the rescan value to 2 seems to have done the trick. I did have a few time out issues (which when I have seen in the past, have been to do with communication between the two VC's), but the "datastore not found" error hasn't occurred after about 10 attempts. On Friday, it was occurring maybe 40% of the time.
Is there anything new in the logs after the datastore mount failures?
Tim Oudin