Hi There,
We've always had working vMotion in our cluster and suddenly it stopped working.
There is supposed to be a reason for that, but we just seem can't find it.
What happens is as follows: some machines cannot be vMotioned. Cold migration works always..
All ESXI hosts are in the same cluster. Logs do not provide any details. The error message is just generic:
"Timed out waiting for migration start request. The vMotion failed because the destination host did not receive data from the source host on the vMotion network. Please check your vMotion network settings and physical network configuration and ensure they are correct."
Here is what we've checked:
1. All settings: (kernel / IPs / switches, jumbo frames)
2. We checked all logs and there is nothing there that would indicate why migration fails.
3. We turned VAAI off and tested Vmotion
For some reason some VMs can be vMotioned and some cannot. All windows 7 VDI machines vMotion across hosts with no problem. However majority of VMs with Windows 2008 R2 or windows 2012 cannot be vMOtioned.This is something new.
Any ideas / suggestions are appreciated.
Thank you
are you able vmkping vmotion IP from source to destination and vice versa.
If yes
try to migrate by web client.
vmkping works, but vmotion does not.
What do you mean by migration using Web client?
It does not work.
Hello,
Please see the following useful KB articles:
The first contains the exact error message when looking at the vmware.log file.
The second is just a handy KB article about each step of a vMotion and what to troubleshoot.
Understanding and troubleshooting vMotion (1003734) | VMware KB
I hope this is useful to you!
Kind Regards,
RJ
Thank you. I will try this. I went through this few days ago, but not in every detail. Will try again
Have the KB articles helped resolved the issue? or are you still having a problem?
Kind Regards,
RJ
Try this:
1. On every host, check which port group is tagged as the vmotion port. If you have only one port group that is using both management traffic and vmotion, try unchecking vmotion on all ndoes and re-enabling and check
2. Should the vmotion port group be separate, disable the vmotion port (uncheck) and try vmotion across the same port group as the management port group and see if this works
3. Why are *some* vm's working and some not - is that across the same hosts that some VM's are able to vmotion? If this is the case, what is the difference between the VM's that are working and onces that are not?
Hi RJ,
We've tried, however, the problem is still there and we're kind at a loss.
Thank you
Maybe This can help! ERROR: "Timed out waiting for migration start request. The vMotion failed because the destination host did not receive data from the source host on the vMotion network.
The above error indicates that remote host did not accept the connection within the allowed time limit."
NOTE: It could be an issue with Jumbo frames and MTU settings on the NICs and switches.
Multi NIC vMotion with jumbo frames on directly connected ESXi
Performing vMotion fails despite vmkping succeeding from source to target IP address (2042654)
Raul.
VMware VDI Administrator
Vmotion and Management use the port group.
vMotion fails across across all hosts for same machine. In other words if machine fails to migrate it will fail across all hosts, however machine that does migrate will migrate to any host in the cluster.
Thank you Raul.
We had jumbo frames enabled. Then we changed the settings back to MTU 1500, did not work.
Now we're changing back to jumbo frames, so now in the process of doing that.
It's just all frustrating and it seems there is something that we obviously overlook but we don't know what and lack of info in vmware logs does not help.
It's definitely a connection issue from the destination host to the source host, you must check every single VMkernel IP, I had the same issue when the VMkernel got the same IPs in different VSS & I fixed the problem just by replacing the IP with new ones in order & created an "Exclusion Range" in DHCP. Try that!...
Raul.
VMware VDI Administrator
I will do that. And currently working on it.
By the way: did you have the same issue with some machines migrating from the same host and some not?
I would try (if you haven't already), creating a VMkernel Port just for vMotion on its own separate vSwitch for each host. This way its completely separate from everything else, and you can then input all the configuration again and ensure all the IPs, Subnet Masks, Default Gateway and possibly VLAN is correct.
For the MTU settings like Raul said, I guess you checked both the NIC and the physical switch?
Kind Regards,
RJ
Yes! Some VMs migrated & the others not & from the same Host, same Cluster, same DataCenter, same Domain. :smileyinfo:
Oh, this is the same what happens here. I will create exclusion ranges in DHCP.
Thank you! Will test and let you know.
Yes, RJ, we've tried that with two hosts.
I have 12 hosts in production cluster. So, we took two hosts and on them we separated vMotion and Management and tried migrating VM from one host to another and vMotion failed.
I am checking now every IP on the network since hosts' IPs have been checked already, however, everything needs to be double-checked
Thank you
Hello everyone. I have same issues as yours. Did you fix it at the end?
Best regards
Vladimir
Check if you have more than one vmkernel enable for vMotion. If you do, disable the others and keep just one vmk vmotion each host in the same subnet.
Try this to check your jumbo frames peer to peer:
vmkping -I <vmotion_vmk_number> -d -s 8000 <dst_vmkernel_ip>
vmkping -I <vmotion_vmk_number> -d -s 1300 <dst_vmkernel_ip>
I would also try to enable only the management vmk to perform the vMotion tasks, usually the vmk0.
Lastly, check if vMotion port TCP 8000 is allowed, it should be allowed in and out in the ESXi internal firewall, but also check your firewall or any port filtering in your network.
Hello,
what vmkernel.log says during that timestamp