I work for a cloud service, and we host many customer private clusters across the country on one 6.5 U2g VCSA. We have one customer cluster, 6 hosts and 60 VMs, that keeps having VMs come up with the error "this virtual machine failed to become vsphere ha protected...". It's easy to correct, but it has to be done manually, turning off HA and then turning it back on. It's happened 4 times in the last 3 days. I'm not even sure at this point if it really is working, as it seems to break every time a VM migrates to a new host, and, no, it is not always the same source or destination host. So, I can't even tell if HA is working, and I don't want a host failure to be the time when we discover that it really isn't working, and have a high paying customer have 8-10 VMs down until they're manually restarted. The hosts are all physically identical, same firmware levels, and same version, 6.5U3, of ESXi, with Enterprise+ licensing.
Is there a more permanent fix for this issue? I haven't been able to find anything in VMWare's knowledge base other than the fix I'm already doing, which seems to last maybe until the next VM migration, or maybe not at all.
Moderator: Moved to Availability: HA & FT Discussions
Hey, hope you are doing fine
Can you disable HA and re enable it? This solves issues sometimes.
What does fdm.log has to say about this?
Which version of ESXi do you have?
Does the error match this? https://kb.vmware.com/s/article/2020082
Yes, I have done that, 4 times in the last 3 days. It lasts "fixed" until the next VM migration.
Can you share fdm.log to investigate if there is an issue over there?
Also, can you tell a little bit more about HA configuration? How is admission control configured? Are you using any reservation? What datastores you use for heartbeating? does it select automatically?
Hey @dgingeri,
The first thing I consider not right is that you have ESXi hosts with a higher version than vCenter. Alwasy ensure that your vCenter Server is equal or higher in version that your ESXi hosts.
As @nachogonzalez the fdm.log will have the details of your issue and when you check that, take a look at this KB because it applies to your version: https://kb.vmware.com/s/article/66928
Can you provide fdm.log ((/var/log/fdm.log) file with time stamp and vm name too.
If you work for a cloud provider, then you're probably in the VSPP program. Why don't you open a ticket with GSS instead since this impacts one of your customers?
6.5 U2g VCSA
6.5 U3 ESXi
Am I not seeing a mismatch here, meaning VCSA version should be the same or higher than ESXi?
6.5 U2g VCSA
6.5 U3 ESXi
VCSA version should be same or higher. You have 6.5 U2g for VCSA and 6.5 U3 for ESXi which is a higher version than VCSA which means VCSA is lower than ESXi.
Please algo get the fdm.log as mentioned previously
I realize this thread is over a year and a half old, but we just recently started running into this exact problem on our cluster. This cluster has been in operation for over 2 years, and we have never encountered this error until recently, and we ONLY encounter it when creating new VM's, either from a template, or building one from scratch. The one recent change in the cluster is that we migrated from an aging fiber channel NetApp SAN to a new NFS NetApp SAN. But I'm not sure if that is merely coincidental or not. I'm not sure why the new SAN would be the cause of this. Disabling/enabling vSphere HA resolves the issue, but its still troubling that this suddenly has started happening.
Anyway, here are the specifics of our cluster:
These are the latest versions of each, according to the VMware versions and builds info. Attached is an excerpt of the fdm.log for one of the affected VM's. I know this error is very low-risk, but I would still like to know why this has just now (or recently) started happening. I just don't want run into a situation where a small problem grows into a larger one.
I have since learned that when this happens, it only alerts on freshly moved or new VMs in the cluster, but HA is not working on the ENTIRE cluster. It may not be alerting HA isn't working, but it does not bring VMs back up if a host fails.
I have not been able to find a real solution, either. The best thing I've found is rebuilding all the hosts and the cluster. The only way I've been able to correct this is to reformat and reinstall each host and then re-add them to the vcenter under a whole new cluster, one at a time, moving VMs over to the new cluster by removing them from the inventory of the old cluster and adding them to the new one, again, one at a time. It's a pain in the behind, but it does eliminate the issue.