Hi,
We have a weird issue that I can't figure out
We have a 2 node vSAN cluster and a witness server all on the same layer 2 network for management/witness traffic
The 2 nodes also have direct connect 10G for vSAN data (WTS has been implemented)
Everything is working and healthy but if I put 1 node into maintenance mode and disconnect the management NICs (not even vSAN data as that is DC) (we were doing switch maintenance)
The VMs on the other node fail and VMware says connectivity to storage lost...
Any ideas?
I logged a ticket to VMware and they can't figure it out yet...
Update:
I have narrowed down the issue
If you put a host on a 2node VSAN cluster in maintenance mode and disconnect the vmnic that the witness traffic (with WTS implemented) is using… it terminates the VMs on the other node!
Regardless of your fault domain settings (I tried preferred and secondary – no difference)
This is only if you have HA turned on and the host is in maintenance mode
If you disconnect the witness traffic vmnics when the host isn’t in maintenance mode… nothing happens
VMware have acknowledged this as a bug and have escalated it to engineering
Hello,
Thanks for sharing. I have been facing same issue. VMware and HPE said that is a bug.
We are waiting an exact solution. If you are using HPE servers and HPE customized VMware image, there is workaround. Please request it from VMware GSS. The issue seems related to NICMGMTD daemon.
Thanks for the reply
This is Dell - but it's not related to the ESXi ISO or hardware as I can replicate it on my nested lab with the native ESXi builds.
I have figured out a few 'work arounds'
1) Don't put the host in maintenance mode when working on the witness vmnics (turn off DRS and move the VMs off it of course)
2) Turn off HA while doing your maintenance
3) Change the advanced vSAN setting on each host for VSAN.AutoTerminateGhostVm to 0 (not recommended as vSAN won't terminate the VMs if a real host isolation occurs)
What was the work around VMware gave you?
They gave me a script that is refreshed "NICMGMTD" daemon when its memory allocation size became full. I scheduled it to run every 5 minutes via crond.