vSAN cluster fails when host put into maintenance ...

digitalbath · ‎07-21-2019

Hi,

We have a weird issue that I can't figure out

We have a 2 node vSAN cluster and a witness server all on the same layer 2 network for management/witness traffic

The 2 nodes also have direct connect 10G for vSAN data (WTS has been implemented)

Everything is working and healthy but if I put 1 node into maintenance mode and disconnect the management NICs (not even vSAN data as that is DC) (we were doing switch maintenance)

The VMs on the other node fail and VMware says connectivity to storage lost...

Any ideas?

I logged a ticket to VMware and they can't figure it out yet...

digitalbath · ‎07-25-2019

Update:

I have narrowed down the issue

If you put a host on a 2node VSAN cluster in maintenance mode and disconnect the vmnic that the witness traffic (with WTS implemented) is using… it terminates the VMs on the other node!

Regardless of your fault domain settings (I tried preferred and secondary – no difference)

This is only if you have HA turned on and the host is in maintenance mode

If you disconnect the witness traffic vmnics when the host isn’t in maintenance mode… nothing happens

VMware have acknowledged this as a bug and have escalated it to engineering

TolgaAsik · ‎07-28-2019

Hello,

Thanks for sharing. I have been facing same issue. VMware and HPE said that is a bug.

We are waiting an exact solution. If you are using HPE servers and HPE customized VMware image, there is workaround. Please request it from VMware GSS. The issue seems related to NICMGMTD daemon.

digitalbath · ‎07-28-2019

Thanks for the reply

This is Dell - but it's not related to the ESXi ISO or hardware as I can replicate it on my nested lab with the native ESXi builds.

I have figured out a few 'work arounds'

1) Don't put the host in maintenance mode when working on the witness vmnics (turn off DRS and move the VMs off it of course)

2) Turn off HA while doing your maintenance

3) Change the advanced vSAN setting on each host for VSAN.AutoTerminateGhostVm to 0 (not recommended as vSAN won't terminate the VMs if a real host isolation occurs)

What was the work around VMware gave you?

TolgaAsik · ‎07-28-2019

They gave me a script that is refreshed "NICMGMTD" daemon when its memory allocation size became full. I scheduled it to run every 5 minutes via crond.

All

vSAN cluster fails when host put into maintenance mode and disconnected from network