Dear Community
We have a VSAN stretched cluster as below. ( VMware ESXi, 7.0.3, 21313628 )
2 Data nodes ( 1 in each Site ) and a witness host
Most of the VMs have storage replicated , we use Should rules to spread them across the 2 Sites
We have some VMs which are pinned to a specific site as below .We use Must rules for them . We can afford downtime on them during maintenance activities .
Site disaster tolerance None - keep data on Secondary (stretched cluster)
Failures to tolerate No data redundancy
Site disaster tolerance None - keep data on Preferred (stretched cluster)
Failures to tolerate No data redundancy
When we try to put a Data node in either site into maintenance mode via "ensure accessibility" or "no data migration" , the operation is failing .
We even tried to power off the VMs using local storage on the impacted Data node but even then , VSAN cluster is not able to migrate the VM with replicated storage to other Data node .
Is this expected behaviour ? are we doing something wrong ?
Appreciate your help & guidance as always
@anandgopinath wrote:
so we have VMs with both replication policy (storage has 2 copies . one in each failure domain ) and local site policy ( storage is only in 1 failure domain )
for VMs with local site policy ( storage is only in 1 failure domain ) ,
why should the option "full data migration" or "ensure accessability" not work if the other failure domain has storage capacity ?
I dont pin these VMs with " must run " rules anymore .
Because you specified in which domain the data needs to reside? If you specify Preferred Site the data can only move to another host in that fault domain, which you don;t have.
@anandgopinath This is expected behaviour and working as intended - how could a cluster satisfy accessibility of objects when you are putting in Maintenance Mode the only node where this data resides? No Action option is the only option that will work with your described configuration.
@TheBobkin : Thanks for the quick help
As mentioned in my post , even with the "no data migration" option , we cannot get the host into maintenance mode as the host cannot migrate a VM with replicated storage and "should run" rule to the other host .
Not sure what we are doing wrong
@anandgopinath Are you aware from previous testing whether this particular VM can actually live vMotion? It is not common but there can be various configurations at the VM level (e.g. passthrough device) that will prevent this being possible. Is it possible to power-off the VM, re-register/cold vMotion it and power it back on? If yes then there is probably a configuration on it preventing live vMotion.
It can always be other things like the backup proxy still has the base-vmdk attached and locked etc. - what error message are you getting when trying to vMotion it?
@TheBobkin , There is no issue with the VM vmotion etc . When we poweroff the esx host in question , VM is restarted on the other esx host .
It is only when we try to enter maintenance mode that the VM is not moved .
Powering Off a host doesn't lead to a vMotion, that is HA taking action.
Try manually migrating the VM from one host to another host to see if that works.
Just wondering, if those hosts contain the vCLS VMs by any chance? Seen situations where those are not automatically powered-off when going into maintenance mode, which blocks the maintenance mode operation from completing.
@depping . it seems issue is with the HA Admission control setting below .if HA admission control is disabled , host can enter maintenance mode .
So does this mean that HA Admission control is not compatible with 2 node Stretched cluster ?
CPU reserved for failover:
@anandgopinath, If you are reserving one nodes worth of compute resources then it won't allow putting the other node into MM as it cannot satisfy that reservation, so no, you shouldn't have it configured like that.
@TheBobkin , Thanks for the quick help 🙂
We have the same issue of maintenance mode not working when we choose " ensure accessibility " as well as "full data migration" .
Same behaviour even if we disable the "must run" rules for the VMs pinned to each site" .
The only option which works is "No data migration"
Is this also a limitation of the 2 Node stretched cluster ?
Thanks in advance for your continued support & guidance
@anandgopinath, I answered this already - this was literally your initial question.
@anandgopinath wrote:
@TheBobkin , Thanks for the quick help 🙂
We have the same issue of maintenance mode not working when we choose " ensure accessibility " as well as "full data migration" .
Same behaviour even if we disable the "must run" rules for the VMs pinned to each site" .
The only option which works is "No data migration"
Is this also a limitation of the 2 Node stretched cluster ?
Thanks in advance for your continued support & guidance
Yes, as you cannot migrate the data anywhere. You should be using "no data migration" indeed.
Thanks for the quick response as always 🙂
i am a bit lost here .
so we have VMs with both replication policy (storage has 2 copies . one in each failure domain ) and local site policy ( storage is only in 1 failure domain )
for VMs with local site policy ( storage is only in 1 failure domain ) ,
why should the option "full data migration" or "ensure accessability" not work if the other failure domain has storage capacity ?
I dont pin these VMs with " must run " rules anymore .
Same goes for VMs with replication policy (storage has 2 copies . one in each failure domain ) .
why should the option "full data migration" or "ensure accessability" not work if the other failure domain has storage capacity ?
@anandgopinath wrote:
Same goes for VMs with replication policy (storage has 2 copies . one in each failure domain ) .
why should the option "full data migration" or "ensure accessability" not work if the other failure domain has storage capacity ?
Why would we migrate the data to a host which already holds the data?
@anandgopinath wrote:
so we have VMs with both replication policy (storage has 2 copies . one in each failure domain ) and local site policy ( storage is only in 1 failure domain )
for VMs with local site policy ( storage is only in 1 failure domain ) ,
why should the option "full data migration" or "ensure accessability" not work if the other failure domain has storage capacity ?
I dont pin these VMs with " must run " rules anymore .
Because you specified in which domain the data needs to reside? If you specify Preferred Site the data can only move to another host in that fault domain, which you don;t have.
@depping : Got it now . Thanks , so basically for these options to work, we need to have more than 1 host per failure domain .
Sorry for all the questions , we have been testing various failure / maintenance scenarios like this and at times what you read / understood before from the documentation seems lost 🙂
Thanks for taking time out from your busy schedule to support the community . Much appreciated
Correct, if you want to move data you would need more hosts.