I need your expertise regarding our vsan cluster. Can someone help me on these issue
We have a Virtual Machine running on windows server 2008. It is an sql server. On this vm, it has 5 partition. There is one partition that always disconnected or inaccessible so our remedy is to restart the vm.
Our boss is annoyed why it always keeps happening. This vm is has an schedule IDPA backup creating snapshot. Is creation of snapshot reasons why the partition got disconnected or suddenly inaccessible?
You have to check what is special about that partition , is the partition gets unmounted only at the Guest OS level ? or the virtual disk is in disconnected state at VM level everytime when the backup takes snapshot. If snapshot causes the issue why only this disk is impacted not the other disks or other VMs.
Is there any way to reproduce the issue ? or the issue occurs randomly. May be more info would help to give some idea to isolate the issue
Hi SureshKumarMuth.
This partition holds the database files. yes, the partition gets unmounted only at the Guest OS level. When we click the said partition, the vm will hang up. When the partition is inaccessible our operations will be stop. We will restart the vm in order the partition will work again.
I have no idea if the snapshot is the reason for the down But in your opinion what is the culprit?
i have somthing to add. Yes the issue occurs randomly. sometime couple of days, sometime weeks.
At the time of this troubleshooting ,what is the backup state ? Does the VM hold any snapshot or backup job in progress ? And is the issue occurs only at the time of backup or after backup task. Basically, I want to exclude certain things to narrow down the issue.
What does event logs says about disk ? Have you checked vmware.log (virtual machine log) to see if any IO error reported on the particular virtual disk at the time of issue.
Do you have the time stamp of the issue ? and vmdk name of the impacted disk?
@jestorba , I see from the vmware.log that you are doing quiesced snapshot (e.g. including VM memory) - is this mandatory for the OS/application?
If not, can you test snapshot-based backup without this option selected to see do you observe the same behaviour?
Have you confirmed that there is nothing wrong nor any state-change of the object backing this vmdk when this issue occurs? (Cluster > Monitor > vSAN > Skyline Health > Retest)
Have you checked the vmkernel.log/vobd.log of all nodes to confirm there are no vSAN checksum error messages relating to this objects components? (grep -i checksum /var/log/vobd.log)
Have you looked at the vSAN latency on this VM/vmdk while this issue is occurring? (VM > Monitor > vSAN > Performance)
Do you have the time stamp of the issue = between 2:50 - 3:30
and vmdk name of the impacted disk? = this logs take on the vm itself
attached is the screenshot
All good now?