We are currently facing the following issue:
- We have vm acting as Samba FileServer (for about 50 clients)
- It is backuped up using veeam B&R. Veeam uses the vmware API to create a snapshot and removes it after the backup is done.
- During snapshot removal the vm freezes and is completely unavailable for 40 seconds but sometimes more than 60 seconds up to 20 minutes. (60 seconds is critical because it impacts our users)
- The snapshot removal progress is not smooth, there are long periods of no progress at all.
- WHen I try ssh to the ESX at that time (e.g. to cd or ls the directory) I get an unresponsive ssh session, or "device our resource busy" timeouts (not knowing if this is normal, as I have seen such waits even when no backup was running).
- The VM has 5 harddisks (in total about 9 TB), 4 Harddisks are on Datastore2, 1 Harddisk is on Datastore1 (1 TB).
- Our cluster consists of 2 ESX hosts with a shared storage (FC direct attached HPE MSA5020) that provides Datastore1 and 2
- Datastore2 is dedicated to this VM.
When I grep in the vmware.log, I see a lot of times where the vm was stopped for snapshot removal.
2019-10-08T10:00:52.645Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1177715 us
2019-10-08T10:04:23.472Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 12895 us
2019-10-08T10:04:38.091Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 155503 us
2019-10-08T10:05:17.391Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 30895458 us
2019-10-08T10:05:22.945Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 166767 us
2019-10-08T10:05:25.206Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 144651 us
2019-10-08T11:00:55.800Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1268898 us
2019-10-08T11:04:31.733Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 13710 us
2019-10-08T11:04:51.409Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 18886631 us
2019-10-08T11:04:53.100Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 197801 us
2019-10-08T11:05:09.481Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 8160928 us
2019-10-08T11:05:15.162Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 187385 us
2019-10-08T11:05:17.492Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 164755 us
2019-10-08T12:00:52.716Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1238173 us
2019-10-08T12:04:08.472Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 10503 us
2019-10-08T12:04:09.289Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 108357 us
2019-10-08T12:04:26.429Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 16180417 us
2019-10-08T12:05:00.877Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 25686061 us
2019-10-08T12:05:46.096Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 39624419 us
2019-10-08T12:05:48.275Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 126040 us
2019-10-08T13:00:57.400Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1269665 us
2019-10-08T13:04:43.269Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 11101 us
2019-10-08T13:05:17.012Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 32848352 us
2019-10-08T13:05:18.540Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 159029 us
2019-10-08T13:05:26.983Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 147776 us
2019-10-08T13:05:58.774Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 26297009 us
2019-10-08T13:06:01.072Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 158871 us
2019-10-08T14:00:53.754Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1084127 us
2019-10-08T14:04:09.445Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 11057 us
2019-10-08T14:07:20.351Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 189794746 us
2019-10-08T14:07:22.001Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 176603 us
2019-10-08T14:12:58.620Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 326074922 us
2019-10-08T14:14:10.905Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 66778768 us
2019-10-08T14:14:13.307Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 152553 us
2019-10-09T12:17:22.774Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1902985 us
2019-10-09T12:23:22.198Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 13032 us
2019-10-09T12:32:32.823Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 549832173 us
2019-10-09T12:32:34.558Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 390240 us
2019-10-09T12:33:56.769Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 73481855 us
2019-10-09T12:42:53.251Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 530946216 us
2019-10-09T12:42:55.595Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 167243 us
I can hardly believe that this is "normal" behaviour. Does anybody have ideas how to narrow down the issue?
regards
Andreas
Nachricht geändert durch Andreas Baier
There are multiple reasons why a virtual machine is a frezee during snapshot. This is sometimes caused by a bug in ESXi, sometimes by virtual machine configuration.
First of all, can you answer my questions below?
ESXi version? Build number?
What is virtual machine hardware?
What is the physical equipment brand model?
We are using:
ESX Version:
VM Hardware:
Phys Equipment:
During snapshot, the virtual machine is instantly frozen. If the CPU and memory resources on the virtual machine are insufficient during this time, performance problems may occur during snapshot. Can you increase CPU and Memory resources? Can you check the latency of the datastores from the ESXi performance monitor?
What is the Guest OS you have also used? What is the version of vmware tools installed on the virtual machine?
Hi,
We installed some updates in the meantime, now the build is VMware ESXi, 6.7.0, 14320388
Nah. increasing the resources of the VM won't make a big difference, it usually has to do with the change rate on disk (IO intensity) and the storage system itself (how fast is it?). it could also be that the host is low on resources and the resources needed to merge the snapshot with the base is the limitation. Try moving it to another host to see if that makes a difference.
The esx hosts were doing almost nothing. CPU was about 8%. Memory was at 30% (max). I can't even see any increase in neither Memory nor CPU nor I/O during the backup / snapshot removal.
We cant find the serial killer when all you give us is the time of the last kills.
The events in between the events are the useful ones.
I attached vmware logs.
I have the same issue with one particular VM (it is also Centos 7).
I have 3 hosts with a lot of resources available on both hosts and VM's. I also have a lot of Centos 7 machines that are not having this issue.
Only thing that is kind a different is that I have weird LVM/partitioning setup on that machine. I've mixed raw and partitioning volumes with LVM with 3 virtual hard disk. I'm not sure if that is the issue.
Hello @AndreasAPS
Can you monitor i/o and latency on the storage system?
As your VM is 'alone' on the datastore we can expect that all i/o is coming from your file server.
BTW: is your storage volume independent from other LUNs? I mean does it have a RAID set on its own, or is it a virtual volume on a shared RAID set?
What kind of vmfs do you have on your datastore volumes?
Kind gregards
Michael