6.7.0 snapshot removal freezes vm

AndreasAPS · ‎10-10-2019

We are currently facing the following issue:

- We have vm acting as Samba FileServer (for about 50 clients)

- It is backuped up using veeam B&R. Veeam uses the vmware API to create a snapshot and removes it after the backup is done.

- During snapshot removal the vm freezes and is completely unavailable for 40 seconds but sometimes more than 60 seconds up to 20 minutes. (60 seconds is critical because it impacts our users)

- The snapshot removal progress is not smooth, there are long periods of no progress at all.

- WHen I try ssh to the ESX at that time (e.g. to cd or ls the directory) I get an unresponsive ssh session, or "device our resource busy" timeouts (not knowing if this is normal, as I have seen such waits even when no backup was running).

- The VM has 5 harddisks (in total about 9 TB), 4 Harddisks are on Datastore2, 1 Harddisk is on Datastore1 (1 TB).

- Our cluster consists of 2 ESX hosts with a shared storage (FC direct attached HPE MSA5020) that provides Datastore1 and 2

- Datastore2 is dedicated to this VM.

When I grep in the vmware.log, I see a lot of times where the vm was stopped for snapshot removal.

2019-10-08T10:00:52.645Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1177715 us

2019-10-08T10:04:23.472Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 12895 us

2019-10-08T10:04:38.091Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 155503 us

2019-10-08T10:05:17.391Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 30895458 us

2019-10-08T10:05:22.945Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 166767 us

2019-10-08T10:05:25.206Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 144651 us

2019-10-08T11:00:55.800Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1268898 us

2019-10-08T11:04:31.733Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 13710 us

2019-10-08T11:04:51.409Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 18886631 us

2019-10-08T11:04:53.100Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 197801 us

2019-10-08T11:05:09.481Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 8160928 us

2019-10-08T11:05:15.162Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 187385 us

2019-10-08T11:05:17.492Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 164755 us

2019-10-08T12:00:52.716Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1238173 us

2019-10-08T12:04:08.472Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 10503 us

2019-10-08T12:04:09.289Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 108357 us

2019-10-08T12:04:26.429Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 16180417 us

2019-10-08T12:05:00.877Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 25686061 us

2019-10-08T12:05:46.096Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 39624419 us

2019-10-08T12:05:48.275Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 126040 us

2019-10-08T13:00:57.400Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1269665 us

2019-10-08T13:04:43.269Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 11101 us

2019-10-08T13:05:17.012Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 32848352 us

2019-10-08T13:05:18.540Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 159029 us

2019-10-08T13:05:26.983Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 147776 us

2019-10-08T13:05:58.774Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 26297009 us

2019-10-08T13:06:01.072Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 158871 us

2019-10-08T14:00:53.754Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1084127 us

2019-10-08T14:04:09.445Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 11057 us

2019-10-08T14:07:20.351Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 189794746 us

2019-10-08T14:07:22.001Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 176603 us

2019-10-08T14:12:58.620Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 326074922 us

2019-10-08T14:14:10.905Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 66778768 us

2019-10-08T14:14:13.307Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 152553 us

2019-10-09T12:17:22.774Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1902985 us

2019-10-09T12:23:22.198Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 13032 us

2019-10-09T12:32:32.823Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 549832173 us

2019-10-09T12:32:34.558Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 390240 us

2019-10-09T12:33:56.769Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 73481855 us

2019-10-09T12:42:53.251Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 530946216 us

2019-10-09T12:42:55.595Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 167243 us

I can hardly believe that this is "normal" behaviour. Does anybody have ideas how to narrow down the issue?

regards

Andreas

Nachricht geändert durch Andreas Baier

tayfundeger · ‎10-10-2019

There are multiple reasons why a virtual machine is a frezee during snapshot. This is sometimes caused by a bug in ESXi, sometimes by virtual machine configuration.

First of all, can you answer my questions below?

ESXi version? Build number?

What is virtual machine hardware?

What is the physical equipment brand model?

--
Blog: https://www.tayfundeger.com
Twitter: https://www.twitter.com/tayfundeger

vBlogger, vExpert, Cisco Champions

Please, if this solution helped your problem, "Helpful" if it solves your problem "Correct Answer" to mark.

AndreasAPS · ‎10-10-2019

We are using:

ESX Version:

Esxi 6.7.0 Build 13006603

VM Hardware:

2 CPU, 4 GB Ram, 5 HardDisks (16GB, 500GB, 4.88TB, 2.93TB, 1TB). (what else do you need to know?)

Phys Equipment:

2x ProLiant DL360 Gen10 each with 128 GB Ram and 2 x 12 Core Intel(R) Xeon(R) Gold 5118

tayfundeger · ‎10-10-2019

During snapshot, the virtual machine is instantly frozen. If the CPU and memory resources on the virtual machine are insufficient during this time, performance problems may occur during snapshot. Can you increase CPU and Memory resources? Can you check the latency of the datastores from the ESXi performance monitor?

What is the Guest OS you have also used? What is the version of vmware tools installed on the virtual machine?

--
Blog: https://www.tayfundeger.com
Twitter: https://www.twitter.com/tayfundeger

vBlogger, vExpert, Cisco Champions

Please, if this solution helped your problem, "Helpful" if it solves your problem "Correct Answer" to mark.

AndreasAPS · ‎10-10-2019

Hi,

The VM is running on Cent OS 7. The version of the vmware tools is "open-vm-tools.x86_64 10.2.5-3.el7"
Increasing memory and CPU of the VM? We could increase the ressources, yes. But does a snapshot removal really consume VM's memory/CPU ressources?
regarding the latency of the datastore: I did not find where to check that in vCenter - it tells me "No performance data is available for the currently selected metrics". But I had a look at Veeam One and I could see that at the time it happens the latency is not increased. Write latency was 0, Read Latency was around 0.5 to 1 ms for both Datastores.

We installed some updates in the meantime, now the build is VMware ESXi, 6.7.0, 14320388

depping · ‎10-10-2019

Nah. increasing the resources of the VM won't make a big difference, it usually has to do with the change rate on disk (IO intensity) and the storage system itself (how fast is it?). it could also be that the host is low on resources and the resources needed to merge the snapshot with the base is the limitation. Try moving it to another host to see if that makes a difference.

AndreasAPS · ‎10-10-2019

The esx hosts were doing almost nothing. CPU was about 8%. Memory was at 30% (max). I can't even see any increase in neither Memory nor CPU nor I/O during the backup / snapshot removal.

continuum · ‎10-10-2019

We cant find the serial killer when all you give us is the time of the last kills.

The events in between the events are the useful ones.

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

AndreasAPS · ‎10-10-2019

I attached vmware logs.

Until 8-Oct we were running backups hourly. Then we stopped the backups to keep the VM running.
9-Oct-19 12:16 - 12:43 (UTC) there was a backup started manually which took 19:50 minutes to remove a snapshot
8-Oct-19 14:00 - 14:14 (UTC) a snapshot removal took 10:36 mins
8-Oct-19 13:00 - 13:06 (UTC) a snapshot removal took 1:35 mins
8-Oct-19 12:00 - 12:06 (UTC) a snapshot removal took 1:58 mins

bjonoski · ‎03-24-2021

I have the same issue with one particular VM (it is also Centos 7).

I have 3 hosts with a lot of resources available on both hosts and VM's. I also have a lot of Centos 7 machines that are not having this issue.

Only thing that is kind a different is that I have weird LVM/partitioning setup on that machine. I've mixed raw and partitioning volumes with LVM with 3 virtual hard disk. I'm not sure if that is the issue.

microlytix · ‎03-24-2021

Hello @AndreasAPS

Can you monitor i/o and latency on the storage system?

As your VM is 'alone' on the datastore we can expect that all i/o is coming from your file server.

BTW: is your storage volume independent from other LUNs? I mean does it have a RAID set on its own, or is it a virtual volume on a shared RAID set?

What kind of vmfs do you have on your datastore volumes?

Kind gregards

Michael

blog: https://www.elasticsky.de/en

All

6.7.0 snapshot removal freezes vm