Re: VM stopped for about 20min

DaniAvni · ‎12-17-2020

I have an ESXi host running version 6.7.0, 13981272 with 2 sockets x 16 cores x 2 hyperthreading. The total VCPU cores assigned of all the VMs running on this host is 44 there are no reservations for vCPU or any specific settings in any guest VM. The host has been running for 570 days

Today one of the guest VM was unavailable for about 20 min. The VM has 5 vCPU assigned to it. The Machine is running Windows 2012R2 with SQL server on it (MAXDOP is set to 5 although I am not sure this question is SQL related at all)

I have noticed the following things in our monitoring platform:

The co-stop of the VM that was unavailable peaked at 80% (at 5 min measurement intervals the values were 0%-5%-80%-62%-0%)
I saw on vSphere that the host had 3 cores at 100% although no specific VM guest was at 100% at that time (they all have been <50% CPU usage)

This same thing has happened before in November for this specific machine and now in December. The machine is completely unresponsive at the time co-stop is high.

Any ideas on how we could find the root cause of this? On other hosts in the same cluster and other guests on the same hosts, I did not observe this at all.

jburen · ‎12-20-2020

Based on your input the host has two NUMA nodes. The VM has an odd number of vCPUs. 2 vCPUs will be running on NUMA Node 0 while 3 vCPUs will be running on NUMA Node 1. This is not an ideal situation. Personally, I would configure the VM with 4 vCPUs (or 6). Or even start with 2 vCPUs and see if you hit any limit.

Consider giving Kudos if you think my response helped you in any way.

DaniAvni · ‎12-21-2020

Thanks for the reply, Could this be causing the VM to hang like this? or for the host to have a few cores so busy (as I said 3 cores at 100% doing something) that the guest VM is totally unresponsive for 20 min?

reading through https://blogs.vmware.com/performance/2017/03/virtual-machine-vcpu-and-vnuma-rightsizing-rules-of-thu... It seems that since I am using only 16GB of ram for the VM (out of 192GB - less than half of the memory) and using 5 vCPU (less than the number of cores in a single pCPU) it would still be NUMA optimal

jburen · ‎12-21-2020

I hear what you are saying and it should work but... This is from https://codenotary.com/blog/vmware-cpu-co-stop-and-sql-server-performance/

VMware’s CPU Co-Stop metric shows you the amount of time that a parallelized request spends trying to line up the vCPU schedulers for the simultaneous execution of a task on multiple vCPUs. It’s measured in milliseconds spent in the queue per vCPU per polling interval. Higher is bad. Very bad. The operating system is constantly reviewing the running processes, and checking their runtime states. It can detect that a CPU isn’t keeping up with the others, and might actually flag a CPU is actually BAD if it can’t keep up and the difference is too great.

If you see blips above zero, you’ve got a performance challenge. The higher the number gets, the worse the performance impact can be. And… it’s not just the performance of this VM. It’s the performance of all of the VMs on the host. The vCPUs on the other VMs are sure to be impacted by this scheduling delay, and their performance will be negatively impacted as well.

So you might want to check other VMs as well.

Consider giving Kudos if you think my response helped you in any way.

DaniAvni · ‎12-22-2020

I have reduced the number of vCPU to 4 and it looks like the co-stop has gone down a bit compared to the previous day. I will keep monitoring this

jhondavid442 · ‎12-23-2020

I also have the same problem but i don't know where to ask. Now we both got the solution.
thanks it will also help me

All

VM stopped for about 20min