Solved: Significant guest os memory performance decrease a...

MixuS · ‎03-16-2021

Hi! I'm trying to narrow down the possible causes of a significant memory performance decrease in the guest OS. Before tearing everything apart I'd be interested in hearing if anyone else has had any similar experiences and what their root causes were.

Description:

What happens is that the guest OS performs normally when started from a complete shutdown. However (very often, but not always) when the guest OS is suspended and later on woken up, it is very sluggish and nearly unusable.

When the issue occurs, the memory performance in the guest OS seems to be only a fraction of what it usually is.
Indicated by Sysbench, when behaving normally, the guest OS memory write speed is roughly 6500 MiB/s:

$ sysbench memory run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 67290110 (6728078.01 per second)

65713.00 MiB transferred (6570.39 MiB/sec)

And then when woken up from suspend-state, the performance is often (but not always) only a fraction of the baseline:

$ sysbench memory run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 396116 (39605.69 per second)

386.83 MiB transferred (38.68 MiB/sec)

When the issue occurs, the memory write performance is always less than 40 MiB/s.

Simply restarting the Guest OS does not fix the issue, but instead the guest has to be powered down to a shutdown-state and then started again. When the issue occurs, the Host OS is performing completely normally. I am running multiple virtual machines both simultaneously and one at a time, and the issue occurs on all virtual machines regardless of whether being run simultaneously or one at a time.

Since the issue does not occur every single time the Guest OS is suspended, but only very often, it would be interesting to find out whether it's an issue with something that happens in the Host OS under some conditions / something that VMware workstation does under some circumstances / some configuration in the hardware that causes it / ...?

The described issue has occurred on this hardware configuration for a few years, ranging from multiple VMware workstation versions (player and workstation).

My colleague running similar setup (similar host os / guest os setup with similar hardware specs and same VMware Workstation version) but on AMD Ryzen Threadripper has not encountered similar issues.

Details:

Hardware:
- Intel Core i7-6700K on Intel Z170 chipset, 64GB of memory.
- All virtualization extensions are enabled in BIOS.
- BIOS: American Megatrends Inc. 3801 / SMBIOS 3.0
- Host OS is installed on a well performing M.2 SSD.
- All virtual machines are stored on their own dedicated well performing M.2 SSD.

Host OS:
- Windows 10 Pro 10.0.19041 / HAL 10.0.19041.844
- Bitlocker is disabled on all drives
- No 3rd party antivirus applications are installed
- Windows integrated security is enabled with "Virus & threat protection" and "Device security"

VMWare configuration for all Guest OS's:
- VMware Workstation 16 Pro, 16.1.0 build-17198959
- Minimum of 16GB memory allocated
- VTx, IOMMU and CPU Performance counters virtualizations are disabled
- Side channel mitigations are enabled

All Guest OS's:
- Ubuntu 20.04.2 LTS on Linux 5.4.0-67-generic #75-Ubuntu SMP Fri Feb 19 18:03:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- SELinux is not installed
- AppArmor is enabled
- Virtual machine logs contain no errors nor anything that would indicate issues

bluefirestorm · ‎03-23-2021

AFAIK, the side channel mitigations checkbox in the Advanced will only appear if Workstation Pro 16.x software detects Hyper-V.

You can confirm this in the vmware.log of any VM. Look for the text "Monitor Mode" if it shows ULM, that means the User Level Monitor (i.e. Hypervisor API) is used. If it is CPL0, it is the ring 0 privilege VMM (which is much faster than the ULM).

| vmx| I005: Monitor Mode: CPL0

Have a look at this KB. There could be other stuff enabled and/or other steps that need to be done to go back to ring 0 VMM.

https://kb.vmware.com/s/article/2146361

View solution in original post

bluefirestorm · ‎03-18-2021

You mentioned side channel mitigations enabled, so that means the Windows 10 host has Hyper-V enabled and the user level Microsoft Hypervisor API is the hypervisor instead of the root Intel VT-x.

With the root VT-x, the virtual RAM is managed by the CPU (for Intel it is EPT). I don't know exactly what is used when the VMM is the Hypervisor API but I would assume there is some software overhead for this.

For the AMD Ryzen Threadripper host does it also have Hyper-V enabled? Does the problem go away when you disable Hyper-V and go back to the root VT-x VMM?

MixuS · ‎03-23-2021

Thank you for your response!

That could be it, but neither the Ryzen-setup or this Intel-setup affected by the issue have Hyper-V enabled.
Do the side channel mitigations rely on Hyper-V being enabled - and if so, could the described issue then occur because it's not enabled?

bluefirestorm · ‎03-23-2021

AFAIK, the side channel mitigations checkbox in the Advanced will only appear if Workstation Pro 16.x software detects Hyper-V.

You can confirm this in the vmware.log of any VM. Look for the text "Monitor Mode" if it shows ULM, that means the User Level Monitor (i.e. Hypervisor API) is used. If it is CPL0, it is the ring 0 privilege VMM (which is much faster than the ULM).

| vmx| I005: Monitor Mode: CPL0

Have a look at this KB. There could be other stuff enabled and/or other steps that need to be done to go back to ring 0 VMM.

https://kb.vmware.com/s/article/2146361

MixuS · ‎03-24-2021

You are correct, and there it is.

2021-03-24T10:35:44.769+02:00| vmx| I005: Monitor Mode: ULM

The link you provided indicates that in this case it must be due to WSL2 being enabled on the Host OS.
Now I just need to decide whether to stop using WSL2 or just live with the issue.

Thank you very much for the assistance!

All

Significant guest os memory performance decrease after suspend