VMware Communities
mkubecek
Hot Shot
Hot Shot

17.5.0 modules on 6.6-rc6 kernel trigger a warn check in RCU code

When running 17.5.0 with unpatched modules on openSUSE Leap with kernel 6.6-rc6, it consistently triggers a warn check in rcu_flavor_sched_clock_irq() when I start a WinXP VM (see attachment for full traces).

Note 1: I get this with Player 17.5.0 but kernel modules are the same and more people are likely to see the report here. But feel free to move this to Player section  if you believe it is relevant.

Note 2: applying this patch to vmmon source seems to help, I no longer get the warns with it

11 Replies
bcdonadio_com
Contributor
Contributor

I confirm that indeed that this freely available patch from someone **not paid** by a company that berates their employees by telling them to "take your butt back to office" fixes the before-useless software, crashing not only itself but taking the whole system with it, which I also pay for with a **bleep** subscription to keep it up-to-date but even having the fix ready and delivered to their doorstep for free to distribute it does absolutely nothing for more than a whole month so far.

Running Linux 6.6 on FC39, with love.

0 Kudos
gbohn
Enthusiast
Enthusiast

> even having the fix ready and delivered to their doorstep for free to distribute it does absolutely nothing for

> more than a whole month so far

I don't know if the Broadcom takeover is at all related, but they might be a bit distracted at VMWare.

What with all the layoffs ( https://www.sdxcentral.com/articles/analysis/vmware-layoffs-and-other-cuts-start-as-broadcom-takes-o...  ).

Also, it looks like Broadcom plans to jettison Workstation to who know where... (https://www.theregister.com/2023/12/07/broadcom_q4_2023/?td=rt-3a).

Things don't look too good for home team as best as I can tell... 😞

0 Kudos
wila
Immortal
Immortal


@gbohn wrote:

 

Things don't look too good for home team as best as I can tell... 😞


Might not be as bad, please see:

https://communities.vmware.com/t5/VMware-Fusion-Discussions/Impact-of-VMware-s-acquisition-by-Broadc...

--
Wil

| Author of Vimalin. The virtual machine Backup app for VMware Fusion, VMware Workstation and Player |
| More info at vimalin.com | Twitter @wilva
0 Kudos
CarloFrancesco
Contributor
Contributor

VMware was also causing my laptop to freeze before applying this patch. The patch seems to work fine for kernel 6.7 as well. Thanks!

FYI, Oracle VirtualBox suffers from a similar problem (even the latest 7.0.14) and I didn't find a solution for that.

Tags (1)
0 Kudos
CarloFrancesco
Contributor
Contributor

Hello all,

the 17.5.1 release doesn't fix the problem.

The patch is still needed.

0 Kudos
mkubecek
Hot Shot
Hot Shot

Yes, 17.5.1 kernel modules are exactly the same as 17.5.0, this update seems to only address a CVE issue affecting the userspace part.

0 Kudos
mkubecek
Hot Shot
Hot Shot

@john5333I'm sorry but while quite long, your "reply" is completely useless and does not provide any usable information or insight. To be honest, your comment sounds rather like an "AI" generated block of text than like a genuine human reaction. If I'm wrong, please read the initial report again, it provides enough relevant information, including a full stack trace.

0 Kudos
CarloFrancesco
Contributor
Contributor

@mkubecek Right, but it was worth mentioning. Yes, I used the patch on the new modules, which have different line positions, but it still works fine.

@john5333it's interesting that you mention that I should ensure I run the latest "stable" kernel. Stable for who?

I run Canonical Ubuntu, with the "vanilla" kernel source. Namely, I get the kernel straight from the Linux archive and compile it myself. I'm currently using the 6.7.9 Linux "stable" version.

Out of curiosity, I tested the packages provided in PPA repositories from Canonical, and the RCU problem appeared with them too. I strongly believe that the issue appears on all kernels >= 6.6-rc.

Without the patch, I can't start any other process after the kernel gives me the rcu_flavor_sched_clock_irq() warning. To fix the problem, I can only press ctrl+alt+f3 to access the tty3, end the VMware process, remove its modules, and then start up the window manager again. I don't know if this is also true for other people.

Still, after applying the patch VMWare Workstation work better and not freeze, there are many other warnings that say UBSAN: array-index-out-of-bounds
problem:
[   12.653889] ================================================================================
[   12.653890] ================================================================================
[   12.653890] UBSAN: array-index-out-of-bounds in /tmp/modconfig-kf9Q2W/vmmon-only/common/vmx86.c:3652:38
[   12.653891] index 0 is out of range for type 'MSRReply [*]'
[   12.653891] CPU: 0 PID: 1808 Comm: modprobe Tainted: P           OE      6.7.9-060709-generic #202403061535
[   12.653891] Hardware name: LENOVO 82WQ/LNVNB161216, BIOS KWCN39WW 07/24/2023
[   12.653892] Call Trace:
[   12.653892]  <TASK>
[   12.653892]  dump_stack_lvl+0x48/0x70
[   12.653893]  dump_stack+0x10/0x20
[   12.653894]  __ubsan_handle_out_of_bounds+0xc6/0x110
[   12.653895]  Vmx86GenFindCommonIntelVTCap+0x144a/0x1540 [vmmon]
[   12.653899]  Vmx86_CheckMSRUniformity+0x695/0x700 [vmmon]
[   12.653903]  ? __pfx_LinuxDriverInit+0x10/0x10 [vmmon]
[   12.653907]  init_module+0x57/0x1b0 [vmmon]
[   12.653910]  ? __pfx_LinuxDriverInit+0x10/0x10 [vmmon]
[   12.653914]  do_one_initcall+0x5b/0x340
[   12.653916]  do_init_module+0x97/0x290
[   12.653917]  load_module+0xba1/0xcf0
[   12.653918]  ? security_kernel_post_read_file+0x75/0x90
[   12.653920]  init_module_from_file+0x96/0x100
[   12.653921]  ? init_module_from_file+0x96/0x100
[   12.653922]  idempotent_init_module+0x11c/0x2b0
[   12.653923]  __x64_sys_finit_module+0x64/0xd0
[   12.653924]  do_syscall_64+0x5d/0xf0
[   12.653925]  ? ext4_llseek+0xc3/0x130
[   12.653926]  ? ksys_lseek+0x7d/0xd0
[   12.653927]  ? exit_to_user_mode_prepare+0x30/0xb0
[   12.653929]  ? syscall_exit_to_user_mode+0x2e/0x50
[   12.653930]  ? do_syscall_64+0x6c/0xf0
[   12.653931]  ? do_syscall_64+0x6c/0xf0
[   12.653932]  ? irqentry_exit+0x43/0x50
[   12.653933]  ? exc_page_fault+0x94/0x1b0
[   12.653934]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[   12.653935] RIP: 0033:0x75340ed25cfd
[   12.653936] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 0
1 f0 ff ff 73 01 c3 48 8b 0d eb 80 0d 00 f7 d8 64 89 01 48
[   12.653937] RSP: 002b:00007ffc8837f4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   12.653937] RAX: ffffffffffffffda RBX: 00005f8bcaa63eb0 RCX: 000075340ed25cfd
[   12.653938] RDX: 0000000000000000 RSI: 00005f8bca5d9727 RDI: 0000000000000003
[   12.653938] RBP: 00005f8bca5d9727 R08: 0000000000000040 R09: 00007ffc8837f5b0
[   12.653938] R10: ffffffffffffffc0 R11: 0000000000000246 R12: 0000000000040000
[   12.653939] R13: 00005f8bcaa635f0 R14: 00005f8bcaa64ab0 R15: 00005f8bcaa640e0
[   12.653940]  </TASK>

 

 

 

 

0 Kudos
bluefirestorm
Champion
Champion

The hardware looks like is Lenovo Legion Pro 7 with 13th Gen Intel CPU (based on Lenovo 82WQ).

Further down the call trace there is

[ 12.653895] Vmx86GenFindCommonIntelVTCap+0x144a/0x1540 [vmmon]
[ 12.653899] Vmx86_CheckMSRUniformity+0x695/0x700 [vmmon]

My guess is that MSRs between e-cores and p-cores are not uniform (and probably the VT-x capabilities are not also) and thus resulting in the call to UBSan out-of-bounds. Possible as VM transitions (VMexit and VMentry) are made; it may not necessarily result in the same type of core being used (and thus the difference in MSR values). The microcode versions may not be even the same.

cat /proc/cpuinfo | grep microcode
will likely show different values for e-cores and the p-cores.

What you could try is to set core/thread affinity of the VM(s) to use only either the p-cores or e-cores but not a mix. This can be done either in the vmx or /etc/vmware/config (instead of applying it on every VM vmx) but try first on one VM.

The example below (see Spoiler) is to use p-cores only for an i9-13900HX (assume 8 p-cores with HT enabled plus 16 e-cores). If you want to use only the e-cores, just flip the "TRUE"/"FALSE" around.-

Spoiler
Processor0.use = "TRUE"
Processor1.use = "TRUE"
Processor2.use = "TRUE"
Processor3.use = "TRUE"
Processor4.use = "TRUE"
Processor5.use = "TRUE"
Processor6.use = "TRUE"
Processor7.use = "TRUE"
Processor8.use = "TRUE"
Processor9.use = "TRUE"
Processor10.use = "TRUE"
Processor11.use = "TRUE"
Processor12.use = "TRUE"
Processor13.use = "TRUE"
Processor14.use = "TRUE"
Processor15.use = "TRUE"
Processor16.use = "FALSE"
Processor17.use = "FALSE"
Processor18.use = "FALSE"
Processor19.use = "FALSE"
Processor20.use = "FALSE"
Processor21.use = "FALSE"
Processor22.use = "FALSE"
Processor23.use = "FALSE"
Processor24.use = "FALSE"
Processor25.use = "FALSE"
Processor26.use = "FALSE"
Processor27.use = "FALSE"
Processor28.use = "FALSE"
Processor29.use = "FALSE"
Processor30.use = "FALSE"
Processor31.use = "FALSE"
0 Kudos
andreaplanet
Enthusiast
Enthusiast

Same boat. Host FC39 Kernel 6.7 freezes without the patch.

VMWare could at least merge the patches provided by mkubecek. In many cases, the required changes are only a few lines of code.

0 Kudos
michaeljclark
Contributor
Contributor

The ubsan warnings can be disabled by adding CONFIG_UBSAN=n to the make command when compiling the modules

git clone https://github.com/mkubecek/vmware-host-modules.git
cd vmware-host-modules
git checkout workstation-17.5.1
make VM_UNAME=$(uname -r) CONFIG_UBSAN=n
sudo make install

0 Kudos