VMware Cloud Community
jhyicraft
Contributor
Contributor

VM Power on failures after adding PCIe device (GPU passthrough)

Hello

I experiencing the VM Power On Failures issue on vSphere 7 (same on ESXi Host) with adding PCIe device NVIDIA GPU.

- ESXi Task result

jhyicraft_0-1707979846834.png

- vSphere Power On Failures

jhyicraft_1-1707979938264.png

- ESXi Host Configure

jhyicraft_2-1707980349084.png

jhyicraft_3-1707980368991.png

- VM Configure
There is no NVIDIA GRID vGPU Profiles either (empty and missing profiles)

jhyicraft_4-1707980441128.png

I even can not check the HCL for the Supermicro vendor in Lifecycle Manager.
There is no vendor add on or something else for Supermicro.

jhyicraft_5-1707980640797.png

 

  •  ESXi
    • VMware ESXi 7.0.3 (VMKernel Release Build 22348816)
    • Chassis
      • Supermicro SYS-221H-TNR
      • CPU Intel Xeon(R) Platinum 8480+ * 2EA
      • Memory 1TB
      • GPU
        • NVIDIA L40 * 2EA
        • Host Driver
          • NVIDIA-GRID-vSphere-7.0-535.129.03-537.70
  • vCenter
    • VMware vCenter Server Appliance 7.0.0.10300
    • VM on ESXi Host

What should I do with them?

Thank you.

Labels (4)
0 Kudos
11 Replies
berndweyand
Expert
Expert

both L40 show 0 Bytes - is the gpu manager installed correctly ?

try "nvidia-smi" on the console to check

0 Kudos
jhyicraft
Contributor
Contributor

Thank you for the check.

The gpu manager was not installed, so I installed and reboot the host.

  • nvidia-smi works fine on the esxi host
  • /etc/init.d/nvdGpuMgmtDaemon status showing
    daemon_nvdGpuMgmtDaemon is running

Now I can see the vGPU Profiles on L40 GPUs!

also the Memory of each L40 GPU shows 44.9GB now.

But the VM still cannot start with Direct I/O and vGPU profile neither.

 

++

I tried

  • disable all passthrough settings on the GPUs
  • add PCIe device on the guest VM with vGPU profile 'nvidia_l40-48q'

but still have same error and cannot start the guest vm with gpu.

0 Kudos
bluefirestorm
Champion
Champion

Have a look at the vmware.log and it might give a clue that why the power-on is failing.
Considering that each L40s has 48GB VRAM, it might be the MMIO size, 2 x 48GB = 96GB. The MMIO size has to be a power of 2; starting at 32GB, 64GB, 128GB .. etc

Assuming the VM is already configured for EFI virtual firmware, you could try adding/editing the vmx with the following lines to increase the MMIO size.

pciPassthru.use64bitMMIO = "TRUE"
pciPassthru.64bitMMIOSizeGB = "128"

 

0 Kudos
berndweyand
Expert
Expert

do you want to assign the L40 with passtrhough oder with gpu profiles ?

with passtrough you assign the whole L40 to one vm, with profiles many vm can use one L40

afaik you dont need the gpu manager on the host with passthrough.

if you want gpu profiles the gpu manager is requried, you also need valid nvidia grid licenses and an nvidia license server in cls- oder dls-mode

0 Kudos
jhyicraft
Contributor
Contributor

@bluefirestorm 

The VM's Configuration Parameters that you mentioned were set up already.

What is the vmware.log and where can I found it?

vSphere and ESXi Host client's event log does not show any details for the vm power on failure.

(vCenter is provisioned on the one of the ESXi host in the cluster as a VM)

 

@berndweyand 

I want to assign the L40 with passthrough first.

but the passthrough enabled, vm cannot start (without gpu manager in the esxi host)

0 Kudos
bluefirestorm
Champion
Champion

The vmware.log files should be in the same location where the VM is stored.

0 Kudos
berndweyand
Expert
Expert

have you tried "dynamic directpath i/o" ?

is the vm configured to efi boot?

have you reserved all memory for the vm ?

0 Kudos
jhyicraft
Contributor
Contributor

@berndweyand 

I already tried dynamic too.

EFI boot and memory reservation also set up.

 

@bluefirestorm 

I found the NVIDIA vGPU does not support ESXi 7, but ESXi 8 would work with L40.
Supported Products :: NVIDIA Virtual GPU Software Documentation

But I can not find any reason that the passthrough mode still not working.

Now I'm install the GPU on Windows baremetal Host(workstation). so I will check the vmware.log later.

 

Thank you

0 Kudos
berndweyand
Expert
Expert

According to the link the L40 is supported with ESXi 7 and 8

I have 20 hosts with 3 Tesla each running on ESXi7.0U3

0 Kudos
berndweyand
Expert
Expert

btw: your vcenter ist 7.0a from may 2020  and your host 7.0u3o from sep 2023 ?

0 Kudos
jhyicraft
Contributor
Contributor

@berndweyand 

My vCenter is VMware vCenter Server Appliance 7.0.0.10300

and the ESXi Host is ESXi-7.0U3g-20328353-standard

0 Kudos