Hi all,
Trying to configure POC environment. Currently stuck at getting the Nvidia driver to load.
[root@localhost:~] nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
[root@localhost:~] esxcli software vib list | grep -i nvidia
NVIDIA-VMware_ESXi_7.0.2_Driver 470.63-1OEM.702.0.0.17630552 NVIDIA VMwareAccepted 2021-09-13
[root@localhost:~] dmesg | grep -E "NVRM|nvidia"
2021-09-13T23:36:11.216Z cpu0:2097152)Loading nvidia_b.v00...
2021-09-13T23:36:11.217Z cpu0:2097152)VisorFSTar: 1871: nvidia_b.v00 for 0x5e18082 bytes
2021-09-13T23:36:43.142Z cpu93:2100393)SchedVsi: 2098: Group: host/vim/vmvisor/plugins/nvidia(18804): max=70 min=70 minLimit=unlimited shares=1000, units: mb
2021-09-13T23:36:43.182Z cpu80:2098541)Starting service nvidia-init
2021-09-13T23:36:43.243Z cpu80:2098541)Activating Jumpstart plugin nvidia-init.
2021-09-13T23:36:51.802Z cpu56:2098541)Jumpstart plugin nvidia-init activated.
2021-09-13T23:36:52.801Z cpu42:2100998)SchedVsi: 1016: Group nvidia could not be created: Already exists
Host is a Dell R740. I have enabled SR-IOV in Bios and disabled inbuilt Video controller. Also changed MIMO to 12TB.
Thanks.
I can confirm that updating to latest vCenter and ESXI the issue is no longer present!
Thanks for your help!
Ciao
What is the model of the NVIDIA card?
Sorry, should have said, its A40.
Please check whether the card is configured for Passthrough in the ESXi host's settings.
André
Hi there,
Yes, it is:
Unless I need to do the below on the host as well? It errors out anyway.
CIao
Remove the GPU from running in Passthrough, and use a vGPU Profile instead. Then run nvidia-smi again.
Fabio
Hi Fabio,
Ok, removed the pass through and now get the nvidia-smi output as per below:
[root@localhost:~] nvidia-smi
Tue Sep 14 10:53:42 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63 Driver Version: 470.63 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:3B:00.0 Off | 0 |
| 0% 31C P0 101W / 300W | 0MiB / 45634MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
But when trying to select the vGPU profile, the selection is empty. May try to reboot the host again.
Ok try another reboot
in the meantime what vSphere licenses do you have? and check how the Hardware Graphics is configured on the host
Hi Fabio,
After removing the card from pass through and rebooting host, I can now see profiles. Thank you for your help.
Regards,
Tom
I am now seeing the below when trying to power on the machine with vGPU configured. This only appears in vCenter. I can power it on fine from the ESXI host.
Any suggestions?
TESTVM
The operation is not allowed in the current state of the host.
Just to add to this. if I power the machine in the ESXI host it will start fine. I managed to configure the card there and obtain a license from the Nvidia license server. When trying to start the VM in vCenter, this fails. I need to fix this before deploying desktop pools. Any help is greatly appreciated.
Ciao
what vsphere licenses do you have installed? what error do you see when starting VM from vCenter?
Hi Fabio,
Error is:
The operation is not allowed in the current state of the host.
Licenses:
VMware vCenter Server 7 Standard
vSphere 7 Desktop Host
Regards,
Tom
Ciao
It could be a communication problem between the ESXi host and the vCenter.
Try disconnecting and reconnecting the ESXi host to the vCenter.
I did try this but it did not resolve the issue. The issue only exists when the vGPU is added, without the vGPU the machine starts up fine in vCenter.
CIao
Can you check the VM log and check if there are any errors?
Locating virtual machine log files on an ESXi host (1007805) (vmware.com)
I tried that yesterday, the issue is the log is not appended with any data when trying to boot the machine from vCenter. Is there any location on the vCenter that perhaps would have a detailed logging for the vCenter activities?
Ciao
Do you have the DRS enabled?
Have you already tried to remove the VM from the vCenter inventory and put it back?
Can you send me the screenshot of the HW assigned to the VM?
Hi Fabio,
Just checked with other colleagues from Nvidia and noticed the vCenter is not on the latest version. i am updating this now to see if this will resolve the issue. Once it is updated I will provide future feedback.
Thanks for your help so far!
Regards,
Tom
I can confirm that updating to latest vCenter and ESXI the issue is no longer present!
Thanks for your help!