VMware Cloud Community
GregChristopher
Enthusiast
Enthusiast

Hidden state on passthru affecting boot after hardware change

Hi All,
I'm having a REALLY interesting (and of course frustrating) issue with one hypervisor using passthrough.

The device in question was using DirectPath I/O for :

-audio and video of an AMD GPU
-FL1100 USB controller
-Promise Raid PCI card

I planned to switch out the video card. I made the mistake of doing that before enabling kernel VGA ( esxcli system settings kernel set -s vga -v TRUE ). The machine was not contactable via the network after the swap.

I found I was able to connect back to the box and change the setting, but only after I put all the cards in, and in exactly the same slots.

Once I was able to get the console up again, I made sure that Direct passthru I/O was disabled for everything. I placed the new card in place removing one of the others so that everything fit, and now the console told me that it was unable to detect any supported network cards. Ok this was getting interesting. So I hit F2 to get a command prompt and... The keyboard was no longer working either! I'm guessing here but something in the boot order was blocking other drivers from working afterwards.

Thinking this was a facet of the combination of hardware and/or the installer needed to work with the correct hardware present... I created an install USB, and did a fresh install on the same 7.0.3 build. EVERYTHING WORKED. Thinking that I fixed it, and that maybe the original problem was something that got messed up in the vibs or kernel drivers, I used

vim-cmd hostsvc/firmware/backup_config

To back up the config from the system running with its original hardware installed,
Then I used

vim-cmd hostsvc/firmware/restore_config

To restore the configuration onto the new boot device which was working perfectly with a fresh install.

Unfortunately, the restore worked better than i thought it would: The hypervisor exhibited EXACTLY the same behavior it did earlier and rebooted slowly, followed by no network connectivity and no keyboard.

So now I'm completely baffled. Something that clearly must be related to passthru remains as state in the system but it's not obvious what. All passthrough is disabled. I did not however modify passthru.map which had a single entry related to the FL1100 (which I'm still not sure is right).

I think some bugs are lurking here:

- Hidden state related to passthru is maintained in state.tgz that we have no idea about
- When a passthru issue happens during bootup sequence, it breaks other unrelated devices

I haven't looked at vmkernel.log yet and of course it's difficult because once the problem happens, I'm in the business of switching around the cards again to get any access whatsoever.

An additional trick I tried: Local.sh tries to dump the output of several "esxcli network" commands during bootup to a vmfs volume. Unfortunately, the local.sh script does not have a chance to complete when the system is borked. The script of course works perfectly when the system has the original cards installed 🙂

Totally at a loss as to what to try next but it's vexing.

0 Kudos
0 Replies