Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.
This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.
As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.
We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7.
Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.
The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide.
The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".
https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-702-installation-setup-guide.pdf
You can also refer to the below KB:
Reference: https://kb.vmware.com/s/article/83376?lang=en_US
Resolution
VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.
Any word on the imminent release of U2P03? My team is tired of playing Whack-A-Mole. 🙂
Larry
Sorry no specific date yet has been verified for the GA release, I would assume sometime in Aug 2021.
We have sort of mitigated the issue by scripting reboots of cluster nodes. We also stopped turbonomics from managing DRS in the cluster which had appeared to signficantly increase IO according to logs. esxcfg-rescan -d vmhba32 seems to work on hosts that are not fully disconnected from the cluster.
Here is the comm from support.. note the promise for U3 by mid August.. clock is ticking vmware!
"
Thank you for your time over the course of this SR:21237061007 and thank you for choosing VMware Products!
I will now proceed in placing this Support Request in an archived state. This state means the Support Request can be re-activated by replying to this mail or by calling VMware Customer Support at any stage within the next 21 days.
To ensure clarity on the resolution of your issue and as a record for yourself below is a summary of what we worked on:
Summary
ESXi 7 host frequently disconnecting from vcenter
Cause
2021-07-07T23:07:41.135Z cpu12:2097520)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-07-07T23:07:41.135Z cpu12:2097520)ScsiDeviceIO: 4315: Cmd(0x45d95fcd2100) 0x28, cmdId.initiator=0x43079ee36ac0 CmdSN 0x1 from world 4817311 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1
2021-07-07T23:07:41.135Z cpu12:2097520)Queued:2
2021-07-07T23:07:41.136Z cpu27:4817311)VFAT: 5144: Failed to get object 36 type 2 uuid 5f525e1a-4f3300a9-443a-36db70100038 cnum 0 dindex fffffffecdate 0 ctime 0 MS 0 :Timeout
2021-07-07T23:07:41.179Z cpu6:4817326)ALERT: Bootbank cannot be found at path '/bootbank'
2021-07-07T23:07:41.770Z cpu22:2097521)ScsiPath: 8058: Cancelled Cmd(0x45b960955000) 0x0, cmdId.initiator=0x45393781bc58 CmdSN 0x0 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:2.
2021-07-07T23:07:41.770Z cpu12:4784715)VMW_SATP_LOCAL: satp_local_updatePath:856: Failed to update path "vmhba32:C0:T0:L0" state. Status=Transient storage condition, suggest retrys..
Resolution
As you are running ESXi 7.0 update 2 from a Sd-Card so the Host getting non responsive due to /bootbank cannot be found message is a known issue and an action plan was shared with you regarding it.
Fix for the issue will be released in ESXi 7.0 patch 3 which is due to be released in a couple of days latest by mid August in the meanwhile you can perform the following as workaround:
1. Reboot the affected Host as then ESXi starts talking to sd-card again untill sd-card is overwhelmed again in future with I/O's sent by our kernel.
2. If reboot of ESXi host is not an option and VMs are running. Rescan vmhba using command: esxcfg-rescan -d vmhba32"
Seems like a patch is imminent:
ResolutionThis issue is resolved in VMware vSphere ESXi 7.0 U2c. To download go to the Customer Connect Patch Downloads page.
It's available within the lifecycle manager in vCenter.
Test Environment patched, will see how it goes before moving onto prod. I hope VMware are not going through a bad patch with the updates/ patches again like they did years ago. Patch one thing and introduce another bug just as bad...
Thanks to one of our awesome VMware TAM's (thats probably why every account should have a TAM covering them) he provided me with this Skyline update, which proactively detects vSphere-VMFS-L-SDCard for potential VMFS-L Locker partition corruption with low-endurance boot devices on ESXi.
https://twitter.com/VMwareSkyline/status/1430246999475900417
get a TAM & get Skyline rolling
and what does it do when it detects the corruption!
Automatically fix it. email VMware Support advising you not to reboot ever!
Has anyone confirmed that P03 fixes the SD IO saturation?
it's still a recommendation to use "High Endurance Flash" even with this patch!
it will be interesting to see if Dell/HPE retract their statements about SD cards!
Only time will tell, if it fixes it, whatever it was!!!! There seem to be many different scenarios which occur.
for me, it crapped out after 13 minutes of new install and high endurance flash media! with no VMware Tools, no VMs.... I will try the same situation and see if I can get it to corrupt the install!
All seems fine for majority of customers however I do have a few which skyline detects possible sd card issues.
Only time will tell but so far so good
and what does Skyline do ? or recommend ?
It just points you to the KB article.
As it did for us even though we only have 6.7 hosts. I guess because the article states that 6.7 is also affected (with no resolution) even though it also states:
"Potential VMFS-L Locker partition corruption on SD cards in ESXi 7.0"
"Starting in ESXi 7.0, the boot partition is formatted as VMFS-L instead of FAT"
Does anyone read these articles before they publish them?
Unfortunately, it seems that long gone are the days when we only had to wait for U1 to consider the new ESXi version stable, we obviously have to change our policy and consider it beta until at least U3.
Skyline detects for potential issues with the new VMFS-L on low endurance SD cards
how does Skyline KNOW - this - low endurance SD cards ???
does it have some sort of AI ?
and what does it do automatically fix it ? or just point you to a useless Kb !
It doesn't know. It just sees an SD card and points you to the article, it does nothing.
@vbabic thanks
exactly, I have no idea, "why everyone thinks this is the next best thing since sliced bread!"
Hi,
does someone know if I can install this patch if I'm running the DellEMC customized 7.0U2 version? Usually I would wait until dell releases their custom ISO/ZIP however this issue is really annoying...
thanks
yes!
Same here using customized Dell ISOs and I have successfully updated my hosts with the U2c patch using Lifecycle Manager.