Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.
This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.
As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.
We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7.
Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.
The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide.
The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".
https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-702-installation-setup-guide.pdf
You can also refer to the below KB:
Reference: https://kb.vmware.com/s/article/83376?lang=en_US
Resolution
VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.
@sbd27 wrote:@PatrickDLong So you are correct. I would not recommend replacing any current embedded ESXi solution mainly because, at least with Dell, you can't! When you purchase a diskless server from Dell without a Perc card and drive cages, they do not support installing them afterwards, you are stuck.
What makes matters even worse for me (and I have to assume other customers) is that I have some R730s that are diskless with only the IDSDM solution, and the R730 (which is still an ESXi supported server) does not support the BOSS card . If this fix does not work I have to replace servers I did not budget for in my upgrade project.
However, all new servers that I purchase will not longer utilize a diskless config. I can easily have a non-technical person replace a bad swappable SSD RAID drive, but replacing a BOSS card or its attached SSD requires downtime and opening the hood of the server. No Thanks!
Fortunately, all my Dell was acquired with SSDs. All, so I don't have this issue with the Dells, only with HPE.
Is the update out yet to resolve this?
Seems it got delayed until end of August (only rumors). But since there are no official statements, it is really difficult to tell an exact release date.
My latest issue was on 14/07 with 2 hosts, and today was a nightmare, with 5 hosts with the issue.
I never had these numbers on the same Cluster. So in a 12 ESXi hosts Cluster, 5 had the issue today (or during the weekend). One of those was running the vCenter. So had double of issues.
Because until I don't recover the ESXi host where the vCenter was running, it was all crazy and unstable.
PS: If you leave the ESXi host with the issue for a long time (10/12h), VMs then start to get affected and CPU 100% usage, together with performance issues.
I managed to get my hands on a BOSS card for one of our hosts and moved all the VMs to that host. That will hold me over. I feel bad for people with bigger environments where it's not an option to replace the boot device for dozens or hundreds of hosts.
Status of our SD Card Testserver: 33 days uptime with no problems. It runs along with 19 other servers in a cluster.
Quick question to the HPE owners: Have you updated the firmware with the latest SPP for Gen9 (2021.05.0) and installed ESXi with the U2a customized image? Maybe this prevents or slows down the issue?
Patch release due next month to resolve this and also support secure boot.
@A13xDo you care to share either your source or confidence level in your statement "due next month"? I will point out that the OP (employee) statement of "recommending the install of P3 in July sometime." was clearly either incorrect from the outset or invalidated as the date approached, and Duncan clarified this in his response to complaints on this thread after nothing was released on July 15 as had been widely speculated here and elsewhere.
"release dates are typically not shared, mainly as they change based on various aspects. In this case your source was/is wrong."
It seems exceedingly clear to me that VMware is not going to make any official statement regarding release date for this patch, and speculation on release dates only serves to improperly set expectations, justified or not.
The source is VMware via a sr. I obtained the patch before but it never supported secure boot. I opened up a new case to seek an ETA and was told esxi patches for 6.7 and 7 will be released next month.
They also confirmed it several times. This SD card patch would be included and also support secure boot.
We too have hit this issue with HP G9 BL460's in a dev cluster on 7.02. We asked to be put on pre-release of the patch from VMware which is supposedly in U3 mid August. Sounds like VMware needs to validate this fix and release it ASAP. Sorry for those that have this issue in Production!
Apparently downgrading to 7.01 is an option to get around this issue. Anybody know if you can do that with VUM? I've been doing vmware for a million years and never had to downgrade a host. I suppose we could just install a fresh copy of U1 on each.
esxcfg-rescan -d vmhba32 just hangs and hangs
https://www.provirtualzone.com/vsphere-7-update-2-loses-connection-with-sd-cards-workaround/
ls -al on server this morning.. still hanging 8 hours later.. server hasn't disconnected but it's basically useless other than hosting vms.
does recovery mode work for you ? Shift-R at BOOT rollback ?
Even the work around does not resolve the issue.
@vivithemage wrote:Even the work around does not resolve the issue.
The workaround is a temporary workaround. Mainly to recover ESXi hosts and able to reboot the host properly.
@vmrulz wrote:Apparently downgrading to 7.01 is an option to get around this issue. Anybody know if you can do that with VUM? I've been doing vmware for a million years and never had to downgrade a host. I suppose we could just install a fresh copy of U1 on each.
esxcfg-rescan -d vmhba32 just hangs and hangs
https://www.provirtualzone.com/vsphere-7-update-2-loses-connection-with-sd-cards-workaround/
ls -al on server this morning.. still hanging 8 hours later.. server hasn't disconnected but it's basically useless other than hosting vms.
If the host has the issue, you can't do ls or even a df -h or other Linux OS commands, it will hang. You need first fix the issue with the esxcfg-rescan -d vmhba3 and reboot. Then you can do other commands normaly.
the fix is a later version of the vib from vmware which you can request via a support SR. the release is hopefully due next month with the rest of the vmware host and vcenter patches
Ah, so what was that work around for? It's in their fix bulletin.
I only use the free version, so no support contract.
@vivithemage wrote:
- I thought the cache tools workaround was the fix?
In some of my ESXi hosts did fix the issue. Others reduce the number of times I get the issue. Instead of having every 24/48h, I get one time a week.
So is not 100% a silver bullet.
@A13x wrote:the fix is a later version of the vib from vmware which you can request via a support SR. the release is hopefully due next month with the rest of the vmware host and vcenter patches
Unfortunately upgrading or even downgrade vmkusb vib did not fix all systems, only a couple were fixed. Many customers have stated that this solution did not fix the issue and they still get the ESXi host U2a issue.