Re: ESXI 6.7u3 VSAN issues

jwince · ‎04-12-2024

Hello folks, I've got a weird one for you and can use some help. (@TheBobkin - You saved my butt like 15 times this year, maybe you'll have some fun with this one too)

General Overview:
We are running ESXI 6.7U3 on all hosts. 6 hosts in total with 2 disk groups per host containing 1 Cache drive and 2 Capacity Drives. We had a cluster shutdown the other day and after booting it up, the VSAN was acting similar to an issue we had with a host dropping it's unicast agent list and many of our VMs were unavailable.
This time VCenter is affected and unavailable as well.

The issue was passed to the next tier of engineers so I won't be able to get new information. I did pull all of /var/log from all hosts for analysis.

Symptoms:
- All VMs are currently unavailable with the following issues:
     - Listed as a 4 digit number
     - Listed with black text and unable to be started.
     - Blue Text as normal but unable to enumerate all disks.
     - Blue text as normal but unable to be started in current state (powered off)
- vSan health shows Inaccessible Objects (149 inaccesssbile, 256 reduced availability with no rebuild, and 54 healthy
- vSan health also showed physical disk alarms (two disks from the same host failed)
- Vcenter server is affected, can't even find it in the datastore to attempt to re-register it. A few other VMs are in the same boat)

We deal with issues like this fairly often, here is a list of checks we ran (I don't have a copy of all the outputs to share unfortunately):

Troubleshooting:
- #esxcli vsan cluster get: All hosts show as expected, such as the Masters Local Node UUID matching the Sub-Cluster Master UUID. Sub-Cluster Membership UUID matches across all hosts.
- Unicast agent list is populated correctly on all hosts.
- vmkping between the vsan vmks pass on all hosts to all the other hosts.
- We even used TCP dump to verify unicast communications. The VSAN Master and Backup talk to all nodes and the other 4 hosts talk to the Master and Backup.
- We suspect an automation issue with our shutdown process so we attempted to use:

     #python /usr/lib/vmware/vsan/bin/reboot_helper.py prepare
     Begin to recover the cluster...
     Time among connected hosts are synchronized.
     Scheduled vSAN cluster restore task.
     Waiting for the scheduled task...(18s left)
     Checking network status...
     Recovery is not ready, retry after 10s...
     Recovery is not ready, retry after 10s...
     Recovery is not ready, retry after 10s...
     Timeout, please try again later

After a reboot we have the same results with the recover script.

We noticed that esxcli vsan debug resync shows we have ~3TB of data identified to resync. We ran the below beforehand and compared to a second output file taken the next morning. a diff command shows no change.
#esxcli vsan debug resync list > /tmp/object_resync1.txt

Digging through logs, the only errors that seem out of norm are in the /var/log/vmkwarning (truncated to the specific messages)

(Host 1 has all of these warning starting a few days prior but the other 5 started the same time after the shutdown.)

WARNING: CMMDS: MasterSendHeartbeatLogMsg:1890: Send heartbeat to all agents: Failure. Duration of previous failed sends: 0 secs
WARNING: ScsiPath: 9274: Adapter Invalid does not exist
WARNING: PCI: 1211: 0000:00:14.0 is nameless
WARNING: etherswitch: PortCfg_ModInit:910: Skipped initializing etherswitch portcfg for VSS to use cswitch and portcfg module
WARNING: Device: 1462: Failed to register device 0x43088b06a070 logical#swdevroot#com.vmware.iscsi_vmk0 com.vmware.iscsi_vmk (parent=0x578743088b06a3f7): Already exists
WARNING: VSAN: VsanIoctlCtrlNodeCommon:2908: 543ee361-d690-81dc-7072-b47af14047fc: RPC to DOM op readPolicy returned: No connection
WARNING: VSAN: VsanIoctlCtrlNodeCommon:2908: 543ee361-d690-81dc-7072-b47af14047fc: RPC to DOM op readPolicy returned: No connection
WARNING: VSAN: VsanIoctlCtrlNodeCommon:2908: 543ee361-d690-81dc-7072-b47af14047fc: RPC to DOM op aggregateAttributes returned: No connection
WARNING: com.vmware.vmklinkmpi: VmklinkMPIMsgRecv:413: [osfs-vmklink] : No slot found for receive (ID: 0x2f1)
WARNING: com.vmware.vmklinkmpi: VmklinkMPICallback:615: [osfs-vmklink] : discarding user reply for message 2f1 (no request waiting)
WARNING: VMW_VAAIP_CX: cx_claim_device:257: CX device naa.50060160c9e01b1750060160c9e01b17 is not in ALUA mode. ALUA mode is required for VAAI.
WARNING: com.vmware.vmklinkmpi: VmklinkMPI_CallSync:1303: No response received for message 0x2f1 on osfs-vmklink (wait status Timeout)
WARNING: ScsiDevice: 3544: Full GetDeviceAttributes during registration of device 'naa.50060160c9e01b1750060160c9e01b17': failed with I/O error
WARNING: VSAN: Vsan_OpenDevice:1324: Failed to open VSAN device 'mpx.vmhba1:C0:T66:L0:2' with DevLib: Not found
WARNING: VSAN: Vsan_OpenDevice:1324: Failed to open VSAN device 'mpx.vmhba1:C0:T64:L0:2' with DevLib: Not found
WARNING: VSAN: Vsan_OpenDevice:1324: Failed to open VSAN device 'mpx.vmhba1:C0:T65:L0:2' with DevLib: Not found
WARNING: VSAN: Vsan_OpenDevice:1324: Failed to open VSAN device 'mpx.vmhba1:C0:T67:L0:2' with DevLib: Not found
WARNING: ScsiUid: 411: vmhba1:C0:T66:L0: NAA identifier type has an unknown naa value of 0x3
WARNING: ScsiUid: 411: vmhba1:C0:T65:L0: NAA identifier type has an unknown naa value of 0x3
WARNING: ScsiUid: 411: vmhba1:C0:T64:L0: NAA identifier type has an unknown naa value of 0x3
WARNING: ScsiUid: 411: vmhba1:C0:T67:L0: NAA identifier type has an unknown naa value of 0x3
WARNING: MemSchedAdmit: 1238: Group likewise: Requested memory limit 0 KB insufficient to support effective reservation 18340 KB

The host with the failed drives has the below in addition to the above
WARNING: VSAN: Vsan_OpenDevice:1324: Failed to open VSAN device 'cdf1e261-0843-4b39-f337-b47af14047fc' with DevLib: Busy (Hundreds of this line are present)
WARNING: VSAN: Vsan_OpenDevice:1377: Failed to initialize client object cdf1e261-0843-4b39-f337-b47af14047fc: Not found

We were unable to add the veeam backups of VCenter to the vsan. We could put them on the singluar VM that does boot but adding the files to the datastore results in a simple "failed" result. pscp also failed.

I'm used to finding cluster/vsan issues at this point but without vcenter/rvc it is a little harder to find issues.

Is it possible this is as simple as data corruption? I feel like there is some sort of underlying issue, I'm really just looking for other steps I can take in the future to help find and correct the problem.

Hope you all are having a great day! (Edit: Formatting)

jwince · ‎04-12-2024

Ah I just found the rest of my notes. Remaining checks are below:

- Vsan Daemon liveliness is red, EPD failed on 1 of 6 hosts, I reckon that isn't the root cause though
- esxcli vsan cluster get AND esxcli system maintenanceMode get show all hosts are not in maintenance mode.
- Verified clomd, hostd, and vsanmgmtd are running on all hosts.

Ocassionaly when running # esxcli vsan cluster health get we have an return of "Invalid Login" ( Can't remember the syntax)

Hopefully these are helpful as well.