Hello,
Im getting these messages in many hosts of the vsan cluster.
2014-11-29T01:14:48.834Z cpu13:32888)HBX: 258: Reclaimed heartbeat for volume 54718a58-1fe911f6-9b43-002590f9c358 (588a7154-7605-87d3-9eed-002590f9c358): [Timeout] Offset 3387392
i have ping to all clusters, disks are healthy,
any idea how can i solve this?
Txs
Ezequiel
Is it always the same host?
Is it always the same volume? 54718a58-1fe911f6-9b43-002590f9c358
You can try examining the stats of the various disk - see if one of the error counters is incrementing? esxcli storage core device stats get
Have you verified that all the hardware components are on the HCL/VCG - controller & SSD/flash device. Have you checked the driver and firmware levels?
Do you run any other feature on the controller? e.g. HP Smart Path or similar. Can you turn these off?
Any significant load when these errors occur? Backup for example?
HTH
Cormac
This is happening in every host. When this occurs all vsan hosts start losing management from vcenter.
You can log into the host using SSH, but neither esxcli or vcenter work.
I verified any networks issue but I found no errors.
I will check the features on the controller.
This event indicates that the ESX host's connectivity to the volume (for which this event was generated) degraded due to the inability of the host to renew its heartbeat for period of approximately 16 seconds (the VMFS lock breaking lease timeout). After the periodic heartbeat renewal fails, VMFS declares that the heartbeat to the volume has timed out and suspends all I/O activity on the device until connectivity is restored or the device is declared inoperable.
Heartbeat Interval = 3 Seconds
If an ESX host has mounted a volume san-lun-100 from device naa.60060160b4111600826120bae2e3dd11:1 and loses connectivity (due to a cable pull, disk array failure, and so on) to the device for a period exceeding 16 seconds, the following error message appears:
Lost access to volume 496befed-1c79c817-6beb-001ec9b60619 (san-lun-100) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
All I/O, metadata operations to the specific volume from COS, user interface (vSphere Client), or virtual machines are internally queued and retried for some duration of time. If the volume or storage device connectivity is not restored within that duration of time, such I/O operations fail. This might have an impact on already running virtual machines as well as any new power on operations by virtual machines.
To resolve this issue:
To resolve this issue using the service console:
Note: For additional information, see Troubleshooting LUN connectivity issues on ESXi/ESX hosts (1003
Please mark my Answer correct and like if it helped. Thanks