I have a 5 host ESXi 5.5 cluster that experienced a hard shutdown due to severe weather power outage. The cluster came back up once power was restored, however now HA fails to initialize. The operation times out and cluster election fails. In this particular instance because the hosts were powered up along with domain controllers when power failed they all came up together. So, the hosts most likely came up before AD services and DNS were actually online. Should I put each host in maintenance mode and reboot them as this acts like a DNS problem. I have already disabled and then re-enabled HA on the cluster with no joy.
Thanks for clarifying,
Tony
Only a guess, but maybe restarting the Management Agents on the host may help. Did you already try this?
André
No, that's a good suggestion. There isn't downtime incurred doing that as I recall. Correct?
No downtime required, just make sure there aren't any active takes running (like migration, backup, ...).
André
I restarted management agents on all hosts with and without HA being enabled with no success. I did test the management network for connectivity and DNS resolution which was successful. Would enabling jumbo frames help with the time out issue? I pulled the FDM logs from each host. Here is a portion from a host. The cluster seems to be running normally except for the failure of HA to initialize.
2015-04-28T20:52:12.723Z [7CF55B70 warning 'Election' opID=SWI-6058ed8] [ClusterElection::UpdateHostListWork] localhost not in new host list
2015-04-28T20:52:12.723Z [7CF55B70 warning 'Election' opID=SWI-6058ed8] Election error
2015-04-28T20:52:12.723Z [7D15CB70 info 'Election'] MasterShutdown
2015-04-28T20:52:12.723Z [7CF55B70 info 'Election' opID=SWI-6058ed8] [ClusterElection::ChangeState] Slave => Startup : Election error
2015-04-28T20:52:12.723Z [7CF55B70 info 'Cluster' opID=SWI-6058ed8] Change state to Startup:0
2015-04-28T20:52:12.723Z [7CF55B70 verbose 'HalCnx' opID=SWI-6058ed8] [HalCnx] Configuring firewall to close tcp(8182) and udp(8182)
2015-04-28T20:52:12.723Z [7CF14B70 verbose 'Cluster' opID=SWI-56f32f43] [ClusterManagerImpl::CheckElectionState] Transitioned from Slave to Startup
2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/50979f56-0fe91490-145e-14feb5dbcb15).
2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/50991870-1dd1308c-903b-14feb5dc1153).
2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/51518f14-b6c7387c-5619-14feb5dc1153).
2015-04-28T20:52:12.723Z [7D15CB70 info 'Message'] Destroying connection
2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/501a5188-e10753e0-e805-14feb5dc1153).
2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/51518b33-73864b6a-93c3-14feb5dc1153).
2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/53c68fb9-b514e950-9bbd-c81f66f4bf36).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/4e8ad8d2-21b8fd24-ad4b-14feb5dc1153).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/4e8ad73b-dcf6f68c-f93a-14feb5dbcb15).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/53c6ea60-1c993a40-bd12-14feb5dc1153).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/5501b069-94d2dd2a-5b6d-14feb5dc1153).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/5098ddc8-bcaeb5c4-d7b4-14feb5dc1153).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/5501ac0f-6dce8512-7e1c-14feb5dc1153).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/524483ae-7eafed78-5fe8-14feb5dc1153).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/4fabcb0d-563cad06-dff3-14feb5dc1153).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/5097a735-6348db78-0c4d-14feb5dc1153).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/53a492fa-c5a80408-cfd5-14feb5dc1153).
2015-04-28T20:52:12.724Z [7CF14B70 info 'Cluster' opID=SWI-56f32f43] [ClusterManagerImpl::MainLoop] curState 1 lastState 4
2015-04-28T20:52:12.724Z [7D099B70 info 'Invt' opID=SWI-4b28ee67] [InventoryManagerImpl::ProcessClusterChange] Cluster state changed to Startup
2015-04-28T20:52:12.724Z [7D099B70 verbose 'PropertyProvider' opID=SWI-4b28ee67] RecordOp ASSIGN: clusterState, fdmService
2015-04-28T20:52:12.724Z [7D099B70 verbose 'FDM' opID=SWI-4b28ee67] [FdmService::Handle::ClusterStateNotification] Cluster state changed: Slave -> Startup
2015-04-28T20:52:12.724Z [7D099B70 verbose 'Placement' opID=SWI-4b28ee67] [PlacementManagerImpl::Handle<ClusterStateNotification>] New cluster state is 1
2015-04-28T20:52:12.724Z [7D099B70 verbose 'Execution' opID=SWI-4b28ee67] [ExecutionManagerImpl::ClusterStateListener::Handle] New cluster state is 1
2015-04-28T20:52:12.724Z [7D099B70 verbose 'Policy' opID=SWI-4b28ee67] [PolicyManager::Handle(ClusterStateNotification)] Transitioning to startup (1). Disabling global policy and enabling local policy.
2015-04-28T20:52:12.724Z [7D099B70 verbose 'Monitor' opID=SWI-4b28ee67] [IsoAddressMonitor::Handle::ClusterStateNotification] Cluster state changed to 1
2015-04-28T20:52:12.724Z [7D099B70 verbose 'Monitor' opID=SWI-4b28ee67] [PingableAddressMonitor::Handle::ClusterStateNotification] Cluster state changed to 1
2015-04-28T20:52:12.724Z [7D099B70 verbose 'Monitor' opID=SWI-4b28ee67] [HostAccessMonitor::ClusterStateListener] Cluster state changed to 1
2015-04-28T20:52:12.730Z [7CF55B70 verbose 'HalCnx' opID=SWI-6058ed8] [HalCnx] Skip disabling FT firewall ruleset
2015-04-28T20:52:12.730Z [7CF55B70 verbose 'HalCnx' opID=SWI-6058ed8] [HalCnx] Disabling fdm firewall ruleset
Hello,
have you tried right-clicking each of your host and selecting "Reconfigure host for High Availability"? KB link: VMware KB: Performing a Reconfigure for VMware HA operation on a master node causes an unexpected vi...
Yes, and that also times out.
Wait...I reread that KB you posted. I haven't tried to reconfigure the FDM policy.
Thanks for the idea, but configuring HA fails on hosts as well as the cluster. It doesn't seem to be a network issue, as no latencies are noticed.
What may also be worth a try is to remove the hosts from the cluster and then move them back into the cluster (one host after the other), and/or disconnect the hosts and reconnect them.
André