VMware Cloud Community
kc5ruj
Contributor
Contributor

Cluster Election Fails on HA Configuration

I have a 5 host ESXi 5.5 cluster that experienced a hard shutdown due to severe weather power outage.  The cluster came back up once power was restored, however now HA fails to initialize.  The operation times out and cluster election fails.  In this particular instance because the hosts were powered up along with domain controllers when power failed they all came up together.  So, the hosts most likely came up before AD services and DNS were actually online.  Should I put each host in maintenance mode and reboot them as this acts like a DNS problem.  I have already disabled and then re-enabled HA on the cluster with no joy.

Thanks for clarifying,

Tony

0 Kudos
9 Replies
a_p_
Leadership
Leadership

Only a guess, but maybe restarting the Management Agents on the host may help. Did you already try this?

André

0 Kudos
kc5ruj
Contributor
Contributor

No, that's a good suggestion.  There isn't downtime incurred doing that as I recall.  Correct?

0 Kudos
a_p_
Leadership
Leadership

No downtime required, just make sure there aren't any active takes running (like migration, backup, ...).

André

0 Kudos
kc5ruj
Contributor
Contributor

I restarted management agents on all hosts with and without HA being enabled with no success.  I did test the management network for connectivity and DNS resolution which was successful.  Would enabling jumbo frames help with the time out issue?  I pulled the FDM logs from each host.  Here is a portion from a host.  The cluster seems to be running normally except for the failure of HA to initialize.

2015-04-28T20:52:12.723Z [7CF55B70 warning 'Election' opID=SWI-6058ed8] [ClusterElection::UpdateHostListWork] localhost not in new host list

2015-04-28T20:52:12.723Z [7CF55B70 warning 'Election' opID=SWI-6058ed8] Election error

2015-04-28T20:52:12.723Z [7D15CB70 info 'Election'] MasterShutdown

2015-04-28T20:52:12.723Z [7CF55B70 info 'Election' opID=SWI-6058ed8] [ClusterElection::ChangeState] Slave => Startup : Election error

2015-04-28T20:52:12.723Z [7CF55B70 info 'Cluster' opID=SWI-6058ed8] Change state to Startup:0

2015-04-28T20:52:12.723Z [7CF55B70 verbose 'HalCnx' opID=SWI-6058ed8] [HalCnx] Configuring firewall to close tcp(8182) and udp(8182)

2015-04-28T20:52:12.723Z [7CF14B70 verbose 'Cluster' opID=SWI-56f32f43] [ClusterManagerImpl::CheckElectionState] Transitioned from Slave to Startup

2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/50979f56-0fe91490-145e-14feb5dbcb15).

2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/50991870-1dd1308c-903b-14feb5dc1153).

2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/51518f14-b6c7387c-5619-14feb5dc1153).

2015-04-28T20:52:12.723Z [7D15CB70 info 'Message'] Destroying connection

2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/501a5188-e10753e0-e805-14feb5dc1153).

2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/51518b33-73864b6a-93c3-14feb5dc1153).

2015-04-28T20:52:12.723Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/53c68fb9-b514e950-9bbd-c81f66f4bf36).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/4e8ad8d2-21b8fd24-ad4b-14feb5dc1153).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/4e8ad73b-dcf6f68c-f93a-14feb5dbcb15).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/53c6ea60-1c993a40-bd12-14feb5dc1153).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/5501b069-94d2dd2a-5b6d-14feb5dc1153).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/5098ddc8-bcaeb5c4-d7b4-14feb5dc1153).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/5501ac0f-6dce8512-7e1c-14feb5dc1153).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/524483ae-7eafed78-5fe8-14feb5dc1153).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/4fabcb0d-563cad06-dff3-14feb5dc1153).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/5097a735-6348db78-0c4d-14feb5dc1153).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::NotifyDatastoreUnlockedLocally] Invoked for datastore (/vmfs/volumes/53a492fa-c5a80408-cfd5-14feb5dc1153).

2015-04-28T20:52:12.724Z [7CF14B70 info 'Cluster' opID=SWI-56f32f43] [ClusterManagerImpl::MainLoop] curState 1 lastState 4

2015-04-28T20:52:12.724Z [7D099B70 info 'Invt' opID=SWI-4b28ee67] [InventoryManagerImpl::ProcessClusterChange] Cluster state changed to Startup

2015-04-28T20:52:12.724Z [7D099B70 verbose 'PropertyProvider' opID=SWI-4b28ee67] RecordOp ASSIGN: clusterState, fdmService

2015-04-28T20:52:12.724Z [7D099B70 verbose 'FDM' opID=SWI-4b28ee67] [FdmService::Handle::ClusterStateNotification] Cluster state changed: Slave -> Startup

2015-04-28T20:52:12.724Z [7D099B70 verbose 'Placement' opID=SWI-4b28ee67] [PlacementManagerImpl::Handle<ClusterStateNotification>] New cluster state is 1

2015-04-28T20:52:12.724Z [7D099B70 verbose 'Execution' opID=SWI-4b28ee67] [ExecutionManagerImpl::ClusterStateListener::Handle] New cluster state is 1

2015-04-28T20:52:12.724Z [7D099B70 verbose 'Policy' opID=SWI-4b28ee67] [PolicyManager::Handle(ClusterStateNotification)] Transitioning to startup (1). Disabling global policy and enabling local policy.

2015-04-28T20:52:12.724Z [7D099B70 verbose 'Monitor' opID=SWI-4b28ee67] [IsoAddressMonitor::Handle::ClusterStateNotification] Cluster state changed to 1

2015-04-28T20:52:12.724Z [7D099B70 verbose 'Monitor' opID=SWI-4b28ee67] [PingableAddressMonitor::Handle::ClusterStateNotification] Cluster state changed to 1

2015-04-28T20:52:12.724Z [7D099B70 verbose 'Monitor' opID=SWI-4b28ee67] [HostAccessMonitor::ClusterStateListener] Cluster state changed to 1

2015-04-28T20:52:12.730Z [7CF55B70 verbose 'HalCnx' opID=SWI-6058ed8] [HalCnx] Skip disabling FT firewall ruleset

2015-04-28T20:52:12.730Z [7CF55B70 verbose 'HalCnx' opID=SWI-6058ed8] [HalCnx] Disabling fdm firewall ruleset

0 Kudos
Alistar
Expert
Expert

Hello,

have you tried right-clicking each of your host and selecting "Reconfigure host for High Availability"? KB link: VMware KB: Performing a Reconfigure for VMware HA operation on a master node causes an unexpected vi...

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
0 Kudos
kc5ruj
Contributor
Contributor

Yes, and that also times out.

0 Kudos
kc5ruj
Contributor
Contributor

Wait...I reread that KB you posted.  I haven't tried to reconfigure the FDM policy. 

0 Kudos
kc5ruj
Contributor
Contributor

Thanks for the idea, but configuring HA fails on hosts as well as the cluster.  It doesn't seem to be a network issue, as no latencies are noticed.

0 Kudos
a_p_
Leadership
Leadership

What may also be worth a try is to remove the hosts from the cluster and then move them back into the cluster (one host after the other), and/or disconnect the hosts and reconnect them.

André

0 Kudos