VMware Cloud Community
colin_jamieson
Contributor
Contributor

Problem enabling HA

While I wait for my support contracts to be renewed I wonder if anyone has any idea what is going on here...

I have four ESX 3i Servers (ESXi 3.5 Installable Update 4) and a vCenter Server (2.5 Update 5). The ESXi Servers use an NFS datastore for virtual machine storage and Scratch (ScratchConfig.ConfiguredScratchLocation and ScratchConfig.ConfiguredSwapState are configured in Advanced Settings).

DRS is working OK, but HA does not. When I enable it one of the four ESXi hosts is correctly reconfigured for HA (it can be any one of the four and often a different one each time) but the other three always report an error. Details of the related event on the server are:

Name: Reconfigure HA host Target: esx01.mydomain.com Initiated by: (my user account) Status: An error occurred during configuration of the HA Agent on the host.

Related Events:

29/09/2009 16:57:41, HA agent on esx01.mydomain.com in Cluster Production in DataCentre1 has an error: cmd addnode failed for primary node: /opt/vmware/aam/bin/ft_startup failed to complete within 3 minutes.

29/09/2009 16:53:58, Enabling HA agent on esx01.mydomain.com in cluster Production in DataCentre1.

I have tried every suggestion I have found in the forums and knowledgebase so far (including disabling and re-enabling HA, renaming the cluster, verifying DNS is resolving the names correctly) and I have also tried the hosts in a new vCenter Server with no success.

0 Kudos
4 Replies
jb12345
Enthusiast
Enthusiast

Have you checked time? Make sure the hosts are pointing at a common time server.

0 Kudos
admin
Immortal
Immortal

9 times out of 10 the answer is covered in this KB 1003691 Diagnosing a VMware High Availability cluster configuration failure

Rick Blythe

Social Media Specialist

VMware Inc.

0 Kudos
colin_jamieson
Contributor
Contributor

Thanks for the suggestions. Since the ESX 3i is used I have no service console to connect to and therefore not all the steps in the self help document can be followed but what I have been able to verify is:

All ESXi Servers use a common time server and the time is the same on all of them.

There are sufficient licenses available for HA.

Name resolution is working correctly on the vCenter Server.

The required network ports are open.

The vCenter Server service has been restarted (several times).

There is only one service console HA can be trying to configure on.

The HA cluster is not corrupt - it has been recreated several times and on another vCenter Server too.

I do hope to have my support contract renewed again soon but if there are any other ideas I'd be grateful to hear them.

Since KB 1003691 is based on ESX Server, I am about to look into this thread I found today:

http://communities.vmware.com/message/1332054

0 Kudos
jb12345
Enthusiast
Enthusiast

Have you tried adding HA as root?

0 Kudos