VMware Cloud Community
lgftsa
Contributor
Contributor

HA only works on 1 host in my cluster

I have a 2-node ESXi(3.5U2) cluster running on Sunfire X4200 hosts. The VC(2.5U1) and VI client is running on a separate X4200, and all are connected with a 3Com gigabit switch and a StorageTek FC disk array. Hosts have correct DNS(forward and reverse) and are pingable.

The symptom is that HA only enables on one of the two hosts - the first one added. I have tried adding hosts to the cluster without HA then enabling HA, enabling HA and then adding each host individually, disabling HA and then re-enabling it, deleting the uwswap files, deleting the cluster and datacenter and re-creating them, uninstalling VC and database and re-installing from scratch, and every other combination I can think of. I've re-installed ESXi on the hosts a dozen times over the last few days.

The error reported by VC is:

/opt/bin/vmware/aam/bin/ft_startup failed

I have attached /var/log/vmware/aam/aam_config_util_addnode.log from each node. In this case, I added esx1 forst, then esx2, but I have tried the opposite order multiple times with the same results - first one added works, second one fails.

The logfile for the failing node(esx2) shows some errors, but I don't know how to debug further.

Please let me know if I should include any other logs.

Thanks, glen.

0 Kudos
7 Replies
jayolsen
Expert
Expert

You might consider going to VC 2.5 update 3. There were some HA fixes in this at least from update 2 might also help with update 1.

0 Kudos
lgftsa
Contributor
Contributor

I've been speaking to VMWare support, and found that the location of FT_HOSTS has been moved from /etc/ to /etc/opt/vmware/aam/.

When HA is configured, that directory is created but the FT_HOSTS file is not. When HA is unconfigured, the directory(and contents) is deleted. When HA is reconfigured, the directory is deleted and then re-created, again without the file.

Manually creating the file is possible, but it's useless as enabling HA immediately deletes it.

Catch 22.

glen.

0 Kudos
lgftsa
Contributor
Contributor

Well, VMWare hasn't managed to find the problem yet. I've upgraded to the latest VC and ESXi, and have exactly the same symptoms, though the error message has changed slightly. It's now:

cmd addnode failed for primary node: /opt/vmware/aam/bin/ft_startup failed to complete within 3 minutes

A little more digging on my part has found that pkill is missing from ESXi, though it's used in the HA scripts:

/var/log/vmware/aam # cat pn-esx2_agent.err

/bin/sh: pkill: not found

The script which I believe needs this command is aam_config_util.pl, and it is trying to kill any VMap processes.

Can anyone help me? VMWare support doesn't seem to care, they're bumbling around with hosts files and tellimg me to check for the tenth time that me DNS is working and I don't have mixed case hostnames. I get the feeling that they're hoping I'll decide to cut my losses and walk away.

0 Kudos
cnhianda
Contributor
Contributor

The problem isn't fixed in U3 either (HP C class blades - 495c)

Running U3 on two ESXi hosts in a cluster

same issue as lgftsa

0 Kudos
lgftsa
Contributor
Contributor

I've since found out from our local Sun reseller that the HA component of ESXi does not work on x4100 or x4200 server hardware. I installed ESX 3.5 and it worked immediately.

VMWare support is the last place I'll be calling for support in the future.

0 Kudos
imsochobo
Contributor
Contributor

got same issue, not found anything yet, but ill let you know if i find something.

MY issue came after i fixed DNS issues, i had one host up, fixed dns, got the other one up, first host got this issue, which i tried to fix and killed the install.

Fixed new install, configured, same issue, this was my fail free esxi host, and now its the one with issues.

0 Kudos
vancod
Contributor
Contributor

Nevermind - it would appear there are actually updates that I have not applied <blush>

0 Kudos