Solved: HA agent disabled on Host...

WinkIT · ‎08-26-2008

One of my ESX hosts keeps showing "HA agent disabled on HOST in CLUSTER in DC

So far, I've restarted the host, reconfigured HA, and removed/added the host to the cluster, but I can't get the message to go away.

The host in question is named br-vm02. Attached is the most recent contents of the vmware_br-vm02.log file under var/log/vmware/aam

=========================================================================

Primary Agent version 5.1 running on Linux 2.4

Restarted at Tue Aug 26 08:14:41(CDT) 2008

Events posted before this process started may not be found in this log file.

Check other agent log files.

===================================

Info NODE Tue Aug 26 08:14:41 2008

By: FT/Agent on Node: br-vm03

MESSAGE: Agent on br-vm02 has started.

===================================

Info FT Tue Aug 26 08:14:43 2008

By: ftProcMon on Node: br-vm03

MESSAGE: Node br-vm03 has started receiving heartbeats from node br-vm02.

===================================

Info FT Tue Aug 26 08:14:43 2008

By: ftProcMon on Node: br-vm01

MESSAGE: Node br-vm01 has started receiving heartbeats from node br-vm02.

===================================

Info NODE Tue Aug 26 08:14:46 2008

By: FT/Agent on Node: br-vm03

MESSAGE: Node br-vm02 is running.

===================================

Info FT Tue Aug 26 08:14:49 2008

By: ftProcMon on Node: br-vm03

MESSAGE: Node br-vm03 has started receiving heartbeats from node br-vm02.

===================================

Info FT Tue Aug 26 08:14:49 2008

By: ftProcMon on Node: br-vm01

MESSAGE: Node br-vm01 has started receiving heartbeats from node br-vm02.

===================================

Info FT Tue Aug 26 08:14:51 2008

By: ftProcMon on Node: br-vm03

MESSAGE: Node br-vm03 has started receiving heartbeats from node br-vm02.

===================================

Info FT Tue Aug 26 08:14:51 2008

By: ftProcMon on Node: br-vm01

MESSAGE: Node br-vm01 has started receiving heartbeats from node br-vm02.

===================================

Info FT Tue Aug 26 08:14:51 2008

By: ftProcMon on Node: br-vm02

MESSAGE: Node br-vm02 has started receiving heartbeats from node br-vm01.

===================================

Info FT Tue Aug 26 08:14:51 2008

By: ftProcMon on Node: br-vm03

MESSAGE: Node br-vm03 has started receiving heartbeats from node br-vm01.

===================================

Info FT Tue Aug 26 08:14:52 2008

By: ftProcMon on Node: br-vm02

MESSAGE: Node br-vm02 has started receiving heartbeats from node br-vm03.

===================================

Info FT Tue Aug 26 08:14:52 2008

By: ftProcMon on Node: br-vm01

MESSAGE: Node br-vm01 has started receiving heartbeats from node br-vm03.

===================================

Info PROC Tue Aug 26 08:14:56 2008

By: FT/Agent on Node: br-vm03

MESSAGE: Started process VMap_br-vm02 on br-vm02

martinrobert · ‎08-27-2008

I've had this issue before and forgot that HA needs to have the HOST NAME listed in LOWER CASE LETTERS in order for HA to converge properly.

I had two out of 7 ESX hosts listed with uppercase letters in VC 2.5. I changed them to lowercase from within VC2.5 under configuration / DNS and Routing | rebooted each ESX server after placing them in MAINT MODE and evacuating all guests | after reboot, HA remained stable.

View solution in original post

mike_laspina · ‎08-26-2008

Hello,

In VC select the ESX host and goto the Tasks & Events Tab and select View: Events.

There are some descriptive items for HA failures there which may reveal the cause.

Genereally two Items are important. DNS and the VMkernel network(s).

http://blog.laspina.ca/ vExpert 2009

depping · ‎08-26-2008

Indeed,

and check the hosts files, and check the ft_host file... it happens a lot that these are out of sync.

Duncan

My virtualisation blog:

If you find this information useful, please award points for "correct" or "helpful".

WinkIT · ‎08-26-2008

Events tab doesn't show anything except "HA agent is configured correctly"

The hosts file contains the following:

127.0.0.1 localhost.localdomain localhost

10.24.20.62 br-vm02.domain.com br-vm02

ft_hosts is as follows:

10.24.20.61 br-vm01

10.24.20.62 br-vm02

10.24.20.63 br-vm03

And finally, hosts and ft_hosts from a "working" server:

hosts br-vm01:

127.0.0.1 localhost.localdomain localhost

10.24.20.61 br-vm01.domain.com br-vm01

ft_hosts br-vm01:

10.24.20.61 br-vm01

10.24.20.62 br-vm02

10.24.20.63 br-vm03

mike_laspina · ‎08-26-2008

Looks like it's working. The VC may not be refreshing the host state.

Are you able to restart your VC?

http://blog.laspina.ca/ vExpert 2009

WinkIT · ‎08-26-2008

Restarted VC server service and reconfigured HA, and it's still showing as disabled.

mike_laspina · ‎08-26-2008

Use the vmkping command at the console and make sure it can reach all vmkernel nets.

The other day I had one cluster which had this behavior and I disabled HA at the cluster and then enabled it after it was finished removing the HA agents.

The issue started with a downed vmkernel net but it stuck to no avail until I toggled the cluster HA setting.

Also verify fqdn DNS pings are valid for both short and long names from each host to the others.

The /etc/hosts file has precedence over all other DNS resolution methods so it might be good to check those too as it could change the IP on one host.

http://blog.laspina.ca/ vExpert 2009

WinkIT · ‎08-26-2008

Both vmkping and normal ping are able to resolve all hosts FQDN both long and short and IP. Verified from another ESX host as well.

I'll try yanking everything at the cluster level.

WinkIT · ‎08-26-2008

Well this is interesting. I yanked HA at the cluster level, and the host that I was having issues with is sitting at 37% for an Unconfiguring HA task. Looks like the agent is hanging on something?

Edit: It finally finished unconfiguring...readded HA to the cluster and it's doing the same thing (HA agent disabled)

mike_laspina · ‎08-26-2008

I think it's waiting for another host to respond, The time out is fairly long.

I would check on

cat /var/log/vmware/aam/agent/run.log

to see what it's doing

http://blog.laspina.ca/ vExpert 2009

WinkIT · ‎08-26-2008

I have no idea how to read this file, so i'll leave it up to people smarter than I. :smileylaugh:

mike_laspina · ‎08-26-2008

Every ftcli agent command was 100% successful, no errors. Darn thing should be happy.

Do you have more than one gateway?

http://blog.laspina.ca/ vExpert 2009

WinkIT · ‎08-26-2008

Nope, just one. It's a pretty simple network setup as far as VMware is concerned. The 3 BL480c blades come out of a virtual connect and are trunked into a 6513. Nothing fancy.

mike_laspina · ‎08-26-2008

There's goto be some sort of network issue.

You could post the file generated by

esxcfg-info -n > /tmp/esxcfg-net.doc

This will contain your esx networking config so its at your security discretion for posting

Also check the rpm version between a good host and the problem host with

rpm -qa | grep aam

http://blog.laspina.ca/ vExpert 2009

martinrobert · ‎08-27-2008

I've had this issue before and forgot that HA needs to have the HOST NAME listed in LOWER CASE LETTERS in order for HA to converge properly.

I had two out of 7 ESX hosts listed with uppercase letters in VC 2.5. I changed them to lowercase from within VC2.5 under configuration / DNS and Routing | rebooted each ESX server after placing them in MAINT MODE and evacuating all guests | after reboot, HA remained stable.

turkina · ‎09-02-2008

Anyone else having these issues or know of a fix?

We just decided to start the upgrade process today for our VC 2.5 u1 box and pair of ESX 3.5 u1 hosts.

Started by running the update on VC to update 2. Everything completed without issues.

Once we updated our clients and logged in to VC to begin updating the ESX hosts, we noticed the HA agent was not running. Rebooting VC, reconfiguring HA on the hosts, and disabling and re-enabling HA at the cluster level has not worked. Everything was fine prior to updating to VC 2.5 u2.

VCP3/4/5, VCAP5-DCA

mwerne01 · ‎09-10-2008

I can tell you this. I am in the midst of a ESX 3.5 Update 2 brand new, fresh install, no upgrade, and I am seeing the same sort of problems.

-_HunterGathere · ‎09-18-2008

I had the same issue. Thanks to this thread, I got it to work. The person that stood up our last VM host put the host name in all caps. As soon as I changed it to lowercase and rebooted the host, all is well with HA.

All

HA agent disabled on Host...