One of my ESX hosts keeps showing "HA agent disabled on HOST in CLUSTER in DC
So far, I've restarted the host, reconfigured HA, and removed/added the host to the cluster, but I can't get the message to go away.
The host in question is named br-vm02. Attached is the most recent contents of the vmware_br-vm02.log file under var/log/vmware/aam
=========================================================================
Primary Agent version 5.1 running on Linux 2.4
Restarted at Tue Aug 26 08:14:41(CDT) 2008
Events posted before this process started may not be found in this log file.
Check other agent log files.
===================================
Info NODE Tue Aug 26 08:14:41 2008
By: FT/Agent on Node: br-vm03
MESSAGE: Agent on br-vm02 has started.
===================================
Info FT Tue Aug 26 08:14:43 2008
By: ftProcMon on Node: br-vm03
MESSAGE: Node br-vm03 has started receiving heartbeats from node br-vm02.
===================================
Info FT Tue Aug 26 08:14:43 2008
By: ftProcMon on Node: br-vm01
MESSAGE: Node br-vm01 has started receiving heartbeats from node br-vm02.
===================================
Info NODE Tue Aug 26 08:14:46 2008
By: FT/Agent on Node: br-vm03
MESSAGE: Node br-vm02 is running.
===================================
Info FT Tue Aug 26 08:14:49 2008
By: ftProcMon on Node: br-vm03
MESSAGE: Node br-vm03 has started receiving heartbeats from node br-vm02.
===================================
Info FT Tue Aug 26 08:14:49 2008
By: ftProcMon on Node: br-vm01
MESSAGE: Node br-vm01 has started receiving heartbeats from node br-vm02.
===================================
Info FT Tue Aug 26 08:14:51 2008
By: ftProcMon on Node: br-vm03
MESSAGE: Node br-vm03 has started receiving heartbeats from node br-vm02.
===================================
Info FT Tue Aug 26 08:14:51 2008
By: ftProcMon on Node: br-vm01
MESSAGE: Node br-vm01 has started receiving heartbeats from node br-vm02.
===================================
Info FT Tue Aug 26 08:14:51 2008
By: ftProcMon on Node: br-vm02
MESSAGE: Node br-vm02 has started receiving heartbeats from node br-vm01.
===================================
Info FT Tue Aug 26 08:14:51 2008
By: ftProcMon on Node: br-vm03
MESSAGE: Node br-vm03 has started receiving heartbeats from node br-vm01.
===================================
Info FT Tue Aug 26 08:14:52 2008
By: ftProcMon on Node: br-vm02
MESSAGE: Node br-vm02 has started receiving heartbeats from node br-vm03.
===================================
Info FT Tue Aug 26 08:14:52 2008
By: ftProcMon on Node: br-vm01
MESSAGE: Node br-vm01 has started receiving heartbeats from node br-vm03.
===================================
Info PROC Tue Aug 26 08:14:56 2008
By: FT/Agent on Node: br-vm03
I've had this issue before and forgot that HA needs to have the HOST NAME listed in LOWER CASE LETTERS in order for HA to converge properly.
I had two out of 7 ESX hosts listed with uppercase letters in VC 2.5. I changed them to lowercase from within VC2.5 under configuration / DNS and Routing | rebooted each ESX server after placing them in MAINT MODE and evacuating all guests | after reboot, HA remained stable.
Hello,
In VC select the ESX host and goto the Tasks & Events Tab and select View: Events.
There are some descriptive items for HA failures there which may reveal the cause.
Genereally two Items are important. DNS and the VMkernel network(s).
Events tab doesn't show anything except "HA agent is configured correctly"
The hosts file contains the following:
127.0.0.1 localhost.localdomain localhost
10.24.20.62 br-vm02.domain.com br-vm02
ft_hosts is as follows:
10.24.20.61 br-vm01
10.24.20.62 br-vm02
10.24.20.63 br-vm03
And finally, hosts and ft_hosts from a "working" server:
hosts br-vm01:
127.0.0.1 localhost.localdomain localhost
10.24.20.61 br-vm01.domain.com br-vm01
ft_hosts br-vm01:
10.24.20.61 br-vm01
10.24.20.62 br-vm02
10.24.20.63 br-vm03
Looks like it's working. The VC may not be refreshing the host state.
Are you able to restart your VC?
Restarted VC server service and reconfigured HA, and it's still showing as disabled.
Use the vmkping command at the console and make sure it can reach all vmkernel nets.
The other day I had one cluster which had this behavior and I disabled HA at the cluster and then enabled it after it was finished removing the HA agents.
The issue started with a downed vmkernel net but it stuck to no avail until I toggled the cluster HA setting.
Also verify fqdn DNS pings are valid for both short and long names from each host to the others.
The /etc/hosts file has precedence over all other DNS resolution methods so it might be good to check those too as it could change the IP on one host.
Both vmkping and normal ping are able to resolve all hosts FQDN both long and short and IP. Verified from another ESX host as well.
I'll try yanking everything at the cluster level.
Well this is interesting. I yanked HA at the cluster level, and the host that I was having issues with is sitting at 37% for an Unconfiguring HA task. Looks like the agent is hanging on something?
Edit: It finally finished unconfiguring...readded HA to the cluster and it's doing the same thing (HA agent disabled)
I think it's waiting for another host to respond, The time out is fairly long.
I would check on
cat /var/log/vmware/aam/agent/run.log
to see what it's doing
Every ftcli agent command was 100% successful, no errors. Darn thing should be happy.
Do you have more than one gateway?
Nope, just one. It's a pretty simple network setup as far as VMware is concerned. The 3 BL480c blades come out of a virtual connect and are trunked into a 6513. Nothing fancy.
There's goto be some sort of network issue.
You could post the file generated by
esxcfg-info -n > /tmp/esxcfg-net.doc
This will contain your esx networking config so its at your security discretion for posting
Also check the rpm version between a good host and the problem host with
rpm -qa | grep aam
I've had this issue before and forgot that HA needs to have the HOST NAME listed in LOWER CASE LETTERS in order for HA to converge properly.
I had two out of 7 ESX hosts listed with uppercase letters in VC 2.5. I changed them to lowercase from within VC2.5 under configuration / DNS and Routing | rebooted each ESX server after placing them in MAINT MODE and evacuating all guests | after reboot, HA remained stable.
Anyone else having these issues or know of a fix?
We just decided to start the upgrade process today for our VC 2.5 u1 box and pair of ESX 3.5 u1 hosts.
Started by running the update on VC to update 2. Everything completed without issues.
Once we updated our clients and logged in to VC to begin updating the ESX hosts, we noticed the HA agent was not running. Rebooting VC, reconfiguring HA on the hosts, and disabling and re-enabling HA at the cluster level has not worked. Everything was fine prior to updating to VC 2.5 u2.
I can tell you this. I am in the midst of a ESX 3.5 Update 2 brand new, fresh install, no upgrade, and I am seeing the same sort of problems.
I had the same issue. Thanks to this thread, I got it to work. The person that stood up our last VM host put the host name in all caps. As soon as I changed it to lowercase and rebooted the host, all is well with HA.