VMware Cloud Community
capitaseanb
Contributor
Contributor

ESXi - HA Error - cmd addnode failed for primary node - Internal AAM error

Hi all,

Seen a few threads on here related to this error, but no fix has worked for me up to now.
I have a cluster of 3 ESXi 4.1.0 433742 servers. 2 run in HA fine. But the third, is throwing the error below now for some reason.
HA agent on esx03 in cluster ESXCluster has an error : cmd addnode failed for primary node: Internal AAM Error - agent could not start.:  Unknown HA error error

The /var/log/vmware/aam/aam_config_util_addnode error log is full of stuff like:

Thu Oct 27 11:24:00 2011: Starting Agent...

10/27/11 11:24:05 [ft_startup_monitor  ] 

10/27/11 11:24:05 [ft_startup_monitor  ]  This agent has been promoted while it was down. It will now be restarted as a primary agent.

10/27/11 11:24:05 [ft_startup_monitor  ]

10/27/11 11:24:05 [ft_startup_monitor  ] ft_startup_ret has evaluated to 3

10/27/11 11:24:05 [elapsed_time        ] ft_startup_monitor: elapsed time  0 minute(s) and 3 second(s)

10/27/11 11:24:05 [remove_from_dead_hos] hosts without running agents:

10/27/11 11:24:05 [active_primary_ftcli] active primary is 'stpaul-bsfesx01'

10/27/11 11:24:05 [active_primary_ftcli] command is 'listnodes'

10/27/11 11:24:05 [issue_cli_cmd       ] command is '/opt/vmware/aam/bin/ftcli -domain vmware -connect stpaul-bsfesx01 -port 8042 -timeout 15 -cmd "listnodes"'

10/27/11 11:24:06 [issue_cmd           ] CMD:    /opt/vmware/aam/bin/ftcli -domain vmware -connect stpaul-bsfesx01 -port 8042 -timeout 15 -cmd "listnodes"

10/27/11 11:24:06 [issue_cmd           ] STATUS: 0

10/27/11 11:24:06 [issue_cmd           ] RESULT:

10/27/11 11:24:06 [issue_cmd           ] *** Node stpaul-bsfesx01 is the master primary ***

10/27/11 11:24:06 [issue_cmd           ]         Node              Type              State

10/27/11 11:24:06 [issue_cmd           ] -----------------------  ------------    --------------

10/27/11 11:24:06 [issue_cmd           ]   stpaul-bsfesx01        Primary      Agent Running

10/27/11 11:24:06 [issue_cmd           ]   stpaul-bsfesx03        Primary      Agent Failed

10/27/11 11:24:06 [issue_cmd           ]   stpaul-bsfexs02        Primary      Agent Running

10/27/11 11:24:06 [issue_cmd           ]

10/27/11 11:24:06 [active_primary_ftcli] command ran successfully on 'stpaul-bsfesx01'.

10/27/11 11:24:06 [wait_agent_startup  ] waiting for agent 'stpaul-bsfesx03' to come alive, status is : 'failed'

10/27/11 11:24:16 [active_primary_ftcli] active primary is 'stpaul-bsfesx01'

10/27/11 11:24:16 [active_primary_ftcli] command is 'listnodes'

and,

10/27/11 11:28:07 [wait_agent_startup  ] Waiting for heartbeat_config and ConfigurationStatus=complete

10/27/11 11:28:11 [issue_cmd           ] CMD:    /opt/vmware/aam/bin/Cli -cmd "getnode stpaul-bsfesx03"

10/27/11 11:28:11 [issue_cmd           ] STATUS: 0

10/27/11 11:28:11 [issue_cmd           ] RESULT:

10/27/11 11:28:11 [issue_cmd           ]

10/27/11 11:28:11 [issue_cmd           ]

10/27/11 11:28:11 [issue_cmd           ]   Description       :

10/27/11 11:28:11 [issue_cmd           ]   System Name       :

10/27/11 11:28:11 [issue_cmd           ]   Operating System  : Unknown

10/27/11 11:28:11 [issue_cmd           ]   Kernel Arch       :

10/27/11 11:28:11 [issue_cmd           ]   Main Memory (MB)  : 0

10/27/11 11:28:11 [issue_cmd           ]   Swap space  (MB)  : 0

10/27/11 11:28:11 [issue_cmd           ]   Supported DS      :

10/27/11 11:28:11 [issue_cmd           ]   Node Attributes   :

10/27/11 11:28:11 [issue_cmd           ]   LAAM Version      : 5.1.2

10/27/11 11:28:11 [issue_cmd           ]   Installed Patches : 0

10/27/11 11:28:11 [issue_cmd           ]   LAAM Version Info : Version 5.1.2

10/27/11 11:28:11 [issue_cmd           ]   Build Date        :

10/27/11 11:28:11 [issue_cmd           ]   State             : Agent Failed

and ends on,

Backing up the AAM configuration to persistent storage

10/27/11 11:30:32 [issue_cmd           ]

10/27/11 11:30:32 [stop_aam            ] copying /var/lib/vmware/aam/vmware-sites to /var/log/vmware/aam/aam_config_util_addnode.log

FULLTIME_SITES_TID 00000023

+ 1:8042,8042,8043 stpaul-bsfesx01    vmware #FT_Agent_Port=8045

+ 2:8042,8042,8043 stpaul-bsfesx03 vmware

+ 3:8042,8042,8043 stpaul-bsfexs02 vmware

10/27/11 11:30:32 [vpxa_respond        ] VMwareerrortext=Internal AAM Error - agent could not start.

10/27/11 11:30:32 [vpxa_respond        ] VMwareerrorcat=internalerror

10/27/11 11:30:32 [myexit              ] copying /var/lib/vmware/aam/vmware-sites to /var/log/vmware/aam/aam_config_util_addnode.log

FULLTIME_SITES_TID 00000023

+ 1:8042,8042,8043 stpaul-bsfesx01    vmware #FT_Agent_Port=8045

+ 2:8042,8042,8043 stpaul-bsfesx03 vmware

+ 3:8042,8042,8043 stpaul-bsfexs02 vmware

10/27/11 11:30:32 [myexit              ] Failure location:

10/27/11 11:30:32 [myexit              ]      function main::myexit called from line 2306

10/27/11 11:30:32 [myexit              ]      function main::start_agent called from line 1238

10/27/11 11:30:32 [myexit              ]      function main::add_aam_node called from line 210

10/27/11 11:30:32 [myexit              ] VMwareresult=failure

10/27/11 11:30:32 [elapsed_time        ] Total time for script to complete:  6 minute(s) and 33 second(s)

I've confirmed dns resolves from esx03 to esx02, and esx01. They all have hosts files in place.

Tried uninstalling the HA agent manually through SSH, well the remote tech support console. Can i get a full console on ESXi?

Also tried disabling HA on the cluster, removing the server, re-adding it, manually uninstalling the AAM agent. etc.


Any thoughts would be appreciated.


Thanks

0 Kudos
9 Replies
athlon_crazy
Virtuoso
Virtuoso

You've done everything but have you try to create a new cluster folder and add the third node?. Once okay, you can try adding the rest of the nodes to this new cluster - Warning, you could loss performance data by doing this.

http://www.no-x.org
capitaseanb
Contributor
Contributor

That worked. Created a Test cluster, added esx03 to the test cluster. HA enabled fine.

Is it going to be possible to move my other 2 servers from their current cluster into the new one, without having downtime to their guest vms?

0 Kudos
athlon_crazy
Virtuoso
Virtuoso

Yes you can and it shouldn't cause any downtime to your VM unless you drag n drop your hosts to new cluster which require maintenance mode. Else, please utliise vMotion.

http://www.no-x.org
0 Kudos
capitaseanb
Contributor
Contributor

How do I ensure I use vMotion? is that the Migrate option?


Thanks

0 Kudos
athlon_crazy
Virtuoso
Virtuoso

If you are require to enter maintenance mode for the host, migrate all VMs first to another host by using vMotion.

http://www.no-x.org
bparlier
VMware Employee
VMware Employee

Yes, you can right click the guest and use the migrate option...(there are several ways to do it, but migrating the guest is what you want to do).

0 Kudos
capitaseanb
Contributor
Contributor

So. I created a test cluster. Added esx03 into the new test cluster. HA enabled fine. Migrated my other 2 servers, esx02 moved accross ok, HA enabled. esx01 however, brought up the same error as in the OP.

So .. its not my server config... as esx01 was in HA mode before. Could it be a Licensing error? It seems I can't have more than 2 servers in this particular cluster. I'm confused now.

0 Kudos
pccbryan
Contributor
Contributor

Having the exact same issue.  Just added a new host, brought me to 3 total.  Once all 3 are up, I get the above error on one of the older servers.  Any idea what is going on?

0 Kudos
capitaseanb
Contributor
Contributor

What version of vCenter are you using btw? and what version of ESXi on the servers out of interest? - not managed to resolve this yet btw.

0 Kudos