ESXi - HA Error - cmd addnode failed for primary n...

capitaseanb · ‎10-27-2011

Hi all,

Seen a few threads on here related to this error, but no fix has worked for me up to now.

I have a cluster of 3 ESXi 4.1.0 433742 servers. 2 run in HA fine. But the third, is throwing the error below now for some reason.

HA agent on esx03 in cluster ESXCluster has an error : cmd addnode failed for primary node: Internal AAM Error - agent could not start.:  Unknown HA error error

The /var/log/vmware/aam/aam_config_util_addnode error log is full of stuff like:

Thu Oct 27 11:24:00 2011: Starting Agent...

10/27/11 11:24:05 [ft_startup_monitor ]

10/27/11 11:24:05 [ft_startup_monitor ] This agent has been promoted while it was down. It will now be restarted as a primary agent.

10/27/11 11:24:05 [ft_startup_monitor ]

10/27/11 11:24:05 [ft_startup_monitor ] ft_startup_ret has evaluated to 3

10/27/11 11:24:05 [elapsed_time ] ft_startup_monitor: elapsed time 0 minute(s) and 3 second(s)

10/27/11 11:24:05 [remove_from_dead_hos] hosts without running agents:

10/27/11 11:24:05 [active_primary_ftcli] active primary is 'stpaul-bsfesx01'

10/27/11 11:24:05 [active_primary_ftcli] command is 'listnodes'

10/27/11 11:24:05 [issue_cli_cmd ] command is '/opt/vmware/aam/bin/ftcli -domain vmware -connect stpaul-bsfesx01 -port 8042 -timeout 15 -cmd "listnodes"'

10/27/11 11:24:06 [issue_cmd ] CMD: /opt/vmware/aam/bin/ftcli -domain vmware -connect stpaul-bsfesx01 -port 8042 -timeout 15 -cmd "listnodes"

10/27/11 11:24:06 [issue_cmd ] STATUS: 0

10/27/11 11:24:06 [issue_cmd ] RESULT:

10/27/11 11:24:06 [issue_cmd ] *** Node stpaul-bsfesx01 is the master primary ***

10/27/11 11:24:06 [issue_cmd ] Node Type State

10/27/11 11:24:06 [issue_cmd ] ----------------------- ------------ --------------

10/27/11 11:24:06 [issue_cmd ] stpaul-bsfesx01 Primary Agent Running

10/27/11 11:24:06 [issue_cmd ] stpaul-bsfesx03 Primary Agent Failed

10/27/11 11:24:06 [issue_cmd ] stpaul-bsfexs02 Primary Agent Running

10/27/11 11:24:06 [issue_cmd ]

10/27/11 11:24:06 [active_primary_ftcli] command ran successfully on 'stpaul-bsfesx01'.

10/27/11 11:24:06 [wait_agent_startup ] waiting for agent 'stpaul-bsfesx03' to come alive, status is : 'failed'

10/27/11 11:24:16 [active_primary_ftcli] active primary is 'stpaul-bsfesx01'

10/27/11 11:24:16 [active_primary_ftcli] command is 'listnodes'

and,

10/27/11 11:28:07 [wait_agent_startup  ] Waiting for heartbeat_config and ConfigurationStatus=complete
10/27/11 11:28:11 [issue_cmd           ] CMD:    /opt/vmware/aam/bin/Cli -cmd "getnode stpaul-bsfesx03"
10/27/11 11:28:11 [issue_cmd           ] STATUS: 0
10/27/11 11:28:11 [issue_cmd           ] RESULT:
10/27/11 11:28:11 [issue_cmd           ] 
10/27/11 11:28:11 [issue_cmd           ] 
10/27/11 11:28:11 [issue_cmd           ]   Description       : 
10/27/11 11:28:11 [issue_cmd           ]   System Name       : 
10/27/11 11:28:11 [issue_cmd           ]   Operating System  : Unknown
10/27/11 11:28:11 [issue_cmd           ]   Kernel Arch       : 
10/27/11 11:28:11 [issue_cmd           ]   Main Memory (MB)  : 0
10/27/11 11:28:11 [issue_cmd           ]   Swap space  (MB)  : 0
10/27/11 11:28:11 [issue_cmd           ]   Supported DS      : 
10/27/11 11:28:11 [issue_cmd           ]   Node Attributes   : 
10/27/11 11:28:11 [issue_cmd           ]   LAAM Version      : 5.1.2
10/27/11 11:28:11 [issue_cmd           ]   Installed Patches : 0
10/27/11 11:28:11 [issue_cmd           ]   LAAM Version Info : Version 5.1.2
10/27/11 11:28:11 [issue_cmd           ]   Build Date        : 
10/27/11 11:28:11 [issue_cmd           ]   State             : Agent Failed

and ends on,

 Backing up the AAM configuration to persistent storage
10/27/11 11:30:32 [issue_cmd           ] 
10/27/11 11:30:32 [stop_aam            ] copying /var/lib/vmware/aam/vmware-sites to /var/log/vmware/aam/aam_config_util_addnode.log
FULLTIME_SITES_TID 00000023
+ 1:8042,8042,8043 stpaul-bsfesx01    vmware #FT_Agent_Port=8045 
+ 2:8042,8042,8043 stpaul-bsfesx03 vmware
+ 3:8042,8042,8043 stpaul-bsfexs02 vmware
10/27/11 11:30:32 [vpxa_respond        ] VMwareerrortext=Internal AAM Error - agent could not start.
10/27/11 11:30:32 [vpxa_respond        ] VMwareerrorcat=internalerror
10/27/11 11:30:32 [myexit              ] copying /var/lib/vmware/aam/vmware-sites to /var/log/vmware/aam/aam_config_util_addnode.log
FULLTIME_SITES_TID 00000023
+ 1:8042,8042,8043 stpaul-bsfesx01    vmware #FT_Agent_Port=8045 
+ 2:8042,8042,8043 stpaul-bsfesx03 vmware
+ 3:8042,8042,8043 stpaul-bsfexs02 vmware
10/27/11 11:30:32 [myexit              ] Failure location:
10/27/11 11:30:32 [myexit              ]      function main::myexit called from line 2306
10/27/11 11:30:32 [myexit              ]      function main::start_agent called from line 1238
10/27/11 11:30:32 [myexit              ]      function main::add_aam_node called from line 210
10/27/11 11:30:32 [myexit              ] VMwareresult=failure
10/27/11 11:30:32 [elapsed_time        ] Total time for script to complete:  6 minute(s) and 33 second(s)

I've confirmed dns resolves from esx03 to esx02, and esx01. They all have hosts files in place.

Tried uninstalling the HA agent manually through SSH, well the remote tech support console. Can i get a full console on ESXi?

Also tried disabling HA on the cluster, removing the server, re-adding it, manually uninstalling the AAM agent. etc.

Any thoughts would be appreciated.

Thanks

athlon_crazy · ‎10-27-2011

You've done everything but have you try to create a new cluster folder and add the third node?. Once okay, you can try adding the rest of the nodes to this new cluster - Warning, you could loss performance data by doing this.

http://www.no-x.org

capitaseanb · ‎10-27-2011

That worked. Created a Test cluster, added esx03 to the test cluster. HA enabled fine.

Is it going to be possible to move my other 2 servers from their current cluster into the new one, without having downtime to their guest vms?

athlon_crazy · ‎10-27-2011

Yes you can and it shouldn't cause any downtime to your VM unless you drag n drop your hosts to new cluster which require maintenance mode. Else, please utliise vMotion.

http://www.no-x.org

capitaseanb · ‎10-27-2011

How do I ensure I use vMotion? is that the Migrate option?

Thanks

athlon_crazy · ‎10-27-2011

If you are require to enter maintenance mode for the host, migrate all VMs first to another host by using vMotion.

http://www.no-x.org

bparlier · ‎10-27-2011

Yes, you can right click the guest and use the migrate option...(there are several ways to do it, but migrating the guest is what you want to do).

capitaseanb · ‎10-27-2011

So. I created a test cluster. Added esx03 into the new test cluster. HA enabled fine. Migrated my other 2 servers, esx02 moved accross ok, HA enabled. esx01 however, brought up the same error as in the OP.

So .. its not my server config... as esx01 was in HA mode before. Could it be a Licensing error? It seems I can't have more than 2 servers in this particular cluster. I'm confused now.

pccbryan · ‎11-09-2011

Having the exact same issue. Just added a new host, brought me to 3 total. Once all 3 are up, I get the above error on one of the older servers. Any idea what is going on?

capitaseanb · ‎11-10-2011

What version of vCenter are you using btw? and what version of ESXi on the servers out of interest? - not managed to resolve this yet btw.

All

ESXi - HA Error - cmd addnode failed for primary node - Internal AAM error