Re: HA Agents

Stephen_Murphy · ‎06-29-2006

I cannot get HA to work.

I have two DL585 both connected to a MSA1000 SAN.

DNS works on both ESX Servers.

If i try to enable HA in a newly created Cluster i get the following error...

opt/LGTOaam512/bin/ft_startup failed

on both ESX Maschines!

so i searched the form but found nothing but only this command

perl /opt/LGTOaam512/vmware/aam_config_util.pl -z -cmd=addnode -traceon=1 > addnode_output.txt[/i]

so i get an txt file but i sill don't know why i can't start the HA Agents.

here is the output of the txt file:

CMD: hostname -s

RESULT:

\----

acn049ffmesx301

CMD: /opt/LGTOaam512/bin/ft_gethostbyname acn049ffmesx301 |grep FAILED

RESULT:

\----

CMD: /opt/LGTOaam512/bin/ftcli -domain vmware -connect acn049ffmesx301 -port 8042 -timeout 60 -cmd "listnodes"

RESULT:

\----

add_aam_node

CMD: cp -f /opt/LGTOaam512/samples/host.cfg /opt/LGTOaam512/config/acn049ffmesx301.cfg

RESULT:

\----

This is the primary agent -- 1st node in cluster.

Primary agent: acn049ffmesx301

CMD: cp /opt/LGTOaam512/vmware/vmware_first_node.pl /opt/LGTOaam512/bin/runInit

RESULT:

\----

CMD: /opt/LGTOaam512/bin/ft_setup -domain=vmware -upgrade=n -noprompt=y -hostname=acn049ffmesx301 -port1=8042 -licensekey=AMCFNEET-4YRDDN53CTHMBDSJ -mailserver=none -primaryagent=acn049ffmesx301

RESULT:

\----

Legato Automated Availability Manager setup script.

Setting environment from /opt/LGTOaam512/config/agent_env.Linux

Setting up the Legato Automated Availability Manager agent for domain vmware

Welcome to Automated Availability Manager. (Release 5.1 )

Configuring Agent for current node: acn049ffmesx301

Enter the name of your domain \[vmware]:

Using comand line argument domain of : vmware

A previous installation has been detected in this directory.

Is this a software upgrade? (y/n) :

Upgrade command line argument: n

WARNING: your previous configuration and database will be overwritten.

Do you want to continue? (y/n) :

Configuration requires the node name of a primary agent. If you

are configuring the first node in the domain, enter the name

of this node. (i.e. acn049ffmesx301) If this is a subsequent installation

enter the name of an existing primary agent node.

Enter the name of a Primary Agent Node:

Using input argument of acn049ffmesx301 for Primary Agent

Performing a primary node configuration.

Agents require the use of 4 network ports through which to

communicate. These port numbers must be available and consistent

across each of the nodes in the domain. If you are unsure about

specifying port numbers or defining primary nodes please read the

appropriate sections of the user documentation provided with this

product.

Specify the first of the 4 port numbers: \[8042]

Using argument for port1: 8042

Ports 8042, 8043, 8044 and 8045 will be used.

Enter your license key: Version: 51

Expires: Permanent License

Features: Site Permanent

Enter the name of your SMTP mail server (optional):

Installation for this node is complete.

To start the Agent run the "ft_startup" command.

VMwareprogress=0

CMD: cp /tmp/aam/*.incarn /opt/LGTOaam512/log/backbone/

RESULT:

\----

VMwareprogress=20

VMwareprogress=22

VMwareprogress=25

CMD: cp -f /opt/LGTOaam512/config/ftbb.prm /opt/LGTOaam512/config/ftbb.prm.bck

RESULT:

\----

Waiting for /opt/LGTOaam512/bin/ft_startup to complete

VMwareprogress=25

CMD: /opt/LGTOaam512/bin/ft_startup

RESULT:

\----

Legato Automated Availability Manager startup script.

Setting environment from /opt/LGTOaam512/config/agent_env.Linux

Starting agent for domain vmware

Starting Backbone...

...

Backbone started successfully.

Starting Agent...

Agent startup failed.

Unexplained fatal error. No $FT_DIR/log/agent/acn049ffmesx301_fatal.out file found.

VMwareprogress=39

ft_startup_monitor: elasped time 0 minute(s) and 22 second(s)

VMwareprogress=39

Waiting for /opt/LGTOaam512/bin/ft_startup to complete

VMwareprogress=39

CMD: /opt/LGTOaam512/bin/ft_startup

RESULT:

\----

Legato Automated Availability Manager startup script.

Setting environment from /opt/LGTOaam512/config/agent_env.Linux

Starting agent for domain vmware

Bind info: Address already in use

Backbone's network ports are in use.

Assuming the backbone is running.

Starting Agent...

Agent startup failed.

Unexplained fatal error. No $FT_DIR/log/agent/acn049ffmesx301_fatal.out file found.

val: 14228 root 14228 1 0 03:28 pts/0 00:00:00 /opt/LGTOaam512/bin/ftbb -S/opt/LGTOaam512/config/vmware-sites -R/opt/LGTOaam512/config/ftbb.rc

val: 14230 root 14230 14228 0 03:28 pts/0 00:00:00 -d. -P1:2:50 -S/opt/LGTOaam512/config/vmware-sites

List: 14228 14230

VMwareerrortext=/opt/LGTOaam512/bin/ft_startup failed

VMwareerrorcat=internalerror

Copying /opt/LGTOaam512/config/vmware-sites to /opt/LGTOaam512/log/aam_config_util_addnode.log

VMwareresult=failure

Total time for script to complete: 0 minute(s) and 27 second(s)

VMadmin · ‎06-29-2006

I haven't done this yet but here are some high-level notes from a VMware session I was in on Troubleshooting HA & DAS:

-Check IP, routing, and DNS for each host

-Make sure that storage and network are available across the cluster

-Verify logs: /opt/LGTOaam512/* and /opt/LGTOaam512/vmsupport/*

-Ensure that the hosts are not managed directly: perform all host management through VC

It isn't much but it's all I got.

Good luck!

Jasemccarty · ‎06-29-2006

I had the same issue.

I disconnected my ESX hosts, that I registered as 10.x.x.x, and reregistered them in VC by their names VI3-0x.domain.com. Made sure that their DNS entries were correct, and then rebooted them.

No problems since then.

The issue is basically name resolution. If you address that on the local ESX, or in DNS, and use FQDN, you should be fine.

Jase McCarty - @jasemccarty

Nicke · ‎06-29-2006

Also make sure you give your vmfs-volumes unique names across the cluster. Instead of naming the local vmfs "vmfs_local" and doing so on multiple hosts, give them a unique name by using the hostname for instance "vmfs_host1".

Might not be relevant to your problem but it won't make it worse at least

/Nicke

Niclas Borgström
Arrow ECS Sweden

admin · ‎06-29-2006

It looks like your FQDN is greater than 30 characters, in which case HA will not configure properly. This is a known bug in VC20 (see KB article 2259). This will be fixed in the next bug release, but for now the workaround is (quoting from the KB article):

\- If the host short name is less than or equal to 29 characters, change the HOSTNAME entry in /etc/sysconfig/network to the short name.

\- If you are using an FQDN that is greater than 29 characters:

1) Change the FQDN to less than or equal to 29 characters.

2) Remove the existing cluster.

3) Create a new cluster.

4) Add all the hosts back to the cluster.

Stephen_Murphy · ‎06-30-2006

Hello

Fist thanks for the answers!

I changed the Names of the Server to a shorter one and rebootet and created a new Cluster. Now the HA Agent configured fine on the first one but on the second i get the error

Could not find a primary host to configure DAS on

what to do now?

Greetings

Stephen

PepeVM · ‎06-30-2006

Try reboot second server as well, maybe putting out-in the domain should help.

Stephen_Murphy · ‎06-30-2006

Hello

i gave up and reinstalled the ESX Server on both Maschines and...

it's workig now

but thx for helping guys

Greetings

Stephen

PepeVM · ‎06-30-2006

Maybe this is not just Give Up ... it's findin the only solution appliable in that case!!

Bill_Oyler · ‎07-19-2006

I am receiving the very same errors:

"An error occurred during the configuration of the HA Agent on the host."[/b]

and

"/opt/LGTOaam512/bin/ft_startup failed"[/b]

This happens on 2 of my 3 ESX 3 servers. I have all DNS working properly. Anyone else run into this strange /opt/LGTOaam512/bin/ft_startup failed[/b] error?

Bill Oyler Systems Engineer

Bill_Oyler · ‎07-21-2006

I solved the HA error issue (inadvertently) by Repairing my VirtualCenter installation (which wiped out my VC database!). So I really can't say if the solution had anything to do with VirtualCenter, or rebuilding my VC database, or re-adding the hosts to the database, or something entirely different. At any rate, that solved the HA error for me.

Bill Oyler Systems Engineer

Phril · ‎07-24-2006

I had the same error with one of my hosts.

I added an A record to DNS. This did not help.

I disconnected, then removed the server from VC, and added it back with the dns short name and it worked without error.

chad_sanders · ‎07-28-2006

I am having the same issue as described, however my error reads "Internal AAM error. Agent did not start. Any suggestions????

admin · ‎07-28-2006

When do you see this error? When you add the host to the cluster or after rebooting a host? If the former could you post the contents of the /opt/LGTOaam512/log/aam_config_util_addnode.log file from the host that has the error. If the latter, please check that PortFast is enabled on your gateway. There is a known issue where the AAM agents will timeout and not start correctly if PortFast is disabled.

chad_sanders · ‎08-15-2006

Figured it out. DNS was being a little too touchy if you ask me, but the fix was recreating the entire server and readding it back to VC using FQDN. Came right up like it should.

castle-cs · ‎08-25-2006

HA wont configure if the ESX servers cant see a default gateway. I found this was the most common problem with HA.

jivnjt · ‎09-18-2006

I found that it all depended on the order in which you re-enable your agents. In my case I started from the bottom up and for some odd reason they started working.

vmmeup · ‎10-03-2006

I ran into the same error after having HA up and running for over a month. It happened after adding 10Gb of ram to one of my clustered servers.

Process followed to install ram:

1.) Vmotioned off all Virtual Servers

2.) Put Server in Maintenance Mode

3.) Shutdown Server

4.) Installed Ram

5.) Turned Server Back on

6.) Removed Server from Maintenance Mode

At this point HA tried to reconfigure and failed

The following is my troubleshooting steps:

1.) Reconfigured for HA (Failed)

2.) Rebooted Physical server and added back to cluster (Failed)

3.) Reconfigured for HA. (Failed)

4.) Removed Server from VC and re-added. (Failed)

Got to love how they word that "Destroy Host" that's some scary wording.

5.) Removed HA from the whole cluster and re-added HA (Failed)

When un-configuring HA it seemed as if all the servers were stuck. The server status said "In Progress" with no status for about 30mins. At that point I restarted our Virtual Center Server. When VC came back 4 of the servers had finished and 3 were still "In Progress". The server that had the memory installed still was reporting an HA error.

6.) Re-enabled HA on the cluster. This enabled HA on 4 of my servers and failed on 3. The original server with the ram installed is still failing as well as two additional. (Failed)

7.) Re-cofigured HA on all hosts one at a time starting with the working hosts first and went from first to last. (Failed)

At this point I'm fairly disappointed with the product. Not only was it a nightmare to get running, it's proves to be a nightmare to keep running. It has not proven to remain highly available only highly annoying. Currently I have HA disabled until the they release the new patches.

Background:

I had HA running for over a month and all DNS issues we resolved in the original setup. FQDN is less then 30 characters, ESX servers can contact each other by FQDN as well as hostname. Hopefully they will have fixed these issues in the service release.

In short they should rename the product to HU (Highly Unavailable) until they resolve the problem.

Sid Smith ----- VCP, VTSP, CCNA, CCA(Xen Server), MCTS Hyper-V & SCVMM08 [http://www.dailyhypervisor.com] - Don't forget to award points for correct and helpful answers. 😉

admin · ‎10-03-2006

If a "Configure HA" task fails, the log file /opt/LGTOaam512/log/aam_config_util_addnode.log usually has some useful information. Can you post the last 10-20 lines of that file?

vmmeup · ‎10-03-2006

After a lot of frustration with getting error after error when trying to get HA to work again I created a new cluster and one server at a time I vmotion off the vm's put it in maintenance mode and then removed it from the current cluster. I then took the server out of maintenance mode and added it to the new cluster. Once all server were added I enabled HA on the cluster and all is good. Alot of work to fix the problem whatever it was, but it's working better then it ever was before. When I originally setup the first HA cluster I noticed a lag in vmotion and now vmotion flies once again.....

Sid Smith ----- VCP, VTSP, CCNA, CCA(Xen Server), MCTS Hyper-V & SCVMM08 [http://www.dailyhypervisor.com] - Don't forget to award points for correct and helpful answers. 😉

All

HA Agents