I came into work this morning to find VirtualCenter reporting that 3 of my ESX 3.5 hosts were disconnected.
A reconnect fails with the error:
'Unable to access the specified host. It wither does not exist, the server software is not responding, or there is a network problem'.
The hosts are up and VMs seem to be running just fine it's the VirtualCenter side of things that seems to be broken.
I can ping the hosts from the VC server and restarting the agent service on the hosts makes no difference. At one point I even had a message about VirtualCenter Agent possibly needing upgrading.
I've also tried restarting the VC server but that made no difference.
Also connecting the Infrastructure Client directly to a host fails:
VMware Infrastructure Client could not establish the initial connection with server "MYSERVER".
Details: A connection failure occurred.
Nor can I use a web browser to manage the 3 hosts in question.
I know update manager is installed but I have in no way configured it so unless by default it applies updates to hosts then I can't blame it for applying a dodgy patches.
Other than just reinstalling ESX what else can I try?
Ta
Mark
Hello,
I have encountered this problem and solved it by restarting the management service. Restarting it does not impact the VM's.
Use this command at the Service Console
service mgmt-vmware restart
DNS failures can also cause simmilar problems.
If you already tried to restart the management service and that didn't work then you can try a couple other things. One question, did you upgrade VC and this broke, or did they just suddenly disconnect? V2.0.2 VC is a nightmare with agents, btw.
these are assuming you have already restarted the mgmt-vmware service, and even the vmware-vpxa and sometimes even the web portion vmware-webAccess. Try all those first, then the steps below. Once you restart these services it can still take a few minutes for them to start responding within VC.
One thing you can do is try to remove the host then re-add it. This will check the version of the vpxa and upgrade if needed.
You can check the version yourself, make sure it matches your VC. 'rpm -qa | grep vpxa' If it is failing on installing the agent and the versions do not match you can do the upgrade manually. Basically for the manual part you look in \Program Files\VMware\VMware VirtualCenter 2.0\upgrade and open up bundleversion.xml to figure out which file you need by your version of ESX. Then ftp or scp (winscp is far easier) the appropriate file the the server, change permissions to 755 and type 'sh ./finlename.sh'
If none of this works respond with versions for everything and some more details on the infrastructure, if anything worked before, etc.
Good luck,
Lee
I restarted vmware-webAccess on all three hosts and on all three it failed to stop the service but started it ok. Which suggests to me that it wasn't running in the first place.
It fixed one of them. I can now use the ESX web interface and VC was able to reconnect. Yeah!
However the other two are still failing to connect and the web interface still doesn't work.
I've tried restarting the vpxa service as well but that made no difference. I'd remove and re-add them VC but as the web interface doesn't work on the two hosts it suggests to me that the fault is not with VC.
I'd just restart the hosts but I'm worried that I'll then lose the VMs that are still running. HA isn't running and it wont let me enable it. I just get loads of 'An error occurred during configuration of the HA agent on the host' messages. Apparently we are unable to contact a primary HA agent in the cluster, whatever that means.
I can't understand why it's happening now. I've not changed anything since the upgrade from 3 to 3.5 last December. I can only assume that the automatic update thing has screwed something up. Is that possible seeing as I've installed it but not configured it in any way?
Incidently apparently I'm running version 2.5.0-64192 of vpxa and that appears to be true of all hosts.
Ta
Mark
Hello,
You can solve issues by unloading and loading VIC components however I would suggest use caution with this method in a production environment. I have paid a price for it in the past.
There are inter dependancies with that code and it is not always safe to mess with vpxa directly in some situations.
Sometimes you need to schedule a host reboot rather that mess with highly sensitive production services.
You can see the detail of these service control scripts are not trivial.
check out
cat /etc/rc.d/init.d/mgmt-vmware
cat /etc/rc.d/init.d/vmware-vpxa
vpxa normally starts and stops with the mgmt-vmware script in a controlled manner using macros.
Here you can see the vpxa process ID before stopping mgmt-vmware services.
PID TTY TIME CMD
2016 ? 00:46:06 vpxa
Stopping VMware ESX Server Management services:
VMware ESX Server Host Agent Services
VMware ESX Server Host Agent Watchdog
Here you can see that the vpxa process was unloaded and reloaded with a new PID directly after it.
The mgmt-vmware service knows the dependancies.
PID TTY TIME CMD
18178 ? 00:00:00 vpxa
Starting VMware ESX Server Management services:
VMware ESX Server Host Agent (background)
Availability report startup (background)
After starting the service uses the current instance of vpxa.
PID TTY TIME CMD
18178 ? 00:00:00 vpxa
There are other processes that depend on and interact with it.
PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
2007 ? S 0:00 470 549 3702 952 0.3 /bin/sh /opt/vmware/vpxa/bin/vmware-watchdog -s vpxa -u 30 -q 5 /opt/vmware/vpxa/sbin/vpxa
18178 ? S 0:02 4456 28435 29284 26668 9.9 /opt/vmware/vpxa/vpx/vpxa
18265 pts/0 S 0:00 279 549 3710 1192 0.4 /bin/sh /usr/bin/vmware-watchdog -s hostd -u 60 -q 5 -c /usr/sbin/vmware-hostd-support /usr/sbin/vmware-hostd -u -a
Here you can see the diference in that PID 2007 is unloaded and loaded.
PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
18265 pts/0 S 0:00 279 549 3710 1192 0.4 /bin/sh /usr/bin/vmware-watchdog -s hostd -u 60 -q 5 -c /usr/sbin/vmware-hostd-support /usr/sbin/vmware-hostd -u -a
18532 ? S 0:00 376 549 3694 1204 0.4 /bin/sh /opt/vmware/vpxa/bin/vmware-watchdog -s vpxa -u 30 -q 5 /opt/vmware/vpxa/sbin/vpxa
18540 ? S 0:00 4503 28435 25312 18968 7.0 /opt/vmware/vpxa/vpx/vpxa
So in general use mgmt-vmware to resolve connectivity issues. Going beyond that to solve issues will require outage risk managment consideration for the running VM's
Hope that helps.
"Apparently we are unable to contact a primary HA agent in the cluster, whatever that means"
Make sure that the ESX hosts can resolve each other from the console. This can be a DNS issue.
Really, from the errors you described I sure thought you were having VC problems. You would be surprised on what disconnecting and reconnecting the VC will do.
If you are hesitant about restarting the services then migrate everything over to another host and reboot the porblem child.
I had failed to mention that when I restart all these services I do restart them in the order from the rc3.d directory.( ls /etc/rc.d/rc3.d/ ) That is if I restart more than just mgmt-vmware.
Check your files systems too (vdf -h) make sure / is not full, or any other them for that matter. I had a full fs doing this to me once.
As for the HA, in the cluster disable HA for all of them. Let it finish, then turn it back on. ONLY do this after you get all the hosts back in VC. Actually, you should uncheck HA while you restart these services otherwise you could trigger all the guests to shut down if you have the default config.
Mike, I am very interested in hearing what kinds of problems restarting these services gave you as I do this fairly often. I would much rather learn form you getting bit than me getting bit too.
When you say disconnect and reconnect to VC do you mean remove from the cluster and add it back?
If I am unable to connect to the host directly with the VI client and the web interface on the host isn't working will I even be able to add it back in?
You say migrate everything off first. I can't because it's not in VC, or do mean some clever command line thing?
Ta
Mark
i had the same problem after upgrading the VCMS server, it failed to install the Agents on the 3.0.2 servers.
I issued to following commands and then successfully reconnect to the servers, agent installation succeeded as well.
service mgmt-vmware restart
service vmware-vmkauthd restart
I realize these commands appear elsewhere in this post and other posts, but i wanted to be clear to the reader of my entry the exact commands used. I find that postings in this forum sometimes suffer from ambiguity.
Naaa, that didn't work either.
Why would the web interface fail?
I'm trying desperately to avoid restarting the host. I'll need to use windows remote desktop to connect to the VMs to power them off as there's no other way to get to them. Then reboot the host and hope to Cliff it comes back up and VC will talk to it. Otherwise I'm left reregistering 20 plus VMs with other hosts and having to reinstall ESX. Again. It'll have to be done out of hours as well as there's some important VMs on there.
Ta
Mark
i'm no longer clear on what the problem is.
i thought you werent able to install the vc agents?
vlchild
I had issues once when I upgraded the VIC agents and once when patching. Some failed to upgrade and I restarted the vpxa service after trying the management service and one server restarted. Don't know why. I have 20+ ESX host servers running. Same behavior on the patch instance.
Hello,
Could you post the vpxa.log file results 60 second after restarting the mgmt-vmware service?
cat /var/log/vmware/vpx/vpxa.log
Well it's no longer an issue.
I had a call this morning from work (on my day off) saying that the VMs that I thought were running ok on the disconnected hosts were in fact down.
So I had little choice but to have the hosts rebooted. They came back up but they seemed to have lost their network configuration. ifconfig showed no vswif0 interface. I have no idea how to fix this let alone talk my trainee through it over the phone so we just powered them off, removed the hosts from VC along with their VMs and I manually registered the now orphaned VMs back with VC.
All VMs are back up now but I have two hosts to reinstall tomorrow when I'm back at work.
Now that makes 4 hosts over the past few days that have lost their network config. We lost 2 on Friday, which I rebuilt on Saturday morning as a clean 3.5 install. I should have figured out that the 2 that we're still running but not talking to VC were having the same problems. I have a fifth that isn't showing any networking details in the VI client so I expect that to fail soon too. Could the Autoupdate Manager have patched up the hosts even though I've not actually configured it? It's installed but I've never touched it. I wonder if a patch screwed everything up.
The only hosts that seem to be okay are those that were clean 3.5 installs rather than upgrades from version 3.
Ta
Mark