Hi all,
Firstly, yes I know this is a repeat of something that's been asked before but I've tried every resolution I can find - nothing has worked so far. In the vSphere client, all hosts currently have a state of "Not responding". For reference, though, here's what I've tried/checked up until now:
- vCenter and ESXi are all in evaluation mode.
- I'm not using the HP-branded ESXi ISO (as referenced by http://communities.vmware.com/message/1852500 plus many other community posts). The hardware *is* HP hardware, though.
- Changing the vCenter server security policy (as referenced by http://communities.vmware.com/thread/271809).
- Gone through all the VMware-published checks listed in http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100340....
- Checked network connectivity using PING, nslookup and telnet on port 902. All ESXi hosts respond on port 902.
- AD DNS monitoring checks all say "Pass".
- There are no firewalls on this network (it's physically isolated from everything, including the internet ... obviously I'm writing this post on a different network).
- Connecting directly to the ESXi hosts works fine - this only happens when connecting through vCenter (therefore the running VMs are contactable and usable without any issues ... there's just no HA).
- /etc/vmware/vmware.lic has a license key filled with "0" (confirms evaluation state, I think).
- /etc/vmware/vmware.lic mode is as follows ... -rw------T.
- All hosts are on the same version of ESXi.
- There's only 1 vCenter server.
One odd thing is that a few articles, including the VMware ones, say to run "rpm" and "service" in a variety of different ways. Neither "rpm" nor "service" are valid commands on my ESXi servers. This seems strange to me, unless ESXi 5.0 has removed those commands from the shell.
Can anyone suggest anything else that could be causing this?
Thanks!
If VC is virtual and the problem is happening on the same ESXi it resides, and all on the same subnet, i'm almost sure something on your windows is blocking it. Check steps on KB http://kb.vmware.com/kb/1029919 to try to discover what is going on.
Try checking vpxa.log, hostd.log, and vmkernel.log on the ESXi hosts. Look especially for an "NMP" or "All Paths Down (APD) type messages or warnings regarding storage in vmkernel or hostd. Storage issues could cause this type of behavior.
Fix the storage issues and reboot the ESXi hosts if you see these messages in the logs (reboot is the only way to fix APD after storage has hosed hostd and vpxa).
So you have the hostst connecting for a short time (like 1-5 minutes) and after that disconnecting, is that right? You need to be sure that the ESX Servers are able to connect back to the vCenter using 902 TCP/UDP port. As ESXi does not have telnet, you can try using a VM on the same vmnic/subnet as the ESX servers to check this. I would recommend to turn off the vCenter Windows firewall and all other stuff (antivirus, antispyware, etc) that can be messing up with this.
Indeed, service and rpm are not present on ESXi on any versions.
To restart services on ESXi: services.sh restart
To check installed packages: esxupdate query
LukasLundell,
Thanks for the info. I've gone through all the logs and can't find any errors that refer to NPM or APD. I've also checked the storage configuration on all 3 hosts and confirmed that the configuration is the same on all of them, including the disks that are presented and mounted.
I've shutdown all the hosts and restarted only one of them - even by itself it says 'Not responding' 1-2 minutes after logging into vCenter.
marcelo.soares,
Thanks for the info. As I mentioned in my original question, I've checked that all the hosts respond on port 902 when I try a telnet session. There are no firewalls, anti-spyware or anti-virus anywhere on this network.
I'm not sure what else I can try ... ?
You need to try the connection back, not only from VC to ESX's, but from ESX to VC also. 902 is responsible for the heartbeats and USUALLY this problem happens when the ESX cannot connect back to the VC on 902 tcp/udp.
Apologies for being unclear earlier - thanks for the clarification.
I can't telnet to the VC on port 902 from anywhere, not even localhost. Windows Firewall is completely disabled on the VC server.
If the firewall is disabled and I'm trying from localhost (telnet client is installed there), should the server respond on port 902 in a similar way to the ESXi hosts?
Thanks
I just looked at the firewall rules on the servers (VC put a ton of them in there). There's a port 902 rule in there but it's UDP - telnet won't work. I did try 'netcat' from a Linux server but that doesn't respond either.
I think you're right (UDP only)... but docs tell TCP/UDP.
Can you try to test with some ESX on the same subnet as VC and on the same physical switch if possible? (trying to avoid any middle piece of hardware/software).
It's probably worth mentioning at this point that the VC is virtual and running on the ESXi host that I'm trying to add to the cluster. Unfortunately, in this demo lab I don't have a spare server to use as a physical VC.
Is there something in the ESXi/VC configuration that prevents port 902 UDP communication if the VC is virtual?
All hardware is on the same physical switch, there are no VLANs, everything is on the same subnet, etc. There's no need to separate the devices here as this is for demo only and won't ever be used for production or in a secure environment.
If VC is virtual and the problem is happening on the same ESXi it resides, and all on the same subnet, i'm almost sure something on your windows is blocking it. Check steps on KB http://kb.vmware.com/kb/1029919 to try to discover what is going on.
After the VC was joined to our lab domain, the VC firewall rules weren't being applied. I'd turned off the firewall before joining the domain but joining the domain re-enabled it for the domain profile. How annoying.
Thanks for your help Marcelo.