VMware Cloud Community
digitalformula
Contributor
Contributor
Jump to solution

ESXi 5.0 hosts all say "not responding" after short time

Hi all,

Firstly, yes I know this is a repeat of something that's been asked before but I've tried every resolution I can find - nothing has worked so far.  In the vSphere client, all hosts currently have a state of "Not responding". For reference, though, here's what I've tried/checked up until now:

- vCenter and ESXi are all in evaluation mode.

- I'm not using the HP-branded ESXi ISO (as referenced by http://communities.vmware.com/message/1852500 plus many other community posts).  The hardware *is* HP hardware, though.

- Changing the vCenter server security policy (as referenced by http://communities.vmware.com/thread/271809).

- Gone through all the VMware-published checks listed in http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100340....

- Checked network connectivity using PING, nslookup and telnet on port 902.  All ESXi hosts respond on port 902.

- AD DNS monitoring checks all say "Pass".

- There are no firewalls on this network (it's physically isolated from everything, including the internet ... obviously I'm writing this post on a different network).

- Connecting directly to the ESXi hosts works fine - this only happens when connecting through vCenter (therefore the running VMs are contactable and usable without any issues ... there's just no HA).

- /etc/vmware/vmware.lic has a license key filled with "0" (confirms evaluation state, I think).

- /etc/vmware/vmware.lic mode is as follows ... -rw------T.

- All hosts are on the same version of ESXi.

- There's only 1 vCenter server.

One odd thing is that a few articles, including the VMware ones, say to run "rpm" and "service" in a variety of different ways.  Neither "rpm" nor "service" are valid commands on my ESXi servers.  This seems strange to me, unless ESXi 5.0 has removed those commands from the shell.

Can anyone suggest anything else that could be causing this?

Thanks!

0 Kudos
1 Solution

Accepted Solutions
marcelo_soares
Champion
Champion
Jump to solution

If VC is virtual and the problem is happening on the same ESXi it resides, and all on the same subnet, i'm almost sure something on your windows is blocking it. Check steps on KB http://kb.vmware.com/kb/1029919 to try to discover what is going on.

Marcelo Soares

View solution in original post

0 Kudos
10 Replies
LukasLundell
Contributor
Contributor
Jump to solution

Try checking vpxa.log, hostd.log, and vmkernel.log on the ESXi hosts.  Look especially for an "NMP" or "All Paths Down (APD) type messages or warnings regarding storage in vmkernel or hostd.  Storage issues could cause this type of behavior.

Fix the storage issues and reboot the ESXi hosts if you see these messages in the logs (reboot is the only way to fix APD after storage has hosed hostd and vpxa).

marcelo_soares
Champion
Champion
Jump to solution

So you have the hostst connecting for a short time (like 1-5 minutes) and after that disconnecting, is that right? You need to be sure that the ESX Servers are able to connect back to the vCenter using 902 TCP/UDP port. As ESXi does not have telnet, you can try using a VM on the same vmnic/subnet as the ESX servers to check this. I would recommend to turn off the vCenter Windows firewall and all other stuff (antivirus, antispyware, etc) that can be messing up with this.

Indeed, service and rpm are not present on ESXi on any versions.

To restart services on ESXi: services.sh restart

To check installed packages: esxupdate query

Marcelo Soares
digitalformula
Contributor
Contributor
Jump to solution

LukasLundell,

Thanks for the info.  I've gone through all the logs and can't find any errors that refer to NPM or APD.  I've also checked the storage configuration on all 3 hosts and confirmed that the configuration is the same on all of them, including the disks that are presented and mounted.

I've shutdown all the hosts and restarted only one of them - even by itself it says 'Not responding' 1-2 minutes after logging into vCenter.

marcelo.soares,

Thanks for the info.  As I mentioned in my original question, I've checked that all the hosts respond on port 902 when I try a telnet session.  There are no firewalls, anti-spyware or anti-virus anywhere on this network.

I'm not sure what else I can try ... ?

0 Kudos
marcelo_soares
Champion
Champion
Jump to solution

You need to try the connection back, not only from VC to ESX's, but from ESX to VC also. 902 is responsible for the heartbeats and USUALLY this problem happens when the ESX cannot connect back to the VC on 902 tcp/udp.

Marcelo Soares
0 Kudos
digitalformula
Contributor
Contributor
Jump to solution

Apologies for being unclear earlier - thanks for the clarification.

I can't telnet to the VC on port 902 from anywhere, not even localhost.  Windows Firewall is completely disabled on the VC server.

If the firewall is disabled and I'm trying from localhost (telnet client is installed there), should the server respond on port 902 in a similar way to the ESXi hosts?

Thanks

0 Kudos
digitalformula
Contributor
Contributor
Jump to solution

I just looked at the firewall rules on the servers (VC put a ton of them in there).  There's a port 902 rule in there but it's UDP - telnet won't work.  I did try 'netcat' from a Linux server but that doesn't respond either.

0 Kudos
marcelo_soares
Champion
Champion
Jump to solution

I think you're right (UDP only)... but docs tell TCP/UDP.

Can you try to test with some ESX on the same subnet as VC and on the same physical switch if possible? (trying to avoid any middle piece of hardware/software).

Marcelo Soares
0 Kudos
digitalformula
Contributor
Contributor
Jump to solution

It's probably worth mentioning at this point that the VC is virtual and running on the ESXi host that I'm trying to add to the cluster.  Unfortunately, in this demo lab I don't have a spare server to use as a physical VC.

Is there something in the ESXi/VC configuration that prevents port 902 UDP communication if the VC is virtual?

All hardware is on the same physical switch, there are no VLANs, everything is on the same subnet, etc.  There's no need to separate the devices here as this is for demo only and won't ever be used for production or in a secure environment.

0 Kudos
marcelo_soares
Champion
Champion
Jump to solution

If VC is virtual and the problem is happening on the same ESXi it resides, and all on the same subnet, i'm almost sure something on your windows is blocking it. Check steps on KB http://kb.vmware.com/kb/1029919 to try to discover what is going on.

Marcelo Soares
0 Kudos
digitalformula
Contributor
Contributor
Jump to solution

After the VC was joined to our lab domain, the VC firewall rules weren't being applied.  I'd turned off the firewall before joining the domain but joining the domain re-enabled it for the domain profile.  How annoying.

Thanks for your help Marcelo.  Smiley Happy

0 Kudos