hello
I have a vcenter cluster of 6 hp ProLiant DL380/DL580 Gen10 server all with ESXi 7.0.2, 17630552
DRS and HA was enabled
one day I logged in to vcenter and one of servers was "not responding", many system migration was going on from all servers to others!, some ok and some failed. I disabled DSR and no new migration started.
After a hour, the server connected again but few of host on this server become "unknown" in status and in front of name was "invalid". but hosts was working well and all services and DBs and websites was alive.
I tried to migrate hosts from this servers, but the operation failed. all hosts with invalid state and even host that was in good shape
I must say that hosts that was powered off before this situation, moved successfully to other server
then I logged in ESXi and host was invalid there.
tried to unregister hosts from vcenter, it failed.
tried to unregister hosts from ESXi, it said unregistered successfully, but the host was there.
tried to unregister hosts from CLI, even there operation failed and nothing happen
then tried to open storage and see whats going on hosts folder, the server with problem, could not browse storage (SAN Storage)
even in CLI, server could not brows storage, (this is when the hosts are working from same storage and all services are alive)
finally I stopped all services on hosts in troubled server and rebooted ESX, because of HA, all hosts transferred to other server and after a reboot came online in normal status and there was no "invalid" or "unknown" situation.
after reboot of ESX, I tested the troubled server and migrated servers over it and from it with all states of network connectivity (there is 4 network cable connected to each server, every 2 of them connected to a Switch for Ethernet, and same for Data Switch), there was no problem!
this condition, just created a huge work for out team to stop services of hosts on troubled server and reboot ESXi.
and I searched for this, nothing similar found on Knowledge Base of VMWare.
FT is not enabled yet, because in a test, it created a big latency over network connection of test machine. we must test that more.
So, thanks to everyone who replied (no one)
I tried so hard and could fix this problem.
At first I restarted the Management Agent, but then i got this Error opening ESXi Web console:
503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x000000691a805070] _serverNamespace = / action = Allow _port = 8309)
the fix everyone told, was to reset this services
so I enabled SSH and executed this on host:
/etc/init.d/hostd restart
/etc/init.d/vpxa restart
but no luck. vpxa couldn't start, even it show vpxa on in service lists
so i restarted all services:
services.sh restart
again, no luck, and i couldn't load ESXi web Console
then i tried this:
/etc/init.d/vpxa stop
services.sh restart
yaaa, the problem fixed and i could connect hosts to VCenter again and storage access was OK in ESXi.
this problem took me about 2 weeks. maybe this solution could help you.
So, thanks to everyone who replied (no one)
I tried so hard and could fix this problem.
At first I restarted the Management Agent, but then i got this Error opening ESXi Web console:
503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x000000691a805070] _serverNamespace = / action = Allow _port = 8309)
the fix everyone told, was to reset this services
so I enabled SSH and executed this on host:
/etc/init.d/hostd restart
/etc/init.d/vpxa restart
but no luck. vpxa couldn't start, even it show vpxa on in service lists
so i restarted all services:
services.sh restart
again, no luck, and i couldn't load ESXi web Console
then i tried this:
/etc/init.d/vpxa stop
services.sh restart
yaaa, the problem fixed and i could connect hosts to VCenter again and storage access was OK in ESXi.
this problem took me about 2 weeks. maybe this solution could help you.