VMware Cloud Community
George_B
Enthusiast
Enthusiast

ESX 3.5 cluster failure with SAN & Network connectivity loss, BMC error ?

Hi All,

I have just had a weird catastrophic failure of one of my hosts in my cluster and I have yet to be able to pinpoint what caused it and how to fix it. Any thoughts, comments or pointers appreciated.

In brief I have a two node ESX 3.5 U4 cluster with enterprise licensing (DRS, vMotion etc with vCentre) running on HP DL380 G5 servers. These are connected via QLogics PCIe iSCSI HBAs to a Nortel ERS 5500 switch and then an EqualLogic iSCSI cluster of a PS100E and a PS4000E with version 4.2.1 firmware.

This is used to provide numerous high value but low load development and demonstration virtual machines. As of about 5 this evening one of the nodes just disconnected from VC along with all the VMs running on it. Before this no changes to configurations of switches, ESX or servers had taken place. What seemed to trigger this was the changing of some VM's NIC vSwitch connections.

Looking at all VMs running on the one remaining host seem to be OK but on closer inspection if you attempt to do anything that writes to the disk it fails, in a similar way if the their is an iSCSI time out due to a network failure. VC which is in a VM allows logins and functions but you cant use RDP to get to it or communicate with the VM in any other way.

There are also a number of errors on the physical consoles of the servers which also indicate a network connectivity failure. See screen grab below:

However the SAN is up, the network is only one switch and it is up, the VCB physical server thats connected to the same independent iSCSI SAN network can mount LUNs and happily write and read to them. This then isolates the network and SAN as possible faults.

One server responds only to Ping and SSH the other responds to Ping, SSH, VI Client direct connection and VC. As they both still respond in some way to their original IPs then there can't have been a vswif mis-config.

Logging in physically to both servers brings up similar errors relating to BMC. With a quick search of Google and the forums this seems like it might be an issue but why has it just happened without a re-boot or upgrade happening etc ? The KB article that seems to relate to this does not work and keeps asking for authorization.

So any help much appreciated !

0 Kudos
1 Reply
George_B
Enthusiast
Enthusiast

In addition I have now just tested from the ISCLI QLogic SanSurfer utility installed on both nodes and both HBA's can ping and see the SAN.

0 Kudos