VMware Cloud Community
chouse
Enthusiast
Enthusiast

Weird ESX/VM network problem - loses connectivity after VM reboot or VMotion to particular server

I have 2 HP BL45p G1 blades each with 4 NICs running ESX 3.0.1 39823. Both have been up for 80 days. Both have 4 dual-core Opteron 875 processors and 24GB ram. Both are clustered with DRS and HA enabled.

vSwitch0 has vmnic0, vmnic1 teamed and portgroups for Service Console & VMotion

vSwitch1 has vmnic2, vmnic3 teamed and a portgroup called "Production" for Virtual machines.

Both vSwitches on both servers are configured with 56 ports and none of them shows that many in use (i.e., there are free ports on the portgroups/switches - solved a similar problem this summer where not enough ports were available)

Both vSwitches are configured for Port ID load balancing and Link Status failure detection.

Both servers report all 4 NICs online at 1000/Full connected to Cisco GESM switches (for HP p-Class blade enclosures)

Other (Windows) servers in this p-Class enclosure are fine.

VLAN trunking is not enabled on the switch and VLAN IDs are not entered on any of the portgroups.

The first server "uranus" is fine, running 36 virtual machines.

The second server "venus" has this weird problem. It is currently running 17 virtual machines which can talk to the external network and function just fine. However, whenever one of these VMs reboots, it cannot talk to anybody outside of vSwitch1. It can ping and talk to the other VMs on the VM "Production" portgroup on vSwitch1. Also, whenever a VM VMotions from Uranus to Venus, the same problem happens - the VM now on Venus can't talk to anybody outside of the venus vSwitch1. VMotioning it back to Uranus solves this problem (whether it rebooted on venus and now can't talk or if it VMotioned to venus and now can't talk).

I'm sure if I reboot venus the problem will go away, but I don't quite have enough spare capacity on Uranus to put it in maintenance mode (am bringing a 3rd esx host online now). Anybody got any ideas of where to check to see what's going on? There is nothing in the vmware.log files in the VMotion directories of those that reboot or VMotion - everything looks good. There isn't anything in the vmkernel log either regarding a VM losing its network connection.

The network guys say there aren't any mac-address limits on the external switch ports that might limit the number of mac addresses allowed on an access port (solved that problem this summer)

Just wondering if anybody else has seen this before.

0 Kudos
3 Replies
Chris_S_UK
Expert
Expert

What I would do next is to run a packet sniffer inside the problem VM to see if it sees any traffic at all from the main LAN....that should help indicate whether it is a virtual switch problem or a physical switch one.

Chris

0 Kudos
gsxr
Contributor
Contributor

Check if you have your vmware tools inside your vm's up to date as I learned today with the same problem.

I Upgraded the vmtools and the problem was gone. I have static addressefor my servers. no dhcp.

0 Kudos
dinny
Expert
Expert

0 Kudos