ESX 3.0.2 Network Configuration Help?!

lee_harris · ‎11-11-2007

Hi, we've just installed a couple of new DL580 G4s running ESX 3.0.2 update1. We're struggling a bit with the network setup side of things. We thought we'd cracked it, but once I started migrating new VMs onto the two new ESX hosts our network admins started seeing hostflapping occuring which was causing ARP broadcast storms on two new Cisco 4948 switches.

Our setup is as follows:

ESX 3.0.2 with 5 NICs

NIC1 -> vSwitch0 -> 100Mbps for Service Console

NIC2 -> vSwitch1 -> 1Gbps for Vmotion

NIC3 -> vSwitch2 -> 1Gbps for VM Data Backup LAN (NetBackup)

NIC4 -> vSwitch3 -> 1Gbps for VM Public LAN A

NIC5 -> vSwitch4 -> 1Gbps for VM Public LAN B.

The ESX server physically connects into three different switches. One for Service Console (which we're not worried about) another for the VM Data Backup LAN (also which we're not worried about) and then we have two Cisco 4948's installed purely for the VM infrastructure. Into these two switches we have a single Vmotion NIC in one of them, and the two VM Public NICs are patched one into one switch and one into the other.

All of our VMs essentially are running Linux, so we then configured each guest as follows:

3 x virtual NICs.

1 dedicated for NetBackup (connected to vSwitch2)

2 configured as an active-backup bond (failover) for public LAN named VM Public A and VM Public B. A being connected to vSwitch3 and B being connected to vSwitch4.

However, this setup is causing the hostflapping situation, as our two 4948s are seeing the same MAC from a VM on a port on both switches which is causing ARP broadcast storms. We ideally wanted to load balance VMs across pNICs and have those pNICs into different switches for resilience.

The idea was that we would have 20 VMs, 10 using vSwitch3 as their primary connection, and the other 10 using vSwitch4 as their primary connection, so under normal operation we have 10 VMs going through a vSwitch, a single physical 1Gbps pNIC and each into a separate 4948. In the event of a failure, all 20 VMs would then run through a single vSwitch, pNIC and hit a single 4948.

However, the hostflapping is causing us big problems and we don't know whether we've missed something in the ESX network configuration or whether there needs to be something done on the Cisco switches to sort this out.

Any ideas would be greatly appreciated. Many thanks - Lee

RParker · ‎11-11-2007

The problem I see is that load balancing. If it is truly a 1Gb port, 10 VM's is NOT going to take all that bandwidth, even 20, probably not until 50 will you see any sort bandwidth problem.

1Gb is a HUGE pipe, even if ALL 10 were banging on the pipe at 100% utilization, that is still pretty much a dedicated 100 Mbs stream, that's a LOT of network traffic, I doubt very seriously if you really need that.

Since you have 2 vSwitches with the same failover from 20 VM's, there is your redudant MAC address. You can't tell ESX which 10 get which switch, therefore, ALL 20 will show on both switches. Take one of the switches away, you won't need it. Make 1 Switch for your VM's, or setup 2 different VLAN, like a 130 segment and a 140 segment, something like that.

It would make things much easier to manage than some sort of load balance. Put some VM's on 130 some on 140, and then your MAC problem will dissapear.