We are having some very odd networking problems and working with the network team we are running out of ideas.
The problem:
VMs on standard vSwitches are experiencing problems talking to other systems on the same vLAN resulting in dropped packets and RPC errors even when on the same VM Host.
Here is a quick and dirty how the network is laid out: http://imgur.com/a/iJIFJ
ESXi NICs are hooked up to Nexus1 and Nexus2. When VMs attempt to communicate the path will often go to the wrong physical address, fail to communicate, and then update ARP tables and go to the other NIC.
The vSwitches are configured to load balancing with "Route based on IP hash" and because we are using standard vSwitches we do NOT have LACP enabled.
We have noticed this problem only exists on the Dell cluster which is hooked directly into the nexus environment. The UCS cluster which is plugged into FIs does not share the problem. I suspect this is because the UCS FIs share ARP tables and it never makes it back to the Nexus 5ks.
My suspicion is that there is a problem between the 2 Nexus 5k switches and they are not sharing ARP tables properly, but the network team is insisting that the problem lies with either the ESXi or Windows OS layer within the VMs. I'm not versed enough in low level network operations to argue this, but I'm at a loss for how to troubleshoot this further and get a definitive answer.
Some solutions we've tried:
1. Setting up a dvSwitch on a test box and enabling LACP. This saw no change.
2. Dropping 1 NIC on the vSwitch and force paths to go up Nexus1. This caused the problem to stop, but is unacceptable as a solutions as it removes our path redundancy.
Some things network team wants us to try but we haven't done yet:
1. Manually changing the VMs' "reachable time" on the NICs to a lower value.
2. Changing out the VMXNET3 interfaces for E1000
3. Enabling LACP on the Standard vSwitches (This isn't supported)
4. Upgrading to ESXi 6 (we're note ready for this migration yet)
We're really pulling our hair out over this one...has anyone ever encountered these problems before?
Hi
Can you cross check with the network team about whether there are APR entries in both the Nexus 5K switches with the same MAC address.
There are not. Network team is seeing ARP entries for the VM's MAC address on 1 nexus device but not the other.
Then this is the problem, They need to make ARP entry at both the switches with same MAC address and IP address of ESXi hosts
That's what I thought too, but they're telling me that that's not the case. According to the network team the ARP entry should only be on the switch where the physical connection is. Once again, I'm not versed enough in networking to dispute this with them, but I did think that that was the way it was supposed to work with both 5k switches syncing their ARP tables.
Hi,
Second though!!!
Have you tried rebooting the ESXi host. It might be ARP entries in cache are older or not present in ARP table.
Oh yeah, we've rebooted a bunch. I don';t think the ARP tables on the ESXi management network are the issue. This seems to be a problem with the VMs and the Nexus gear not handling moving MAC addresses properly for some reason. The ARP tables on the VMs keep pointing at Nexus 1 and then the MAC moves to Nexus 2 (or vice versa) and it tries to go down the wrong path...resulting in dropped packets until it figures it out.
Did you ever get a resolution to this? We are experincing what seems to be the same situation and like you and your network team, pulling our hair out!!
1. vDS
2. Route based on originating virtual port
3. spanning-tree port type edge trunk
FYI I have resolved this issue by upgrading the NIC i40e driver to: 2.0.6
Nice. Good to hear you have resolved the issue.
The vSwitches are configured to load balancing with "Route based on IP hash" and because we are using standard vSwitches we do NOT have LACP enabled.
Since LACP ( port channel) not enabled in physical switch side, you can not use route based IP hash . the reason being that IP hash required both source and destination IP addresses to make decision.
so solution would be, 1) do port cahnnel at Phsyical switch side ad use IP hash
2) change IP hash to Route based on the originating virtual port
below link should give you more info