Re: Need some help troubleshooting network problem...

jconlin2010 · ‎11-07-2016

We are having some very odd networking problems and working with the network team we are running out of ideas.

The problem:
VMs on standard vSwitches are experiencing problems talking to other systems on the same vLAN resulting in dropped packets and RPC errors even when on the same VM Host.

Here is a quick and dirty how the network is laid out: http://imgur.com/a/iJIFJ

ESXi NICs are hooked up to Nexus1 and Nexus2. When VMs attempt to communicate the path will often go to the wrong physical address, fail to communicate, and then update ARP tables and go to the other NIC.

The vSwitches are configured to load balancing with "Route based on IP hash" and because we are using standard vSwitches we do NOT have LACP enabled.

We have noticed this problem only exists on the Dell cluster which is hooked directly into the nexus environment. The UCS cluster which is plugged into FIs does not share the problem. I suspect this is because the UCS FIs share ARP tables and it never makes it back to the Nexus 5ks.

My suspicion is that there is a problem between the 2 Nexus 5k switches and they are not sharing ARP tables properly, but the network team is insisting that the problem lies with either the ESXi or Windows OS layer within the VMs. I'm not versed enough in low level network operations to argue this, but I'm at a loss for how to troubleshoot this further and get a definitive answer.

Some solutions we've tried:
1. Setting up a dvSwitch on a test box and enabling LACP. This saw no change.
2. Dropping 1 NIC on the vSwitch and force paths to go up Nexus1. This caused the problem to stop, but is unacceptable as a solutions as it removes our path redundancy.

Some things network team wants us to try but we haven't done yet:
1. Manually changing the VMs' "reachable time" on the NICs to a lower value.
2. Changing out the VMXNET3 interfaces for E1000
3. Enabling LACP on the Standard vSwitches (This isn't supported)
4. Upgrading to ESXi 6 (we're note ready for this migration yet)

We're really pulling our hair out over this one...has anyone ever encountered these problems before?

UmeshAhuja · ‎11-07-2016

Hi

Can you cross check with the network team about whether there are APR entries in both the Nexus 5K switches with the same MAC address.

Thanks n Regards
Umesh Ahuja

If your query resolved then please consider awarding points by correct or helpful marking.

jconlin2010 · ‎11-07-2016

There are not. Network team is seeing ARP entries for the VM's MAC address on 1 nexus device but not the other.

UmeshAhuja · ‎11-07-2016

Then this is the problem, They need to make ARP entry at both the switches with same MAC address and IP address of ESXi hosts

Thanks n Regards
Umesh Ahuja

If your query resolved then please consider awarding points by correct or helpful marking.

jconlin2010 · ‎11-07-2016

That's what I thought too, but they're telling me that that's not the case. According to the network team the ARP entry should only be on the switch where the physical connection is. Once again, I'm not versed enough in networking to dispute this with them, but I did think that that was the way it was supposed to work with both 5k switches syncing their ARP tables.

UmeshAhuja · ‎11-07-2016

Hi,

Second though!!!

Have you tried rebooting the ESXi host. It might be ARP entries in cache are older or not present in ARP table.

Troubleshooting network connectivity issues using Address Resolution Protocol (ARP) (1008184) | VMwa...

Thanks n Regards
Umesh Ahuja

If your query resolved then please consider awarding points by correct or helpful marking.

jconlin2010 · ‎11-08-2016

Oh yeah, we've rebooted a bunch. I don';t think the ARP tables on the ESXi management network are the issue. This seems to be a problem with the VMs and the Nexus gear not handling moving MAC addresses properly for some reason. The ARP tables on the VMs keep pointing at Nexus 1 and then the MAC moves to Nexus 2 (or vice versa) and it tries to go down the wrong path...resulting in dropped packets until it figures it out.

Calyps0Craig · ‎06-19-2017

Did you ever get a resolution to this? We are experincing what seems to be the same situation and like you and your network team, pulling our hair out!!

dineshgoundar · ‎06-19-2017

Whats your network setup? ESXi uplink and using VSS or vDS?
What load balancing algorithm are you using within ESXi?
Whats the uplink switch ports configured as?

Calyps0Craig · ‎06-25-2017

1. vDS

2. Route based on originating virtual port

3. spanning-tree port type edge trunk

Calyps0Craig · ‎06-28-2017

FYI I have resolved this issue by upgrading the NIC i40e driver to: 2.0.6

dineshgoundar · ‎06-28-2017

Nice. Good to hear you have resolved the issue.

dhanarajramesh · ‎06-28-2017

The vSwitches are configured to load balancing with "Route based on IP hash" and because we are using standard vSwitches we do NOT have LACP enabled.

Since LACP ( port channel) not enabled in physical switch side, you can not use route based IP hash . the reason being that IP hash required both source and destination IP addresses to make decision.

so solution would be, 1) do port cahnnel at Phsyical switch side ad use IP hash

2) change IP hash to Route based on the originating virtual port

below link should give you more info

Understanding IP Hash load balancing (2006129) | VMware KB

Configure NIC Teaming, Failover, and Load Balancing on a vSphere Standard Switch or Standard Port Gr...

All

Need some help troubleshooting network problems - ESXi 5.5