VMware Networking Community
PiotrBerent
Contributor
Contributor
Jump to solution

Issue with routing inside NSX / Edge cluster

We have VCF on-prem (version 4.5) with 3 separate workload domains - management (m01), vRA for virtual machines (w01) and tanzu for containers (w02).
We hosting that on two separate locations (LocationA & LocationB). Edges are sticked to those locations by DRS rules of vCenter clusters.
Each domain got pair of edge instances working as a cluster and managet by separate NSX-T instances. So we have:
w01-en01(LocationB) + w01-en02(LocationA) -> w01-ec01 (cluster)
m01-en01(LocationB) + m01-en02(LocationA) -> m01-ec01 (cluster)
w02-en01(LocationB) + w02-en02(LocationA) -> w02-ec01 (cluster)
All 3 NSX-T are in the same version: 3.2.1.2.0.20541212
All 6 Edges are in the same version: 3.2.1.2.0.20541219

Those versions comes in with the BOM for VCF 4.5 update. We did that update recently and we started to hit some issues.
We have issues with routing. Before update to VCF 4.5 (and NSX 3.1 to NSX 3.2) all was working fine, so we have connection through all our enviroment, now we have issues that are pretty strange.
We did a call with POS engineers, we've checked configuration which was confirmed during call by PSO that it is fine. We also raised a ticket to support, but no solution was given (ticket is still open)
Besides of that we still are not able to establish connection between VMs that are migrated from old VMware enviroment and their gateways are located on physical switches (Cisco based), and the machines that are connected to NSX-managed network segments with T1 gateways located on NSX-T

Except this issue we are able to connect to those machines handled by "vlan" from each point in our network, as well as we are able to reach all VMs handled by NSX "segments" and using NSX port group on the VDS.

Out config is T0 - Active/Active, T1 - Active/Standby

Communication works fine when we shutdown one edge VM in w01 domain - and there is no difference which one.
If both edges in w01 are up we are able to reach machines from all netwokrs in LAN including old VMWare instance.
If both edges in w01 are up we are able to reach tkg-management which is under NSX-T segment form outside, but tkg-management VM cannot reach vCenter which is connected via VLAN from physical network
If both edges in w01 are up VMs in "legacy network" - just VLANs assigned to VDS on vCenter cluster cannot communicate with VMs that are connected to networks managed by NSX-T

If both edges are up - packet sniffing for ping from VMs inside "vlan" and to VMs inside NSX "segment" - showing only incoming request, but no reply (both ways).

If we shutdown one edge (it doesn't matter which one in specific cluster for single workload domain) - all communications are restored.
If we block BGP for one edge - all communications are restored.
If we move edge to single location (that causing BGP Neighbour on "his primary" site to be down - which is expected)
So all is comming to communication with T0 active/active config.

A bit related issue to that question: https://communities.vmware.com/t5/VMware-NSX-Discussions/bd-p/4001

I'm not sure if I've described issue well - but I hope there is someone who can help to answer question "What is a possible wrong with T0 active/active while having T1 - active / standby".

0 Kudos
1 Solution

Accepted Solutions
CyberNils
Hot Shot
Hot Shot
Jump to solution

Hi,

Not sure if I managed to grasp your problem correctly but you can try setting URPF Mode to None to see if that resolves your issue. I would also cross check if you have any IP address conflicts among your Edge Nodes.



Nils Kristiansen
https://cybernils.net/

View solution in original post

4 Replies
CyberNils
Hot Shot
Hot Shot
Jump to solution

Hi,

Not sure if I managed to grasp your problem correctly but you can try setting URPF Mode to None to see if that resolves your issue. I would also cross check if you have any IP address conflicts among your Edge Nodes.



Nils Kristiansen
https://cybernils.net/
PiotrBerent
Contributor
Contributor
Jump to solution

Man! Seems that it is working. Now testing but it looks promising

0 Kudos
PiotrBerent
Contributor
Contributor
Jump to solution

However now it looks like both edges need to be up.

When I shut down one edge all communication is lost for a 60 seconds.

0 Kudos
serge40
Contributor
Contributor
Jump to solution

Have you try resetting the inter-SR iBGP .. 

 

 

0 Kudos